Ornith 1.0 Shows the Open-Source Agentic Coding Race Is Becoming a Workflow Battle, Not Just a Benchmark Contest

Sophie Larsen
1 day ago
3 min read

Ornith 1.0 agentic coding model releases a full family of models that treat workflow construction as the main training target. The lineup includes 9B and 31B dense models plus 35B and 397B mixture-of-experts variants. All four sizes reached the highest scores yet recorded for any fully open-source system on major coding-agent benchmarks.

The largest model posted 82.4 on SWE-Bench Verified and 62.2 on SWE-Bench Pro. It also recorded 77.5 on Terminal-Bench 2.1. These numbers come after joint reinforcement learning on both the scaffold that structures tasks and the final code solution. Earlier open models treated the scaffold as fixed engineering and trained only the solver.

Benchmark numbers and training choice

Ornith-1.0 draws from Gemma 4 and Qwen 3.5 base checkpoints. Post-training then applies reinforcement learning across entire execution traces instead of single-turn completions. The same reward model scores both the quality of the scaffold steps and the correctness of the delivered code.

This joint optimization produces models that generate their own task plans, launch tools, inspect intermediate results, and rewrite failing steps. The 397B MoE variant leads every listed benchmark. Smaller dense models stay close enough to run on consumer hardware through the supplied GGUF files.

Workflow execution now decides outcomes

Agentic coding performance has stopped rising mainly from raw parameter count. Teams now measure how reliably a model keeps context across dozens of tool calls, recovers from partial failures, and re-plans without human intervention. Ornith 1.0 directly trains these recovery behaviors instead of leaving them to prompt engineering.

Developers who tried earlier open agents report frequent stalls when the model produces a broken plan and then repeats the same error. The new training loop penalizes that repetition. The result is longer autonomous runs before any human step is required.

Open weights change how teams test agents

All Ornith 1.0 checkpoints carry the MIT license. GGUF versions run in Ollama and Unsloth with no extra gatekeeping. Local execution lets engineering groups run private repositories through the same agent loop they use on public benchmarks.

Closed frontier models still post higher absolute scores on some suites. Open weights remove the requirement to send code outside the organization, which matters for regulated codebases. Teams can now compare scaffold behavior side by side instead of guessing from public leaderboards alone.

Long-running agents need memory and recovery layers

Office agents handling multi-hour tasks face the same scaffold problem. They must maintain task state, detect when a step has stalled, and switch to an alternate path without resetting the entire session. Ornith 1.0 demonstrates that reinforcement learning on full traces produces these behaviors more consistently than prompt-only methods.

remio stores meeting notes, documents, and prior decisions in a persistent five-level memory system. When an agent starts a new report or analysis, it already holds the relevant context instead of requiring the user to restate project history. The same recovery logic that helps Ornith 1.0 continue after a failed test run helps remio resume after an interrupted research thread.

Remaining limits and open questions

The published scores still come from curated benchmark environments. Real repositories contain more noisy dependencies and internal tooling than the test suites cover. It remains unclear how far the current scaffolds generalize once the model encounters unfamiliar build systems or undocumented APIs.

Model size also still trades against latency on local hardware. The 397B checkpoint delivers the strongest numbers yet requires substantial GPU memory even after quantization. Smaller dense versions run faster but lose several points on the hardest tasks.

What to watch next

Teams will test Ornith 1.0 on private codebases within the next quarter. The first public reports of scaffold success rates on internal monorepos will show whether benchmark gains transfer.

Hardware vendors will release updated GGUF quantization recipes aimed at the 31B and 35B checkpoints. Uptake numbers for these quantized files will indicate how many developers move from API-only agents to local runs.

Competing open projects will publish their own joint scaffold-and-solution training runs. Direct head-to-head results on the same extended task suites will clarify whether Ornith 1.0 holds its lead or simply marks the start of a new training pattern.

Developers evaluating agent stacks now have concrete open weights to measure against their own workflow requirements. The next round of results will show whether scaffold-focused training becomes the default approach for both open and closed coding agents.

Ornith 1.0 Shows the Open-Source Agentic Coding Race Is Becoming a Workflow Battle, Not Just a Benchmark Contest

Benchmark numbers and training choice

Workflow execution now decides outcomes

Open weights change how teams test agents

Long-running agents need memory and recovery layers

Remaining limits and open questions

What to watch next

Recent Posts

Get started for free

Features

Alternatives

Solutions

Resources

Company