10 AI Harnesses. One Job. Who Will Win?

The Setup
A simple question. Which AI coding harness would actually ship code I’d put in front of a live trading system? Not the one that demos best on YouTube. Not the one with the prettiest agent graph. The one that gets the job done… when getting it wrong costs me real money.
I have been building a production trading platform for the past six months. SST3-AI-Harness/framework (Single Source of Truth v3) underneath, though not the same SST3 throughout. Mine kept evolving, same as every other harness out there, as the models kept moving underneath us. What held was the shape: five-stage workflow, Ralph review trio (Haiku, Sonnet, Opus, in that order, restart from the top on any fail). It works. It ships. It runs.
But every other week, some new agent framework shows up in my feeds. CrewAI! LangGraph! Smolagents! AutoGen-renamed-to-MAF! Real frameworks, all of them. Built by people who know what they’re doing, working fine in their own studies and edge cases. I’d been ignoring them (my harness/framework already works, why fix what isn’t broken?) but the question kept nagging: would any of them actually fit my use case? My way of working? My quality bar?
So I figured… fine. Let’s actually find out. Properly.
The Itch
I want to know which framework actually ships. So I am running a bake-off.
Same brief. Same clock. Same scorecard. Ten harnesses, all aimed at the same job: build one autonomous controller feature for the production trading platform. The winner’s code goes into the live system. The losers go on the shelf.
(Yes, this is going to take weeks. Yes, I know I could have just picked Claude Agent SDK and called it done. No, that’s not the point.)
The point is I want to KNOW. Not “the docs look nice”, not “the discord seems active”, not “this guy on Twitter swears by it”. I want to know which one ships. And the only way to know is to put them in the same room, give them the same job, and see what falls out the other side.
The 10 Contestants
Three lean floors. Four heavyweights. One Anthropic-native floor. One late entrant. One home-grown harness/framework that sets the bar. All in.
The lean floors are Smolagents (Hugging Face’s minimalist runtime, the “if you can’t beat THIS, why are you adding any abstraction at all” floor), pydantic-ai (Pydantic’s bet on type-safe agent purity), and OpenAI Agents SDK (handoff-flavoured, the SDK successor to OpenAI’s earlier agent experiments).
The heavyweights are CrewAI (multi-agent crews with explicit personas), LangGraph (stateful directed-graph orchestration, currently the LinkedIn buzzword leader), Google ADK (OpenTelemetry-native, the only one that natively pretends to care about observability), and MAF (Microsoft Agent Framework, the AutoGen rename, Microsoft’s enterprise pitch).
The Anthropic-native floor is Claude Agent SDK. As close to the model as you can get without writing your own loop. It is the baseline of what raw Claude does without a wrapper framework wrapped around it.
The late entrant is Agno. Python-first, performance-focused, showed up after the lineup was already drafted, made the cut anyway.
And the home-grown one is SST3-AI-Harness (Single Source of Truth v3). My own. Built from first principles before I knew LangChain or CrewAI existed, on top of twenty years of project management and engineering scar tissue. Currently runs both my production trading platform AND this very blog you are reading. Two production systems on the same harness/framework, two completely different domains… so the “domain-agnostic” claim is real, not theoretical. (I have written more about how SST3 reshapes itself per task in an earlier SST3 deep dive, if you want it.)
Two of these have a special role in the scoring.
SST3 is the bar. Meaning it is what already runs my live trading platform today, and what every other contestant has to beat to take its place. If a contestant ships better code than SST3 under the same brief, SST3 gets dethroned. Simple.
Claude Agent SDK is the baseline. Meaning it is the bare minimum: raw Claude with no framework wrapped around it at all, just the SDK. Every contestant should at least be able to beat THAT (otherwise the framework is making things actively worse, which would be a damning result for any framework that calls itself a framework).
So there are three possible outcomes per contestant. They could land above the bar (beat SST3, in which case SST3 gets retired and they take its seat). They could land between baseline and bar (beat raw Claude but lose to SST3, in which case they are a working framework but not better than what I already have). Or they could land below the baseline (lose even to raw Claude with no wrapper, which means the framework is actively making things worse, which would be embarrassing for whoever shipped it).
Most should land in the middle. The interesting questions are whether any of them hit the top, and whether any of them sit at the bottom.
The Rules
A bake-off is only worth reading if it is fair. The setup is deliberately, deliberately, deliberately boring on that front.
Same input. Every harness gets the same frozen brief, packaged as the same tarball. No live editing. No follow-up clarifications. What is in the tarball is what each harness sees… and that’s it.
Tokens logged, quality scored. I track every token each harness consumes (some are token-pigs by design and that is fine). I do not score on token count. I score on whether the harness actually delivered, and how well. The question is whether the harness holds water under a real production brief, not whether it uses fewer tokens than its rivals.
One shot. No second attempts on the same brief. No “let me re-prompt and try again”. Whatever the harness produces on its first and only run is what gets scored. No second pass. No retries. The first run is the last run.
Anti-fab gate. Code that compiles but does not do what the spec says fails on a separate axis. A harness cannot win by producing plausible nonsense (the AI agent equivalent of lying with a straight face). Fail-loud beats fail-quiet, every time.
Eleven-column scorecard. Code quality, observability, verification, monitoring, trading safety, deployability, documentation, anti-fabrication, plus three weighted aggregates. Pareto sanity check. Documented tiebreakers, written down before the first run.
Cooling-off and rest days. Fixed running order, set in advance, with rest days between heavy runs to keep MY OWN measurement quality steady. (I am the dumbest part of this experiment. The bake-off is also a measurement of how well I judge the bake-off, and that needs sleep too.)
Full methodology lives in the public bake-off repo. Locked before the first run starts. So I cannot move the goalposts mid-experiment, no matter how much I might want to.
What’s at Stake
The winner does not get a trophy. The winner’s code goes into the next round of production controller enhancement. On a system that already trades against a real brokerage. With real money.
That is why the rubric is the way it is. If a harness ships impressive-looking code that fails on observability or trading safety, it is not useful to me. The measurement is “would I put this in production tomorrow”, not “did the demo run and look pretty in the YouTube video”.
It also means the cost of getting this wrong is real. I am not running this for fun (well, mostly). The result of the bake-off changes which harness/framework I bet on for the next year of work. Possibly longer.
I am writing this comparison blog post the long way: by actually building production code in all ten of them.
Different angle. Same question. Better answer.
Watch This Space
The first run starts shortly. Per-harness postmortems publish here as one arc once all ten runs are done… not in dribs and drabs. Bookmark the site. Grab the RSS feed (yes, RSS, like it’s 2007). Or just check back when the dust has settled.
The full methodology, the scorecard, the running order, and the rationale behind every design choice. All locked in the public bake-off repo. If you want to reproduce the experiment yourself, the templates will be there once the dust clears.
Until then… ten harnesses, one job, may the best one win.
Watch this space as the battle begins.