Benchmark
A specialist model beats the big platforms on the one task it’s trained for, at a fraction of the cost.
This is the open evidence for that claim. We put a model trained for one narrow job against the big-platform AI a team would otherwise reach for, score both on public data anyone can re-run, and publish all of it. A couple of tasks so far, with more underway. The same test runs on your task.
What it costs to run
Daily queries:
Hosting scenario:
| Model | Monthly cost | Annual cost |
|---|
The full results & how it’s scored
Every model and method, scored
| Model | Method | Metric ↓ |
|---|
Open-source fine-tuned
Hosted API (cost tier)
Open any row’s Reproduce panel for its exact data, recipe, training health, cost, and verification hashes.
Where the errors go
Reproduce any number
Every result is reproducible from published artifacts, no access to our training code required. Run the released adapter on the eval data, or recompute the scores from the raw prediction logs.
These are public-data results. See if it holds for your task.
Score your task →Already know it fits? Book a free call →