Benchmark

A specialist model beats the big platforms on the one task it’s trained for, at a fraction of the cost.

This is the open evidence for that claim. We put a model trained for one narrow job against the big-platform AI a team would otherwise reach for, score both on public data anyone can re-run, and publish all of it. A couple of tasks so far, with more underway. The same test runs on your task.

How it’s tested → View the code →

What it costs to run

Daily queries:

Hosting scenario:

Model	Monthly cost	Annual cost

The full results & how it’s scored

Every model and method, scored

Model	Method	Metric ↓

Open-source fine-tuned

Hosted API (cost tier)

Open any row’s Reproduce panel for its exact data, recipe, training health, cost, and verification hashes.

Where the errors go

Reproduce any number

Every result is reproducible from published artifacts, no access to our training code required. Run the released adapter on the eval data, or recompute the scores from the raw prediction logs.

These are public-data results. See if it holds for your task.

Score your task →

Already know it fits? Book a free call →