What this measures
The benchmark compares Qwen3-8B with a task-specific LoRA adapter against a hosted API baseline (zero-shot and 5-shot) on two tasks representative of common production AI work. It measures accuracy on the task, total cost of ownership at production volume, and error composition.
Every result is reproducible. Training code, evaluation scripts, model weights, and prediction logs are public. The benchmark is ongoing; more tasks and model families will be added.
Independence & funding
The benchmark is self-funded and unsponsored today. Its credibility is the entire point, so the rules are fixed in advance and hold even once that changes:
- Sponsors fund the work, never the conclusion. A model lab, inference host, or tooling vendor may underwrite a run, but gets no say over the method, the numbers, or how they read.
- Results are published regardless of outcome, including when an open model loses, or a sponsor’s product comes out behind.
- Code and methodology stay public. Every result is reproducible from the released adapters, prediction logs, and pipeline, so anyone can check the numbers rather than take them on trust.
- Any sponsorship is disclosed on the run it applies to.
If a result can’t survive being reproduced by a skeptic, it doesn’t ship.
Tasks
Both datasets are public and version-pinned. Test sets are held out from all fine-tuning.
| Task | Dataset | What it tests | Metric | Test size |
|---|---|---|---|---|
| Contract clause extraction | CUAD | Given a commercial contract, identify whether a specific clause type appears and extract the relevant text, or correctly report its absence. 41 clause categories across 510 contracts. | AUPR (+ Token F1) | 500 questions |
| Customer support routing | Banking77 | Given a short customer message, assign the correct intent label from a closed set of 77 banking-specific categories. | Weighted F1 | 500 |
Models
Open-source (QLoRA fine-tuned): Qwen3-8B (8B parameters, Apache 2.0 licence). Served with vLLM on a single NVIDIA RTX 3090.
Hosted API (prompted): GPT-5.4 Mini (the cost-efficient API tier a team would actually deploy for a task at this volume, not the pricier flagship or a reasoning model) tested zero-shot and 5-shot at standard API pricing. Other vendors’ models, and flagship or reasoning tiers where a task’s latency and cost budget justify them, are being added. The per-query cost reflects actual token usage at the time of the run; it updates automatically when the benchmark reruns.
Why a decoder for a classification task? For a single closed-set task like Banking77 in isolation, a small encoder classifier (a fine-tuned DeBERTa, or SetFit on a few hundred labels) is often the more accurate and cheaper-to-serve tool. Encoder and other non-decoder baselines are being added per task. We report a decoder here because most deployments run a portfolio of tasks: one open base model with per-task LoRA adapters, hot-swapped or served concurrently, covers classification, extraction, and generation on a single inference stack, frequently the better fleet-level trade even when an encoder wins a task on its own.
Conditions
| Condition | Applied to | What it means |
|---|---|---|
| Zero-shot | Hosted API | Task instruction only, no examples in the prompt. The model relies on pre-training. |
| 5-shot | Hosted API | Five labelled examples prepended to every query. No weights updated. |
| LoRA | Qwen3-8B | QLoRA fine-tuning on the task training set. A small adapter layer is trained; the base model weights are unchanged. |
Metrics
Task accuracy. For extraction (CUAD), the primary metric is AUPR, area under the precision–recall curve for clause detection, the metric the CUAD authors report and the right one when most clause types are absent from any given contract. Token F1 (token-level overlap, partial credit for near-correct extractions) is reported alongside as a secondary, partial-credit check. Weighted F1 for classification (accounts for label imbalance across all categories).
Deployment cost. Presented as an interactive total-cost-of-ownership calculator on the benchmark page, since real cost depends on query volume and the hardware you deploy on. The hosted API is usage-priced from measured input and output token counts at the published per-token rate. The self-hosted model is priced as a GPU reservation at a rate you choose (on-prem, A40, A10G, or H100), scaled to the number of GPUs your volume needs, plus one-time fine-tuning. The benchmark ran on an RTX 3090; the calculator lets you model your own setup.
Error analysis
Every prediction is classified into one error category by deterministic rules, not by judgment.
Clause extraction (CUAD) uses: correct (token F1 ≥ 0.9, including correctly reporting "not found" when the clause is absent), partial (F1 between 0.5 and 0.9), hallucinated (extracted text that does not appear in the source contract), and format violation.
Support routing (Banking77) uses: correct and wrong class. As a closed-vocabulary classification task, there are no other failure modes; every prediction maps to a valid label.
Training configuration
All QLoRA fine-tuning uses Unsloth on a single NVIDIA RTX 3090 (24 GB). Hyperparameters for Banking77 were selected by a 6-trial grid sweep over learning rate × LoRA rank, ranked by validation loss only, no test metrics were examined at selection time. CUAD hyperparameters were transferred from the Banking77 sweep winner, with rank increased from 4 to 8 to accommodate span extraction, and early-stopping patience increased from 1 to 2 to account for the different loss profile of extraction tasks.
Hyperparameters
| Parameter | Banking77 | CUAD |
|---|---|---|
| Quantization | 4-bit NF4 | |
| LoRA rank | 4 | 8 |
| LoRA alpha | 8 | 16 |
| RSLoRA | Yes | |
| LoRA dropout | 0.05 | |
| Target modules | All linear layers | |
| Learning rate | 5e-5 (constant after 5% warmup) | |
| Optimizer | AdamW 8-bit | |
| Weight decay | 0.01 | |
| Max gradient norm | 1.0 | |
| Effective batch size | 16 | |
| Per-device batch / accumulation steps | 4 / 4 | 2 / 8 |
| Max sequence length | 1,024 tokens | 2,560 tokens |
| Precision | BF16 | |
| Seed | 42 | |
| Max epochs | 3 (>1,000 training examples) | |
| Early stopping patience | 1 eval | 2 evals |
| Checkpoint selection | Best validation loss | |
Dataset splits
| Task | Train | Validation | Test |
|---|---|---|---|
| Banking77 | 4,996 | 256 | 500 |
| CUAD | 2,998 | 400 | 500 questions (expanded via sliding-window chunking at inference) |
Compute budget
| Task | GPU | Training time | GPU hours | Cost |
|---|---|---|---|---|
| Banking77 | RTX 3090 | 84.8 min (early stopped, epoch 1.66 of 3) | 1.41 | $0.69 |
| CUAD | RTX 3090 | 132.8 min (early stopped, epoch 1.65 of 3) | 2.21 | $1.08 |
Inference configuration
Fine-tuned models are served with vLLM. Temperature is 0 (greedy) for all models, including APIs. Maximum output tokens vary by task type: 32 for classification, 768 for extraction.
Reproduce
Clone the repo, install dependencies, and follow the README.
Repository github.com/baseweight-ai/benchmark