Skip to main content

How it’s tested

How the benchmark was designed, run, and measured.

What this measures

The benchmark compares Qwen3-8B with a task-specific LoRA adapter against a hosted API baseline (zero-shot and 5-shot) on two tasks representative of common production AI work. It measures accuracy on the task, total cost of ownership at production volume, and error composition.

Every result is reproducible. Training code, evaluation scripts, model weights, and prediction logs are public. The benchmark is ongoing; more tasks and model families will be added.

Independence & funding

The benchmark is self-funded and unsponsored today. Its credibility is the entire point, so the rules are fixed in advance and hold even once that changes:

  • Sponsors fund the work, never the conclusion. A model lab, inference host, or tooling vendor may underwrite a run, but gets no say over the method, the numbers, or how they read.
  • Results are published regardless of outcome, including when an open model loses, or a sponsor’s product comes out behind.
  • Code and methodology stay public. Every result is reproducible from the released adapters, prediction logs, and pipeline, so anyone can check the numbers rather than take them on trust.
  • Any sponsorship is disclosed on the run it applies to.

If a result can’t survive being reproduced by a skeptic, it doesn’t ship.

Tasks

Both datasets are public and version-pinned. Test sets are held out from all fine-tuning.

TaskDatasetWhat it testsMetricTest size
Contract clause extractionCUADGiven a commercial contract, identify whether a specific clause type appears and extract the relevant text, or correctly report its absence. 41 clause categories across 510 contracts.AUPR (+ Token F1)500 questions
Customer support routingBanking77Given a short customer message, assign the correct intent label from a closed set of 77 banking-specific categories.Weighted F1500

Models

Open-source (QLoRA fine-tuned): Qwen3-8B (8B parameters, Apache 2.0 licence). Served with vLLM on a single NVIDIA RTX 3090.

Hosted API (prompted): GPT-5.4 Mini (the cost-efficient API tier a team would actually deploy for a task at this volume, not the pricier flagship or a reasoning model) tested zero-shot and 5-shot at standard API pricing. Other vendors’ models, and flagship or reasoning tiers where a task’s latency and cost budget justify them, are being added. The per-query cost reflects actual token usage at the time of the run; it updates automatically when the benchmark reruns.

Why a decoder for a classification task? For a single closed-set task like Banking77 in isolation, a small encoder classifier (a fine-tuned DeBERTa, or SetFit on a few hundred labels) is often the more accurate and cheaper-to-serve tool. Encoder and other non-decoder baselines are being added per task. We report a decoder here because most deployments run a portfolio of tasks: one open base model with per-task LoRA adapters, hot-swapped or served concurrently, covers classification, extraction, and generation on a single inference stack, frequently the better fleet-level trade even when an encoder wins a task on its own.

Conditions

ConditionApplied toWhat it means
Zero-shotHosted APITask instruction only, no examples in the prompt. The model relies on pre-training.
5-shotHosted APIFive labelled examples prepended to every query. No weights updated.
LoRAQwen3-8BQLoRA fine-tuning on the task training set. A small adapter layer is trained; the base model weights are unchanged.

Metrics

Task accuracy. For extraction (CUAD), the primary metric is AUPR, area under the precision–recall curve for clause detection, the metric the CUAD authors report and the right one when most clause types are absent from any given contract. Token F1 (token-level overlap, partial credit for near-correct extractions) is reported alongside as a secondary, partial-credit check. Weighted F1 for classification (accounts for label imbalance across all categories).

Deployment cost. Presented as an interactive total-cost-of-ownership calculator on the benchmark page, since real cost depends on query volume and the hardware you deploy on. The hosted API is usage-priced from measured input and output token counts at the published per-token rate. The self-hosted model is priced as a GPU reservation at a rate you choose (on-prem, A40, A10G, or H100), scaled to the number of GPUs your volume needs, plus one-time fine-tuning. The benchmark ran on an RTX 3090; the calculator lets you model your own setup.

Error analysis

Every prediction is classified into one error category by deterministic rules, not by judgment.

Clause extraction (CUAD) uses: correct (token F1 ≥ 0.9, including correctly reporting "not found" when the clause is absent), partial (F1 between 0.5 and 0.9), hallucinated (extracted text that does not appear in the source contract), and format violation.

Support routing (Banking77) uses: correct and wrong class. As a closed-vocabulary classification task, there are no other failure modes; every prediction maps to a valid label.

Training configuration

All QLoRA fine-tuning uses Unsloth on a single NVIDIA RTX 3090 (24 GB). Hyperparameters for Banking77 were selected by a 6-trial grid sweep over learning rate × LoRA rank, ranked by validation loss only, no test metrics were examined at selection time. CUAD hyperparameters were transferred from the Banking77 sweep winner, with rank increased from 4 to 8 to accommodate span extraction, and early-stopping patience increased from 1 to 2 to account for the different loss profile of extraction tasks.

Hyperparameters

ParameterBanking77CUAD
Quantization4-bit NF4
LoRA rank48
LoRA alpha816
RSLoRAYes
LoRA dropout0.05
Target modulesAll linear layers
Learning rate5e-5 (constant after 5% warmup)
OptimizerAdamW 8-bit
Weight decay0.01
Max gradient norm1.0
Effective batch size16
Per-device batch / accumulation steps4 / 42 / 8
Max sequence length1,024 tokens2,560 tokens
PrecisionBF16
Seed42
Max epochs3 (>1,000 training examples)
Early stopping patience1 eval2 evals
Checkpoint selectionBest validation loss

Dataset splits

TaskTrainValidationTest
Banking774,996256500
CUAD2,998400500 questions (expanded via sliding-window chunking at inference)

Compute budget

TaskGPUTraining timeGPU hoursCost
Banking77RTX 309084.8 min (early stopped, epoch 1.66 of 3)1.41$0.69
CUADRTX 3090132.8 min (early stopped, epoch 1.65 of 3)2.21$1.08

Inference configuration

Fine-tuned models are served with vLLM. Temperature is 0 (greedy) for all models, including APIs. Maximum output tokens vary by task type: 32 for classification, 768 for extraction.

Reproduce

Clone the repo, install dependencies, and follow the README.

Repository github.com/baseweight-ai/benchmark