How it's tested | Baseweight Head-to-head

What this measures

The head-to-head compares Qwen3-8B with a task-specific LoRA adapter against a hosted API baseline (zero-shot and 5-shot) on tasks representative of common production AI work. It measures accuracy on the task, total cost of ownership at production volume, and error composition.

Every result is reproducible. Training code, evaluation scripts, model weights, and prediction logs are public. The head-to-head is ongoing: new tasks and model families are added as runs land.

Tasks

All datasets are public and version-pinned. Test sets are held out from all fine-tuning.

Task	Dataset	What it tests	Metric	Test size
Contract clause extraction	CUAD	Given a commercial contract, identify whether a specific clause type appears and extract the relevant text, or correctly report its absence. 41 clause categories across 510 contracts.	AUPR (+ Token F1)	500 questions
Customer support routing	Banking77	Given a short customer message, assign the correct intent label from a closed set of 77 banking-specific categories.	Weighted F1	500

Models

Open-weight (QLoRA fine-tuned): Qwen3-8B (8B parameters, Apache 2.0 licence; open-weight rather than OSI-open-source, since the training data is closed). Model files are revision-pinned SafeTensors with published hashes. Served with vLLM on a single NVIDIA RTX 3090.

Hosted API (prompted): GPT-5.4 Mini (the cost-efficient API tier a team would deploy for a task at this volume) tested zero-shot and 5-shot at standard API pricing. Other vendors’ models, and flagship or reasoning tiers where a task’s latency and cost budget justify them, are being added. The per-query cost reflects measured token usage at the time of the run; it updates automatically when the head-to-head reruns.

Why a decoder for a classification task? For a single closed-set task like Banking77 in isolation, a small encoder classifier (a fine-tuned DeBERTa, or SetFit on a few hundred labels) is often the more accurate and cheaper-to-serve tool. Encoder and other non-decoder baselines are being added per task. We report a decoder here because most deployments run a portfolio of tasks: one open base model with per-task LoRA adapters, hot-swapped or served concurrently, covers classification, extraction, and generation on a single inference stack, frequently the better fleet-level trade even when an encoder wins a task on its own.

Conditions

Condition	Applied to	What it means
Zero-shot	Hosted API	Task instruction only, no examples in the prompt. The model relies on pre-training.
5-shot	Hosted API	Five labelled examples prepended to every query. No weights updated.
LoRA	Qwen3-8B	QLoRA fine-tuning on the task training set. A small adapter layer is trained; the base model weights are unchanged.

Metrics

Task accuracy. For extraction (CUAD), the primary metric is AUPR, area under the precision–recall curve for clause detection, the metric the CUAD authors report and the right one when most clause types are absent from any given contract. Token F1 (token-level overlap, partial credit for near-correct extractions) is reported alongside as a secondary, partial-credit check. Weighted F1 for classification (accounts for label imbalance across all categories).

Deployment cost. Presented as an interactive total-cost-of-ownership calculator on the head-to-head page, since real cost depends on query volume and the hardware you deploy on. The hosted API is usage-priced from measured input and output token counts at the published per-token rate. The self-hosted model is priced as a GPU reservation at a rate you choose (on-prem, A40, A10G, or H100), scaled to the number of GPUs your volume needs, plus one-time fine-tuning. The head-to-head ran on an RTX 3090; the calculator lets you model your own setup.

Error analysis

Every prediction is classified into one error category by deterministic rules alone.

Clause extraction (CUAD) uses: correct (token F1 ≥ 0.9, including correctly reporting "not found" when the clause is absent), partial (F1 between 0.5 and 0.9), hallucinated (extracted text that does not appear in the source contract), and format violation.

Support routing (Banking77) uses: correct and wrong class. As a closed-vocabulary classification task, there are no other failure modes; every prediction maps to a valid label.

Training configuration

All QLoRA fine-tuning uses Unsloth on a single NVIDIA RTX 3090 (24 GB). Hyperparameters for Banking77 were selected by a 6-trial grid sweep over learning rate × LoRA rank, ranked by validation loss only, no test metrics were examined at selection time. CUAD hyperparameters were transferred from the Banking77 sweep winner, with rank increased from 4 to 8 to accommodate span extraction, and early-stopping patience increased from 1 to 2 to account for the different loss profile of extraction tasks.

Hyperparameters

Parameter	Banking77	CUAD
Quantization	4-bit NF4
LoRA rank	4	8
LoRA alpha	8	16
RSLoRA	Yes
LoRA dropout	0.05
Target modules	All linear layers
Learning rate	5e-5 (constant after 5% warmup)
Optimizer	AdamW 8-bit
Weight decay	0.01
Max gradient norm	1.0
Effective batch size	16
Per-device batch / accumulation steps	4 / 4	2 / 8
Max sequence length	1,024 tokens	2,560 tokens
Precision	BF16
Seed	42
Max epochs	3 (>1,000 training examples)
Early stopping patience	1 eval	2 evals
Checkpoint selection	Best validation loss

Dataset splits

Task	Train	Validation	Test
Banking77	4,996	256	500
CUAD	2,998	400	500 questions (expanded via sliding-window chunking at inference)

Compute budget

Task	GPU	Training time	GPU hours	Cost
Banking77	RTX 3090	84.8 min (early stopped, epoch 1.66 of 3)	1.41	$0.69
CUAD	RTX 3090	132.8 min (early stopped, epoch 1.65 of 3)	2.21	$1.08

Inference configuration

Fine-tuned models are served with vLLM. Temperature is 0 (greedy) for all models, including APIs. Maximum output tokens vary by task type: 32 for classification, 768 for extraction.

Reproduce

Clone the repo, install dependencies, and follow the README.

Repository github.com/baseweight-ai/benchmark

How it’s tested