About Baseweight

Every AI shop has a tool they’re trying to sell you. We don’t have one.

Every AI services company has a specialty. Fine-tuning shops recommend fine-tuning. RAG consultancies recommend RAG. Prompt engineers recommend more prompting. The recommendation follows from their default, not your production failures.

The result: months committed to the wrong approach. Then they start over, if they can justify it internally.

Baseweight exists to fix the sequence. Classify the failure first. Then, and only then, prescribe and execute the right intervention. The technique follows from what the failure analysis shows: retrieval, weights, pipeline design, or data quality. There’s no default approach because the failure mode determines the fix.

Every engagement produces artifacts you own outright: trained weights, training recipes, eval sets, deployment configs. The benchmark ships with you. Retainers are available, but the goal is never dependency.

What We Believe

Diagnosis before prescription

The technique should follow from the problem, not the other way around.

Ownership over dependency

You should own your model, your weights, and your methodology. Full stop.

Patterns over bespoke

Each engagement extends a failure taxonomy built across clients and verticals. The second time we see the same failure pattern in a vertical, the diagnosis is faster and the recommendation is more grounded. We don’t start from zero.

Honesty over revenue

If your problem is solvable with better prompting, we'll tell you. We'd rather earn trust than manufacture scope.

Founder

Founded by Philip Stevens

15 years in applied ML. Production work at Agoda building personalization and recommendation systems at scale, and at Quantcast managing end-to-end ML lifecycle for core targeting models: feature engineering, model architecture, and domain drift monitoring.

Independent consultant since 2023. Baseweight was built around a pattern I kept encountering: teams choosing adaptation techniques before diagnosing the problem. The technique follows the firm, not the evidence.

I work across the full stack and select the approach after analyzing failure modes and data characteristics. Your team owns the output outright and can extend it independently.

Fine-tuning (LoRA, QLoRA, full)
Eval design & regression harnesses
RAG pipeline hardening
DPO alignment
Agent workflow design
Inference optimization
MSc Computer Science, Univ. of Auckland

If you’ve exhausted the obvious fixes, let’s talk.

30 minutes. Free. No commitment. We’ll tell you what’s wrong, or we’ll tell you it isn’t us.

Book a diagnostic

By Philip Stevens

January 2025