Every AI shop has a tool they’re trying to sell you. We don’t have one.
Every AI services company has a specialty. Fine-tuning shops recommend fine-tuning. RAG consultancies recommend RAG. Prompt engineers recommend more prompting. The recommendation follows from their default, not your production failures.
The result: months committed to the wrong approach. Then they start over, if they can justify it internally.
Baseweight exists to fix the sequence. Classify the failure first. Then, and only then, prescribe and execute the right intervention. The technique follows from what the failure analysis shows: retrieval, weights, pipeline design, or data quality. There’s no default approach because the failure mode determines the fix.
Every engagement produces artifacts you own outright: trained weights, training recipes, eval sets, deployment configs. The benchmark ships with you. Retainers are available, but the goal is never dependency.
What We Believe
Diagnosis before prescription
The technique should follow from the problem, not the other way around.
Ownership over dependency
You should own your model, your weights, and your methodology. Full stop.
Patterns over bespoke
Each engagement extends a failure taxonomy built across clients and verticals. The second time we see the same failure pattern in a vertical, the diagnosis is faster and the recommendation is more grounded. We don’t start from zero.
Honesty over revenue
If your problem is solvable with better prompting, we'll tell you. We'd rather earn trust than manufacture scope.
Founded by Philip Stevens
15 years in applied ML. Production work at Agoda building personalization and recommendation systems at scale, and at Quantcast managing end-to-end ML lifecycle for core targeting models: feature engineering, model architecture, and domain drift monitoring.
Independent consultant since 2023. Baseweight was built around a pattern I kept encountering: teams choosing adaptation techniques before diagnosing the problem. The technique follows the firm, not the evidence.
I work across the full stack and select the approach after analyzing failure modes and data characteristics. Your team owns the output outright and can extend it independently.
- Fine-tuning (LoRA, QLoRA, full)
- Eval design & regression harnesses
- RAG pipeline hardening
- DPO alignment
- Agent workflow design
- Inference optimization
- MSc Computer Science, Univ. of Auckland
If you’ve exhausted the obvious fixes, let’s talk.
30 minutes. Free. No commitment. We’ll tell you what’s wrong, or we’ll tell you it isn’t us.
Book a diagnostic