Skip to main content

You’ve hit the ceiling. We break through it, without hiring an ML team.

Your prompting and RAG are maxed out. The gap between demo and production isn’t a configuration problem. It’s a model quality problem. We diagnose which failure modes are present, then execute the right adaptation to close the gap.

Book a free call

Free 30-minute call first. We’ll tell you if we can help.

$3–5k diagnostic, fixed scope, if it’s the right fit.

See how it works ↓  ·  About the practice →

Built for the technical founder who’s exhausted prompting and RAG and can’t afford six months to hire around it

Right for this if
  • LLM feature shipped to production
  • Prompting and RAG are maxed out
  • Post-training is the obvious next step, but nobody on the team can execute it
  • Quality “feels better” after the last fix, but you can’t prove it
Not right for this if
  • Still prototyping or pre-production
  • Haven’t exhausted prompting and RAG yet
  • Have an in-house ML team that can own post-training
Honest qualifier

You’ve considered post-training, fine-tuning, or switching to open-source weights, but nobody on the team can execute any of it, and hiring someone who can will take six months you don’t have.

If you haven’t hit the ceiling on prompting yet, come back when you have. We’re not the right call before that point.

Book a free call

We’ll tell you within 30 minutes whether we can help. We work under NDA. Your data and architecture stay confidential.

Three ways teams end up building the wrong fix.

01

You fine-tuned. You don’t know if it helped.

The provider returned a training loss curve. Looks like it converged. But did it fix the failure modes that were breaking production? Did it regress on cases you didn’t include in training? Did it overfit? There’s no way to know without evals calibrated to your actual production distribution.

A green loss curve is not a green light.
02

Performance looked solid in testing. It’s degrading in production.

Your evals passed. In production, the model fails on input types that weren’t in your test set: distribution shift, edge cases real users send but testers don’t. The model was optimized for the eval, not the problem.

The model was ready for your eval. It wasn’t ready for your users.
03

No way to prove the fix worked

Your team ran a fine-tune or rebuilt the RAG pipeline. Performance feels better. But “feels better” isn’t a number, and without evals calibrated to your production failure cases, there’s no defensible way to know whether quality improved or whether the next model update will silently regress it.

Progress by opinion is indistinguishable from no progress at all.

30 minutes to map the failure modes.

Book a free call

None of them closed the gap.

Model switching. Fast to test. If you hit the same ceiling on the new model, the failure mode is in the task, not the model family. You’re buying time, not fixing the problem.

Fine-tuning through a provider API. The API call is easy. What you get back is a training loss curve, not a verdict on whether you fixed the failure modes breaking production. You need evals calibrated to your production distribution for that, and building those is most of the work.

Eval platforms. They measure whatever you configure, accurately. The gap: they require you to already know which failure modes matter and how to detect them. If your evals don’t cover your actual production failures, you get a green dashboard on a broken model.

An ML consultant. A good one can do most of what we do: diagnose failure modes, recommend the right technique, build the solution. The difference is measurement. Without a domain benchmark calibrated to your production bar, the consultant leaves and you have no way to know if the next model update holds. Six months later there’s a regression and you start over.

We fix the model. Not the prompt.

Three cases. Wrong intervention each time.

Without the diagnostic: a RAG refactor that doesn’t close the gap, followed by a fine-tuning engagement that should have come first.

What teams see

Domain extraction performance degrades after a model update or provider change. The assumption is that chunking strategy or the embedding model is to blame: retrieval is the obvious variable.

What the diagnostic finds

Failure analysis shows systematic gaps in domain terminology handling that aren’t retrieval-dependent. The model lacks weight-level knowledge of your domain entities. Better chunking won’t close this gap. Only training signal will.

Technique that follows

LoRA adapter with a custom eval set covering domain-specific entities, edge cases, and adversarial inputs. Eval-gated deployment blocks any release that regresses below threshold, making model swaps survivable without emergency retraining.

Without the diagnostic: a RAG refactor that doesn’t close the gap, followed by a fine-tuning engagement that should have come first.

Without the diagnostic: continuous patching against an eval that doesn’t reflect real inputs, shipping updates that pass tests but keep failing users.

What teams see

Performance looked solid on internal testing. In production, accuracy slips on specific input types, after a model provider update, or as data distribution shifts over time. Nobody can explain why, because nobody defined what “working” looks like in production.

What the diagnostic finds

The test set was built from clean, well-formed examples, not the ambiguous, edge-case, adversarial inputs that production traffic actually contains. The model was optimised for the eval, not for the problem. There’s no baseline to measure regression against.

Technique that follows

Eval-first: production failure cases classified and used to construct a representative test set, must-pass thresholds agreed before any technique decision is made. Often reveals the fix is smaller than assumed, or different than planned.

Without the diagnostic: continuous patching against an eval that doesn’t reflect production, shipping model updates that pass tests but keep failing users.

Without the diagnostic: a fine-tune trained against a flawed eval set, shipping a model that passes tests but still fails in production.

What teams see

Structured field extraction passes every stakeholder demo but fails on real-world inputs. Quality is assessed by eyeballing samples: no formal definition of correct, no measurement, no baseline.

What the diagnostic finds

Without a reliable eval set, there’s no reliable measurement, and without measurement, there’s no defensible technique decision. Failure pattern analysis classifies which inputs fail and why. The problem is often more tractable than it looks once it’s actually measured.

Technique that follows

Eval-first: adversarial and representative failure cases classified, must-pass gates defined, acceptance criteria agreed before any training begins. The technique follows from what the eval reveals, not from what was planned in advance.

Without the diagnostic: a fine-tune trained against a flawed eval set, shipping a model that passes tests but still fails in production.

One of these patterns is in your production data. Thirty minutes to find out which one.

Book a free call

Diagnose first. Build second. Hand over everything.

01

Failure diagnosis

We go through your production failure logs and separate systematic failures from noise: only systematic failures respond to training intervention. Each failure mode gets mapped to the fix most likely to close it, with decision rationale. The output is a written report: what’s failing, why it’s systematic, and what fixing it will cost. You see it before we touch anything.

02

Intervention execution

You approve the work before it starts. Target metric, cost, and timeline, all defined before a line of training code runs. We build what the diagnostic prescribed. Not what’s easiest to bill, not what we know best.

03

Complete handoff, including the benchmark

Trained weights, training recipes, deployment configs, adapter templates: all yours. The evals are built from your production failure cases, not the clean examples you had when you started, and ship with everything else. They make the test/prod gap measurable, catch regressions before deployment, and expand as production surfaces new failure modes. Your team runs them. We’re not in the loop unless you want us to be.

Book a free call

Free 30-minute intro call to walk through your failure patterns and whether the diagnostic is the right next step.

The diagnostic outputs are yours.

Evals calibrated to production, not test set artifacts

Every engagement ships evals built from your actual production failures. They tell you whether the intervention worked. They catch regressions before deployment. You know when you’re done. You can prove it.

A compounding benchmark, not a one-time engagement artifact

The eval suite ships calibrated to your production failures and evolves as your data does. Each model update runs through the same gates. Each new failure mode extends the taxonomy. The benchmark lives in your stack and catches regressions before each deployment, not in a consultant’s repo that nobody runs after the engagement ends.

No ongoing license, no hosted dependency

Trained weights, evals, training recipes, deployment configs. Yours outright: no inference contract, no hosted eval dependency, no call back to us before you can ship a model update. Your team runs it independently.

One fixed-scope entry point. Then a continuous improvement loop.

Engagement type Typical scope Range
Diagnostic → Entry point, converts to retainer on success $3–5k
Core retainer Continuous improvement loop: intervention execution, benchmark maintenance, new failure patterns, regression monitoring $8–15k/mo
Scale retainer Multiple models or product lines; buyer typically Series B+ or multi-product $20–40k/mo

The diagnostic is $3–5k, fixed scope. What comes next, if anything, is defined in the report. Target metric, cost, and timeline are agreed before any build work begins. Retainers build on the diagnostic: benchmark maintenance, new failure patterns closed as your data evolves, regressions caught before each model update.

Book a free call

30 minutes. Free. No commitment.

Common questions

Two to three weeks from kickoff to written report. We need access to your production failure logs and a working session to walk through the failure patterns. The report covers what’s failing, why it’s systematic, and what closing each failure mode will cost, all defined before we propose anything further.

We’ll tell you. If better prompting or a RAG refactor is the answer, we’ll say so, even if that means a smaller engagement or no engagement at all. The recommendation follows the data, not the business case for more work.

The diagnostic deliverable is a written report you own regardless of what you do next. If we recommend fine-tuning and you want to take it to an in-house team or another vendor, that’s fine. If your current approach is adequate, that’s what the report says, even if it means no further engagement.

We work under NDA and can operate within your infrastructure. For fine-tuning, we need access to representative training data. We agree on data handling requirements before any access.

Everything. Trained weights, adapters, evaluation sets, training recipes, deployment configs. No ongoing licensing. No vendor lock-in.

A senior ML hire is right if you need someone in-house long-term. The trade-off: 3–6 months to ramp, $200–300k fully loaded, and they’ll approach the problem through the techniques they know best, which may or may not match your failure modes. We bring cross-vertical diagnosis experience and deliver a written recommendation in two weeks. If you need permanent in-house ML capability, that’s the right long-term call. Come back to us to scope the problem first.

Eval platforms measure whatever you configure them to measure, accurately and well. The gap is that they require you to already know which failure modes are in your production data and how to detect them. If your eval coverage doesn’t match your production failure distribution, the platform faithfully tells you nothing is wrong. The diagnostic comes before the eval platform, not instead of it.

The API call is easy. What you get back is a training loss curve. The curve doesn’t tell you whether you fixed the specific failure modes that were breaking production, whether you regressed on cases you didn’t include in training, or whether you overfit to your training examples rather than learning the underlying pattern. You need evals calibrated to your production distribution to answer any of those questions, and building those is the diagnostic.

A good consultant can do most of what we do: diagnose failure modes, recommend the right technique, build the solution. The difference is measurement. Without a domain benchmark calibrated to your production bar, the consultant leaves and you have no way to know if the next model update holds. The benchmark they built lives in a repo nobody runs. The institutional memory of what was tried is in their notes, not yours. Six months later there’s a regression and you start from zero. The retainer model exists specifically to prevent this: benchmark maintenance, new failure modes closed as production evolves, regressions caught before each model update.

Because the failure mode you were targeting was never defined. You ran a training job: the provider returned a loss curve, you deployed, the problem was still there. The diagnostic starts a step earlier: classifying which failure modes are actually in your production data before any training code runs. Fine-tuning without that step is common. It’s also how teams spend months on the wrong fix.

If you’ve been patching the same problem for two sprints and the gap still isn’t closing: that’s what the diagnostic is for.

Book an intro call to walk through your failure patterns, where performance is breaking down, and whether the diagnostic is the right next step. Thirty minutes. We’ll tell you what’s wrong, or we’ll tell you it isn’t us.

No pitch deck. No sales sequence. Just a technical conversation about your stack.

Book a free call
By