Model switching. Fast to test. If you hit the same ceiling on the new model, the failure mode is in the task, not the model family. You’re buying time, not fixing the problem.
Fine-tuning through a provider API. The API call is easy. What you get back is a training loss curve, not a verdict on whether you fixed the failure modes breaking production. You need evals calibrated to your production distribution for that, and building those is most of the work.
Eval platforms. They measure whatever you configure, accurately. The gap: they require you to already know which failure modes matter and how to detect them. If your evals don’t cover your actual production failures, you get a green dashboard on a broken model.
An ML consultant. A good one can do most of what we do: diagnose failure modes, recommend the right technique, build the solution. The difference is measurement. Without a domain benchmark calibrated to your production bar, the consultant leaves and you have no way to know if the next model update holds. Six months later there’s a regression and you start over.
We fix the model. Not the prompt.