Shipping machine learning to production: what actually matters

A lot of machine learning projects look successful on a slide and quietly die three months after launch. They were never really in production — they were running in production, which is not the same thing.

After enough ML engagements, the failure modes start to repeat. Here's the shortlist of what actually decides whether an ML system lasts.

1. The evaluation harness is the product

If you can't measure model quality on a fresh batch of data in under an hour, you can't iterate, and you can't catch regressions before users do. The evaluation harness — golden datasets, offline metrics tied to business outcomes, a one-command rerun — is the most undervalued part of any ML stack. Build it first, before the model.

2. Models drift; pipelines rot

Training data ages. Upstream schemas change. Feature distributions shift seasonally. None of this is news, but most production ML systems still don't have monitoring that would catch any of it within a week.

What we ship by default on ML engagements:

Input distribution monitoring on the top features that drive predictions.
Output drift checks — comparing the model's predicted distribution this week to baseline.
Live shadow scoring during model rollouts, so we can compare a candidate to the incumbent on real traffic before switching.

3. The boring deployment story is the right one

Containerized inference, a versioned model registry, a CI pipeline that rebuilds the image when training succeeds, a canary deploy, a rollback button that works. Pick whatever flavor of MLOps tooling fits the client's stack. Resist anything that requires its own conference.

4. Failure modes are product features

What happens when the model service is down? When confidence is too low to act on? When a prediction is obviously wrong? These aren't edge cases to handle in a follow-up — they're product decisions, and they need to be designed alongside the model, not bolted on later.

5. Hand it back cleanly

This is where consulting projects fail more than the ML itself. If we hand off a system the client can't operate without us, we've shipped a dependency, not a product. Documentation, runbooks, on-call training, and a real transition window are part of the engagement — not extras.

This is roughly the shape of every ML implementation we take on at Voight Labs: model, evaluation, monitoring, deployment, handoff. The model is usually the smallest part of the work.

If you're putting an ML system into production and want a second pair of senior engineers on it, let's talk.