[Avg. reading time: 7 minutes]

Life Before MLOps

Challenges Faced by ML Teams.

Moving Models from Dev → Staging → Prod

Models were often shared as .pkl or joblib files, passed around manually.

Problem: Dependency mismatches (Python, sklearn version), fragile handoffs.

Stopgap: Packaging models with Docker images, but still manual and inconsistent.

Teams struggled to test a new (challenger) model against the current (champion).

Problem: No controlled A/B testing or shadow deployments → risky rollouts.

Stopgap: Manual canary releases or running offline comparisons.

Models saved as model_final.pkl, model_final_v2.pkl, final_final.pkl.

Problem: Nobody knew which version was truly in production.

Stopgap: Git or S3 versioning for files, but no link to experiments/data.

Even if multiple versions existed, production systems sometimes pointed to the wrong one.

Problem: Silent failures, misaligned experiments vs prod results.

Stopgap: Hardcoding file paths or timestamps — brittle and error-prone.

Preprocessing done in notebooks was re-written differently in prod code.

Problem: Same model behaves differently in production.

Stopgap: Copy-paste code snippets, but no guarantee of sync.

Results scattered across notebooks, Slack messages, spreadsheets.

Problem: Couldn’t reproduce “that good accuracy we saw last week.”

Stopgap: Manually logging metrics in Excel or text files.

Same code/data gave different results on different machines.

Problem: No control of data versions, package dependencies, or random seeds.

Stopgap: Virtualenvs, requirements.txt — helped a bit but not full reproducibility.

Once deployed, no one knew if the model degraded over time.

Problem: Models silently failed due to data drift or concept drift.

Stopgap: Occasional manual performance checks, but no automation.

Models trained in notebooks failed under production loads.

Problem: Couldn’t handle large-scale data or real-time inference.

Stopgap: Batch scoring jobs on cron — but too slow for real-time use cases.

Data Scientists, Engineers, Ops worked in silos.

Problem: Miscommunication -> wrong datasets, broken pipelines, delays.

Stopgap: Jira tickets and handovers — but still slow and error-prone.

No audit trail of which model made which prediction.

Problem: Risky for regulated domains (finance, healthcare).

Stopgap: Manual logging of predictions — incomplete and unreliable.