[Avg. reading time: 7 minutes]

Life Before MLOps

Challenges Faced by ML Teams.

Moving Models from Dev → Staging → Prod

Models were often shared as .pkl or joblib files, passed around manually.

Problem: Dependency mismatches (Python, sklearn version), fragile handoffs.

Stopgap: Packaging models with Docker images, but still manual and inconsistent.

Champion vs Challenger Deployment

Teams struggled to test a new (challenger) model against the current (champion).

Problem: No controlled A/B testing or shadow deployments → risky rollouts.

Stopgap: Manual canary releases or running offline comparisons.

Model Versioning Confusion

Models saved as model_final.pkl, model_final_v2.pkl, final_final.pkl.

Problem: Nobody knew which version was truly in production.

Stopgap: Git or S3 versioning for files, but no link to experiments/data.

Inference on Wrong Model Version

Even if multiple versions existed, production systems sometimes pointed to the wrong one.

Problem: Silent failures, misaligned experiments vs prod results.

Stopgap: Hardcoding file paths or timestamps — brittle and error-prone.

Train vs Serve Skew (Data-Model Mismatch)

Preprocessing done in notebooks was re-written differently in prod code.

Problem: Same model behaves differently in production.

Stopgap: Copy-paste code snippets, but no guarantee of sync.

Experiment Tracking Chaos

Results scattered across notebooks, Slack messages, spreadsheets.

Problem: Couldn’t reproduce “that good accuracy we saw last week.”

Stopgap: Manually logging metrics in Excel or text files.

Reproducibility Issues

Same code/data gave different results on different machines.

Problem: No control of data versions, package dependencies, or random seeds.

Stopgap: Virtualenvs, requirements.txt — helped a bit but not full reproducibility.

Lack of Monitoring in Production

Once deployed, no one knew if the model degraded over time.

Problem: Models silently failed due to data drift or concept drift.

Stopgap: Occasional manual performance checks, but no automation.

Scaling & Performance Gaps

Models trained in notebooks failed under production loads.

Problem: Couldn’t handle large-scale data or real-time inference.

Stopgap: Batch scoring jobs on cron — but too slow for real-time use cases.

Collaboration Breakdowns

Data Scientists, Engineers, Ops worked in silos.

Problem: Miscommunication -> wrong datasets, broken pipelines, delays.

Stopgap: Jira tickets and handovers — but still slow and error-prone.

Governance & Compliance Gaps

No audit trail of which model made which prediction.

Problem: Risky for regulated domains (finance, healthcare).

Stopgap: Manual logging of predictions — incomplete and unreliable.

#mlops #development #productionVer 0.3.6

Last change: 2025-12-02