[Avg. reading time: 7 minutes]

Model Serving Types

Model Serving is the process of deploying trained machine-learning models so they can generate predictions on new data.

Once a model is trained and validated, it must be made available to applications, pipelines, or users that need its outputs — whether that’s a batch job scoring millions of records, a web app recommending products, or an IoT stream detecting anomalies.

Model serving sits in the production stage of the MLOps lifecycle, bridging the gap between model development and business consumption.

It ensures models are:

Accessible (via APIs, pipelines, or streams)
Scalable (able to handle varying loads)
Versioned and governed (using registries and lineage)
Monitored (for drift, latency, and performance)

In modern stacks (e.g., Databricks, AWS SageMaker, GCP Vertex AI), serving integrates tightly with model registries, feature stores, and CI/CD pipelines to enable reliable, repeatable ML deployment.

Batch Model Serving

Batch serving runs inference on large datasets at scheduled intervals (hourly, nightly, weekly).

Input data is read from storage or database.
Predictions are generated for all records.
Outputs are written back to storage or a downstream table.

Example: Predict new car Co2 Emission.

Pros: Efficient, reproducible, simple to schedule. Cons: Not real-time; predictions may get stale.

Demo:

Real-Time (Online) Model Serving

Real-time serving exposes the model as a low-latency API endpoint. Each request is scored on demand and returned within milliseconds to seconds.

How it works:

An application (e.g., web or mobile) calls the API.

The model receives input features and returns a prediction immediately.

As discussed in the previous chapter.

MlFlow Serving
FastAPI Serving

Example:

Credit-card fraud detection, dynamic pricing, personalized recommendations.

Pros: Instant feedback, personalized predictions

Cons: Needs always-on infra, online feature store, auto-scaling

Demo

Streaming (Continuous) Model Serving

Streaming serving applies the model continuously to incoming event streams (Kafka, Kinesis, Delta Live Tables).

Instead of single requests, it handles ongoing flows of data.

Data arrives in small micro-batches or as events.
The model scores each record as soon as it appears.
Results are pushed to dashboards, alerts, or storage.

Example:

IoT anomaly detection, clickstream optimization, live sensor analytics.

Pros:

Near real-time, high-throughput, scalable

Cons:

Complex orchestration, harder to monitor and debug.

#batch #streaming #realtimeVer 0.3.6

MLOps and AI

Model Serving Types

Batch Model Serving

Real-Time (Online) Model Serving

Streaming (Continuous) Model Serving