[Avg. reading time: 0 minutes] Ver 0.3.6

[Avg. reading time: 0 minutes]

Disclaimer

#disclaimerVer 0.3.6

[Avg. reading time: 4 minutes]

Required Tools

CLI
Python (3.11 to 3.13)
- Download Python
Python Dependency Manager (choose one)
- UV (Fast, modern)
Code Editor
- Visual Studio Code
VSCode Extension
- Thunder Client
Container Engine
- Docker Personal

Free Cloud Services

Tool	Purpose	Link
Databricks Free Edition	ML & Ops	Free Signup
Chroma	Free Vector DB	ChromaDB

#tools #databricks #python #gitVer 0.3.6

[Avg. reading time: 2 minutes]

MLOps & AI Overview

MLOps & AI Overview

[Avg. reading time: 3 minutes]

Introduction

AI/ML are no longer just research topics - they drive industry, innovation, and jobs.

GenAI has shifted expectations: businesses want faster solutions with production-grade reliability.

MLOps ensures ideas → working models → deployed systems.

Evolution of the Field

2010s: Big Data + early ML adoption (scikit-learn, Spark MLlib).

2015-2022: Deep learning boom (Neural Networks, NLP with BERT).

2022: Generative AI (GPT, diffusion models).

MLOps is critical for scaling, governance, monitoring.

Where MLOps Fits in the Data/AI Journey

MLOps is part of all of this.

Without MLOps, many models stay as “academic projects.”

Today’s hiring market looks for hybrid skills (data + ML + cloud + ops).

Course Positioning

Not too heavy on topics covered in other courses such as ML algorithms or NLP or Deep Learning or LLM.

This course is heavy on CICD - MLOps, Pipelines, versioning, monitoring, cloud platforms and related toolsets.

Course Focus = Industry Readiness

#Data #mlengineer #mlopsVer 0.3.6

[Avg. reading time: 0 minutes]

AI then and now

#MachineLearning #ArtificialIntelligenceVer 0.3.6

[Avg. reading time: 1 minute]

Expert Systems

Early AI systems (1970s–1990s)

Rule-based: encode human expert knowledge as if-then rules.

Precursor to modern ML, focused on symbolic reasoning rather than data-driven learning.

Pros

Transparent and explainable (rules are visible).
Effective in narrow, well-defined domains.

Cons

Knowledge engineering is labor-intensive.
Doesn’t scale well as rules explode.
Cannot adapt automatically from new data.

#expert-systems #rulebasedVer 0.3.6

[Avg. reading time: 3 minutes]

Fuzzy Logic

Logic that allows degrees of truth (not just True/False). Models uncertainty with values between 0 and 1.

graph TD
    A["Is it Cold?"] --> B["Crisp Logic<br/>Yes = 1<br/>No = 0"]
    A --> C["Fuzzy Logic<br/>Maybe Cold = 0.3<br/>Not really cold = 0.7"]

Useful in control systems and decision-making under vagueness.

Still used in various use cases to find out similarity like New Jersey similar to Jersey.

Pros

Handles imprecise, uncertain, or linguistic data (“high temperature”, “low risk”).
Good for rule-based control.

Cons

Not data-driven → rules must be defined manually.
Limited learning ability compared to ML.

Use Cases

Washing machines that adjust cycles based on “fuzziness” of dirt level.
Air conditioning systems adapting to “comfort level”.
Automotive control (braking, transmission).
Risk assessment systems.

#fuzzy-logic #fuzzinessVer 0.3.6

[Avg. reading time: 3 minutes]

Machine Learning

A subset of AI where systems learn patterns from data and make predictions or decisions without being explicitly programmed.

One of the core pillars of AI.
Between traditional rule-based systems (Expert Systems) and modern Deep Learning/GenAI.
Provides the foundation for many practical AI applications used in industry today.

Pros

Automates decision-making at scale.
Flexible: can be applied to structured and unstructured data.
Improves with more data and better features.

Cons

Requires labeled data (for supervised learning).
Models can overfit or underfit if not designed carefully.
Often seen as a “black box” with limited interpretability.

Use Cases

Fraud detection in finance.
Customer churn prediction in telecom/retail.
Demand forecasting in supply chain.
Email spam filtering.
Customer segmentation for targeted marketing.
Market basket analysis (“people who buy X also buy Y”).
Anomaly detection in cybersecurity and IoT.

#Supervised #Unsupervised #classification #regressionVer 0.3.6

[Avg. reading time: 3 minutes]

Generative AI

A class of AI that can create new content (text, code, images, video, music) rather than just predicting outcomes.

Builds on Deep Learning + NLP + multimodal modeling.
Represents the shift from discriminative models (predicting) to generative models (creating).

Pros

Enables creativity and automation at scale.
Reduces time to draft, design, or brainstorm.

Cons

Can hallucinate false information.
High computational cost and environmental footprint.
Raises copyright, ethics, and bias concerns.

Use Cases

Text: AI writing assistants, code copilots.
Image/video: marketing content generation, design prototyping.
Data: generating synthetic data for ML training.
Education: personalized learning materials and quizzes.

Key differences

Traditional ML	Generative AI
Predicts outcome from features	Produces new content
Needs task-specific data	Pretrained on massive corpora
Optimized for accuracy	Optimized for creativity, coherence
Example: Predict churn	Example: Generate flying pigs/elephant

#GPT #Claude #GenerativeAIVer 0.3.6

[Avg. reading time: 4 minutes]

Reinforcement Learning

RLHF (Reinforcement Learning Human Feedback)

Its like humans learning todo and not todo.

A learning paradigm where an agent interacts with an environment, takes actions, and learns from reward signals.

Instead of labeled data, it uses trial-and-error feedback.

Complements supervised/unsupervised learning.

Strongly linked to decision-making and control tasks.

Example: YT recommends a video, if you watch it system understands that, if you choose don’t show this system reacts to that.

Here the agent is YT recommendation engine, action: user watching or ignoring the video. Rewards like/share or not-interested.

Pros

Handles complex sequential decisions.
Can learn optimal strategies without explicit rules.
Mimics human/animal learning.

Cons

Data and compute intensive.
Reward design is tricky.
Training can be unstable.

Use Cases

Game AI: AlphaGo defeating world champions.
Robotics: teaching robots to walk, grasp, or navigate.
Finance: algorithmic trading strategies.
Dynamic pricing in e-commerce.

flowchart TD
    A[Prompt] --> B[Base LLM generates multiple responses]
    B --> C[Human labelers rank responses]
    C --> D[Reward Model learns preferences]
    D --> E[Fine-tune LLM with Reinforcement Learning]
    E --> F[Aligned ChatGPT]

#rl #rlhf #roboticsVer 0.3.6

[Avg. reading time: 4 minutes]

Agentic AI

AI systems that are autonomous agents: they can plan, reason, take actions, and use tools.

Builds on LLMs + RL concepts.

Can execute multi-step tasks with minimal human guidance.

Before Agentic AI

Traditional AI -> task-specific models.
LLMs -> good at generating text but mostly passive responders.

Transformation with Agentic AI

Adds agency: memory, planning, acting.
Can chain multiple AI capabilities (search + reasoning + action).

Pros

Automates workflows end-to-end.
Adaptable across domains.
Learns from feedback loops.

Cons

Hard to control (hallucinations, unsafe actions).
High computational cost.
Reliability and governance still open challenges.

Use Cases

AI agents booking travel (search -> compare -> purchase).
Customer support bots that escalate only when needed.
Business process automation (invoice handling, data entry).

Aspect	AI Assistant (Chatbot/LLM)	Agentic AI (Autonomous Agent)
Nature	Reactive → answers questions	Proactive → plans and executes tasks
Memory	Limited to current session	Has memory across interactions
Actions	Generates text/code only	Uses tools, APIs, external systems
Planning	One-shot response	Multi-step reasoning and decision-making
Adaptability	Needs explicit user prompts	Self-adjusts based on goals and feedback
Example Use Case	“What’s the weather in NYC?” → gives forecast	“Plan my weekend trip to NYC” → books flight, hotel, creates itinerary
Industry Example	Customer support FAQ bot	AI agent that handles returns, refunds, and escalations automatically

#agentsVer 0.3.6

[Avg. reading time: 3 minutes]

MLOps

Why MLOps

Operationalizing ML/AI models with focus on automation, collaboration, and reliability.

Building is easy, sustaining is hard.

Remember dieting/excercise?

Companies moved past “build model in Jupyter” → now productionize models.
80% of ML projects fail due to lack of deployment + monitoring strategy.
MLOps bridges Data → Model → Production.

Industry requirement

Versioning models
Monitoring drift
Scalable deployment
Regulatory compliance (audit trail, lineage)

Lifecycle

Data ingestion -> data validation & quality checks -> feature engineering
Model training -> validation -> experiment tracking & versioning
Deployment (batch, real-time, API) -> rollback capabilities
Monitoring
- Data drift (input distribution)
- Model drift (prediction accuracy)
- Concept drift (feature:label relationship)
- Infrastructure performance
Continuous improvement -> retraining & iteration

Cross-Functional Teams

Data Engineers
Data Scientists
ML Engineers
Platform/DevOps Engineers
Product Managers

Key Capabilities

Reproducibility
Scalability
Governance & compliance
Automated CI/CD pipelines

#cicd #mlops #devops #medallionVer 0.3.6

[Avg. reading time: 1 minute]

Differences across AI/ML systems

Aspect	Traditional ML	NLP (Pre-GenAI)	GenAI	MLOps
Data	Structured, tabular	Text, tokens	Multi-modal	Any
Training	From small datasets	Task-specific corpora	Massive pretraining + fine-tune	Not about training, about lifecycle
Output	Prediction	Classification, tagging, parsing	Content (text, code, image)	Deployment + Ops
Role Focus	Data Scientist	NLP Researcher	Prompt Engineer, LLM Engineer	ML Engineer, Platform Eng.

#ai-ml #genai #mlops #llmVer 0.3.6

[Avg. reading time: 2 minutes]

Examples

Retail:

Traditional ML -> Demand forecasting.
GenAI -> Personalized product descriptions.
MLOps -> Continuous retraining as seasons change.

Healthcare:

Traditional ML -> Predict patient readmission.
GenAI -> Auto-generate clinical notes.
MLOps -> Ensure compliance & monitoring under HIPAA.

Finance:

Traditional ML -> Fraud detection.
GenAI -> AI-powered customer chatbots.
MLOps -> Drift detection for fraud models.

Traditional ML	GenAI	MLOps
Fraud detection (transaction classification)	AI-powered customer chatbots for support	Drift detection & alerts for fraud models
Credit scoring (loan approval risk models)	Personalized financial advice reports	Automated retraining with new credit bureau data
Stock price trend prediction	Summarizing financial reports & earnings calls	Compliance monitoring (audit trails for regulators)
Customer lifetime value prediction	Generating personalized investment recommendations	Model versioning & rollback in case of errors

#finance #healthcare #retail #examplesVer 0.3.6

[Avg. reading time: 1 minute]

Job Opportunities

Traditional ML

Data Scientist
Applied ML Engineer
Data Analyst -> ML transition

GenAI

Prompt Engineer
LLM Application Developer
GenAI Product Engineer
AI Research Scientist

MLOps

ML Engineer (deployment, monitoring)
MLOps Engineer (CI/CD pipelines for ML)
Cloud ML Platform Engineer (Databricks, AWS Sagemaker, GCP Vertex AI, Azure ML)

#jobs #mlengineer #mlopsengineerVer 0.3.6

[Avg. reading time: 9 minutes]

Terms to Know

Regression

Predicting a continuous numeric value.

Use Case: Predicting house prices based on size, location, and number of rooms.

Linear Regression

A regression model assuming a straight-line relationship between input features and target.

Use Case: Estimating sales revenue as a function of advertising spend.

Classification

Predicting discrete categories.

Use Case: Classifying an email as spam or not spam.

Clustering

Grouping similar data points without labels.

Use Case: Segmenting unknown data into groups.

Feature Engineering

Creating new meaningful features from raw data to improve model performance.

Use Case: From “Date of Birth” → create “Age” as a feature for predicting insurance risk.

Overfitting

Model learns training data too well (including noise) -> poor generalization.

Use Case: Overfitting = a spam filter that memorizes training emails but fails on new ones.

Underfitting

Model too simple to capture patterns -> poor performance.

Use Case: Trying to predict house prices using only the average price (ignoring size, location, rooms, etc.).

Bias

A source of error that happens due to overly simplistic assumptions.

Leads to underfitting.

Variance

A source of error that happens due to too much sensitivity to training data fluctuations.

Leads to overfitting.

Model Drift

When a model’s performance degrades over time because data distribution changes.

Use Case: A churn model trained pre-pandemic performs poorly after online behavior changes drastically.

MSE

Mean Squared Error

Avg of the squared differences between predicted values and actual values.

Actual a: [10, 20, 30, 40, 50]
Predicted p : [12, 18, 25, 45, 60]

| i | Actual| Predicted | Error | Squared Error |
| - | ------|-----------|-------|---------------|
| 1 | 10    | 12        |  -2   | 4             |
| 2 | 20    | 18        |   2   | 4             |
| 3 | 30    | 25        |   5   | 25            |
| 4 | 40    | 45        |  -5   | 25            |
| 5 | 50    | 60        | -10   | 100           |

SS = 4 + 4 + 25 + 25 + 100 = 158

MSE (ss_res) = 158 / 5 = 31.6

R Square

Proportion of variane in the target explained by the model.

1.0 = Perfect Prediction. 0.0 = Model is no better than predicting the mean. Negative = Model is worse than just predicting the mean.

Mean of actual values = (10 + 20 + 30 + 40 + 50) / 5 = 30

Total Variation (ss_tot) : (10 - 30)^2 + (30 - 30)^2 + (40 - 30)^2 + (50 - 30)^2 = 400 + 100 + 0 + 100 + 400 = 1000

R^2 = 1 - (ss_res / ss_tot)

R^2 = 1 - (158/1000) = 0.842

Serialization

The process of converting an in-memory object (e.g., a Python object) into a storable or transferable format (such as JSON, binary, or a file) so it can be saved or shared.

import json

data = {"name": "Ganesh", "course": "MLOps"}
# Serialization → Python dict → JSON string
serialized = json.dumps(data)

## Store the serialized data into JSON file if needed.

Deserialization

The process of converting the stored or transferred data (JSON, binary, file, etc.) back into an in-memory object that your program can use.

# Load it from JSON file
# Deserialization → JSON string → Python dict

deserialized = json.loads(serialized)

#serialization #deserialization #overfitting #underfittingVer 0.3.6

[Avg. reading time: 3 minutes]

Model vs Library vs Framework

python -m venv .demomodel 
source .demomodel/bin/activate 
pip install scikit-learn joblib

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from joblib import dump, load
import numpy as np

# Fake dataset: study_hours -> exam_score
rng = np.random.default_rng(42)
hours = rng.uniform(0, 10, size=100).reshape(-1, 1)     # feature X
noise = rng.normal(0, 5, size=100)                      # noise
scores = 5 + 8*hours.ravel() + noise                    # target y

X_train, X_test, y_train, y_test = 
    train_test_split(hours, scores, test_size=0.2, random_state=42)

model = LinearRegression()

# Train (fit)
model.fit(X_train, y_train)

# Evaluate
pred = model.predict(X_test)
print("MSE:", mean_squared_error(y_test, pred))
print("R2 :", r2_score(y_test, pred))
print("Learned slope and intercept:", model.coef_[0], model.intercept_)

# Save 
dump(model, "linear_hours_to_score.joblib")

# Inference on new data
new_hours = np.array([[2.0], [5.0], [9.0]])
print("Predicted scores:", model.predict(new_hours))

# Predict after load

restored = load("linear_hours_to_score.joblib")
print("Loaded model predicts:", restored.predict(new_hours))

Fun Task

Identify the Algorithj, Library, Model in this code
What is MSE, R2_Score
What is .joblib
What is the number 42?

#model #library #framework #r2score #mseVer 0.3.6

[Avg. reading time: 2 minutes]

Explanation

Library: scikit-learn
Algorithm: Linear Regression (Mathematics)
Prebuilt Model: LinearRegression (part of scikit-learn library)
model.fit(): Custom built model for this data.
42 answer to the ultimate question of Life, the Universe, and Everything.
model.coef_[0] → the slope learned from data How much the target (exam score) increases for 1 extra unit of study hours model.intercept_ → the intercept The predicted target value when study hours = 0

Example:

Learned slope and intercept: 8.1 4.9

8.1 * hrs + 4.9

If a student studies 0 hours, predicted score ≈ 4.9 (baseline knowledge).

If a student studies 5 hours, predicted score ≈ 8.1 × 5 + 4.9 = 45.4.

#explanation #libraryVer 0.3.6

[Avg. reading time: 3 minutes]

Statistical vs ML Models

Statistical Models

Focus on inference -> understanding relationships between variables.
Assume an underlying distribution (e.g., linear, normal).
Typically work well with smaller datasets.

Goal: test hypotheses, estimate parameters.

Example: Linear regression to explain how income depends on education, experience, etc.

Machine Learning Models

Focus on prediction -> finding patterns that generalize to unseen data.
Fewer assumptions about data distribution.
Can handle very large datasets and high-dimensional data.

Goal: optimize predictive performance.

Example: Random Forest predicting whether a customer will churn.

Key Similarities

Both use data to build models.

Both rely on training (fit) and evaluation (test).

Overlaps: linear regression is both a statistical model and an ML model, depending on context.

Book worth reading

The Manga Guide to Linear Algebra.

https://www.amazon.com/dp/1593274130

(Not an affiliate or referral)

On a lighter note

#statistics #ml #linearalgebraVer 0.3.6

[Avg. reading time: 3 minutes]

Types of ML Models

Supervised Learning

Data has input features (X) and target labels (y).

Model learns mapping: f(X) → y.

Examples:

Regression -> Predicting house prices, demand forecast, server usage.
Classification -> Spam vs Non-spam email or Customer churn.

Unsupervised Learning

Data has inputs only, no labels.

Goal: find hidden patterns or structure.

Examples:

Clustering -> Customer segmentation.
Association Rules -> Market basket analysis (“people who buy X also buy Y”).
Dimensionality Reduction -> Principal Component Analysis (PCA) for visualization.
- Taking a high dimensional data and reducing it to fewer dimensions.

Reinforcement Learning (RL)

Agent interacts with environment -> learns by trial and error.

Used for decision-making & control.

Examples:

Robotics & self-driving cars.
Newer Video Games.
OTT Content recommendations.
Ads.

Semi-Supervised Learning

Mix of few labeled + many unlabeled data points.

Often used in NLP and computer vision.

Example: labeling 1,000 medical images, then using 100,000 unlabeled ones to improve model.

#supervised #unsupervisedVer 0.3.6

[Avg. reading time: 2 minutes]

ML Lifecycle

Collect Data (Data Engineers Role)

Gather raw data from systems (databases, APIs, sensors, logs).
Ensure sources are reliable and updated.

Clean & Prepare

Handle missing values, outliers, and noise.
Feature engineering: create new features, scale/encode as needed.
Data splitting (train/validation/test).

Train Model

Choose algorithm (supervised, unsupervised, reinforcement, etc.).
Train on training set, tune hyperparameters.

Evaluate

Use appropriate metrics:
- Classification → Accuracy, Precision, Recall, F1.
- Regression → RMSE, MAE, R².
Cross-validation for robustness.

Deploy

Make model accessible via API, batch jobs, or embedded in applications.
Consider scaling (cloud, containers, edge devices).

Monitor & Improve

Track data drift, concept drift, and model performance decay.
Automate retraining pipelines (MLOps).
Capture feedback loop to improve features and models.

#collect #clean #train #evaluateVer 0.3.6

[Avg. reading time: 6 minutes]

Data Preparation

~80% of the time in ML projects is spent on Data Preparation & Cleaning, and ~20% on Model Training.

The process of making raw data accurate, complete, and structured so it can be used for model training.

Wait data cleaning is not ML engineers job, it belongs to Data Engineer.

True but..

Data Engineers focus on collection and validation at scale:

Ingest raw data from source systems (databases, APIs, IoT, logs).
Build ETL/ELT pipelines (Bronze → Silver → Gold).
Ensure data quality checks (avoid duplicates, schema validation, type checks, primary key uniqueness).
Handle big data infrastructure: Spark, Databricks, Airflow, Kafka.
Deliver curated data (often “Silver” or “Gold” layer) for downstream ML.

ML Engineers / Data Scientists take over once curated data is available:

Apply ML-specific cleaning & prep:
- Impute missing values intelligently (mean/median/model-based).
- Encode categorical variables (one-hot, embeddings).
- Normalize/standardize numeric features.
- Text normalization, tokenization, embeddings.
Create features meaningful to the ML model.
Split data into train/validation/test sets.

flowchart LR
    DE[**Data Engineer**<br/><br/>- ETL/ELT Pipelines<br/>- Schema Validation<br/>• Deduplication<br/>- Type Checks] 

    OVERLAP[**Common** <br/><br/>- Remove Duplicates<br/>- Ensure Consistency]

    MLE[**ML Engineer**<br/><br/>-Handle Missing Values<br/>- Feature Scaling<br/>- Imputation<br/>- Encoding & Embeddings<br/>- Train/Val/Test Split] 
    
    DE --> OVERLAP
    MLE --> OVERLAP

For Example

Tabular Data

Data Engineer: ensures no duplicate customer IDs in database.

ML Engineer: fills missing “Age” values with median, scales “Income”.

Text Data

Data Engineer: stores raw customer reviews as UTF-8 encoded text.

ML Engineer: lowercases, removes stopwords, converts to embeddings.

Image Data

Data Engineer: validates images aren’t corrupted on ingest.

ML Engineer: resizes images, normalizes pixel values.

#data #cleaning Ver 0.3.6

[Avg. reading time: 6 minutes]

Data Cleaning

Check for Target Leakage

What it is: Features that give away the answer (future info in training data).

Why it matters: Makes the model look perfect in training but useless in production.

Example:

When building a model having this column is not correct as in Production you will never have this during Prediction. This can be used when Testing your model prediction.

refund_issued_flag when predicting “Will this order be refunded?”.

Validate Labels

What it is: Make sure labels are correct, consistent, and usable.

Why it matters: Garbage labels = garbage predictions.

Example:

Churn column has values: yes, Y, 1, true.

Normalize to 1 = churn, 0 = not churn.

Handle Outliers Intentionally

What it is: Extreme values that distort training.

Why it matters: “Emp_Salary = 10,000,000” can throw off predictions.

Example

Cap at 99th percentile.

Flag as anomaly instead of training on it.

Enforce Feature Types

What it is: Make sure data types match their meaning.

Why it matters: Models can’t learn if types are wrong.

Example:

customer_id stored as integer → model may treat it as numeric.

Why is that problem, customer_id = 20 will have more weightage than customer_id = 1

Convert to string (categorical).

Standardize Categories

What it is: Inconsistent labels in categorical columns.

Why it matters: Model may treat the same thing as different classes.

Example:

Country: USA, U.S.A., United States.

Map all to United States.

Normalize Text for ML

What it is: Clean and standardize text features.

Why it matters: Prevents the model from treating “Hello” and “hello!” as different.

Example:

Lowercasing, removing punctuation, stripping whitespace.

Keep a copy of raw text for audit.

Protect Data Splits

What it is: Make sure related rows don’t leak between train/test.

Why it matters: Prevents unfair accuracy boost.

Example:

Same student appears in both train and test sets.

Fix: Group by student_id when splitting.

#datacleaning #mlcleaning #normalize_dataVer 0.3.6

[Avg. reading time: 11 minutes]

Data Imputation

Data Imputation is the process of filling in missing values in a dataset with estimated or predicted values.

Data imputation aims to enhance the quality and completeness of the dataset, ultimately improving the performance and reliability of the ML model.

Problems with Missing Data

Reduced Model
Biased Inferences
Imbalanced Representations
Increased complexity in Model handling

Data Imputation

Data Domain knowledge is important before choosing the right method.

Dropping Rows/Columns

Remove the rows or columns that contain missing values.

If the percentage of missing data is very small.
If the column isn’t important for the model.

Example: Drop the few rows out of Million where “Age” is missing.

Treat as a Category

Encode “missing” or “NA” or “Unknown” as its own category.

For categorical variables (like Country, Gender, Payment Method).

When “missing” itself carries meaning (e.g., customer didn’t provide income → may be sensitive).

Example: Add a category Unknown to “Marital Status” column.

Data with Missing Values

ID	Country
1	USA
2	Canada
3	Null
4	India
5	NA (missing)

After treating as a Category

ID	Country
1	USA
2	Canada
3	Missing
4	India
5	Missing

The model will see “Missing” as just another value like “USA” or “India.”

Replacing Missing Values (Imputation)

Fill missing values with a reasonable estimate.

Methods:

Mean/Median/Mode: Quick fixes for numeric/categorical data.
KNN Imputation: Fill value based on “closest” similar records.
Regression Imputation: Predict the missing value using other features.

Example: Replace missing “Salary” with median salary of the group.

Data Imputation Methods

Using regression models repeatedly (with randomness) to fill missing data, producing several plausible datasets, and then combining them for analysis.

Age	Education	Income
30	Masters	?
40	PhD	120K
35	Bachelors	80K

Step 1: Fit regression: Income ~ Age + Education.
Step 2: Predict missing Income for Age=30, Edu=Masters.
Step 3: Add random noise → 95K in dataset1, 92K in dataset2, 98K in dataset3.
Step 4: Analyze all 3 datasets, combine results.

Downside: Delay in process and computing time. More missing values more coputation time.

Drop : if it’s tiny and negligible.
Category : if it’s categorical.
Replace : if it’s numeric and important.
KNN/Regression : if you want smarter imputations and can afford compute.

It is important to mark the imputated data

To know which data is from source and which is calculated. So its handled with pinch of salt.

Data Imputation Marking

Method	When to Use	Pros	Cons
Drop Rows/Columns	When % of missing data is very small (e.g., <5%) or the feature is unimportant	- Simple and fast - No assumptions needed	- Lose data (rows) - Risk of losing valuable features (columns)
Treat as a Category	For categorical variables where “missing” may carry meaning	- Preserves all rows - Captures the “missingness” as useful info	- Only works for categorical data - Can create an artificial category if missing isn’t meaningful
Replace with Mean/Median/Mode	For numeric data (mean/median) or categorical (mode)	- Easy to implement - Keeps dataset size intact	- Distorts distribution - Ignores correlations between features
KNN Imputation	When dataset is not too large and similar neighbors make sense	- Considers relationships between features - More accurate than simple averages	- Computationally expensive - Sensitive to scaling and choice of K
Regression Imputation	When missing values can be predicted from other variables	- Uses feature relationships - Can be very accurate	- Risk of “overfitting” imputations - Adds complexity

#dataimputation #knn #encode #dropdataVer 0.3.6

[Avg. reading time: 13 minutes]

Data Encoding

Data Encoding is the process of converting categorical data (like colors, countries, product types) into a numeric format that ML models can understand.

Unlike numerical data, categorical data is not directly usable because models operate on numbers, not labels.

Encoding ensures categorical values are represented in a way that preserves meaning and avoids misleading the model.

Typically rule-based.

Example: Products

ID	Product
1	Laptop
2	Phone
3	Tablet
4	Phone

Label Encoding

Assigns each category a unique integer.

ID	Product (Encoded)
1	0
2	1
3	2
4	1

Pros:

Very simple, minimal storage.
Works well for tree-based models.

Cons:

Implies an order between categories (Laptop < Phone < Tablet).
Misleads linear models.

One-Hot Encoding

Creates a binary column for each category.

ID	Laptop	Phone	Tablet
1	1	0	0
2	0	1	0
3	0	0	1
4	0	1	0

Pros:

No ordinal assumption.
Easy to interpret.

Cons:

High dimensionality for many products (e.g., thousands of SKUs).
Sparse data, more memory needed.

Ordinal Encoding

Encodes categories when they have a natural order.

Works for things like product size or version level.

Example (Product Tier):

ID	Product Tier
1	Basic
2	Standard
3	Premium
4	Standard

After Ordinal Encoding:

ID	Product Tier (Encoded)
1	1
2	2
3	3
4	2

Pros:

Preserves rank/order.
Efficient storage.

Cons:

Only valid if order is real (Basic < Standard < Premium).
Wrong if categories are unordered (Laptop vs Phone).

Target Encoding (Mean Encoding)

Replaces each category with the mean of the target variable.

Target - “Purchased” Yes=1, No=0

ID	Product	Purchased
1	Laptop	1
2	Phone	0
3	Tablet	1
4	Phone	1

ID	Product (Encoded)	Purchased
1	1.0	1
2	0.5	0
3	1.0	1
4	0.5	1

Compute mean purchase rate:

Laptop = 1.0 Phone = 0.5 Tablet = 1.0

Pros:

Great for high-cardinality features (e.g., hundreds of product SKUs).
Often improves accuracy.
Keeps dataset compact (just 1 numeric column).
Often boosts performance in models like Logistic Regression or Gradient Boosted Trees.

Cons:

Risk of data leakage if target encoding is done on the whole dataset.
Must use cross-validation to avoid leakage.
Compute intensive.

Encoding Type	Best For	Avoid When
Label Encoding	Tree-based models, low-cardinality products	Linear models, unordered categories
One-Hot Encoding	General ML, few product categories	Very high-cardinality features
Ordinal Encoding	Ordered categories (tiers, sizes, versions)	Unordered categories (Phone vs Laptop)
Target Encoding	High-cardinality products, with proper CV	Without CV (leakage risk)

Multiple Categorical Columns

ID	Product	Product Tier	Category	Purchased
1	Laptop	Premium	PC	1
2	Phone	Basic	Mobile	0
3	Tablet	Standard	Electronics	1
4	Phone	Premium	Mobile	1

Product: Laptop, Phone, Tablet
Product Tier: Basic < Standard < Premium (ordered)
Category: Electronics, Accessories, Clothing (unordered)

Label Encoding (all columns)

Replace each category with an integer.

ID	Product	Product Tier	Category
1	0	2	0
2	1	0	1
3	2	1	2
4	1	2	1

Artificial order created (e.g., PC=0, Mobile=1, Electronics=2).

One-Hot Encoding (all columns)

ID	Laptop	Phone	Tablet	Tier_Basic	Tier_Standard	Tier_Premium	Cat_PC	Cat_Mobile	Cat_Electronics
1	1	0	0	0	0	1	1	0	0
2	0	1	0	1	0	0	0	1	0
3	0	0	1	0	1	0	0	0	1
4	0	1	0	0	0	1	0	1	0

Very interpretable, but column explosion if you have 50+ products or 100+ categories.

Mixed Encoding (best practice)

Product → One-Hot (few categories).
Product Tier → Ordinal (Basic=1, Standard=2, Premium=3).
Category → One-Hot (PC, Mobile, Electronics).

ID	Laptop	Phone	Tablet	Tier (Ordinal)	Cat_PC	Cat_Mobile	Cat_Electronics
1	1	0	0	3	1	0	0
2	0	1	0	1	0	1	0
3	0	0	1	2	0	0	1
4	0	1	0	3	0	1	0

#onehot_encoding #target_encoding #label_encodingVer 0.3.6

[Avg. reading time: 6 minutes]

Feature Engineering

The process of transforming raw data into more informative inputs (features) for ML models.

Goes beyond encoding: you can create new features/metrics (like derived columns in the DB world) that pure encoding does not offer.

The goal of FE is to improve model accuracy, interpretability, and generalization.

Example (Laptop Sales):

Purchase Date = 2025-09-02

Derived Features:

Month = 09
DayOfWeek = Tuesday
IsHolidaySeason = No
IsWeekend = No
IsLeapYear= No
Quarter = Q3

Encoding (One-Hot, Label, Target) = only turns categories into numbers.

But real-world data often hides useful patterns in dates, interactions, domain knowledge, or semantics.

ID	Product	Purchase Date	Price	PurchasedAgain
1	Laptop	2023-12-01	1200	1
2	Laptop	2024-07-15	1100	0
3	Phone	2024-05-20	800	1
4	Tablet	2024-08-05	600	1

Encoding only handles Product → One-Hot or Target.

Feature Engineering adds new insights:

From Purchase Date: extract Month, DayOfWeek, IsHolidaySeason.
From Price: create Discounted? (if < avg product price).
Combine features: Price / AvgCategoryPrice.

Basic Feature Engineering

Improve signals/patterns without domain-specific knowledge.

Scaling/Normalization: Price → (Price – mean) / std

Date/Time Features: Purchase Date → Month=12, DayOfWeek=Friday

Polynomial/Interaction: Price × Tier

Pros:

Easy to implement.
Immediately boosts many models (especially linear/Neural Networks).

Cons:

Risk of adding noise if done blindly.
Limited unless combined with domain insights.

Domain-Specific Feature Engineering

Apply business/field knowledge.

Examples:

Finance: Debt-to-Income Ratio, Credit Utilization %

Healthcare: BMI = Weight / Height², risk score categories

IoT: Rolling averages, peak detection in sensor data.

Pros:

Captures real-world meaning → big performance gains.
Makes models explainable to stakeholders.

Cons:

Requires domain expertise.
Not always transferable between datasets.

#feature_engineering #domain_specificVer 0.3.6

[Avg. reading time: 10 minutes]

Vectors

A vector is just an ordered list of numbers that represents a data point so models can do math on it.

Think “row -> numbers” for tabular data, or “text/image -> numbers” after a transformation.

Example:

Price = 1200, Weight = 2kg, Warranty = 24 months → Vector = [1200, 2, 24]

Types of Vectors

Tabular Feature Vector

Concatenate numeric columns (and encoded categoricals) into a single vector.

ML engineer/data scientist during data prep/FE (training) and the same code at inference.

Example: [Price, Weight, Warranty] → [1200, 2, 24].

Sparse Vectors

High-dimensional vectors with many zeros (e.g., One-Hot, Bag-of-Words, TF-IDF).

Encoding/featurization function in your pipeline.

Example

Products = {Laptop, Phone, Pen}

Laptop → [1, 0, 0]
Phone → [0, 1, 0]
Pen → [0, 0, 1]

Dense Vectors (compact, mostly non-zeros)

Lower-dimensional, compact numeric representation

Created by algorithms (scalers/PCA) or models (embeddings) in your pipeline.

Lower-dimensional, compact, mostly non-zeros → dense.

Example: Not actual values

Laptop → [0.65, -0.12, 0.48]
Phone → [0.60, -0.15, 0.52]
Pen → [0.10, 0.85, -0.40]

Laptop and Phone vectors are close together.

Model-Derived Feature Vectors

Dense vectors specifically generated by models like CNN/Transformer as a vector. Mainly used with Computer Vision. Image classification, object detection, face recognition, voice processing.

Models generate them during feature extraction (training & inference).

Example: BERT sentence vector, ResNet image features.

Vector Type	Who designs it?	Who computes it?	When it’s computed	Example
Tabular feature vector	ML Eng/DS (choose columns)	Pipeline code	Train & Inference	[Price, Weight, Warranty]
Sparse (One-Hot/TF-IDF)	ML Eng/DS (choose encoder)	Encoder in pipeline	Train (fit) & Inference (transform)	One-Hot Product
Dense (scaled/PCA)	ML Eng/DS (choose scaler/PCA)	Scaler/PCA in pipeline	Train (fit) & Inference (transform)	StandardScaled price, PCA(100)
Model features / Embeddings	ML Eng/DS (choose model)	Model (pretrained or trained)	Train & Inference	BERT/ResNet/categorical embedding

MLOps ensures the same steps run at inference to avoid train/serve skew.

Example of Dense Vector

python -m venv .densevector 
source .densevector/bin/activate 
pip install sentence-transformers

from sentence_transformers import SentenceTransformer

# Load a pre-trained model (MiniLM is small & fast)
model = SentenceTransformer('all-MiniLM-L6-v2')

text = "Laptop"

# Convert text into dense vector
vector = model.encode(text)

print("Dense Vector Shape:", text, vector.shape)
print("Dense Vector (first 10 values):", vector[:10])

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Load model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Words
texts = ["Laptop", "Computer", "Pencil"]

# Encode all
vectors = model.encode(texts)

# Convert to numpy array
vectors = np.array(vectors)

# Cosine similarity matrix
sim_matrix = cosine_similarity(vectors)

# Display similarity scores
for i in range(len(texts)):
    for j in range(i+1, len(texts)):
        print(f"Similarity({texts[i]} vs {texts[j]}): {sim_matrix[i][j]:.4f}")

#vectors #densevector #sparsevector #tabularvectorVer 0.3.6

[Avg. reading time: 8 minutes]

Embeddings

Embeddings transform high-dimensional categorical or textual data into a compact, dense vector space.

Similar items are placed closer together in vector space -> models can understand similarity.

These representations capture relationships and context among different entities.
Used in Recommendation Systems, NLP, Image Search and more.
Can be learning from data using neural networks or retrieved from pretrained models (eg: Word2Vec, FastText)

Use Cases

Search & Retrieval: Semantic search, image search.
NLP: Word/sentence embeddings for sentiment, chatbots, translation.
Computer Vision: Image embeddings for similarity or classification.

Advantages over traditional encoding:

Handle high-cardinality categorical features (e.g., millions of products).
Capture context and semantics (“Laptop” is closer to “Computer” than “Pencil”).
Lower-dimensional → more efficient than One-Hot or TF-IDF.

Types of Embeddings

Word Embeddings (Text)

Represent words as vectors so that semantically similar words are close together.

Examples: Word2Vec, GloVe, FastText.

“king” – “man” + “woman” = “queen”

Used in: sentiment analysis, translation, chatbots.

Sentence / Document Embeddings (Text)

Represent longer text (sentences, paragraphs, docs) in vector form.

Capture context and meaning beyond individual words.

Examples: BERT, Sentence-BERT, Universal Sentence Encoder.

“The laptop is fast” and “This computer is quick” → close vectors.

Image Embeddings (Computer Vision)

Represent images as vectors extracted from CNNs or Vision Transformers.

Capture visual similarity (shapes, colors, objects).

Examples: ResNet, CLIP (image+text).

A cheetah photo and a leopard photo → embeddings close together (both cat family).

Used in: image search, face recognition, object detection.

Audio / Speech Embeddings

Convert audio waveforms into dense vectors capturing phonetics and semantics.

Examples: wav2vec, HuBERT.

Voice saying “Laptop” → embedding close to text embedding of “Laptop”.

Used in: speech recognition, speaker identification.

Graph Embeddings

Represent nodes/edges in a graph (social networks, knowledge graphs).

Capture relationships and network structure.

Examples: Node2Vec, DeepWalk, Graph Neural Networks (GNNs).

In a product graph, Laptop node embedding will be close to Mouse if often co-purchased.

Type	Example Algorithms	Data Type	Use Cases
Word	Word2Vec, GloVe	Text (words)	NLP basics
Sentence/Doc	BERT, SBERT	Text (longer)	Semantic search, QA
Categorical	Embedding layers	Tabular (IDs)	Recommenders, fraud detection
Image	ResNet, CLIP	Vision	Image search, recognition
Audio	wav2vec, HuBERT	Audio	Speech-to-text, voice auth
Graph	Node2Vec, GNNs	Graphs	Social networks, KG search

#embeddings [#<abbr title="Bidirectional Encoder Representations from Transformers">BERT</abbr>](../tags.md#BERT "Tag: BERT") #Word2Vec #NLPVer 0.3.6

[Avg. reading time: 7 minutes]

Life Before MLOps

Challenges Faced by ML Teams.

Moving Models from Dev → Staging → Prod

Models were often shared as .pkl or joblib files, passed around manually.

Problem: Dependency mismatches (Python, sklearn version), fragile handoffs.

Stopgap: Packaging models with Docker images, but still manual and inconsistent.

Champion vs Challenger Deployment

Teams struggled to test a new (challenger) model against the current (champion).

Problem: No controlled A/B testing or shadow deployments → risky rollouts.

Stopgap: Manual canary releases or running offline comparisons.

Model Versioning Confusion

Models saved as model_final.pkl, model_final_v2.pkl, final_final.pkl.

Problem: Nobody knew which version was truly in production.

Stopgap: Git or S3 versioning for files, but no link to experiments/data.

Inference on Wrong Model Version

Even if multiple versions existed, production systems sometimes pointed to the wrong one.

Problem: Silent failures, misaligned experiments vs prod results.

Stopgap: Hardcoding file paths or timestamps — brittle and error-prone.

Train vs Serve Skew (Data-Model Mismatch)

Preprocessing done in notebooks was re-written differently in prod code.

Problem: Same model behaves differently in production.

Stopgap: Copy-paste code snippets, but no guarantee of sync.

Experiment Tracking Chaos

Results scattered across notebooks, Slack messages, spreadsheets.

Problem: Couldn’t reproduce “that good accuracy we saw last week.”

Stopgap: Manually logging metrics in Excel or text files.

Reproducibility Issues

Same code/data gave different results on different machines.

Problem: No control of data versions, package dependencies, or random seeds.

Stopgap: Virtualenvs, requirements.txt — helped a bit but not full reproducibility.

Lack of Monitoring in Production

Once deployed, no one knew if the model degraded over time.

Problem: Models silently failed due to data drift or concept drift.

Stopgap: Occasional manual performance checks, but no automation.

Scaling & Performance Gaps

Models trained in notebooks failed under production loads.

Problem: Couldn’t handle large-scale data or real-time inference.

Stopgap: Batch scoring jobs on cron — but too slow for real-time use cases.

Collaboration Breakdowns

Data Scientists, Engineers, Ops worked in silos.

Problem: Miscommunication -> wrong datasets, broken pipelines, delays.

Stopgap: Jira tickets and handovers — but still slow and error-prone.

Governance & Compliance Gaps

No audit trail of which model made which prediction.

Problem: Risky for regulated domains (finance, healthcare).

Stopgap: Manual logging of predictions — incomplete and unreliable.

#mlops #development #productionVer 0.3.6

[Avg. reading time: 13 minutes]

Quiz

Note: This is a practice quiz and will not be graded. The purpose is to help you check your understanding of the concepts we covered.

[Avg. reading time: 0 minutes]

Developer Tools

[Avg. reading time: 5 minutes]

Introduction

Before diving into Data or ML frameworks, it's important to have a clean and reproducible development setup. A good environment makes you:

Faster: less time fighting dependencies.
Consistent: same results across laptops, servers, and teammates.
Confident: tools catch errors before they become bugs.

A consistent developer experience saves hours of debugging. You spend more time solving problems, less time fixing environments.

Python Virtual Environment

A virtual environment is like a sandbox for Python.
It isolates your project’s dependencies from the global Python installation.
Easy to manage different versions of library.
Must depend on requirements.txt, it has to be managed manually.

Without it, installing one package for one project may break another project.

Open the CMD prompt (Windows)

Open the Terminal (Mac)

# Step 0: Create a project folder under your Home folder.

mkdir project

cd project


# Step 1: Create a virtual environment
python -m venv myenv

# Step 2: Activate it
# On Mac/Linux:
source myenv/bin/activate

# On Windows:
myenv\Scripts\activate.bat

# Step 3: Install packages (they go inside `myenv`, not global)
pip install faker

# Step 4: Open Python
python

# Step 5: Verify 

import sys

sys.prefix

sys.base_prefix

# Step 6: Run this sample

from faker import Faker
fake = Faker()
fake.name()

# Step 6: Close Python (Control + Z)

# Step 7: Deactivate the venv when done

deactivate

As a next step, you can either use Poetry or UV as your package manager.

#venv #python #uv #poetry developer_toolsVer 0.3.6

[Avg. reading time: 3 minutes]

UV

Dependency & Environment Manager

Written in Rust.
Syntax is lightweight.
Automatic Virtual environment creation.

Create a new project:

# Initialize a new uv project
uv init uv_helloworld

Sample layout of the directory structure

.
├── main.py
├── pyproject.toml
├── README.md
└── uv.lock

# Change directory
cd uv_helloworld

# # Create a virtual environment myproject
# uv venv myproject

# or create a UV project with specific version of Python

# uv venv myproject --python 3.11

# # Activate the Virtual environment

# source myproject/bin/activate

# # Verify the Virtual Python version

# which python3

# add library (best practice)
uv add faker

# verify the list of libraries under virtual env
uv tree

# To find the list of libraries inside Virtual env

uv pip list

edit the main.py

from faker import Faker
fake = Faker()
print(fake.name())

uv run main.py

#uv #rust #venvVer 0.3.6

[Avg. reading time: 12 minutes]

Python Developer Tools

PEP

PEP, or Python Enhancement Proposal, is the official style guide for Python code. It provides conventions and recommendations for writing readable, consistent, and maintainable Python code.

PEP Conventions

PEP 8 : Style guide for Python code (most famous).
PEP 20 : "The Zen of Python" (guiding principles).
PEP 484 : Type hints (basis for MyPy).
PEP 517/518 : Build system interfaces (basis for pyproject.toml, used by Poetry/UV).
PEP 572 : Assignment expressions (the := walrus operator).
PEP 695 : Type parameter syntax for generics (Python 3.12).

Key Aspects of PEP 8 (Popular ones)

Indentation

Use 4 spaces per indentation level
Continuation lines should align with opening delimiter or be indented by 4 spaces.

Line Length

Limit lines to a maximum of 79 characters.
For docstrings and comments, limit lines to 72 characters.

Blank Lines

Use 2 blank lines before top-level functions and class definitions.
Use 1 blank line between methods inside a class.

Imports

Imports should be on separate lines.
Group imports into three sections: standard library, third-party libraries, and local application imports.
Use absolute imports whenever possible.

# Correct
import os
import sys

# Wrong
import sys, os

Naming Conventions

Use snake_case for function and variable names.
Use CamelCase for class names.
Use UPPER_SNAKE_CASE for constants.
Avoid single-character variable names except for counters or indices.

Whitespace

Don’t pad inside parentheses/brackets/braces.
Use one space around operators and after commas, but not before commas.
No extra spaces when aligning assignments.

Comments

Write comments that are clear, concise, and helpful.
Use complete sentences and capitalize the first word.
Use # for inline comments, but avoid them where the code is self-explanatory.

Docstrings

Use triple quotes (""") for multiline docstrings.
Describe the purpose, arguments, and return values of functions and methods.

Code Layout

Keep function definitions and calls readable.
Avoid writing too many nested blocks.

Consistency

Consistency within a project outweighs strict adherence.
If you must diverge, be internally consistent.

Linting

Linting is the process of automatically checking your Python code for:

Syntax errors
Stylistic issues (PEP 8 violations)
Potential bugs or bad practices
Keeps your code consistent and readable.
Helps catch errors early before runtime.
Encourages team-wide coding standards.


# Incorrect
import sys, os

# Correct
import os
import sys

# Bad spacing
x= 5+3

# Good spacing
x = 5 + 3

Ruff : Linter and Code Formatter

Ruff is a fast, modern tool written in Rust that helps keep your Python code:

Consistent (follows PEP 8)
Clean (removes unused imports, fixes spacing, etc.)
Correct (catches potential errors)

Install

poetry add ruff

uv add ruff

Verify

ruff --version 
ruff --help

example.py

import os, sys 

def greet(name): 
  print(f"Hello, {name}")

def message(name): print(f"Hi, {name}")

def calc_sum(a, b): return a+b

greet('World')
greet('Ruff')
message('Ruff')

poetry run ruff check example.py
poetry run ruff check example.py --fix

poetry run ruff format example.py --check
poetry run ruff format example.py

uv run ruff check example.py
uv run ruff check example.py --fix
uv run ruff format example.py --check
uv run ruff check example.py

MyPy : Type Checking Tool

mypy is a static type checker for Python. It checks your code against the type hints you provide, ensuring that the types are consistent throughout the codebase.

It primarily focuses on type correctness—verifying that variables, function arguments, return types, and expressions match the expected types.

Install

    poetry add mypy

    or

    uv add mypy

    or

    pip install mypy

sample.py

x = 1
x = 1.0
x = True
x = "test"
x = b"test"

print(x)

def add(a: int, b: int) -> int:
    return a + b

print(add(100, 123))      

print(add("hello", "world"))


uv run mypy sample.py

or

poetry run mypy sample.py

or

mypy sample.py

#mypy #pep #ruff #lintVer 0.3.6

[Avg. reading time: 8 minutes]

Error Handling

Python uses try/except blocks for error handling.

The basic structure is:

try:
    # Code that may raise an exception
except ExceptionType:
    # Code to handle the exception
finally:
    # Code executes all the time

Uses

Improved User Experience: Instead of the program crashing, you can provide a user-friendly error message.

Debugging: Capturing exceptions can help you log errors and understand what went wrong.

Program Continuity: Allows the program to continue running or perform cleanup operations before terminating.

Guaranteed Cleanup: Ensures that certain operations, like closing files or releasing resources, are always performed.

Some key points

You can catch specific exception types or use a bare except to catch any exception.
Multiple except blocks can be used to handle different exceptions.
An else clause can be added to run if no exception occurs.
A finally clause will always execute, whether an exception occurred or not.

Without Try/Except

x = 10 / 0

Basic Try/Except

try:
    x = 10 / 0 
except ZeroDivisionError:
    print("Error: Division by zero!")

Generic Exception

try:
    file = open("nonexistent_file.txt", "r")
except:
    print("An error occurred!")

Find the exact error

try:
    file = open("nonexistent_file.txt", "r")
except Exception as e:
    print(str(e))

Raise - Else and Finally

try:
    x = -10
    if x <= 0:
        raise ValueError("Number must be positive")
except ValueError as ve:
    print(f"Error: {ve}")
else:
    print(f"You entered: {x}")
finally:
    print("This will always execute")

try:
    x = 10
    if x <= 0:
        raise ValueError("Number must be positive")
except ValueError as ve:
    print(f"Error: {ve}")
else:
    print(f"You entered: {x}")
finally:
    print("This will always execute")

Nested Functions


def divide(a, b):
    try:
        result = a / b
        return result
    except ZeroDivisionError:
        print("Error in divide(): Cannot divide by zero!")
        raise  # Re-raise the exception

def calculate_and_print(x, y):
    try:
        result = divide(x, y)
        print(f"The result of {x} divided by {y} is: {result}")
    except ZeroDivisionError as e:
        print(str(e))
    except TypeError as e:
        print(str(e))

# Test the nested error handling
print("Example 1: Valid division")
calculate_and_print(10, 2)

print("\nExample 2: Division by zero")
calculate_and_print(10, 0)

print("\nExample 3: Invalid type")
calculate_and_print("10", 2)

#error #try #exceptionVer 0.3.6

[Avg. reading time: 4 minutes]

UnitTest

A unit test verifies the correctness of a small, isolated "unit" of code—typically a single function or method—independent of the rest of the program.

Key Benefits of Unit Testing

Isolates functionality – Tests focus on one unit at a time, making it easier to pinpoint where a bug originates.

Enables early detection – Issues are caught during development, reducing costly fixes later in production.

Prevents regressions – Running existing tests after changes ensures new bugs aren’t introduced.

Supports safe refactoring – With a strong test suite, developers can confidently update or restructure code.

Improves quality – High coverage enforces standards, highlights edge cases, and strengthens overall reliability.

Unit Testing in Python

Every language provides its own frameworks for unit testing. In Python, popular choices include:

unittest – The built-in testing framework in the standard library.

pytest – Widely used, simple syntax, rich plugin ecosystem.

doctest – Tests embedded directly in docstrings.

testify – An alternative framework inspired by unittest, with added features.

pytest is the popular testing tool for data/ML code. It’s faster to write, far more expressive for data-heavy tests, and has a rich plugin ecosystem that plays nicely with Spark, Pandas, MLflow, and CI.

git clone https://github.com/gchandra10/pytest-demo.git

uv run pytest -v

#unittesting #pytestVer 0.3.6

[Avg. reading time: 11 minutes]

DUCK DB

DuckDB is a single file built with no dependencies.

All the great features can be read here https://duckdb.org/

Automatic Parallelism: DuckDB has improved its automatic parallelism capabilities, meaning it can more effectively utilize multiple CPU cores without requiring manual tuning. This results in faster query execution for large datasets.

Parquet File Improvements: DuckDB has improved its handling of Parquet files, both in terms of reading speed and support for more complex data types and compression codecs. This makes DuckDB an even better choice for working with large datasets stored in Parquet format.

Query Caching: Improves the performance of repeated queries by caching the results of previous executions. This can be a game-changer for analytics workloads with similar queries being run multiple times.

How to use DuckDB?

Download the CLI Client

Windows
Mac
Linux).
For other programming languages, visit https://duckdb.org/docs/installation/
Unzip the file.
Open Command / Terminal and run the Executable.

DuckDB in Data Engineering

Download orders.parquet from

https://github.com/duckdb/duckdb-data/releases/download/v1.0/orders.parquet

More files are available here

https://github.com/cwida/duckdb-data/releases/

Open Command Prompt or Terminal

./duckdb

# Create / Open a database

.open ordersdb

Duckdb allows you to read the contents of orders.parquet as is without needing a table. Double quotes around the file name orders.parquet is essential.

describe table  "orders.parquet"

Not only this, but it also allows you to query the file as-is. (This feature is similar to one data bricks supports)

select * from "orders.parquet" limit 3;

DuckDB supports CTAS syntax and helps to create tables from the actual file.

show tables;

create table orders  as select * from "orders.parquet";

select count(*) from orders;

DuckDB supports parallel query processing, and queries run fast.

This table has 1.5 million rows, and aggregation happens in less than a second.

select now(); select o_orderpriority,count(*) cnt from orders group by o_orderpriority; select now();

DuckDB also helps to convert parquet files to CSV in a snap. It also supports converting CSV to Parquet.

COPY "orders.parquet" to 'orders.csv'  (FORMAT "CSV", HEADER 1);Select * from "orders.csv" limit 3;

It also supports exporting existing Tables to Parquet files.

COPY "orders" to  'neworder.parquet' (FORMAT "PARQUET");

DuckDB supports Programming languages such as Python, R, JAVA, node.js, C/C++.

DuckDB ably supports Higher-level SQL programming such as Macros, Sequences, Window Functions.

Get sample data from Yellow Cab

https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

Copy yellow cabs data into yellowcabs folder

create table taxi_trips as select * from "yellowcabs/*.parquet";

SELECT
    PULocationID,
    EXTRACT(HOUR FROM tpep_pickup_datetime) AS hour_of_day,
    AVG(fare_amount) AS avg_fare
FROM
    taxi_trips
GROUP BY
    PULocationID,
    hour_of_day;

Extensions

https://duckdb.org/docs/extensions/overview

INSTALL json;
LOAD json;

select * from demo.json;

describe demo.json;

Load directly from HTTP location

select * from 'https://raw.githubusercontent.com/gchandra10/filestorage/main/sales_100.csv'

#duckdb #singlefiledatabase #parquet #tools #cliVer 0.3.6

[Avg. reading time: 8 minutes]

JQ

jq is a lightweight and flexible command-line JSON processor.
Reads JSON from stdin or a file, applies filters, and writes JSON to stdout.
Useful when working with APIs, logs, or config files in JSON format.
Handy tool in Automation.

Download JQ CLI (Preferred) and learn JQ.

JQ Download

Use the VSCode Extension and learn JQ.

VSCode Extension

Download the sample JSON

https://raw.githubusercontent.com/gchandra10/jqtutorial/refs/heads/master/sample_nows.json

Note: As this has no root element, '.' is used.

1. View JSON file in readable format

jq '.' sample_nows.json

2. Read the First JSON element / object

jq 'first(.[])' sample_nows.json

3. Read the Last JSON element

jq 'last(.[])' sample_nows.json

4. Read top 3 JSON elements

jq 'limit(3;.[])' sample_nows.json

5. Read 2nd & 3rd element. Remember, Python has the same format. LEFT Side inclusive, RIGHT Side exclusive

jq '.[2:4]' sample_nows.json

6. Extract individual values. | Pipeline the output

jq '.[] | [.balance,.age]' sample_nows.json

7. Extract individual values and do some calculations

jq '.[] | [.age, 65 - .age]' sample_nows.json

8. Return CSV from JSON

jq '.[] | [.company, .phone, .address] | @csv ' sample_nows.json

9. Return Tab Separated Values (TSV) from JSON

jq '.[] | [.company, .phone, .address] | @tsv ' sample_nows.json

10. Return with custom pipeline delimiter ( | )

jq '.[] | [.company, .phone, .address] | join("|")' sample_nows.json

Pro TIP : Export this result > output.txt and Import to db using bulk import tools like bcp, load data infile

11. Convert the number to string and return | delimited result

jq '.[] | [.balance,(.age | tostring)] | join("|") ' sample_nows.json

12. Process Array return Name (returns as list / array)

jq '.[] | [.friends[].name]' sample_nows.json

or (returns line by line)

jq '[].friends[].name' sample_nows.json

13. Parse multi level values

returns as list / array

jq '.[] | [.name.first, .name.last]' sample_nows.json

returns line by line

jq '.[].name.first, .[].name.last' sample_nows.json

14. Query values based on condition, say .index > 2

jq 'map(select(.index > 2))' sample_nows.json

jq 'map(select(.index > 2)) | .[] | [.index,.balance,.age]' sample_nows.json

15. Sorting Elements

# Sort by Age ASC
jq 'sort_by(.age)' sample_nows.json

# Sort by Age DESC
jq 'sort_by(-.age)' sample_nows.json

# Sort on multiple keys
jq 'sort_by(.age, .index)' sample_nows.json

Use Cases

curl -s https://www.githubstatus.com/api/v2/status.json

curl -s https://www.githubstatus.com/api/v2/status.json | jq '.'

curl -s https://www.githubstatus.com/api/v2/status.json | jq '.status'

#jq #tools #json #parser #cli #automationVer 0.3.6

[Avg. reading time: 5 minutes]

SQLite

Its a Serverless - Embedded database. Database engine is a library compiled into your Application.

The entire database is one file on disk.
It’s self-contained - needs no external dependencies.
It’s the most widely deployed database in the world.

How It’s Different from “Big” Databases

No client-server architecture - your app directly reads/writes the database file
No network overhead - everything is local file I/O
No configuration - no setup, no admin, no user management
Lightweight - the library is only a few hundred KB
Single writer at a time - multiple readers OK, but writes are serialized

Key Architectural Concepts

ACID Properties:

Transactions are atomic, consistent, isolated, durable
Even if your app crashes mid-write, database stays consistent

Locking & Concurrency:

Database-level locking (not row or table level like PostgreSQL)
Write transactions block other writers
This is fine for mobile/embedded, problematic for high-concurrency servers

Storage & Pages:

Data stored in fixed-size pages (default 4KB)
Understanding page size matters for performance tuning

When to Use SQLite

Mobile apps (iOS, Android)
Desktop applications
Embedded systems (IoT devices, cars, planes)
Small-to-medium websites (< 100K hits/day)
Local caching
Application file format (instead of XML/JSON)
Development/testing

When not to Use SQLite

High-concurrency web apps with many simultaneous writers
Distributed systems needing replication
Client-server architectures where you need central control
Applications requiring fine-grained access control

Performance Characteristics

Extremely fast for reads
Very fast for writes on local storage
Slower on network drives (NFS, cloud mounts)
Indexes work like other databases - crucial for query performance
Analyze your queries - use EXPLAIN QUERY PLAN

Demo

git clone https://github.com/gchandra10/python_sqlite_demo

#sqlite #localdb #embeddeddbVer 0.3.6

[Avg. reading time: 5 minutes]

Introduction

MLflow Components

MLflow Tracking

Logs experiments, parameters, metrics, and artifacts
Provides UI for comparing runs and visualizing results
Supports automatic logging for popular ML libraries

Use case: Track model performance across different hyperparameters, compare experiment results

MLflow Projects

Packages ML code in reusable, reproducible format
Uses conda.yaml or requirements.txt for dependencies
Supports different execution environments (local, cloud, Kubernetes)

Use case: Share reproducible ML workflows, standardize project structure

MLflow Models

Standardizes model packaging and deployment
Supports multiple ML frameworks (scikit-learn, TensorFlow, PyTorch, etc.)
Enables model serving via REST API, batch inference, or cloud platforms

Use case: Deploy models consistently across environments, A/B test different model versions

MLflow Model Registry

Centralized model store with versioning and stage management
Tracks model lineage and metadata
Supports approval workflows and access controls

Use case: Manage model lifecycle from staging to production, collaborate on model deployment

Common Use Cases

Experiment Management

Compare model architectures, hyperparameters, and feature engineering approaches
Track training metrics over time and across team members

Model Deployment

Package models for consistent deployment across dev/staging/prod
Serve models as REST endpoints or batch processing jobs

Collaboration

Share reproducible experiments and models across data science teams
Maintain audit trail of model development and deployment decisions

MLOps Workflows

Automate model training, validation, and deployment pipelines
Integrate with CI/CD systems for continuous model delivery

MLflow works well as a lightweight, open-source solution that integrates with existing ML workflows without requiring major infrastructure changes.Ver 0.3.6

[Avg. reading time: 4 minutes]

MLflow Experiment Structure

A typical Chemistry experiment we did in school days.

Experiment (ML Project)	Run #	Inputs (Parameters)	Process (Code/Recipe)	Outputs (Artifacts)	Metrics (Results)
Acid + Base Reaction	Run 1	Acid=10ml, Base=5ml	Stirred 2 mins, room temp	Beaker with solution	pH=7.0
Acid + Base Reaction	Run 2	Acid=10ml, Base=7ml	Stirred 2 mins, room temp	Beaker with solution	pH=6.2
Acid + Base Reaction	Run 3	Acid=10ml, Base=7ml	Stirred 5 mins, heat	Beaker with solution	pH=6.0, Color=yellow

Experiment → Group of related trials (like a project or ML task).
Run → One trial with a unique ID (just like a single lab experiment entry).
Inputs (Parameters) → Model hyperparameters (learning rate, batch size, etc.).
Process (Code/Recipe) → Training code or pipeline steps.
Outputs (Artifacts) → Models, plots, datasets, or serialized files.
Metrics (Results) → Accuracy, loss, F1-score, etc.

MLflow
│
├── Experiment A
│     ├── Run 1
│     │     ├── Parameters
│     │     ├── Metrics
│     │     ├── Artifacts
│     │     └── Tags
│     ├── Run 2
│     │     ├── Parameters
│     │     ├── Metrics
│     │     ├── Artifacts
│     │     └── Tags
│     └── Run 3
│           ├── Parameters
│           ├── Metrics
│           ├── Artifacts
│           └── Tags
│
└── Experiment B
      ├── Run 1
      ├── Run 2
      └── Run N

git clone https://github.com/gchandra10/uni_multi_model.git

#mlflow #experiment #runVer 0.3.6

[Avg. reading time: 8 minutes]

Why MLflow

MLflow provides comprehensive support for traditional ML workflows, making it effortless to track experiments, manage models, and deploy solutions at scale.

Key Features

Intelligent (Auto)logging

- Simple Integration for scikit-learn, XGBoost, and more
- Automatic Parameter Capture (logs all model hyperparameters without manual intervention)
- Built-in Evaluation Metrics (automatically computes and stores relevant performance metrics)
- Model Serialization (handles complex objects like pipelines seamlessly)

Compare Model Performance Across Algorithms

Save Time: No more manually tracking results in spreadsheets or notebooks
Make Better Decisions: Easily spot which algorithms perform best on your data
Avoid Mistakes: Never lose track of promising model configurations
Share Results: Team members can see all experiments and build on each other’s work
Visual charts comparing accuracy, precision, recall across all your models
Sortable tables showing parameter combinations and their results
Quick filtering to find models that meet specific performance criteria
Export capabilities to share findings with stakeholders

Flexible Deployment

Real-Time Inference for low-latency prediction services
Batch Processing for large-scale scoring jobs
Edge Deployment for offline and mobile applications
Containerized Serving with Docker and Kubernetes support
Cloud Integration across AWS, Azure, and Google Cloud platforms
Custom Serving Logic for complex preprocessing and postprocessing requirements

Capabilities

Tracking Server & MLflow UI

Start a new project

VSCode, Open Workspace

Open Shell 1 (Terminal/GitBash)

uv init mlflow_demo
cd mlflow_demo
uv add mlflow pandas numpy scikit-learn matplotlib

Option 1: Store MLflow details in Local Machine

mlflow server --host 127.0.0.1 --port 8080

Open this URL and copy the file to your VSCode

https://github.com/gchandra10/uni_multi_model/blob/main/01-lr-model.py

Open Shell 2

Step Activate Virtual Environment

python 01-lr-model.py

Open your browser and goto http://127.0.0.1:8080

View the Experiment

Option 2: Store MLflow details in a Local Database

mlflow server --host 127.0.0.1 --port 8080 \
--backend-store-uri sqlite:///mlflow.db

Option 3: Store MLflow details in a Remote Database

export AWS_PROFILE=your_profile_name

mlflow server --host 127.0.0.1 --port 8080 \
  --default-artifact-root s3://yourbucket
  --backend-store-uri 'postgresql://yourhostdetails/'

Model Serving

Open Shell 3

Optional Step Activate Virtual Environment


export MLFLOW_TRACKING_URI=http://127.0.0.1:8080

mlflow models serve \
  -m "models:/Linear_Regression_Model/1" \
  --host 127.0.0.1 \
  --port 5001 \
  --env-manager local

Real Time Prediction

Open Shell 4

Optional Step Activate Virtual Environment


curl -X POST "http://127.0.0.1:5001/invocations" \
  -H "Content-Type: application/json" \
  --data '{"inputs": [{"ENGINESIZE": 2.0}, {"ENGINESIZE": 3.0}, {"ENGINESIZE": 4.0}]}'

OR

curl -X POST http://127.0.0.1:5001/invocations \
  -H "Content-Type: application/json" \
  -d '{
        "dataframe_split": {
          "columns": ["ENGINESIZE"],
          "data": [[2.0],[3.0],[4.0]]
        }
      }'

#mlflow #serving #mlflow_serverVer 0.3.6

[Avg. reading time: 5 minutes]

YAML

Introduction

YAML Ain’t Markup Language.
Human-readable alternative to JSON.
Indentation is very key. (like Python)
Used for configuration, not for programming logic.

Key Principles

Whitespace indentation -> hierarchy
Colon (:) -> Key Value Pair
Dash (-) -> List Item
Comments (#)

Use Cases in MLOps

MLflow experiment configs (parameters, environments)
Kubernetes -> Pods, Services, Deployments
Docker Compose -> multi-container setups
CI/CD pipelines -> GitHub Actions, GitLab CI, Azure DevOps

{
  "experiment": "CO2_Regression",
  "params": {
    "alpha": 0.1,
    "max_iter": 100
  },
  "tags": ["linear_regression", "mlflow"]
}

experiment: CO2_Regression
params:
  alpha: 0.1
  max_iter: 100
tags:
  - linear_regression
  - mlflow

YAMLLint OR VSCode YAML Validator Extension

YAML Data Structures

Scalars (strings, numbers, booleans)

learning_rate: 0.01
early_stopping: true
experiment_name: "CO2_Prediction"

Lists

models:
  - linear_regression
  - random_forest
  - xgboost

Dictionaries (maps)

params:
  n_estimators: 100
  max_depth: 5

Description

description: |
  This is a multi-line string.
  It preserves line breaks.
  Useful for comments/description/notes.

Putting together

experiment:
  name: CO2_Regression
  params:
    alpha: 0.1
    max_iter: 100
  metrics:
    - mse
    - r2
  description: |
    Model built using Linear Regression.
    We can use univariate or multi variate.

  environments:
    development:
      database: sqlite
    production:
      database: mysql

Default Values Using &anchorName and *anchorName and Merge Key <<


base_config: &base
  host: localhost
  port: 3306

development:
  <<: *base
  database: dev_db

production:
  <<: *base
  database: prod_db
  host: prod.server.com

Using Environment Variables


config:
  path: ${USERPROFILE}\folder1

Mac/Linux/Git Bash

export USERPROFILE="sometext"

Command Prompt

set USERPROFILE="sometext"

YAML Variables

variables:
  base_url: http://example.com
endpoints:
  user: ${variables.base_url}/user
  admin: ${variables.base_url}/admin

https://github.com/gchandra10/python_yaml_demo.git

#yaml #json #pythonVer 0.3.6

[Avg. reading time: 1 minute]

Cloud

Overview
Types of cloud Services
Challenges of cloud Computing
AWS
- AWS Global Infrastructure
- CIDR
- EC2
- S3
- IAM
- Cloud Shell
TerraformVer 0.3.6

[Avg. reading time: 6 minutes]

Overview

Definitions

Hardware: physical computer / equipment / devices

Software: programs such as operating systems, word, Excel

Web Site: Readonly web pages such as company pages, portfolios, newspapers

Web Application: Read Write - Online forms, Google Docs, email, Google apps

Cloud Plays a significant role in the Big Data world.

In today’s market, Cloud helps companies to accommodate the ever-increasing volume, variety, and velocity of data.

Cloud Computing is a demand delivery of IT resources over the Internet through Pay Per Use.

Src : https://thinkingispower.com/the-blind-men-and-the-elephant-is-perception-reality/

Without Cloud knowledge, knowing Bigdata will be something like the above picture.

Volume: Size of the data.
Velocity: Speed at which new data is generated.
Variety: Different types of data.
Veracity: Trustworthiness of the data.
Value: Usefulness of the data.
Vulnerability: Security and privacy aspects.

When people focus on only one aspect without the help of cloud technologies, they miss out on the comprehensive picture. Cloud solutions offer ways to manage all these dimensions in an integrated manner, thus providing a fuller understanding and utilization of Big Data.

Advantages of Cloud Computing

Cost Savings
Security
Flexibility
Mobility
Insight
Increased Collaboration
Quality Control
Disaster Recovery
Loss Prevention
Automatic Software Updates
Competitive Edge
Sustainability

Types of Cloud Computing

Public Cloud

Owned and operated by third-party providers. (AWS, Azure, GCP, Heroku, and a few more)

Private Cloud

Cloud computing resources are used exclusively by a single business or organization.

Hybrid

Public + Private: By allowing data and applications to move between private and public clouds, a hybrid cloud gives your business greater flexibility and more deployment options, and helps optimize your existing infrastructure, security, and compliance.

#cloud #overviewVer 0.3.6

[Avg. reading time: 5 minutes]

Types of Cloud Services

SaaS

Software as a Service

Cloud-based service providers offer end-user applications. Google Apps, DropBox, Slack, etc.

Web access to Software (primarily commercial).
Software is managed from a central location.
Delivery 1 - many models.
No patches, No upgrades

When not to use

Hardware integration is needed. (Price Scanner)
Faster processing is required.
Cannot host data outside the premise.

PaaS

Platform as a Service

Software tools are available over the internet. AWS RDS, Heroku, Salesforce

Scalable
Built on Virtualization Technology
No User needed to maintain software. (DB upgrades, patches by cloud team)

When not to use PaaS

Propriety tools don’t allow moving to diff providers. (AWS-specific tools)
Using new software that is not part of the PaaS toolset.

IaaS

Infrastructure as a Service

Cloud-based hardware services. Pay-as-you-go services for Storage, Networking, and Servers.

Amazon EC2, Google Compute Engine, S3.

Highly flexible and scalable.
Accessible by more than one user.
Cost-effective (if used right).

Serverless computing

Focuses on building apps without spending time managing servers/infrastructure.

Feature automatic scaling, built-in high availability, and pay-per-use.

Use of resources when a specific function or event occurs.

Cloud providers handle the deployment, and capacity, and manage the servers.

Example: AWS Lambda, AWS Step Functions.

Easy way to remember SaaS, PaaS, IaaS

#cloud #iaas #paas #saasVer 0.3.6

[Avg. reading time: 5 minutes]

Challenges of Cloud Computing

Privacy: “Both traditional and Big Data sets often contain sensitive information, such as addresses, credit card details, or social security numbers.”

So, it’s the responsibility of users to ensure proper security methods are followed.

Compliance: Cloud providers replicate data across regions to ensure safety. If companies have regulations that data should not be stored outside their organization or should not be stored in a specific part of the world.

Data Availability: Everything is dependent on the Internet and speed. It is also dependent on the choice of the cloud provider. Big companies like AWS / GCP / Azure have more data centers and backup facilities.

Connectivity: Internet availability + speed.

Vendor lock-in: Once an organization has migrated its data and applications to the cloud, switching to a different provider can be difficult and expensive. This is known as vendor lock-in. Some cloud agnostic tools like Databricks help enterprises to mitigate this problem, but still, its a challenge.

Cost: Cloud computing can be a cost-effective way to deploy and manage IT resources. However, it is essential to carefully consider your needs and budget before choosing a cloud provider.

Continuous Training: Employees may need to be trained to use cloud-based applications. This can be a cost and time investment.

Constant Change in Technology: Cloud providers constantly improve or change their technology. Recently, Microsoft decided to decommission Synapse and launch a new tool called Fabric.

#cloud #challengesVer 0.3.6

[Avg. reading time: 4 minutes]

AWS

Terms to Know

Elasticity The ability to acquire resources as you need them and release resources when you no longer need them.

Scale Up vs. Scale Down

Scale-Out vs. Scale In

Latency

Typically latency is a measurement of a round-trip between two systems, such as how long it takes data to make its way between two.

Root User

Owner of the AWS account.

IAM

Identity Access Management

ARN

Amazon Resource Name

For example

arn:aws:iam::123456789012:user/Development/product_1234/*

Policy

Rules

AWS Popular Services

Amazon EC2

Allows you to deploy virtual servers within your AWS environment.

Amazon S3

A fully managed, object-based storage service that is highly available, highly durable, cost-effective, and widely accessible.

AWS IAM (Identify and Access Mgt)

Used to manage permissions to your AWS resources

AWS Management Services

Amazon CloudWatch

A comprehensive monitoring tool allows you to monitor your services and applications in the cloud.

Billing & Budgeting

Helps control the cost.Ver 0.3.6

[Avg. reading time: 6 minutes]

AWS Global Infrastructure

The Primary two items are given below.

Availability Zones
Regions

Availability Zones (AZs)

AZs are the physical data centers of AWS.

This is where the actual computing, storage, network, and database resources are hosted that we as consumers, provision within our Virtual Private Clouds (VPCs).

A common misconception is that a single availability zone equals a single data center. Multiple data centers located closely form a single availability zone.

Each AZ will have another AZ in the same geographical area. Each AZ will be isolated from others using a separate power/network like DR.

Many AWS services use low latency links between AZs to replicate data for high availability and resilience purposes.

Multiple AZs are defined as an AWS Regions. (Example: Virginia)

Regions

Every Region will act independently of the others, containing at least two Availability Zones.

Interestingly, only some AWS services are available in some regions.

US East (N. Virginia) us-east-1
US East (Ohio) us-east-2
EU (Ireland) eu-west-1
EU (Frankfurt) eu-central-1

Note: As of today, AWS is available in 38 regions and 120 AZs

AWS Regions

Edge Location

A smaller AWS data center used by Amazon CloudFront and Lambda@Edge to cache content closer to users.

Reduces latency and improves performance for end users, especially for content delivery and inference endpoints.

A user in Singapore fetching from a U.S. model endpoint may hit an Edge Location nearby for lower latency.

Use Cases:

- DNS Resolution (Route 53)
- Content Caching

#aws #region #az #edgelocationVer 0.3.6

[Avg. reading time: 3 minutes]

CIDR

CIDR = Classless Inter-Domain Routing

It defines how many IP addresses are in a network (or subnet) using a “slash” notation.

Example: 192.168.10.0/24

Network address: 192.168.10.0
Prefix Length: /24 means this network will have 256 total IPs

Number of IPs = 2^(32 - prefix)

But AWS and most networks reserve 5 IPs in each subnet:

1 for network address
1 for broadcast address
3 reserved by AWS (for internal routing, DNS, etc.)

/24 subnet gives 251 usable IPs

CIDR	Subnet Mask	Total IPs	Usable in AWS	Typical Use
`/16`	255.255.0.0	65,536	65,531	Entire VPC range
`/20`	255.255.240.0	4,096	4,091	Large subnet
`/24`	255.255.255.0	256	251	Typical small subnet
`/28`	255.255.255.240	16	11	Small test subnet
`/32`	255.255.255.255	1	0	Single host route

192.168.10.0 = 11000000.10101000.00001010.00000000

Last 8 digits goes like this

#cidr #ipv4 #subnetVer 0.3.6

[Avg. reading time: 6 minutes]

EC2

(Elastic Cloud Compute)

Compute: Closely related to CPU/RAM

Elastic Compute Cloud (EC2): AWS EC2 provides resizable compute capacity in the cloud, allowing you to run virtual servers as per your needs.

Instance Types: EC2 offers various instance types optimized for different use cases, such as general purpose, compute-optimized, memory-optimized, and GPU instances.

Pricing Models

On-Demand: Pay for computing capacity by the hour or second.

Reserved: Commit to a one or 3-year term and get a discount.

Spot: Bid for unused EC2 capacity at a reduced cost.

Savings Plans: Commit to consistent compute usage for lower prices. AMI (Amazon Machine Image): Pre-configured templates for your EC2 instances, including the operating system, application server, and applications.

Security

Security Groups: Act as a virtual firewall for your instances to control inbound and outbound traffic.

Key Pairs: These are used to access your EC2 instances via SSH or RDP securely.

Elastic IPs: These are static IP addresses that can be associated with EC2 instances. They are useful for hosting services that require a consistent IP.

Auto Scaling: Automatically adjusts the number of EC2 instances in response to changing demand, ensuring you only pay for what you need.

Elastic Load Balancing (ELB): Distributes incoming traffic across multiple EC2 instances, improving fault tolerance and availability.

EBS (Elastic Block Store): Provides persistent block storage volumes for EC2 instances, allowing data to be stored even after an instance is terminated.

Regions and Availability Zones: EC2 instances can be deployed in various geographic regions, each with multiple availability zones for high availability and fault tolerance.

Storage

Persistent Storage

Elastic Block Storage (EBS) Volumes / Logically attached via AWS network.
Automatically replicated.
Encryption is available.

Ephemeral Storage - Local storage

Physically attached to the underlying host.
When the instance is stopped or terminated, all the data is lost.
Rebooting will keep the data intact.

DEMO - Deploy EC2

#aws #ec2 #vm #serverVer 0.3.6

[Avg. reading time: 19 minutes]

S3

(Simple Storage Service)

It’s an IaaS service. S3 uses Object storage instead of File storage (like your machine or Google Drive)

Warehouse vs Book Shelf

Scenario	File Storage	Object Storage (S3, Blob)
Read one small file	Faster (local I/O)	Slight overhead (API call)
Read 10,000 files concurrently	Struggles with locks & hierarchy	Scales linearly via APIs
Edit file in place	Easy	Not possible (must re-upload)
Integrate with Spark / Databricks	Not scalable	Native integration (`spark.read.parquet("s3://...")`)
Network access	Typically mounted	Always network-based (HTTP)

Highly Available
Durable
Cost Effective
Widely Accessible
Uptime of 99.99%

Objects and Buckets: The fundamental elements of Amazon S3 are objects and buckets. Objects are the individual data pieces stored in Amazon S3, while buckets are containers for these objects. An object consists of a file and, optionally, any metadata that describes that file.
- It’s also a regional service, meaning that when you create a bucket, you specify a region, and all objects are stored there.
- Globally Unique: The name of an Amazon S3 bucket must be unique across all of Amazon S3, that is, across all AWS customers. It’s like a domain name.
- Globally Accessible: Even though you specify a particular region when you create a bucket, once the bucket is created, you can access it from anywhere in the world using the appropriate URL.
Scalability: Amazon S3 can scale in terms of storage, request rate, and users to support unlimited web-scale applications.
Security: Amazon S3 includes several robust security features, such as encryption for data at rest and in transit, access controls like Identity and Access Management (IAM) policies, bucket policies, and Access Control Lists (ACLs), and features for monitoring and logging activity, like AWS CloudTrail.
Data transfer: Amazon S3 supports transfer acceleration, which speeds up uploads and downloads of large objects.
Event Notification: S3 can notify you of specific events in your bucket. For instance, you could set up a notification to alert you when an object is deleted from your bucket.
Management Features: S3 has a suite of features to help manage your data, including lifecycle management, which allows you to define rules for moving or expiring objects, versioning to keep multiple versions of an object in the same bucket, and analytics for understanding and optimizing storage costs.
Consistency: Amazon S3 provides read-after-write consistency for PUTS of new objects and eventual consistency for overwrite PUTS and DELETES.
- Read-after-write Consistency for PUTS of New Objects: When a new object is uploaded (PUT) into an Amazon S3 bucket, it’s immediately accessible for read (GET) operations. This is known as read-after-write consistency. You can immediately retrieve a new object as soon as you create it. This applies across all regions in AWS, and it’s crucial when immediate, accurate data retrieval is required.
- Eventual Consistency for Overwrite PUTS and DELETES: Overwrite PUTS and DELETES refer to operations where an existing object is updated (an overwrite PUT) or removed (a DELETE). For these operations, Amazon S3 provides eventual consistency. If you update or delete an object and immediately attempt to read or delete it, you might still get the old version or find it there (in the case of a DELETE) for a short period. This state of affairs is temporary, and shortly after the update or deletion, you’ll see the new version or find the object gone, as expected.

S3 is like a building full of mailboxes (buckets). Each bucket has a unique name globally, and you can only access the ones you have keys for (permissions).
The overall S3 service is like a large building that contains multiple lockers.
Each bucket is a unique container that stores your objects (files, images, datasets).
Only authorized users (via IAM roles or bucket policies) can open that specific locker.

This is the destination — it tells S3 where to deliver or find the object (s3://my-bucket/path/to/file.csv).
The envelope is the actual content.
Labels on envelope is the object metadata (content-type, size, date and so on)

Notes

Data is stored as an “Object.”

Object storage, also known as object-based storage, manages data as objects. Each object includes the data, associated metadata, and a globally unique identifier.

Unlike file storage, there are no folders or directories in object storage. Instead, objects are organized into a flat address space, called a bucket in Amazon S3’s terminology.

The unique identifier allows an object to be retrieved without needing to know the physical location of the data. Metadata can be customized, making object storage incredibly flexible.

Every object gets a UID (universal ID) and associated META data.

No Folders / SubFolders

For example, if you have an object with the key images/summer/beach.png in your bucket, Amazon S3 has no internal concept of the images or summer as separate entities—it simply sees the entire string images/summer/beach.png as the key for that object.

To store objects in S3, you must first define and create a bucket.

You can think of a bucket as a container for your data.

This bucket name must be unique, not just within the region you specify, but globally against all other S3 buckets, of which there are many millions.

Any object uploaded to your buckets is given a unique object key to identify it.

S3 bucket ownership is not transferable.
S3 bucket names should start with alphabets, and - is allowed in between.
An AWS account can have a maximum of 100 buckets.

More details

S3 Object Keys

#aws #s3 #storage #objectstorageVer 0.3.6

[Avg. reading time: 6 minutes]

description: Identity Access Management

IAM

ARN: Amazon Resource Name

Users - Individual Person / Application

Groups - Collection of IAM Users

Policies - Policy sets permission/control access to AWS resources. Policies are stored in AWS as JSON documents.

A Policy can be attached to multiple entities (users, groups, and roles) in your AWS account.

Multiple Policies can be created and attached to the user.

Roles - Set of permissions that define what actions are allowed and denied by an entity in the AWS console. Similar to a user, it can be accessed by any type of entity.

// Examples of ARNs

arn:aws:s3:::my_corporate_bucket/*

arn:aws:s3:::my_corporate_bucket/Development/*

arn:aws:iam::123456789012:user/chandr34

arn:aws:iam::123456789012:group/bigdataclass

arn:aws:iam::123456789012:group/*

Types of Policies

Identity-based policies: Identity-based policies are attached to an IAM user, group, or role (identities). These policies control what actions an identity can perform, on which resources, and under what conditions.

Resource-based policies: Resource-based policies are attached to a resource such as an Amazon S3 bucket. These policies control what actions a specified principal can perform on that resource and under what conditions.

Permission Boundary: You can use an AWS-managed policy or a customer-managed policy to set the boundary for an IAM entity (user or role). A permissions boundary is an advanced feature for using a managed policy to set the maximum permissions that an identity-based policy can grant to an IAM entity.

Inline Policies: Policies that are embedded in an IAM identity. Inline policies maintain a strict one-to-one relationship between a policy and an identity. They are deleted when you delete the identity.

#aws #iam #user #permissionsVer 0.3.6

[Avg. reading time: 13 minutes]

AWS CloudShell

AWS CloudShell is a browser-based shell environment available directly through the AWS Management Console. It provides a command-line interface (CLI) to manage and interact with AWS resources securely without needing to install any software or set up credentials on your local machine.

Use Cases

Quick Access to AWS CLI

Allows you to run AWS CLI commands directly without configuring your local machine. It’s perfect for quick tasks like managing AWS resources (e.g., EC2 instances, S3 buckets, or Lambda functions).

Development and Automation

You can write and execute scripts using common programming languages like Python and Shell. It’s great for testing and automating tasks directly within your AWS environment.

Secure and Pre-Configured Environment

AWS CloudShell comes pre-configured with AWS CLI, Python, Node.js, and other essential tools. It uses your IAM permissions, so you don’t need to handle keys or credentials directly, making it secure and convenient.

Access to Filesystem and Persistent Storage

You get a persistent 1 GB home directory per region to store scripts, logs, or other files between sessions, which can be used to manage files related to your AWS resources.

Cross-Region Management

You can access and manage resources across different AWS regions directly from CloudShell, making it useful for multi-region setups.

Basic Commands

    aws s3 ls
    aws ec2 describe-instances

    sudo apt install jq

list_buckets.sh

#!/bin/bash
echo "Listing all S3 buckets:"
aws s3 ls

    bash list_buckets.sh

# get account details

aws sts get-caller-identity

# list available regions

aws ec2 describe-regions --query "Regions[].RegionName" --output table

# create a bucket

aws s3 mb s3://chandr34-newbucket

# upload a file to a bucket 

echo "Hello, CloudShell!" > hello.txt
aws s3 cp hello.txt s3://chandr34-newbucket

# List files in bucket 

aws s3 ls s3://chandr34-newbucket/

# Delete bucket  with files 

aws s3 rb s3://chandr34-newbucket --force

# List AMIs

aws ec2 describe-images --owners amazon --query 'Images[*].{ID:ImageId,Name:Name}' --output table

# quickly launch a ec2

aws ec2 create-key-pair --key-name gcnewkeypair --query 'KeyMaterial' --output text > myNewKeyPair.pem

# Change Permission

chmod 0400 myNewKeyPair.pem

# Launch new EC2

aws ec2 run-instances --image-id ami-0866a3c8686eaeeba --count 1 --instance-type t2.micro --key-name gcnewkeypair --security-groups default

# Get Public IP

aws ec2 describe-instances --query "Reservations[].Instances[].PublicIpAddress" --output text

# Login to server

ssh -i myKeyNewPair.pem ubuntu@<getthehostip>

# terminate the instance

aws ec2 terminate-instances --instance-ids <>

Cloud Formation

my-webserver.yml

AWSTemplateFormatVersion: '2010-09-09'
Description: CloudFormation template to launch an Amazon Linux EC2 instance with Nginx installed.

Resources:
  MyEC2Instance:
    Type: AWS::EC2::Instance
    Properties:
      InstanceType: t2.micro
      ImageId: ami-0866a3c8686eaeeba
      KeyName: gcnewkeypair
      SecurityGroupIds:
        - !Ref InstanceSecurityGroup
      UserData:
        Fn::Base64: 
          !Sub |
            #!/bin/bash
            apt update -y
            apt install -y nginx
            systemctl start nginx
            systemctl enable nginx
      Tags:
        - Key: Name
          Value: MyNginxServer

  InstanceSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Enable SSH and HTTP access
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 22
          ToPort: 22
          CidrIp: 0.0.0.0/0  # SSH access, restrict this to your IP range for security
        - IpProtocol: tcp
          FromPort: 80
          ToPort: 80
          FromPort: 443
          ToPort: 443
          CidrIp: 0.0.0.0/0  # HTTP access for Nginx

Outputs:
  InstanceId:
    Description: The Instance ID of the EC2 instance
    Value: !Ref MyEC2Instance
  PublicIP:
    Description: The Public IP address of the EC2 instance
    Value: !GetAtt MyEC2Instance.PublicIp
  WebURL:
    Description: URL to access the Nginx web server
    Value: !Sub "http://${MyEC2Instance.PublicIp}"

Launch the Stack via CloudShell

# Create the stack
aws cloudformation create-stack --stack-name gc-stack --template-body file://my-webserver.yml --capabilities CAPABILITY_NAMED_IAM


# Check the status

aws cloudformation describe-stacks --stack-name gc-stack --query "Stacks[0].StackStatus"


aws cloudformation describe-stacks --stack-name gc-stack --query "Stacks[0].Outputs"

# delete the stack

aws cloudformation delete-stack --stack-name gc-stack


aws cloudformation describe-stacks --stack-name gc-stack --query "Stacks[0].StackStatus"

# confirm the deletion status

aws cloudformation list-stacks --query "StackSummaries[?StackName=='gc-stack'].StackStatus"

#cli #aws #cloudshellVer 0.3.6

[Avg. reading time: 16 minutes]

Terraform

Features of Terraform

Infrastructure as Code: Terraform allows you to write, plan, and create infrastructure using configuration files. This makes infrastructure management automated, consistent, and easy to collaborate on.

Multi-Cloud Support: Terraform supports many cloud providers and on-premises environments, allowing you to manage resources across different platforms seamlessly.

State Management: Terraform keeps track of the current state of your infrastructure in a state file. This enables you to manage changes, plan updates, and maintain consistency in your infrastructure.

Resource Graph: Terraform builds a resource dependency graph that helps in efficiently creating or modifying resources in parallel, speeding up the provisioning process and ensuring dependencies are handled correctly.

Immutable Infrastructure: Terraform promotes the practice of immutable infrastructure, meaning that resources are replaced rather than updated directly. This ensures consistency and reduces configuration drift.

Execution Plan: Terraform provides an execution plan (terraform plan) that previews changes before they are applied, allowing you to understand and validate the impact of changes before implementing them.

Modules: Terraform supports reusability through modules, which are self-contained, reusable pieces of configuration that help you maintain best practices and reduce redundancy in your infrastructure code.

Community and Ecosystem: Terraform has a large open-source community and many providers and modules available through the Terraform Registry, which makes it easier to get started and integrate with various services.

Use Cases

Multi-Cloud Provisioning
Infrastructure Scaling
Disaster Recovery
Environment Management
Compliance & Standardization
CI/CD Pipelines
Speed and Simplicity
Team Collaboration
Error Reduction
Enhanced Security

Install Terraform CLI

<a href="https://developer.hashicorp.com/terraform/downloads"" title="" target="_blank">Terraform Download

Terraform Structure

Provider Block: Specifies the cloud provider or service (e.g., AWS, Azure, Google Cloud) that Terraform will interact with.

provider "aws" {
  region = "us-east-1"
}

Resource Block: Defines the resources to be created or managed. A resource can be a server, network, or other infrastructure component.

resource "aws_instance" "example" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t2.micro"
}

Data Block: Fetches information about existing resources, often for referencing in resource blocks.

data "aws_ami" "latest" {
  most_recent = true
  owners      = ["amazon"]
}

Variable Block: Declares input variables to make the script flexible and reusable.

variable "instance_type" {
  description = "Type of instance to use"
  type        = string
  default     = "t2.micro"
}

Output Block: Specifies values to be output after the infrastructure is applied, like resource IDs or connection strings.

output "instance_ip" {
  value = aws_instance.example.public_ip
}

Module Block: Used to encapsulate and reuse sets of Terraform resources.

module "vpc" {
  source = "./modules/vpc"
  cidr_block = "10.0.0.0/16"
}

Locals Block: Defines local values that can be reused in the configuration.

locals {
  environment = "production"
  instance_count = 3
}

SET these environment variables.

export AWS_ACCESS_KEY_ID="your-access-key-id"
export AWS_SECRET_ACCESS_KEY="your-secret-access-key"

Simple S3 Bucket

simple_s3_bucket.tf


terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.70.0"
    }
  }

  required_version = ">= 1.2.0"
}

provider "aws" {
  region = "us-east-1"
  profile = "chandr34"
}

resource "aws_s3_bucket" "demo" {
  bucket = "chandr34-my-new-tf-bucket"

  tags = {
    Createdusing = "tf"
    Environment  = "classdemo"
  }
}

output "bucket_name" {
  value = aws_s3_bucket.demo.bucket
}

Create a new folder
Copy the .tf into it
terraform init 
terraform validate
terraform plan
terraform apply
terraform destroy

Variable S3 Bucket

variable_bucket.tf


terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.70.0"
    }
  }

  required_version = ">= 1.2.0"
}

provider "aws" {
  region  = "us-east-1"
  profile = "chandr34"
}

variable "bucket_name" {
  description = "The name of the S3 bucket to create"
  type        = string
}

resource "aws_s3_bucket" "demo" {
  bucket = var.bucket_name

  tags = {
    Createdusing = "tf"
    Environment  = "classdemo"
  }
}

output "bucket_name" {
  value = aws_s3_bucket.demo.bucket
}

Create a new folder
Copy the .tf into it
terraform init
terraform validate
terraform plan
terraform apply -var="bucket_name=chandr34-variable-bucket"
terraform destroy -var="bucket_name=chandr34-variable-bucket"

Variable file

Any filename with extension .tfvars

terraform.tfvars

bucket_name = "chandr34-variable-bucket1"

terraform apply -auto-approve

AWS Resource Types

Please make sure AWS Profile is created.

Create Public and Private Keys

Linux / Mac Users

// create private/public key

ssh-keygen -b 2048 -t rsa -f ec2_tf_demo

Windows Users

Open PuttyGen and create a Key

Terraform

mkdir simple_ec2
cd tf-aws-ec2-sample
Create main.tf

// main.tf
#https://registry.terraform.io/providers/hashicorp/aws/latest

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.70.0"
    }
  }

  required_version = ">= 1.2.0"
}

provider "aws" {
  region  = "us-east-1"
  profile = "chandr34"
}

resource "aws_key_pair" "generated_key" {
  key_name   = "generated-key-pair"
  public_key = tls_private_key.generated_key.public_key_openssh
}

resource "tls_private_key" "generated_key" {
  algorithm = "RSA"
  rsa_bits  = 2048
}

resource "local_file" "private_key_file" {
  content  = tls_private_key.generated_key.private_key_pem
  filename = "${path.module}/generated-key.pem"
}

resource "aws_instance" "ubuntu_ec2" {
  ami           = "ami-00874d747dde814fa"
  instance_type = "t2.micro"
  key_name      = aws_key_pair.generated_key.key_name
  vpc_security_group_ids = [aws_security_group.ec2_security_group.id]

  tags = {
    Name        = "UbuntuInstance"
    Environment = "classdemo"
  }
}

resource "aws_security_group" "ec2_security_group" {
  name        = "ec2_security_group"
  description = "Allow SSH and HTTP access"

  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]  # Allow SSH from anywhere (use cautiously)
  }

  ingress {
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]  # Allow HTTP from anywhere
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]  # Allow all outbound traffic
  }

  tags = {
    Name = "EC2SecurityGroup"
  }
}

output "ec2_instance_public_ip" {
  value = aws_instance.ubuntu_ec2.public_ip
}

output "private_key_pem" {
  value     = tls_private_key.generated_key.private_key_pem
  sensitive = true
}

goto terminal

terraform init
terraform fmt
terraform validate
terraform apply
terraform show

Finally

terraform destroy

#terraform #IaaCVer 0.3.6

[Avg. reading time: 1 minute]

MLflow Model Lifecycle

[Avg. reading time: 14 minutes]

Decorator

Decorators in Python are a powerful way to modify or extend the behavior of functions or methods without changing their code. Decorators are often used for tasks like logging, authentication, and adding additional functionality to functions. They are denoted by the “@” symbol and are applied above the function they decorate.

def say_hello():
    print("World")

say_hello()

How do we change the output without changing the say hello() function?

wrapper() is not reserved word. It can be anyting.

Use Decorators

# Define a decorator function
def hello_decorator(func):
    def wrapper():
        print("Hello,")
        func()  # Call the original function
    return wrapper

# Use the decorator to modify the behavior of say_hello
@hello_decorator
def say_hello():
    print("World")

# Call the decorated function
say_hello()

When Python sees @decorator_name, it does:

say_hello = hello_decorator(say_hello)

If you want to replace the new line character and the end of the print statement, use end=''

# Define a decorator function
def hello_decorator(func):
    def wrapper():
        print("Hello, ", end='')
        func()  # Call the original function
    return wrapper

# Use the decorator to modify the behavior of say_hello
@hello_decorator
def say_hello():
    print("World")

# Call the decorated function
say_hello()

Multiple functions inside the Decorator

def hello_decorator(func):
    def first_wrapper():
        print("First wrapper, doing something before the second wrapper.")
        #func()
    
    def second_wrapper():
        print("Second wrapper, doing something before the actual function.")
        #func()
    
    def main_wrapper():
        first_wrapper()  # Call the first wrapper
        second_wrapper()  # Then call the second wrapper, which calls the actual function
        func()
    
    return main_wrapper

@hello_decorator
def say_hello():
    print("World")

say_hello()

Multiple Decorators

from functools import wraps
def one(func):
    def one_wrapper():
        print(f"Decorator One: Before function - Called by {func.__name__}")
        func()
        print(f"Decorator One: After function - Called by {func.__name__}")
    return one_wrapper

def two(func):
    def two_wrapper():
        print(f"Decorator Two: Before function - Called by {func.__name__}")
        func()
        print(f"Decorator Two: After function - Called by {func.__name__}")
    return two_wrapper

def three(func):
    def three_wrapper():
        print(f"Decorator Three: Before function - Called by {func.__name__}")
        func()
        print(f"Decorator Three: After function - Called by {func.__name__}")
    return three_wrapper

@one
@two
@three
def say_hello():
    print("Hello, World!")

say_hello()

Decorator Order

one(two(three(say_hello())))

[ONE 
    TWO
        THREE
            SAY_HELLO]

Wraps

@wraps is a decorator from Python’s functools module that preserves the original function’s metadata (like its name, docstring, and annotations) when it’s wrapped by another function.

Without using wraps

def some_decorator(func):
    def wrapper():
        """Wrapper docstring"""
        return func()
    return wrapper

@some_decorator
def hello():
    """Original docstring"""
    print("Hi!")

print(hello.__name__)
print(hello.__doc__)

Using Wraps

from functools import wraps

def some_decorator(func):
    @wraps(func)
    def wrapper():
        """Wrapper docstring"""
        return func()
    return wrapper

@some_decorator
def hello():
    """Original docstring"""
    print("Hi!")

print(hello.__name__)
print(hello.__doc__)

Args & Kwargs

*args: This is used to represent positional arguments. It collects all the positional arguments passed to the decorated function as a tuple.
**kwargs: This is used to represent keyword arguments. It collects all the keyword arguments (arguments passed with names) as a dictionary.

from functools import wraps

def my_decorator(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        print("Positional Arguments (*args):", args)
        print("Keyword Arguments (**kwargs):", kwargs)
        result = func(*args, **kwargs)
        return result
    return wrapper

@my_decorator
def example_function(a, b, c=0, d=0):
    print("Function Body:", a, b, c, d)

# Calling the decorated function with different arguments
example_function(1, 2)
example_function(3, 4, c=5)

Popular Example

import time
from functools import wraps

def timer(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        end = time.time()
        print(f"Execution time of {func.__name__}: {end - start} seconds")
        return result
    return wrapper
    
@timer
def add(x, y):
    """Returns the sum of x and y"""
    return x + y

@timer
def greet(name, message="Hello"):
    """Returns a greeting message with the name"""
    return f"{message}, {name}!"

print(add(2, 3))
print(greet("Rachel"))

The purpose of @wraps is to preserve the metadata of the original function being decorated.

#decorator #wraps #pythonVer 0.3.6

[Avg. reading time: 5 minutes]

HTTP Basics

HTTP (Hypertext Transfer Protocol) is the foundation for data communication on the web.

Common HTTP Methods

Method	Description	Typical Use
GET	Retrieve data	Fetch a resource (read-only)
POST	Create new data	Add new records or trigger an action
PUT	Replace data	Update/replace an existing record
DELETE	Remove data	Delete a record

Popular HTTP Status Codes

200 Series (Success): 200 OK, 201 Created.

300 Series (Redirection): 301 Moved Permanently, 302 Found.

400 Series (Client Error): 400 Bad Request, 401 Unauthorized, 404 Not Found.

500 Series (Server Error): 500 Internal Server Error, 503 Service Unavailable.

REST API

REpresentational State Transfer is a software architectural style developers apply to web APIs.

REST APIs provide simple, uniform interfaces because they can be used to make data, content, algorithms, media, and other digital resources available through web URLs. Essentially, REST APIs are the most common APIs used across the web today.

https://api.zippopotam.us/us/08028

http://api.tvmaze.com/search/shows?q=friends

https://jsonplaceholder.typicode.com/posts

https://jsonplaceholder.typicode.com/posts/1

https://jsonplaceholder.typicode.com/posts/1/comments

https://reqres.in/api/users?page=2

https://reqres.in/api/users/2

CURL & VSCode

curl is a CLI application available for all OS.

https://curl.se/windows/

curl https://api.zippopotam.us/us/08028

curl https://api.zippopotam.us/us/08028 -o zipdata.json

VS Code - Get Thunder Client

#RESTAPI #httpVer 0.3.6

[Avg. reading time: 3 minutes]

Pydantic

Pydantic is a Python library for data validation, type enforcement, and serialization using standard Python type hints.

It ensures the data coming into your app (like API requests, configs, or ML inputs) is valid, typed, and clean — automatically.

Key Features

Automatic validation: Converts and checks input types (e.g., “5” → int(5)).

BaseModel class: Define data schemas by subclassing BaseModel.

Error messages: Tells you exactly which field is invalid and why.

Data parsing: Converts JSON or dicts into Python objects you can use directly.

Integration with FastAPI: FastAPI uses Pydantic models to validate request bodies and auto-generate documentation.

Why It Matters in MLOps

Ensures model inputs (e.g., features in an API request) are validated before prediction.
Prevents serving errors due to missing or wrong data types.
Makes your FastAPI endpoints self-documenting via OpenAPI and /docs.

Example: Google colab

https://colab.research.google.com/drive/1IkROILidYV8iY9HchMGv2EAqQNK5o8d5?usp=sharing

#pydantic #datavalidationVer 0.3.6

[Avg. reading time: 8 minutes]

Model Flavors

Remember MLflow features (Experiments - Runs - Models - Versions)

Rerun the model again.

git clone https://github.com/gchandra10/uni_multi_model

Popular MLflow Model Flavors

Flavor	Used For	Typical Libraries / Frameworks
`sklearn`	Traditional ML models (regression, classification, clustering)	Scikit-Learn, statsmodels
`xgboost`	Gradient boosting trees	XGBoost
`lightgbm`	High-performance gradient boosting	LightGBM
`catboost`	Categorical-feature-friendly boosting	CatBoost
`pytorch`	Deep learning and neural networks	PyTorch
`tensorflow` / `keras`	Deep learning models	TensorFlow, Keras
`onnx`	Portable models for cross-framework inference	ONNX Runtime
`fastai`	Transfer learning and DL pipelines	FastAI
`statsmodels`	Statistical / econometric models	statsmodels
`prophet`	Time series forecasting	Facebook Prophet
`gluon`	Deep learning (MXNet backend)	Apache MXNet
`sparkml`	Distributed ML pipelines	Apache Spark MLlib
`pyfunc`	Universal interface — wraps all other flavors	MLflow internal meta-flavor

PyFunc makes ML models cross-platform — one consistent way to load and predict, regardless of how they were built.

Just like apps can be built separately for iOS or Android, models in MLflow can be saved in different native formats (like Scikit-Learn, PyTorch, XGBoost, etc.).
A cross-platform app works everywhere, and that’s what PyFunc is for ML models: a universal wrapper that runs any model with the same interface.
This lets teams serve and reuse models easily, without worrying about which library originally trained them.

For Example:

Library	Save API	Predict Method
Scikit-Learn	`joblib.dump()`	`model.predict()`
TensorFlow	`model.save()`	`model(x)`
PyTorch	`torch.save()`	`model.forward(x)`
XGBoost	`model.save_model()`	`model.predict(xgb.DMatrix(x))`

You can use pyfunc for all the flavors

import mlflow.pyfunc
mlflow.pyfunc.save_model()

---
---
---

model = mlflow.pyfunc.load_model("models:/<name>/<stage>")
model.predict(pd.DataFrame(...))

Advantages

One simple API for inference. Works the same whether the model was trained in Scikit-Learn, XGBoost, PyTorch, or TensorFlow.
Reduces code differences between data-science teams using different libraries.
PyFunc packages the model + environment (conda/requirements) together.
Guarantees that the model runs identically on local machines, servers, or cloud.
Ideal for CI/CD pipelines and container builds.
Can be loaded from: Run path: runs:/<run_id>/model Registry stage: models:/name/Production
Works seamlessly with MLflow Serving, FastAPI, Docker, or SageMaker deploys.
Enables easy A/B comparisons between models trained in different frameworks.
You can subclass mlflow.pyfunc.PythonModel to: Add preprocessing or feature engineering. Postprocess predictions. Integrate external systems (feature store, logging, metrics).

Limitations

Framework-specific features are lost.
Input is pandas centric.
In some cases, can be slower than native runtimes. (Torch/Tensor flow)

https://github.com/gchandra10/uni_multi_model/blob/main/03_load_test_model.py

#pyfunc #mlflow #tensorflow #pytorchVer 0.3.6

[Avg. reading time: 6 minutes]

Model Serving

mlflow server

Instantly turn a registered model into a REST API endpoint.

Make sure the MLFlow is still running as per the example.

mlflow server --host 127.0.0.1 --port 8080 \
--backend-store-uri sqlite:///mlflow.db

Windows

SET MLFLOW_TRACKING_URI=http://127.0.0.1:8080

MAC/Linux

export MLFLOW_TRACKING_URI=http://127.0.0.1:8080

Serve the Model

mlflow models serve \
  -m "models:/Linear_Regression_Model/1" \
  --host 127.0.0.1 \
  --port 5001 \
  --env-manager local

Use the Model

curl -X POST "http://127.0.0.1:5001/invocations" \
  -H "Content-Type: application/json" \
  --data '{"inputs": [{"ENGINESIZE": 2.0}, {"ENGINESIZE": 3.0}, {"ENGINESIZE": 4.0}]}'

Pros

Zero-code serving: Just one CLI command — no need to build an API yourself.
Auto-handles environment: Loads dependencies automatically.
Ideal for testing and demos.
Supports model URIs.

Cons

Single-threaded process.
Limited customization.
Minimal built in monitoring.
Not suited for blue-green / CICD promotion pipelines.

Fast API

Modern, high-performance Python web framework for building REST APIs.
FastAPI turns Python functions into fully documented, high-performance REST APIs with minimal code.
Built on ASGI (Asynchronous Server Gateway Interface) .
Designed for speed, type safety, and developer productivity.

Key Features

Fast execution: Comparable to Node.js & Go — async by design.
Automatic validation: Uses Pydantic models to validate and parse JSON inputs.
Auto-generated API docs: Swagger UI available at /docs, ReDoc at /redoc.
Type hints = API schema: Python typing directly defines request/response schema.
Easy to test & extend: Works great with Docker, CI/CD, and modern MLOps stacks.
Supports both sync & async: You can mix blocking ML inference and async endpoints.

export MLFLOW_TRACKING_URI=http://127.0.0.1:8080

Open uni_multi_model in VSCode

cd uni_multi_model

uvicorn fast_app:app --host 127.0.0.1 --port 5002

Uvicorn

Python runtime Application server used to run Python app code.
A lightweight, lightning-fast ASGI server (ASGI = Asynchronous Server Gateway Interface).
Built on uvloop (fast event loop) and httptools (HTTP parser), with native WebSocket support.
Works great with FastAPI, Pydandic.

#modelserving #mlflow #fastapiVer 0.3.6

[Avg. reading time: 7 minutes]

Model Serving Types

Model Serving is the process of deploying trained machine-learning models so they can generate predictions on new data.

Once a model is trained and validated, it must be made available to applications, pipelines, or users that need its outputs — whether that’s a batch job scoring millions of records, a web app recommending products, or an IoT stream detecting anomalies.

Model serving sits in the production stage of the MLOps lifecycle, bridging the gap between model development and business consumption.

It ensures models are:

Accessible (via APIs, pipelines, or streams)
Scalable (able to handle varying loads)
Versioned and governed (using registries and lineage)
Monitored (for drift, latency, and performance)

In modern stacks (e.g., Databricks, AWS SageMaker, GCP Vertex AI), serving integrates tightly with model registries, feature stores, and CI/CD pipelines to enable reliable, repeatable ML deployment.

Batch Model Serving

Batch serving runs inference on large datasets at scheduled intervals (hourly, nightly, weekly).

Input data is read from storage or database.
Predictions are generated for all records.
Outputs are written back to storage or a downstream table.

Example: Predict new car Co2 Emission.

Pros: Efficient, reproducible, simple to schedule. Cons: Not real-time; predictions may get stale.

Demo:

Real-Time (Online) Model Serving

Real-time serving exposes the model as a low-latency API endpoint. Each request is scored on demand and returned within milliseconds to seconds.

How it works:

An application (e.g., web or mobile) calls the API.

The model receives input features and returns a prediction immediately.

As discussed in the previous chapter.

MlFlow Serving
FastAPI Serving

Example:

Credit-card fraud detection, dynamic pricing, personalized recommendations.

Pros: Instant feedback, personalized predictions

Cons: Needs always-on infra, online feature store, auto-scaling

Demo

Streaming (Continuous) Model Serving

Streaming serving applies the model continuously to incoming event streams (Kafka, Kinesis, Delta Live Tables).

Instead of single requests, it handles ongoing flows of data.

Data arrives in small micro-batches or as events.
The model scores each record as soon as it appears.
Results are pushed to dashboards, alerts, or storage.

Example:

IoT anomaly detection, clickstream optimization, live sensor analytics.

Pros:

Near real-time, high-throughput, scalable

Cons:

Complex orchestration, harder to monitor and debug.

#batch #streaming #realtimeVer 0.3.6

[Avg. reading time: 7 minutes]

Auto ML

AutoML (Automated Machine Learning) is the process of automating the end-to-end machine-learning workflow, from data preprocessing and model selection to hyperparameter tuning, evaluation, and deployment.

Make machine learning faster, easier, and more accessible, without sacrificing performance.

Instead of a data scientist manually trying dozens of models and tuning parameters, AutoML systems do this automatically, guided by optimization techniques and performance metrics.

Speeds up experimentation
Democratizes machine learning
Improves model quality
Enables scalable model governance

Area	Example Use Case	What AutoML Helps With
Retail	Predict customer churn or recommend products	Automatically build and tune classifiers/regressors
Finance	Credit-risk modeling, fraud detection	Feature selection, threshold optimization
Healthcare	Predict patient readmission	Imbalanced-data handling, model explainability
Energy	Predict CO₂ emissions or fuel consumption	Regression with mixed numeric + categorical inputs
Marketing	Forecast campaign ROI	Fast model iteration and ranking

What AutoML Actually Does

Typical AutoML frameworks automate these stages:

Data Preprocessing

Missing-value imputation
Encoding categorical variables
Normalization or standardization

Feature Engineering

Automatic transformations (log, polynomial, interaction terms)
Feature selection and importance ranking

Model Selection

Chooses among algorithms (e.g., Linear, Random Forest, XGBoost, Neural Net)

Model Ensemble / Stacking

Combines several good models into one stronger ensemble

Model Evaluation and Ranking

Uses metrics (RMSE, MAE, AUC, F1, etc.) to pick the best

Model Export

Produces portable artifacts for production (e.g., MOJO, ONNX, pickle)

H2O AutoML

H2O.ai is an open-source AI and machine-learning platform built for speed and scalability.

It’s written in Java and C++ (high performance) with Python and R APIs for easy use.

The flagship open-source library is H2O-3, and H2O AutoML is a major component within it.

Other similar products

AutoGluon
Flaml
PyCaret
Auto-sklearn
AutoKeras

Why H2O AutoML Is Popular in Industry

Feature	Benefit
Scalable JVM backend	Runs on a laptop or a multi-node cluster
Multiple APIs	Python, R, Java, Scala
Easy deployment	Exports MOJO/POJO models for production scoring
Interpretable	Provides variable importance and SHAP explanations
Open Source	No license barrier; integrates with enterprise tools

Google Colab

https://colab.research.google.com/drive/1DZjBbcWXeRk-xlmffG7A4zSez7eX1Rba?usp=sharing

#automlVer 0.3.6

[Avg. reading time: 4 minutes]

CPU vs GPU

CPU: few powerful cores optimized for low-latency, branching, and general purpose tasks. Great for data orchestration, preprocessing, control flow.

Use cases in ML:

feature engineering, I/O, tokenization, small classical ML, control logic.

GPU: thousands of simpler cores optimized for massive parallel math, especially dense linear algebra. Great for matrix multiplies, convolutions, attention.

Orders-of-magnitude speedups for medium to large models and batches.

Use cases in ML:

deep learning training, embedding inference, vector search re-ranking, image and generative workloads.

CUDA

GPU is the hardware. CUDA (Compute Unified Device Architecture) is the framework / language and toolkit that unlocks that hardware. Its from nVidia.

When working with GPU, its a must to check whether CUDA is enabled.

There are bunch of GPU’s like Apple Silicon M-Series, Game consoles uses GPU but doesnt have CUDA.

Remember to change the Runtime

https://colab.research.google.com/drive/1byrDchiV4OWdLKOPl8H4UAcdbwFoR7aA?usp=sharing

#cpu #gpuVer 0.3.6

[Avg. reading time: 0 minutes]

Tools

Containers

Ver 0.3.6

[Avg. reading time: 6 minutes]

Containers

World before containers

Physical Machines

1 Physical Server
1 Host Machine (say some Linux)
3 Applications installed

Limitation:

Need of physical server.
Version dependency (Host and related apps)
Patches ”hopefully” not affecting applications.
All apps should work with the same Host OS.

3 physical server
3 Host Machine (diff OS)
3 Applications installed

Limitation:

Need of physical server(s).
Version dependency (Host and related apps)
Patches ”hopefully” not affecting applications.
Maintenance of 3 machines.
Network all three so they work together.

Virtual Machines

Virtual Machines emulate a real computer by virtualizing it to execute applications,running on top of a real computer.
To emulate a real computer, virtual machines use a Hypervisor to create a virtual computer.
On top of the Hypervisor, we have a Guest OS that is a Virtualized Operating System where we can run isolated applications, called Guest Operating System.
Applications that run in Virtual Machines have access to Binaries and Libraries on top of the operating system.

( + ) Full Isolation, Full virtualization ( - ) Too many layers, Heavy-duty servers.

Here comes Containers

Containers are lightweight, portable environments that package an application with everything it needs to run—like code, runtime, libraries, and system tools—ensuring consistency across different environments. They run on the same operating system kernel and isolate applications from each other, which improves security and makes deployments easier.

Containers are isolated processes that share resources with their host and, unlike VMs, don’t virtualize the hardware and don’t need a Guest OS.
Containers share resources with other Containers in the same host.
This gives more performance than VMs (no separate guest OS).
Container Engine in place of Hypervisor.

Pros

Isolated Process
Mounted Files
Lightweight Process

Cons

Same Host OS
Security

#containers #vmVer 0.3.6

[Avg. reading time: 6 minutes]

VMs or Containers

VMs are great for running multiple, isolated OS environments on a single hardware platform. They offer strong security isolation and are useful when applications need different OS versions or configurations.

Containers are lightweight and share the host OS kernel, making them faster to start and less resource-intensive. They’re perfect for microservices, CI/CD pipelines, and scalable applications.

Smart engineers focus on the right tool for the job rather than getting caught up in “better or worse” debates.

Use them in combination to make life better.

Popular container technologies

Docker: The most widely used container platform, known for its simplicity, portability, and extensive ecosystem.

Podman: A daemonless container engine that’s compatible with Docker but emphasizes security, running containers as non-root users.

Images

The image is the prototype or skeleton to create a container, like a recipe to make your favorite food.

Container

A container is the environment, up and running and ready for your application.

If Image = Recipe, then Container = Cooked food.

Where to get the Image from?

Docker Hub

For both Podman and Docker, images are from the Docker Hub.

https://hub.docker.com/

NOTE: INSTALL DOCKER OR PODMAN (Not BOTH)

Podman on Windows

https://podman-desktop.io/docs/installation/windows-install

Once installed, verify the installation by checking the version:

podman info

Podman on MAC

Install Podman

After installing, you need to create and start your first Podman machine:

podman machine init
podman machine start

You can then verify the installation information using:

podman info

Podman on Linux

Install Podman

You can then verify the installation information using:

podman info

Docker Installation

Here is step by step installation

https://docs.docker.com/desktop/setup/install/windows-install/

#vm #docker #podmanVer 0.3.6

[Avg. reading time: 0 minutes]

What container does

It brings to us the ability to create applications without worrying about their environment.

#containerVer 0.3.6

[Avg. reading time: 11 minutes]

Container Examples

If you have installed Docker replace podman with docker.

Syntax

docker pull <imagename>
docker run <imagename>

Examples:

docker pull hello-world
docker run hello-world
docker container ls
docker container ls -a
docker image ls

Optional Setting (For PODMAN)

/etc/containers/registries.conf

unqualified-search-registries = ["docker.io"]

Deploy MySQL Database using Containers

Create the following folder

Linux / Mac

mkdir -p container/mysql
cd container/mysql

Windows

md container
cd container
md mysql
cd mysql

Note: If you already have MySQL Server installed in your machine then please change the port to 3307 as given below.

-p 3307:3306 \

Run the container


docker run --name mysql -d \
    -p 3306:3306 \
    -e MYSQL_ROOT_PASSWORD=root-pwd \
    -e MYSQL_ROOT_HOST="%" \
    -e MYSQL_DATABASE=mydb \
    -e MYSQL_USER=remote_user \
    -e MYSQL_PASSWORD=remote_user-pwd \
    docker.io/library/mysql:8.4.4

-d : detached (background mode)
-p : 3306:3306 maps mysql default port 3306 to host machines port 3306
    3307:3306 maps mysql default port 3306 to host machines port 3307

-e MYSQL_ROOT_HOST="%" Allows to login to MySQL using MySQL Workbench

docker exec -it mysql bash

List all the Containers

docker container ls -a

Stop MySQL Container

docker stop mysql

Delete the container**

docker rm mysql

Preserve the Data for future**

Inside container/mysql

mkdir data

docker run --name mysql -d \
    -p 3306:3306 \
    -e MYSQL_ROOT_PASSWORD=root-pwd \
    -e MYSQL_ROOT_HOST="%" \
    -e MYSQL_DATABASE=mydb \
    -e MYSQL_USER=remote_user \
    -e MYSQL_PASSWORD=remote_user-pwd \
    -v ./data:/var/lib/mysql \
    docker.io/library/mysql:8.4.4

-- Create database
CREATE DATABASE IF NOT EXISTS friends_tv_show;
USE friends_tv_show;

-- Create Characters table
CREATE TABLE characters (
    character_id INT AUTO_INCREMENT PRIMARY KEY,
    first_name VARCHAR(50) NOT NULL,
    last_name VARCHAR(50) NOT NULL,
    actor_name VARCHAR(100) NOT NULL,
    date_of_birth DATE,
    occupation VARCHAR(100),
    apartment_number VARCHAR(10)
);

INSERT INTO characters (first_name, last_name, actor_name, date_of_birth, occupation, apartment_number) VALUES
('Ross', 'Geller', 'David Schwimmer', '1967-10-02', 'Paleontologist', '3B'),
('Rachel', 'Green', 'Jennifer Aniston', '1969-02-11', 'Fashion Executive', '20'),
('Chandler', 'Bing', 'Matthew Perry', '1969-08-19', 'IT Procurement Manager', '19'),
('Monica', 'Geller', 'Courteney Cox', '1964-06-15', 'Chef', '20'),
('Joey', 'Tribbiani', 'Matt LeBlanc', '1967-07-25', 'Actor', '19'),
('Phoebe', 'Buffay', 'Lisa Kudrow', '1963-07-30', 'Massage Therapist/Musician', NULL);

select * from characters;

Build your own Image


mkdir -p container
cd container

Python Example

Follow the README.md

Fork & Clone

git clone https://github.com/gchandra10/docker_mycalc_demo.git

Web App Demo

Fork & Clone

git clone https://github.com/gchandra10/docker_webapp_demo.git

Publish Image to Docker Hub

Create a Repository “my_faker_calc”
Under Account Settings
- Personal Access Token
- Create a PAT token with Read/Write access for 1 day

Replace gchandra10 with yours.

docker login docker.io 

enter userid
enter PAT token

Then build the Image with your userid

docker build -t gchandra10/my_faker_calc:1.0  .
docker image ls

Copy the ImageID of gchandra10/my_fake_calc:1.0

Tag the ImageID with necessary version and latest

docker image tag <image_id> gchandra10/my_faker_calc:latest

Push the Images to Docker Hub (version and latest)

docker push gchandra10/my_faker_calc:1.0 
docker push gchandra10/my_faker_calc:latest

Image Security

Open Source tool Trivy

https://trivy.dev/latest/getting-started/installation/

trivy image python:3.9-slim

trivy image gchandra10/my_faker_calc

trivy image gchandra10/my_faker_calc --severity CRITICAL,HIGH --format table

trivy image gchandra10/my_faker_calc --severity CRITICAL,HIGH  --output result.txt

#examples #docker Ver 0.3.6

[Avg. reading time: 0 minutes]

Productionizing ML Models

[Avg. reading time: 3 minutes]

Observability

ML observability means:

monitoring model behavior
understanding WHY the model behaves that way
detecting issues early
supporting debugging and retraining decisions

ML Observability Pillars

Data Quality Monitoring
Drift Monitoring
Operational / System Monitoring
Explainability & Bias Monitoring
Governance, Lineage & Reproducibility

Data Quality Monitoring

Tracks whether the input data is valid, clean, and reliable.

missing values
invalid values
type issues
schema changes
outliers
range violations
feature null spikes

Operational / System Monitoring

throughput
hardware utilization
inference failures
API timeouts
memory leaks
GPU/CPU load spikes
queue lag in streaming pipelines

This ensures the model endpoint or batch job is healthy.

Governance, Lineage & Reproducibility

Tracks the lifecycle and accountability of all ML assets.

dataset versioning
model versioning
feature lineage
pipeline lineage
audit logs (who deployed, who retrained)
model approval workflow
reproducible experiments
rollback support

#observability #mlopsVer 0.3.6

[Avg. reading time: 8 minutes]

Drift

Monitoring and observability in ML is about continuously checking:

What data is coming in
How that data is changing
Whether the model’s predictions are still reliable
Whether the business metrics are degrading

Three key issues:

Data Drift: Incoming feature distributions shift from what the model was trained on.

Concept Drift: The relationship between features and target changes.

Model Performance Decay: Accuracy, precision, recall, RMSE, etc. degrade over time.

Use cases

Fraud models stop detecting new fraud patterns.
Demand forecasting fails when consumer behavior changes.
Recommendation systems decay as user preferences evolve.
Healthcare/diagnosis models degrade with new demographics.
NLP sentiment models break due to new slang or cultural shifts.

Example

Phase 1: Training distribution

sqft mean ~1500
bedrooms mostly 2 or 3
house_age mostly 5–15 years

Model learns reasonable patterns.

Phase 2: Production year later

Neighborhood changes + new houses get built.

1. Data Drift

Example:

sqft mean shifts from 1500 to 2300
more 4-bedroom homes appear
house_age shifts from 10 years old to 2 years old (new constructions)

This is feature distribution drift. Model still predicts, but sees very different patterns than training.

2. Concept Drift

Originally:

Price increases roughly 150 per extra sqft

After market shift:

Price increases 250 per extra sqft

Meaning: the mapping from features to target changed, even though features look similar.

3. Model Performance Decay

You track weekly RMSE:

Week 1: RMSE 19k
Week 15: RMSE 25k
Week 32: RMSE 42k

Why does it decay?

Market changed
New developers building larger homes
New inflation conditions
Seasonal patterns changed
The model is outdated.

Data Quality Drift

Quality of incoming data begins to degrade:

more missing values
more zeros
more invalid/out-of-range values
more outliers
schema changes
feature suddenly becomes constant
new categories never seen before

This is one of the most important practical drifts.

Example:

“furnished”, “semi-furnished” → suddenly “fully-furnished” appears (NEW category)

Data Freshness Drift (Latency Drift)

Data arrives:

late
too early
stale
out-of-order

Feature Importance Drift

Rank of feature importance changes:

Example:

bedrooms used to be the strongest feature
now open backyard becomes dominant
previously irrelevant features become important and vice-versa

Input Volume Drift

Sudden spikes or drops in data volume.

Example:

Daily 500 requests suddenly becomes 10,000.

This affects latency, performance, and reliability.

Demo

https://colab.research.google.com/drive/1gf2Qs3avNej6JP-LmKHe022HUiSqbCmy?usp=sharing

git clone https://github.com/gchandra10/python_model_drift

Open Source Tools

https://github.com/evidentlyai/evidently

#mlops #driftVer 0.3.6

[Avg. reading time: 9 minutes]

Security

Machine learning systems introduce a whole new attack surface. In traditional software, you secure code, networks, data, and deployments. In ML, you also have to secure training data, model artifacts, feature pipelines, model endpoints, and the feedback loops that continuously update the model.

If ML security is ignored, attackers can quietly poison training data, steal the model, extract sensitive information, or manipulate predictions in production. The impact can be severe: compliance violations, financial loss, biased decisions, or complete system compromise.

Why It Matters

ML models behave exactly the way the data teaches them. If attackers can tamper with data, you lose trust in the entire pipeline.
Models deployed as APIs are prime targets for extraction, prompt injections, and inference manipulation.
Regulatory pressure is rising, and ML systems now need governance similar to financial or healthcare-grade systems.
Many orgs automate retraining. Without guardrails, an attacker could push poisoned data into the pipeline and silently change model behavior overnight.

1. Data Security

Validate and sanitize input data before training or inference.
Detect drift that might be intentional poisoning.
Maintain lineage: who produced the data, when, from where.
Encrypt data in transit and at rest.

2. Model Artifact Security

Store models in a secure registry (MLflow Model Registry or cloud-managed registry).
Use signed and versioned models to prevent unverified deployments.
Restrict access at the catalog or registry level using RBAC.

3. Supply Chain Security

Training code, libraries, dependencies, Docker images, and notebooks can be compromised.
Use vulnerability scanning tools on Python packages and containers.
Pin versions using pyproject.toml or UV/Poetry lockfiles.
Verify model lineage (code version, data version, training environment).

4. API & Endpoint Hardening

Rate limiting and throttling to prevent model extraction.
Authentication and authorization around inference endpoints.
Input validation to avoid adversarial attacks and prompt injections (LLMs).
Don’t expose internal model metadata via the API.

5. Monitoring & Detection

Track prediction patterns to catch sudden spikes or targeted manipulation.
Use model drift & data drift monitoring tools.
Alert when confidence scores change unpredictably.
Store logs for forensics.

6. Secrets & Environment Security

Never hardcode API keys into notebooks or training code.
Use cloud secret managers or Databricks secret scopes.
Lock down S3/Blob/GCS buckets and model storage.
Use network isolation: private endpoints, VPC peering, firewall rules.

How To Ensure Models Are Not Vulnerable

Implement model reviews as part of CI/CD, including robustness tests.
Continuously test your data pipelines for poisoning or schema violations.
Use secure serving infrastructure (no local Flask servers in production).
Perform penetration testing specifically targeted at model endpoints.
Automate retraining only when data validation checks pass.
Track every model version, input source, and deployment environment.
Keep models and features inside secured catalogs with RBAC and audit logs.
Use zero-trust principles for every pipeline component.

Popular Tools

FalconPy by Crowdstrike

#security #mlopsVer 0.3.6

[Avg. reading time: 5 minutes]

Validation Frameworks

Data validation frameworks help you prove your data is correct before you process or model it.Instead of writing ad-hoc if-else checks, you declare rules once and let the framework enforce them automatically.

Consistency
Repeatability
Cleaner code
Faster debugging
Less human error

Validation Frameworks

Detect bad data early instead of debugging downstream failures
Enforce rules across teams so everyone validates the same way
Automate thousands of checks with very little code
Reduce manual cleanup work that normally takes hours
Make pipelines safer, more predictable, and easier to maintain
Shift data quality to where it belongs: before transformation and modeling

Manual Validation	Framework-Based Validation
Lots of custom code	Declare rules once
Hard to maintain	Reuse rules everywhere
Easy to miss edge cases	Remove 70–90 percent of custom code.
Never consistent between developers	Fail fast instead of debugging downstream
Repeated onboarding pain	Easier onboarding for new developers and analysts

Popular Tools

Pandera (Python)

Easiest for Python pipelines
Schema-based, great for ML workflows
Integrates with Pandas, Polars, Dask, Spark
Treats data validation like unit tests

Pydantic

Row-level validation
Excellent for API inputs and ML inference
Great complement to Pandera, not a dataframe validator

Pydantic + Pandera

Pydantic is for validating one row at a time.
Pandera is for validating the whole dataset at once.
Pydantic shines in ML inference, web APIs, and configuration files.
Pandera shines in ETL, data cleaning, feature engineering, and ML training pipelines.

git clone https://github.com/gchandra10/python_validator_demo

#pandera #pydantic #validationframeworkVer 0.3.6

[Avg. reading time: 5 minutes]

Model Compression

Model compression is the set of techniques used to reduce the size and computational cost of a trained model while keeping its accuracy almost the same.

Why It Exists

Speed up inference
Reduce memory footprint
Fit models on cheaper hardware
Reduce serving cost
Enable on-device ML (phones, edge devices, IoT)
Allow high-traffic systems to scale

Without Compression

Slow predictions
GPU or CPU bottlenecks
More servers needed to keep up
Higher inference bill
Some environments can’t run your model at all
Increased latency kills user experience

Photo Analogy

The popular mechanisms include

Quantization

Quantization refers to the process of reducing the precision or bit-width of the numerical values used to represent model parameters usually from 𝑛 bits to 𝑚 bits, where 𝑛 > 𝑚.

In ML, FP32 (Floating Point 32 bits) is the default, by quantization method we convert the 32bits to 16 or 8 bits and achieve similar results.

https://colab.research.google.com/drive/1SHGqVZhk8tKpuGQ3KqLhUXIk8NU9W2Er?usp=sharing

When using this with mlflow, log both the models, artifacts and serve the ones depending upon the usecase.

Distillation

Model distillation, also known as knowledge distillation, is a technique where a smaller model, often referred to as a student model, is trained to mimic the behavior of a larger, more complex model, known as a teacher model. The goal is to transfer the knowledge and performance of the larger model to the smaller one.

Reading whole book vs Nudging with hints and references

#compression #quantization #distillationVer 0.3.6

[Avg. reading time: 4 minutes]

Ollama

Ollama is an open-source tool that allows you to run large language models (LLMs) on your local machine, providing privacy and offline access.
It simplifies the process of downloading, running, and managing LLMs with a user-friendly interface, both via a command-line interface (CLI) and an API.
It’s designed for developers and researchers who want to customize and experiment with AI models locally, without depending on cloud services.

Install

Download and Install

https://ollama.com/

Open Terminal

ollama

ollama list

ollama pull deepseek-r1:8b

ollama run deepseek-r1:8b

To close the prompt

/bye

Roles

user: The human asking questions or giving instructions.
assistant : The model’s response role. This is what the LLM outputs.
system : Optional. Used to set initial behavior or constraints, similar to system prompts in OpenAI/ChatGPT.

git clone https://github.com/gchandra10/python_ollama_demo.git

chat() - conversational, role-based, template-aware generate() - raw LLM token generation, no chat template, no memory

Build Custom Models

Create a Modelfile
Mention the model and prompt
Create and use the new Model

#llm #ollamaVer 0.3.6

[Avg. reading time: 6 minutes]

Best Practices

Continuous Integration (CI): Automate testing and validation for code, data, and models before deployment.

Continuous Delivery/Deployment (CD): Automate the deployment of the complete ML pipeline and the trained model to production environments (often using Docker/Kubernetes).

Continuous Training (CT): Implement automated triggers to retrain models based on performance degradation (drift) or arrival of significant new data.

Version Control: Use Git for code and configuration. Crucially, version control datasets (Data Versioning) and model artifacts (Model Registry).

Reproducibility: Log all experiment metadata—including hyperparameters, package dependencies, and data/code versions—to enable exact reproduction of any past result.

Infrastructure as Code (IaC): Manage all compute resources and environments (e.g., training clusters, deployment services) using code (e.g., Terraform) for consistency.

Continuous Monitoring: Track both operational metrics (latency, throughput, resource usage) and model performance metrics (accuracy, precision, business KPIs) in production.

Drift Detection: Actively monitor for Data Drift (input data changes) and Concept Drift (target relationship changes) and set up automated alerts and retraining workflows.

Data Validation: Implement continuous checks on the schema, quality, and statistical properties of input data streams before they reach the model.

Model Governance & Lineage: Maintain a clear audit trail of every model, documenting who trained it, when, and with what specific assets, for regulatory compliance and debugging.

Modular Pipelines: Break the ML workflow (data ingestion, preprocessing, training, evaluation, deployment) into independent, reusable components.

Feature Stores: Use a centralized platform to define, serve, and share reusable features across different models and teams, ensuring consistency between training and serving.

Collaboration: Facilitate smooth handoffs and shared ownership between Data Scientists, ML Engineers, and Operations teams through common tools and standardized interfaces.

#mlops #bestpracticesVer 0.3.6

[Avg. reading time: 4 minutes]

SAAS Tools for MlFlow

These platforms streamline the entire machine learning lifecycle, often integrating MLflow’s capabilities.

Amazon SageMaker: AWS’s comprehensive, fully-managed platform that covers the entire ML workflow from data preparation to deployment and monitoring.

Google Vertex AI: Google Cloud’s unified platform for building, deploying, and scaling ML models, which includes MLOps tools like pipelines, a model registry, and monitoring.

Microsoft Azure Machine Learning: A cloud service that provides a range of tools and a unified environment to accelerate and manage the ML project lifecycle, with strong native MLflow integration.

Databricks (Managed MLflow): Databricks, co-founded by the creators of MLflow, offers a fully managed and enhanced version of MLflow tightly integrated with their lakehouse platform.

Benefits

Enhanced Collaboration: Provides a shared, centralized platform (via the Tracking Server and Model Registry UI) where data scientists can log, view, compare, and share experiment results and model versions.

Efficient Model Lifecycle Management: The Model Registry offers governance and an audit trail by controlling the transition of model versions through different stages (e.g., from Staging to Production) and linking them to their original training runs.

#saastools #sagemaker #azureml #googlevertexaiVer 0.3.6

[Avg. reading time: 2 minutes]

Good Reads

These are just resources I found interesting and thought you might too. I’m not connected to them and can’t vouch for everything, but I’m sharing in the spirit of helping you discover new ideas, books, and opportunities.

Loaded with lots and lots of useful information.

#goodreads #resourcesVer 0.3.6

MLOps and AI