[Avg. reading time: 0 minutes] Ver 0.3.6

Last change: 2025-12-02

[Avg. reading time: 0 minutes]

Disclaimer

#disclaimerVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 4 minutes]

Required Tools


Free Cloud Services

ToolPurposeLink
Databricks Free EditionML & OpsFree Signup
ChromaFree Vector DBChromaDB

#tools #databricks #python #gitVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 2 minutes]

MLOps & AI Overview

[Avg. reading time: 3 minutes]

Introduction

AI/ML are no longer just research topics - they drive industry, innovation, and jobs.

GenAI has shifted expectations: businesses want faster solutions with production-grade reliability.

MLOps ensures ideas → working models → deployed systems.

Evolution of the Field

2010s: Big Data + early ML adoption (scikit-learn, Spark MLlib).

2015-2022: Deep learning boom (Neural Networks, NLP with BERT).

2022: Generative AI (GPT, diffusion models).

MLOps is critical for scaling, governance, monitoring.

Where MLOps Fits in the Data/AI Journey

MLOps is part of all of this.

Without MLOps, many models stay as “academic projects.”

Today’s hiring market looks for hybrid skills (data + ML + cloud + ops).

Course Positioning

Not too heavy on topics covered in other courses such as ML algorithms or NLP or Deep Learning or LLM.

This course is heavy on CICD - MLOps, Pipelines, versioning, monitoring, cloud platforms and related toolsets.

Course Focus = Industry Readiness

#Data #mlengineer #mlopsVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 0 minutes]

AI then and now

#MachineLearning #ArtificialIntelligenceVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 1 minute]

Expert Systems

Early AI systems (1970s–1990s)

Rule-based: encode human expert knowledge as if-then rules.

Precursor to modern ML, focused on symbolic reasoning rather than data-driven learning.

Pros

  • Transparent and explainable (rules are visible).
  • Effective in narrow, well-defined domains.

Cons

  • Knowledge engineering is labor-intensive.
  • Doesn’t scale well as rules explode.
  • Cannot adapt automatically from new data.

#expert-systems #rulebasedVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 3 minutes]

Fuzzy Logic

Logic that allows degrees of truth (not just True/False). Models uncertainty with values between 0 and 1.

graph TD
    A["Is it Cold?"] --> B["Crisp Logic<br/>Yes = 1<br/>No = 0"]
    A --> C["Fuzzy Logic<br/>Maybe Cold = 0.3<br/>Not really cold = 0.7"]


Useful in control systems and decision-making under vagueness.

Still used in various use cases to find out similarity like New Jersey similar to Jersey.

Pros

  • Handles imprecise, uncertain, or linguistic data (“high temperature”, “low risk”).
  • Good for rule-based control.

Cons

  • Not data-driven → rules must be defined manually.
  • Limited learning ability compared to ML.

Use Cases

  • Washing machines that adjust cycles based on “fuzziness” of dirt level.
  • Air conditioning systems adapting to “comfort level”.
  • Automotive control (braking, transmission).
  • Risk assessment systems.

#fuzzy-logic #fuzzinessVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 3 minutes]

Machine Learning

A subset of AI where systems learn patterns from data and make predictions or decisions without being explicitly programmed.

  • One of the core pillars of AI.

  • Between traditional rule-based systems (Expert Systems) and modern Deep Learning/GenAI.

  • Provides the foundation for many practical AI applications used in industry today.

Pros

  • Automates decision-making at scale.
  • Flexible: can be applied to structured and unstructured data.
  • Improves with more data and better features.

Cons

  • Requires labeled data (for supervised learning).
  • Models can overfit or underfit if not designed carefully.
  • Often seen as a “black box” with limited interpretability.

Use Cases

  • Fraud detection in finance.
  • Customer churn prediction in telecom/retail.
  • Demand forecasting in supply chain.
  • Email spam filtering.
  • Customer segmentation for targeted marketing.
  • Market basket analysis (“people who buy X also buy Y”).
  • Anomaly detection in cybersecurity and IoT.

#Supervised #Unsupervised #classification #regressionVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 3 minutes]

Generative AI

A class of AI that can create new content (text, code, images, video, music) rather than just predicting outcomes.

Powered by foundation models like GPT, Stable Diffusion, etc.

  • Builds on Deep Learning + NLP + multimodal modeling.

  • Represents the shift from discriminative models (predicting) to generative models (creating).

Pros

  • Enables creativity and automation at scale.
  • Reduces time to draft, design, or brainstorm.

Cons

  • Can hallucinate false information.
  • High computational cost and environmental footprint.
  • Raises copyright, ethics, and bias concerns.

Use Cases

  • Text: AI writing assistants, code copilots.
  • Image/video: marketing content generation, design prototyping.
  • Data: generating synthetic data for ML training.
  • Education: personalized learning materials and quizzes.

Key differences

Traditional MLGenerative AI
Predicts outcome from featuresProduces new content
Needs task-specific dataPretrained on massive corpora
Optimized for accuracyOptimized for creativity, coherence
Example: Predict churnExample: Generate flying pigs/elephant

#GPT #Claude #GenerativeAIVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 4 minutes]

Reinforcement Learning

RLHF (Reinforcement Learning Human Feedback)

Its like humans learning todo and not todo.

A learning paradigm where an agent interacts with an environment, takes actions, and learns from reward signals.

Instead of labeled data, it uses trial-and-error feedback.

Complements supervised/unsupervised learning.

Strongly linked to decision-making and control tasks.

Example: YT recommends a video, if you watch it system understands that, if you choose don’t show this system reacts to that.

Here the agent is YT recommendation engine, action: user watching or ignoring the video. Rewards like/share or not-interested.

Pros

  • Handles complex sequential decisions.
  • Can learn optimal strategies without explicit rules.
  • Mimics human/animal learning.

Cons

  • Data and compute intensive.
  • Reward design is tricky.
  • Training can be unstable.

Use Cases

  • Game AI: AlphaGo defeating world champions.
  • Robotics: teaching robots to walk, grasp, or navigate.
  • Finance: algorithmic trading strategies.
  • Dynamic pricing in e-commerce.
flowchart TD
    A[Prompt] --> B[Base LLM generates multiple responses]
    B --> C[Human labelers rank responses]
    C --> D[Reward Model learns preferences]
    D --> E[Fine-tune LLM with Reinforcement Learning]
    E --> F[Aligned ChatGPT]

#rl #rlhf #roboticsVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 4 minutes]

Agentic AI

AI systems that are autonomous agents: they can plan, reason, take actions, and use tools.

Builds on LLMs + RL concepts.

Can execute multi-step tasks with minimal human guidance.

Before Agentic AI

  • Traditional AI -> task-specific models.
  • LLMs -> good at generating text but mostly passive responders.

Transformation with Agentic AI

  • Adds agency: memory, planning, acting.
  • Can chain multiple AI capabilities (search + reasoning + action).

Pros

  • Automates workflows end-to-end.
  • Adaptable across domains.
  • Learns from feedback loops.

Cons

  • Hard to control (hallucinations, unsafe actions).
  • High computational cost.
  • Reliability and governance still open challenges.

Use Cases

  • AI agents booking travel (search -> compare -> purchase).
  • Customer support bots that escalate only when needed.
  • Business process automation (invoice handling, data entry).
AspectAI Assistant (Chatbot/LLM)Agentic AI (Autonomous Agent)
NatureReactive → answers questionsProactive → plans and executes tasks
MemoryLimited to current sessionHas memory across interactions
ActionsGenerates text/code onlyUses tools, APIs, external systems
PlanningOne-shot responseMulti-step reasoning and decision-making
AdaptabilityNeeds explicit user promptsSelf-adjusts based on goals and feedback
Example Use Case“What’s the weather in NYC?” → gives forecast“Plan my weekend trip to NYC” → books flight, hotel, creates itinerary
Industry ExampleCustomer support FAQ botAI agent that handles returns, refunds, and escalations automatically

#agentsVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 3 minutes]

MLOps

Why MLOps

Operationalizing ML/AI models with focus on automation, collaboration, and reliability.

Building is easy, sustaining is hard.

Remember dieting/excercise?

  • Companies moved past “build model in Jupyter” → now productionize models.
  • 80% of ML projects fail due to lack of deployment + monitoring strategy.
  • MLOps bridges Data → Model → Production.

Industry requirement

  • Versioning models
  • Monitoring drift
  • Scalable deployment
  • Regulatory compliance (audit trail, lineage)

Lifecycle

  • Data ingestion -> data validation & quality checks -> feature engineering
  • Model training -> validation -> experiment tracking & versioning
  • Deployment (batch, real-time, API) -> rollback capabilities
  • Monitoring
    • Data drift (input distribution)
    • Model drift (prediction accuracy)
    • Concept drift (feature:label relationship)
    • Infrastructure performance
  • Continuous improvement -> retraining & iteration

Cross-Functional Teams

  • Data Engineers
  • Data Scientists
  • ML Engineers
  • Platform/DevOps Engineers
  • Product Managers

Key Capabilities

  • Reproducibility
  • Scalability
  • Governance & compliance
  • Automated CI/CD pipelines

#cicd #mlops #devops #medallionVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 1 minute]

Differences across AI/ML systems

AspectTraditional MLNLP (Pre-GenAI)GenAIMLOps
DataStructured, tabularText, tokensMulti-modalAny
TrainingFrom small datasetsTask-specific corporaMassive pretraining + fine-tuneNot about training, about lifecycle
OutputPredictionClassification, tagging, parsingContent (text, code, image)Deployment + Ops
Role FocusData ScientistNLP ResearcherPrompt Engineer, LLM EngineerML Engineer, Platform Eng.

#ai-ml #genai #mlops #llmVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 2 minutes]

Examples

Retail:

  • Traditional ML -> Demand forecasting.
  • GenAI -> Personalized product descriptions.
  • MLOps -> Continuous retraining as seasons change.

Healthcare:

  • Traditional ML -> Predict patient readmission.
  • GenAI -> Auto-generate clinical notes.
  • MLOps -> Ensure compliance & monitoring under HIPAA.

Finance:

  • Traditional ML -> Fraud detection.
  • GenAI -> AI-powered customer chatbots.
  • MLOps -> Drift detection for fraud models.
Traditional MLGenAIMLOps
Fraud detection (transaction classification)AI-powered customer chatbots for supportDrift detection & alerts for fraud models
Credit scoring (loan approval risk models)Personalized financial advice reportsAutomated retraining with new credit bureau data
Stock price trend predictionSummarizing financial reports & earnings callsCompliance monitoring (audit trails for regulators)
Customer lifetime value predictionGenerating personalized investment recommendationsModel versioning & rollback in case of errors

#finance #healthcare #retail #examplesVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 1 minute]

Job Opportunities

Traditional ML

  • Data Scientist
  • Applied ML Engineer
  • Data Analyst -> ML transition

GenAI

  • Prompt Engineer
  • LLM Application Developer
  • GenAI Product Engineer
  • AI Research Scientist

MLOps

  • ML Engineer (deployment, monitoring)
  • MLOps Engineer (CI/CD pipelines for ML)
  • Cloud ML Platform Engineer (Databricks, AWS Sagemaker, GCP Vertex AI, Azure ML)

#jobs #mlengineer #mlopsengineerVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 9 minutes]

Terms to Know

Regression

Predicting a continuous numeric value.

Use Case: Predicting house prices based on size, location, and number of rooms.

Linear Regression

A regression model assuming a straight-line relationship between input features and target.

Use Case: Estimating sales revenue as a function of advertising spend.

Classification

Predicting discrete categories.

Use Case: Classifying an email as spam or not spam.

Clustering

Grouping similar data points without labels.

Use Case: Segmenting unknown data into groups.

Feature Engineering

Creating new meaningful features from raw data to improve model performance.

Use Case: From “Date of Birth” → create “Age” as a feature for predicting insurance risk.

Overfitting

Model learns training data too well (including noise) -> poor generalization.

Use Case: Overfitting = a spam filter that memorizes training emails but fails on new ones.

Underfitting

Model too simple to capture patterns -> poor performance.

Use Case: Trying to predict house prices using only the average price (ignoring size, location, rooms, etc.).

Bias

A source of error that happens due to overly simplistic assumptions.

  • Leads to underfitting.

Variance

A source of error that happens due to too much sensitivity to training data fluctuations.

  • Leads to overfitting.

Model Drift

When a model’s performance degrades over time because data distribution changes.

Use Case: A churn model trained pre-pandemic performs poorly after online behavior changes drastically.

MSE

Mean Squared Error

Avg of the squared differences between predicted values and actual values.

Actual a: [10, 20, 30, 40, 50]
Predicted p : [12, 18, 25, 45, 60]

| i | Actual| Predicted | Error | Squared Error |
| - | ------|-----------|-------|---------------|
| 1 | 10    | 12        |  -2   | 4             |
| 2 | 20    | 18        |   2   | 4             |
| 3 | 30    | 25        |   5   | 25            |
| 4 | 40    | 45        |  -5   | 25            |
| 5 | 50    | 60        | -10   | 100           |

SS = 4 + 4 + 25 + 25 + 100 = 158

MSE (ss_res) = 158 / 5 = 31.6

R Square

Proportion of variane in the target explained by the model.

1.0 = Perfect Prediction. 0.0 = Model is no better than predicting the mean. Negative = Model is worse than just predicting the mean.

Mean of actual values = (10 + 20 + 30 + 40 + 50) / 5 = 30

Total Variation (ss_tot) : (10 - 30)^2 + (30 - 30)^2 + (40 - 30)^2 + (50 - 30)^2 = 400 + 100 + 0 + 100 + 400 = 1000

R^2 = 1 - (ss_res / ss_tot)

R^2 = 1 - (158/1000) = 0.842

Serialization

The process of converting an in-memory object (e.g., a Python object) into a storable or transferable format (such as JSON, binary, or a file) so it can be saved or shared.

import json

data = {"name": "Ganesh", "course": "MLOps"}
# Serialization → Python dict → JSON string
serialized = json.dumps(data)

## Store the serialized data into JSON file if needed.

Deserialization

The process of converting the stored or transferred data (JSON, binary, file, etc.) back into an in-memory object that your program can use.

# Load it from JSON file
# Deserialization → JSON string → Python dict

deserialized = json.loads(serialized)

#serialization #deserialization #overfitting #underfittingVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 3 minutes]

Model vs Library vs Framework

python -m venv .demomodel 
source .demomodel/bin/activate 
pip install scikit-learn joblib
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from joblib import dump, load
import numpy as np

# Fake dataset: study_hours -> exam_score
rng = np.random.default_rng(42)
hours = rng.uniform(0, 10, size=100).reshape(-1, 1)     # feature X
noise = rng.normal(0, 5, size=100)                      # noise
scores = 5 + 8*hours.ravel() + noise                    # target y

X_train, X_test, y_train, y_test = 
    train_test_split(hours, scores, test_size=0.2, random_state=42)

model = LinearRegression()

# Train (fit)
model.fit(X_train, y_train)

# Evaluate
pred = model.predict(X_test)
print("MSE:", mean_squared_error(y_test, pred))
print("R2 :", r2_score(y_test, pred))
print("Learned slope and intercept:", model.coef_[0], model.intercept_)

# Save 
dump(model, "linear_hours_to_score.joblib")

# Inference on new data
new_hours = np.array([[2.0], [5.0], [9.0]])
print("Predicted scores:", model.predict(new_hours))

# Predict after load

restored = load("linear_hours_to_score.joblib")
print("Loaded model predicts:", restored.predict(new_hours))

Fun Task

  • Identify the Algorithj, Library, Model in this code
  • What is MSE, R2_Score
  • What is .joblib
  • What is the number 42?

#model #library #framework #r2score #mseVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 2 minutes]

Explanation

  • Library: scikit-learn

  • Algorithm: Linear Regression (Mathematics)

  • Prebuilt Model: LinearRegression (part of scikit-learn library)

  • model.fit(): Custom built model for this data.

  • 42 answer to the ultimate question of Life, the Universe, and Everything.

  • model.coef_[0] → the slope learned from data How much the target (exam score) increases for 1 extra unit of study hours model.intercept_ → the intercept The predicted target value when study hours = 0

    Example:

    Learned slope and intercept: 8.1 4.9

    8.1 * hrs + 4.9

    If a student studies 0 hours, predicted score ≈ 4.9 (baseline knowledge).

    If a student studies 5 hours, predicted score ≈ 8.1 × 5 + 4.9 = 45.4.

#explanation #libraryVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 3 minutes]

Statistical vs ML Models

Statistical Models

  • Focus on inference -> understanding relationships between variables.
  • Assume an underlying distribution (e.g., linear, normal).
  • Typically work well with smaller datasets.

Goal: test hypotheses, estimate parameters.

Example: Linear regression to explain how income depends on education, experience, etc.

Machine Learning Models

  • Focus on prediction -> finding patterns that generalize to unseen data.
  • Fewer assumptions about data distribution.
  • Can handle very large datasets and high-dimensional data.

Goal: optimize predictive performance.

Example: Random Forest predicting whether a customer will churn.

Key Similarities

Both use data to build models.

Both rely on training (fit) and evaluation (test).

Overlaps: linear regression is both a statistical model and an ML model, depending on context.


Book worth reading

The Manga Guide to Linear Algebra.

https://www.amazon.com/dp/1593274130

(Not an affiliate or referral)


On a lighter note

#statistics #ml #linearalgebraVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 3 minutes]

Types of ML Models

Supervised Learning

Data has input features (X) and target labels (y).

Model learns mapping: f(X) → y.

Examples:

  • Regression -> Predicting house prices, demand forecast, server usage.
  • Classification -> Spam vs Non-spam email or Customer churn.

Unsupervised Learning

Data has inputs only, no labels.

Goal: find hidden patterns or structure.

Examples:

  • Clustering -> Customer segmentation.
  • Association Rules -> Market basket analysis (“people who buy X also buy Y”).
  • Dimensionality Reduction -> Principal Component Analysis (PCA) for visualization.
    • Taking a high dimensional data and reducing it to fewer dimensions.

Reinforcement Learning (RL)

Agent interacts with environment -> learns by trial and error.

Used for decision-making & control.

Examples:

  • Robotics & self-driving cars.
  • Newer Video Games.
  • OTT Content recommendations.
  • Ads.

Semi-Supervised Learning

Mix of few labeled + many unlabeled data points.

Often used in NLP and computer vision.

Example: labeling 1,000 medical images, then using 100,000 unlabeled ones to improve model.

#supervised #unsupervisedVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 2 minutes]

ML Lifecycle

Collect Data (Data Engineers Role)

  • Gather raw data from systems (databases, APIs, sensors, logs).
  • Ensure sources are reliable and updated.

Clean & Prepare

  • Handle missing values, outliers, and noise.
  • Feature engineering: create new features, scale/encode as needed.
  • Data splitting (train/validation/test).

Train Model

  • Choose algorithm (supervised, unsupervised, reinforcement, etc.).
  • Train on training set, tune hyperparameters.

Evaluate

  • Use appropriate metrics:
    • Classification → Accuracy, Precision, Recall, F1.
    • Regression → RMSE, MAE, R².
  • Cross-validation for robustness.

Deploy

  • Make model accessible via API, batch jobs, or embedded in applications.
  • Consider scaling (cloud, containers, edge devices).

Monitor & Improve

  • Track data drift, concept drift, and model performance decay.
  • Automate retraining pipelines (MLOps).
  • Capture feedback loop to improve features and models.

#collect #clean #train #evaluateVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 6 minutes]

Data Preparation

~80% of the time in ML projects is spent on Data Preparation & Cleaning, and ~20% on Model Training.

The process of making raw data accurate, complete, and structured so it can be used for model training.

Wait data cleaning is not ML engineers job, it belongs to Data Engineer.

True but..

Data Engineers focus on collection and validation at scale:

  • Ingest raw data from source systems (databases, APIs, IoT, logs).
  • Build ETL/ELT pipelines (Bronze → Silver → Gold).
  • Ensure data quality checks (avoid duplicates, schema validation, type checks, primary key uniqueness).
  • Handle big data infrastructure: Spark, Databricks, Airflow, Kafka.
  • Deliver curated data (often “Silver” or “Gold” layer) for downstream ML.

ML Engineers / Data Scientists take over once curated data is available:

  • Apply ML-specific cleaning & prep:
    • Impute missing values intelligently (mean/median/model-based).
    • Encode categorical variables (one-hot, embeddings).
    • Normalize/standardize numeric features.
    • Text normalization, tokenization, embeddings.
  • Create features meaningful to the ML model.
  • Split data into train/validation/test sets.
flowchart LR
    DE[**Data Engineer**<br/><br/>- ETL/ELT Pipelines<br/>- Schema Validation<br/>• Deduplication<br/>- Type Checks] 

    OVERLAP[**Common** <br/><br/>- Remove Duplicates<br/>- Ensure Consistency]

    MLE[**ML Engineer**<br/><br/>-Handle Missing Values<br/>- Feature Scaling<br/>- Imputation<br/>- Encoding & Embeddings<br/>- Train/Val/Test Split] 
    
    DE --> OVERLAP
    MLE --> OVERLAP

For Example

Tabular Data

Data Engineer: ensures no duplicate customer IDs in database.

ML Engineer: fills missing “Age” values with median, scales “Income”.

Text Data

Data Engineer: stores raw customer reviews as UTF-8 encoded text.

ML Engineer: lowercases, removes stopwords, converts to embeddings.

Image Data

Data Engineer: validates images aren’t corrupted on ingest.

ML Engineer: resizes images, normalizes pixel values.

#data #cleaning Ver 0.3.6

Last change: 2025-12-02

[Avg. reading time: 6 minutes]

Data Cleaning

Check for Target Leakage

What it is: Features that give away the answer (future info in training data).

Why it matters: Makes the model look perfect in training but useless in production.

Example:

When building a model having this column is not correct as in Production you will never have this during Prediction. This can be used when Testing your model prediction.

refund_issued_flag when predicting “Will this order be refunded?”.

Validate Labels

What it is: Make sure labels are correct, consistent, and usable.

Why it matters: Garbage labels = garbage predictions.

Example:

Churn column has values: yes, Y, 1, true.

Normalize to 1 = churn, 0 = not churn.

Handle Outliers Intentionally

What it is: Extreme values that distort training.

Why it matters: “Emp_Salary = 10,000,000” can throw off predictions.

Example

Cap at 99th percentile.

Flag as anomaly instead of training on it.

Enforce Feature Types

What it is: Make sure data types match their meaning.

Why it matters: Models can’t learn if types are wrong.

Example:

customer_id stored as integer → model may treat it as numeric.

Why is that problem, customer_id = 20 will have more weightage than customer_id = 1

Convert to string (categorical).

Standardize Categories

What it is: Inconsistent labels in categorical columns.

Why it matters: Model may treat the same thing as different classes.

Example:

Country: USA, U.S.A., United States.

Map all to United States.

Normalize Text for ML

What it is: Clean and standardize text features.

Why it matters: Prevents the model from treating “Hello” and “hello!” as different.

Example:

Lowercasing, removing punctuation, stripping whitespace.

Keep a copy of raw text for audit.

Protect Data Splits

What it is: Make sure related rows don’t leak between train/test.

Why it matters: Prevents unfair accuracy boost.

Example:

Same student appears in both train and test sets.

Fix: Group by student_id when splitting.

#datacleaning #mlcleaning #normalize_dataVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 11 minutes]

Data Imputation

Data Imputation is the process of filling in missing values in a dataset with estimated or predicted values.

Data imputation aims to enhance the quality and completeness of the dataset, ultimately improving the performance and reliability of the ML model.

Problems with Missing Data

  • Reduced Model
  • Biased Inferences
  • Imbalanced Representations
  • Increased complexity in Model handling

Data Imputation

Data Domain knowledge is important before choosing the right method.

Dropping Rows/Columns

Remove the rows or columns that contain missing values.

  • If the percentage of missing data is very small.
  • If the column isn’t important for the model.

Example: Drop the few rows out of Million where “Age” is missing.

Treat as a Category

Encode “missing” or “NA” or “Unknown” as its own category.

  • For categorical variables (like Country, Gender, Payment Method).

When “missing” itself carries meaning (e.g., customer didn’t provide income → may be sensitive).

Example: Add a category Unknown to “Marital Status” column.

Data with Missing Values

IDCountry
1USA
2Canada
3Null
4India
5NA (missing)

After treating as a Category

IDCountry
1USA
2Canada
3Missing
4India
5Missing

The model will see “Missing” as just another value like “USA” or “India.”

Replacing Missing Values (Imputation)

Fill missing values with a reasonable estimate.

Methods:

  • Mean/Median/Mode: Quick fixes for numeric/categorical data.
  • KNN Imputation: Fill value based on “closest” similar records.
  • Regression Imputation: Predict the missing value using other features.

Example: Replace missing “Salary” with median salary of the group.

Data Imputation Methods

Using regression models repeatedly (with randomness) to fill missing data, producing several plausible datasets, and then combining them for analysis.

AgeEducationIncome
30Masters?
40PhD120K
35Bachelors80K
  • Step 1: Fit regression: Income ~ Age + Education.
  • Step 2: Predict missing Income for Age=30, Edu=Masters.
  • Step 3: Add random noise → 95K in dataset1, 92K in dataset2, 98K in dataset3.
  • Step 4: Analyze all 3 datasets, combine results.

Downside: Delay in process and computing time. More missing values more coputation time.


  • Drop : if it’s tiny and negligible.
  • Category : if it’s categorical.
  • Replace : if it’s numeric and important.
  • KNN/Regression : if you want smarter imputations and can afford compute.

It is important to mark the imputated data

To know which data is from source and which is calculated. So its handled with pinch of salt.

Data Imputation Marking


MethodWhen to UseProsCons
Drop Rows/ColumnsWhen % of missing data is very small (e.g., <5%) or the feature is unimportant- Simple and fast
- No assumptions needed
- Lose data (rows)
- Risk of losing valuable features (columns)
Treat as a CategoryFor categorical variables where “missing” may carry meaning- Preserves all rows
- Captures the “missingness” as useful info
- Only works for categorical data
- Can create an artificial category if missing isn’t meaningful
Replace with Mean/Median/ModeFor numeric data (mean/median) or categorical (mode)- Easy to implement
- Keeps dataset size intact
- Distorts distribution
- Ignores correlations between features
KNN ImputationWhen dataset is not too large and similar neighbors make sense- Considers relationships between features
- More accurate than simple averages
- Computationally expensive
- Sensitive to scaling and choice of K
Regression ImputationWhen missing values can be predicted from other variables- Uses feature relationships
- Can be very accurate
- Risk of “overfitting” imputations
- Adds complexity

#dataimputation #knn #encode #dropdataVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 13 minutes]

Data Encoding

Data Encoding is the process of converting categorical data (like colors, countries, product types) into a numeric format that ML models can understand.

Unlike numerical data, categorical data is not directly usable because models operate on numbers, not labels.

Encoding ensures categorical values are represented in a way that preserves meaning and avoids misleading the model.

Typically rule-based.


Example: Products

IDProduct
1Laptop
2Phone
3Tablet
4Phone

Label Encoding

Assigns each category a unique integer.

IDProduct (Encoded)
10
21
32
41

Pros:

  • Very simple, minimal storage.
  • Works well for tree-based models.

Cons:

  • Implies an order between categories (Laptop < Phone < Tablet).
  • Misleads linear models.

One-Hot Encoding

Creates a binary column for each category.

IDLaptopPhoneTablet
1100
2010
3001
4010

Pros:

  • No ordinal assumption.
  • Easy to interpret.

Cons:

  • High dimensionality for many products (e.g., thousands of SKUs).
  • Sparse data, more memory needed.

Ordinal Encoding

Encodes categories when they have a natural order.

Works for things like product size or version level.

Example (Product Tier):

IDProduct Tier
1Basic
2Standard
3Premium
4Standard

After Ordinal Encoding:

IDProduct Tier (Encoded)
11
22
33
42

Pros:

  • Preserves rank/order.
  • Efficient storage.

Cons:

  • Only valid if order is real (Basic < Standard < Premium).
  • Wrong if categories are unordered (Laptop vs Phone).

Target Encoding (Mean Encoding)

Replaces each category with the mean of the target variable.

Target - “Purchased” Yes=1, No=0

IDProductPurchased
1Laptop1
2Phone0
3Tablet1
4Phone1
IDProduct (Encoded)Purchased
11.01
20.50
31.01
40.51

Compute mean purchase rate:

Laptop = 1.0 Phone = 0.5 Tablet = 1.0

Pros:

  • Great for high-cardinality features (e.g., hundreds of product SKUs).
  • Often improves accuracy.
  • Keeps dataset compact (just 1 numeric column).
  • Often boosts performance in models like Logistic Regression or Gradient Boosted Trees.

Cons:

  • Risk of data leakage if target encoding is done on the whole dataset.
  • Must use cross-validation to avoid leakage.
  • Compute intensive.

Encoding TypeBest ForAvoid When
Label EncodingTree-based models, low-cardinality productsLinear models, unordered categories
One-Hot EncodingGeneral ML, few product categoriesVery high-cardinality features
Ordinal EncodingOrdered categories (tiers, sizes, versions)Unordered categories (Phone vs Laptop)
Target EncodingHigh-cardinality products, with proper CVWithout CV (leakage risk)

Multiple Categorical Columns

IDProductProduct TierCategoryPurchased
1LaptopPremiumPC1
2PhoneBasicMobile0
3TabletStandardElectronics1
4PhonePremiumMobile1
  • Product: Laptop, Phone, Tablet
  • Product Tier: Basic < Standard < Premium (ordered)
  • Category: Electronics, Accessories, Clothing (unordered)

Label Encoding (all columns)

Replace each category with an integer.

IDProductProduct TierCategory
1020
2101
3212
4121

Artificial order created (e.g., PC=0, Mobile=1, Electronics=2).

One-Hot Encoding (all columns)

IDLaptopPhoneTabletTier_BasicTier_StandardTier_PremiumCat_PCCat_MobileCat_Electronics
1100001100
2010100010
3001010001
4010001010

Very interpretable, but column explosion if you have 50+ products or 100+ categories.

Mixed Encoding (best practice)

  • Product → One-Hot (few categories).
  • Product Tier → Ordinal (Basic=1, Standard=2, Premium=3).
  • Category → One-Hot (PC, Mobile, Electronics).
IDLaptopPhoneTabletTier (Ordinal)Cat_PCCat_MobileCat_Electronics
11003100
20101010
30012001
40103010

#onehot_encoding #target_encoding #label_encodingVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 6 minutes]

Feature Engineering

The process of transforming raw data into more informative inputs (features) for ML models.

Goes beyond encoding: you can create new features/metrics (like derived columns in the DB world) that pure encoding does not offer.

The goal of FE is to improve model accuracy, interpretability, and generalization.

Example (Laptop Sales):

Purchase Date = 2025-09-02

Derived Features:

  • Month = 09
  • DayOfWeek = Tuesday
  • IsHolidaySeason = No
  • IsWeekend = No
  • IsLeapYear= No
  • Quarter = Q3

Encoding (One-Hot, Label, Target) = only turns categories into numbers.

But real-world data often hides useful patterns in dates, interactions, domain knowledge, or semantics.

IDProductPurchase DatePricePurchasedAgain
1Laptop2023-12-0112001
2Laptop2024-07-1511000
3Phone2024-05-208001
4Tablet2024-08-056001
  • Encoding only handles Product → One-Hot or Target.

Feature Engineering adds new insights:

  • From Purchase Date: extract Month, DayOfWeek, IsHolidaySeason.
  • From Price: create Discounted? (if < avg product price).
  • Combine features: Price / AvgCategoryPrice.

Basic Feature Engineering

Improve signals/patterns without domain-specific knowledge.

Scaling/Normalization: Price → (Price – mean) / std

Date/Time Features: Purchase Date → Month=12, DayOfWeek=Friday

Polynomial/Interaction: Price × Tier

Pros:

  • Easy to implement.
  • Immediately boosts many models (especially linear/Neural Networks).

Cons:

  • Risk of adding noise if done blindly.
  • Limited unless combined with domain insights.

Domain-Specific Feature Engineering

Apply business/field knowledge.

Examples:

Finance: Debt-to-Income Ratio, Credit Utilization %

Healthcare: BMI = Weight / Height², risk score categories

IoT: Rolling averages, peak detection in sensor data.

Pros:

  • Captures real-world meaning → big performance gains.
  • Makes models explainable to stakeholders.

Cons:

  • Requires domain expertise.
  • Not always transferable between datasets.

#feature_engineering #domain_specificVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 10 minutes]

Vectors

A vector is just an ordered list of numbers that represents a data point so models can do math on it.

Think “row -> numbers” for tabular data, or “text/image -> numbers” after a transformation.

Example:

Price = 1200, Weight = 2kg, Warranty = 24 months → Vector = [1200, 2, 24]

Types of Vectors

Tabular Feature Vector

Concatenate numeric columns (and encoded categoricals) into a single vector.

ML engineer/data scientist during data prep/FE (training) and the same code at inference.

Example: [Price, Weight, Warranty] → [1200, 2, 24].

Sparse Vectors

High-dimensional vectors with many zeros (e.g., One-Hot, Bag-of-Words, TF-IDF).

Encoding/featurization function in your pipeline.

Example

Products = {Laptop, Phone, Pen}

Laptop → [1, 0, 0]
Phone → [0, 1, 0]
Pen → [0, 0, 1]

Dense Vectors (compact, mostly non-zeros)

Lower-dimensional, compact numeric representation

Created by algorithms (scalers/PCA) or models (embeddings) in your pipeline.

Lower-dimensional, compact, mostly non-zeros → dense.

Example: Not actual values

Laptop → [0.65, -0.12, 0.48]
Phone → [0.60, -0.15, 0.52]
Pen → [0.10, 0.85, -0.40]

Laptop and Phone vectors are close together.

Model-Derived Feature Vectors

Dense vectors specifically generated by models like CNN/Transformer as a vector. Mainly used with Computer Vision. Image classification, object detection, face recognition, voice processing.

Models generate them during feature extraction (training & inference).

Example: BERT sentence vector, ResNet image features.

Vector TypeWho designs it?Who computes it?When it’s computedExample
Tabular feature vectorML Eng/DS (choose columns)Pipeline codeTrain & Inference[Price, Weight, Warranty]
Sparse (One-Hot/TF-IDF)ML Eng/DS (choose encoder)Encoder in pipelineTrain (fit) & Inference (transform)One-Hot Product
Dense (scaled/PCA)ML Eng/DS (choose scaler/PCA)Scaler/PCA in pipelineTrain (fit) & Inference (transform)StandardScaled price, PCA(100)
Model features / EmbeddingsML Eng/DS (choose model)Model (pretrained or trained)Train & InferenceBERT/ResNet/categorical embedding

MLOps ensures the same steps run at inference to avoid train/serve skew.

Example of Dense Vector

python -m venv .densevector 
source .densevector/bin/activate 
pip install sentence-transformers
from sentence_transformers import SentenceTransformer

# Load a pre-trained model (MiniLM is small & fast)
model = SentenceTransformer('all-MiniLM-L6-v2')

text = "Laptop"

# Convert text into dense vector
vector = model.encode(text)

print("Dense Vector Shape:", text, vector.shape)
print("Dense Vector (first 10 values):", vector[:10])
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Load model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Words
texts = ["Laptop", "Computer", "Pencil"]

# Encode all
vectors = model.encode(texts)

# Convert to numpy array
vectors = np.array(vectors)

# Cosine similarity matrix
sim_matrix = cosine_similarity(vectors)

# Display similarity scores
for i in range(len(texts)):
    for j in range(i+1, len(texts)):
        print(f"Similarity({texts[i]} vs {texts[j]}): {sim_matrix[i][j]:.4f}")

#vectors #densevector #sparsevector #tabularvectorVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 8 minutes]

Embeddings

Embeddings transform high-dimensional categorical or textual data into a compact, dense vector space.

Similar items are placed closer together in vector space -> models can understand similarity.

  • These representations capture relationships and context among different entities.
  • Used in Recommendation Systems, NLP, Image Search and more.
  • Can be learning from data using neural networks or retrieved from pretrained models (eg: Word2Vec, FastText)

Use Cases

  • Search & Retrieval: Semantic search, image search.
  • NLP: Word/sentence embeddings for sentiment, chatbots, translation.
  • Computer Vision: Image embeddings for similarity or classification.

Advantages over traditional encoding:

  • Handle high-cardinality categorical features (e.g., millions of products).
  • Capture context and semantics (“Laptop” is closer to “Computer” than “Pencil”).
  • Lower-dimensional → more efficient than One-Hot or TF-IDF.

Types of Embeddings

Word Embeddings (Text)

Represent words as vectors so that semantically similar words are close together.

Examples: Word2Vec, GloVe, FastText.

“king” – “man” + “woman” = “queen”

Used in: sentiment analysis, translation, chatbots.

Sentence / Document Embeddings (Text)

Represent longer text (sentences, paragraphs, docs) in vector form.

Capture context and meaning beyond individual words.

Examples: BERT, Sentence-BERT, Universal Sentence Encoder.

“The laptop is fast” and “This computer is quick” → close vectors.

Image Embeddings (Computer Vision)

Represent images as vectors extracted from CNNs or Vision Transformers.

Capture visual similarity (shapes, colors, objects).

Examples: ResNet, CLIP (image+text).

A cheetah photo and a leopard photo → embeddings close together (both cat family).

Used in: image search, face recognition, object detection.

Audio / Speech Embeddings

Convert audio waveforms into dense vectors capturing phonetics and semantics.

Examples: wav2vec, HuBERT.

Voice saying “Laptop” → embedding close to text embedding of “Laptop”.

Used in: speech recognition, speaker identification.

Graph Embeddings

Represent nodes/edges in a graph (social networks, knowledge graphs).

Capture relationships and network structure.

Examples: Node2Vec, DeepWalk, Graph Neural Networks (GNNs).

In a product graph, Laptop node embedding will be close to Mouse if often co-purchased.

TypeExample AlgorithmsData TypeUse Cases
WordWord2Vec, GloVeText (words)NLP basics
Sentence/DocBERT, SBERTText (longer)Semantic search, QA
CategoricalEmbedding layersTabular (IDs)Recommenders, fraud detection
ImageResNet, CLIPVisionImage search, recognition
Audiowav2vec, HuBERTAudioSpeech-to-text, voice auth
GraphNode2Vec, GNNsGraphsSocial networks, KG search

#embeddings [#<abbr title="Bidirectional Encoder Representations from Transformers">BERT</abbr>](../tags.md#BERT "Tag: BERT") #Word2Vec #NLPVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 7 minutes]

Life Before MLOps

Challenges Faced by ML Teams.

Moving Models from Dev → Staging → Prod

Models were often shared as .pkl or joblib files, passed around manually.

Problem: Dependency mismatches (Python, sklearn version), fragile handoffs.

Stopgap: Packaging models with Docker images, but still manual and inconsistent.

Champion vs Challenger Deployment

Teams struggled to test a new (challenger) model against the current (champion).

Problem: No controlled A/B testing or shadow deployments → risky rollouts.

Stopgap: Manual canary releases or running offline comparisons.

Model Versioning Confusion

Models saved as model_final.pkl, model_final_v2.pkl, final_final.pkl.

Problem: Nobody knew which version was truly in production.

Stopgap: Git or S3 versioning for files, but no link to experiments/data.

Inference on Wrong Model Version

Even if multiple versions existed, production systems sometimes pointed to the wrong one.

Problem: Silent failures, misaligned experiments vs prod results.

Stopgap: Hardcoding file paths or timestamps — brittle and error-prone.

Train vs Serve Skew (Data-Model Mismatch)

Preprocessing done in notebooks was re-written differently in prod code.

Problem: Same model behaves differently in production.

Stopgap: Copy-paste code snippets, but no guarantee of sync.

Experiment Tracking Chaos

Results scattered across notebooks, Slack messages, spreadsheets.

Problem: Couldn’t reproduce “that good accuracy we saw last week.”

Stopgap: Manually logging metrics in Excel or text files.

Reproducibility Issues

Same code/data gave different results on different machines.

Problem: No control of data versions, package dependencies, or random seeds.

Stopgap: Virtualenvs, requirements.txt — helped a bit but not full reproducibility.

Lack of Monitoring in Production

Once deployed, no one knew if the model degraded over time.

Problem: Models silently failed due to data drift or concept drift.

Stopgap: Occasional manual performance checks, but no automation.

Scaling & Performance Gaps

Models trained in notebooks failed under production loads.

Problem: Couldn’t handle large-scale data or real-time inference.

Stopgap: Batch scoring jobs on cron — but too slow for real-time use cases.

Collaboration Breakdowns

Data Scientists, Engineers, Ops worked in silos.

Problem: Miscommunication -> wrong datasets, broken pipelines, delays.

Stopgap: Jira tickets and handovers — but still slow and error-prone.

Governance & Compliance Gaps

No audit trail of which model made which prediction.

Problem: Risky for regulated domains (finance, healthcare).

Stopgap: Manual logging of predictions — incomplete and unreliable.

#mlops #development #productionVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 13 minutes]

Quiz

Note: This is a practice quiz and will not be graded. The purpose is to help you check your understanding of the concepts we covered.

Ver 0.3.6
Last change: 2025-12-02

[Avg. reading time: 0 minutes]

Developer Tools

[Avg. reading time: 5 minutes]

Introduction

Before diving into Data or ML frameworks, it's important to have a clean and reproducible development setup. A good environment makes you:

  • Faster: less time fighting dependencies.
  • Consistent: same results across laptops, servers, and teammates.
  • Confident: tools catch errors before they become bugs.

A consistent developer experience saves hours of debugging. You spend more time solving problems, less time fixing environments.


Python Virtual Environment

  • A virtual environment is like a sandbox for Python.
  • It isolates your project’s dependencies from the global Python installation.
  • Easy to manage different versions of library.
  • Must depend on requirements.txt, it has to be managed manually.

Without it, installing one package for one project may break another project.

Open the CMD prompt (Windows)

Open the Terminal (Mac)

# Step 0: Create a project folder under your Home folder.

mkdir project

cd project


# Step 1: Create a virtual environment
python -m venv myenv

# Step 2: Activate it
# On Mac/Linux:
source myenv/bin/activate

# On Windows:
myenv\Scripts\activate.bat

# Step 3: Install packages (they go inside `myenv`, not global)
pip install faker

# Step 4: Open Python
python

# Step 5: Verify 

import sys

sys.prefix

sys.base_prefix

# Step 6: Run this sample

from faker import Faker
fake = Faker()
fake.name()

# Step 6: Close Python (Control + Z)

# Step 7: Deactivate the venv when done

deactivate

As a next step, you can either use Poetry or UV as your package manager.

#venv #python #uv #poetry developer_toolsVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 3 minutes]

UV

Dependency & Environment Manager

  • Written in Rust.
  • Syntax is lightweight.
  • Automatic Virtual environment creation.

Create a new project:

# Initialize a new uv project
uv init uv_helloworld

Sample layout of the directory structure

.
├── main.py
├── pyproject.toml
├── README.md
└── uv.lock
# Change directory
cd uv_helloworld

# # Create a virtual environment myproject
# uv venv myproject

# or create a UV project with specific version of Python

# uv venv myproject --python 3.11

# # Activate the Virtual environment

# source myproject/bin/activate

# # Verify the Virtual Python version

# which python3

# add library (best practice)
uv add faker

# verify the list of libraries under virtual env
uv tree

# To find the list of libraries inside Virtual env

uv pip list

edit the main.py

from faker import Faker
fake = Faker()
print(fake.name())
uv run main.py

Read More on the differences between UV and Poetry

#uv #rust #venvVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 12 minutes]

Python Developer Tools

PEP

PEP, or Python Enhancement Proposal, is the official style guide for Python code. It provides conventions and recommendations for writing readable, consistent, and maintainable Python code.

PEP Conventions

  • PEP 8 : Style guide for Python code (most famous).
  • PEP 20 : "The Zen of Python" (guiding principles).
  • PEP 484 : Type hints (basis for MyPy).
  • PEP 517/518 : Build system interfaces (basis for pyproject.toml, used by Poetry/UV).
  • PEP 572 : Assignment expressions (the := walrus operator).
  • PEP 695 : Type parameter syntax for generics (Python 3.12).

Indentation

  • Use 4 spaces per indentation level
  • Continuation lines should align with opening delimiter or be indented by 4 spaces.

Line Length

  • Limit lines to a maximum of 79 characters.
  • For docstrings and comments, limit lines to 72 characters.

Blank Lines

  • Use 2 blank lines before top-level functions and class definitions.
  • Use 1 blank line between methods inside a class.

Imports

  • Imports should be on separate lines.
  • Group imports into three sections: standard library, third-party libraries, and local application imports.
  • Use absolute imports whenever possible.
# Correct
import os
import sys

# Wrong
import sys, os

Naming Conventions

  • Use snake_case for function and variable names.
  • Use CamelCase for class names.
  • Use UPPER_SNAKE_CASE for constants.
  • Avoid single-character variable names except for counters or indices.

Whitespace

  • Don’t pad inside parentheses/brackets/braces.
  • Use one space around operators and after commas, but not before commas.
  • No extra spaces when aligning assignments.

Comments

  • Write comments that are clear, concise, and helpful.
  • Use complete sentences and capitalize the first word.
  • Use # for inline comments, but avoid them where the code is self-explanatory.

Docstrings

  • Use triple quotes (""") for multiline docstrings.
  • Describe the purpose, arguments, and return values of functions and methods.

Code Layout

  • Keep function definitions and calls readable.
  • Avoid writing too many nested blocks.

Consistency

  • Consistency within a project outweighs strict adherence.
  • If you must diverge, be internally consistent.

Linting

Linting is the process of automatically checking your Python code for:

  • Syntax errors

  • Stylistic issues (PEP 8 violations)

  • Potential bugs or bad practices

  • Keeps your code consistent and readable.

  • Helps catch errors early before runtime.

  • Encourages team-wide coding standards.


# Incorrect
import sys, os

# Correct
import os
import sys
# Bad spacing
x= 5+3

# Good spacing
x = 5 + 3

Ruff : Linter and Code Formatter

Ruff is a fast, modern tool written in Rust that helps keep your Python code:

  • Consistent (follows PEP 8)
  • Clean (removes unused imports, fixes spacing, etc.)
  • Correct (catches potential errors)

Install

poetry add ruff
uv add ruff

Verify

ruff --version 
ruff --help

example.py

import os, sys 

def greet(name): 
  print(f"Hello, {name}")

def message(name): print(f"Hi, {name}")

def calc_sum(a, b): return a+b

greet('World')
greet('Ruff')
message('Ruff')

poetry run ruff check example.py
poetry run ruff check example.py --fix

poetry run ruff format example.py --check
poetry run ruff format example.py

OR

uv run ruff check example.py
uv run ruff check example.py --fix
uv run ruff format example.py --check
uv run ruff check example.py

MyPy : Type Checking Tool

mypy is a static type checker for Python. It checks your code against the type hints you provide, ensuring that the types are consistent throughout the codebase.

It primarily focuses on type correctness—verifying that variables, function arguments, return types, and expressions match the expected types.

Install

    poetry add mypy

    or

    uv add mypy

    or

    pip install mypy

sample.py

x = 1
x = 1.0
x = True
x = "test"
x = b"test"

print(x)

def add(a: int, b: int) -> int:
    return a + b

print(add(100, 123))      

print(add("hello", "world"))


uv run mypy sample.py

or

poetry run mypy sample.py

or

mypy sample.py

#mypy #pep #ruff #lintVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 8 minutes]

Error Handling

Python uses try/except blocks for error handling.

The basic structure is:

try:
    # Code that may raise an exception
except ExceptionType:
    # Code to handle the exception
finally:
    # Code executes all the time

Uses

Improved User Experience: Instead of the program crashing, you can provide a user-friendly error message.

Debugging: Capturing exceptions can help you log errors and understand what went wrong.

Program Continuity: Allows the program to continue running or perform cleanup operations before terminating.

Guaranteed Cleanup: Ensures that certain operations, like closing files or releasing resources, are always performed.

Some key points

  • You can catch specific exception types or use a bare except to catch any exception.

  • Multiple except blocks can be used to handle different exceptions.

  • An else clause can be added to run if no exception occurs.

  • A finally clause will always execute, whether an exception occurred or not.


Without Try/Except

x = 10 / 0 

Basic Try/Except

try:
    x = 10 / 0 
except ZeroDivisionError:
    print("Error: Division by zero!")

Generic Exception

try:
    file = open("nonexistent_file.txt", "r")
except:
    print("An error occurred!")

Find the exact error

try:
    file = open("nonexistent_file.txt", "r")
except Exception as e:
    print(str(e))

Raise - Else and Finally

try:
    x = -10
    if x <= 0:
        raise ValueError("Number must be positive")
except ValueError as ve:
    print(f"Error: {ve}")
else:
    print(f"You entered: {x}")
finally:
    print("This will always execute")

try:
    x = 10
    if x <= 0:
        raise ValueError("Number must be positive")
except ValueError as ve:
    print(f"Error: {ve}")
else:
    print(f"You entered: {x}")
finally:
    print("This will always execute")

Nested Functions


def divide(a, b):
    try:
        result = a / b
        return result
    except ZeroDivisionError:
        print("Error in divide(): Cannot divide by zero!")
        raise  # Re-raise the exception

def calculate_and_print(x, y):
    try:
        result = divide(x, y)
        print(f"The result of {x} divided by {y} is: {result}")
    except ZeroDivisionError as e:
        print(str(e))
    except TypeError as e:
        print(str(e))

# Test the nested error handling
print("Example 1: Valid division")
calculate_and_print(10, 2)

print("\nExample 2: Division by zero")
calculate_and_print(10, 0)

print("\nExample 3: Invalid type")
calculate_and_print("10", 2)

#error #try #exceptionVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 4 minutes]

UnitTest

A unit test verifies the correctness of a small, isolated "unit" of code—typically a single function or method—independent of the rest of the program.

Key Benefits of Unit Testing

Isolates functionality – Tests focus on one unit at a time, making it easier to pinpoint where a bug originates.

Enables early detection – Issues are caught during development, reducing costly fixes later in production.

Prevents regressions – Running existing tests after changes ensures new bugs aren’t introduced.

Supports safe refactoring – With a strong test suite, developers can confidently update or restructure code.

Improves quality – High coverage enforces standards, highlights edge cases, and strengthens overall reliability.

Unit Testing in Python

Every language provides its own frameworks for unit testing. In Python, popular choices include:

unittest – The built-in testing framework in the standard library.

pytest – Widely used, simple syntax, rich plugin ecosystem.

doctest – Tests embedded directly in docstrings.

testify – An alternative framework inspired by unittest, with added features.


pytest is the popular testing tool for data/ML code. It’s faster to write, far more expressive for data-heavy tests, and has a rich plugin ecosystem that plays nicely with Spark, Pandas, MLflow, and CI.

git clone https://github.com/gchandra10/pytest-demo.git
uv run pytest -v

#unittesting #pytestVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 11 minutes]

DUCK DB

DuckDB is a single file built with no dependencies.

All the great features can be read here https://duckdb.org/

Automatic Parallelism: DuckDB has improved its automatic parallelism capabilities, meaning it can more effectively utilize multiple CPU cores without requiring manual tuning. This results in faster query execution for large datasets.

Parquet File Improvements: DuckDB has improved its handling of Parquet files, both in terms of reading speed and support for more complex data types and compression codecs. This makes DuckDB an even better choice for working with large datasets stored in Parquet format.

Query Caching: Improves the performance of repeated queries by caching the results of previous executions. This can be a game-changer for analytics workloads with similar queries being run multiple times.

How to use DuckDB?

Download the CLI Client

DuckDB in Data Engineering

Download orders.parquet from

https://github.com/duckdb/duckdb-data/releases/download/v1.0/orders.parquet

More files are available here

https://github.com/cwida/duckdb-data/releases/

Open Command Prompt or Terminal

./duckdb

# Create / Open a database

.open ordersdb

Duckdb allows you to read the contents of orders.parquet as is without needing a table. Double quotes around the file name orders.parquet is essential.

describe table  "orders.parquet"

Not only this, but it also allows you to query the file as-is. (This feature is similar to one data bricks supports)

select * from "orders.parquet" limit 3;

DuckDB supports CTAS syntax and helps to create tables from the actual file.

show tables;

create table orders  as select * from "orders.parquet";

select count(*) from orders;

DuckDB supports parallel query processing, and queries run fast.

This table has 1.5 million rows, and aggregation happens in less than a second.

select now(); select o_orderpriority,count(*) cnt from orders group by o_orderpriority; select now();

DuckDB also helps to convert parquet files to CSV in a snap. It also supports converting CSV to Parquet.

COPY "orders.parquet" to 'orders.csv'  (FORMAT "CSV", HEADER 1);Select * from "orders.csv" limit 3;

It also supports exporting existing Tables to Parquet files.

COPY "orders" to  'neworder.parquet' (FORMAT "PARQUET");

DuckDB supports Programming languages such as Python, R, JAVA, node.js, C/C++.

DuckDB ably supports Higher-level SQL programming such as Macros, Sequences, Window Functions.

Get sample data from Yellow Cab

https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

Copy yellow cabs data into yellowcabs folder

create table taxi_trips as select * from "yellowcabs/*.parquet";
SELECT
    PULocationID,
    EXTRACT(HOUR FROM tpep_pickup_datetime) AS hour_of_day,
    AVG(fare_amount) AS avg_fare
FROM
    taxi_trips
GROUP BY
    PULocationID,
    hour_of_day;

Extensions

https://duckdb.org/docs/extensions/overview

INSTALL json;
LOAD json;
select * from demo.json;
describe demo.json;

Load directly from HTTP location

select * from 'https://raw.githubusercontent.com/gchandra10/filestorage/main/sales_100.csv'

#duckdb #singlefiledatabase #parquet #tools #cliVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 8 minutes]

JQ

  • jq is a lightweight and flexible command-line JSON processor.
  • Reads JSON from stdin or a file, applies filters, and writes JSON to stdout.
  • Useful when working with APIs, logs, or config files in JSON format.
  • Handy tool in Automation.
  1. Download JQ CLI (Preferred) and learn JQ.

JQ Download

  1. Use the VSCode Extension and learn JQ.

VSCode Extension

Download the sample JSON

https://raw.githubusercontent.com/gchandra10/jqtutorial/refs/heads/master/sample_nows.json

Note: As this has no root element, '.' is used.

1. View JSON file in readable format

jq '.' sample_nows.json

2. Read the First JSON element / object

jq 'first(.[])' sample_nows.json

3. Read the Last JSON element

jq 'last(.[])' sample_nows.json

4. Read top 3 JSON elements

jq 'limit(3;.[])' sample_nows.json

5. Read 2nd & 3rd element. Remember, Python has the same format. LEFT Side inclusive, RIGHT Side exclusive

jq '.[2:4]' sample_nows.json

6. Extract individual values. | Pipeline the output

jq '.[] | [.balance,.age]' sample_nows.json

7. Extract individual values and do some calculations

jq '.[] | [.age, 65 - .age]' sample_nows.json

8. Return CSV from JSON

jq '.[] | [.company, .phone, .address] | @csv ' sample_nows.json

9. Return Tab Separated Values (TSV) from JSON

jq '.[] | [.company, .phone, .address] | @tsv ' sample_nows.json

10. Return with custom pipeline delimiter ( | )

jq '.[] | [.company, .phone, .address] | join("|")' sample_nows.json

Pro TIP : Export this result > output.txt and Import to db using bulk import tools like bcp, load data infile

11. Convert the number to string and return | delimited result

jq '.[] | [.balance,(.age | tostring)] | join("|") ' sample_nows.json

12. Process Array return Name (returns as list / array)

jq '.[] | [.friends[].name]' sample_nows.json

or (returns line by line)

jq '[].friends[].name' sample_nows.json

13. Parse multi level values

returns as list / array

jq '.[] | [.name.first, .name.last]' sample_nows.json 

returns line by line

jq '.[].name.first, .[].name.last' sample_nows.json 

14. Query values based on condition, say .index > 2

jq 'map(select(.index > 2))' sample_nows.json
jq 'map(select(.index > 2)) | .[] | [.index,.balance,.age]' sample_nows.json

15. Sorting Elements

# Sort by Age ASC
jq 'sort_by(.age)' sample_nows.json
# Sort by Age DESC
jq 'sort_by(-.age)' sample_nows.json
# Sort on multiple keys
jq 'sort_by(.age, .index)' sample_nows.json

Use Cases

curl -s https://www.githubstatus.com/api/v2/status.json
curl -s https://www.githubstatus.com/api/v2/status.json | jq '.'
curl -s https://www.githubstatus.com/api/v2/status.json | jq '.status'

#jq #tools #json #parser #cli #automationVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 5 minutes]

SQLite

Its a Serverless - Embedded database. Database engine is a library compiled into your Application.

  • The entire database is one file on disk.

  • It’s self-contained - needs no external dependencies.

  • It’s the most widely deployed database in the world.

How It’s Different from “Big” Databases

  • No client-server architecture - your app directly reads/writes the database file
  • No network overhead - everything is local file I/O
  • No configuration - no setup, no admin, no user management
  • Lightweight - the library is only a few hundred KB
  • Single writer at a time - multiple readers OK, but writes are serialized

Key Architectural Concepts

ACID Properties:

  • Transactions are atomic, consistent, isolated, durable
  • Even if your app crashes mid-write, database stays consistent

Locking & Concurrency:

  • Database-level locking (not row or table level like PostgreSQL)
  • Write transactions block other writers
  • This is fine for mobile/embedded, problematic for high-concurrency servers

Storage & Pages:

  • Data stored in fixed-size pages (default 4KB)
  • Understanding page size matters for performance tuning

When to Use SQLite

  • Mobile apps (iOS, Android)
  • Desktop applications
  • Embedded systems (IoT devices, cars, planes)
  • Small-to-medium websites (< 100K hits/day)
  • Local caching
  • Application file format (instead of XML/JSON)
  • Development/testing

When not to Use SQLite

  • High-concurrency web apps with many simultaneous writers
  • Distributed systems needing replication
  • Client-server architectures where you need central control
  • Applications requiring fine-grained access control

Performance Characteristics

  • Extremely fast for reads
  • Very fast for writes on local storage
  • Slower on network drives (NFS, cloud mounts)
  • Indexes work like other databases - crucial for query performance
  • Analyze your queries - use EXPLAIN QUERY PLAN

Demo

git clone https://github.com/gchandra10/python_sqlite_demo

#sqlite #localdb #embeddeddbVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 5 minutes]

Introduction

MLflow Components

MLflow Tracking

  • Logs experiments, parameters, metrics, and artifacts
  • Provides UI for comparing runs and visualizing results
  • Supports automatic logging for popular ML libraries

Use case: Track model performance across different hyperparameters, compare experiment results

MLflow Projects

  • Packages ML code in reusable, reproducible format
  • Uses conda.yaml or requirements.txt for dependencies
  • Supports different execution environments (local, cloud, Kubernetes)

Use case: Share reproducible ML workflows, standardize project structure

MLflow Models

  • Standardizes model packaging and deployment
  • Supports multiple ML frameworks (scikit-learn, TensorFlow, PyTorch, etc.)
  • Enables model serving via REST API, batch inference, or cloud platforms

Use case: Deploy models consistently across environments, A/B test different model versions

MLflow Model Registry

  • Centralized model store with versioning and stage management
  • Tracks model lineage and metadata
  • Supports approval workflows and access controls

Use case: Manage model lifecycle from staging to production, collaborate on model deployment

Common Use Cases

Experiment Management

  • Compare model architectures, hyperparameters, and feature engineering approaches
  • Track training metrics over time and across team members

Model Deployment

  • Package models for consistent deployment across dev/staging/prod
  • Serve models as REST endpoints or batch processing jobs

Collaboration

  • Share reproducible experiments and models across data science teams
  • Maintain audit trail of model development and deployment decisions

MLOps Workflows

  • Automate model training, validation, and deployment pipelines
  • Integrate with CI/CD systems for continuous model delivery

MLflow works well as a lightweight, open-source solution that integrates with existing ML workflows without requiring major infrastructure changes.Ver 0.3.6

Last change: 2025-12-02

[Avg. reading time: 4 minutes]

MLflow Experiment Structure

A typical Chemistry experiment we did in school days.

Experiment (ML Project)Run #Inputs (Parameters)Process (Code/Recipe)Outputs (Artifacts)Metrics (Results)
Acid + Base ReactionRun 1Acid=10ml, Base=5mlStirred 2 mins, room tempBeaker with solutionpH=7.0
Acid + Base ReactionRun 2Acid=10ml, Base=7mlStirred 2 mins, room tempBeaker with solutionpH=6.2
Acid + Base ReactionRun 3Acid=10ml, Base=7mlStirred 5 mins, heatBeaker with solutionpH=6.0, Color=yellow
  • Experiment → Group of related trials (like a project or ML task).
  • Run → One trial with a unique ID (just like a single lab experiment entry).
  • Inputs (Parameters) → Model hyperparameters (learning rate, batch size, etc.).
  • Process (Code/Recipe) → Training code or pipeline steps.
  • Outputs (Artifacts) → Models, plots, datasets, or serialized files.
  • Metrics (Results) → Accuracy, loss, F1-score, etc.
MLflow
│
├── Experiment A
│     ├── Run 1
│     │     ├── Parameters
│     │     ├── Metrics
│     │     ├── Artifacts
│     │     └── Tags
│     ├── Run 2
│     │     ├── Parameters
│     │     ├── Metrics
│     │     ├── Artifacts
│     │     └── Tags
│     └── Run 3
│           ├── Parameters
│           ├── Metrics
│           ├── Artifacts
│           └── Tags
│
└── Experiment B
      ├── Run 1
      ├── Run 2
      └── Run N
git clone https://github.com/gchandra10/uni_multi_model.git

#mlflow #experiment #runVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 8 minutes]

Why MLflow

MLflow provides comprehensive support for traditional ML workflows, making it effortless to track experiments, manage models, and deploy solutions at scale.

Key Features

Intelligent (Auto)logging

- Simple Integration for scikit-learn, XGBoost, and more
- Automatic Parameter Capture (logs all model hyperparameters without manual intervention)
- Built-in Evaluation Metrics (automatically computes and stores relevant performance metrics)
- Model Serialization (handles complex objects like pipelines seamlessly)

Compare Model Performance Across Algorithms

  • Save Time: No more manually tracking results in spreadsheets or notebooks

  • Make Better Decisions: Easily spot which algorithms perform best on your data

  • Avoid Mistakes: Never lose track of promising model configurations

  • Share Results: Team members can see all experiments and build on each other’s work

  • Visual charts comparing accuracy, precision, recall across all your models

  • Sortable tables showing parameter combinations and their results

  • Quick filtering to find models that meet specific performance criteria

  • Export capabilities to share findings with stakeholders

Flexible Deployment

  • Real-Time Inference for low-latency prediction services
  • Batch Processing for large-scale scoring jobs
  • Edge Deployment for offline and mobile applications
  • Containerized Serving with Docker and Kubernetes support
  • Cloud Integration across AWS, Azure, and Google Cloud platforms
  • Custom Serving Logic for complex preprocessing and postprocessing requirements

Capabilities

Tracking Server & MLflow UI

Start a new project

VSCode, Open Workspace

Open Shell 1 (Terminal/GitBash)

uv init mlflow_demo
cd mlflow_demo
uv add mlflow pandas numpy scikit-learn matplotlib

Option 1: Store MLflow details in Local Machine

mlflow server --host 127.0.0.1 --port 8080

Open this URL and copy the file to your VSCode

https://github.com/gchandra10/uni_multi_model/blob/main/01-lr-model.py

Open Shell 2

Step Activate Virtual Environment

python 01-lr-model.py

Open your browser and goto http://127.0.0.1:8080

View the Experiment


Option 2: Store MLflow details in a Local Database

mlflow server --host 127.0.0.1 --port 8080 \
--backend-store-uri sqlite:///mlflow.db

Option 3: Store MLflow details in a Remote Database

export AWS_PROFILE=your_profile_name

mlflow server --host 127.0.0.1 --port 8080 \
  --default-artifact-root s3://yourbucket
  --backend-store-uri 'postgresql://yourhostdetails/'

Model Serving

Open Shell 3

Optional Step Activate Virtual Environment


export MLFLOW_TRACKING_URI=http://127.0.0.1:8080

mlflow models serve \
  -m "models:/Linear_Regression_Model/1" \
  --host 127.0.0.1 \
  --port 5001 \
  --env-manager local

Real Time Prediction

Open Shell 4

Optional Step Activate Virtual Environment


curl -X POST "http://127.0.0.1:5001/invocations" \
  -H "Content-Type: application/json" \
  --data '{"inputs": [{"ENGINESIZE": 2.0}, {"ENGINESIZE": 3.0}, {"ENGINESIZE": 4.0}]}'

OR

curl -X POST http://127.0.0.1:5001/invocations \
  -H "Content-Type: application/json" \
  -d '{
        "dataframe_split": {
          "columns": ["ENGINESIZE"],
          "data": [[2.0],[3.0],[4.0]]
        }
      }'

#mlflow #serving #mlflow_serverVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 5 minutes]

YAML

Introduction

  • YAML Ain’t Markup Language.
  • Human-readable alternative to JSON.
  • Indentation is very key. (like Python)
  • Used for configuration, not for programming logic.

Key Principles

  • Whitespace indentation -> hierarchy
  • Colon (:) -> Key Value Pair
  • Dash (-) -> List Item
  • Comments (#)

Use Cases in MLOps

  • MLflow experiment configs (parameters, environments)
  • Kubernetes -> Pods, Services, Deployments
  • Docker Compose -> multi-container setups
  • CI/CD pipelines -> GitHub Actions, GitLab CI, Azure DevOps
{
  "experiment": "CO2_Regression",
  "params": {
    "alpha": 0.1,
    "max_iter": 100
  },
  "tags": ["linear_regression", "mlflow"]
}
experiment: CO2_Regression
params:
  alpha: 0.1
  max_iter: 100
tags:
  - linear_regression
  - mlflow

YAMLLint OR VSCode YAML Validator Extension

YAML Data Structures

Scalars (strings, numbers, booleans)

learning_rate: 0.01
early_stopping: true
experiment_name: "CO2_Prediction"

Lists

models:
  - linear_regression
  - random_forest
  - xgboost

Dictionaries (maps)

params:
  n_estimators: 100
  max_depth: 5

Description

description: |
  This is a multi-line string.
  It preserves line breaks.
  Useful for comments/description/notes.

Putting together

experiment:
  name: CO2_Regression
  params:
    alpha: 0.1
    max_iter: 100
  metrics:
    - mse
    - r2
  description: |
    Model built using Linear Regression.
    We can use univariate or multi variate.

  environments:
    development:
      database: sqlite
    production:
      database: mysql

Default Values Using &anchorName and *anchorName and Merge Key <<


base_config: &base
  host: localhost
  port: 3306

development:
  <<: *base
  database: dev_db

production:
  <<: *base
  database: prod_db
  host: prod.server.com

Using Environment Variables


config:
  path: ${USERPROFILE}\folder1

Mac/Linux/Git Bash

export USERPROFILE="sometext"

Command Prompt

set USERPROFILE="sometext"

YAML Variables

variables:
  base_url: http://example.com
endpoints:
  user: ${variables.base_url}/user
  admin: ${variables.base_url}/admin
https://github.com/gchandra10/python_yaml_demo.git

#yaml #json #pythonVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 1 minute]

Cloud

[Avg. reading time: 6 minutes]

Overview

Definitions

Hardware: physical computer / equipment / devices

Software: programs such as operating systems, word, Excel

Web Site: Readonly web pages such as company pages, portfolios, newspapers

Web Application: Read Write - Online forms, Google Docs, email, Google apps

Cloud Plays a significant role in the Big Data world.

In today’s market, Cloud helps companies to accommodate the ever-increasing volume, variety, and velocity of data.

Cloud Computing is a demand delivery of IT resources over the Internet through Pay Per Use.

Src : https://thinkingispower.com/the-blind-men-and-the-elephant-is-perception-reality/

Without Cloud knowledge, knowing Bigdata will be something like the above picture.

  1. Volume: Size of the data.
  2. Velocity: Speed at which new data is generated.
  3. Variety: Different types of data.
  4. Veracity: Trustworthiness of the data.
  5. Value: Usefulness of the data.
  6. Vulnerability: Security and privacy aspects.

When people focus on only one aspect without the help of cloud technologies, they miss out on the comprehensive picture. Cloud solutions offer ways to manage all these dimensions in an integrated manner, thus providing a fuller understanding and utilization of Big Data.

Advantages of Cloud Computing

  • Cost Savings
  • Security
  • Flexibility
  • Mobility
  • Insight
  • Increased Collaboration
  • Quality Control
  • Disaster Recovery
  • Loss Prevention
  • Automatic Software Updates
  • Competitive Edge
  • Sustainability

Types of Cloud Computing

Public Cloud

Owned and operated by third-party providers. (AWS, Azure, GCP, Heroku, and a few more)

Private Cloud

Cloud computing resources are used exclusively by a single business or organization.

Hybrid

Public + Private: By allowing data and applications to move between private and public clouds, a hybrid cloud gives your business greater flexibility and more deployment options, and helps optimize your existing infrastructure, security, and compliance.

#cloud #overviewVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 5 minutes]

Types of Cloud Services

SaaS

Software as a Service

Cloud-based service providers offer end-user applications. Google Apps, DropBox, Slack, etc.

  • Web access to Software (primarily commercial).
  • Software is managed from a central location.
  • Delivery 1 - many models.
  • No patches, No upgrades

When not to use

  • Hardware integration is needed. (Price Scanner)
  • Faster processing is required.
  • Cannot host data outside the premise.

PaaS

Platform as a Service

Software tools are available over the internet. AWS RDS, Heroku, Salesforce

  • Scalable
  • Built on Virtualization Technology
  • No User needed to maintain software. (DB upgrades, patches by cloud team)

When not to use PaaS

  • Propriety tools don’t allow moving to diff providers. (AWS-specific tools)
  • Using new software that is not part of the PaaS toolset.

IaaS

Infrastructure as a Service

Cloud-based hardware services. Pay-as-you-go services for Storage, Networking, and Servers.

Amazon EC2, Google Compute Engine, S3.

  • Highly flexible and scalable.
  • Accessible by more than one user.
  • Cost-effective (if used right).

Serverless computing

Focuses on building apps without spending time managing servers/infrastructure.

Feature automatic scaling, built-in high availability, and pay-per-use.

Use of resources when a specific function or event occurs.

Cloud providers handle the deployment, and capacity, and manage the servers.

Example: AWS Lambda, AWS Step Functions.

Easy way to remember SaaS, PaaS, IaaS

bigcommerce.com

#cloud #iaas #paas #saasVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 5 minutes]

Challenges of Cloud Computing

Privacy: “Both traditional and Big Data sets often contain sensitive information, such as addresses, credit card details, or social security numbers.”

So, it’s the responsibility of users to ensure proper security methods are followed.

Compliance: Cloud providers replicate data across regions to ensure safety. If companies have regulations that data should not be stored outside their organization or should not be stored in a specific part of the world.

Data Availability: Everything is dependent on the Internet and speed. It is also dependent on the choice of the cloud provider. Big companies like AWS / GCP / Azure have more data centers and backup facilities.

Connectivity: Internet availability + speed.

Vendor lock-in: Once an organization has migrated its data and applications to the cloud, switching to a different provider can be difficult and expensive. This is known as vendor lock-in. Some cloud agnostic tools like Databricks help enterprises to mitigate this problem, but still, its a challenge.

Cost: Cloud computing can be a cost-effective way to deploy and manage IT resources. However, it is essential to carefully consider your needs and budget before choosing a cloud provider.

Continuous Training: Employees may need to be trained to use cloud-based applications. This can be a cost and time investment.

Constant Change in Technology: Cloud providers constantly improve or change their technology. Recently, Microsoft decided to decommission Synapse and launch a new tool called Fabric.

#cloud #challengesVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 4 minutes]

AWS

Terms to Know

Elasticity The ability to acquire resources as you need them and release resources when you no longer need them.

Scale Up vs. Scale Down

Scale-Out vs. Scale In

Latency

Typically latency is a measurement of a round-trip between two systems, such as how long it takes data to make its way between two.

Root User

Owner of the AWS account.

IAM

Identity Access Management

ARN

Amazon Resource Name

For example

arn:aws:iam::123456789012:user/Development/product_1234/*

Policy

Rules

Amazon EC2

Allows you to deploy virtual servers within your AWS environment.

Amazon S3

A fully managed, object-based storage service that is highly available, highly durable, cost-effective, and widely accessible.

AWS IAM (Identify and Access Mgt)

Used to manage permissions to your AWS resources

AWS Management Services

Amazon CloudWatch

A comprehensive monitoring tool allows you to monitor your services and applications in the cloud.

Billing & Budgeting

Helps control the cost.Ver 0.3.6

[Avg. reading time: 6 minutes]

AWS Global Infrastructure

The Primary two items are given below.

  • Availability Zones
  • Regions

Availability Zones (AZs)

AZs are the physical data centers of AWS.

This is where the actual computing, storage, network, and database resources are hosted that we as consumers, provision within our Virtual Private Clouds (VPCs).

A common misconception is that a single availability zone equals a single data center. Multiple data centers located closely form a single availability zone.

Each AZ will have another AZ in the same geographical area. Each AZ will be isolated from others using a separate power/network like DR.

Many AWS services use low latency links between AZs to replicate data for high availability and resilience purposes.

Multiple AZs are defined as an AWS Regions. (Example: Virginia)

Regions

Every Region will act independently of the others, containing at least two Availability Zones.

Interestingly, only some AWS services are available in some regions.

  • US East (N. Virginia) us-east-1
  • US East (Ohio) us-east-2
  • EU (Ireland) eu-west-1
  • EU (Frankfurt) eu-central-1

Note: As of today, AWS is available in 38 regions and 120 AZs

AWS Regions

Edge Location

A smaller AWS data center used by Amazon CloudFront and Lambda@Edge to cache content closer to users.

Reduces latency and improves performance for end users, especially for content delivery and inference endpoints.

A user in Singapore fetching from a U.S. model endpoint may hit an Edge Location nearby for lower latency.

Use Cases:

- DNS Resolution (Route 53)
- Content Caching

#aws #region #az #edgelocationVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 3 minutes]

CIDR

CIDR = Classless Inter-Domain Routing

It defines how many IP addresses are in a network (or subnet) using a “slash” notation.

Example: 192.168.10.0/24
  • Network address: 192.168.10.0
  • Prefix Length: /24 means this network will have 256 total IPs

Number of IPs = 2^(32 - prefix)

But AWS and most networks reserve 5 IPs in each subnet:

  • 1 for network address
  • 1 for broadcast address
  • 3 reserved by AWS (for internal routing, DNS, etc.)

/24 subnet gives 251 usable IPs

CIDRSubnet MaskTotal IPsUsable in AWSTypical Use
/16255.255.0.065,53665,531Entire VPC range
/20255.255.240.04,0964,091Large subnet
/24255.255.255.0256251Typical small subnet
/28255.255.255.2401611Small test subnet
/32255.255.255.25510Single host route
192.168.10.0 = 11000000.10101000.00001010.00000000

Last 8 digits goes like this

00000100 
00000101
00000101
00000110
.....
.....
11111111

#cidr #ipv4 #subnetVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 6 minutes]

EC2

(Elastic Cloud Compute)

Compute: Closely related to CPU/RAM

Elastic Compute Cloud (EC2): AWS EC2 provides resizable compute capacity in the cloud, allowing you to run virtual servers as per your needs.

Instance Types: EC2 offers various instance types optimized for different use cases, such as general purpose, compute-optimized, memory-optimized, and GPU instances.

Pricing Models

On-Demand: Pay for computing capacity by the hour or second.

Reserved: Commit to a one or 3-year term and get a discount.

Spot: Bid for unused EC2 capacity at a reduced cost.

Savings Plans: Commit to consistent compute usage for lower prices. AMI (Amazon Machine Image): Pre-configured templates for your EC2 instances, including the operating system, application server, and applications.

Security

Security Groups: Act as a virtual firewall for your instances to control inbound and outbound traffic.

Key Pairs: These are used to access your EC2 instances via SSH or RDP securely.

Elastic IPs: These are static IP addresses that can be associated with EC2 instances. They are useful for hosting services that require a consistent IP.

Auto Scaling: Automatically adjusts the number of EC2 instances in response to changing demand, ensuring you only pay for what you need.

Elastic Load Balancing (ELB): Distributes incoming traffic across multiple EC2 instances, improving fault tolerance and availability.

EBS (Elastic Block Store): Provides persistent block storage volumes for EC2 instances, allowing data to be stored even after an instance is terminated.

Regions and Availability Zones: EC2 instances can be deployed in various geographic regions, each with multiple availability zones for high availability and fault tolerance.

Storage

Persistent Storage

  • Elastic Block Storage (EBS) Volumes / Logically attached via AWS network.
  • Automatically replicated.
  • Encryption is available.

Ephemeral Storage - Local storage

  • Physically attached to the underlying host.
  • When the instance is stopped or terminated, all the data is lost.
  • Rebooting will keep the data intact.

DEMO - Deploy EC2

#aws #ec2 #vm #serverVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 19 minutes]

S3

(Simple Storage Service)

It’s an IaaS service. S3 uses Object storage instead of File storage (like your machine or Google Drive)

Warehouse vs Book Shelf

ScenarioFile StorageObject Storage (S3, Blob)
Read one small fileFaster (local I/O)Slight overhead (API call)
Read 10,000 files concurrentlyStruggles with locks & hierarchyScales linearly via APIs
Edit file in placeEasyNot possible (must re-upload)
Integrate with Spark / DatabricksNot scalableNative integration (spark.read.parquet("s3://..."))
Network accessTypically mountedAlways network-based (HTTP)
  • Highly Available
  • Durable
  • Cost Effective
  • Widely Accessible
  • Uptime of 99.99%
  1. Objects and Buckets: The fundamental elements of Amazon S3 are objects and buckets. Objects are the individual data pieces stored in Amazon S3, while buckets are containers for these objects. An object consists of a file and, optionally, any metadata that describes that file.

    • It’s also a regional service, meaning that when you create a bucket, you specify a region, and all objects are stored there.

    • Globally Unique: The name of an Amazon S3 bucket must be unique across all of Amazon S3, that is, across all AWS customers. It’s like a domain name.

    • Globally Accessible: Even though you specify a particular region when you create a bucket, once the bucket is created, you can access it from anywhere in the world using the appropriate URL.

  2. Scalability: Amazon S3 can scale in terms of storage, request rate, and users to support unlimited web-scale applications.

  3. Security: Amazon S3 includes several robust security features, such as encryption for data at rest and in transit, access controls like Identity and Access Management (IAM) policies, bucket policies, and Access Control Lists (ACLs), and features for monitoring and logging activity, like AWS CloudTrail.

  4. Data transfer: Amazon S3 supports transfer acceleration, which speeds up uploads and downloads of large objects.

  5. Event Notification: S3 can notify you of specific events in your bucket. For instance, you could set up a notification to alert you when an object is deleted from your bucket.

  6. Management Features: S3 has a suite of features to help manage your data, including lifecycle management, which allows you to define rules for moving or expiring objects, versioning to keep multiple versions of an object in the same bucket, and analytics for understanding and optimizing storage costs.

  7. Consistency: Amazon S3 provides read-after-write consistency for PUTS of new objects and eventual consistency for overwrite PUTS and DELETES.

    • Read-after-write Consistency for PUTS of New Objects: When a new object is uploaded (PUT) into an Amazon S3 bucket, it’s immediately accessible for read (GET) operations. This is known as read-after-write consistency. You can immediately retrieve a new object as soon as you create it. This applies across all regions in AWS, and it’s crucial when immediate, accurate data retrieval is required.

    • Eventual Consistency for Overwrite PUTS and DELETES: Overwrite PUTS and DELETES refer to operations where an existing object is updated (an overwrite PUT) or removed (a DELETE). For these operations, Amazon S3 provides eventual consistency. If you update or delete an object and immediately attempt to read or delete it, you might still get the old version or find it there (in the case of a DELETE) for a short period. This state of affairs is temporary, and shortly after the update or deletion, you’ll see the new version or find the object gone, as expected.

Src: Mailbox

  • S3 is like a building full of mailboxes (buckets). Each bucket has a unique name globally, and you can only access the ones you have keys for (permissions).

  • The overall S3 service is like a large building that contains multiple lockers.

  • Each bucket is a unique container that stores your objects (files, images, datasets).

  • Only authorized users (via IAM roles or bucket policies) can open that specific locker.

Src: USPS

  • This is the destination — it tells S3 where to deliver or find the object (s3://my-bucket/path/to/file.csv).

  • The envelope is the actual content.

  • Labels on envelope is the object metadata (content-type, size, date and so on)

Notes

Data is stored as an “Object.”

Object storage, also known as object-based storage, manages data as objects. Each object includes the data, associated metadata, and a globally unique identifier.

Unlike file storage, there are no folders or directories in object storage. Instead, objects are organized into a flat address space, called a bucket in Amazon S3’s terminology.

The unique identifier allows an object to be retrieved without needing to know the physical location of the data. Metadata can be customized, making object storage incredibly flexible.

Every object gets a UID (universal ID) and associated META data.

No Folders / SubFolders

For example, if you have an object with the key images/summer/beach.png in your bucket, Amazon S3 has no internal concept of the images or summer as separate entities—it simply sees the entire string images/summer/beach.png as the key for that object.

To store objects in S3, you must first define and create a bucket.

You can think of a bucket as a container for your data.

This bucket name must be unique, not just within the region you specify, but globally against all other S3 buckets, of which there are many millions.

Any object uploaded to your buckets is given a unique object key to identify it.

  • S3 bucket ownership is not transferable.
  • S3 bucket names should start with alphabets, and - is allowed in between.
  • An AWS account can have a maximum of 100 buckets.

More details

S3 Object Keys

#aws #s3 #storage #objectstorageVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 6 minutes]

description: Identity Access Management

IAM

src: Aws

ARN: Amazon Resource Name

Users - Individual Person / Application

Groups - Collection of IAM Users

Policies - Policy sets permission/control access to AWS resources. Policies are stored in AWS as JSON documents.

A Policy can be attached to multiple entities (users, groups, and roles) in your AWS account.

Multiple Policies can be created and attached to the user.

Roles - Set of permissions that define what actions are allowed and denied by an entity in the AWS console. Similar to a user, it can be accessed by any type of entity.

// Examples of ARNs

arn:aws:s3:::my_corporate_bucket/*

arn:aws:s3:::my_corporate_bucket/Development/*

arn:aws:iam::123456789012:user/chandr34

arn:aws:iam::123456789012:group/bigdataclass

arn:aws:iam::123456789012:group/*

Types of Policies

Identity-based policies: Identity-based policies are attached to an IAM user, group, or role (identities). These policies control what actions an identity can perform, on which resources, and under what conditions.

Resource-based policies: Resource-based policies are attached to a resource such as an Amazon S3 bucket. These policies control what actions a specified principal can perform on that resource and under what conditions.

Permission Boundary: You can use an AWS-managed policy or a customer-managed policy to set the boundary for an IAM entity (user or role). A permissions boundary is an advanced feature for using a managed policy to set the maximum permissions that an identity-based policy can grant to an IAM entity.

Inline Policies: Policies that are embedded in an IAM identity. Inline policies maintain a strict one-to-one relationship between a policy and an identity. They are deleted when you delete the identity.

#aws #iam #user #permissionsVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 13 minutes]

AWS CloudShell

AWS CloudShell is a browser-based shell environment available directly through the AWS Management Console. It provides a command-line interface (CLI) to manage and interact with AWS resources securely without needing to install any software or set up credentials on your local machine.

Use Cases

Quick Access to AWS CLI

Allows you to run AWS CLI commands directly without configuring your local machine. It’s perfect for quick tasks like managing AWS resources (e.g., EC2 instances, S3 buckets, or Lambda functions).

Development and Automation

You can write and execute scripts using common programming languages like Python and Shell. It’s great for testing and automating tasks directly within your AWS environment.

Secure and Pre-Configured Environment

AWS CloudShell comes pre-configured with AWS CLI, Python, Node.js, and other essential tools. It uses your IAM permissions, so you don’t need to handle keys or credentials directly, making it secure and convenient.

Access to Filesystem and Persistent Storage

You get a persistent 1 GB home directory per region to store scripts, logs, or other files between sessions, which can be used to manage files related to your AWS resources.

Cross-Region Management

You can access and manage resources across different AWS regions directly from CloudShell, making it useful for multi-region setups.


Basic Commands

    aws s3 ls
    aws ec2 describe-instances

    sudo apt install jq

list_buckets.sh

#!/bin/bash
echo "Listing all S3 buckets:"
aws s3 ls
    bash list_buckets.sh
# get account details

aws sts get-caller-identity

# list available regions

aws ec2 describe-regions --query "Regions[].RegionName" --output table

# create a bucket

aws s3 mb s3://chandr34-newbucket

# upload a file to a bucket 

echo "Hello, CloudShell!" > hello.txt
aws s3 cp hello.txt s3://chandr34-newbucket

# List files in bucket 

aws s3 ls s3://chandr34-newbucket/

# Delete bucket  with files 

aws s3 rb s3://chandr34-newbucket --force

# List AMIs

aws ec2 describe-images --owners amazon --query 'Images[*].{ID:ImageId,Name:Name}' --output table

# quickly launch a ec2

aws ec2 create-key-pair --key-name gcnewkeypair --query 'KeyMaterial' --output text > myNewKeyPair.pem

# Change Permission

chmod 0400 myNewKeyPair.pem

# Launch new EC2

aws ec2 run-instances --image-id ami-0866a3c8686eaeeba --count 1 --instance-type t2.micro --key-name gcnewkeypair --security-groups default

# Get Public IP

aws ec2 describe-instances --query "Reservations[].Instances[].PublicIpAddress" --output text

# Login to server

ssh -i myKeyNewPair.pem ubuntu@<getthehostip>

# terminate the instance

aws ec2 terminate-instances --instance-ids <>

Cloud Formation

my-webserver.yml

AWSTemplateFormatVersion: '2010-09-09'
Description: CloudFormation template to launch an Amazon Linux EC2 instance with Nginx installed.

Resources:
  MyEC2Instance:
    Type: AWS::EC2::Instance
    Properties:
      InstanceType: t2.micro
      ImageId: ami-0866a3c8686eaeeba
      KeyName: gcnewkeypair
      SecurityGroupIds:
        - !Ref InstanceSecurityGroup
      UserData:
        Fn::Base64: 
          !Sub |
            #!/bin/bash
            apt update -y
            apt install -y nginx
            systemctl start nginx
            systemctl enable nginx
      Tags:
        - Key: Name
          Value: MyNginxServer

  InstanceSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Enable SSH and HTTP access
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 22
          ToPort: 22
          CidrIp: 0.0.0.0/0  # SSH access, restrict this to your IP range for security
        - IpProtocol: tcp
          FromPort: 80
          ToPort: 80
          FromPort: 443
          ToPort: 443
          CidrIp: 0.0.0.0/0  # HTTP access for Nginx

Outputs:
  InstanceId:
    Description: The Instance ID of the EC2 instance
    Value: !Ref MyEC2Instance
  PublicIP:
    Description: The Public IP address of the EC2 instance
    Value: !GetAtt MyEC2Instance.PublicIp
  WebURL:
    Description: URL to access the Nginx web server
    Value: !Sub "http://${MyEC2Instance.PublicIp}"

Launch the Stack via CloudShell

# Create the stack
aws cloudformation create-stack --stack-name gc-stack --template-body file://my-webserver.yml --capabilities CAPABILITY_NAMED_IAM


# Check the status

aws cloudformation describe-stacks --stack-name gc-stack --query "Stacks[0].StackStatus"


aws cloudformation describe-stacks --stack-name gc-stack --query "Stacks[0].Outputs"

# delete the stack

aws cloudformation delete-stack --stack-name gc-stack


aws cloudformation describe-stacks --stack-name gc-stack --query "Stacks[0].StackStatus"

# confirm the deletion status

aws cloudformation list-stacks --query "StackSummaries[?StackName=='gc-stack'].StackStatus"

#cli #aws #cloudshellVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 16 minutes]

Terraform

Features of Terraform

Infrastructure as Code: Terraform allows you to write, plan, and create infrastructure using configuration files. This makes infrastructure management automated, consistent, and easy to collaborate on.

Multi-Cloud Support: Terraform supports many cloud providers and on-premises environments, allowing you to manage resources across different platforms seamlessly.

State Management: Terraform keeps track of the current state of your infrastructure in a state file. This enables you to manage changes, plan updates, and maintain consistency in your infrastructure.

Resource Graph: Terraform builds a resource dependency graph that helps in efficiently creating or modifying resources in parallel, speeding up the provisioning process and ensuring dependencies are handled correctly.

Immutable Infrastructure: Terraform promotes the practice of immutable infrastructure, meaning that resources are replaced rather than updated directly. This ensures consistency and reduces configuration drift.

Execution Plan: Terraform provides an execution plan (terraform plan) that previews changes before they are applied, allowing you to understand and validate the impact of changes before implementing them.

Modules: Terraform supports reusability through modules, which are self-contained, reusable pieces of configuration that help you maintain best practices and reduce redundancy in your infrastructure code.

Community and Ecosystem: Terraform has a large open-source community and many providers and modules available through the Terraform Registry, which makes it easier to get started and integrate with various services.

Use Cases

  • Multi-Cloud Provisioning
  • Infrastructure Scaling
  • Disaster Recovery
  • Environment Management
  • Compliance & Standardization
  • CI/CD Pipelines
  • Speed and Simplicity
  • Team Collaboration
  • Error Reduction
  • Enhanced Security

Install Terraform CLI

<a href="https://developer.hashicorp.com/terraform/downloads"" title="" target="_blank">Terraform Download

Terraform Structure

Provider Block: Specifies the cloud provider or service (e.g., AWS, Azure, Google Cloud) that Terraform will interact with.

provider "aws" {
  region = "us-east-1"
}

Resource Block: Defines the resources to be created or managed. A resource can be a server, network, or other infrastructure component.

resource "aws_instance" "example" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t2.micro"
}

Data Block: Fetches information about existing resources, often for referencing in resource blocks.

data "aws_ami" "latest" {
  most_recent = true
  owners      = ["amazon"]
}

Variable Block: Declares input variables to make the script flexible and reusable.

variable "instance_type" {
  description = "Type of instance to use"
  type        = string
  default     = "t2.micro"
}

Output Block: Specifies values to be output after the infrastructure is applied, like resource IDs or connection strings.

output "instance_ip" {
  value = aws_instance.example.public_ip
}

Module Block: Used to encapsulate and reuse sets of Terraform resources.

module "vpc" {
  source = "./modules/vpc"
  cidr_block = "10.0.0.0/16"
}

Locals Block: Defines local values that can be reused in the configuration.

locals {
  environment = "production"
  instance_count = 3
}

SET these environment variables.

export AWS_ACCESS_KEY_ID="your-access-key-id"
export AWS_SECRET_ACCESS_KEY="your-secret-access-key"

Simple S3 Bucket

simple_s3_bucket.tf


terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.70.0"
    }
  }

  required_version = ">= 1.2.0"
}

provider "aws" {
  region = "us-east-1"
  profile = "chandr34"
}

resource "aws_s3_bucket" "demo" {
  bucket = "chandr34-my-new-tf-bucket"

  tags = {
    Createdusing = "tf"
    Environment  = "classdemo"
  }
}

output "bucket_name" {
  value = aws_s3_bucket.demo.bucket
}
Create a new folder
Copy the .tf into it
terraform init 
terraform validate
terraform plan
terraform apply
terraform destroy

Variable S3 Bucket

variable_bucket.tf


terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.70.0"
    }
  }

  required_version = ">= 1.2.0"
}

provider "aws" {
  region  = "us-east-1"
  profile = "chandr34"
}

variable "bucket_name" {
  description = "The name of the S3 bucket to create"
  type        = string
}

resource "aws_s3_bucket" "demo" {
  bucket = var.bucket_name

  tags = {
    Createdusing = "tf"
    Environment  = "classdemo"
  }
}

output "bucket_name" {
  value = aws_s3_bucket.demo.bucket
}
Create a new folder
Copy the .tf into it
terraform init
terraform validate
terraform plan
terraform apply -var="bucket_name=chandr34-variable-bucket"
terraform destroy -var="bucket_name=chandr34-variable-bucket"

Variable file

Any filename with extension .tfvars

terraform.tfvars

bucket_name = "chandr34-variable-bucket1"
terraform apply -auto-approve

AWS Resource Types


Please make sure AWS Profile is created.

Create Public and Private Keys

Linux / Mac Users

// create private/public key

ssh-keygen -b 2048 -t rsa -f ec2_tf_demo

Windows Users

Open PuttyGen and create a Key

Terraform

  • mkdir simple_ec2
  • cd tf-aws-ec2-sample
  • Create main.tf
// main.tf
#https://registry.terraform.io/providers/hashicorp/aws/latest

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.70.0"
    }
  }

  required_version = ">= 1.2.0"
}

provider "aws" {
  region  = "us-east-1"
  profile = "chandr34"
}

resource "aws_key_pair" "generated_key" {
  key_name   = "generated-key-pair"
  public_key = tls_private_key.generated_key.public_key_openssh
}

resource "tls_private_key" "generated_key" {
  algorithm = "RSA"
  rsa_bits  = 2048
}

resource "local_file" "private_key_file" {
  content  = tls_private_key.generated_key.private_key_pem
  filename = "${path.module}/generated-key.pem"
}

resource "aws_instance" "ubuntu_ec2" {
  ami           = "ami-00874d747dde814fa"
  instance_type = "t2.micro"
  key_name      = aws_key_pair.generated_key.key_name
  vpc_security_group_ids = [aws_security_group.ec2_security_group.id]

  tags = {
    Name        = "UbuntuInstance"
    Environment = "classdemo"
  }
}

resource "aws_security_group" "ec2_security_group" {
  name        = "ec2_security_group"
  description = "Allow SSH and HTTP access"

  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]  # Allow SSH from anywhere (use cautiously)
  }

  ingress {
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]  # Allow HTTP from anywhere
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]  # Allow all outbound traffic
  }

  tags = {
    Name = "EC2SecurityGroup"
  }
}

output "ec2_instance_public_ip" {
  value = aws_instance.ubuntu_ec2.public_ip
}

output "private_key_pem" {
  value     = tls_private_key.generated_key.private_key_pem
  sensitive = true
}

goto terminal

  • terraform init
  • terraform fmt
  • terraform validate
  • terraform apply
  • terraform show

Finally

  • terraform destroy

#terraform #IaaCVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 1 minute]

MLflow Model Lifecycle

[Avg. reading time: 14 minutes]

Decorator

Decorators in Python are a powerful way to modify or extend the behavior of functions or methods without changing their code. Decorators are often used for tasks like logging, authentication, and adding additional functionality to functions. They are denoted by the “@” symbol and are applied above the function they decorate.

def say_hello():
    print("World")

say_hello()

How do we change the output without changing the say hello() function?

wrapper() is not reserved word. It can be anyting.

Use Decorators

# Define a decorator function
def hello_decorator(func):
    def wrapper():
        print("Hello,")
        func()  # Call the original function
    return wrapper

# Use the decorator to modify the behavior of say_hello
@hello_decorator
def say_hello():
    print("World")

# Call the decorated function
say_hello()

When Python sees @decorator_name, it does:

say_hello = hello_decorator(say_hello)

If you want to replace the new line character and the end of the print statement, use end=''

# Define a decorator function
def hello_decorator(func):
    def wrapper():
        print("Hello, ", end='')
        func()  # Call the original function
    return wrapper

# Use the decorator to modify the behavior of say_hello
@hello_decorator
def say_hello():
    print("World")

# Call the decorated function
say_hello()

Multiple functions inside the Decorator

def hello_decorator(func):
    def first_wrapper():
        print("First wrapper, doing something before the second wrapper.")
        #func()
    
    def second_wrapper():
        print("Second wrapper, doing something before the actual function.")
        #func()
    
    def main_wrapper():
        first_wrapper()  # Call the first wrapper
        second_wrapper()  # Then call the second wrapper, which calls the actual function
        func()
    
    return main_wrapper

@hello_decorator
def say_hello():
    print("World")

say_hello()

Multiple Decorators

from functools import wraps
def one(func):
    def one_wrapper():
        print(f"Decorator One: Before function - Called by {func.__name__}")
        func()
        print(f"Decorator One: After function - Called by {func.__name__}")
    return one_wrapper

def two(func):
    def two_wrapper():
        print(f"Decorator Two: Before function - Called by {func.__name__}")
        func()
        print(f"Decorator Two: After function - Called by {func.__name__}")
    return two_wrapper

def three(func):
    def three_wrapper():
        print(f"Decorator Three: Before function - Called by {func.__name__}")
        func()
        print(f"Decorator Three: After function - Called by {func.__name__}")
    return three_wrapper

@one
@two
@three
def say_hello():
    print("Hello, World!")

say_hello()

Decorator Order

one(two(three(say_hello())))

[ONE 
    TWO
        THREE
            SAY_HELLO]

Wraps

@wraps is a decorator from Python’s functools module that preserves the original function’s metadata (like its name, docstring, and annotations) when it’s wrapped by another function.

Without using wraps

def some_decorator(func):
    def wrapper():
        """Wrapper docstring"""
        return func()
    return wrapper

@some_decorator
def hello():
    """Original docstring"""
    print("Hi!")

print(hello.__name__)
print(hello.__doc__)

Using Wraps

from functools import wraps

def some_decorator(func):
    @wraps(func)
    def wrapper():
        """Wrapper docstring"""
        return func()
    return wrapper

@some_decorator
def hello():
    """Original docstring"""
    print("Hi!")

print(hello.__name__)
print(hello.__doc__)

Args & Kwargs

  • *args: This is used to represent positional arguments. It collects all the positional arguments passed to the decorated function as a tuple.
  • **kwargs: This is used to represent keyword arguments. It collects all the keyword arguments (arguments passed with names) as a dictionary.
from functools import wraps

def my_decorator(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        print("Positional Arguments (*args):", args)
        print("Keyword Arguments (**kwargs):", kwargs)
        result = func(*args, **kwargs)
        return result
    return wrapper

@my_decorator
def example_function(a, b, c=0, d=0):
    print("Function Body:", a, b, c, d)

# Calling the decorated function with different arguments
example_function(1, 2)
example_function(3, 4, c=5)

Popular Example

import time
from functools import wraps

def timer(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        end = time.time()
        print(f"Execution time of {func.__name__}: {end - start} seconds")
        return result
    return wrapper
    
@timer
def add(x, y):
    """Returns the sum of x and y"""
    return x + y

@timer
def greet(name, message="Hello"):
    """Returns a greeting message with the name"""
    return f"{message}, {name}!"

print(add(2, 3))
print(greet("Rachel"))

The purpose of @wraps is to preserve the metadata of the original function being decorated.

#decorator #wraps #pythonVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 5 minutes]

HTTP Basics

HTTP (Hypertext Transfer Protocol) is the foundation for data communication on the web.

Common HTTP Methods

MethodDescriptionTypical Use
GETRetrieve dataFetch a resource (read-only)
POSTCreate new dataAdd new records or trigger an action
PUTReplace dataUpdate/replace an existing record
DELETERemove dataDelete a record

Popular HTTP Status Codes

200 Series (Success): 200 OK, 201 Created.

300 Series (Redirection): 301 Moved Permanently, 302 Found.

400 Series (Client Error): 400 Bad Request, 401 Unauthorized, 404 Not Found.

500 Series (Server Error): 500 Internal Server Error, 503 Service Unavailable.

REST API

REpresentational State Transfer is a software architectural style developers apply to web APIs.

REST APIs provide simple, uniform interfaces because they can be used to make data, content, algorithms, media, and other digital resources available through web URLs. Essentially, REST APIs are the most common APIs used across the web today.

https://api.zippopotam.us/us/08028

http://api.tvmaze.com/search/shows?q=friends

https://jsonplaceholder.typicode.com/posts

https://jsonplaceholder.typicode.com/posts/1

https://jsonplaceholder.typicode.com/posts/1/comments

https://reqres.in/api/users?page=2

https://reqres.in/api/users/2

CURL & VSCode

curl is a CLI application available for all OS.

https://curl.se/windows/
curl https://api.zippopotam.us/us/08028

curl https://api.zippopotam.us/us/08028 -o zipdata.json

VS Code - Get Thunder Client

#RESTAPI #httpVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 3 minutes]

Pydantic

Pydantic is a Python library for data validation, type enforcement, and serialization using standard Python type hints.

It ensures the data coming into your app (like API requests, configs, or ML inputs) is valid, typed, and clean — automatically.

Key Features

Automatic validation: Converts and checks input types (e.g., “5” → int(5)).

BaseModel class: Define data schemas by subclassing BaseModel.

Error messages: Tells you exactly which field is invalid and why.

Data parsing: Converts JSON or dicts into Python objects you can use directly.

Integration with FastAPI: FastAPI uses Pydantic models to validate request bodies and auto-generate documentation.

Why It Matters in MLOps

  • Ensures model inputs (e.g., features in an API request) are validated before prediction.

  • Prevents serving errors due to missing or wrong data types.

  • Makes your FastAPI endpoints self-documenting via OpenAPI and /docs.

Example: Google colab

https://colab.research.google.com/drive/1IkROILidYV8iY9HchMGv2EAqQNK5o8d5?usp=sharing

#pydantic #datavalidationVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 8 minutes]

Model Flavors

Remember MLflow features (Experiments - Runs - Models - Versions)

Rerun the model again.

git clone https://github.com/gchandra10/uni_multi_model

Popular MLflow Model Flavors

FlavorUsed ForTypical Libraries / Frameworks
sklearnTraditional ML models (regression, classification, clustering)Scikit-Learn, statsmodels
xgboostGradient boosting treesXGBoost
lightgbmHigh-performance gradient boostingLightGBM
catboostCategorical-feature-friendly boostingCatBoost
pytorchDeep learning and neural networksPyTorch
tensorflow / kerasDeep learning modelsTensorFlow, Keras
onnxPortable models for cross-framework inferenceONNX Runtime
fastaiTransfer learning and DL pipelinesFastAI
statsmodelsStatistical / econometric modelsstatsmodels
prophetTime series forecastingFacebook Prophet
gluonDeep learning (MXNet backend)Apache MXNet
sparkmlDistributed ML pipelinesApache Spark MLlib
pyfuncUniversal interface — wraps all other flavorsMLflow internal meta-flavor


PyFunc makes ML models cross-platform — one consistent way to load and predict, regardless of how they were built.

  • Just like apps can be built separately for iOS or Android, models in MLflow can be saved in different native formats (like Scikit-Learn, PyTorch, XGBoost, etc.).

  • A cross-platform app works everywhere, and that’s what PyFunc is for ML models: a universal wrapper that runs any model with the same interface.

  • This lets teams serve and reuse models easily, without worrying about which library originally trained them.


For Example:

LibrarySave APIPredict Method
Scikit-Learnjoblib.dump()model.predict()
TensorFlowmodel.save()model(x)
PyTorchtorch.save()model.forward(x)
XGBoostmodel.save_model()model.predict(xgb.DMatrix(x))

You can use pyfunc for all the flavors

import mlflow.pyfunc
mlflow.pyfunc.save_model()

---
---
---

model = mlflow.pyfunc.load_model("models:/<name>/<stage>")
model.predict(pd.DataFrame(...))

Advantages

  • One simple API for inference. Works the same whether the model was trained in Scikit-Learn, XGBoost, PyTorch, or TensorFlow.
  • Reduces code differences between data-science teams using different libraries.
  • PyFunc packages the model + environment (conda/requirements) together.
  • Guarantees that the model runs identically on local machines, servers, or cloud.
  • Ideal for CI/CD pipelines and container builds.
  • Can be loaded from: Run path: runs:/<run_id>/model Registry stage: models:/name/Production
  • Works seamlessly with MLflow Serving, FastAPI, Docker, or SageMaker deploys.
  • Enables easy A/B comparisons between models trained in different frameworks.
  • You can subclass mlflow.pyfunc.PythonModel to: Add preprocessing or feature engineering. Postprocess predictions. Integrate external systems (feature store, logging, metrics).

Limitations

  • Framework-specific features are lost.
  • Input is pandas centric.
  • In some cases, can be slower than native runtimes. (Torch/Tensor flow)
https://github.com/gchandra10/uni_multi_model/blob/main/03_load_test_model.py

#pyfunc #mlflow #tensorflow #pytorchVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 6 minutes]

Model Serving

mlflow server

Instantly turn a registered model into a REST API endpoint.

Make sure the MLFlow is still running as per the example.

mlflow server --host 127.0.0.1 --port 8080 \
--backend-store-uri sqlite:///mlflow.db

Windows

SET MLFLOW_TRACKING_URI=http://127.0.0.1:8080

MAC/Linux

export MLFLOW_TRACKING_URI=http://127.0.0.1:8080

Serve the Model

mlflow models serve \
  -m "models:/Linear_Regression_Model/1" \
  --host 127.0.0.1 \
  --port 5001 \
  --env-manager local

Use the Model

curl -X POST "http://127.0.0.1:5001/invocations" \
  -H "Content-Type: application/json" \
  --data '{"inputs": [{"ENGINESIZE": 2.0}, {"ENGINESIZE": 3.0}, {"ENGINESIZE": 4.0}]}'

Pros

  • Zero-code serving: Just one CLI command — no need to build an API yourself.
  • Auto-handles environment: Loads dependencies automatically.
  • Ideal for testing and demos.
  • Supports model URIs.

Cons

  • Single-threaded process.
  • Limited customization.
  • Minimal built in monitoring.
  • Not suited for blue-green / CICD promotion pipelines.

Fast API

  • Modern, high-performance Python web framework for building REST APIs.

  • FastAPI turns Python functions into fully documented, high-performance REST APIs with minimal code.

  • Built on ASGI (Asynchronous Server Gateway Interface) .

  • Designed for speed, type safety, and developer productivity.

Key Features

  • Fast execution: Comparable to Node.js & Go — async by design.
  • Automatic validation: Uses Pydantic models to validate and parse JSON inputs.
  • Auto-generated API docs: Swagger UI available at /docs, ReDoc at /redoc.
  • Type hints = API schema: Python typing directly defines request/response schema.
  • Easy to test & extend: Works great with Docker, CI/CD, and modern MLOps stacks.
  • Supports both sync & async: You can mix blocking ML inference and async endpoints.
export MLFLOW_TRACKING_URI=http://127.0.0.1:8080

Open uni_multi_model in VSCode

cd uni_multi_model
uvicorn fast_app:app --host 127.0.0.1 --port 5002

Uvicorn

  • Python runtime Application server used to run Python app code.
  • A lightweight, lightning-fast ASGI server (ASGI = Asynchronous Server Gateway Interface).
  • Built on uvloop (fast event loop) and httptools (HTTP parser), with native WebSocket support.
  • Works great with FastAPI, Pydandic.

#modelserving #mlflow #fastapiVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 7 minutes]

Model Serving Types

Model Serving is the process of deploying trained machine-learning models so they can generate predictions on new data.

Once a model is trained and validated, it must be made available to applications, pipelines, or users that need its outputs — whether that’s a batch job scoring millions of records, a web app recommending products, or an IoT stream detecting anomalies.

Model serving sits in the production stage of the MLOps lifecycle, bridging the gap between model development and business consumption.

It ensures models are:

  • Accessible (via APIs, pipelines, or streams)
  • Scalable (able to handle varying loads)
  • Versioned and governed (using registries and lineage)
  • Monitored (for drift, latency, and performance)

In modern stacks (e.g., Databricks, AWS SageMaker, GCP Vertex AI), serving integrates tightly with model registries, feature stores, and CI/CD pipelines to enable reliable, repeatable ML deployment.

Batch Model Serving

Batch serving runs inference on large datasets at scheduled intervals (hourly, nightly, weekly).

  • Input data is read from storage or database.
  • Predictions are generated for all records.
  • Outputs are written back to storage or a downstream table.

Example: Predict new car Co2 Emission.

Pros: Efficient, reproducible, simple to schedule. Cons: Not real-time; predictions may get stale.

Demo:


Real-Time (Online) Model Serving

Real-time serving exposes the model as a low-latency API endpoint. Each request is scored on demand and returned within milliseconds to seconds.

How it works:

An application (e.g., web or mobile) calls the API.

The model receives input features and returns a prediction immediately.

As discussed in the previous chapter.

  • MlFlow Serving
  • FastAPI Serving

Example:

Credit-card fraud detection, dynamic pricing, personalized recommendations.

Pros: Instant feedback, personalized predictions

Cons: Needs always-on infra, online feature store, auto-scaling

Demo


Streaming (Continuous) Model Serving

Streaming serving applies the model continuously to incoming event streams (Kafka, Kinesis, Delta Live Tables).

Instead of single requests, it handles ongoing flows of data.

  • Data arrives in small micro-batches or as events.
  • The model scores each record as soon as it appears.
  • Results are pushed to dashboards, alerts, or storage.

Example:

IoT anomaly detection, clickstream optimization, live sensor analytics.

Pros:

Near real-time, high-throughput, scalable

Cons:

Complex orchestration, harder to monitor and debug.

#batch #streaming #realtimeVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 7 minutes]

Auto ML

AutoML (Automated Machine Learning) is the process of automating the end-to-end machine-learning workflow, from data preprocessing and model selection to hyperparameter tuning, evaluation, and deployment.

Make machine learning faster, easier, and more accessible, without sacrificing performance.

Instead of a data scientist manually trying dozens of models and tuning parameters, AutoML systems do this automatically, guided by optimization techniques and performance metrics.

  • Speeds up experimentation
  • Democratizes machine learning
  • Improves model quality
  • Enables scalable model governance
AreaExample Use CaseWhat AutoML Helps With
RetailPredict customer churn or recommend productsAutomatically build and tune classifiers/regressors
FinanceCredit-risk modeling, fraud detectionFeature selection, threshold optimization
HealthcarePredict patient readmissionImbalanced-data handling, model explainability
EnergyPredict CO₂ emissions or fuel consumptionRegression with mixed numeric + categorical inputs
MarketingForecast campaign ROIFast model iteration and ranking

What AutoML Actually Does

Typical AutoML frameworks automate these stages:

Data Preprocessing

  • Missing-value imputation
  • Encoding categorical variables
  • Normalization or standardization

Feature Engineering

  • Automatic transformations (log, polynomial, interaction terms)

  • Feature selection and importance ranking

Model Selection

  • Chooses among algorithms (e.g., Linear, Random Forest, XGBoost, Neural Net)

Model Ensemble / Stacking

  • Combines several good models into one stronger ensemble

Model Evaluation and Ranking

  • Uses metrics (RMSE, MAE, AUC, F1, etc.) to pick the best

Model Export

  • Produces portable artifacts for production (e.g., MOJO, ONNX, pickle)

H2O AutoML

H2O.ai is an open-source AI and machine-learning platform built for speed and scalability.

It’s written in Java and C++ (high performance) with Python and R APIs for easy use.

The flagship open-source library is H2O-3, and H2O AutoML is a major component within it.

Other similar products

  • AutoGluon
  • Flaml
  • PyCaret
  • Auto-sklearn
  • AutoKeras

Why H2O AutoML Is Popular in Industry

FeatureBenefit
Scalable JVM backendRuns on a laptop or a multi-node cluster
Multiple APIsPython, R, Java, Scala
Easy deploymentExports MOJO/POJO models for production scoring
InterpretableProvides variable importance and SHAP explanations
Open SourceNo license barrier; integrates with enterprise tools

Google Colab

https://colab.research.google.com/drive/1DZjBbcWXeRk-xlmffG7A4zSez7eX1Rba?usp=sharing

#automlVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 4 minutes]

CPU vs GPU

CPU: few powerful cores optimized for low-latency, branching, and general purpose tasks. Great for data orchestration, preprocessing, control flow.

Use cases in ML:

feature engineering, I/O, tokenization, small classical ML, control logic.

GPU: thousands of simpler cores optimized for massive parallel math, especially dense linear algebra. Great for matrix multiplies, convolutions, attention.

Orders-of-magnitude speedups for medium to large models and batches.

Use cases in ML:

deep learning training, embedding inference, vector search re-ranking, image and generative workloads.

CUDA

GPU is the hardware. CUDA (Compute Unified Device Architecture) is the framework / language and toolkit that unlocks that hardware. Its from nVidia.

When working with GPU, its a must to check whether CUDA is enabled.

There are bunch of GPU’s like Apple Silicon M-Series, Game consoles uses GPU but doesnt have CUDA.

Remember to change the Runtime

https://colab.research.google.com/drive/1byrDchiV4OWdLKOPl8H4UAcdbwFoR7aA?usp=sharing

#cpu #gpuVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 0 minutes]

Tools

Ver 0.3.6

[Avg. reading time: 6 minutes]

Containers

World before containers

Physical Machines

  • 1 Physical Server
  • 1 Host Machine (say some Linux)
  • 3 Applications installed

Limitation:

  • Need of physical server.
  • Version dependency (Host and related apps)
  • Patches ”hopefully” not affecting applications.
  • All apps should work with the same Host OS.

  • 3 physical server
  • 3 Host Machine (diff OS)
  • 3 Applications installed

Limitation:

  • Need of physical server(s).
  • Version dependency (Host and related apps)
  • Patches ”hopefully” not affecting applications.
  • Maintenance of 3 machines.
  • Network all three so they work together.

Virtual Machines

  • Virtual Machines emulate a real computer by virtualizing it to execute applications,running on top of a real computer.

  • To emulate a real computer, virtual machines use a Hypervisor to create a virtual computer.

  • On top of the Hypervisor, we have a Guest OS that is a Virtualized Operating System where we can run isolated applications, called Guest Operating System.

  • Applications that run in Virtual Machines have access to Binaries and Libraries on top of the operating system.

( + ) Full Isolation, Full virtualization ( - ) Too many layers, Heavy-duty servers.

Here comes Containers

Containers are lightweight, portable environments that package an application with everything it needs to run—like code, runtime, libraries, and system tools—ensuring consistency across different environments. They run on the same operating system kernel and isolate applications from each other, which improves security and makes deployments easier.

  • Containers are isolated processes that share resources with their host and, unlike VMs, don’t virtualize the hardware and don’t need a Guest OS.

  • Containers share resources with other Containers in the same host.

  • This gives more performance than VMs (no separate guest OS).

  • Container Engine in place of Hypervisor.

Pros

  • Isolated Process
  • Mounted Files
  • Lightweight Process

Cons

  • Same Host OS
  • Security

#containers #vmVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 6 minutes]

VMs or Containers

VMs are great for running multiple, isolated OS environments on a single hardware platform. They offer strong security isolation and are useful when applications need different OS versions or configurations.

Containers are lightweight and share the host OS kernel, making them faster to start and less resource-intensive. They’re perfect for microservices, CI/CD pipelines, and scalable applications.

Smart engineers focus on the right tool for the job rather than getting caught up in “better or worse” debates.

Use them in combination to make life better.

Docker: The most widely used container platform, known for its simplicity, portability, and extensive ecosystem.

Podman: A daemonless container engine that’s compatible with Docker but emphasizes security, running containers as non-root users.


Images

The image is the prototype or skeleton to create a container, like a recipe to make your favorite food.

Container

A container is the environment, up and running and ready for your application.

If Image = Recipe, then Container = Cooked food.

Where to get the Image from?

Docker Hub

For both Podman and Docker, images are from the Docker Hub.

https://hub.docker.com/


NOTE: INSTALL DOCKER OR PODMAN (Not BOTH)

Podman on Windows

https://podman-desktop.io/docs/installation/windows-install

Once installed, verify the installation by checking the version:

podman info

Podman on MAC

Install Podman

After installing, you need to create and start your first Podman machine:

podman machine init
podman machine start

You can then verify the installation information using:

podman info

Podman on Linux

Install Podman

You can then verify the installation information using:

podman info

Docker Installation

Here is step by step installation

https://docs.docker.com/desktop/setup/install/windows-install/

#vm #docker #podmanVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 0 minutes]

What container does

It brings to us the ability to create applications without worrying about their environment.

#containerVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 11 minutes]

Container Examples

If you have installed Docker replace podman with docker.

Syntax

docker pull <imagename>
docker run <imagename>

Examples:

docker pull hello-world
docker run hello-world
docker container ls
docker container ls -a
docker image ls

Optional Setting (For PODMAN)

/etc/containers/registries.conf

unqualified-search-registries = ["docker.io"]

Deploy MySQL Database using Containers

Create the following folder

Linux / Mac

mkdir -p container/mysql
cd container/mysql

Windows

md container
cd container
md mysql
cd mysql

Note: If you already have MySQL Server installed in your machine then please change the port to 3307 as given below.

-p 3307:3306 \

Run the container


docker run --name mysql -d \
    -p 3306:3306 \
    -e MYSQL_ROOT_PASSWORD=root-pwd \
    -e MYSQL_ROOT_HOST="%" \
    -e MYSQL_DATABASE=mydb \
    -e MYSQL_USER=remote_user \
    -e MYSQL_PASSWORD=remote_user-pwd \
    docker.io/library/mysql:8.4.4
-d : detached (background mode)
-p : 3306:3306 maps mysql default port 3306 to host machines port 3306
    3307:3306 maps mysql default port 3306 to host machines port 3307

-e MYSQL_ROOT_HOST="%" Allows to login to MySQL using MySQL Workbench

Login to MySQL Container

docker exec -it mysql bash

List all the Containers

docker container ls -a

Stop MySQL Container

docker stop mysql

Delete the container**

docker rm mysql

Preserve the Data for future**

Inside container/mysql

mkdir data
docker run --name mysql -d \
    -p 3306:3306 \
    -e MYSQL_ROOT_PASSWORD=root-pwd \
    -e MYSQL_ROOT_HOST="%" \
    -e MYSQL_DATABASE=mydb \
    -e MYSQL_USER=remote_user \
    -e MYSQL_PASSWORD=remote_user-pwd \
    -v ./data:/var/lib/mysql \
    docker.io/library/mysql:8.4.4
-- Create database
CREATE DATABASE IF NOT EXISTS friends_tv_show;
USE friends_tv_show;

-- Create Characters table
CREATE TABLE characters (
    character_id INT AUTO_INCREMENT PRIMARY KEY,
    first_name VARCHAR(50) NOT NULL,
    last_name VARCHAR(50) NOT NULL,
    actor_name VARCHAR(100) NOT NULL,
    date_of_birth DATE,
    occupation VARCHAR(100),
    apartment_number VARCHAR(10)
);

INSERT INTO characters (first_name, last_name, actor_name, date_of_birth, occupation, apartment_number) VALUES
('Ross', 'Geller', 'David Schwimmer', '1967-10-02', 'Paleontologist', '3B'),
('Rachel', 'Green', 'Jennifer Aniston', '1969-02-11', 'Fashion Executive', '20'),
('Chandler', 'Bing', 'Matthew Perry', '1969-08-19', 'IT Procurement Manager', '19'),
('Monica', 'Geller', 'Courteney Cox', '1964-06-15', 'Chef', '20'),
('Joey', 'Tribbiani', 'Matt LeBlanc', '1967-07-25', 'Actor', '19'),
('Phoebe', 'Buffay', 'Lisa Kudrow', '1963-07-30', 'Massage Therapist/Musician', NULL);

select * from characters;

Build your own Image


mkdir -p container
cd container

Python Example

Follow the README.md

Fork & Clone

git clone https://github.com/gchandra10/docker_mycalc_demo.git

Web App Demo

Fork & Clone

git clone https://github.com/gchandra10/docker_webapp_demo.git

Publish Image to Docker Hub

Login to Docker Hub

  • Create a Repository “my_faker_calc”
  • Under Account Settings
    • Personal Access Token
    • Create a PAT token with Read/Write access for 1 day

Replace gchandra10 with yours.

docker login docker.io 

enter userid
enter PAT token

Then build the Image with your userid

docker build -t gchandra10/my_faker_calc:1.0  .
docker image ls

Copy the ImageID of gchandra10/my_fake_calc:1.0

Tag the ImageID with necessary version and latest

docker image tag <image_id> gchandra10/my_faker_calc:latest

Push the Images to Docker Hub (version and latest)

docker push gchandra10/my_faker_calc:1.0 
docker push gchandra10/my_faker_calc:latest

Image Security

Open Source tool Trivy

https://trivy.dev/latest/getting-started/installation/

trivy image python:3.9-slim

trivy image gchandra10/my_faker_calc

trivy image gchandra10/my_faker_calc --severity CRITICAL,HIGH --format table

trivy image gchandra10/my_faker_calc --severity CRITICAL,HIGH  --output result.txt

#examples #docker Ver 0.3.6

Last change: 2025-12-02

[Avg. reading time: 0 minutes]

Productionizing ML Models

[Avg. reading time: 3 minutes]

Observability

ML observability means:

  • monitoring model behavior
  • understanding WHY the model behaves that way
  • detecting issues early
  • supporting debugging and retraining decisions

ML Observability Pillars

  1. Data Quality Monitoring
  2. Drift Monitoring
  3. Operational / System Monitoring
  4. Explainability & Bias Monitoring
  5. Governance, Lineage & Reproducibility

Data Quality Monitoring

Tracks whether the input data is valid, clean, and reliable.

  • missing values
  • invalid values
  • type issues
  • schema changes
  • outliers
  • range violations
  • feature null spikes

Operational / System Monitoring

  • throughput
  • hardware utilization
  • inference failures
  • API timeouts
  • memory leaks
  • GPU/CPU load spikes
  • queue lag in streaming pipelines

This ensures the model endpoint or batch job is healthy.

Governance, Lineage & Reproducibility

Tracks the lifecycle and accountability of all ML assets.

  • dataset versioning
  • model versioning
  • feature lineage
  • pipeline lineage
  • audit logs (who deployed, who retrained)
  • model approval workflow
  • reproducible experiments
  • rollback support

#observability #mlopsVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 8 minutes]

Drift

Monitoring and observability in ML is about continuously checking:

  • What data is coming in
  • How that data is changing
  • Whether the model’s predictions are still reliable
  • Whether the business metrics are degrading

Three key issues:

Data Drift: Incoming feature distributions shift from what the model was trained on.

Concept Drift: The relationship between features and target changes.

Model Performance Decay: Accuracy, precision, recall, RMSE, etc. degrade over time.

Use cases

  • Fraud models stop detecting new fraud patterns.
  • Demand forecasting fails when consumer behavior changes.
  • Recommendation systems decay as user preferences evolve.
  • Healthcare/diagnosis models degrade with new demographics.
  • NLP sentiment models break due to new slang or cultural shifts.

Example

Phase 1: Training distribution

  • sqft mean ~1500
  • bedrooms mostly 2 or 3
  • house_age mostly 5–15 years

Model learns reasonable patterns.

Phase 2: Production year later

Neighborhood changes + new houses get built.

1. Data Drift

Example:

  • sqft mean shifts from 1500 to 2300
  • more 4-bedroom homes appear
  • house_age shifts from 10 years old to 2 years old (new constructions)

This is feature distribution drift. Model still predicts, but sees very different patterns than training.

2. Concept Drift

Originally:

  • Price increases roughly 150 per extra sqft

After market shift:

  • Price increases 250 per extra sqft

Meaning: the mapping from features to target changed, even though features look similar.

3. Model Performance Decay

You track weekly RMSE:

  • Week 1: RMSE 19k
  • Week 15: RMSE 25k
  • Week 32: RMSE 42k

Why does it decay?

  • Market changed
  • New developers building larger homes
  • New inflation conditions
  • Seasonal patterns changed
  • The model is outdated.

Data Quality Drift

Quality of incoming data begins to degrade:

  • more missing values
  • more zeros
  • more invalid/out-of-range values
  • more outliers
  • schema changes
  • feature suddenly becomes constant
  • new categories never seen before

This is one of the most important practical drifts.

Example:

“furnished”, “semi-furnished” → suddenly “fully-furnished” appears (NEW category)

Data Freshness Drift (Latency Drift)

Data arrives:

  • late
  • too early
  • stale
  • out-of-order

Feature Importance Drift

Rank of feature importance changes:

Example:

  • bedrooms used to be the strongest feature
  • now open backyard becomes dominant
  • previously irrelevant features become important and vice-versa

Input Volume Drift

Sudden spikes or drops in data volume.

Example:

Daily 500 requests suddenly becomes 10,000.

This affects latency, performance, and reliability.

Demo

https://colab.research.google.com/drive/1gf2Qs3avNej6JP-LmKHe022HUiSqbCmy?usp=sharing
git clone https://github.com/gchandra10/python_model_drift

Open Source Tools

https://github.com/evidentlyai/evidently

#mlops #driftVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 9 minutes]

Security

Machine learning systems introduce a whole new attack surface. In traditional software, you secure code, networks, data, and deployments. In ML, you also have to secure training data, model artifacts, feature pipelines, model endpoints, and the feedback loops that continuously update the model.

If ML security is ignored, attackers can quietly poison training data, steal the model, extract sensitive information, or manipulate predictions in production. The impact can be severe: compliance violations, financial loss, biased decisions, or complete system compromise.

Why It Matters

  • ML models behave exactly the way the data teaches them. If attackers can tamper with data, you lose trust in the entire pipeline.
  • Models deployed as APIs are prime targets for extraction, prompt injections, and inference manipulation.
  • Regulatory pressure is rising, and ML systems now need governance similar to financial or healthcare-grade systems.
  • Many orgs automate retraining. Without guardrails, an attacker could push poisoned data into the pipeline and silently change model behavior overnight.

1. Data Security

  • Validate and sanitize input data before training or inference.
  • Detect drift that might be intentional poisoning.
  • Maintain lineage: who produced the data, when, from where.
  • Encrypt data in transit and at rest.

2. Model Artifact Security

  • Store models in a secure registry (MLflow Model Registry or cloud-managed registry).
  • Use signed and versioned models to prevent unverified deployments.
  • Restrict access at the catalog or registry level using RBAC.

3. Supply Chain Security

  • Training code, libraries, dependencies, Docker images, and notebooks can be compromised.
  • Use vulnerability scanning tools on Python packages and containers.
  • Pin versions using pyproject.toml or UV/Poetry lockfiles.
  • Verify model lineage (code version, data version, training environment).

4. API & Endpoint Hardening

  • Rate limiting and throttling to prevent model extraction.
  • Authentication and authorization around inference endpoints.
  • Input validation to avoid adversarial attacks and prompt injections (LLMs).
  • Don’t expose internal model metadata via the API.

5. Monitoring & Detection

  • Track prediction patterns to catch sudden spikes or targeted manipulation.
  • Use model drift & data drift monitoring tools.
  • Alert when confidence scores change unpredictably.
  • Store logs for forensics.

6. Secrets & Environment Security

  • Never hardcode API keys into notebooks or training code.
  • Use cloud secret managers or Databricks secret scopes.
  • Lock down S3/Blob/GCS buckets and model storage.
  • Use network isolation: private endpoints, VPC peering, firewall rules.

How To Ensure Models Are Not Vulnerable

  • Implement model reviews as part of CI/CD, including robustness tests.
  • Continuously test your data pipelines for poisoning or schema violations.
  • Use secure serving infrastructure (no local Flask servers in production).
  • Perform penetration testing specifically targeted at model endpoints.
  • Automate retraining only when data validation checks pass.
  • Track every model version, input source, and deployment environment.
  • Keep models and features inside secured catalogs with RBAC and audit logs.
  • Use zero-trust principles for every pipeline component.

Popular Tools

FalconPy by Crowdstrike

#security #mlopsVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 5 minutes]

Validation Frameworks

Data validation frameworks help you prove your data is correct before you process or model it.Instead of writing ad-hoc if-else checks, you declare rules once and let the framework enforce them automatically.

  • Consistency
  • Repeatability
  • Cleaner code
  • Faster debugging
  • Less human error

Validation Frameworks

  • Detect bad data early instead of debugging downstream failures
  • Enforce rules across teams so everyone validates the same way
  • Automate thousands of checks with very little code
  • Reduce manual cleanup work that normally takes hours
  • Make pipelines safer, more predictable, and easier to maintain
  • Shift data quality to where it belongs: before transformation and modeling
Manual ValidationFramework-Based Validation
Lots of custom codeDeclare rules once
Hard to maintainReuse rules everywhere
Easy to miss edge casesRemove 70–90 percent of custom code.
Never consistent between developersFail fast instead of debugging downstream
Repeated onboarding painEasier onboarding for new developers and analysts

Pandera (Python)

  • Easiest for Python pipelines
  • Schema-based, great for ML workflows
  • Integrates with Pandas, Polars, Dask, Spark
  • Treats data validation like unit tests

Pydantic

  • Row-level validation
  • Excellent for API inputs and ML inference
  • Great complement to Pandera, not a dataframe validator

Pydantic + Pandera

  • Pydantic is for validating one row at a time.
  • Pandera is for validating the whole dataset at once.
  • Pydantic shines in ML inference, web APIs, and configuration files.
  • Pandera shines in ETL, data cleaning, feature engineering, and ML training pipelines.
git clone https://github.com/gchandra10/python_validator_demo

#pandera #pydantic #validationframeworkVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 5 minutes]

Model Compression

Model compression is the set of techniques used to reduce the size and computational cost of a trained model while keeping its accuracy almost the same.

Why It Exists

  • Speed up inference
  • Reduce memory footprint
  • Fit models on cheaper hardware
  • Reduce serving cost
  • Enable on-device ML (phones, edge devices, IoT)
  • Allow high-traffic systems to scale

Without Compression

  • Slow predictions
  • GPU or CPU bottlenecks
  • More servers needed to keep up
  • Higher inference bill
  • Some environments can’t run your model at all
  • Increased latency kills user experience

Photo Analogy

The popular mechanisms include

  • Quantization

Quantization refers to the process of reducing the precision or bit-width of the numerical values used to represent model parameters usually from 𝑛 bits to 𝑚 bits, where 𝑛 > 𝑚.

In ML, FP32 (Floating Point 32 bits) is the default, by quantization method we convert the 32bits to 16 or 8 bits and achieve similar results.

https://colab.research.google.com/drive/1SHGqVZhk8tKpuGQ3KqLhUXIk8NU9W2Er?usp=sharing

When using this with mlflow, log both the models, artifacts and serve the ones depending upon the usecase.

  • Distillation

Model distillation, also known as knowledge distillation, is a technique where a smaller model, often referred to as a student model, is trained to mimic the behavior of a larger, more complex model, known as a teacher model. The goal is to transfer the knowledge and performance of the larger model to the smaller one.

Reading whole book vs Nudging with hints and references

#compression #quantization #distillationVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 4 minutes]

Ollama

  • Ollama is an open-source tool that allows you to run large language models (LLMs) on your local machine, providing privacy and offline access.

  • It simplifies the process of downloading, running, and managing LLMs with a user-friendly interface, both via a command-line interface (CLI) and an API.

  • It’s designed for developers and researchers who want to customize and experiment with AI models locally, without depending on cloud services.

Install

Download and Install

https://ollama.com/

Open Terminal

ollama

ollama list

ollama pull deepseek-r1:8b

ollama run deepseek-r1:8b

To close the prompt

/bye

Roles

  • user: The human asking questions or giving instructions.
  • assistant : The model’s response role. This is what the LLM outputs.
  • system : Optional. Used to set initial behavior or constraints, similar to system prompts in OpenAI/ChatGPT.
git clone https://github.com/gchandra10/python_ollama_demo.git

chat() - conversational, role-based, template-aware generate() - raw LLM token generation, no chat template, no memory

Build Custom Models

  • Create a Modelfile
  • Mention the model and prompt
  • Create and use the new Model

#llm #ollamaVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 6 minutes]

Best Practices

Continuous Integration (CI): Automate testing and validation for code, data, and models before deployment.

Continuous Delivery/Deployment (CD): Automate the deployment of the complete ML pipeline and the trained model to production environments (often using Docker/Kubernetes).

Continuous Training (CT): Implement automated triggers to retrain models based on performance degradation (drift) or arrival of significant new data.

Version Control: Use Git for code and configuration. Crucially, version control datasets (Data Versioning) and model artifacts (Model Registry).

Reproducibility: Log all experiment metadata—including hyperparameters, package dependencies, and data/code versions—to enable exact reproduction of any past result.

Infrastructure as Code (IaC): Manage all compute resources and environments (e.g., training clusters, deployment services) using code (e.g., Terraform) for consistency.

Continuous Monitoring: Track both operational metrics (latency, throughput, resource usage) and model performance metrics (accuracy, precision, business KPIs) in production.

Drift Detection: Actively monitor for Data Drift (input data changes) and Concept Drift (target relationship changes) and set up automated alerts and retraining workflows.

Data Validation: Implement continuous checks on the schema, quality, and statistical properties of input data streams before they reach the model.

Model Governance & Lineage: Maintain a clear audit trail of every model, documenting who trained it, when, and with what specific assets, for regulatory compliance and debugging.

Modular Pipelines: Break the ML workflow (data ingestion, preprocessing, training, evaluation, deployment) into independent, reusable components.

Feature Stores: Use a centralized platform to define, serve, and share reusable features across different models and teams, ensuring consistency between training and serving.

Collaboration: Facilitate smooth handoffs and shared ownership between Data Scientists, ML Engineers, and Operations teams through common tools and standardized interfaces.

#mlops #bestpracticesVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 4 minutes]

SAAS Tools for MlFlow

These platforms streamline the entire machine learning lifecycle, often integrating MLflow’s capabilities.

Amazon SageMaker: AWS’s comprehensive, fully-managed platform that covers the entire ML workflow from data preparation to deployment and monitoring.

Google Vertex AI: Google Cloud’s unified platform for building, deploying, and scaling ML models, which includes MLOps tools like pipelines, a model registry, and monitoring.

Microsoft Azure Machine Learning: A cloud service that provides a range of tools and a unified environment to accelerate and manage the ML project lifecycle, with strong native MLflow integration.

Databricks (Managed MLflow): Databricks, co-founded by the creators of MLflow, offers a fully managed and enhanced version of MLflow tightly integrated with their lakehouse platform.

Benefits

Enhanced Collaboration: Provides a shared, centralized platform (via the Tracking Server and Model Registry UI) where data scientists can log, view, compare, and share experiment results and model versions.

Efficient Model Lifecycle Management: The Model Registry offers governance and an audit trail by controlling the transition of model versions through different stages (e.g., from Staging to Production) and linking them to their original training runs.

#saastools #sagemaker #azureml #googlevertexaiVer 0.3.6

Last change: 2025-12-02

[Avg. reading time: 2 minutes]

Good Reads

These are just resources I found interesting and thought you might too. I’m not connected to them and can’t vouch for everything, but I’m sharing in the spirit of helping you discover new ideas, books, and opportunities.


Google Colab Free

https://blog.google/outreach-initiatives/education/colab-higher-education/


DeepLearning.ai

https://www.deeplearning.ai/


Notebook LM

https://notebooklm.google/


ByteByteGo

It’s a very, very useful YT channel.

https://www.youtube.com/@ByteByteGo/videos

Loaded with lots and lots of useful information.


#goodreads #resourcesVer 0.3.6

Last change: 2025-12-02

Tags

<abbr title="bidirectional encoder representations from transformers">bert</abbr>

/MLOps & AI Overview/ML Lifecycle/Feature Engineering/Embeddings

agents

/MLOps & AI Overview/AI then and now/Agentic AI

ai-ml

/MLOps & AI Overview/AI then and now/Differences

artificialintelligence

/MLOps & AI Overview/AI then and now

automation

/Developer Tools/JQ

automl

/MLflow Model Lifecycle/Auto ML

aws

/Cloud/AWS/AWS Global Infra

/Cloud/AWS/CloudShell

/Cloud/AWS/EC2

/Cloud/AWS/IAM

/Cloud/AWS/S3

az

/Cloud/AWS/AWS Global Infra

azureml

/Productionizing ML Models/SAAS Tools

batch

/MLflow Model Lifecycle/Model Serving Types

bestpractices

/Productionizing ML Models/Best Practices

challenges

/Cloud/Challenges

cicd

/MLOps & AI Overview/AI then and now/MLOps

cidr

/Cloud/AWS/CIDR

classification

/MLOps & AI Overview/AI then and now/Machine Learning

claude

/MLOps & AI Overview/AI then and now/Generative AI

clean

/MLOps & AI Overview/ML Lifecycle

cleaning

/MLOps & AI Overview/ML Lifecycle/Data Preparation

cli

/Cloud/AWS/CloudShell

/Developer Tools/DuckDB

/Developer Tools/JQ

cloud

/Cloud/Challenges

/Cloud/Overview

/Cloud/Types

cloudshell

/Cloud/AWS/CloudShell

collect

/MLOps & AI Overview/ML Lifecycle

compression

/Productionizing ML Models/Model Compression

container

/Tools/Containers/What container does

containers

/Tools/Containers

cpu

/MLflow Model Lifecycle/CPU vs GPU

data

/MLOps & AI Overview/Introduction

/MLOps & AI Overview/ML Lifecycle/Data Preparation

databricks

/Required Tools

datacleaning

/MLOps & AI Overview/ML Lifecycle/Data Cleaning

dataimputation

/MLOps & AI Overview/ML Lifecycle/Data Imputation

datavalidation

/MLflow Model Lifecycle/pydantic

decorator

/MLflow Model Lifecycle/Decorator

densevector

/MLOps & AI Overview/ML Lifecycle/Feature Engineering/Vectors

deserialization

/MLOps & AI Overview/Terms to Know

development

/MLOps & AI Overview/Life Before MLOps

devops

/MLOps & AI Overview/AI then and now/MLOps

disclaimer

/Disclaimer

distillation

/Productionizing ML Models/Model Compression

docker

/Tools/Containers/Container Examples

/Tools/Containers/VMs or Containers

domain_specific

/MLOps & AI Overview/ML Lifecycle/Feature Engineering

drift

/Productionizing ML Models/Drift

dropdata

/MLOps & AI Overview/ML Lifecycle/Data Imputation

duckdb

/Developer Tools/DuckDB

ec2

/Cloud/AWS/EC2

edgelocation

/Cloud/AWS/AWS Global Infra

embeddeddb

/Developer Tools/SQLite

embeddings

/MLOps & AI Overview/ML Lifecycle/Feature Engineering/Embeddings

encode

/MLOps & AI Overview/ML Lifecycle/Data Imputation

error

/Developer Tools/Error Handling

evaluate

/MLOps & AI Overview/ML Lifecycle

examples

/MLOps & AI Overview/Examples

/Tools/Containers/Container Examples

exception

/Developer Tools/Error Handling

experiment

/MLflow Introduction/MLflow Experiment Structure

expert-systems

/MLOps & AI Overview/AI then and now/Expert Systems

explanation

/MLOps & AI Overview/Model vs Library vs Framework/Explanation

fastapi

/MLflow Model Lifecycle/Model Serving

feature_engineering

/MLOps & AI Overview/ML Lifecycle/Feature Engineering

finance

/MLOps & AI Overview/Examples

framework

/MLOps & AI Overview/Model vs Library vs Framework

fuzziness

/MLOps & AI Overview/AI then and now/Fuzzy Logic

fuzzy-logic

/MLOps & AI Overview/AI then and now/Fuzzy Logic

genai

/MLOps & AI Overview/AI then and now/Differences

generativeai

/MLOps & AI Overview/AI then and now/Generative AI

git

/Required Tools

goodreads

/Good Reads

googlevertexai

/Productionizing ML Models/SAAS Tools

gpt

/MLOps & AI Overview/AI then and now/Generative AI

gpu

/MLflow Model Lifecycle/CPU vs GPU

healthcare

/MLOps & AI Overview/Examples

http

/MLflow Model Lifecycle/HTTP Basics

iaac

/Cloud/Terraform

iaas

/Cloud/Types

iam

/Cloud/AWS/IAM

ipv4

/Cloud/AWS/CIDR

jobs

/MLOps & AI Overview/Job Opportunities

jq

/Developer Tools/JQ

json

/Developer Tools/JQ

/MLflow Introduction/YAML

knn

/MLOps & AI Overview/ML Lifecycle/Data Imputation

label_encoding

/MLOps & AI Overview/ML Lifecycle/Data Encoding

library

/MLOps & AI Overview/Model vs Library vs Framework

/MLOps & AI Overview/Model vs Library vs Framework/Explanation

linearalgebra

/MLOps & AI Overview/Statistical vs ML Models

lint

/Developer Tools/Other Python Tools

llm

/MLOps & AI Overview/AI then and now/Differences

/Productionizing ML Models/Ollama

localdb

/Developer Tools/SQLite

machinelearning

/MLOps & AI Overview/AI then and now

medallion

/MLOps & AI Overview/AI then and now/MLOps

ml

/MLOps & AI Overview/Statistical vs ML Models

mlcleaning

/MLOps & AI Overview/ML Lifecycle/Data Cleaning

mlengineer

/MLOps & AI Overview/Introduction

/MLOps & AI Overview/Job Opportunities

mlflow

/MLflow Introduction/MLflow Experiment Structure

/MLflow Introduction/MLflow Features

/MLflow Model Lifecycle/Model Flavors

/MLflow Model Lifecycle/Model Serving

mlflow_server

/MLflow Introduction/MLflow Features

mlops

/MLOps & AI Overview/AI then and now/Differences

/MLOps & AI Overview/AI then and now/MLOps

/MLOps & AI Overview/Introduction

/MLOps & AI Overview/Life Before MLOps

/Productionizing ML Models/Best Practices

/Productionizing ML Models/Drift

/Productionizing ML Models/Observability

/Productionizing ML Models/Security

mlopsengineer

/MLOps & AI Overview/Job Opportunities

model

/MLOps & AI Overview/Model vs Library vs Framework

modelserving

/MLflow Model Lifecycle/Model Serving

mse

/MLOps & AI Overview/Model vs Library vs Framework

mypy

/Developer Tools/Other Python Tools

nlp

/MLOps & AI Overview/ML Lifecycle/Feature Engineering/Embeddings

normalize_data

/MLOps & AI Overview/ML Lifecycle/Data Cleaning

objectstorage

/Cloud/AWS/S3

observability

/Productionizing ML Models/Observability

ollama

/Productionizing ML Models/Ollama

onehot_encoding

/MLOps & AI Overview/ML Lifecycle/Data Encoding

overfitting

/MLOps & AI Overview/Terms to Know

overview

/Cloud/Overview

paas

/Cloud/Types

pandera

/Productionizing ML Models/Validation Frameworks

parquet

/Developer Tools/DuckDB

parser

/Developer Tools/JQ

pep

/Developer Tools/Other Python Tools

permissions

/Cloud/AWS/IAM

podman

/Tools/Containers/VMs or Containers

poetry

/Developer Tools/Introduction

production

/MLOps & AI Overview/Life Before MLOps

pydantic

/MLflow Model Lifecycle/pydantic

/Productionizing ML Models/Validation Frameworks

pyfunc

/MLflow Model Lifecycle/Model Flavors

pytest

/Developer Tools/Unit Test

python

/Developer Tools/Introduction

/MLflow Introduction/YAML

/MLflow Model Lifecycle/Decorator

/Required Tools

pytorch

/MLflow Model Lifecycle/Model Flavors

quantization

/Productionizing ML Models/Model Compression

r2score

/MLOps & AI Overview/Model vs Library vs Framework

realtime

/MLflow Model Lifecycle/Model Serving Types

region

/Cloud/AWS/AWS Global Infra

regression

/MLOps & AI Overview/AI then and now/Machine Learning

resources

/Good Reads

restapi

/MLflow Model Lifecycle/HTTP Basics

retail

/MLOps & AI Overview/Examples

rl

/MLOps & AI Overview/AI then and now/Reinforcement Learning

rlhf

/MLOps & AI Overview/AI then and now/Reinforcement Learning

robotics

/MLOps & AI Overview/AI then and now/Reinforcement Learning

ruff

/Developer Tools/Other Python Tools

rulebased

/MLOps & AI Overview/AI then and now/Expert Systems

run

/MLflow Introduction/MLflow Experiment Structure

rust

/Developer Tools/UV

s3

/Cloud/AWS/S3

saas

/Cloud/Types

saastools

/Productionizing ML Models/SAAS Tools

sagemaker

/Productionizing ML Models/SAAS Tools

security

/Productionizing ML Models/Security

serialization

/MLOps & AI Overview/Terms to Know

server

/Cloud/AWS/EC2

serving

/MLflow Introduction/MLflow Features

singlefiledatabase

/Developer Tools/DuckDB

sparsevector

/MLOps & AI Overview/ML Lifecycle/Feature Engineering/Vectors

sqlite

/Developer Tools/SQLite

statistics

/MLOps & AI Overview/Statistical vs ML Models

storage

/Cloud/AWS/S3

streaming

/MLflow Model Lifecycle/Model Serving Types

subnet

/Cloud/AWS/CIDR

supervised

/MLOps & AI Overview/AI then and now/Machine Learning

/MLOps & AI Overview/Types of ML Models

tabularvector

/MLOps & AI Overview/ML Lifecycle/Feature Engineering/Vectors

target_encoding

/MLOps & AI Overview/ML Lifecycle/Data Encoding

tensorflow

/MLflow Model Lifecycle/Model Flavors

terraform

/Cloud/Terraform

tools

/Developer Tools/DuckDB

/Developer Tools/JQ

/Required Tools

train

/MLOps & AI Overview/ML Lifecycle

try

/Developer Tools/Error Handling

underfitting

/MLOps & AI Overview/Terms to Know

unittesting

/Developer Tools/Unit Test

unsupervised

/MLOps & AI Overview/AI then and now/Machine Learning

/MLOps & AI Overview/Types of ML Models

user

/Cloud/AWS/IAM

uv

/Developer Tools/Introduction

/Developer Tools/UV

validationframework

/Productionizing ML Models/Validation Frameworks

vectors

/MLOps & AI Overview/ML Lifecycle/Feature Engineering/Vectors

venv

/Developer Tools/Introduction

/Developer Tools/UV

vm

/Cloud/AWS/EC2

/Tools/Containers

/Tools/Containers/VMs or Containers

word2vec

/MLOps & AI Overview/ML Lifecycle/Feature Engineering/Embeddings

wraps

/MLflow Model Lifecycle/Decorator

yaml

/MLflow Introduction/YAML