[Avg. reading time: 6 minutes]
Data Preparation
~80% of the time in ML projects is spent on Data Preparation & Cleaning, and ~20% on Model Training.
The process of making raw data accurate, complete, and structured so it can be used for model training.
Wait data cleaning is not ML engineers job, it belongs to Data Engineer.
True but..
Data Engineers focus on collection and validation at scale:
- Ingest raw data from source systems (databases, APIs, IoT, logs).
- Build ETL/ELT pipelines (Bronze → Silver → Gold).
- Ensure data quality checks (avoid duplicates, schema validation, type checks, primary key uniqueness).
- Handle big data infrastructure: Spark, Databricks, Airflow, Kafka.
- Deliver curated data (often “Silver” or “Gold” layer) for downstream ML.
ML Engineers / Data Scientists take over once curated data is available:
- Apply ML-specific cleaning & prep:
- Impute missing values intelligently (mean/median/model-based).
- Encode categorical variables (one-hot, embeddings).
- Normalize/standardize numeric features.
- Text normalization, tokenization, embeddings.
- Create features meaningful to the ML model.
- Split data into train/validation/test sets.
flowchart LR
DE[**Data Engineer**<br/><br/>- ETL/ELT Pipelines<br/>- Schema Validation<br/>• Deduplication<br/>- Type Checks]
OVERLAP[**Common** <br/><br/>- Remove Duplicates<br/>- Ensure Consistency]
MLE[**ML Engineer**<br/><br/>-Handle Missing Values<br/>- Feature Scaling<br/>- Imputation<br/>- Encoding & Embeddings<br/>- Train/Val/Test Split]
DE --> OVERLAP
MLE --> OVERLAP
For Example
Tabular Data
Data Engineer: ensures no duplicate customer IDs in database.
ML Engineer: fills missing “Age” values with median, scales “Income”.
Text Data
Data Engineer: stores raw customer reviews as UTF-8 encoded text.
ML Engineer: lowercases, removes stopwords, converts to embeddings.
Image Data
Data Engineer: validates images aren’t corrupted on ingest.
ML Engineer: resizes images, normalizes pixel values.