[Avg. reading time: 6 minutes]
Data Cleaning
Check for Target Leakage
What it is: Features that give away the answer (future info in training data).
Why it matters: Makes the model look perfect in training but useless in production.
Example:
When building a model having this column is not correct as in Production you will never have this during Prediction. This can be used when Testing your model prediction.
refund_issued_flag when predicting “Will this order be refunded?”.
Validate Labels
What it is: Make sure labels are correct, consistent, and usable.
Why it matters: Garbage labels = garbage predictions.
Example:
Churn column has values: yes, Y, 1, true.
Normalize to 1 = churn, 0 = not churn.
Handle Outliers Intentionally
What it is: Extreme values that distort training.
Why it matters: “Emp_Salary = 10,000,000” can throw off predictions.
Example
Cap at 99th percentile.
Flag as anomaly instead of training on it.
Enforce Feature Types
What it is: Make sure data types match their meaning.
Why it matters: Models can’t learn if types are wrong.
Example:
customer_id stored as integer → model may treat it as numeric.
Why is that problem, customer_id = 20 will have more weightage than customer_id = 1
Convert to string (categorical).
Standardize Categories
What it is: Inconsistent labels in categorical columns.
Why it matters: Model may treat the same thing as different classes.
Example:
Country: USA, U.S.A., United States.
Map all to United States.
Normalize Text for ML
What it is: Clean and standardize text features.
Why it matters: Prevents the model from treating “Hello” and “hello!” as different.
Example:
Lowercasing, removing punctuation, stripping whitespace.
Keep a copy of raw text for audit.
Protect Data Splits
What it is: Make sure related rows don’t leak between train/test.
Why it matters: Prevents unfair accuracy boost.
Example:
Same student appears in both train and test sets.
Fix: Group by student_id when splitting.