[Avg. reading time: 6 minutes]

Data Cleaning

Check for Target Leakage

What it is: Features that give away the answer (future info in training data).

Why it matters: Makes the model look perfect in training but useless in production.

Example:

When building a model having this column is not correct as in Production you will never have this during Prediction. This can be used when Testing your model prediction.

refund_issued_flag when predicting “Will this order be refunded?”.

Validate Labels

What it is: Make sure labels are correct, consistent, and usable.

Why it matters: Garbage labels = garbage predictions.

Example:

Churn column has values: yes, Y, 1, true.

Normalize to 1 = churn, 0 = not churn.

Handle Outliers Intentionally

What it is: Extreme values that distort training.

Why it matters: “Emp_Salary = 10,000,000” can throw off predictions.

Example

Cap at 99th percentile.

Flag as anomaly instead of training on it.

Enforce Feature Types

What it is: Make sure data types match their meaning.

Why it matters: Models can’t learn if types are wrong.

Example:

customer_id stored as integer → model may treat it as numeric.

Why is that problem, customer_id = 20 will have more weightage than customer_id = 1

Convert to string (categorical).

Standardize Categories

What it is: Inconsistent labels in categorical columns.

Why it matters: Model may treat the same thing as different classes.

Example:

Country: USA, U.S.A., United States.

Map all to United States.

Normalize Text for ML

What it is: Clean and standardize text features.

Why it matters: Prevents the model from treating “Hello” and “hello!” as different.

Example:

Lowercasing, removing punctuation, stripping whitespace.

Keep a copy of raw text for audit.

Protect Data Splits

What it is: Make sure related rows don’t leak between train/test.

Why it matters: Prevents unfair accuracy boost.

Example:

Same student appears in both train and test sets.

Fix: Group by student_id when splitting.

#datacleaning #mlcleaning #normalize_dataVer 0.3.6

Last change: 2025-12-02