Why Data Cleaning Matters
Data analysts spend 60-80% of their time cleaning and preparing data. Raw data is messy - it contains errors, missing values, duplicates, and inconsistencies that can lead to wrong conclusions.
# The Reality of Real-World Data
Raw Data Problems:
┌─────────────────────────────────────────────────────────────────┐
│ Issue │ Example │ Impact │
├─────────────────────────────────────────────────────────────────┤
│ Missing Values │ Age: NULL, N/A, "" │ Skewed stats │
│ Duplicates │ Same customer twice │ Inflated counts│
│ Inconsistent │ "USA", "US", "U.S.A" │ Wrong grouping │
│ Wrong Data Types │ "25" as text, not int │ Calc errors │
│ Outliers │ Age: 999 │ Skewed averages│
│ Invalid Values │ Date: 2024-13-45 │ Processing fail│
└─────────────────────────────────────────────────────────────────┘
"Garbage In, Garbage Out" - Clean data = Reliable insights
The Data Cleaning Process
# Data Cleaning Workflow
Step 1: EXPLORE
│ └── Understand your data first
▼
Step 2: HANDLE MISSING VALUES
│ └── Drop, fill, or flag nulls
▼
Step 3: REMOVE DUPLICATES
│ └── Identify and remove duplicate rows
▼
Step 4: FIX DATA TYPES
│ └── Convert to correct types
▼
Step 5: STANDARDIZE VALUES
│ └── Consistent formats and categories
▼
Step 6: HANDLE OUTLIERS
│ └── Detect and decide what to do
▼
Step 7: VALIDATE
└── Check data quality after cleaning
Step 1: Explore Your Data
Before cleaning, understand what you're working with:
import pandas as pd
import numpy as np
# Load your data
df = pd.read_csv('sales_data.csv')
# First look at the data
print(df.head()) # First 5 rows
print(df.tail()) # Last 5 rows
print(df.shape) # (rows, columns)
# Data types and missing values
print(df.info())
# Output:
# Column Non-Null Count Dtype
# ------ -------------- -----
# customer_id 1000 non-null int64
# name 998 non-null object ← 2 missing!
# age 950 non-null float64 ← 50 missing!
# country 1000 non-null object
# Summary statistics
print(df.describe())
# Shows: count, mean, std, min, 25%, 50%, 75%, max
# Check for missing values
print(df.isnull().sum())
# customer_id 0
# name 2
# age 50
# country 0
# Check for duplicates
print(f"Duplicates: {df.duplicated().sum()}")
# Unique values in categorical columns
print(df['country'].value_counts())
Step 2: Handle Missing Values
Missing values are the most common data quality issue. You have several options:
Option 1: Remove Rows with Missing Values
# Drop rows with ANY missing values
df_clean = df.dropna()
# Drop rows only if SPECIFIC columns are missing
df_clean = df.dropna(subset=['age', 'income'])
# Drop rows if ALL values are missing
df_clean = df.dropna(how='all')
# Drop rows if more than 2 values are missing
df_clean = df.dropna(thresh=len(df.columns) - 2)
Option 2: Fill Missing Values
# Fill with a constant value
df['age'].fillna(0, inplace=True)
df['country'].fillna('Unknown', inplace=True)
# Fill with mean (for numerical data)
df['age'].fillna(df['age'].mean(), inplace=True)
# Fill with median (better for skewed data)
df['income'].fillna(df['income'].median(), inplace=True)
# Fill with mode (for categorical data)
df['country'].fillna(df['country'].mode()[0], inplace=True)
# Forward fill (use previous value) - good for time series
df['stock_price'].fillna(method='ffill', inplace=True)
# Backward fill (use next value)
df['stock_price'].fillna(method='bfill', inplace=True)
# Fill with group-specific values
df['age'] = df.groupby('country')['age'].transform(
lambda x: x.fillna(x.mean())
)
Option 3: Flag Missing Values
# Create a flag column before filling
df['age_missing'] = df['age'].isnull().astype(int)
# Now fill the value
df['age'].fillna(df['age'].median(), inplace=True)
# You can use the flag in analysis later
# to see if "missingness" affects outcomes
When to Use Each Approach
┌──────────────────────────────────────────────────────────────────┐
│ Approach │ When to Use │
├──────────────────────────────────────────────────────────────────┤
│ DROP rows │ • Few missing values (<5%) │
│ │ • Missing values are random │
│ │ • You have plenty of data │
├──────────────────────────────────────────────────────────────────┤
│ FILL mean │ • Numerical data │
│ │ • Data is normally distributed │
│ │ • Missing values are random │
├──────────────────────────────────────────────────────────────────┤
│ FILL median │ • Numerical data with outliers │
│ │ • Skewed distributions │
├──────────────────────────────────────────────────────────────────┤
│ FILL mode │ • Categorical data │
│ │ • One category dominates │
├──────────────────────────────────────────────────────────────────┤
│ FLAG + fill │ • "Missingness" might be meaningful │
│ │ • You want to analyze patterns │
└──────────────────────────────────────────────────────────────────┘
Step 3: Remove Duplicates
# Check for duplicate rows
print(f"Total rows: {len(df)}")
print(f"Duplicate rows: {df.duplicated().sum()}")
# View the duplicates
print(df[df.duplicated(keep=False)]) # Shows all duplicates
# Remove exact duplicates (keep first occurrence)
df_clean = df.drop_duplicates()
# Remove duplicates based on specific columns
df_clean = df.drop_duplicates(subset=['email'])
df_clean = df.drop_duplicates(subset=['customer_id', 'order_date'])
# Keep last occurrence instead of first
df_clean = df.drop_duplicates(subset=['email'], keep='last')
# Remove ALL duplicates (don't keep any)
df_clean = df.drop_duplicates(subset=['email'], keep=False)
Fuzzy Duplicates
# Sometimes duplicates aren't exact matches
# "John Smith" vs "john smith" vs "JOHN SMITH"
# Standardize before checking
df['name_clean'] = df['name'].str.lower().str.strip()
df_clean = df.drop_duplicates(subset=['name_clean'])
# For advanced fuzzy matching
from fuzzywuzzy import fuzz
def find_similar_names(df, column, threshold=80):
"""Find names that are similar but not exact matches."""
names = df[column].unique()
similar_pairs = []
for i, name1 in enumerate(names):
for name2 in names[i+1:]:
score = fuzz.ratio(str(name1), str(name2))
if score >= threshold:
similar_pairs.append((name1, name2, score))
return similar_pairs
# Find potential duplicates
similar = find_similar_names(df, 'company_name')
print(similar)
# [('Microsoft Corp', 'Microsoft Corporation', 92), ...]
Step 4: Fix Data Types
# Check current data types
print(df.dtypes)
# Common conversions
# String to numeric
df['price'] = pd.to_numeric(df['price'], errors='coerce') # Invalid → NaN
df['quantity'] = df['quantity'].astype(int)
# String to datetime
df['order_date'] = pd.to_datetime(df['order_date'])
df['order_date'] = pd.to_datetime(df['order_date'], format='%d/%m/%Y')
# Handle date errors
df['date'] = pd.to_datetime(df['date'], errors='coerce') # Invalid → NaT
# Numeric to category (saves memory, enables ordering)
df['rating'] = df['rating'].astype('category')
df['size'] = pd.Categorical(df['size'],
categories=['S', 'M', 'L', 'XL'],
ordered=True)
# Boolean conversion
df['is_active'] = df['status'].map({'active': True, 'inactive': False})
# String cleaning before conversion
df['phone'] = df['phone'].str.replace(r'[^0-9]', '', regex=True)
df['phone'] = df['phone'].astype('Int64') # Nullable integer
Step 5: Standardize Values
Text Standardization
# Consistent case
df['name'] = df['name'].str.lower()
df['name'] = df['name'].str.title() # "John Smith"
df['name'] = df['name'].str.upper()
# Remove extra whitespace
df['name'] = df['name'].str.strip() # Leading/trailing
df['name'] = df['name'].str.replace(r'\s+', ' ', regex=True) # Multiple spaces
# Fix inconsistent categories
# Before: ['USA', 'US', 'U.S.A', 'United States', 'usa']
country_mapping = {
'USA': 'United States',
'US': 'United States',
'U.S.A': 'United States',
'U.S.A.': 'United States',
'usa': 'United States'
}
df['country'] = df['country'].replace(country_mapping)
# Or use a function
def standardize_country(country):
country = str(country).strip().upper()
if country in ['USA', 'US', 'U.S.A', 'U.S.A.', 'UNITED STATES']:
return 'United States'
elif country in ['UK', 'U.K.', 'UNITED KINGDOM', 'GREAT BRITAIN']:
return 'United Kingdom'
return country.title()
df['country'] = df['country'].apply(standardize_country)
Date Standardization
# Dates come in many formats
# '2024-01-15', '01/15/2024', '15-Jan-2024', 'January 15, 2024'
# Parse various formats
df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True)
# Standardize to one format
df['date_str'] = df['date'].dt.strftime('%Y-%m-%d')
# Extract useful parts
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.day_name()
df['quarter'] = df['date'].dt.quarter
Numerical Standardization
# Remove currency symbols and convert
df['price'] = df['price'].str.replace(r'[$,]', '', regex=True).astype(float)
# Standardize units (e.g., all weights in kg)
def convert_to_kg(value, unit):
if unit == 'lb':
return value * 0.453592
elif unit == 'g':
return value / 1000
return value
df['weight_kg'] = df.apply(
lambda x: convert_to_kg(x['weight'], x['unit']), axis=1
)
# Round to consistent decimal places
df['price'] = df['price'].round(2)
Step 6: Handle Outliers
Outliers are extreme values that can skew your analysis.
Detecting Outliers
import numpy as np
# Method 1: IQR (Interquartile Range)
Q1 = df['age'].quantile(0.25)
Q3 = df['age'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['age'] < lower_bound) | (df['age'] > upper_bound)]
print(f"Outliers found: {len(outliers)}")
# Method 2: Z-Score (for normally distributed data)
from scipy import stats
z_scores = np.abs(stats.zscore(df['age'].dropna()))
outliers = df[z_scores > 3] # More than 3 standard deviations
# Method 3: Visual inspection
import matplotlib.pyplot as plt
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
df['age'].hist(ax=axes[0], bins=50)
axes[0].set_title('Distribution')
df.boxplot(column='age', ax=axes[1])
axes[1].set_title('Box Plot')
plt.show()
Handling Outliers
# Option 1: Remove outliers
df_clean = df[(df['age'] >= lower_bound) & (df['age'] <= upper_bound)]
# Option 2: Cap outliers (Winsorization)
df['age_capped'] = df['age'].clip(lower=lower_bound, upper=upper_bound)
# Option 3: Replace with NaN and handle as missing
df.loc[df['age'] > upper_bound, 'age'] = np.nan
df.loc[df['age'] < lower_bound, 'age'] = np.nan
# Option 4: Log transformation (reduces impact of outliers)
df['income_log'] = np.log1p(df['income']) # log(1+x) handles zeros
# Option 5: Flag outliers for separate analysis
df['is_outlier'] = ((df['age'] < lower_bound) |
(df['age'] > upper_bound)).astype(int)
When to Keep Outliers
┌─────────────────────────────────────────────────────────────────┐
│ KEEP OUTLIERS WHEN: │ REMOVE OUTLIERS WHEN: │
├─────────────────────────────────────────────────────────────────┤
│ • They're valid data points │ • They're data entry errors │
│ • They represent real events │ • They're impossible values │
│ • You're studying extremes │ • They skew your analysis │
│ • Your model handles them │ • Small sample size │
│ │ │
│ Example: A billionaire in │ Example: Age = 999 │
│ income data is real │ is clearly an error │
└─────────────────────────────────────────────────────────────────┘
Step 7: Validate Your Clean Data
# Create a data quality report
def data_quality_report(df):
"""Generate a data quality report."""
report = pd.DataFrame({
'Column': df.columns,
'Type': df.dtypes.values,
'Non-Null': df.count().values,
'Null': df.isnull().sum().values,
'Null %': (df.isnull().sum() / len(df) * 100).round(2).values,
'Unique': df.nunique().values,
'Sample': [df[col].iloc[0] if len(df) > 0 else None for col in df.columns]
})
return report
# Before cleaning
print("BEFORE CLEANING:")
print(data_quality_report(df_raw))
# After cleaning
print("\nAFTER CLEANING:")
print(data_quality_report(df_clean))
# Validation checks
def validate_data(df):
"""Run validation checks on cleaned data."""
issues = []
# Check for remaining nulls
null_counts = df.isnull().sum()
if null_counts.any():
issues.append(f"Remaining nulls: {null_counts[null_counts > 0].to_dict()}")
# Check for duplicates
dup_count = df.duplicated().sum()
if dup_count > 0:
issues.append(f"Remaining duplicates: {dup_count}")
# Check value ranges
if 'age' in df.columns:
invalid_age = df[(df['age'] < 0) | (df['age'] > 120)]
if len(invalid_age) > 0:
issues.append(f"Invalid ages: {len(invalid_age)}")
# Check date ranges
if 'date' in df.columns:
future_dates = df[df['date'] > pd.Timestamp.now()]
if len(future_dates) > 0:
issues.append(f"Future dates: {len(future_dates)}")
if issues:
print("VALIDATION ISSUES:")
for issue in issues:
print(f" - {issue}")
else:
print("All validations passed!")
return len(issues) == 0
validate_data(df_clean)
Data Cleaning in Excel
Not everything needs Python. Excel has powerful cleaning tools:
Excel Data Cleaning Features:
1. REMOVE DUPLICATES
Data → Remove Duplicates → Select columns
2. TEXT FUNCTIONS
=TRIM(A1) Remove extra spaces
=UPPER(A1) Convert to uppercase
=LOWER(A1) Convert to lowercase
=PROPER(A1) Title case
=CLEAN(A1) Remove non-printable characters
=SUBSTITUTE(A1,"old","new") Replace text
3. FIND & REPLACE
Ctrl+H → Find and Replace
4. TEXT TO COLUMNS
Data → Text to Columns → Split by delimiter
5. FLASH FILL (Ctrl+E)
Excel learns your pattern and fills automatically
6. HANDLING ERRORS
=IFERROR(A1/B1, 0) Replace errors with 0
=IFNA(VLOOKUP(...), "Not Found")
7. DATE FIXES
=DATEVALUE(A1) Convert text to date
=TEXT(A1,"YYYY-MM-DD") Standardize format
8. CONDITIONAL FORMATTING
Home → Conditional Formatting → Highlight duplicates
Data Cleaning Best Practices
- Document everything: Keep a log of what you changed and why
- Never modify raw data: Always work on a copy
- Automate with scripts: Make cleaning reproducible
- Validate early, validate often: Check after each step
- Understand your data: Talk to data owners about edge cases
- Be consistent: Apply the same rules throughout
- Handle edge cases explicitly: Don't let them slip through
# Template: Reusable Data Cleaning Pipeline
class DataCleaner:
def __init__(self, df):
self.df = df.copy()
self.log = []
def log_action(self, action, details):
self.log.append(f"{action}: {details}")
def remove_duplicates(self, subset=None):
before = len(self.df)
self.df = self.df.drop_duplicates(subset=subset)
removed = before - len(self.df)
self.log_action("Remove duplicates", f"Removed {removed} rows")
return self
def fill_missing(self, column, method='median'):
nulls = self.df[column].isnull().sum()
if method == 'median':
self.df[column].fillna(self.df[column].median(), inplace=True)
elif method == 'mean':
self.df[column].fillna(self.df[column].mean(), inplace=True)
elif method == 'mode':
self.df[column].fillna(self.df[column].mode()[0], inplace=True)
self.log_action(f"Fill {column}", f"Filled {nulls} nulls with {method}")
return self
def standardize_text(self, column):
self.df[column] = self.df[column].str.strip().str.lower()
self.log_action(f"Standardize {column}", "Trimmed and lowercased")
return self
def get_clean_data(self):
print("Cleaning Log:")
for entry in self.log:
print(f" - {entry}")
return self.df
# Usage
cleaner = DataCleaner(df)
df_clean = (cleaner
.remove_duplicates(subset=['email'])
.fill_missing('age', method='median')
.fill_missing('country', method='mode')
.standardize_text('name')
.get_clean_data()
)
Master Data Analytics
Our Data Analytics program covers data cleaning, analysis, and visualization with hands-on projects using real-world datasets.
Explore Data Analytics Program