Why Statistics Matter in Analytics
Statistics is the foundation of data analytics. It provides the tools to understand data, identify patterns, test hypotheses, and make data-driven decisions with confidence.
A solid grasp of statistics helps you avoid common pitfalls, communicate findings effectively, and draw meaningful conclusions from data.
Descriptive Statistics
Measures of Central Tendency
─────────────────────────────
Mean (Average): Sum of values / Count
Example: (10+20+30)/3 = 20
Median: Middle value when sorted
Example: [10, 20, 30] → 20
Even count: average of two middle values
Mode: Most frequent value
Example: [1, 2, 2, 3] → 2
When to Use:
• Mean: Symmetric data, no extreme outliers
• Median: Skewed data, with outliers
• Mode: Categorical data, finding most common
Measures of Spread
─────────────────────────────
Range: Max - Min
Simple but sensitive to outliers
Variance: Average of squared deviations from mean
σ² = Σ(x - μ)² / N
Standard Deviation: Square root of variance
σ = √variance
Same unit as original data
IQR (Interquartile Range):
Q3 - Q1 (75th - 25th percentile)
Robust to outliers
Percentiles & Quartiles
Percentiles
─────────────────────────────
P10: 10% of data falls below this value
P50: Median (50th percentile)
P90: 90% of data falls below this value
P99: Used for identifying extreme values
Quartiles
─────────────────────────────
Q1 (25th percentile): Lower quartile
Q2 (50th percentile): Median
Q3 (75th percentile): Upper quartile
Five-Number Summary
─────────────────────────────
Min, Q1, Median, Q3, Max
→ Used for box plots
Example:
Data: [2, 4, 6, 8, 10, 12, 14, 16, 18, 20]
Min = 2
Q1 = 5 (25th percentile)
Median = 11 (50th percentile)
Q3 = 17 (75th percentile)
Max = 20
IQR = Q3 - Q1 = 17 - 5 = 12
Outlier Detection (IQR Method):
Lower fence: Q1 - 1.5 × IQR
Upper fence: Q3 + 1.5 × IQR
Values outside fences = potential outliers
Probability Basics
Probability Rules
─────────────────────────────
P(A) = Number of favorable outcomes / Total outcomes
0 ≤ P(A) ≤ 1
Complement Rule:
P(not A) = 1 - P(A)
Addition Rule:
P(A or B) = P(A) + P(B) - P(A and B)
Multiplication Rule (Independent):
P(A and B) = P(A) × P(B)
Conditional Probability:
P(A|B) = P(A and B) / P(B)
"Probability of A given B occurred"
Example: Customer Conversion
─────────────────────────────
1000 visitors
200 clicked ad (P = 0.20)
40 made purchase (P = 0.04)
P(purchase | clicked) = 40/200 = 0.20
"Given they clicked, 20% probability of purchase"
Bayes' Theorem
─────────────────────────────
P(A|B) = P(B|A) × P(A) / P(B)
Example: Spam Filter
P(Spam|Contains "free") =
P("free"|Spam) × P(Spam) / P("free")
Probability Distributions
Normal Distribution (Gaussian)
─────────────────────────────
• Bell-shaped, symmetric
• Defined by mean (μ) and std dev (σ)
• 68% of data within 1 σ of mean
• 95% of data within 2 σ of mean
• 99.7% of data within 3 σ of mean
Use: Heights, test scores, measurement errors
Binomial Distribution
─────────────────────────────
• Fixed number of trials (n)
• Two outcomes: success/failure
• Probability of success (p) is constant
• Formula: P(k) = C(n,k) × p^k × (1-p)^(n-k)
Use: Conversion rates, defect rates, A/B tests
Poisson Distribution
─────────────────────────────
• Events occurring in fixed interval
• Events are independent
• Average rate (λ) is known
• Formula: P(k) = (λ^k × e^-λ) / k!
Use: Customer arrivals, website visits, call volume
Exponential Distribution
─────────────────────────────
• Time between events in Poisson process
• Memoryless property
Use: Time between purchases, system failures
Sampling & Estimation
Sampling Methods
─────────────────────────────
Simple Random: Each member has equal chance
Stratified: Divide into groups, sample from each
Cluster: Sample entire groups randomly
Systematic: Every nth member
Sample Size Considerations
─────────────────────────────
Larger sample → More precise estimates
Diminishing returns after certain point
Budget and time constraints matter
Standard Error
─────────────────────────────
SE = σ / √n
Standard error decreases as sample size increases
Measures precision of sample mean
Confidence Intervals
─────────────────────────────
"We are 95% confident the true value lies in this range"
95% CI for mean = x̄ ± 1.96 × (σ/√n)
Example:
Sample mean = 50
Standard error = 2
95% CI = 50 ± 1.96 × 2 = [46.08, 53.92]
Common Confidence Levels:
90% CI: ± 1.645 × SE
95% CI: ± 1.960 × SE
99% CI: ± 2.576 × SE
Hypothesis Testing
Hypothesis Testing Framework
─────────────────────────────
1. State hypotheses
H₀ (Null): No effect/difference
H₁ (Alternative): There is an effect
2. Choose significance level (α)
Common: 0.05 (5%), 0.01 (1%)
3. Calculate test statistic
4. Find p-value
5. Make decision
p ≤ α → Reject H₀
p > α → Fail to reject H₀
P-Value Interpretation
─────────────────────────────
p-value: Probability of observing results
at least as extreme as the sample,
assuming H₀ is true
p = 0.03 means: "If there's truly no effect,
there's only a 3% chance of seeing these results"
Common Tests
─────────────────────────────
Z-test: Compare mean to known value
(known population std dev)
t-test: Compare means
(unknown population std dev)
Chi-square: Test categorical associations
ANOVA: Compare means across 3+ groups
F-test: Compare variances
Example: A/B Test
─────────────────────────────
H₀: Conversion rate A = Conversion rate B
H₁: Conversion rate A ≠ Conversion rate B
Control: 1000 visitors, 50 conversions (5%)
Test: 1000 visitors, 65 conversions (6.5%)
Calculate z-statistic → p-value = 0.12
p > 0.05 → Fail to reject H₀
Conclusion: Not enough evidence of difference
Correlation & Regression
Correlation
─────────────────────────────
Measures linear relationship between two variables
Pearson Correlation (r):
• Range: -1 to +1
• r = +1: Perfect positive correlation
• r = -1: Perfect negative correlation
• r = 0: No linear correlation
Interpretation:
|r| < 0.3: Weak
0.3 ≤ |r| < 0.7: Moderate
|r| ≥ 0.7: Strong
⚠️ Correlation ≠ Causation!
Simple Linear Regression
─────────────────────────────
Y = β₀ + β₁X + ε
β₀: Intercept (Y when X = 0)
β₁: Slope (change in Y per unit X)
ε: Error term
Example:
Sales = 1000 + 50 × (Ad Spend in $1000s)
Interpretation: Each $1000 in ad spend
associates with $50 more sales
R-squared (R²)
─────────────────────────────
• Proportion of variance explained by model
• Range: 0 to 1
• R² = 0.75 → Model explains 75% of variance
Adjusted R²:
• Penalizes adding unnecessary variables
• Use for multiple regression
Multiple Regression
─────────────────────────────
Y = β₀ + β₁X₁ + β₂X₂ + ... + βₙXₙ + ε
Example:
Sales = 500 + 40×(Ad_Spend) + 30×(Season) - 5×(Competitor_Price)
Each coefficient shows effect holding others constant
Statistical Significance vs Practical Significance
Statistical Significance
─────────────────────────────
• p < 0.05 (or chosen α)
• Effect is unlikely due to chance
• Large samples can make tiny effects significant
Practical Significance
─────────────────────────────
• Is the effect size meaningful for business?
• Consider the real-world impact
• Cost-benefit analysis
Effect Size Measures
─────────────────────────────
Cohen's d: Difference in means / pooled std dev
• Small: 0.2
• Medium: 0.5
• Large: 0.8
Example
─────────────────────────────
A/B Test Result:
• Control: 5.0% conversion
• Test: 5.1% conversion
• p-value: 0.03 (statistically significant)
• Sample: 1,000,000 users each
Statistically significant? Yes
Practically significant?
• 0.1% improvement = 1000 more conversions
• But implementation cost = $100,000
• Value of 1000 conversions = $50,000
• Not worth implementing!
Always consider both statistical AND practical significance.
Statistics in Python
import numpy as np
import pandas as pd
from scipy import stats
# Descriptive Statistics
data = [23, 45, 67, 32, 89, 12, 45, 67, 34, 56]
np.mean(data) # 47.0
np.median(data) # 45.0
np.std(data) # 22.19
np.percentile(data, 75) # 66.25
# Using Pandas
df = pd.DataFrame({'values': data})
df.describe() # Full summary
# Hypothesis Testing
# One-sample t-test
sample = [52, 48, 55, 51, 49, 53, 50, 52]
t_stat, p_value = stats.ttest_1samp(sample, 50)
# Two-sample t-test
group1 = [23, 25, 28, 24, 26]
group2 = [30, 32, 29, 31, 33]
t_stat, p_value = stats.ttest_ind(group1, group2)
# Correlation
x = [1, 2, 3, 4, 5]
y = [2, 4, 5, 4, 5]
correlation, p_value = stats.pearsonr(x, y)
# Chi-square test
observed = [[50, 30], [35, 85]]
chi2, p, dof, expected = stats.chi2_contingency(observed)
# Linear Regression
from scipy.stats import linregress
slope, intercept, r_value, p_value, std_err = linregress(x, y)
Common Pitfalls
- Confusing correlation with causation: Correlation shows relationship, not cause
- P-hacking: Running multiple tests until finding significant result
- Ignoring sample size: Small samples can be misleading
- Simpson's Paradox: Aggregated data can show opposite trend from disaggregated
- Survivorship bias: Only analyzing survivors, not failures
- Base rate neglect: Ignoring overall probability when interpreting results
Master Statistics with Expert Mentorship
Our Data Analytics program covers essential statistics for business analysis. Build confidence in statistical analysis with guidance from industry experts.
Explore Data Analytics Program