π Statistics Basics
1. Measurement of Central Tendency
- Mean (xΜ) = (Sum of all values) Γ· (Number of values)
- Median = Middle value when data is sorted
- Mode = Value that occurs most often
---
2. Measurements of Dispersion
Variance (Ο^2 or s^2)
Population: Ο^2 = ( Ξ£(xi - ΞΌ)^2 ) Γ· N Sample: s^2 = ( Ξ£(xi - xΜ)^2 ) Γ· (n - 1)
Standard Deviation (Ο or s)
Population: Ο = β( Ξ£(xi - ΞΌ)^2 Γ· N ) Sample: s = β( Ξ£(xi - xΜ)^2 Γ· (n - 1) )
π Meaning: Tells how far values spread around the mean
---
3. Quartiles & Interquartile Range (IQR)
- Q1 (Lower Quartile) = 25% position
- Q2 (Median) = 50% position
- Q3 (Upper Quartile) = 75% position
IQR = Q3 - Q1 π Spread of middle 50% data (ignores extreme values). π More stable than range, less affected by outliers.
---
4. Outliers using IQR
Outlier Rule:
- Lower Bound = Q1 - (1.5 Γ IQR)
- Upper Bound = Q3 + (1.5 Γ IQR)
π Any value outside this range = Outlier (too far from the bulk of data).
---
5. Min, Max, Range
- Minimum (min) = Smallest value
- Maximum (max) = Largest value
- Range = max - min π Quick measure, but sensitive to outliers.
--- # Part 2 ---
π Visualizing Data Notes
1. Scatter Plots
- Show relationship between 2 variables (x,y).
- Each point = one observation. π Useful for finding patterns, trends, outliers, correlation.
---
2. Line Plots
- Points connected with lines (usually time on x-axis). π Great for trends over time (stock price, temperature, etc.).
---
3. Distribution Plots β Histograms
- Show how data values are spread across intervals (bins).
- x-axis = value ranges, y-axis = frequency/count. π Helps see skew, shape, spread.
---
4. Categorical Plots β Bar Plots
- Categories on x-axis, bar height = value/frequency. π Used for comparing groups or categories.
---
5. Categorical/Distribution Plots β Box & Whisker Plots
- Show median, quartiles, IQR, outliers.
- Box = Q1 to Q3, line = median, whiskers = min/max (without outliers). π Best for comparing distributions between groups.
---
6. Other Plot Types
- Violin Plot β combo of boxplot + density curve (shows distribution shape).
- KDE Plot (Kernel Density Estimation) β smooth curve showing probability density. π Both are for understanding distribution shapes better than plain histograms.
---
7. Common Plot Pitfalls
- Wrong scale (zooming or cutting axes can mislead).
- Too many categories β bar/line chart becomes messy.
- Cherry-picking β showing only part of the data.
- Overplotting β too many points on scatter, hides patterns.
---
β Why variance denominator is n-1 (not n)
This is the part that confuses many people, so letβs break it super simple.
Step 1: Population vs Sample
- Population variance β divide by N (you have all data).
- Sample variance β divide by n-1 (you have only part of data).
---
Step 2: The problem with just dividing by n
When you use a sample, you already used the sample mean (xΜ) to calculate deviations. This mean is closer to your sample data than the real population mean (ΞΌ).
π Result: Variance calculated with n underestimates the true spread. It looks smaller than reality.
---
Step 3: Fixing the bias
To correct this "shrinkage", statisticians use n-1 instead of n. This makes the variance a little bigger β more fair estimate of the true population variance.
---
Step 4: Easy way to remember
- Divide by N if you have the whole population.
- Divide by n-1 if you only have a sample. π That β-1β is called degrees of freedom = one piece of info is lost when you use the sample mean.
---
β In short:
- Population variance: Γ·N
- Sample variance: Γ·(n-1) β avoids underestimating true spread.