Updated: 2025-09-10 07:03:27 · Views: 68

📘 Statistics Basics

1. Measurement of Central Tendency

Mean (x̄) = (Sum of all values) ÷ (Number of values)
Median = Middle value when data is sorted
Mode = Value that occurs most often

---

2. Measurements of Dispersion

Variance (σ^2 or s^2)

Population: σ^2 = ( Σ(xi - μ)^2 ) ÷ N Sample: s^2 = ( Σ(xi - x̄)^2 ) ÷ (n - 1)

Standard Deviation (σ or s)

Population: σ = √( Σ(xi - μ)^2 ÷ N ) Sample: s = √( Σ(xi - x̄)^2 ÷ (n - 1) )

👉 Meaning: Tells how far values spread around the mean

---

3. Quartiles & Interquartile Range (IQR)

Q1 (Lower Quartile) = 25% position
Q2 (Median) = 50% position
Q3 (Upper Quartile) = 75% position

IQR = Q3 - Q1 👉 Spread of middle 50% data (ignores extreme values). 👉 More stable than range, less affected by outliers.

---

4. Outliers using IQR

Outlier Rule:

Lower Bound = Q1 - (1.5 × IQR)
Upper Bound = Q3 + (1.5 × IQR)

👉 Any value outside this range = Outlier (too far from the bulk of data).

---

5. Min, Max, Range

Minimum (min) = Smallest value
Maximum (max) = Largest value
Range = max - min 👉 Quick measure, but sensitive to outliers.

--- # Part 2 ---

📘 Visualizing Data Notes

1. Scatter Plots

Show relationship between 2 variables (x,y).
Each point = one observation. 👉 Useful for finding patterns, trends, outliers, correlation.

---

2. Line Plots

Points connected with lines (usually time on x-axis). 👉 Great for trends over time (stock price, temperature, etc.).

---

3. Distribution Plots – Histograms

Show how data values are spread across intervals (bins).
x-axis = value ranges, y-axis = frequency/count. 👉 Helps see skew, shape, spread.

---

4. Categorical Plots – Bar Plots

Categories on x-axis, bar height = value/frequency. 👉 Used for comparing groups or categories.

---

5. Categorical/Distribution Plots – Box & Whisker Plots

Show median, quartiles, IQR, outliers.
Box = Q1 to Q3, line = median, whiskers = min/max (without outliers). 👉 Best for comparing distributions between groups.

---

6. Other Plot Types

Violin Plot → combo of boxplot + density curve (shows distribution shape).
KDE Plot (Kernel Density Estimation) → smooth curve showing probability density. 👉 Both are for understanding distribution shapes better than plain histograms.

---

7. Common Plot Pitfalls

Wrong scale (zooming or cutting axes can mislead).
Too many categories → bar/line chart becomes messy.
Cherry-picking → showing only part of the data.
Overplotting → too many points on scatter, hides patterns.

---

❓ Why variance denominator is n-1 (not n)

This is the part that confuses many people, so let’s break it super simple.

Step 1: Population vs Sample

Population variance → divide by N (you have all data).
Sample variance → divide by n-1 (you have only part of data).

---

Step 2: The problem with just dividing by n

When you use a sample, you already used the sample mean (x̄) to calculate deviations. This mean is closer to your sample data than the real population mean (μ).

👉 Result: Variance calculated with n underestimates the true spread. It looks smaller than reality.

---

Step 3: Fixing the bias

To correct this "shrinkage", statisticians use n-1 instead of n. This makes the variance a little bigger → more fair estimate of the true population variance.

---

Step 4: Easy way to remember

Divide by N if you have the whole population.
Divide by n-1 if you only have a sample. 👉 That “-1” is called degrees of freedom = one piece of info is lost when you use the sample mean.

---

✅ In short:

Population variance: ÷N
Sample variance: ÷(n-1) → avoids underestimating true spread.

← Back to Home