Attribute statistics
When we are working with variables it is good to be able to use basic statistic measures. Such as mean and variance.
Suppose that we record a number of observations , for instance corresponding to the weight of N = 20 for schoolchildren we can compute the empirical mean and we can use this mean is a best guess to what the average will be of the larger population. This gives us a measure we can use to as a guess of the true population.
Mean (Average)
The sample mean (or average) is computed as:
This gives us the central tendency of our data.
Variance
The sample variance measures how spread out the data is around the mean:
Note: We divide by instead of to get an unbiased estimate of the population variance (this is called Bessel’s correction).
Standard Deviation
The sample standard deviation is simply the square root of the variance:
This gives us a measure of spread in the same units as the original data.
4 Summary statistics and measures of similarity
The mean does provide us important info about the data sample but it can be effected by outliers. A way to avoid that is to use the median which finds the value of that is in the middle of the dataset.
Median
The median is the middle value when the data is sorted:
- For odd : median is the value at position
- For even : median is the average of values at positions and
The median is robust to outliers because it only depends on the middle value(x), not on extreme values.
The median is then defined as the value of x such that half of the median values are lower than x.
Percentile
We can generalize a median into a percentile. Given a sample in the p’th percentile, this means that of the observations in the dataset are less than or equal to .
Definition: The -th percentile is the value below which percent of the data falls.
Examples:
- 50th percentile = median (50% of data below this value)
- 25th percentile = first quartile (Q1)
- 75th percentile = third quartile (Q3)
- 100th percentile = maximum value
Computing Percentiles:
For a sorted dataset , we can approximate the -th percentile as:
where rounds up to the nearest integer.
However, this is an approximation. The exact percentile requires interpolation because:
- If 180 students have grades less than 11.7, they also have grades less than 11.7001
- We need to choose a value between adjacent data points
Linear Interpolation Method:
-
Sort the data:
-
Calculate the position index:
-
The percentile value is found by linear interpolation:
What this means:
- If is an integer, use that exact data point:
- If is between two integers (e.g., ), interpolate between and
- The fractional part tells us how much weight to give to the higher value
Example: To find the 90th percentile with students:
- , so interpolate between the 180th and 181st values
- Using the interpolation formula:
- This simplifies to the average of the two values:
Why this works:
- The fractional part is , meaning we’re exactly halfway between the two data points
- So we weight each value equally:
Notation:
- = ceiling function (rounds up):
- = floor function (rounds down):
Mode
The mode is the most frequently occurring value in the dataset.
A dataset can be:
- Unimodal: one mode
- Bimodal: two modes
- Multimodal: more than two modes
Covariance and Correlation
Covariance measures how much a variable changes when another variable changes and vice-versa.
Covariance
The sample covariance between two variables and is:
Interpretation:
- Positive covariance: When increases, tends to increase
- Negative covariance: When increases, tends to decrease
- Zero covariance: No linear relationship between and
Covariance Matrix
Given a dataset with attributes , we can compute the pairwise covariance between any two attributes and collect them in an covariance matrix :
This matrix is:
- Symmetric:
- Diagonal elements are variances:
- Off-diagonal elements are covariances between different attributes
Drawback: Covariance is affected by the scale of each attribute, making it difficult to compare covariances between different pairs of variables.
Correlation
To overcome the scale dependence of covariance, we standardize by dividing by the standard deviations of both variables. This gives us the correlation coefficient:
Expanding the denominator:
where and are the sample standard deviations of and .
Properties:*
- Always between -1 and 1:
- Scale-invariant: Changing units of or doesn’t change
- : Perfect positive linear relationship
- : Perfect negative linear relationship
- : No linear relationship
A correlation of 0 means that x tells us nothing about y, a positive correlation tells us that when x is large y is also likely to be large and the opposite is true if theres a negative correlation y is going to be small if x is large.
Distance Metrics
Theres no simple definition of distance except that it is a function of observations such that if the value is large if they are not very similar and short if they are very similar.
Formal Properties of a Metric:
A proper distance measure should satisfy:
- Non-negativity:
- Identity of indiscernibles: if and only if
- Symmetry:
- Triangle inequality:
The triangle inequality states that the direct distance between two points is never greater than going through an intermediate point. For example, the distance from home to work is not greater than the distance from home to the bakery plus the distance from the bakery to work.
Distance from Norms:
A common way to define distances is using norms. A norm is the magnitude of a vector and must satisfy:
- Non-negativity: if
- Scaling:
- Triangle inequality:
We can define distance from a norm as:
Euclidean Distance
The most common distance metric, measuring the straight-line distance between two points:
where and are -dimensional vectors.
Manhattan Distance (L1 Norm)
Sum of absolute differences along each dimension:
Also called “taxicab distance” or “city block distance” because it’s like traveling along a grid of streets.
Minkowski Distance (Lp Norm)
A generalization that includes both Euclidean and Manhattan distances:
Special cases:
- : Manhattan distance
- : Euclidean distance
- : Chebyshev distance (maximum difference):
Effect of p: As increases, the distance metric becomes less sensitive to differences in individual dimensions and more dominated by the largest difference.
Mahalanobis Distance
Unlike the previous metrics, Mahalanobis distance takes into account the covariance structure of the data:
where is the covariance matrix of the dataset.
Key properties:
- If (identity matrix), it reduces to Euclidean distance
- Accounts for correlations between variables
- Scale-invariant (adjusts for different variances in each dimension)
Intuition: The Mahalanobis distance is lower when two points lie along the direction where the data naturally varies (within the “point cloud”).
Example interpretation:
- Two points might have the same Euclidean distance (~5.65)
- But if one pair lies along the main axis of variation in the data (blue points):
- While another pair lies perpendicular to the variation (red points):
- The Mahalanobis distance captures that movement along the natural spread of data is “cheaper” than movement against it
This makes Mahalanobis distance useful when features have different scales or are correlated.