Descriptive Statistics summarizes (describes) and organizes observations(sample data) from a set of data and visualizes them in graphs to provide intelligence by turning data into knowledge. I.e. describes the entire population using some measures like standard deviation, mean, etc.

Descriptive Statistics is the study of different measures(mean, median, variance…) of sample data and their dependence(and inter-dependence) on the existing features.

Measures of Descriptive Statistics

Centrality(Measure Of Central Tendency)

Centrality measures tendency of majority values, and is calculated via mean, median, and mode.

The Mean

The Mean (or the average value): The Mean Value is the Average of all values.
Defined as:
Estimated as:
Mean is often used to find central tendency of interval variables. However it is more easily influenced by outliers and the skewness of the distribution.

The Median

The Median (or the mid point value): The Median is the value in the middle of a sorted set.
Defined as:

given is an ordered list of values, and is the number of values.
Median is used for ordinal data.

The Mode

The Mode (or the most common value): The Mode Value is the value that appears the most number of times. Mode represents the “peak” of the distribution and it overlaps with both the mean and the median for symmetric distributions.
Mode is used in nominal data.

Quantile (Measure Of Location)

Quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities

q-Quantiles

q-Quantiles are values that partition a finite set of values into subsets of equal sizes.

Given where is a q-quantile for variable we have this important quantiles:

  • Median: , is the only 2-quantile
  • Tertiles(Terciles):
  • Quartiles : , The data is divided into 4 different equal parts(quartiles). then Inter-Quartile Range(IQR) is inspected by comparing Lower Quartile and Upper Quartile to understand range of most variables in data set.
  • Deciles:
  • Percentiles:

Spread (Measures Of Variability/Dispersion)

  • Variance(, or sigma squared): It is the measure of how the set of numbers is disperse or spread out compared to the mean value.
    Definition:
    Estimation:

  • Standard Deviation(): Standard Deviation is a measure of how spread out numbers are by calculating how far the given data point is from the mean. Standard Deviation is the square root of variance.
    The further it is from the mean, the higher deviation exists within the data set.

  • Coefficient of Variation: It is the ratio of the standard deviation to its mean.

  • Z-Score: it is the distance of the data point from its mean to the standard deviation providing information about the spread of the data around the mean.

  • Range: It is the interval of maximum and minimum values.

  • Min and Max

Distribution Shape (Measures of Shape)

Distribution Shape is the degree of symmetry of the numerical data.
Descriptive Statistics/skew.png

Skewness

It is a distortion (an asymmetry) from the bell curve (normal distribution). Skewness describes unexpected values in one tail.

Definition:

Estimation:

Kurtosis

It is a symmetrical distortion from the bell curve (normal distribution). kurtosis describes unexpected values in both tails.

Definition:

Estimation:

  • Negative kurtosis: lower peak than normal distribution.
  • Positive kurtosis: higher peak than normal distribution.

Correlation(Measures Of dependency)

  • Covariance: It is a measure of the relationship(direction and magnitude) between two random variables calculated from how changes in one variable is associated with changes in the other variable.

    Definition:

    Estimation:

    According to definition, if and are independent, then , therefore .

  • Correlation: the ratio of covariance. It is dimensionless and independent of scale. It shows the strength of variation for both the variables and has a defined range of -1 to +1.

    Definition:

Pearson correlation(Linear dependency)

It is a method of measuring a linear correlation by calculating covariance of two variables divided by the product of their standard deviations and describes the strength and direction of the linear relationship between two quantitative variables.

  • it’s used when both variables are quantitative, linear and normally distributed.
  • It’s not robust toward outliers

It is in range of–1 and 1, meaning:

  • Close to 0: meaning there is no correlation and direction implying there is no relationship between the variables.
  • Close to 1: A positive correlation with stronger the strength the closer it is to 1.
  • Close to -1: Negative correlation with stronger the strength the closer it is to -1.

Spearman’s Rank correlation(variable order dependency)

It’s a rank correlation coefficient because it uses the rankings(order) of data from each variable rather than the value of variables.

  • It’s used when variables are quantitative or ordinal, have a monotonic relationship, and don’t meet normality assumption.
  • Spearman’s rank correlation is also used on quantitative variables that do not have a normal distribution.
  • Positive monotonic: when one variable increases, the other also increases.
  • Negative monotonic: when one variable increases, the other decreases.

Kendall-Tau Correlation(Kendall Rank correlation)

It’s used to detect the existence of monotonic relationship and can be used on ordinal or quantitative variables.

Information Theory

Entropy

Expected amount of information

Mutual Information

Measure of dependence in variables

Kullback-Leibler Divergence (KL-Divergence)

Measure of similarity in distributions.

The KL-Divergence is a function which quantifies the difference between 2 distributions. Since the function is differentiable, we can use gradient-based methods to optimize the function. Performing this optimization will then return a local optima for the posterior distribution.