Descriptive Statistics summarizes (describes) and organizes observations(sample data) from a set of data and visualizes them in graphs to provide intelligence by turning data into knowledge. I.e. describes the entire population using some measures like standard deviation, mean, etc.
Descriptive Statistics is the study of different measures(mean, median, variance…) of sample data and their dependence(and inter-dependence) on the existing features.
Centrality measures tendency of majority values, and is calculated via mean, median, and mode.
The Mean (or the average value): The Mean Value is the Average of all values.
Defined as:
Estimated as:
Mean is often used to find central tendency of interval variables. However it is more easily influenced by outliers and the skewness of the distribution.
The Median (or the mid point value): The Median is the value in the middle of a sorted set.
Defined as:
given
Median is used for ordinal data.
The Mode (or the most common value): The Mode Value is the value that appears the most number of times. Mode represents the “peak” of the distribution and it overlaps with both the mean and the median for symmetric distributions.
Mode is used in nominal data.
Quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities
q-Quantiles are values that partition a finite set of values into
Given
Variance(
Estimation:
Standard Deviation(
The further it is from the mean, the higher deviation exists within the data set.
Coefficient of Variation: It is the ratio of the standard deviation to its mean.
Z-Score: it is the distance of the data point from its mean to the standard deviation providing information about the spread of the data around the mean.
Range: It is the interval of maximum and minimum values.
Min and Max
Distribution Shape is the degree of symmetry of the numerical data.
It is a distortion (an asymmetry) from the bell curve (normal distribution). Skewness describes unexpected values in one tail.
Definition:
Estimation:
It is a symmetrical distortion from the bell curve (normal distribution). kurtosis describes unexpected values in both tails.
Definition:
Estimation:
Covariance: It is a measure of the relationship(direction and magnitude) between two random variables calculated from how changes in one variable is associated with changes in the other variable.
Definition:
Estimation:
According to definition, if
Correlation: the ratio of covariance. It is dimensionless and independent of scale. It shows the strength of variation for both the variables and has a defined range of -1 to +1.
Definition:
It is a method of measuring a linear correlation by calculating covariance of two variables divided by the product of their standard deviations and describes the strength and direction of the linear relationship between two quantitative variables.
It is in range of–1 and 1, meaning:
It’s a rank correlation coefficient because it uses the rankings(order) of data from each variable rather than the value of variables.
It’s used to detect the existence of monotonic relationship and can be used on ordinal or quantitative variables.
Expected amount of information
Measure of dependence in variables
Measure of similarity in distributions.
The KL-Divergence is a function which quantifies the difference between 2 distributions. Since the function is differentiable, we can use gradient-based methods to optimize the function. Performing this optimization will then return a local optima for the posterior distribution.