Excel – Descriptive Statistics
Today, we have data all around with millions of new data points being generated every second. Everyone, a company, an individual, is using data to derive inferences and insights that can be useful. Statistics are
crucial for working on any dataset and statisticians employ various statistical tools and techniques to present the data in an easily understandable form.
What is Descriptive Statistics?
Summarizing and presenting an unorganized dataset in an organized way which may be tabular, graphical or numerical is referred to as descriptive statistics. Descriptive statistics help the readers in understanding the data well at first look. It also helps in making statistical inferences from the data.
For example, for Census 2021 the Government of India will conduct primary research to gather the data of its citizens which is helpful in policy-making decisions. The government can employ descriptive statistics
on this huge dataset to better understand the data by describing and summarizing its features.
Descriptive statistics include measures of location such as minimum, maximum, percentiles, quartiles,
and central tendency (mean, median, and mode); and measures of dispersion or variability, including range, variance, and standard deviation.
Measures of Location:
Measures of location are the best way in which one can reduce a set of data summarizing all data points with a single value. A single number is derived that is most representative of the entire dataset. The derived single number generally shows most of the properties of the data distribution.
Let us, for example, study a data set obtained from the measurement of the length of the crocodiles that are there in a pond. We observed the following things:
− the average length of a crocodile is 3.4 meters or that
− 50 % of the crocodiles have a length of fewer than 3.4 meters or that
− 90 % of the crocodiles have a length between 2.8 meters to 4.3 meters
Here in the above-stated example, the first two statements try to specify in some manner the location of the peak of the distribution, whereas the third statement tells us about the variability in the dataset.
Measures of Central Tendency:
These are the measures that help in locating the center of the data. These are as follows:
Mean: The mean value is generally known as the average value of the data. To get the value of the mean, we simply take summation of all the observations and then divide the sum by the total number of observations that are there in the dataset.
Mean = (Sum of Observations)/ (Total Number of Observations)
For Example, the marks of the 5 students are 2, 9, 10, 3, 6. The mean here would be:
Mean = (2+9+10+3+6)/ 5 = 30/ 5 = 6
Interpretation: Here we can say that on an average a student has got 6 marks in the class.
Median: The median is the location or a point that has half the data points smaller than that location and half the data points larger than that location. In the case of asymmetric data, the median is considered to be a more stable measure of central tendency that means. The reason behind it is that the median, in particular, is less affected by outliers (i.e. any extreme values) than the mean.
The median is calculated as follows:
− Sort all values in ascending/descending order.
− If the total number of values is odd then take the middle number of the series.
− If the total number of values even then takes the average of the middle two numbers.
For example, the marks of the 5 students are 2, 9, 10, 3, 6. Here, one must first sort these values in ascending order 2, 3, 6, 9, 10. Since the number of observations here is odd, we would take the middle value as the median for this series which is 6 in this case.
Mode: The Mode is the value that occurs most frequently in the dataset. It is not generally used in the statistical analysis. The mode is useful in the case of categorical data to describe the most frequent category. In a series, there could be more than one mode also and in that case, the series is called bimodal/ multimodal series.
For example, the marks of the 5 students are 2, 9, 10, 6, 6. In this case, 6 occurs most of the time (twice) in this series and hence becomes the mode of this series.
Measures of Dispersion: Measures of dispersion play a very important role in describing the spread of the data and its variation around a central value (Mean/ Median/ Mode). The spread of a data set can be described by various descriptive statistics like range, variance, and standard deviation.
Range: It is the difference between the lowest and highest values in a dataset. It is very simple to compute and useful when someone wishes to interpret the whole of a dataset. The range is useful to show the spread within a dataset and also, for comparing the spread/ variability among two or more similar datasets.
For example, the marks of the 5 students are 2, 9, 10, 3, 6. To calculate the value of the range, we need to subtract the lowest value which is 2 from the highest value which is 10 and we get the range as 8.
Variance: It is a numerical value used to show how widely individual observations vary in a group. When the individual observations vary to a large extent from the mean of the group then the variance of the group would be high, and if the individual observations vary to a small extent from the mean of the group then the variance of the group would below.
One must distinguish between the population variance and sample variance. Generally, they have different notations, and also, they are calculated differently. The variance of a population is denoted by
σ2 and the variance of a sample, by s2.
To calculate the variance of a population the following formula is used:
where σ2 is the variance of the population, X is the mean of the population, Xi is the ith element from the population and N is the total number of elements in the population.
The variance of a sample is defined by a different formula:
where s2 is the variance of the sample, x is the mean of the sample, xi is the ith element from the sample, and n is the total number of elements in the sample. By using the above-stated formula, we get the variance of the sample which is an unbiased estimate of the population variance.
Standard Deviation: Square root of the variance gives the standard deviation for the data. It is preferred over variance as it measures the deviation from the mean in the same units as the original data.
For example, if the data is of distance measurements in kilometers then the standard deviation will also be measured in kilometers, hence, the standard deviation is comparatively easy to interpret as compared to the variance.
Measures of Shape:
Skewness: The skewness measures the asymmetry of the distribution by the tendency of the data to be spread out on one side of mean compared to the other. There are two types of skewness i.e. left-skewed and right-skewed distribution.
The nearer the skewness statistics is to 0, the more symmetric a dataset is. The data is said to be left-skewed, implying, the mean is less than the median when the skewness statistic is negative and it is said to be right-skewed, implying the mean is greater than the median when the statistic is positive.
Kurtosis: It measures the peakedness of a distribution. Kurtosis statistic less than 3, implies, distribution with tails data exceeding the normal distribution also referred to as a platykurtic curve. Statistic greater
than 3, implies the tail data is less extreme compared to the normal distribution, also referred to as a leptokurtic curve.
Now let’s see how to add Data Analysis option in MS-Excel for easy application of data analysis options
on various datasets.
Step 1: Go to ‘File’ that appears on the top left side of MS-Excel.
Step 3: Check ‘Analysis ToolPak’, ‘Analysis ToolPak – VBA’ and ‘Solver Add-in’ in the dialog box. Next, click OK. The Data Analysis option has been added now.
Step 4: Click ‘Data’ from the options and see the ‘Data Analysis’ option is now available for use.
Let’s understand how to use descriptive statistics in MS-Excel on stock prices and infer the results. Consider Axis Bank traded as equity stock on NSE. The data has been downloaded from NSE for a period of 24 months, 1st January 2018 to 30th December 2019.
Note: Return on stock prices is calculated as Pt – Pt-1 / Pt-1where Pt is current periods’ price and Pt-1 is previous periods’ price
Now, we will use the ‘Data Analysis’ option in Excel [as learned earlier] on the monthly returns of Axis Bank stock to find a summary of the descriptive statistics and infer the results.
Step 1: Click on ‘Data’ then select ‘Data Analysis’. Find the ‘Descriptive Statistics’ option as shown in the image below and click OK.
Step 2: Select the ‘Input Range’ here, $D$3:$D$25 as only numeric values to be considered. Define the ‘Output Range’ for the result to be generated and then select ‘Summary Statistics’ and click OK.
The following Summary Statistics are generated for the Monthly Return of Axis Bank stock.
Interpretation of the Summary Statistics:
The count of the data implies that there are 23 observations of monthly return. The maximum monthly return that an investor could earn on a monthly basis is a profit of 16.06% and the minimum being a loss of
11.38% on the Axis Bank stock, implying a range of 27.44% between the maximum loss and profit. The sum of 28.25% implies that an investor can gain 28.25% if he/ she kept the stock for the time period of the analysis i.e. from 1st January 2018 till 30th December 2019.
From the measures of central tendency, we can infer that the average or mean return that the stock yields over the time period are 1.23%, whereas the median is 1.12%. The mean and median are slightly different but they imply that the data is not symmetric. Also, there is no mode in the data of monthly returns. The standard deviation of 6.02% implies the extent of deviation from the mean return that the stock yields. The measures of shape infer that the distribution is rightly skewed with skewness statistic being positive.
Kurtosis statistic is less than 3 which implies, that the investor will experience occasional extreme returns (either positive or negative), more extreme than the usual + or – three standard deviations from the mean that is predicted by the normal distribution of returns.
As seen above, one can use descriptive statistics to better understand the data. We can perform the descriptive statistics on any stock using the historical price data to check the range of profit and loss and take better-informed decisions on the basis of our inferences.