The post Data Sciences Interview preparation appeared first on StepUp Analytics.
]]>Central tendency is the way to describe the centermost value of the data set i.e. it describes where the maximum values lie. The 3 ways to calculate central tendency are:
Mean: it is something that includes every value in data set by adding all the values divided by the no of values.
Median: when we ordered the given values in any form and find the most central value of the frame.
Mode: The most frequent number which occurs in the frame is the mode.
Dispersion means the extent to which a numerical data is likely to vary about an average value. dispersion is nothing else but “scatteredness”.
The various measure of dispersion is:
Range: It is the difference between two extreme observations of the distribution.
Quartile Deviation: Quartile Deviation (QD) is the product of half of the difference between the upper and lower quartiles.
Mean Deviation: Mean deviation is a statistical measure which gives us the average deviation of values from the mean in a sample.
Standard Deviation: positive square root of the arithmetic means of the squares of the deviation from means. It is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the values are spread out over a wider range from the mean.
Quartiles: It means dividing the Probability distribution into areas of equal probability. Quartiles are 3 cut points that divide the data into 4 equal parts.
The first quartile (Q1) is defined as the middle number between the smallest number and the median of the dataset.
The second quartile (Q2) is the median of the data and 50% of the data lies below this point.
The third quartile (Q3) is the middle value between the median and the highest value of the data set.
Skewness is defined as a lack of symmetry. Generally, skewness gives the idea of the shape of the curve which can be particularly termed as positive skewed and negative skewed.
Kurtosis gives an idea about the flatness of the frequency curve and can be classified into the normal, leptokurtic, and platykurtic curve.
The Key difference between Skewness and Kurtosis is that Skewness talks about the degree of lop side and the latter talks about the flatness of the distribution.
Ques. What does a low and high standard deviation indicate?
Ans: A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.
Ques. Suppose that you are 5 inches in height and you have to cross the river where the observation of the water is 2,4,8,6 and 10. Would you be able to cross the river without sinking? which method will you apply for the same?
Ans: Here we will consider the method of range (dispersion method) to form an instant image whether we can cross the river or not. the highest observation here is 10 and the lowest is 2 so that will give us 8, it won’t be possible to row the river without sinking.
Ques. Which is the ideal average among the central tendency?
Ans: Arithmetic mean which satisfy all the condition of an ideal average can be laid as the most accurate one.
Ques. What is the median of the human body?
Ans: Navel is the median of the body which divides the body into two equal parts.
Ques. What are the different types of mean?
Ans: There are 3 types of mean:
Arithmetic: Simple means is also called the arithmetic mean were the sum of all the observation divided by the total no of observation.
Harmonic: This is generally used when rates are desired in a particular frame.
Geometric: the nth root of the product of n values.
Ques. Which is the best measure of dispersion?
Ans: Standard deviation is the best method as it satisfies almost all the properties for an ideal measure of dispersion but is affected by outliers as it includes mean which is a very sensitive index.
Ques. Which method of dispersion is based on 50% of the data?
Ans: Quartile deviation is the method of dispersion which includes 50% of the data only so it can’t be regarded as a reliable measure as it ignores the rest of the data.
Ques. Is mean possible for 2,3,4,5,6,7and 1000?
Ans: Yes, the mean does exist for this distribution but as extreme values are present in this so it won’t provide an accurate answer for that. As we can see that the mean 146.71 and we can see that none of the observation is accumulating around that value so this will not provide us with valuable results.
Ques. What is the meaning of absolute and relative measure of dispersion? Ans: Absolute measures the dispersion in the same units as the units of original data whereas Relative dispersion is defined as the ratio of the standard deviation to the mean. Unlike absolute dispersion, relative dispersion is dimensionless.
Ques. If all the observation is increased by 10, then what will be the effect on standard deviation?
Ans: It will remain unchanged as the variance is independent of the change of origin and Standard deviation is the root of the variance so it will also remain unchanged.
Ques. What is the range of Karl Pearson’s range of skewness?
Ans: The range of skewness of the same will be -3 to 3 but in practical use, these limits can’t be attained.
Ques. List the case when the skewness will be called positively and negatively skewed?
Ans: There are generally two ways to say that:
i. skewness is said to be positive when mean > mode/median
ii. skewness is said to be negative when mean < mode/median
Ques. Enlist all the cases when Bowley’s coefficient of skewness can be used?
Ans: Bowley coefficient which is also known as quartile coefficient of skewness is extremely useful when median and quartiles are being used.
Ques. What is the key difference between median and mean?
Ans: Median divides the whole data into two equal halves i.e. the probability of one half will be exactly equal to the other half which will be 0.5
Mean is the value around which most of the values lie. It does not divide the data into two equal halves but both of these can be equal in various situations.
Ques. If we have ordered data then can we use mode?
Ans: Mode is used when we are dealing with unordered or nominal data.
There are two types of Categorical data:
1. Ordered data (mainly values representing rank or order)
2. Unordered data
When we have ordinal data, the mean and median is the best measure, and when its categorical data, the mode is the best choice.
Ques. We have values: 20,25,29,25,18,20,30,31,26 and their mean us 24.8. If we take the sum of deviations of these values from their mean then what will be the value?
Ans: Sum of deviations about its mean is always zero.
Ques. Suppose we have some data and the analyst wants to check the variability in the data. What statistic should the analyst prefer for this?
Ans: The most efficient measure of dispersion is Standard deviation. It helps us to check the variation or the spread of data about the mean. This one is the most efficient because it takes all the data points into account.
Ques. What do 0 standard deviations signify?
A: 0 standard deviation means that our random variable is constant and all the data points are the same without any dispersion.
Example: If a random variable has values like 9,9,9,9,9,9,9,9,9 then the mean of X will be 9 and its standard deviation will be 0.
Ques. If we calculate variance with n-1 in the denominator instead of n, what will it mean?
Ans: It means that the data we have is a sample and not the population because variance calculated with n-1 in the denominator gives us sample variance.
When we deal with bivariate distribution or data, we are interested in finding out correlation and covariance between the two variables. If the change in one variable affects change in other variables them the variables are said to be correlated. The correlation can be of:
i. If two variable deviates in the same direction that means when an increase in one variable results in a corresponding increase in the other then they are said to be positively correlated.
ii. If two variable deviates in a different direction that means when an increase in one variable result decrease in the other variable then they are called as negatively correlated.
Regression Analysis is statistical modeling to find the relationship between a dependent variable and an independent variable. This is undertaken to predict future values based on the previously given data. It helps in estimating the relationship between two or more dependent and independent variables or how one variable is affecting the others and by how much.
The most common form of regression is linear regression which assumes a linear relationship between the input variables (x) and the single output variable (y). More specifically, y can be calculated from a linear combination of the input variables.
For Example: if suppose we have to find a relationship between pen and ink, so we can say that pen is a dependent variable, and ink is an independent variable.
Ques: Are correlation and dependency the same thing?
Ans: Non-dependency between two variables means no correlation but the converse is not true. A perfect dependency can also have zero correlation.
Example: square of x is linearly dependent with y which shows that it will both show positive as well as negative correlation and zero correlation also. This tells us that for a pair of variables that are perfectly dependent on each other, It can also give you a zero correlation.
Ques: Suppose that Raman ate 5 plates of pasta which is being ordered online and the owner of the ordered restaurant is corona infected. Is by any chance Raman can get affected? Is the negatively or positively related to the happening of the infection?
Ans: Yes, Raman has a chance of getting infected as one or the other day (without knowing) could have visited the restaurant, and as this is a communicable disease the chances of transfer from him to many are very high. Hence, we can say that the chances are positively related to each other.
Ques: What do points which are dense depicts?
Ans: There can be two cases here:
i. If the points of data are close enough to each other given the condition that they are dense in nature then they have a good amount of correlation among them.
ii. If the points are scattered everywhere, we can say that they don’t have a very good amount of correlation among them.
Ques: What is the relation between two independent variables?
Ans: The two independent variables are always uncorrelated.
Ques: Is the Pearson coefficient sensitive to outliers?
Ans: Yes, the Pearson coefficient is very much sensitive to the outlier as even a single point can change the direction of the coefficient.
Ques: What is the difference between CORRELATION and REGRESSION?
Ans: THE DIFFERENCE IS:
i. The slope in a linear regression gives the marginal change in output/target variable by changing the independent variable by unit distance. The correlation has no slope.
ii. The intercept in a linear regression gives the value of the target variable if one of the inputs/independent variables is set zero. Correlation does not have this information.
iii. Linear regression can give you a prediction given all the input variables. Correlation analysis does not predict anything.
Ques. Difference between correlation and covariance?
Ans: Correlation is simply the normalized co-variance with the standard deviation of both the factors and this is generally done to get the results into the prescribed limits. In simple words, both terms measure the relationship and the dependency between two variables. “Covariance” indicates the direction of the linear relationship between variables. “Correlation” on the other hand measures both the strength and direction of the linear relationship between two variables.
Ques. If a constant 50 is subtracted from each value of X and Y, what will be the change in the regression coefficient?
Ans: No, there will be no change as it is independent of the change of origin but not of scale.
Ques. Why do we have two lines of regression?
Ans: There are always two lines of regression which mean one Y on X and another is X on Y.
Ques. What is the angle between the two lines?
Ans: Yes, there are two angles between them namely:
i. acute angle
ii. obtuse angle.
Ques. When do two lines intersect?
Ans: Two lines intersect each other when they are perpendicular to each other.
Ques. State the difference between homoscedastic and heteroscedastic
Ans: Homoscedastic refers to a condition in which the variance of the residual, or error term, in a regression model, is constant. Here accumulation of points is not concentrated at equal distance to the line where Heteroscedastic has the problem of outliers as points are widely varying from each other.
Ques. What are the assumptions for REGRESSION?
Ans: The basic assumption for regression is as follows:
i. Independence of error
ii. Normality of error distribution
iii. Quantitative data condition
iv. Homoscedastic
Ques. What does Multiple Regression mean?
Ans: When more than two variables are highly related to each other than multiple regression comes into place or when we have more than one explanatory variable.
Ques. State the assumptions of Multiple Regression?
Ans: The assumptions are as follows:
i. Multivariate normality: it is said that the residuals are normally distributed
ii. No multicollinearity: it is assumed that two variables are not correlated with each other
A hypothesis is some kind of statement about the population distribution which we want to verify from the kind of information available to us. It is a proposition that is made according to past data, facts, and personal assumptions.
When the hypothesis satisfies the population completely then it is known as SIMPLE HYPOTHESIS whereas when it does not satisfy the population completely it is known as COMPOSITE HYPOTHESIS.
We have two kinds of hypothesis:
i. NULL HYPOTHESIS: It is a type of hypothesis which assumes that there are no specific differences between the two or more characteristics of the sample. It is usually a commonly accepted fact and we try to disapprove of this hypothesis using various tests.
ii. ALTERNATIVE HYPOTHESIS: In statistical terms, when there is a significant difference between the two variables it is an alternative hypothesis. It is mainly a hypothesis opposite or alternate of the Null Hypothesis.
Degree of Freedom: It is the number of values in the final calculation of a statistic that are free to vary to estimate a parameter. The degrees of freedom equal your sample size minus the number of parameters you need to calculate during an analysis.
i. Two-Tailed test: If we have an alpha of 5% then it will allot half of the alpha at one tail of the distribution of the test statistic and the other half on the other. In this type of test, we try to test our null hypothesis in both the direction of the distribution.
Suppose we want to test if the students have scored 70% marks or not. Then according to this, our null hypothesis will be mean (x) = 70. Now, according to the two-tailed tests, we will test both if the mean is significantly greater than x and if the mean is significantly less than x.
ii. One-Tailed Test: It is exactly like two-tailed tests but in this, we test our null hypothesis in only one direction. There are two types of One-Tailed test
Right tailed test: When the null hypothesis is being tested on the right side of the distribution.,
Left tailed test: When the null hypothesis is being tested on the left side of the distribution.
P-value: It is the probability of obtaining results as extreme as the observed results of a statistical hypothesis test, assuming that the null hypothesis is correct.
Sampling Error: As a sample cannot include all the characteristics of the population. So, it is prone to some errors and those errors are known as sampling errors. There are two types of sampling errors
Now here we come up with two types of errors:
Type 1 Error: Probability of rejecting H0, when H0 is true. This is also known as the producer’s risk.
Type 2 Error: Probability of accepting H0, when H0 is false. This is also known as the consumer’s risk.
Ques. Why do we reject our null hypothesis when the p-value is small?
Ans: P-Value is the observed value of significance which is compared with the actual level of significance and hence when the p-value is small, we tend to reject the null hypothesis.
Ques. Which error is more dangerous?
Ans: Type 1 error is more dangerous as it involves producer risk which means the products/supplies which are capable of being supplied will not get into the market.
Ques. What does the level of significance depict?
Ans: The probability of Type-1 error, is known as Level of Significance and it is also known as the size of the critical region.
Ques. Is p=0.05 significant?
Ans: P-value under 0.05 shows strong evidence against the null hypothesis and hence it shows strong evidence to reject the hypothesis and accept the alternative one.
Ques. What is the difference between one tail and two tail?
Ans: In the one-tail test, the entire region of 5% is available either in right or left whereas two-tail splits the alpha range into two parts.
Ques. When can the two-tailed test be used?
Ans: As two-tail uses both positive and negative tails of the distribution so they take the possibility of difference of the same.
Ques. What is a critical region?
Ans: These are those areas that lead to the rejection of the Null Hypothesis at some probability level.
Ques. Which is the best critical region?
Ans: The best critical region is the region that helps in minimizing both kinds of errors i.e. Type 1 and type 2 error.
GOODNESS OF FIT It is the statistical technique to determine how well sample observation satisfies the population observation. It helps in comparison of observed sample distribution with the expected probability distribution and analyzing the same.
Now the different types of test which we know are:
CHI-SQUARE TEST: This test is done to test the relationship between the categorical data. The null hypothesis under this is that there is no significant difference between the data that means they are independent.
A very high chi-square value indicates that the data does not fit into that completely.
T-TEST: This test helps us to know the significant difference between two or more means. T distribution curve converges the normal distribution curve from below. There are three types of tests involving T-distribution:
i. Single T-Test: this test compares the significant difference between two means
ii. Paired T-Test: this test compares measures the before and after effect of the situation
iii. One-Sample T-Test: this test compares one means with a standard or given mean
F-TEST: This test generally arises when the data is being fitted with the method of least square. this test helps in comparing the variances of two different populations and find out whether there is any kind of significant difference between the two populations or not.
Z-TEST: This test is applied to check the difference between two population mean when variances are known in the sample. This test is based on the normal distribution.
Ques: Name a distribution whose variance is twice of its mean?
Ans: Chi-square distribution has a mean (n) and variance (2n).
Ques: In what condition does the chi-square distribution tend to gamma distribution?
Ans: When the parameter becomes ½ and ½n then only it will tend towards chi-square distribution.
Ques. What is the sum of the independent chi-square variant?
Ans: The sum of independent chi-square variate is also a chi-square variate.
Ques. If X1 and X2 are two independent chi-square variates with n1 and n2 df then which distribution X1/X2 will it follow?
Ans: If these are two independent variates then it will follow b2(n1/2, n2/2) distribution.
Ques. State the condition for the chi-square test?
Ans: The conditions for the chi-square test are:
i. The sample observation should be independent with each other
ii. N the total frequency should be sufficiently large where each unit should be greater than 50.
iii. Constraints of the cell frequency should be linear.
iv. No cell frequency should be less than 5
v. We will not make any kind of assumption regarding the parent population.
Ques. What distribution will (2X) ^1/2 will follow?
Ans: As x is a chi-square variate with n degree of freedom then it will follow a normal distribution with mean (2n) ^1/2 and variance of 1.
Ques. Which test can be used for testing the homogeneity of an independent estimate of the population correlation coefficient?
Ans: Chi-square test can be used to test the same.
Ques. What does the term “degree of freedom” means?
Ans: The number of independent variates which are present in any kind of statistics is called the degree of freedom.
Ques. In what condition will YATES’S CORRECTION will be used?
Ans: If any frequency or we can say that any of the cell frequency is smaller than 5 then by the method of pooling which is popularly known as “Yate’s correction” can be used.
Ques. In what condition will student’s t distribution will follow Cauchy distribution?
Ans: When we have v=1, in the pdf of t – distribution then it will follow Cauchy distribution.
Ques. What is the value of the odd moments in t-distribution?
Ans: The odd moments in the t-distribution will all be 0.
Ques. Which test will be used to test the significance of the observed partial correlation coefficient?
Ans: T-TEST will be used here.
Ques. State applications of t-distribution?
Ans: The application of t-distribution will be-
Ques. What will be the null hypothesis under the t-test for a single mean
Ans: The null hypothesis under this will be there is no significant difference between the sample mean and the population mean
Ques. Suppose that the government wants to calculate the before and after effect of the pandemic situation in DELHI for working-class people. Which test should the government apply?
Ans: PAIRED T-TEST should be applied by the government as it takes into account the before and after effect.
Ques. Define F distribution?
Ans: The ratio of two independent chi-square variates with df v1 and v2 is defined as f distribution.
Ques. What is the mode of F-distribution?
Ans: The mode of f distribution is always less than unity.
Ques. Which test will be used to test the equality of two population variances?
Ans: F- DISTRIBUTION will be used to test the equality of two population variances.
Ques. State the application of Z-transformation?
Ans: Z-TEST has the following application:
Non – Parametric methods allow us to assume that the sample is being drawn with no particular distribution. These tests are more robust than others as they can be applied to a broader range of situations.
This test gives lesser power to rejection of the null hypothesis as the p-value associated with this is way higher than the p-value associated with the parametric test.
These tests are way too simple as compared with the parametric test but when we talk about more accuracy, the parametric test gives us more accurate results for the same.
Some advantages of using non-parametric are:
Ques. What is the difference between parametric and non-parametric methods?
Ans: In the parametric method parent population does exist but in the non-parametric method, this classification is mainly missing.
In the parametric methods, we do not take into account socio-economic data whereas in the non-parametric methods it had spread down its roots into psychometry, sociology, and statistics.
In the parametric method ranking of data is not possible whereas when we talk about non-parametric method ranking is possible.
Ques. What is the meaning of RUN?
Ans: RUN is nothing else but a regular combination of a sequence of letters that is surrounded by a different combination of another kind.
Ques. Names two tests which are based upon one sample or matched pair
Ans: SIGN TEST and WILCOXON SIGNED RANK TEST are generally used in one sample test.
Ques. Name one test which is based upon two independent samples?
Ans: MANN-WHITNEY TEST is the test that is based upon two independent samples.
Ques: Name one test which can be used as an alternative to ONE WAY ANOVA?
Ans: KRUSKAL WALLIS TEST which is a non-parametric test is an alternative test to one-way ANOVA.
Ques. If we want to check the trends in time data then which non-parametric test can be used?
Ans: The Mann-Kendall test can be used to judge the trend for the same.
Ques. Suppose that we need to compare our data from the median and then find the deviation in the sample. Which test can be used in this case?
Ans: WILCOXON SIGN RANK TEST can be used to find the median and then find the deviation from the same.
Ques. Which test can be used to find a correlation between two samples?
Ans: Spearman rank correlation to find the correlation between two samples.
Ques. Suppose that I have two samples in which their central tendencies are being measured, then which test can be used for the same?
Ans: MEDIAN TEST can be used in this case as they take into account the central tendency which is based upon two samples.
MLE is the method of estimating the parameters by maximizing a likelihood function when the wastage/effect is minimum. This method gives the most efficient result as MLE doesn’t need to be always uniquely defined.
Steps to calculate MLE:
Hence by following these simple steps, we will be able to find likelihood function
ROC CURVE: This curve tells us how capable our model is to distinguish between different classes. The higher the value of the model, the better is the model.
Properties of good estimator:
UNBIASNESS
An estimator is called to be an unbiased estimator if its mean exists and gives us a particular value. Sometimes unbiasedness can give an absurd answer also.
For eg: E(T)=Y, HERE Y is an unbiased estimator of T.
CONSISTENCY
Here it needs to follow two things i.e., the variance should be zero as n tends to infinity, and expectation should also exist for a specific distribution.
EFFICIENCY
The ratio of variances of two populations is simply termed as efficiency.
SUFFICIENCY
When the estimator contains all the necessary information about the population then it can be called as sufficient estimator.
Ques. What is the meaning of MLE?
Ans: For making the observed data most probable under the assumptions of statistical models we tend to maximize the likelihood functions, so it’s a method for estimating the parameter of a probability distribution.
Ques. Is MLE UNBIASED and CONSISTENT?
Ans: MLE’S are always consistent but need not be unbiased as when the sample size increases it will tend towards unity.
Ques. Is MLE the most efficient estimator?
Ans: Yes, MLE is the most efficient estimator in the class of estimator.
Ques. What is the ROC curve is all about?
Ans: ROC curve is the graphical representation of the diagnostic ability of a binary classified system as its thresholds keep on varying.
Ques. What is the full form of ROC?
Ans: RECEIVER OPERATING CHARACTERISTIC curve
Ques. What does the ROC curve denote?
Ans: ROC curve denotes the benefits or the advantages of using a particular test or distribution.
Ques. What does ROC AUC value tell us?
Ans: In ROC AUC –
Ques. What is an Estimator?
Ans: Estimator is a statistic to estimate some facts about the population.
Ques. Name two types of ESTIMATIONS?
Ans: The two types of estimations are:
Ques. State the properties of a good estimator?
Ans: THERE ARE 4 PROPERTIES OF GOOD ESTIMATOR:
Ques. Do unbiased estimates always exist?
Ans: NO, it is not necessary for an unbiased estimator to exist as sometimes it provides absurd results as well.
SUPERVISED LEARNING: In the supervised model we train the model to be well labeled so that we can very well predict the unforeseen data. Successful building and scaling of data at equal intervals so that desired results can be obtained.
Ques. Why supervised data is important/helpful?
Ans:
UNSUPERVISED DATA: In the unsupervised model, no predetermined input and output is set which means in this model we do not need to supervise the data as the model is free to explore and can go into many directions.
Ques. Why un-supervised data is important/helpful?
Ques. Suppose that I want to go to work from home. So, I checked the weather condition, the route which can be taken, and stations where I can stop to take food. which model I am taking into consideration?
Ans: The person is taking into consideration the SUPERVISED MODEL as every aspect here is very well labeled starting from the type of route to the station the person will stop.
Ques. State some techniques to have supervised data?
Ans: There can be two techniques in which we can have the same:
Ques. State some techniques to have unsupervised data?
Ans: There can be two techniques in which we can have the same:
Ques. State some algorithms used in both supervised and unsupervised models?
Ans: Support vector machine, Neural network, Linear and logistics regression, random forest, and Classification trees are used in a supervised model. Unsupervised algorithms can be divided into different categories: Cluster algorithms, K-means, Hierarchical clustering, etc.
Ques. Which model will be used when we need to classify big data into real life?
Ans: Un-supervised model will be used here as these models don’t have any direction and hence helps in classification.
For more preparations and materials: click
The post Data Sciences Interview preparation appeared first on StepUp Analytics.
]]>The post Excel – Descriptive Statistics appeared first on StepUp Analytics.
]]>What is Descriptive Statistics?
Summarizing and presenting an unorganized dataset in an organized way which may be tabular, graphical or numerical is referred to as descriptive statistics. Descriptive statistics help the readers in understanding the data well at first look. It also helps in making statistical inferences from the data.
For example, for Census 2021 the Government of India will conduct primary research to gather the data of its citizens which is helpful in policy-making decisions. The government can employ descriptive statistics
on this huge dataset to better understand the data by describing and summarizing its features.
Descriptive statistics include measures of location such as minimum, maximum, percentiles, quartiles,
and central tendency (mean, median, and mode); and measures of dispersion or variability, including range, variance, and standard deviation.
Measures of Location:
Measures of location are the best way in which one can reduce a set of data summarizing all data points with a single value. A single number is derived that is most representative of the entire dataset. The derived single number generally shows most of the properties of the data distribution.
Let us, for example, study a data set obtained from the measurement of the length of the crocodiles that are there in a pond. We observed the following things:
− the average length of a crocodile is 3.4 meters or that
− 50 % of the crocodiles have a length of fewer than 3.4 meters or that
− 90 % of the crocodiles have a length between 2.8 meters to 4.3 meters
Here in the above-stated example, the first two statements try to specify in some manner the location of the peak of the distribution, whereas the third statement tells us about the variability in the dataset.
Measures of Central Tendency:
These are the measures that help in locating the center of the data. These are as follows:
Mean: The mean value is generally known as the average value of the data. To get the value of the mean, we simply take summation of all the observations and then divide the sum by the total number of observations that are there in the dataset.
Mean = (Sum of Observations)/ (Total Number of Observations)
For Example, the marks of the 5 students are 2, 9, 10, 3, 6. The mean here would be:
Mean = (2+9+10+3+6)/ 5 = 30/ 5 = 6
Interpretation: Here we can say that on an average a student has got 6 marks in the class.
Median: The median is the location or a point that has half the data points smaller than that location and half the data points larger than that location. In the case of asymmetric data, the median is considered to be a more stable measure of central tendency that means. The reason behind it is that the median, in particular, is less affected by outliers (i.e. any extreme values) than the mean.
The median is calculated as follows:
− Sort all values in ascending/descending order.
− If the total number of values is odd then take the middle number of the series.
− If the total number of values even then takes the average of the middle two numbers.
For example, the marks of the 5 students are 2, 9, 10, 3, 6. Here, one must first sort these values in ascending order 2, 3, 6, 9, 10. Since the number of observations here is odd, we would take the middle value as the median for this series which is 6 in this case.
Mode: The Mode is the value that occurs most frequently in the dataset. It is not generally used in the statistical analysis. The mode is useful in the case of categorical data to describe the most frequent category. In a series, there could be more than one mode also and in that case, the series is called bimodal/ multimodal series.
For example, the marks of the 5 students are 2, 9, 10, 6, 6. In this case, 6 occurs most of the time (twice) in this series and hence becomes the mode of this series.
Measures of Dispersion: Measures of dispersion play a very important role in describing the spread of the data and its variation around a central value (Mean/ Median/ Mode). The spread of a data set can be described by various descriptive statistics like range, variance, and standard deviation.
Range: It is the difference between the lowest and highest values in a dataset. It is very simple to compute and useful when someone wishes to interpret the whole of a dataset. The range is useful to show the spread within a dataset and also, for comparing the spread/ variability among two or more similar datasets.
For example, the marks of the 5 students are 2, 9, 10, 3, 6. To calculate the value of the range, we need to subtract the lowest value which is 2 from the highest value which is 10 and we get the range as 8.
Variance: It is a numerical value used to show how widely individual observations vary in a group. When the individual observations vary to a large extent from the mean of the group then the variance of the group would be high, and if the individual observations vary to a small extent from the mean of the group then the variance of the group would below.
One must distinguish between the population variance and sample variance. Generally, they have different notations, and also, they are calculated differently. The variance of a population is denoted by
σ2 and the variance of a sample, by s2.
To calculate the variance of a population the following formula is used:
where σ2 is the variance of the population, X is the mean of the population, Xi is the ith element from the population and N is the total number of elements in the population.
The variance of a sample is defined by a different formula:
where s2 is the variance of the sample, x is the mean of the sample, xi is the ith element from the sample, and n is the total number of elements in the sample. By using the above-stated formula, we get the variance of the sample which is an unbiased estimate of the population variance.
Standard Deviation: Square root of the variance gives the standard deviation for the data. It is preferred over variance as it measures the deviation from the mean in the same units as the original data.
For example, if the data is of distance measurements in kilometers then the standard deviation will also be measured in kilometers, hence, the standard deviation is comparatively easy to interpret as compared to the variance.
Measures of Shape:
Skewness: The skewness measures the asymmetry of the distribution by the tendency of the data to be spread out on one side of mean compared to the other. There are two types of skewness i.e. left-skewed and right-skewed distribution.
The nearer the skewness statistics is to 0, the more symmetric a dataset is. The data is said to be left-skewed, implying, the mean is less than the median when the skewness statistic is negative and it is said to be right-skewed, implying the mean is greater than the median when the statistic is positive.
Kurtosis: It measures the peakedness of a distribution. Kurtosis statistic less than 3, implies, distribution with tails data exceeding the normal distribution also referred to as a platykurtic curve. Statistic greater
than 3, implies the tail data is less extreme compared to the normal distribution, also referred to as a leptokurtic curve.
Now let’s see how to add Data Analysis option in MS-Excel for easy application of data analysis options
on various datasets.
Step 1: Go to ‘File’ that appears on the top left side of MS-Excel.
Step 3: Check ‘Analysis ToolPak’, ‘Analysis ToolPak – VBA’ and ‘Solver Add-in’ in the dialog box. Next, click OK. The Data Analysis option has been added now.
Step 4: Click ‘Data’ from the options and see the ‘Data Analysis’ option is now available for use.
Case Study:
Let’s understand how to use descriptive statistics in MS-Excel on stock prices and infer the results. Consider Axis Bank traded as equity stock on NSE. The data has been downloaded from NSE for a period of 24 months, 1st January 2018 to 30th December 2019.
Note: Return on stock prices is calculated as Pt – Pt-1 / Pt-1where Pt is current periods’ price and Pt-1 is previous periods’ price
Now, we will use the ‘Data Analysis’ option in Excel [as learned earlier] on the monthly returns of Axis Bank stock to find a summary of the descriptive statistics and infer the results.
Step 1: Click on ‘Data’ then select ‘Data Analysis’. Find the ‘Descriptive Statistics’ option as shown in the image below and click OK.
Step 2: Select the ‘Input Range’ here, $D$3:$D$25 as only numeric values to be considered. Define the ‘Output Range’ for the result to be generated and then select ‘Summary Statistics’ and click OK.
The following Summary Statistics are generated for the Monthly Return of Axis Bank stock.
Interpretation of the Summary Statistics:
The count of the data implies that there are 23 observations of monthly return. The maximum monthly return that an investor could earn on a monthly basis is a profit of 16.06% and the minimum being a loss of
11.38% on the Axis Bank stock, implying a range of 27.44% between the maximum loss and profit. The sum of 28.25% implies that an investor can gain 28.25% if he/ she kept the stock for the time period of the analysis i.e. from 1st January 2018 till 30th December 2019.
From the measures of central tendency, we can infer that the average or mean return that the stock yields over the time period are 1.23%, whereas the median is 1.12%. The mean and median are slightly different but they imply that the data is not symmetric. Also, there is no mode in the data of monthly returns. The standard deviation of 6.02% implies the extent of deviation from the mean return that the stock yields. The measures of shape infer that the distribution is rightly skewed with skewness statistic being positive.
Kurtosis statistic is less than 3 which implies, that the investor will experience occasional extreme returns (either positive or negative), more extreme than the usual + or – three standard deviations from the mean that is predicted by the normal distribution of returns.
As seen above, one can use descriptive statistics to better understand the data. We can perform the descriptive statistics on any stock using the historical price data to check the range of profit and loss and take better-informed decisions on the basis of our inferences.
The post Excel – Descriptive Statistics appeared first on StepUp Analytics.
]]>The post Data Analytics How to Use Graphs to Present Your Data Smartly appeared first on StepUp Analytics.
]]>In Data Analytics, when we say data, these involve numbers or texts or symbols that represent some pieces of information. More often than not, we can see the numbers. Because numbers are involved, it is easier to think that it has some values of quantitative or qualitative variables. It must be taken note of that the term “values” is broad enough to cover everything to which value can be ascribed to.
Variables have different types. It is necessary to understand these types so as to know how to make a graphic presentation that really suits its content, nature and treatment of their values. First, there is a quantitative variable. This is also called numerical in the sense that it has a significant meaning as a measurement.
For example are persons height and weight. The best way to contain this kind of variable is thinking of the basic forms of arithmetic such as adding, subtracting, multiplying and dividing. Of course, symbols may be used to denote a value of a certain number in its place and stead. This can further be classified as continuous and discrete.
A continuous variable is a specific kind of a quantitative variable that describes data in a measurable way. If your data deal with measuring a height, weight, or time, then you have a continuous variable. Here there is the interval, and within this interval, any value can be possible. A discrete variable has a finite number of possible values and does not have the inherent order.
Here, every value is not possible. For example, in grading a performance of a product, you may use qualitative values such as 1,2,3 and 4 for the rating but this does not specifically show the real value in its strictest quantitive sense. Statistically speaking, only integer values are possible. Look the example below that talks about a family size.
This is the biggest question: How are you going to present your data? Descriptive statistics now comes in. It is the idea of presenting and describing the features of your data. It can be done through various means: graphical representation, tabular representation and summary statistics. First two are called visualization technique. For better understanding the dichotomy of the presentation, it is better to tackle the overview of descriptive vs. inferential statistics.
Descriptive statistics are used to present quantitative descriptions in a manageable form. This is a way to see something meaningful of data at hand. In short, you make a statement based on, about, and derived from these data. As a limitation, you are not allowed to make conclusions beyond the data at hand. You cannot make inferences or generalizations.
On the other hand, inferential statistics go beyond the figures. By its name, one can make inferences and conclusions. In descriptive statistics, more that one variable may be involved, that says, that the point of interest is the relationship between or among different variables. One should ask: How does one variable change with respect to other variables?
The post Data Analytics How to Use Graphs to Present Your Data Smartly appeared first on StepUp Analytics.
]]>The post Introduction To Descriptive and Predictive Analytics Part – 1 appeared first on StepUp Analytics.
]]>Often the aim is to address a specific problem through modelling the world in some form and then use the model to develop a better understanding of the world.
A Fundamental Operations Problem: An example
An Operations Problem: Costs
Timeline of Events
Demand is uncertain. Suppose you bought 10 items
Problem Recap
You don’t know what the demand is going to be
You have to decide on the number of units to order from the supplier before seeing the customer demand.
What could help?
The chart shows the demands (y-axis) observed in past 100 periods (x-axis).
Past Demand Data
Before you make your decision
The problem you just saw is called a Newsvendor problem.
Its characteristics are:
This is called the newsvendor problem:
Because it is similar to a vendor who sells newspapers:
In this course, we will show you how to think about and analyze this problem
A Business Application at Time Inc.
Time Magazine Supply chain:
Stores were either selling out inventories (too little inventory) Or sold only a small fraction of allocation (too much inventory). Time Magazine evaluated and adjusted for every issue:
National print order (total number of copies printed and shipped), Wholesale allotment structure (How those copies are allotted to wholesalers). Store distribution (Final distribution to stores).
Note: the above three decisions are made before the actual demand is realized
Need to analyze past data Forecast future demand. Time Magazine reports saving $3.5M annually from tackling the newsvendor problem.
Broader applications of the Newsvendor problem
Governments order flu vaccines before the flu season begins, and before the extent or the nature of the flu strain Is known
What is forecasting?
Primary Function is to predict the Future
Why are we interested?
Dictates the decisions we make today
Examples: who uses forecasting in their jobs?
What makes a good forecast?
Point forecasts are usually wrong! Why?
1. Examples: In December 2015, there will be 37cms of snow.
2. We will sell 314 umbrellas during the rains next Part.
3. Demand could be a random variable.
Therefore, a good forecast should be more than a single number
1. Mean and standard deviation
2. range (high and low) (e.g. weather forecasts).
Modeling Uncertain Future: Probability Distributions
An Example of a Model of Future Demand
Example of a Model of Future Demand: How likely is Each Scenario?
Three Scenarios and Probability Distribution
In other words, we project that the demand is not equal to a certain number with probability 1, but, rather can take one of three values with those probabilities
We have just created a probability distribution for the future demand:
Probability distributions like that one, described by a number of distinct scenarios with attached probabilities, are called discrete Note that the probabilities are:
In other words, we project that the demand is not equal to a certain number with probability 1, but, rather can take one of three values with those probabilities Three Scenarios Probability Distribution: Scenarios and Their Probabilities
Describing Probability Distribution: Mean and Standard Deviation
Three Scenarios Probability Distribution: Mean
Describing Probability Distribution: Mean and Standard Deviation
Three Scenarios Probability Distribution: Mean and Standard Deviation
Knowledge of mean and standard deviation values helps to support a general intuition about the nature of a random variable
Mean and Standard Deviation: More than three scenarios
– D1 with probability 1
– D2with probability 2
– D3with probability 3
————————–
– Dn with probability
And 1+2+3+⋯+=1
The random variable being modelled has a really large number of scenarios on any small interval of the possible interval of values and
The probability that any one scenario is realized is really small
There exist statistical formulas (also implemented in Excel) that calculate a probability that a normal random variable X with given mean and standard deviation s produces a value within a specified interval of values
[Xmin, Xmax]
Other Continuous Probability Distributions
Returning back: Characteristics of Forecasts
1. Point forecasts are usually wrong! Why?
2. Therefore, a good forecast should be more than a single number
3. Forecasts should include some distribution information
4. Aggregate forecasts are usually more accurate
5. The accuracy of forecasts erodes as we go further into the future
6. Don’t exclude known information
Subjective Forecasting Methods
How to forecast with past data, objectively?
We can leverage past data to come up with forecasts:
Two primary methods:
Causal Models
– Let D be the demand or future outcome to be predicted and assume that there
– Are n variables (or root causes) that influence the demand.
– A causal model is one in which demand D is formulated as a theoretical function of all those n causes.
– Causal models are generally intricate and complex and need advanced tools in addition to domain expertise.
– In this course, we will focus mainly on time series based models.
Time Series Methods
Next…
Forecasting with past historical data
Moving Averages
Exponential smoothing
The post Introduction To Descriptive and Predictive Analytics Part – 1 appeared first on StepUp Analytics.
]]>The post Shiny R (Building your first Shiny App) appeared first on StepUp Analytics.
]]>It will launch two tabs ui.R and server.R. To launch the application click on Run App.
There are several other examples available, for reference and details refer to below shiny website
http://shiny.rstudio.com/tutorial/
Download Cheat Sheet
The post Shiny R (Building your first Shiny App) appeared first on StepUp Analytics.
]]>