The post Lasso And Elastic Net Regression appeared first on StepUp Analytics.
]]>For example, it shrinks the coefficients towards zero, but it does not set any of them exactly to zero. It does not perform feature selection and etc. So in this article, I have introduced two new methods such as lasso and elastic net regression which deals with these issues very well and does both variable selection and regularization.
Lasso (or least absolute shrinkage and selection operator) is a regression analysis method that follows the L1 regularization and penalizes the absolute size of the regression coefficients similar to ridge regression. In addition; it is capable of reducing the variability and improving the accuracy of linear regression models. Lasso regression differs from ridge regression in a way that it uses absolute values in the penalty function, instead of squares. This leads to penalizing the regression coefficients for which some of the parameter estimates turn out exactly zero. Hence, much like the best subset selection method, lasso performs variable selection out of the given n variables.
The tuning parameter lambda is chosen by cross-validation. When lambda is small, the result is essentially the least squares estimates (OLS). As lambda increases, shrinkage occurs and the less important feature’s coefficient shrinks to zero thus, removing some feature altogether.
So, a major advantage of lasso is that it is a combination of both shrinkage and selection of variables. In cases of a very large number of features, lasso allows us to efficiently find the sparse model that involves a small subset of the features.
The cost function is given below, where the highlighted part is the L1 regularization.
The method was proposed by Professor Robert Tibshirani from the University of Toronto, Canada. He said, “The Lasso minimizes the residual sum of squares to the sum of the absolute value of the coefficients being less than a constant. Because of the nature of this constraint, it tends to produce some coefficients that are exactly 0 and hence gives interpretable models”.
In his article titled Regression Shrinkage and Selection via the Lasso, Tibshirani tells us about this technique with respect to various other statistical models such as subset selection and ridge regression. He goes on to say that “lasso can even be extended to generalized regression models and tree-based models. In fact, this technique provides possibilities for even conducting statistical estimations.”
Traditional methods like cross-validation, stepwise regression to handle overfitting and perform feature selection works well with a small set of features but penalized regression techniques are a great alternative when we are dealing with a large set of features.
Lasso was originally formulated for least squares models and this simple case reveals a substantial amount about the behavior of the estimator, including its relationship to ridge regression and best subset selection and the connections between lasso coefficient estimates and so-called soft thresholding. It also reveals that (like standard linear regression) the coefficient estimates need not be unique if covariates are collinear. [Source: Wikipedia]
Though originally defined for least squares, lasso regularization is easily extended to a wide variety of statistical models including generalized linear models, generalized estimating equations, proportional hazards models, and M-estimators, in a straightforward fashion. Lasso’s ability to perform subset selection relies on the form of the constraint and has a variety of interpretations including in terms of geometry, Bayesian statistics, and convex analysis. [Source: Wikipedia]
As discussed above, lasso can set coefficients to zero, while ridge regression, which appears superficially similar, but cannot. This is due to the difference in the shape of the constraint boundaries in the two cases.
From the figure, one can see that the constraint region of lasso regression is a rotated square and its corners lie on the axes, while the constraint region of ridge regression is a sphere which is rotationally invariant and, therefore, has no corners. A convex object that lies tangent to the boundary, is likely to encounter a corner a hypercube, for which some components of are identically zero, while in the case of a sphere, the points on the convex object boundary for which some of the components are not distinguished from the others and the convex object is not likely to contact a point at which some components are zero.
In the case of ML, both ridge regression and Lasso find their respective advantages. Both these techniques tackle overfitting, which is generally present in a realistic statistical model. It all depends on the computing power and data available to perform these techniques on statistical software. Ridge regression is faster compared to lasso but then again lasso has the advantage of completely reducing unnecessary parameters in the model.
One important limitation of lasso regression is that, for grouped variables, the lasso fails to do grouped selection. It tends to select one variable from a group and ignore the others.
Elastic-net is a mix of both L1 and L2 regularizations. A penalty is applied to the sum of the absolute values and to the sum of the squared values:
Lambda is a shared penalization parameter while alpha sets the ratio between L1 and L2 regularization in the Elastic Net Regularization. Hence, we expect a hybrid behavior between L1 and L2 regularization. Though coefficients are cut, the cut is less abrupt than the cut with lasso penalization alone. The hyper-parameter is between 0 and 1 and controls how much L2 or L1 penalization is used. The usual approach to optimizing the lambda hyper-parameter is through cross-validation—by minimizing the cross-validated mean squared prediction error—but in elastic net regression, the optimal lambda hyper-parameter also depends upon the alpha hyper-parameter.
This article takes a cross-validated approach that uses the grid search to find the optimal alpha hyper-parameter while also optimizing the lambda hyper-parameter for the data set.
In my previous article, I used the glmnet package to show the ridge regression in R. In this article, I have used the caret package for better comparison between the techniques.
Loading the MASS package to get the data set
library (MASS)
data <- Boston
Splitting the dataset in training and testing data
train <- data [1:400,]
test <- data [401:506,]
Setting up a grid range of lambda values
lambda <- 10^seq (-3, 3, length = 100)
Loading the required libraries
library (tidyverse)
library (caret)
library (Metrics)
We fit the ridge regression model on the training data using k fold cross validation
set.seed (123)
ridge <- train (
medv ~., data = train, method = “glmnet”,
trControl = trainControl (“cv”, number = 10),
tuneGrid = expand.grid (alpha = 0, lambda = lambda))
plot (ridge$finalModel , xlab = “L2 Norm” )
Displaying the regression coefficients below
coef (ridge$finalModel, ridge$bestTune$lambda)
We save the predicted values of the response variable in a vector prediction_ridge
prediction_ridge <- predict (ridge, test)
Saving the RMSE, SSE and MAPE values in Accuracy_lasso
Accuracy_ridge <- data.frame (
RMSE = RMSE (prediction_ridge, test$medv),
SSE = sse (test$medv, prediction_ridge),
MAPE = mape (test$medv, prediction_ridge))
The only difference between the R code used for ridge and lasso regression is that for lasso regression, we need to specify the argument alpha = 1 instead of alpha = 0 (for ridge regression).
Now executing the Lasso Regression
set.seed (123)
lasso <- train (
medv ~., data = train, method = “glmnet”,
trControl = trainControl (“cv”, number = 10),
tuneGrid = expand.grid (alpha = 1, lambda = lambda))
plot (lasso$finalModel , xlab = “L1 Norm” )
If we look at the plot, the x-axis is the maximum permissible value the L1 norm can take. So when we have a small L1 norm, we have a lot of regularization. Therefore, an L1 norm of zero gives an empty model, and as you increase the L1 norm, variables will “enter” the model as their coefficients take non-zero values.
Displaying the regression coefficients below
coef (lasso$finalModel, lasso$bestTune$lambda)
We save the predicted values of the response variable in a vector prediction_ lasso
prediction_lasso <- predict (lasso, test)
Saving the RMSE, SSE and MAPE values in Accuracy_lasso
Accuracy_lasso <-data.frame (
RMSE = RMSE (prediction_lasso, test$medv),
SSE = sse (test$medv, prediction_lasso),
MAPE = mape (test$medv, prediction_lasso))
The elastic net regression models do not require us to mention a specific value of lambda and alpha. We use caret package to automatically select the best tuning parameters alpha and lambda. The caret package tests a range of possible alpha and lambda values, and then selects the best values for lambda and alpha, resulting in a final model that is an elastic net model.
Now executing the Elastic Net Regression
set.seed (123)
elasticnet <- train (
medv ~., data = train, method = “glmnet”,
trControl = trainControl (“cv”, number = 10), tuneLength = 10)
plot (elasticnet$finalModel , xlab= “Elasticnet Regularization”)
Displaying the regression coefficients below
coef (elasticnet$finalModel, elasticnet$bestTune$lambda)
We save the predicted values of the response variable in a vector prediction_ elasticnet
predictions_elasticnet <- predict (elasticnet, test)
Saving the RMSE, SSE and MAPE values in Accuracy_ elasticnet
Accuracy_elasticnet <-data.frame (
RMSE = RMSE (predictions_elasticnet, test$medv),
SSE = sse (test$medv, predictions_elasticnet),
MAPE = mape (test$medv, predictions_elasticnet))
We finally bring the RMSE, SSE and MAPE values of the three regression techniques in a dataframe Accuracy.
Accuracy <- rbind.data.frame (Accuracy_ridge = Accuracy_ridge, Accuracy_lasso = Accuracy_lasso, Accuracy_elasticnet = Accuracy_elasticnet)
Accuracy
Here both lasso and elastic net regression do a great job of feature selection technique in addition to the shrinkage method. On the other hand, the lasso achieves poor results in accuracy. This is because there is a high degree of collinearity in the features. Further, the L1 norm is underdetermined when the number of predictors exceeds the number of observations while ridge regression can handle this.
From our example we see that penalized regression models performed much better than the multiple linear regression models. But it can be said that Lasso regression performs better than ridge in scenarios with many noise predictors and worse in the presence of correlated predictors. Elastic net, is a hybrid of the two, and performs well in all these scenarios.
The post Lasso And Elastic Net Regression appeared first on StepUp Analytics.
]]>The post Ridge Regression and Its Application appeared first on StepUp Analytics.
]]>The OLS function works quite well when some assumptions like a linear relationship, no autocorrelation, homoscedasticity, more observations than variables, normal distribution of the residuals and No or little multicollinearity are fulfilled.
But in many real-life scenarios, these assumptions are violated. In those cases, we need to find alternative approaches to provide solutions. Penalized/Regularized regression techniques such as ridge, lasso and elastic net regression work very well in these cases. In this article, I have tried to explain the ridge regression technique which is a way of creating regression models when the number of predictor variables of a dataset is more than the number of observations or when the data suffers from multicollinearity (independent variables are highly correlated).
Regularization methods provide a means to control our regression coefficients, which can help to reduce the variance and decrease the sampling error. Ridge regression belongs to a class of regression tools that use L2 regularization. L2 regularization works as a small addition to the OLS function that weights the residuals in a particular way to make the parameters more stable. The L2 penalty parameter, which equals the square of the magnitude of coefficients, is given by,
And the regression function is given by,
The amount of the penalty can be fine-tuned using a constant called lambda (λ). Selecting a good value for λ is critical. When λ = 0, the penalty term has no effect and ridge regression produces classical least square coefficients. If λ = ∞, the impact of the penalty grows and all coefficients are shrunk to zero. The ideal penalty is therefore somewhere in between 0 and ∞.
In this way, ridge regression puts constraints on the magnitude of the coefficients and help to reduce the magnitude and fluctuations of the coefficients and progressively shrinks them towards zero. This will definitely help to reduce the variance of the model. The outcome is typically a model that fits the training data less well than OLS but generalizes better because it is less sensitive to extreme variance in the data such as outliers.
Note that, in contrast to the ordinary least square regression, ridge regression is highly affected by the scale of the predictors. Therefore, it is better to standardize (i.e., scale) the predictors before applying the ridge regression so that all the predictors are on the same scale.
Advantages and Disadvantages Of Ridge Regression
Here I have given the link of a website below, where you can get the mathematical and geometric interpretation of Ridge regression More Info
Loading the MASS package to get the data set
library (MASS)
data <- Boston
Splitting the dataset in training and testing data
train <- data [1:400,]
test <- data [401:506,]
Loading libraries required for Ridge regression
library(tidyverse)
library(caret)
library (glmnet)
library (MASS)
library (Metrics)
We need to know the glmnet package
For more details about this package: More Info
There is another function lm.ridge () in MASS package which can also be used. Please see the link below for more details about the function. More Info [Page Number: 79]
Preparing the training data set for training the regression model
x.train <- model.matrix (medv~., train) [,-1]
We save the response variable housing price in a vector y.train
y.train <- train$medv
We need to find the best value for lambda for the given data set with the function cv.glmnet()
set.seed (123)
cv <- cv.glmnet (x.train, y.train, alpha = 0)
plot (cv)
Displaying the best lambda value
cv$lambda.min
We fit the final model on the training data by adding the best lambda value.
model_ridge <- glmnet (x.train, y.train, alpha = 0, lambda = cv$lambda.min)
Displaying the regression coefficients below
coef (model_ridge)
Preparing the test data set to be used as a data matrix and discarding the intercept for predicting the values of the response variable.
x.test <- model.matrix (medv ~., test)[,-1]
We save the predicted values of the response variable Housing price in a vector prediction_ridge
prediction_ridge <- as.vector(predict(model_ridge,x.test))
Saving the RMSE, SSE and MAPE value of the predicted values of the test data set in Accuracy_ridge
Accuracy_ridge <- data.frame(
RMSE = RMSE (prediction_ridge, test$medv),
SSE = sse (test$medv, prediction_ridge),
Mape = mape (test$medv, prediction_ridge))
Now we fit the multiple linear regression model on the training data set
names (train)
model_lm <- lm (medv ~ crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+black+lstat, data=train)
From the summary of the model we can find the p value of the individual predictor variables and decide which variables to be kept in the model
summary (model_lm)
model_lm <- lm (medv ~ crim+zn+nox+rm+dis+rad+tax+ptratio+lstat, data=train)
summary (model_lm)
We need to check the multicollinearity with the help of the function vif () from car package.
vif (model_lm)
We also need to exclude the predictor variables with high vif values to avoid multicollinearity. Though we may allow multicollinearity up to a certain level.
model_lm <- lm (medv ~ crim+zn+nox+rm+dis+rad+ptratio+lstat, data=train)
Below I have mentioned the summary of the updated final model with all the significant variables and the vif values of the variables. The values of the R square and adjusted R square are pretty close, which also shows that the present predictor variables in the model are pretty significant.
summary (model_lm)
vif (model_lm)
We compute the prediction of the test data set with multiple linear regression which was trained using the training dataset
prediction_lm <- predict (model_lm, test [,-14])
We find out the RMSE, SSE, and MAPE of the regression model and save them in Accuracy_lm
Accuracy_lm <-data.frame (
RMSE = RMSE (prediction_lm, test$medv),
SSE = sse (test$medv, prediction_lm),
MAPE = mape (test$medv, prediction_lm))
We save the RMSE, SSE and MAPE values of both linear and ridge regression models in Accuracy.
Accuracy <- rbind.data.frame (Accuracy_ridge = Accuracy_ridge, Accuracy_lm = Accuracy_lm)
Accuracy
From the Accuracy mentioned above, it is clear that even though the least square estimates are unbiased; the accuracy of the model is compromised. By adding a degree of bias to the regression estimates, ridge regression reduces the standard errors. But with other models like the lasso and elastic net regression, we have a possibility of getting a better accuracy value.
This is because complicated models tend to overfit the training data. In my next article, I will introduce you to lasso and elastic net regression and explain the comparative advantage of using these models over multiple linear or ridge regression models.
To learn more on Statistics for Data Science Read
The post Ridge Regression and Its Application appeared first on StepUp Analytics.
]]>The post Queueing Theory and Its Application appeared first on StepUp Analytics.
]]>In this article, we will learn about Queueing Theory and its practical applications. We all have experienced the annoyance of having to wait in a queue. We wait in line at supermarkets to check out, we wait in line in banks and post offices and we wait in line at fast food restaurants. But we as customers do not like waiting. And the managers of these establishments also don’t like their customers to wait as it may cost them business.
So the first question that arises is that “Why is there waiting?”
To which the answer is that there is more demand for service that there is an available facility for that service.
And “Why is this so?”
For which there could be a number of reasons such as, shortage of available servers, limitation of space, economic limitations etc.
These limitations can be removed with the expenditure of capital. And to know how much service should then be made available, one needs to know:
Queuing theory attempts to answer these questions through detailed mathematical analysis. The ultimate goal is to achieve an economic balance between cost of service and the cost associated with the waiting for that service.
A Queuing system can be described as one in which customers are arriving for service, waiting for their service if it is not immediately available and if having waited for service leaving the system after being served.
The term ‘customer’ is used in general sense and does not imply necessarily a human customer. For example, a customer can be a computer program waiting to be run or an airplane waiting in line to take off.
Queuing Theory was developed to provide models to predict the behavior of systems that attempt to provide service for randomly arising demands.
For defining the characteristics we’ll first explain the following terms:
In the context of above, there are six basic characteristics of a queuing process that provide an adequate description of a queuing system
1. Arrival Pattern: In general situations, the process of arrivals is random (stochastic). It is, therefore, necessary to know the probability distribution of the times between successive customer arrivals (inter-arrival times). Also, the customers can arrive simultaneously (batch or bulk arrival) and if so, the probability distribution describing the size of the batch.
The reaction of the customer upon entering the system.
2. Service Patterns: We need to describe a probability distribution for the sequence of customer service time. Service may be single or in batch, there are many situations where customers may be served simultaneously by the same server, such as people boarding a train, sightseers on a guided tour. The situation in which service depends on the number of customers waiting is referred to as State-Dependent service.
3. Queue Discipline: It refers to the manner in which the customers are selected for service when a queue has formed. Most common disciplines are:
4. System Capacity: In some queuing process there is a physical limitation to the amount of waiting room so that when the line reaches a certain length, no further customers are allowed to enter until space becomes available as a result of service completion. This situation is referred to as finite queuing situation.
5. The number of Service Channels: By this, we are typically referring to the number of parallel service stations which can serve customers simultaneously. It is assumed that service mechanisms of parallel channels operate independently of each other.
6. Stages of Service: A queuing system may have only a single stage of service, or it may have several stages. An example of a multistage the queuing system would be a physical examination procedure, where each patient must proceed through several stages, such as medical history, blood tests etc.
A Queueing process is described by a series of symbols and dashes such as A/B/X/Y/Z where
Some standard symbols for the characteristic distributions are as follows
These notations are referred to as Kendall’s Notation.
Generally, there are three types of system response of interest:
Since most queuing systems are stochastic, these measures are often random variables and their probability distribution is desired.
There are two types of customer waiting times:
The task of a Queuing Analyst if generally one of the two things:
= N_{s} (t) + N_{q }(t)
= P [N (t) =n]
(1/ λ) is the expected inter arrival time.
µ_{n }= cµ when n ≥ 1 (all servers are busy). 1/µ is the expected service time.
= Probability of exactly ‘n’ customers in the queuing system.
Where, N is the random variable giving the number of customers in the system.
One of the most powerful relations in queuing given by John D.C. Little. This formula relates the steady-state mean system size to steady state average customer waiting times.
Little’s formula is:
Also, since E (T) = E (T_{q} + S) = E (T_{q}) + E (S) or
W = W_{q} + (1/µ)
Hence it is necessary to find only one of the four expected values.
The post Queueing Theory and Its Application appeared first on StepUp Analytics.
]]>The post AB Testing With R: An Example Of Marketing Campaign appeared first on StepUp Analytics.
]]>Even e-commerce companies in India like Amazon and Flipkart have a lot of questions about their websites, application designs, and marketing strategies. These questions can be answered by conducting an A/B test.
When comparing two versions of products (such as A and B ) for similar customers are tested to see which group should sell more in the market or sometimes two groups of customers A and B for similar products, to see which group we should target for the products, we use A/B testing.
For example for a website:
Null Hypothesis: Assumption that there is no difference between the conversion rates for products A and B
Alternative Hypothesis: There is a difference between the conversion rates for products A and B
To reject the Null Hypothesis we need a p-value that is lower than the significance level i.e. P < 0.05
install.packages ("pwr") library (pwr) ######## 2-sample test for equality of proportions ############ prop.test(c (225, 250), c (3450, 3000))
The p-value is less than 0.05, so we can reject the hypothesis that conversion rates are equal.
But one cannot directly conclude that A and B have dissimilar conversion rates or vice verse. Here true underlying behavior is not known as we are trying to test the hypothesis by carrying out the experiment over a sample.
The Disadvantages of Using A/B Test:
Bayesian statistics in A/B testing is mainly based on past or prior knowledge of similar experiment and the present data. The past knowledge is known as prior also prior probability distribution (Wiki) is combined with current experiment data to make a conclusion on the test at hand.
In this method, we model the metric for each variant. We have prior knowledge about the conversion rate for A which has a certain range of values based on the historical data. After observing data from both variants, we estimate the most likely values or the new evidence for each variant.
Now we need to know:
What is Posterior Probability Distribution?
Posterior probability is the probability of an event to happen after all the background information about the event has been taken into account. Posterior probability as an adjustment on prior probability:
Posterior probability = prior probability + new evidence (called likelihood). And the Posterior Probability Distribution is Posterior Distribution = Prior Distribution + Likelihood Function (“new evidence”)
Open the link for further information: Wiki
By calculating this posterior distribution for each variant, we can express the uncertainty about our beliefs through probability statements.
install.packages (“bayesAB”) library (bayesAB)
The link below contains all the information to explain the parameters and functions in the package bayesAB. CRAN
Using the previous example
library (bayesAB) A_binom <- rbinom (3450, 1, 0.065) B_binom <- rbinom (3000, 1, 0.083)
About rbinom function rbinom (n, size, p) where
n = number of observations
size = number of trials
p = vector of probability
We choose the alpha and beta level from the prior knowledge we had about parameters. Here I have shown the test with two levels of the values. We generally use trial and error method to get the distribution to look like our imagined prior distribution. The peak should be centered over our expected mean based on previous experiments.
plotBeta (1, 1) plotBeta (100, 200) ## more specific range of p AB1 <- bayesTest (A_binom, B_binom, priors = c ('alpha' = 1, 'beta' = 1), distribution = 'bernoulli')
Saving the outputs of the test in AB2
AB2 <- bayesTest (A_binom, B_binom, priors = c ('alpha' = 100,'beta' = 200), distribution = 'bernoulli')
Here I have checked the AB2 test with an alpha and beta value of 100 and 200 respectively. You can also check the plots and results for AB1.
Print tells us the inputs we have made and the summary statistics of the data.
print (AB2)
summary (AB2)
The summary gives the credible interval. Bayesian intervals treat their bounds as fixed and the estimated Parameter as a random variable, whereas frequentist confidence intervals treat their bounds as random Variables and parameters as the fixed value.
It also shows that P (A>B) is by 0.00068%. So, B is much better than A. And the posterior expected loss for choosing B over A is low.
plot (AB2)
The means are quite separate, but there is a minimum overlap between distributions. Credible interval highlights this overlap region. To quantify the findings we calculate the probability of one variation beating another i.e. if we randomly draw a sample from Product A and from Product B, what are the chances that sample from B would have higher conversion rates than that of A.
So, from the diagrams and the summary of the test we can easily solve the problems which we had faced earlier while doing a simple prop.test.
Similarly, we can also try the test for other specific distributions like Poisson, normal, exponential and etc and check the results for them. Then we can combine the results of the tests and find out an overall credible interval and a percentage of A Over B or vice versa.
A/B test approaches are centered on hypothesis tests used with a point estimate (probability of rejecting the null) of a hard-to-interpret value. Oftentimes, the statistician or data scientist laying down the groundwork for the A/B test will have to do a power test to determine sample size. This quickly gets messy in terms of interpretability. More importantly, it is simply not as robust as Bayesian A/B testing and it does not have the ability to inspect an entire distribution over a parameter.
Bayesian statistics is simply more powerful and informative than a normal A/B test. While frequentist A/B testing requires the length of the test to be defined in advance, Bayesian testing does not. It can calculate the potential dangers of ending the test (the loss value) at any point, and gives a constantly updated probability of either variant being better and by how much. Ending the test early can be disastrous for frequentist A/B testing. A Bayesian approach, therefore, provides us with much greater flexibility during the experiment.
There is no agreed method for choosing a prior and it requires skill to estimate subjective prior beliefs into a mathematically calculated prior. If not done correctly it could lead to misleading results. The posterior distribution can be heavily influenced by the selection of the prior and the selection of the prior is a subjective process. Moreover, Bayesian statistics require a high level of computational resource, particularly in models with a large number of parameters.
The main advantage of the Bayesian approach is the ability to include historical data and to select a prior distribution. The main disadvantage with this approach is the subjective nature of the selection process for the prior.
The post AB Testing With R: An Example Of Marketing Campaign appeared first on StepUp Analytics.
]]>The post Obtaining A Critical Region And p-Value appeared first on StepUp Analytics.
]]>In the case of the two-tailed test, there would be two critical regions (As shown in the above graph). In the cases when we are interested to find whether the values are different or not equal, we use the two-tailed test. If percentage level is assumed to be (1-α) level, then both the critical regions would be of size α/2. The following hypothesis is an example of a two-tailed test.
H_{0}: µ = µ_{0 }H_{1}: µ ≠ µ_{0}
One-tailed tests are used when we are interested only in the extreme values that are greater than or less than a comparative value (say µ_{0}). In the case of one-tailed tests, there is only one critical region.
One-tailed tests are of two types-
Hypothesis-
H_{0}: µ = µ_{0 }_{ }H_{1}: µ < µ_{0}
Hypothesis-
H_{0}: µ = µ_{0 }H_{1}: µ > µ_{0}
In the case of one-tailed test, the critical region is of the value α (Unlike α/2 in the case of two-tailed).
Now, in order to obtain the critical value, we must know the type of hypothesis, the distribution the test follows, the percentage level at which we are working and lastly whether the test is two-tailed or one (right or left tailed). We’ve discussed all the above terms above, so now obtaining the value beyond which the critical region lies would be easy to find.
Step 1: Check the null and the alternative hypothesis.
Step 2: Take note of the distribution the test follows.
Step 3: Calculate the degrees of freedom, if any.
Step 4: Open the tables and look up for the distribution.
Step 5: If it is a two-tailed test at suppose 95% level of a Normal distribution, then look up for the value of 2.5% (α/2). And if it is one tailed test then look up for the value of 5% and then put a negative sign depending on the fact whether it is left or right tailed.
In the case of normal distribution, we do not require to calculate the degree of freedom, but in the cases of other distribution like t-distribution or chi-square distribution, we need to calculate the degree of freedom. On the other hand, both Normal, as well as the t distribution, are symmetrical so we need to just check one value and just replace signs, but in the case of non-symmetrical distributions, we need to check the individual values.
For example, if we are working on the chi-square distribution at 95%, we need to first find the degree of freedom of that chi-square and then check the value of both 97.5% as well as 2.5%. To begin, just follow the steps and practice with the Normal distribution. Once you’ve mastered it, go for the calculation of the degree of freedom and then the other distributions.
Critical regions are as critical as their name suggests and hence should be calculated carefully, or else we might end up in a wrong conclusion (Type 1 or Type 2 error).
The post Obtaining A Critical Region And p-Value appeared first on StepUp Analytics.
]]>The post Classical Normal Linear Regression Model (CNLRM) appeared first on StepUp Analytics.
]]>In this article, we will discuss the details of the Classical Normal Linear Regression Model (CNLRM). The method of ordinary least squares is attributed to Carl Friedrich Gauss, a German mathematician. Under certain Assumption, this method of estimation has some very attractive statistical properties that made it one of most powerful and popular method of regression analysis.
The two variable Population Regression Function:
However, the Population functions can’t be obtained directly, hence we estimated them from the help of sample regression functions:
Where 𝑌 ̂𝑖^{ }is the estimated (conditional mean) value of 𝑌𝑖
The OLS Estimated of 𝛽1 and 𝛽2 can be obtained as follows:
On differentiating partially with respect to 𝛽1 and 𝛽2 we obtained the following results:
and
Thus we get the estimated of the population regression function as:
Where,
The Assumption Underlying The Method Of Least Square.
The objective of estimating 𝛽1 and 𝛽2 only, the method of OLS discussed is suffice, but if the objective is to draw the inference about the true value of population variables 𝛽1and 𝛽2 then we have to look upon the fictional form of 𝑌𝑖′𝑠 or the functional form of 𝑋𝑖′𝑠 and 𝑢𝑖′𝑠. This is because the value of population regression function i.e., 𝑌 𝑖 = 𝛽1 + 𝛽2𝑋𝑖 + 𝑢𝑖 depends on Xi and error terms.
Therefore, unless we are specified about how 𝑋𝑖 and 𝑢𝑖 are created or generated, there is no way we can make any statistical inference about 𝑌𝑖 and also, as we shall see, about 𝛽1and 𝛽2. Thus, we need some assumptions made about the 𝑋𝑖 variables and error terms are extremely critical to valid interpretation of the regression estimates.
The Gaussian, standard, or classical linear regression model (CLRM), which is the cornerstone of almost every economic theory, makes the following 7 assumptions:
Assumption1: The regression model is linear in terms of parameters.
Assumption2: The values of 𝑋𝑖′𝑠 are fixed or 𝑋 values are independent of the error term.
Assumption3: the mean of error terms is zero.
Assumption4: the variance of error terms is constant; this assumption is also known as Homoscedasticity.
Assumption5: the is no autocorrelation between the error terms (or disturbances)
Assumption6: the number of observations is greater than the number of parameters to be estimated. Assumption7: The 𝑋 values in the given sample must not be the same. i.e., the variance of 𝑋𝑖′𝑠 is positive.
NOTE: Gauss-Markov Theorem:
Given the assumption of the Classical linear regression model, the least-square estimators, in the class of unbiased linear estimators, have minimum variance, that is, they are BLUE
Using the method of OLS we are able to estimate the population parameters 𝛽1 and 𝛽2, under the assumptions of the classical linear regression model, as 𝛽 ̂1 and𝛽 ̂2.But, since these estimators differ from sample to sample. therefore, these estimators as random variables.
Hence, we called the estimators as random variable thus we have to find the probability distribution of these estimators.
The Probability Distribution Of Disturbances (𝒖𝒊′𝒔)
But since 𝑋𝑖′𝑠 are assumed fixed, or nonstochastic, because ours is conditional regression analysis, conditional on the fixed values of 𝑋𝑖.
also 𝑌 𝑖 = 𝛽1 + 𝛽2𝑋𝑖 + 𝑢𝑖
Hence the 𝛽 ̂2 can be rewritten
the 𝑘𝑖, the beta and 𝑘𝑖 are fixed hence the estimate 𝛽 ̂2 is ultimately a linear function of the random variable.
Therefore, the probability distribution of estimators depends on the assumption of error terms. Since we need the probability distribution of estimated to draw the inference about population parameters, we have to draw the assumption about the distribution of error term.
Since the OLS does not make any assumption about the probabilistic nature of 𝑢𝑖, it is of little help for the purpose of drawing of drawing inference about population regression function from the sample regression function, the Gauss-Markov Theorem notwithstanding.
This void can be filled if we are willing to assume that the 𝑢𝑖′𝑠 follow some probability distribution. For reasons to be explained shortly, in the regression context it is usually assumed that the 𝑢𝑖′𝑠 follow a normal distribution.
Thus, adding the normality assumption of the classical linear regression model (CLRM) discussed earlier, we obtained what is known as the classical normal linear regression model (CNLRM).
The post Classical Normal Linear Regression Model (CNLRM) appeared first on StepUp Analytics.
]]>The post Missing Value Imputation Techniques In R appeared first on StepUp Analytics.
]]>Let’s see the three main types missing values according to their pattern of occurrence in a data set.
It occurs when the missing values occur entirely at random and are independent of other variables in the observation. Here we are assuming that the variable of missing data is completely unrelated to the other variables or columns in the data. For example
Suppose that we have a database of school students with 4 columns Student.Id, Name, Gender, and Number of Subjects. With the data available we cannot determine the number of subjects for the given missing observation because the missing data is completely independent of the other observations in the data.
An alternative assumption to MCAR is MAR or Missing at Random. It assumes that we can predict the missing value on the basis of other available data.
From the given data we can build a predictive model that Number of subjects can be predicted on the basis of independent variables like class and age. So in these cases, we can use some advanced imputation techniques to determine the missing values.
MAR is always a safer assumption than MCAR. This is because any statistical analysis which is performed under the assumption of MCAR is also valid for MAR, but the reverse is not true.
NMAR is also known as nonignorable missing data. It is completely different from MCAR or MAR. It is a case where we cannot determine the value of the missing data with any of the advanced imputation techniques. For example, if there is a question in a questionnaire which is a very sensitive issue and it is likely to be avoided by the people filling out the questionnaire, or anything that we don’t know. This is known as missing not at random data.
In the present study, I have used the iris data set which is already present in the R software. Though the dataset does not have any missing values, I have introduced missing values randomly into the data set to execute the six most popular methods of missing value treatment.
D <- iris
Saving the dataset in a dataframe “D”.
The data frame has four columns
Of these variables, the first four are numeric and the fifth variable “Species” is a factor with three levels.
NOTE: I have used only the first column i.e. Sepal.Length to explain the imputation techniques.
str(D)
This is the best avoidable method unless the data type is MCAR. We have to see whether the deletion of the data will affect any of the statistical analysis done with the data or not. Moreover, it is only performed if there is a sufficient amount of data available after deleting those observations with “NA” values and deleting them does not create any bias or not a representation of any variable.
Creating another data set from the original dataset “D”
df <- D
Introducing “NA” values randomly into the dataset.
df<-as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.95, 0.05), size = length(cc), replace = TRUE) ]))
Determining the number of “NA” values in the data set.
sapply(df,function(x)sum(is.na(x)))
Deleting the observations or rows which have “NA” values.
df<-na.omit (df)
Now, there is also an alternative to this, which is using na.action = na.omit directly in the model.
sapply(df,function(x)sum(is.na(x)))
This method is used only when a certain variable has a very high number of NA values in comparison to other variables. So using the previous method would lead to a loss of too many observations from the dataset. Now here we also need to see whether the given variable is an important predictor of the dependent variable or not. Then decide the better approach to deal with it.
Creating another data set from the original dataset “D”
df1 <- D
Introducing NA values to the first column of the dataset
df1$Sepal.Length [20:140] <-NA
Determining the number of “NA” values in the data set.
sapply(df1,function(x)sum(is.na(x)))
We can see that out of 150 observations 121 values in the Sepal.Length column is missing.
Deleting the variable Sepal.Length from the dataset.
df1$Sepal.Length<-NULL df1<-df1 [,-1] ## Another way to do it. sapply(df1,function(x)sum(is.na(x)))
This is a very common technique of replacing the NA values. It is often used when there is not much variation in the data or the variable is not that important predictor of the dependent variable. Though one can easily calculate the mean or median value to impute the missing values, this method leads to an artificial reduction of the variation in the dataset.
Moreover, it reduces the standard error which invalidates most hypothesis tests. Also, it introduces a wrong representation of the relationship of the variable with other variables in the dataset.
Creating another data set from the original dataset “D”
df2<-D
Introducing NA values randomly into the dataset.
set.seed (123) df2<-as.data.frame (lapply (df2, function (cc) cc [sample(c (TRUE, NA), > prob = c (0.60, 0.40), size = length (cc), replace = TRUE)]))
Here I am saving a copy of the variable Sepal.Length as “original”, consisting of the values which have been replaced as NA values in the data set. This is done so that later one can calculate the MSE, RMSE, and MAPE to see the accuracy of the imputation method.
for more details about MSE, RMSE, and MAPE please open this link
fn<-ifelse (is.na (df2$Sepal.Length) ==TRUE, df2$Sepal.Length, 0) original<-D$Sepal.Length [is.na (fn)]
Calculating the value of mean for the variable Sepal.Length and saving it as predictmean
predictmean <-round (mean (df2$Sepal.Length, na.rm = TRUE), digits = 1) df21<-df2
Replacing the missing values in the Sepal.Length column with the mean value
df21$Sepal.Length [is.na (df2$Sepal.Length)] <- predictmean
For checking the accuracy, we calculate MSE, RMSE, and MAPE with the help of the library “Metrics” Saving the output of accuracy in Accuracy_mean
library (Metrics) mae_mean<-mae (original, predictmean) rmse_mean<-rmse (original, predictmean) mape_mean<-mape (original, predictmean) Accuracy_mean<-cbind (mae=mae_mean, rmse=rmse_mean, mape=mape_mean) Accuracy_mean
Calculating the value of the median for the variable Sepal.Length and saving it as predictmedian
predictmedian <-round(median(df2$Sepal.Length,na.rm= TRUE),digits = 1) df22<-df2
Replacing the missing values in the Sepal.Length column with the median value
df22$Sepal.Length[is.na(df2$Sepal.Length)] <- predictmedian
For checking the accuracy, we calculate MSE, RMSE, and MAPE with the help of the library “Metrics”
Saving the output of accuracy in Accuracy_median
library(Metrics) mae_median<-mae(original,predictmedian) rmse_median<-rmse(original,predictmedian) mape_median<-mape(original,predictmedian) Accuracy_median<-cbind(mae=mae_median,rmse=rmse_median,mape=mape_median) Accuracy_median
This method of imputation is used when the missing data is of MAR type. So we create a decision tree model to predict the values of the missing data which was previously trained with the data which was already present in the dataset. This method can be used to predict both numeric and factor variables. Here we are predicting a numeric variable, so we choose the method = “anova”. But we write method= “class” in case of factor variables. It is also important to exclude the missing values from the model by using na.action=na.omit.
We have saved the model by the name fitrpart
library(rpart) # tree based model fitrpart <- rpart(Sepal.Length ~ ., data=df2[!is.na(df2$Sepal.Length),], method="anova",na.action=na.omit)
Here we are predicting the missing values of Sepal.Length with the model and saving it in predictrpart
predictrpart<- predict(fitrpart, df2[is.na(df2$Sepal.Length),])
For checking the accuracy, we calculate MSE, RMSE, and MAPE with the help of the library “Metrics”
Saving the output of accuracy in Accuracy_rpart
library(Metrics) mae_rpart<-mae(original,predictrpart) rmse_rpart<-rmse(original,predictrpart) mape_rpart<-mape(original,predictrpart) Accuracy_rpart<-cbind(mae=mae_rpart,rmse=rmse_rpart,mape=mape_rpart) Accuracy_rpart
The kNN algorithm computes the distance between the data point and its k nearest neighbors using the Euclidean distance in multidimensional space and imputes the missing values with the weighted average of the values taken by the k nearest neighbors.
Things to remember:
1.k is the number of nearest neighbors used to find the values of the missing data points.
2.variable is the variable which consists of missing values we choose to impute. We can choose more than one variable by, variable= c(“a”, “b”, “c” ) where a, b, and c are the variables which consist of missing values
Disadvantages of using Knn
Executing the algorithm
library(VIM) df23<- kNN(df2,variable="Sepal.Length",k=6 ) df23$Sepal.Length_imp<-NULL
Saving the values of variable Sepal.Length that was imputed using kNN in a vector predictkNN
predictkNN <- df23[is.na(df2$Sepal.Length), "Sepal.Length"]
For checking the accuracy, we calculate MSE, RMSE, and MAPE with the help of the library “Metrics” Saving the output of accuracy in Accuracy_kNN
library(Metrics) mae_kNN<-mae(original,predictkNN) rmse_kNN<-rmse(original,predictkNN) mape_kNN<-mape(original,predictkNN) Accuracy_kNN<-cbind(mae=mae_kNN,rmse=rmse_kNN,mape=mape_kNN) Accuracy_kNN
NOTE: There is another package library (DMwR) where function knnImputation () can be used to do the imputation. (For more details about the function: Click
Mice or Multivariate Imputation via Chained Equations is a package that uses multiple imputations for a missing data treatment. Now, as multiple imputations create multiple predictions for each missing value; they take into account the uncertainty in the imputation and give the best standard errors. If there is not much information in the given data used to prepare the model, the imputations will be highly variable, leading to high standard errors in the analysis.
Things to remember:-
library (mice) fitmice<-mice (df2, m=10, maxit=30, method="pmm") df24<-complete (fitmice)
Saving the values of variable Sepal.Length that was imputed using Mice in a vector predictmice
predictmice<-df24 [is.na (df2$Sepal.Length), "Sepal.Length"]
For checking the accuracy, we calculate MSE, RMSE, and MAPE with the help of the library “Metrics” Saving the output of accuracy in Accuracy_mice
library (Metrics) mae_mice<-mae (original, predictmice) rmse_mice<-rmse (original, predictmice) mape_mice<-mape (original, predictmice) Accuracy_mice<-cbind (mae=mae_mice, rmse=rmse_mice, mape=mape_mice) Accuracy_mice
The missForest function is used particularly in the case of mixed-type data. It can be used to impute continuous and categorical data including complex interactions and non-linear relations. It uses the given data in the data frame to train the random forest model and then uses the model to predict the missing values. It yields an out-of-bag (OOB) imputation error estimate without the need of a test set or elaborate cross-validation.
library (missForest) Executing the algorithm fitmissforest<-missForest (df2) Saving the output in df25 df25<-fitmissforest$ximp
Saving the values of variable Sepal.Length that was imputed using missForest in a vector predictmissforest
predictmissforest<-round (df25 [is.na (df2$Sepal.Length), "Sepal.Length"], digits=1)
For checking the accuracy, we calculate MSE, RMSE, and MAPE with the help of the library “Metrics” Saving the output of accuracy in Accuracy_missforest
library (Metrics) mae_missforest<-mae (original, predictmissforest) rmse_missforest<-rmse (original, predictmissforest) mape_missforest<-mape (original, predictmissforest) Accuracy_missforest<-cbind (mae=mae_missforest, rmse=rmse_missforest, mape=mape_missforest) Accuracy_missforest
Creating a data frame Accuracy consisting of all the accuracy outputs of the last five methods
Accuracy<-cbind(Methods=c("Mean","Median","rpart","kNN","mice","missForest"), rbind.data.frame(Accuracy_mean,Accuracy_median,Accuracy_rpart, Accuracy_kNN,Accuracy_mice,Accuracy_missforest)) library (ggplot2)
Creating a scatter plot to visualize the best method with least MAPE (Mean absolute percentage error) Similarly, you can see all the plots of MAE and RMSE.
Scatter plot
#Scatter plot ggplot(data=Accuracy, aes(x=Methods, y=mape)) + geom_point()
So, from the above diagram, we can see that for the given data set mice and missForest gives us the best output. But we cannot conclude that for each and every dataset mice or missForest will be the best method of missing value treatment. Because missing values have different types and patterns. So, I think the best way is to test the models with the given data and then use the best model to impute the missing values in the data set
Missing values imputation with missMDA
Missing values imputation with Fuzzy K-means Clustering
Missing values imputation with bpca (Bayesian Principle Component Analysis)
The post Missing Value Imputation Techniques In R appeared first on StepUp Analytics.
]]>The post Multivariate Analysis Of Variance Or MANOVA appeared first on StepUp Analytics.
]]>In ANOVA we examine if there is any statistically significant effect of independent variables on a continuous dependent variable using the sum of squares. But here we only have one dependent variable. It’s very simple, but in practical life the problems are complex. So, we can have more than one dependent variable. We can use ANOVA for every dependent variable separately, but using Multivariate Analysis Of Variance Or MANOVA you can do that in one analysis.
So, we can think of MANOVA as a multivariate extension of ANOVA. This way MANOVA explains how much variability of dependent variables is explained by the independent variables simultaneously.
In MANOVA there also some assumptions, like ANOVA. Before performing MANOVA we have to check the following assumptions are satisfied or not.
Let us take a sample data set to understand how MANOVA works. The dataset can be downloaded from the given link – Data
Data Dictionary
Data contains 7 columns with 120 observations (variables). The variables of interest are “Temperature” (3 levels), ”N.source” (2 levels), “Optical.density”, “Product.yield”.
Here we want to check whether the different Temperatures or different N.sources have significant effects on Optical density and Product yield. Here we can perform ANOVA separately for Temperature and N.source but this can be done simultaneously in MANOVA.
Now we are going to observe how MANOVA works using graphical representations.
First, the data is loaded into the R environment.
data <- read.delim("MANOVA.txt",header = T) head(data,n=10) data$Temperature <- as.factor(data$Temperature)
Now we could have plots as following,
#For N.source plot(data$Optical.density,data$Product.yield,col = data$N.source,pch = 15, main = "Optical density vs Product yield vs N.source", xlab = "Optical density", ylab = "Product yield") legend("topleft",legend = as.character(levels(data$N.source)),fill = 1:2) #For Temperature plot(data$Optical.density,data$Product.yield,col = data$Temperature,pch = 15, main = "Optical density vs Product yield vs Temperature", xlab = "Optical density", ylab = "Product yield") legend("topleft",legend = as.character(levels(data$Temperature)),fill = 1:3)
From the above scatter plots we can easily understand that N.source differs significantly to explain the variability of Optical density and Product yield, but Temperature doesn’t significantly differ.
We can have the same result mathematically simultaneously, as we get from the plots, using MANOVA.
In R we can perform MANOVA as follows,
summary(manova(cbind(Product.yield,Optical.density) ~ N.source + Temperature, data = data)) > summary(manova(cbind(Product.yield,Optical.density) ~ N.source + Temperature, data = data))
Df Pillai approx F num Df den Df Pr(>F) N.source 1 0.72282 149.944 2 115 <2e-16 ***0 Temperature 2 0.04278 1.268 4 232 0.2835 Residuals 116 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
From the output, we can easily see that at 1% level of significance it can be concluded that N.source has significant effects on Product yield and Optical density. But, on the other hand, although we can say at 5% level of significance Temperature has no significant effect in Product yield and Optical density.
So, this is how MANOVA gives us a mathematical result to understand if the effects of some treatments significantly differ to explain the variability of more than two continuous variables simultaneously.
Now, we can have the result of ANOVA as follows,
summary.aov(manova(cbind(Product.yield,Optical.density) ~ N.source + Temperature, data = data)) > summary.aov(manova(cbind(Product.yield,Optical.density) ~ N.source + Temperature, data = data))
Response Product.yield : Df Sum Sq Mean Sq F value Pr(>F) N.source 1 27579.1 27579.1 187.3099 <2e-16 *** Temperature 2 80.9 40.5 0.2748 0.7602 Residuals 116 17079.6 147.2 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Response Optical.density : Df Sum Sq Mean Sq F value Pr(>F) N.source 1 6.7071 6.7071 193.7043 < 2e-16 *** Temperature 2 0.1656 0.0828 2.3915 0.09599 . Residuals 116 4.0166 0.0346 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Here we have the result if we perform ANOVA separately. We get the same result for N.source but we can see at 10% level of significance Temperature differs in the effect of Optical density.
So, basically, in ANOVA, we separately check how significant each predictor variables are, but in MANOVA we analyse the variability of response variables on the multiple predictor variables simultaneously.
We should use MANOVA when there are multiple dependent variables which are correlated. Unlike individual ANOVAs, MANOVA can detect small significant effects. In MANOVA, the effect of those factors which may have an influence on the relationship between two dependent variables can be determined which may be missed if we perform individual ANOVAs.
In individual ANOVAs, type-I error (chance of rejection the true null hypothesis) can be increased, but in MANOVA, all response variables simultaneously keep the error rate equal to the desired level of significance.
Presence of outlier may increase the type-I error as MANOVA sensitive to outliers. Presence of multicollinearity violates the assumption of MANOVA. If the dependent variables are highly correlated then one can be a linear function of others, which becomes statistically redundant.
The post Multivariate Analysis Of Variance Or MANOVA appeared first on StepUp Analytics.
]]>The post Hypothesis Testing Examples appeared first on StepUp Analytics.
]]>Then we may be interested in knowing if this sample average is in line with the population average of 85 or not. Hypothesis testing is like a litmus test that gives us the path for rejection or acceptance of an assumption or a claim except for the fact that it is not deterministic but probabilistic. It is a technique to compare two datasets or a sample from a dataset.
Let us first see what a hypothesis is and take a look at some of the terms that are inclusive to hypothesis testing.
A hypothesis is nothing but some assumptions that we make about the population parameters that we want to verify. Two hypotheses are included in every test namely the null hypothesis and alternative hypothesis. The Null Hypothesis is the statement of no difference and is denoted as H0. It simply asserts that there is no real difference between the sample and the population and the difference is accidental or by chance. An alternative hypothesis is a statement against the null hypothesis.
It is the contradiction of the null hypothesis. It is usually denoted by H1. For example, a sample of 50 light bulbs is tested for their life and we want to test if the average lifetime of the bulbs is 300 days. Then we will set up the null hypothesis as “the lifetime is 300 days” and the alternative hypothesis will be “the lifetime is not 300 days”.
To test a hypothesis we need to have a single value based on the sample observations that can be compared with a pre-defined value so as to reach a decision. This value is computed using a certain formula and follows a particular probability distribution under some assumptions. Since the value calculated is used for testing and is derived from the sample, it is called a test statistic.
We all are familiar with the game of darts. Consider a simpler version of such a game in which an aim on the outer ring results in the disqualification (rejected, straightaway) of the aimer whereas an aim on the inner two circles results in qualification (acceptance) of the aimer for further rounds (only qualified, not yet the winner).
Just like this dartboard is divided into areas of rejection and acceptance, in a similar way a probability curve is divided into acceptance region and the rejection region (also called the critical region). If a test statistic falls in the critical region then the null hypothesis is rejected, it may be accepted otherwise. Hence, a critical region can be defined as the region of rejection of H0 when H0 is true.
A point to be noted here is that we reject the null hypothesis much strongly as compared to its acceptance (as in the example above, where the aimer is only qualified and is not the winner). The reason is that we deal with a sample rather than the population itself. In R.A. Fisher’s own words:
“The null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis.”
The confidence with which a decision is taken depends on the significance level chosen. The significance level is the probability of rejecting the null hypothesis when it is true. It is the size of the critical region and is expressed in terms of percentage. For example, a significance level of 5% means that the null hypothesis will be rejected 5 out of 100 times when it is true. The significance level is denoted by α.
As we have seen above, the critical region is a portion of the probability curve. This portion can lie on either end of the curve or on both ends of the curve. A test is recognized as one tailed or two tailed depending upon which side of the curve the critical area lies which further depends on the nature of our alternative hypothesis. For example, in the lightbulb problem if we want to test that the lifetime of bulbs is greater than 300 days then our alternative hypothesis will be “lifetime of bulbs >300 days” (right-tailed).
If we want to test if it is less than 300 days then the alternative hypothesis becomes “lifetime of bulbs < 300 days” (left tailed). If we do not care about whether it is greater or less and just want to test if it is 300 days or not then alternative hypothesis becomes “lifetime of bulbs ≠ 300 days” (two-tailed). The critical region, say at 5%, in these cases can be illustrated as below:
For a two-tailed test, the critical region is divided into two parts, one for the right side and other for the left side. While for one-tailed test it remains undivided. So, if we are dealing with a two-tailed test at significance level (size of the critical region) α% then on each side we have (α/2)% of the area.
In simple terms, the p-value is a evidence to accept or reject the null hypothesis. Consider a coin that someone says is biased. So, to test this claim we set up the null hypothesis that “the coin is unbiased” as opposed to the alternative hypothesis that “the coin is biased”. Now to test its biasedness the coin is tossed 20 times and suppose that 18 heads and 2 tails are obtained. Clearly, we should have got a similar number of heads and tails if the coin was unbiased. But to prove this we need to have an evidence in the form of a p-value.
It is the probability of obtaining a result equal to or more extreme than the observed value. So, in this case, “the result more extreme than observed” would be (19 heads, 1 tail), (20 heads, 0 tail), (19 tails, 1 head) or (20 tails, 0 head). Calculating the probability of obtaining this result (using binomial distribution) under the null hypothesis we get:
P(18 heads and 2 tails) = P(18 tails and 2 heads) = 0.000181
P(19 heads and 1 tail) = P(19 tails and 1 head) = 0.00001907
P(20 heads and 0 tail) = P(20 tails and 0 head) = 0.0000009536
Adding up the probabilities we get the p-value as 0.0004.
This p-value is compared with the significance level which we will take here as 0.05. If the p-value is greater then the significance level then we say that the evidence against the null hypothesis is weak, which means we can accept the null hypothesis. If the p-value is less than the significance level then the evidence against the null hypothesis is strong and hence we reject the null hypothesis.
So, in this case, the p-value is very small as compared to the significance level, therefore, we can safely say that the null hypothesis is rejected and the coin is indeed a biased one.
Statistics is probabilistic and so is hypothesis testing. There is always a probability of making a wrong decision. While making decisions, four possibilities arise:
Clearly, the last two decisions are correct. First two decisions reject and accept the null hypothesis wrong. They are errors. The first one rejects the null hypothesis when it is, in fact, true, it is called the type I error. The probability of committing type I error is denoted by α.
The second one accepts the null hypothesis when it is false, it is called the type II error. The probability of committing type II error is denoted by β.
Consider an example of testing whether a new toothpaste is better than the previous toothpaste in fighting dental cavities. The hypotheses are H0: the toothpaste have no difference against H1: the new toothpaste is better than the old one. Now suppose that the new toothpaste is actually better. If our test accepts the null hypothesis that the toothpaste has no difference then we commit a type II error.
While testing the hypothesis, our aim is to reduce both types of error but it is not possible to control both the errors simultaneously. So we fix the probability of type I error(α) in advance at a satisfactory level and try to minimize the probability of type II error(β). α is also known as the significance level or the size of the critical region.
We know that for large sample sizes, almost all the distributions can be approximated by the normal distribution due to the Central Limit Theorem. This forms the basis of the large sample tests. Let us take a look at some of the tests and also how to perform them in R.
Consider a random sample of size n(≥30) from a normal population with mean µ and variance σ^{2}. We know that the sample mean(x) of this sample will also be normally distributed as N( µ, σ^{2}/n). Thus, the standard normal variate corresponding to x is:
If the population standard deviation is not known, which is usually the case, then we use sample variance as its estimate. Let us take an example of a pizza delivery boy who claims that he takes on an average 8.9 minutes to reach his destination to deliver pizzas. To check on this claim the agency that hires him notes his time taken for 50 orders. It gets a mean of 9.3 minutes with a standard deviation of 1.6 minutes.
Now let us check if the average time taken to deliver a pizza is 8.9 or not. For this we start by setting up the null hypothesis as; H0: the average time taken to deliver the pizza is 8.9 minutes (µ=8.9) against the alternative hypothesis; H1: the average time taken to deliver a pizza is not 89 minutes (µ≠8.9).
According to the situation we have: sample mean(x) = 9.3 minutes, population mean (µ)= 8.9 minutes, population standard deviation(σ) = 1.6 minutes and the sample size(n) = 50. Since the sample is large(≥30) and the population standard deviation is known therefore we apply the large sample test, otherwise t-test is used. On substituting the values in the test statistic formula we get the value of test statistic as
This is a two-tailed test so the critical region will be on both sides of the curve. The critical value from the standard normal table is 1.96 (we have taken a significance level of 0.05). The calculated value of the test statistic, 1.767, is less than the tabulated value of 1.96. Hence, we may accept the null hypothesis at the 0.05 level of significance.
This test can be performed in R as well. The code for which is given below.
Since here we are dealing with two-tailed test the p-value is calculated as p = P(Z ≤ -z) + P(Z ≥ z) = 2*P(Z ≤ -z). Clearly, the p-value is greater than the significance level, therefore, we say that we have less evidence against the null hypothesis and may accept it.
In the above test, we had one sample and we compared the value of the sample mean to that of the population mean. We can also compare the value of two sample means to find out if they belong to populations with identical means or to check if one population is superior or inferior to other.
For this two samples of size n1 and n2 taken from same or different populations with means µ1 and µ2 and variances σ1^{2} and σ2^{2 }respectively. Also, we now that the sample means x1, x2, and their difference, x1 – x2 are distributed as:
Under the null hypothesis that the samples are from the same population (µ1 = µ2), the test statistic is given as:
If population standard deviations are not known then sample variances s1 and s2 are used as:
For better understanding consider an example where it is required to check if the mean level of pay of one state is greater than that of another state. Two samples of employees are taken from sizes 1200 and 1000. The mean and standard deviation of the samples (in thousands of rupees) is given as:
Here, we have to test the null hypothesis, H0: there is no difference between the average pay of two states i.e., µ1 = µ2 as opposed to the alternative hypothesis, H1: mean level of pay of state 1 is greater than that of state 2 i.e., µ1 > µ2 (right tail test). The population standard deviations are not known hence estimating it using sample standard deviation, we get Now, the value of test the statistic is |z| = 24.28. The tabulated value for the left tailed test at 5% level of significance is 1.645.
The calculated value is much greater than the tabulated value at 5% significance level and thus we reject the null hypothesis and conclude that the mean pay of state 2 is higher than the mean pay of state 1.
In R, this test can be done as follows:
This test is used to test if the standard deviations of the two samples differ significantly or not. Let s1 and s2 be standard deviations of two independent samples then under the null hypothesis that the sample standard deviations don’t differ significantly i.e., σ1 = σ2, the test statistic for large samples is given as:
Where σ1^{2} and σ2^{2 }are population variances and n1 and n2 are sample sizes for sample 1 and sample 2 respectively. Sample variance is used as an estimate of population variance when it is not known. Consider a farmer who yields two sets of plots whose variability are as follows:
We have to test whether the variability in two sets of plots is significant. The null hypothesis can be stated as, H0: there is no significant difference between the variability of the two plots i.e., σ1=σ2 against the alternative hypothesis, H1: two sets of plots have significantly different variability i.e., σ1≠σ2. We have, s1=34, s2=28, n1=40, n2=60, σ1^{2} =34^{2} = 1156, σ2^{2}=28^{2}=784. Substituting these values in test statistic we get z as 1.3.
This value is less than the tabulated at 0.05 level of significance which is given as 1.96. Hence, we cannot reject the null hypothesis and conclude that difference between the variability for two sets of plots is not significant.
In R, it can be done as follows:
The post Hypothesis Testing Examples appeared first on StepUp Analytics.
]]>The post Which Non Parametric Tests to Apply When appeared first on StepUp Analytics.
]]>Nonparametric tests take into account fewer assumptions as compared to parametric tests. They don’t assume anything beforehand about the probability distribution of the population and hence are referred to as distribution-free tests. They are readily comprehensible and easy to use.
The hypotheses which can be tested using nonparametric tests are:
Many tests are available to test these hypotheses but the main question that one encounters is to decide so as to which test would be appropriate. Given below is a list of some nonparametric tests and their application in hypothesis testing.
The sign test is the simplest of all and it is quite evident from the name that it is based on the signs (pluses or minuses) of the observations and not their magnitude (the data here is nominal). The sign test can be of two types namely:
In one sample sign test, we test whether the sample is drawn from the population with a specified median or not. The only assumption behind the one sample sign test is that the observations are drawn independently from a continuous distribution. The assumption of continuity is important in the sense that it means that no ties should occur, but in practical situations, ties may occur.
In such a situation we ignore the tied observations and the rest of the procedure remains the same. Let us consider an example of the score of students of a particular school which claims that their median score is greater than 60. The scores are as follows:
81, 76, 53, 71, 66, 59, 88, 73, 80, 66, 58, 70, 61, 56, 55
So here we have to test the school’s claim. The null hypothesis is, H0: the median score is 60 (µmedian=60)
against the alternative hypothesis, H1: the median score is greater than 60 (µmedian>60). Now we will
assign each sample value greater than the median (here 60) with a plus sign and each sample value less
than median with a minus sign. Thus, we will have 10 plus signs and 5 minus signs. The value of test
statistic is calculated using the expression
Where n is the number of sample observations. Substituting the value of n=15 in the above expression, we get K=3.20. We compare this value with the number of times the less frequent sign occurs, S (say). In this example minus sign occurs less number of times (S=5).
H0 is rejected if S ≤ K. Here, since S > K (5>3.20) hence we may accept the null hypothesis and conclude that the median of the population is 60. For large samples (n>20), the normal approximation to the binomial distribution can be used. The value of z is given as
Where X is the number of plus signs, n is the number of observations and p is the probability of occurrence of plus sign. Value of p is taken as 0.5 as it is equally likely to get a plus or minus sign. Command SIGN.test() can be used to conduct sign test in R. For the above example the R code is given as:
Notice the p-value (=0.15091) in the above code. It is greater than the significance level of 0.05. Hence, we may accept the null hypothesis.
Sign test also has important application when two paired (dependent) samples are to be tested for the significant difference for a before and after measurement. The test assumes that dependent samples or paired samples are drawn from continuous populations that may be different. The data for this test needs to be of at least ordinal scale for the observations to be compared.
Consider the results of a clinical experiment where a new drug is tested on a group of patients suffering from hypertension. The patients are asked to score their satisfaction level out of 100. A score of zero denoting strong dissatisfaction and a score of 100 denoting strong satisfaction. We proceed in exactly the same way as in one sample case. The scores before and after treatment are as given:
The null hypothesis will be that there is no significant difference between scores before and after treatment. The alternative hypothesis is that the drug has positive effects i.e., the median score for after treatment observations are greater than the before treatment observations.
Calculating the test statistic K as in one sample case we get K=0.784 for n=8 paired observations. Comparing this with S=2 (as less frequent sign appears 2 times), we see that S>K. Therefore we may accept the null hypothesis that there is no significant difference between the before- after samples. The R code for two sample test can simply be run by replacing y=NULL with a y vector as a second sample in the SIGN.test() command. A snippet for the same is as follows:
Mann-Whitney U test uses the ranks assigned to the sample observations to determine whether two samples come from the identical population. In this test, we assume that the samples are independent and the observations are at least ordinal for the sake of ranking. If the samples are drawn from identical populations then it can be assumed that the mean of ranks assigned to both the samples is more or less the same.
Let us take an example where two samples A and B are given and we have to test if they are drawn from the identical population. The samples are as follows:
The null hypothesis to test is that the two samples are from the same population against the alternative hypothesis that the two samples are from different populations. We start by assigning ranks to the samples. The ranks are given as follows:
In case of ties, the ranks are given as the average of ranks the observations would have received had there been no ties. The statistics U1 and U2 are for U test are given as:
Where n1 (=12) and n2 (=12) are the sample sizes of sample A and B respectively, R1 (=123.5) and R2
(=176.5) are the sum of ranks of sample A and B respectively. This gives U1=98.5 and U2=45.5. For
comparison, we take U=min {U1, U2}. Here, U=45.5, this value is compared to the tabulated value for
n1 and n2.
For n1, n2>10, the normal approximation can be used with and variance
Hence, z = -1.59<1.96 (at 5% level of significance), the null hypothesis may be accepted and we may conclude that the two samples are from the identical population.
In R programming, the command wilcox.test() is used to conduct Mann-Whitney U test. The R code for the above example is given below.
We get the p-value as 0.1299 which is greater than the significance value of 0.05. Thus, we may accept the null hypothesis that the two samples are from the same population.
Wald-Wolfowitz Run test or simply Run test consists in checking whether a sequence of elements is random or not. This test is based on the theory of runs. We can define a run as a sequence of identical letters preceded and followed by the different letter or no letter at all.
Let us consider the sequence of manufactured items from a production house with good (G) and defective (D) items as follows: GDDGGDGDDDGGDGDDGGDDGDGG. We will first define the null hypothesis as, H0: the given sample is random. Now we have 15 runs (=r), the number of good items =12 (=m) and the number of defective items as 12 (=n). The number of runs, r, has its own sampling distribution with mean and variance given as:
Where n1 and n2 are the sizes of sample 1 and 2 respectively. Therefore the test statistic becomes,
Substituting values of n1 and n2 in the above expressions we will obtain the value of z as 0.834. This calculated value of z then is compared with the tabulated value at α% level of significance. Here, taking the level of significance as 5% we observe that tabulated value (0.834) is less than the tabulated value at 0.05 level of significance (=1.64) therefore we may accept the null hypothesis and conclude that the sample is random. To run this test in R, let us denote the good items as 1 and defective items as 0. Command runs.test() is used to conduct a run test in R.
The null hypothesis may thus be accepted as the p-value (0.4038) is greater than the significance value of 0.05.
The Kruskal Wallis one way analysis of variance is a useful test when several independent samples are involved. It helps in deciding whether k (>2) independent samples are from the same population or identical population with the same median or not.
It is assumed that the observations are independent and at least ordinal. In a similar fashion to Mann-Whitney U test, this test also begins with ranking the observations in ascending order. Now, the average ranks of the samples must be about the same if they are from the same populations. To test if they are about the same the Kruskal Wallis test statistic is given as:
Where; k is the number of samples
The sampling distribution of KW statistic can be well approximated by ꭓ2 distribution with (k-1) degrees of freedom when the number of samples (k) is more than 3 and the number of observations in each sample exceeds 5.
Let us suppose a factory installs three machinery units and wants to determine if the output of the machines varies significantly or not. The output for different machines being given as:
Machine A : 80, 83, 79, 85, 90, 68
Machine B : 82, 84, 60, 72, 86, 67, 91
Machine C : 93, 65, 77, 78, 88
The ranks allotted for the above data are given as:
Ranks of machine A: 9, 11, 8, 13, 16, 4
Ranks of machine B: 10, 12, 1, 5, 14, 3, 17
Ranks of machine C: 18, 2, 6, 7, 15
The average of ranks of machine A, B and C are 10.16, 8.86 and 9.6. Putting these values in the expression of test statistic we get, KW = 0.197 which is less than the tabulated value of ꭓ2 2, 0.05 (=5.991). Hence, H0 may be accepted at 5% level of significance.
Following is the R code for running a Kruskal-Wallis test.
The Kolmogorov Smirnov test is a test of goodness of fit. It tests whether a sample comes from a specified theoretical distribution. This test is concerned with the degree of agreement between the distribution of a set of values and some specified theoretical distribution. This theoretical distribution is assumed to be continuous. Let F0(x) be specified cumulative relative distribution i.e. for any value of X the value of F0(x) is the proportion of cases expected to have values equal to or less than x (X≤x).
Also, let S0(x) be the observed cumulative distribution function. Now, the null hypothesis is stated as the sample has been drawn from the specified theoretical distribution. For H0 to be true we would expect the differences between F0(x) and S0(x) to be small. The Kolmogorov Smirnov test focuses on the largest of the deviations. Thus, the test statistic is given as:
The value of this statistic is then compared with the tabulated value of D at α% level and H0 is rejected if the calculated value is greater than the tabulated value otherwise it is accepted.
Let us have a look at an example where observed and predicted observations are given and we have to test if the predicted sample can be thought to have come from the theoretical distribution (observed sample). We first find the relative cumulative frequency by dividing the observations of observed and predicted sample by 683 and 683.2 respectively. Then we have to find the difference between relative cumulative frequencies of observed and predicted values.
The maximum of these values (=0.015) is the value of the test statistic D. Tabulated value of D at n=10 and 5% level of significance is 0.409. Since the calculated value is less than the tabulated value, therefore we may accept the null hypothesis. The R code for Kolmogorov Smirnov test is as follows:
As seen above, non-parametric tests are reasonably practical and straightforward. Although they don’t take into account many assumptions and can be applied to both small and large samples, the main disadvantage lies in the fact that they have less statistical power as compared to the parametric tests i.e., they are unable to strongly reject the null hypothesis when the alternative hypothesis is true.
Also, it is disadvantageous to use non-parametric methods when assumptions of parametric methods are met and the data are measured on interval or ratio scale. Having said that, it is recommended to use parametric tests whenever possible and if not then non-parametric will always be up for use.
The post Which Non Parametric Tests to Apply When appeared first on StepUp Analytics.
]]>