The post Theory of Estimation Or What is Estimation appeared first on StepUp Analytics.
]]>Population: A group of individuals under study is called population. The population may be finite or infinite. Eg. All the registered voters in India.
Sample: A finite subset of statistical individuals in a population. Eg. Selecting some voters from all registered voters.
Parameter: The statistical constants of the population such as mean (μ), variance (σ^{2}) etc. Eg. Mean of income of all the registered voters.
Statistic: The statistical constants of the sample such as mean (X̄), variance (s^{2}) etc. In other words, any function of the random sample x_{1}, x_{2},…,
Estimator: If a statistic is used to estimate an unknown parameter θ of the distribution, then it is called an estimator. Eg. Sample mean is an estimator of population mean.
Estimate: A particular value of the estimator is called an estimate of an unknown parameter. Eg. Mean income of selected voters is ₹25000 which represents mean income of all the registered voters.
Sampling Distribution: When the total probability is distributed according to the value of statistic then the distribution is said to be sampling distribution. Eg. If we want the average height of a voter, we can randomly select some of them and use the sample mean to estimate the population mean.
Standard Error: The standard deviation of the sampling distribution of a statistic is known as its standard error and is denoted by ‘s.e.’ Eg. If we want to know the variability of the height of voters, then standard error is used.
Now, before discussing about different methods of finding estimates of unknown population parameter, it is important to know the characteristics of a good estimator. Here, “a good estimator” is one which is close to the true value of the parameter as much as possible. The following are some of the criterion that should be satisfied by a good estimator:
Unbiasedness
This is a desirable property of a good estimator. An estimator T_{n }is said to be an unbiased estimator of γ (θ), where γ (θ) is a function of unknown parameter θ, if the expectation of the estimator is equal to the population parameter, i.e.,
E [T_{n}] = γ (θ)
Example: If X ~ N (μ,σ^{2}),
Consistency
An estimator is said to be consistent if increasing the sample size produces an estimate with smaller standard error (standard deviation of sampling distribution of a statistic). In other words, if the sample size increases, it becomes almost certain that the value of a statistic will be very close to the true value of the parameter. Example: Sample mean is a consistent estimator of the population mean, since as sample size n→∞, the sample means converges to the population mean in probability and variability of the sample mean tends to 0.
Efficiency
There is a necessity of some further criterion which will enable us to choose between the estimators, with the common property of consistency. Such a criterion which is based on the variances of the sampling distribution of estimators is usually known as efficiency.
It refers to the size of the standard error of the statistic. If two statistic are compared from a sample of same size and we try to decide which one a good estimator is, the statistic that has a smaller standard error or standard deviation of the sampling distribution will be selected.
If T_{1} is the most efficient estimator with variance V_{1} and T_{2}, any other estimator with variance V_{2}, then the efficiency E of T_{2} is given by:
[∵ Efficiency and Variances are inversely proportional]
Sufficiency
An estimator is said to be sufficient for a parameter, if it contains all the information in the sample regarding the parameter.
If T_{n} is an estimator of parameter θ, based on a sample x_{1}, x_{2},…, x_{n} of size n from the population with density f(x,θ), such that the conditional distribution of x_{1}, x_{2},…, x_{n} given T_{n}, is independent of θ, then T_{n} is sufficient estimator for θ.
Methods of Point Estimation
So far we have been discussing the requisites of a good estimator. Now we shall briefly outline some of the important methods of obtaining such estimators. Commonly used methods are:
Method of Moments (MoM)
The basic principle is to equate population moments (i.e. the means, variances, etc. of the theoretical model) to the corresponding sample moments (i.e. the means, variances, etc. of the sample data observed) and solve for the parameter(s).
Let x_{1}, x_{2}, …, x_{n} be a random sample from any distribution f(x,θ) which has m unknown parameters θ_{1}, θ_{2}, …, θ_{m}, where m ≤ n. Then the moment estimators θ ̂ _{1}, θ ̂ _{2}, …, θ ̂ _{m }are obtained by equating the first m sample moments to the corresponding m population moments and then solving for θ_{1}, θ_{2}, …, θ_{m}.
Method of Maximum Likelihood Estimation (MLE)
MLE is widely regarded as the best general method of finding estimators. In particular, MLE’s usually have easily determined asymptotic properties and are especially good in the large sample situations. “Asymptotic’’ here just means when the samples are very large.
Let x_{1}, x_{2}, …, x_{n} be a random sample from a population with density f(x,θ). The likelihood function of the observed sample at the function of θ is given by:
Notice that the likelihood function is a function of the unknown parameter θ. So different values of θ would give different values for the likelihood. The maximum likelihood approach is to find the value of θ that would have been most likely to give us the particular sample we got. In other words, we need to find the value of θ that maximizes the likelihood function. In most cases, taking logs greatly simplifies the determination of the MLE θ ̂. Differentiating the likelihood or log likelihood with respect to the parameter and setting the derivative to 0 gives the MLE for the parameter.
It is necessary to check, either formally or through simple logic, that the turning point is a maximum. The formal approach would be to check that the second derivative is negative.
Method of Minimum Variance
It is also known as Minimum Variance Unbiased Estimator (MVUE). As the name itself depicts, estimator which is unbiased as well as having minimum variance.
If a statistic T_{n} based on a sample of size n is such that:
Method of Least Squares
The principle of least squares is used to fit a curve of the form:
where θ_{i}’s are unknown parameters, to a set of n sample observations (x_{i}, y_{i}); i=1,2,…,n from a bivariate population. It consists of minimizing the sum of squares of residuals,
subject to variations in θ_{1}, θ_{2}, …, θ_{n}. The normal equations for estimating θ_{1}, θ_{2}, …, θ_{n} are given by:
Confidence Intervals and Confidence Limits
Confidence interval provides an ‘interval estimate’ for an unknown population parameter. It is designed to contain the parameter’s value with some stated probability. The width of the interval provides a measure of the precision accuracy of the estimator involved.
Let x_{i}, i = 1, 2, … n be a random sample of size n from f(x,θ). If T_{1}(x) and T_{2}(x) be any two statistics such that T_{1}(x) ≤ T_{2}(x) then,
P(T_{1}(x) < θ < T_{2}(x)) = 1 – α
where α is level of significance, then the random interval (T_{1}(x), T_{2}(x)) is called 100(1-α)% confidence interval for θ.
Here, T_{1} is called lower confidence limit and T_{2} is called upper confidence limit. (1-α) is called the confidence coefficient.
Usually, the value of α is taken as 5% in the testing of hypothesis. Thus, if α = 5%, then there is a 95% chance of the estimate to be in the confidence interval.
Interval estimate = Point estimate ± Margin of Error
The margin of error is the amount of random sampling error. In other words, the range of values above and below the sample statistic.
Margin of Error = Critical Value * Standard Error of the statistic
Here, a critical value is the point (or points) on the scale of the test statistic beyond which we reject the null hypothesis, and is derived from the level of significance α of a particular test into consideration.
Confidence intervals are not unique. In general, they should be obtained via the sampling distribution of a good estimator, in particular, the MLE. Even then there is a choice between one-sided and two-sided intervals and between equal-tailed and shortest length intervals although these are often the same.
So, we have learned what the estimation is, i.e., the process of providing numerical value to unknown population parameter. To test whether an estimate is a good estimator of the population parameter, an estimate should have the following characteristics:
There are different methods of finding estimates such as method of moments, MLE, minimum variance and least squares. Of these methods, MLE is considered as the best general method of finding estimates.
Also, there are two types of estimations, point and interval estimation. Point estimation provides a single value to the estimate, whereas, interval estimation provides confidence interval which is likely to include the unknown population parameter.
Hence, now you have the basic understanding about the theory of estimation.
The post Theory of Estimation Or What is Estimation appeared first on StepUp Analytics.
]]>The post Lasso And Elastic Net Regression appeared first on StepUp Analytics.
]]>For example, it shrinks the coefficients towards zero, but it does not set any of them exactly to zero. It does not perform feature selection and etc. So in this article, I have introduced two new methods such as lasso and elastic net regression which deals with these issues very well and does both variable selection and regularization.
Lasso (or least absolute shrinkage and selection operator) is a regression analysis method that follows the L1 regularization and penalizes the absolute size of the regression coefficients similar to ridge regression. In addition; it is capable of reducing the variability and improving the accuracy of linear regression models. Lasso regression differs from ridge regression in a way that it uses absolute values in the penalty function, instead of squares. This leads to penalizing the regression coefficients for which some of the parameter estimates turn out exactly zero. Hence, much like the best subset selection method, lasso performs variable selection out of the given n variables.
The tuning parameter lambda is chosen by cross-validation. When lambda is small, the result is essentially the least squares estimates (OLS). As lambda increases, shrinkage occurs and the less important feature’s coefficient shrinks to zero thus, removing some feature altogether.
So, a major advantage of lasso is that it is a combination of both shrinkage and selection of variables. In cases of a very large number of features, lasso allows us to efficiently find the sparse model that involves a small subset of the features.
The cost function is given below, where the highlighted part is the L1 regularization.
The method was proposed by Professor Robert Tibshirani from the University of Toronto, Canada. He said, “The Lasso minimizes the residual sum of squares to the sum of the absolute value of the coefficients being less than a constant. Because of the nature of this constraint, it tends to produce some coefficients that are exactly 0 and hence gives interpretable models”.
In his article titled Regression Shrinkage and Selection via the Lasso, Tibshirani tells us about this technique with respect to various other statistical models such as subset selection and ridge regression. He goes on to say that “lasso can even be extended to generalized regression models and tree-based models. In fact, this technique provides possibilities for even conducting statistical estimations.”
Traditional methods like cross-validation, stepwise regression to handle overfitting and perform feature selection works well with a small set of features but penalized regression techniques are a great alternative when we are dealing with a large set of features.
Lasso was originally formulated for least squares models and this simple case reveals a substantial amount about the behavior of the estimator, including its relationship to ridge regression and best subset selection and the connections between lasso coefficient estimates and so-called soft thresholding. It also reveals that (like standard linear regression) the coefficient estimates need not be unique if covariates are collinear. [Source: Wikipedia]
Though originally defined for least squares, lasso regularization is easily extended to a wide variety of statistical models including generalized linear models, generalized estimating equations, proportional hazards models, and M-estimators, in a straightforward fashion. Lasso’s ability to perform subset selection relies on the form of the constraint and has a variety of interpretations including in terms of geometry, Bayesian statistics, and convex analysis. [Source: Wikipedia]
As discussed above, lasso can set coefficients to zero, while ridge regression, which appears superficially similar, but cannot. This is due to the difference in the shape of the constraint boundaries in the two cases.
From the figure, one can see that the constraint region of lasso regression is a rotated square and its corners lie on the axes, while the constraint region of ridge regression is a sphere which is rotationally invariant and, therefore, has no corners. A convex object that lies tangent to the boundary, is likely to encounter a corner a hypercube, for which some components of are identically zero, while in the case of a sphere, the points on the convex object boundary for which some of the components are not distinguished from the others and the convex object is not likely to contact a point at which some components are zero.
In the case of ML, both ridge regression and Lasso find their respective advantages. Both these techniques tackle overfitting, which is generally present in a realistic statistical model. It all depends on the computing power and data available to perform these techniques on statistical software. Ridge regression is faster compared to lasso but then again lasso has the advantage of completely reducing unnecessary parameters in the model.
One important limitation of lasso regression is that, for grouped variables, the lasso fails to do grouped selection. It tends to select one variable from a group and ignore the others.
Elastic-net is a mix of both L1 and L2 regularizations. A penalty is applied to the sum of the absolute values and to the sum of the squared values:
Lambda is a shared penalization parameter while alpha sets the ratio between L1 and L2 regularization in the Elastic Net Regularization. Hence, we expect a hybrid behavior between L1 and L2 regularization. Though coefficients are cut, the cut is less abrupt than the cut with lasso penalization alone. The hyper-parameter is between 0 and 1 and controls how much L2 or L1 penalization is used. The usual approach to optimizing the lambda hyper-parameter is through cross-validation—by minimizing the cross-validated mean squared prediction error—but in elastic net regression, the optimal lambda hyper-parameter also depends upon the alpha hyper-parameter.
This article takes a cross-validated approach that uses the grid search to find the optimal alpha hyper-parameter while also optimizing the lambda hyper-parameter for the data set.
In my previous article, I used the glmnet package to show the ridge regression in R. In this article, I have used the caret package for better comparison between the techniques.
Loading the MASS package to get the data set
library (MASS)
data <- Boston
Splitting the dataset in training and testing data
train <- data [1:400,]
test <- data [401:506,]
Setting up a grid range of lambda values
lambda <- 10^seq (-3, 3, length = 100)
Loading the required libraries
library (tidyverse)
library (caret)
library (Metrics)
We fit the ridge regression model on the training data using k fold cross validation
set.seed (123)
ridge <- train (
medv ~., data = train, method = “glmnet”,
trControl = trainControl (“cv”, number = 10),
tuneGrid = expand.grid (alpha = 0, lambda = lambda))
plot (ridge$finalModel , xlab = “L2 Norm” )
Displaying the regression coefficients below
coef (ridge$finalModel, ridge$bestTune$lambda)
We save the predicted values of the response variable in a vector prediction_ridge
prediction_ridge <- predict (ridge, test)
Saving the RMSE, SSE and MAPE values in Accuracy_lasso
Accuracy_ridge <- data.frame (
RMSE = RMSE (prediction_ridge, test$medv),
SSE = sse (test$medv, prediction_ridge),
MAPE = mape (test$medv, prediction_ridge))
The only difference between the R code used for ridge and lasso regression is that for lasso regression, we need to specify the argument alpha = 1 instead of alpha = 0 (for ridge regression).
Now executing the Lasso Regression
set.seed (123)
lasso <- train (
medv ~., data = train, method = “glmnet”,
trControl = trainControl (“cv”, number = 10),
tuneGrid = expand.grid (alpha = 1, lambda = lambda))
plot (lasso$finalModel , xlab = “L1 Norm” )
If we look at the plot, the x-axis is the maximum permissible value the L1 norm can take. So when we have a small L1 norm, we have a lot of regularization. Therefore, an L1 norm of zero gives an empty model, and as you increase the L1 norm, variables will “enter” the model as their coefficients take non-zero values.
Displaying the regression coefficients below
coef (lasso$finalModel, lasso$bestTune$lambda)
We save the predicted values of the response variable in a vector prediction_ lasso
prediction_lasso <- predict (lasso, test)
Saving the RMSE, SSE and MAPE values in Accuracy_lasso
Accuracy_lasso <-data.frame (
RMSE = RMSE (prediction_lasso, test$medv),
SSE = sse (test$medv, prediction_lasso),
MAPE = mape (test$medv, prediction_lasso))
The elastic net regression models do not require us to mention a specific value of lambda and alpha. We use caret package to automatically select the best tuning parameters alpha and lambda. The caret package tests a range of possible alpha and lambda values, and then selects the best values for lambda and alpha, resulting in a final model that is an elastic net model.
Now executing the Elastic Net Regression
set.seed (123)
elasticnet <- train (
medv ~., data = train, method = “glmnet”,
trControl = trainControl (“cv”, number = 10), tuneLength = 10)
plot (elasticnet$finalModel , xlab= “Elasticnet Regularization”)
Displaying the regression coefficients below
coef (elasticnet$finalModel, elasticnet$bestTune$lambda)
We save the predicted values of the response variable in a vector prediction_ elasticnet
predictions_elasticnet <- predict (elasticnet, test)
Saving the RMSE, SSE and MAPE values in Accuracy_ elasticnet
Accuracy_elasticnet <-data.frame (
RMSE = RMSE (predictions_elasticnet, test$medv),
SSE = sse (test$medv, predictions_elasticnet),
MAPE = mape (test$medv, predictions_elasticnet))
We finally bring the RMSE, SSE and MAPE values of the three regression techniques in a dataframe Accuracy.
Accuracy <- rbind.data.frame (Accuracy_ridge = Accuracy_ridge, Accuracy_lasso = Accuracy_lasso, Accuracy_elasticnet = Accuracy_elasticnet)
Accuracy
Here both lasso and elastic net regression do a great job of feature selection technique in addition to the shrinkage method. On the other hand, the lasso achieves poor results in accuracy. This is because there is a high degree of collinearity in the features. Further, the L1 norm is underdetermined when the number of predictors exceeds the number of observations while ridge regression can handle this.
From our example we see that penalized regression models performed much better than the multiple linear regression models. But it can be said that Lasso regression performs better than ridge in scenarios with many noise predictors and worse in the presence of correlated predictors. Elastic net, is a hybrid of the two, and performs well in all these scenarios.
The post Lasso And Elastic Net Regression appeared first on StepUp Analytics.
]]>The post Ridge Regression and Its Application appeared first on StepUp Analytics.
]]>The OLS function works quite well when some assumptions like a linear relationship, no autocorrelation, homoscedasticity, more observations than variables, normal distribution of the residuals and No or little multicollinearity are fulfilled.
But in many real-life scenarios, these assumptions are violated. In those cases, we need to find alternative approaches to provide solutions. Penalized/Regularized regression techniques such as ridge, lasso and elastic net regression work very well in these cases. In this article, I have tried to explain the ridge regression technique which is a way of creating regression models when the number of predictor variables of a dataset is more than the number of observations or when the data suffers from multicollinearity (independent variables are highly correlated).
Regularization methods provide a means to control our regression coefficients, which can help to reduce the variance and decrease the sampling error. Ridge regression belongs to a class of regression tools that use L2 regularization. L2 regularization works as a small addition to the OLS function that weights the residuals in a particular way to make the parameters more stable. The L2 penalty parameter, which equals the square of the magnitude of coefficients, is given by,
And the regression function is given by,
The amount of the penalty can be fine-tuned using a constant called lambda (λ). Selecting a good value for λ is critical. When λ = 0, the penalty term has no effect and ridge regression produces classical least square coefficients. If λ = ∞, the impact of the penalty grows and all coefficients are shrunk to zero. The ideal penalty is therefore somewhere in between 0 and ∞.
In this way, ridge regression puts constraints on the magnitude of the coefficients and help to reduce the magnitude and fluctuations of the coefficients and progressively shrinks them towards zero. This will definitely help to reduce the variance of the model. The outcome is typically a model that fits the training data less well than OLS but generalizes better because it is less sensitive to extreme variance in the data such as outliers.
Note that, in contrast to the ordinary least square regression, ridge regression is highly affected by the scale of the predictors. Therefore, it is better to standardize (i.e., scale) the predictors before applying the ridge regression so that all the predictors are on the same scale.
Advantages and Disadvantages Of Ridge Regression
Here I have given the link of a website below, where you can get the mathematical and geometric interpretation of Ridge regression More Info
Loading the MASS package to get the data set
library (MASS)
data <- Boston
Splitting the dataset in training and testing data
train <- data [1:400,]
test <- data [401:506,]
Loading libraries required for Ridge regression
library(tidyverse)
library(caret)
library (glmnet)
library (MASS)
library (Metrics)
We need to know the glmnet package
For more details about this package: More Info
There is another function lm.ridge () in MASS package which can also be used. Please see the link below for more details about the function. More Info [Page Number: 79]
Preparing the training data set for training the regression model
x.train <- model.matrix (medv~., train) [,-1]
We save the response variable housing price in a vector y.train
y.train <- train$medv
We need to find the best value for lambda for the given data set with the function cv.glmnet()
set.seed (123)
cv <- cv.glmnet (x.train, y.train, alpha = 0)
plot (cv)
Displaying the best lambda value
cv$lambda.min
We fit the final model on the training data by adding the best lambda value.
model_ridge <- glmnet (x.train, y.train, alpha = 0, lambda = cv$lambda.min)
Displaying the regression coefficients below
coef (model_ridge)
Preparing the test data set to be used as a data matrix and discarding the intercept for predicting the values of the response variable.
x.test <- model.matrix (medv ~., test)[,-1]
We save the predicted values of the response variable Housing price in a vector prediction_ridge
prediction_ridge <- as.vector(predict(model_ridge,x.test))
Saving the RMSE, SSE and MAPE value of the predicted values of the test data set in Accuracy_ridge
Accuracy_ridge <- data.frame(
RMSE = RMSE (prediction_ridge, test$medv),
SSE = sse (test$medv, prediction_ridge),
Mape = mape (test$medv, prediction_ridge))
Now we fit the multiple linear regression model on the training data set
names (train)
model_lm <- lm (medv ~ crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+black+lstat, data=train)
From the summary of the model we can find the p value of the individual predictor variables and decide which variables to be kept in the model
summary (model_lm)
model_lm <- lm (medv ~ crim+zn+nox+rm+dis+rad+tax+ptratio+lstat, data=train)
summary (model_lm)
We need to check the multicollinearity with the help of the function vif () from car package.
vif (model_lm)
We also need to exclude the predictor variables with high vif values to avoid multicollinearity. Though we may allow multicollinearity up to a certain level.
model_lm <- lm (medv ~ crim+zn+nox+rm+dis+rad+ptratio+lstat, data=train)
Below I have mentioned the summary of the updated final model with all the significant variables and the vif values of the variables. The values of the R square and adjusted R square are pretty close, which also shows that the present predictor variables in the model are pretty significant.
summary (model_lm)
vif (model_lm)
We compute the prediction of the test data set with multiple linear regression which was trained using the training dataset
prediction_lm <- predict (model_lm, test [,-14])
We find out the RMSE, SSE, and MAPE of the regression model and save them in Accuracy_lm
Accuracy_lm <-data.frame (
RMSE = RMSE (prediction_lm, test$medv),
SSE = sse (test$medv, prediction_lm),
MAPE = mape (test$medv, prediction_lm))
We save the RMSE, SSE and MAPE values of both linear and ridge regression models in Accuracy.
Accuracy <- rbind.data.frame (Accuracy_ridge = Accuracy_ridge, Accuracy_lm = Accuracy_lm)
Accuracy
From the Accuracy mentioned above, it is clear that even though the least square estimates are unbiased; the accuracy of the model is compromised. By adding a degree of bias to the regression estimates, ridge regression reduces the standard errors. But with other models like the lasso and elastic net regression, we have a possibility of getting a better accuracy value.
This is because complicated models tend to overfit the training data. In my next article, I will introduce you to lasso and elastic net regression and explain the comparative advantage of using these models over multiple linear or ridge regression models.
To learn more on Statistics for Data Science Read
The post Ridge Regression and Its Application appeared first on StepUp Analytics.
]]>The post Queueing Theory and Its Application appeared first on StepUp Analytics.
]]>In this article, we will learn about Queueing Theory and its practical applications. We all have experienced the annoyance of having to wait in a queue. We wait in line at supermarkets to check out, we wait in line in banks and post offices and we wait in line at fast food restaurants. But we as customers do not like waiting. And the managers of these establishments also don’t like their customers to wait as it may cost them business.
So the first question that arises is that “Why is there waiting?”
To which the answer is that there is more demand for service that there is an available facility for that service.
And “Why is this so?”
For which there could be a number of reasons such as, shortage of available servers, limitation of space, economic limitations etc.
These limitations can be removed with the expenditure of capital. And to know how much service should then be made available, one needs to know:
Queuing theory attempts to answer these questions through detailed mathematical analysis. The ultimate goal is to achieve an economic balance between cost of service and the cost associated with the waiting for that service.
A Queuing system can be described as one in which customers are arriving for service, waiting for their service if it is not immediately available and if having waited for service leaving the system after being served.
The term ‘customer’ is used in general sense and does not imply necessarily a human customer. For example, a customer can be a computer program waiting to be run or an airplane waiting in line to take off.
Queuing Theory was developed to provide models to predict the behavior of systems that attempt to provide service for randomly arising demands.
For defining the characteristics we’ll first explain the following terms:
In the context of above, there are six basic characteristics of a queuing process that provide an adequate description of a queuing system
1. Arrival Pattern: In general situations, the process of arrivals is random (stochastic). It is, therefore, necessary to know the probability distribution of the times between successive customer arrivals (inter-arrival times). Also, the customers can arrive simultaneously (batch or bulk arrival) and if so, the probability distribution describing the size of the batch.
The reaction of the customer upon entering the system.
2. Service Patterns: We need to describe a probability distribution for the sequence of customer service time. Service may be single or in batch, there are many situations where customers may be served simultaneously by the same server, such as people boarding a train, sightseers on a guided tour. The situation in which service depends on the number of customers waiting is referred to as State-Dependent service.
3. Queue Discipline: It refers to the manner in which the customers are selected for service when a queue has formed. Most common disciplines are:
4. System Capacity: In some queuing process there is a physical limitation to the amount of waiting room so that when the line reaches a certain length, no further customers are allowed to enter until space becomes available as a result of service completion. This situation is referred to as finite queuing situation.
5. The number of Service Channels: By this, we are typically referring to the number of parallel service stations which can serve customers simultaneously. It is assumed that service mechanisms of parallel channels operate independently of each other.
6. Stages of Service: A queuing system may have only a single stage of service, or it may have several stages. An example of a multistage the queuing system would be a physical examination procedure, where each patient must proceed through several stages, such as medical history, blood tests etc.
A Queueing process is described by a series of symbols and dashes such as A/B/X/Y/Z where
Some standard symbols for the characteristic distributions are as follows
These notations are referred to as Kendall’s Notation.
Generally, there are three types of system response of interest:
Since most queuing systems are stochastic, these measures are often random variables and their probability distribution is desired.
There are two types of customer waiting times:
The task of a Queuing Analyst if generally one of the two things:
= N_{s} (t) + N_{q }(t)
= P [N (t) =n]
(1/ λ) is the expected inter arrival time.
µ_{n }= cµ when n ≥ 1 (all servers are busy). 1/µ is the expected service time.
= Probability of exactly ‘n’ customers in the queuing system.
Where, N is the random variable giving the number of customers in the system.
One of the most powerful relations in queuing given by John D.C. Little. This formula relates the steady-state mean system size to steady state average customer waiting times.
Little’s formula is:
Also, since E (T) = E (T_{q} + S) = E (T_{q}) + E (S) or
W = W_{q} + (1/µ)
Hence it is necessary to find only one of the four expected values.
The post Queueing Theory and Its Application appeared first on StepUp Analytics.
]]>The post AB Testing With R: An Example Of Marketing Campaign appeared first on StepUp Analytics.
]]>Even e-commerce companies in India like Amazon and Flipkart have a lot of questions about their websites, application designs, and marketing strategies. These questions can be answered by conducting an A/B test.
When comparing two versions of products (such as A and B ) for similar customers are tested to see which group should sell more in the market or sometimes two groups of customers A and B for similar products, to see which group we should target for the products, we use A/B testing.
For example for a website:
Null Hypothesis: Assumption that there is no difference between the conversion rates for products A and B
Alternative Hypothesis: There is a difference between the conversion rates for products A and B
To reject the Null Hypothesis we need a p-value that is lower than the significance level i.e. P < 0.05
install.packages ("pwr") library (pwr) ######## 2-sample test for equality of proportions ############ prop.test(c (225, 250), c (3450, 3000))
The p-value is less than 0.05, so we can reject the hypothesis that conversion rates are equal.
But one cannot directly conclude that A and B have dissimilar conversion rates or vice verse. Here true underlying behavior is not known as we are trying to test the hypothesis by carrying out the experiment over a sample.
The Disadvantages of Using A/B Test:
Bayesian statistics in A/B testing is mainly based on past or prior knowledge of similar experiment and the present data. The past knowledge is known as prior also prior probability distribution (Wiki) is combined with current experiment data to make a conclusion on the test at hand.
In this method, we model the metric for each variant. We have prior knowledge about the conversion rate for A which has a certain range of values based on the historical data. After observing data from both variants, we estimate the most likely values or the new evidence for each variant.
Now we need to know:
What is Posterior Probability Distribution?
Posterior probability is the probability of an event to happen after all the background information about the event has been taken into account. Posterior probability as an adjustment on prior probability:
Posterior probability = prior probability + new evidence (called likelihood). And the Posterior Probability Distribution is Posterior Distribution = Prior Distribution + Likelihood Function (“new evidence”)
Open the link for further information: Wiki
By calculating this posterior distribution for each variant, we can express the uncertainty about our beliefs through probability statements.
install.packages (“bayesAB”) library (bayesAB)
The link below contains all the information to explain the parameters and functions in the package bayesAB. CRAN
Using the previous example
library (bayesAB) A_binom <- rbinom (3450, 1, 0.065) B_binom <- rbinom (3000, 1, 0.083)
About rbinom function rbinom (n, size, p) where
n = number of observations
size = number of trials
p = vector of probability
We choose the alpha and beta level from the prior knowledge we had about parameters. Here I have shown the test with two levels of the values. We generally use trial and error method to get the distribution to look like our imagined prior distribution. The peak should be centered over our expected mean based on previous experiments.
plotBeta (1, 1) plotBeta (100, 200) ## more specific range of p AB1 <- bayesTest (A_binom, B_binom, priors = c ('alpha' = 1, 'beta' = 1), distribution = 'bernoulli')
Saving the outputs of the test in AB2
AB2 <- bayesTest (A_binom, B_binom, priors = c ('alpha' = 100,'beta' = 200), distribution = 'bernoulli')
Here I have checked the AB2 test with an alpha and beta value of 100 and 200 respectively. You can also check the plots and results for AB1.
Print tells us the inputs we have made and the summary statistics of the data.
print (AB2)
summary (AB2)
The summary gives the credible interval. Bayesian intervals treat their bounds as fixed and the estimated Parameter as a random variable, whereas frequentist confidence intervals treat their bounds as random Variables and parameters as the fixed value.
It also shows that P (A>B) is by 0.00068%. So, B is much better than A. And the posterior expected loss for choosing B over A is low.
plot (AB2)
The means are quite separate, but there is a minimum overlap between distributions. Credible interval highlights this overlap region. To quantify the findings we calculate the probability of one variation beating another i.e. if we randomly draw a sample from Product A and from Product B, what are the chances that sample from B would have higher conversion rates than that of A.
So, from the diagrams and the summary of the test we can easily solve the problems which we had faced earlier while doing a simple prop.test.
Similarly, we can also try the test for other specific distributions like Poisson, normal, exponential and etc and check the results for them. Then we can combine the results of the tests and find out an overall credible interval and a percentage of A Over B or vice versa.
A/B test approaches are centered on hypothesis tests used with a point estimate (probability of rejecting the null) of a hard-to-interpret value. Oftentimes, the statistician or data scientist laying down the groundwork for the A/B test will have to do a power test to determine sample size. This quickly gets messy in terms of interpretability. More importantly, it is simply not as robust as Bayesian A/B testing and it does not have the ability to inspect an entire distribution over a parameter.
Bayesian statistics is simply more powerful and informative than a normal A/B test. While frequentist A/B testing requires the length of the test to be defined in advance, Bayesian testing does not. It can calculate the potential dangers of ending the test (the loss value) at any point, and gives a constantly updated probability of either variant being better and by how much. Ending the test early can be disastrous for frequentist A/B testing. A Bayesian approach, therefore, provides us with much greater flexibility during the experiment.
There is no agreed method for choosing a prior and it requires skill to estimate subjective prior beliefs into a mathematically calculated prior. If not done correctly it could lead to misleading results. The posterior distribution can be heavily influenced by the selection of the prior and the selection of the prior is a subjective process. Moreover, Bayesian statistics require a high level of computational resource, particularly in models with a large number of parameters.
The main advantage of the Bayesian approach is the ability to include historical data and to select a prior distribution. The main disadvantage with this approach is the subjective nature of the selection process for the prior.
The post AB Testing With R: An Example Of Marketing Campaign appeared first on StepUp Analytics.
]]>The post Obtaining A Critical Region And p-Value appeared first on StepUp Analytics.
]]>In the case of the two-tailed test, there would be two critical regions (As shown in the above graph). In the cases when we are interested to find whether the values are different or not equal, we use the two-tailed test. If percentage level is assumed to be (1-α) level, then both the critical regions would be of size α/2. The following hypothesis is an example of a two-tailed test.
H_{0}: µ = µ_{0 }H_{1}: µ ≠ µ_{0}
One-tailed tests are used when we are interested only in the extreme values that are greater than or less than a comparative value (say µ_{0}). In the case of one-tailed tests, there is only one critical region.
One-tailed tests are of two types-
Hypothesis-
H_{0}: µ = µ_{0 }_{ }H_{1}: µ < µ_{0}
Hypothesis-
H_{0}: µ = µ_{0 }H_{1}: µ > µ_{0}
In the case of one-tailed test, the critical region is of the value α (Unlike α/2 in the case of two-tailed).
Now, in order to obtain the critical value, we must know the type of hypothesis, the distribution the test follows, the percentage level at which we are working and lastly whether the test is two-tailed or one (right or left tailed). We’ve discussed all the above terms above, so now obtaining the value beyond which the critical region lies would be easy to find.
Step 1: Check the null and the alternative hypothesis.
Step 2: Take note of the distribution the test follows.
Step 3: Calculate the degrees of freedom, if any.
Step 4: Open the tables and look up for the distribution.
Step 5: If it is a two-tailed test at suppose 95% level of a Normal distribution, then look up for the value of 2.5% (α/2). And if it is one tailed test then look up for the value of 5% and then put a negative sign depending on the fact whether it is left or right tailed.
In the case of normal distribution, we do not require to calculate the degree of freedom, but in the cases of other distribution like t-distribution or chi-square distribution, we need to calculate the degree of freedom. On the other hand, both Normal, as well as the t distribution, are symmetrical so we need to just check one value and just replace signs, but in the case of non-symmetrical distributions, we need to check the individual values.
For example, if we are working on the chi-square distribution at 95%, we need to first find the degree of freedom of that chi-square and then check the value of both 97.5% as well as 2.5%. To begin, just follow the steps and practice with the Normal distribution. Once you’ve mastered it, go for the calculation of the degree of freedom and then the other distributions.
Critical regions are as critical as their name suggests and hence should be calculated carefully, or else we might end up in a wrong conclusion (Type 1 or Type 2 error).
The post Obtaining A Critical Region And p-Value appeared first on StepUp Analytics.
]]>The post Classical Normal Linear Regression Model (CNLRM) appeared first on StepUp Analytics.
]]>In this article, we will discuss the details of the Classical Normal Linear Regression Model (CNLRM). The method of ordinary least squares is attributed to Carl Friedrich Gauss, a German mathematician. Under certain Assumption, this method of estimation has some very attractive statistical properties that made it one of most powerful and popular method of regression analysis.
The two variable Population Regression Function:
However, the Population functions can’t be obtained directly, hence we estimated them from the help of sample regression functions:
Where 𝑌 ̂𝑖^{ }is the estimated (conditional mean) value of 𝑌𝑖
The OLS Estimated of 𝛽1 and 𝛽2 can be obtained as follows:
On differentiating partially with respect to 𝛽1 and 𝛽2 we obtained the following results:
and
Thus we get the estimated of the population regression function as:
Where,
The Assumption Underlying The Method Of Least Square.
The objective of estimating 𝛽1 and 𝛽2 only, the method of OLS discussed is suffice, but if the objective is to draw the inference about the true value of population variables 𝛽1and 𝛽2 then we have to look upon the fictional form of 𝑌𝑖′𝑠 or the functional form of 𝑋𝑖′𝑠 and 𝑢𝑖′𝑠. This is because the value of population regression function i.e., 𝑌 𝑖 = 𝛽1 + 𝛽2𝑋𝑖 + 𝑢𝑖 depends on Xi and error terms.
Therefore, unless we are specified about how 𝑋𝑖 and 𝑢𝑖 are created or generated, there is no way we can make any statistical inference about 𝑌𝑖 and also, as we shall see, about 𝛽1and 𝛽2. Thus, we need some assumptions made about the 𝑋𝑖 variables and error terms are extremely critical to valid interpretation of the regression estimates.
The Gaussian, standard, or classical linear regression model (CLRM), which is the cornerstone of almost every economic theory, makes the following 7 assumptions:
Assumption1: The regression model is linear in terms of parameters.
Assumption2: The values of 𝑋𝑖′𝑠 are fixed or 𝑋 values are independent of the error term.
Assumption3: the mean of error terms is zero.
Assumption4: the variance of error terms is constant; this assumption is also known as Homoscedasticity.
Assumption5: the is no autocorrelation between the error terms (or disturbances)
Assumption6: the number of observations is greater than the number of parameters to be estimated. Assumption7: The 𝑋 values in the given sample must not be the same. i.e., the variance of 𝑋𝑖′𝑠 is positive.
NOTE: Gauss-Markov Theorem:
Given the assumption of the Classical linear regression model, the least-square estimators, in the class of unbiased linear estimators, have minimum variance, that is, they are BLUE
Using the method of OLS we are able to estimate the population parameters 𝛽1 and 𝛽2, under the assumptions of the classical linear regression model, as 𝛽 ̂1 and𝛽 ̂2.But, since these estimators differ from sample to sample. therefore, these estimators as random variables.
Hence, we called the estimators as random variable thus we have to find the probability distribution of these estimators.
The Probability Distribution Of Disturbances (𝒖𝒊′𝒔)
But since 𝑋𝑖′𝑠 are assumed fixed, or nonstochastic, because ours is conditional regression analysis, conditional on the fixed values of 𝑋𝑖.
also 𝑌 𝑖 = 𝛽1 + 𝛽2𝑋𝑖 + 𝑢𝑖
Hence the 𝛽 ̂2 can be rewritten
the 𝑘𝑖, the beta and 𝑘𝑖 are fixed hence the estimate 𝛽 ̂2 is ultimately a linear function of the random variable.
Therefore, the probability distribution of estimators depends on the assumption of error terms. Since we need the probability distribution of estimated to draw the inference about population parameters, we have to draw the assumption about the distribution of error term.
Since the OLS does not make any assumption about the probabilistic nature of 𝑢𝑖, it is of little help for the purpose of drawing of drawing inference about population regression function from the sample regression function, the Gauss-Markov Theorem notwithstanding.
This void can be filled if we are willing to assume that the 𝑢𝑖′𝑠 follow some probability distribution. For reasons to be explained shortly, in the regression context it is usually assumed that the 𝑢𝑖′𝑠 follow a normal distribution.
Thus, adding the normality assumption of the classical linear regression model (CLRM) discussed earlier, we obtained what is known as the classical normal linear regression model (CNLRM).
The post Classical Normal Linear Regression Model (CNLRM) appeared first on StepUp Analytics.
]]>The post Missing Value Imputation Techniques In R appeared first on StepUp Analytics.
]]>Let’s see the three main types missing values according to their pattern of occurrence in a data set.
It occurs when the missing values occur entirely at random and are independent of other variables in the observation. Here we are assuming that the variable of missing data is completely unrelated to the other variables or columns in the data. For example
Suppose that we have a database of school students with 4 columns Student.Id, Name, Gender, and Number of Subjects. With the data available we cannot determine the number of subjects for the given missing observation because the missing data is completely independent of the other observations in the data.
An alternative assumption to MCAR is MAR or Missing at Random. It assumes that we can predict the missing value on the basis of other available data.
From the given data we can build a predictive model that Number of subjects can be predicted on the basis of independent variables like class and age. So in these cases, we can use some advanced imputation techniques to determine the missing values.
MAR is always a safer assumption than MCAR. This is because any statistical analysis which is performed under the assumption of MCAR is also valid for MAR, but the reverse is not true.
NMAR is also known as nonignorable missing data. It is completely different from MCAR or MAR. It is a case where we cannot determine the value of the missing data with any of the advanced imputation techniques. For example, if there is a question in a questionnaire which is a very sensitive issue and it is likely to be avoided by the people filling out the questionnaire, or anything that we don’t know. This is known as missing not at random data.
In the present study, I have used the iris data set which is already present in the R software. Though the dataset does not have any missing values, I have introduced missing values randomly into the data set to execute the six most popular methods of missing value treatment.
D <- iris
Saving the dataset in a dataframe “D”.
The data frame has four columns
Of these variables, the first four are numeric and the fifth variable “Species” is a factor with three levels.
NOTE: I have used only the first column i.e. Sepal.Length to explain the imputation techniques.
str(D)
This is the best avoidable method unless the data type is MCAR. We have to see whether the deletion of the data will affect any of the statistical analysis done with the data or not. Moreover, it is only performed if there is a sufficient amount of data available after deleting those observations with “NA” values and deleting them does not create any bias or not a representation of any variable.
Creating another data set from the original dataset “D”
df <- D
Introducing “NA” values randomly into the dataset.
df<-as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.95, 0.05), size = length(cc), replace = TRUE) ]))
Determining the number of “NA” values in the data set.
sapply(df,function(x)sum(is.na(x)))
Deleting the observations or rows which have “NA” values.
df<-na.omit (df)
Now, there is also an alternative to this, which is using na.action = na.omit directly in the model.
sapply(df,function(x)sum(is.na(x)))
This method is used only when a certain variable has a very high number of NA values in comparison to other variables. So using the previous method would lead to a loss of too many observations from the dataset. Now here we also need to see whether the given variable is an important predictor of the dependent variable or not. Then decide the better approach to deal with it.
Creating another data set from the original dataset “D”
df1 <- D
Introducing NA values to the first column of the dataset
df1$Sepal.Length [20:140] <-NA
Determining the number of “NA” values in the data set.
sapply(df1,function(x)sum(is.na(x)))
We can see that out of 150 observations 121 values in the Sepal.Length column is missing.
Deleting the variable Sepal.Length from the dataset.
df1$Sepal.Length<-NULL df1<-df1 [,-1] ## Another way to do it. sapply(df1,function(x)sum(is.na(x)))
This is a very common technique of replacing the NA values. It is often used when there is not much variation in the data or the variable is not that important predictor of the dependent variable. Though one can easily calculate the mean or median value to impute the missing values, this method leads to an artificial reduction of the variation in the dataset.
Moreover, it reduces the standard error which invalidates most hypothesis tests. Also, it introduces a wrong representation of the relationship of the variable with other variables in the dataset.
Creating another data set from the original dataset “D”
df2<-D
Introducing NA values randomly into the dataset.
set.seed (123) df2<-as.data.frame (lapply (df2, function (cc) cc [sample(c (TRUE, NA), > prob = c (0.60, 0.40), size = length (cc), replace = TRUE)]))
Here I am saving a copy of the variable Sepal.Length as “original”, consisting of the values which have been replaced as NA values in the data set. This is done so that later one can calculate the MSE, RMSE, and MAPE to see the accuracy of the imputation method.
for more details about MSE, RMSE, and MAPE please open this link
fn<-ifelse (is.na (df2$Sepal.Length) ==TRUE, df2$Sepal.Length, 0) original<-D$Sepal.Length [is.na (fn)]
Calculating the value of mean for the variable Sepal.Length and saving it as predictmean
predictmean <-round (mean (df2$Sepal.Length, na.rm = TRUE), digits = 1) df21<-df2
Replacing the missing values in the Sepal.Length column with the mean value
df21$Sepal.Length [is.na (df2$Sepal.Length)] <- predictmean
For checking the accuracy, we calculate MSE, RMSE, and MAPE with the help of the library “Metrics” Saving the output of accuracy in Accuracy_mean
library (Metrics) mae_mean<-mae (original, predictmean) rmse_mean<-rmse (original, predictmean) mape_mean<-mape (original, predictmean) Accuracy_mean<-cbind (mae=mae_mean, rmse=rmse_mean, mape=mape_mean) Accuracy_mean
Calculating the value of the median for the variable Sepal.Length and saving it as predictmedian
predictmedian <-round(median(df2$Sepal.Length,na.rm= TRUE),digits = 1) df22<-df2
Replacing the missing values in the Sepal.Length column with the median value
df22$Sepal.Length[is.na(df2$Sepal.Length)] <- predictmedian
For checking the accuracy, we calculate MSE, RMSE, and MAPE with the help of the library “Metrics”
Saving the output of accuracy in Accuracy_median
library(Metrics) mae_median<-mae(original,predictmedian) rmse_median<-rmse(original,predictmedian) mape_median<-mape(original,predictmedian) Accuracy_median<-cbind(mae=mae_median,rmse=rmse_median,mape=mape_median) Accuracy_median
This method of imputation is used when the missing data is of MAR type. So we create a decision tree model to predict the values of the missing data which was previously trained with the data which was already present in the dataset. This method can be used to predict both numeric and factor variables. Here we are predicting a numeric variable, so we choose the method = “anova”. But we write method= “class” in case of factor variables. It is also important to exclude the missing values from the model by using na.action=na.omit.
We have saved the model by the name fitrpart
library(rpart) # tree based model fitrpart <- rpart(Sepal.Length ~ ., data=df2[!is.na(df2$Sepal.Length),], method="anova",na.action=na.omit)
Here we are predicting the missing values of Sepal.Length with the model and saving it in predictrpart
predictrpart<- predict(fitrpart, df2[is.na(df2$Sepal.Length),])
For checking the accuracy, we calculate MSE, RMSE, and MAPE with the help of the library “Metrics”
Saving the output of accuracy in Accuracy_rpart
library(Metrics) mae_rpart<-mae(original,predictrpart) rmse_rpart<-rmse(original,predictrpart) mape_rpart<-mape(original,predictrpart) Accuracy_rpart<-cbind(mae=mae_rpart,rmse=rmse_rpart,mape=mape_rpart) Accuracy_rpart
The kNN algorithm computes the distance between the data point and its k nearest neighbors using the Euclidean distance in multidimensional space and imputes the missing values with the weighted average of the values taken by the k nearest neighbors.
Things to remember:
1.k is the number of nearest neighbors used to find the values of the missing data points.
2.variable is the variable which consists of missing values we choose to impute. We can choose more than one variable by, variable= c(“a”, “b”, “c” ) where a, b, and c are the variables which consist of missing values
Disadvantages of using Knn
Executing the algorithm
library(VIM) df23<- kNN(df2,variable="Sepal.Length",k=6 ) df23$Sepal.Length_imp<-NULL
Saving the values of variable Sepal.Length that was imputed using kNN in a vector predictkNN
predictkNN <- df23[is.na(df2$Sepal.Length), "Sepal.Length"]
For checking the accuracy, we calculate MSE, RMSE, and MAPE with the help of the library “Metrics” Saving the output of accuracy in Accuracy_kNN
library(Metrics) mae_kNN<-mae(original,predictkNN) rmse_kNN<-rmse(original,predictkNN) mape_kNN<-mape(original,predictkNN) Accuracy_kNN<-cbind(mae=mae_kNN,rmse=rmse_kNN,mape=mape_kNN) Accuracy_kNN
NOTE: There is another package library (DMwR) where function knnImputation () can be used to do the imputation. (For more details about the function: Click
Mice or Multivariate Imputation via Chained Equations is a package that uses multiple imputations for a missing data treatment. Now, as multiple imputations create multiple predictions for each missing value; they take into account the uncertainty in the imputation and give the best standard errors. If there is not much information in the given data used to prepare the model, the imputations will be highly variable, leading to high standard errors in the analysis.
Things to remember:-
library (mice) fitmice<-mice (df2, m=10, maxit=30, method="pmm") df24<-complete (fitmice)
Saving the values of variable Sepal.Length that was imputed using Mice in a vector predictmice
predictmice<-df24 [is.na (df2$Sepal.Length), "Sepal.Length"]
For checking the accuracy, we calculate MSE, RMSE, and MAPE with the help of the library “Metrics” Saving the output of accuracy in Accuracy_mice
library (Metrics) mae_mice<-mae (original, predictmice) rmse_mice<-rmse (original, predictmice) mape_mice<-mape (original, predictmice) Accuracy_mice<-cbind (mae=mae_mice, rmse=rmse_mice, mape=mape_mice) Accuracy_mice
The missForest function is used particularly in the case of mixed-type data. It can be used to impute continuous and categorical data including complex interactions and non-linear relations. It uses the given data in the data frame to train the random forest model and then uses the model to predict the missing values. It yields an out-of-bag (OOB) imputation error estimate without the need of a test set or elaborate cross-validation.
library (missForest) Executing the algorithm fitmissforest<-missForest (df2) Saving the output in df25 df25<-fitmissforest$ximp
Saving the values of variable Sepal.Length that was imputed using missForest in a vector predictmissforest
predictmissforest<-round (df25 [is.na (df2$Sepal.Length), "Sepal.Length"], digits=1)
For checking the accuracy, we calculate MSE, RMSE, and MAPE with the help of the library “Metrics” Saving the output of accuracy in Accuracy_missforest
library (Metrics) mae_missforest<-mae (original, predictmissforest) rmse_missforest<-rmse (original, predictmissforest) mape_missforest<-mape (original, predictmissforest) Accuracy_missforest<-cbind (mae=mae_missforest, rmse=rmse_missforest, mape=mape_missforest) Accuracy_missforest
Creating a data frame Accuracy consisting of all the accuracy outputs of the last five methods
Accuracy<-cbind(Methods=c("Mean","Median","rpart","kNN","mice","missForest"), rbind.data.frame(Accuracy_mean,Accuracy_median,Accuracy_rpart, Accuracy_kNN,Accuracy_mice,Accuracy_missforest)) library (ggplot2)
Creating a scatter plot to visualize the best method with least MAPE (Mean absolute percentage error) Similarly, you can see all the plots of MAE and RMSE.
Scatter plot
#Scatter plot ggplot(data=Accuracy, aes(x=Methods, y=mape)) + geom_point()
So, from the above diagram, we can see that for the given data set mice and missForest gives us the best output. But we cannot conclude that for each and every dataset mice or missForest will be the best method of missing value treatment. Because missing values have different types and patterns. So, I think the best way is to test the models with the given data and then use the best model to impute the missing values in the data set
Missing values imputation with missMDA
Missing values imputation with Fuzzy K-means Clustering
Missing values imputation with bpca (Bayesian Principle Component Analysis)
The post Missing Value Imputation Techniques In R appeared first on StepUp Analytics.
]]>The post Multivariate Analysis Of Variance Or MANOVA appeared first on StepUp Analytics.
]]>In ANOVA we examine if there is any statistically significant effect of independent variables on a continuous dependent variable using the sum of squares. But here we only have one dependent variable. It’s very simple, but in practical life the problems are complex. So, we can have more than one dependent variable. We can use ANOVA for every dependent variable separately, but using Multivariate Analysis Of Variance Or MANOVA you can do that in one analysis.
So, we can think of MANOVA as a multivariate extension of ANOVA. This way MANOVA explains how much variability of dependent variables is explained by the independent variables simultaneously.
In MANOVA there also some assumptions, like ANOVA. Before performing MANOVA we have to check the following assumptions are satisfied or not.
Let us take a sample data set to understand how MANOVA works. The dataset can be downloaded from the given link – Data
Data Dictionary
Data contains 7 columns with 120 observations (variables). The variables of interest are “Temperature” (3 levels), ”N.source” (2 levels), “Optical.density”, “Product.yield”.
Here we want to check whether the different Temperatures or different N.sources have significant effects on Optical density and Product yield. Here we can perform ANOVA separately for Temperature and N.source but this can be done simultaneously in MANOVA.
Now we are going to observe how MANOVA works using graphical representations.
First, the data is loaded into the R environment.
data <- read.delim("MANOVA.txt",header = T) head(data,n=10) data$Temperature <- as.factor(data$Temperature)
Now we could have plots as following,
#For N.source plot(data$Optical.density,data$Product.yield,col = data$N.source,pch = 15, main = "Optical density vs Product yield vs N.source", xlab = "Optical density", ylab = "Product yield") legend("topleft",legend = as.character(levels(data$N.source)),fill = 1:2) #For Temperature plot(data$Optical.density,data$Product.yield,col = data$Temperature,pch = 15, main = "Optical density vs Product yield vs Temperature", xlab = "Optical density", ylab = "Product yield") legend("topleft",legend = as.character(levels(data$Temperature)),fill = 1:3)
From the above scatter plots we can easily understand that N.source differs significantly to explain the variability of Optical density and Product yield, but Temperature doesn’t significantly differ.
We can have the same result mathematically simultaneously, as we get from the plots, using MANOVA.
In R we can perform MANOVA as follows,
summary(manova(cbind(Product.yield,Optical.density) ~ N.source + Temperature, data = data)) > summary(manova(cbind(Product.yield,Optical.density) ~ N.source + Temperature, data = data))
Df Pillai approx F num Df den Df Pr(>F) N.source 1 0.72282 149.944 2 115 <2e-16 ***0 Temperature 2 0.04278 1.268 4 232 0.2835 Residuals 116 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
From the output, we can easily see that at 1% level of significance it can be concluded that N.source has significant effects on Product yield and Optical density. But, on the other hand, although we can say at 5% level of significance Temperature has no significant effect in Product yield and Optical density.
So, this is how MANOVA gives us a mathematical result to understand if the effects of some treatments significantly differ to explain the variability of more than two continuous variables simultaneously.
Now, we can have the result of ANOVA as follows,
summary.aov(manova(cbind(Product.yield,Optical.density) ~ N.source + Temperature, data = data)) > summary.aov(manova(cbind(Product.yield,Optical.density) ~ N.source + Temperature, data = data))
Response Product.yield : Df Sum Sq Mean Sq F value Pr(>F) N.source 1 27579.1 27579.1 187.3099 <2e-16 *** Temperature 2 80.9 40.5 0.2748 0.7602 Residuals 116 17079.6 147.2 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Response Optical.density : Df Sum Sq Mean Sq F value Pr(>F) N.source 1 6.7071 6.7071 193.7043 < 2e-16 *** Temperature 2 0.1656 0.0828 2.3915 0.09599 . Residuals 116 4.0166 0.0346 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Here we have the result if we perform ANOVA separately. We get the same result for N.source but we can see at 10% level of significance Temperature differs in the effect of Optical density.
So, basically, in ANOVA, we separately check how significant each predictor variables are, but in MANOVA we analyse the variability of response variables on the multiple predictor variables simultaneously.
We should use MANOVA when there are multiple dependent variables which are correlated. Unlike individual ANOVAs, MANOVA can detect small significant effects. In MANOVA, the effect of those factors which may have an influence on the relationship between two dependent variables can be determined which may be missed if we perform individual ANOVAs.
In individual ANOVAs, type-I error (chance of rejection the true null hypothesis) can be increased, but in MANOVA, all response variables simultaneously keep the error rate equal to the desired level of significance.
Presence of outlier may increase the type-I error as MANOVA sensitive to outliers. Presence of multicollinearity violates the assumption of MANOVA. If the dependent variables are highly correlated then one can be a linear function of others, which becomes statistically redundant.
The post Multivariate Analysis Of Variance Or MANOVA appeared first on StepUp Analytics.
]]>The post Hypothesis Testing Examples appeared first on StepUp Analytics.
]]>Then we may be interested in knowing if this sample average is in line with the population average of 85 or not. Hypothesis testing is like a litmus test that gives us the path for rejection or acceptance of an assumption or a claim except for the fact that it is not deterministic but probabilistic. It is a technique to compare two datasets or a sample from a dataset.
Let us first see what a hypothesis is and take a look at some of the terms that are inclusive to hypothesis testing.
A hypothesis is nothing but some assumptions that we make about the population parameters that we want to verify. Two hypotheses are included in every test namely the null hypothesis and alternative hypothesis. The Null Hypothesis is the statement of no difference and is denoted as H0. It simply asserts that there is no real difference between the sample and the population and the difference is accidental or by chance. An alternative hypothesis is a statement against the null hypothesis.
It is the contradiction of the null hypothesis. It is usually denoted by H1. For example, a sample of 50 light bulbs is tested for their life and we want to test if the average lifetime of the bulbs is 300 days. Then we will set up the null hypothesis as “the lifetime is 300 days” and the alternative hypothesis will be “the lifetime is not 300 days”.
To test a hypothesis we need to have a single value based on the sample observations that can be compared with a pre-defined value so as to reach a decision. This value is computed using a certain formula and follows a particular probability distribution under some assumptions. Since the value calculated is used for testing and is derived from the sample, it is called a test statistic.
We all are familiar with the game of darts. Consider a simpler version of such a game in which an aim on the outer ring results in the disqualification (rejected, straightaway) of the aimer whereas an aim on the inner two circles results in qualification (acceptance) of the aimer for further rounds (only qualified, not yet the winner).
Just like this dartboard is divided into areas of rejection and acceptance, in a similar way a probability curve is divided into acceptance region and the rejection region (also called the critical region). If a test statistic falls in the critical region then the null hypothesis is rejected, it may be accepted otherwise. Hence, a critical region can be defined as the region of rejection of H0 when H0 is true.
A point to be noted here is that we reject the null hypothesis much strongly as compared to its acceptance (as in the example above, where the aimer is only qualified and is not the winner). The reason is that we deal with a sample rather than the population itself. In R.A. Fisher’s own words:
“The null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis.”
The confidence with which a decision is taken depends on the significance level chosen. The significance level is the probability of rejecting the null hypothesis when it is true. It is the size of the critical region and is expressed in terms of percentage. For example, a significance level of 5% means that the null hypothesis will be rejected 5 out of 100 times when it is true. The significance level is denoted by α.
As we have seen above, the critical region is a portion of the probability curve. This portion can lie on either end of the curve or on both ends of the curve. A test is recognized as one tailed or two tailed depending upon which side of the curve the critical area lies which further depends on the nature of our alternative hypothesis. For example, in the lightbulb problem if we want to test that the lifetime of bulbs is greater than 300 days then our alternative hypothesis will be “lifetime of bulbs >300 days” (right-tailed).
If we want to test if it is less than 300 days then the alternative hypothesis becomes “lifetime of bulbs < 300 days” (left tailed). If we do not care about whether it is greater or less and just want to test if it is 300 days or not then alternative hypothesis becomes “lifetime of bulbs ≠ 300 days” (two-tailed). The critical region, say at 5%, in these cases can be illustrated as below:
For a two-tailed test, the critical region is divided into two parts, one for the right side and other for the left side. While for one-tailed test it remains undivided. So, if we are dealing with a two-tailed test at significance level (size of the critical region) α% then on each side we have (α/2)% of the area.
In simple terms, the p-value is a evidence to accept or reject the null hypothesis. Consider a coin that someone says is biased. So, to test this claim we set up the null hypothesis that “the coin is unbiased” as opposed to the alternative hypothesis that “the coin is biased”. Now to test its biasedness the coin is tossed 20 times and suppose that 18 heads and 2 tails are obtained. Clearly, we should have got a similar number of heads and tails if the coin was unbiased. But to prove this we need to have an evidence in the form of a p-value.
It is the probability of obtaining a result equal to or more extreme than the observed value. So, in this case, “the result more extreme than observed” would be (19 heads, 1 tail), (20 heads, 0 tail), (19 tails, 1 head) or (20 tails, 0 head). Calculating the probability of obtaining this result (using binomial distribution) under the null hypothesis we get:
P(18 heads and 2 tails) = P(18 tails and 2 heads) = 0.000181
P(19 heads and 1 tail) = P(19 tails and 1 head) = 0.00001907
P(20 heads and 0 tail) = P(20 tails and 0 head) = 0.0000009536
Adding up the probabilities we get the p-value as 0.0004.
This p-value is compared with the significance level which we will take here as 0.05. If the p-value is greater then the significance level then we say that the evidence against the null hypothesis is weak, which means we can accept the null hypothesis. If the p-value is less than the significance level then the evidence against the null hypothesis is strong and hence we reject the null hypothesis.
So, in this case, the p-value is very small as compared to the significance level, therefore, we can safely say that the null hypothesis is rejected and the coin is indeed a biased one.
Statistics is probabilistic and so is hypothesis testing. There is always a probability of making a wrong decision. While making decisions, four possibilities arise:
Clearly, the last two decisions are correct. First two decisions reject and accept the null hypothesis wrong. They are errors. The first one rejects the null hypothesis when it is, in fact, true, it is called the type I error. The probability of committing type I error is denoted by α.
The second one accepts the null hypothesis when it is false, it is called the type II error. The probability of committing type II error is denoted by β.
Consider an example of testing whether a new toothpaste is better than the previous toothpaste in fighting dental cavities. The hypotheses are H0: the toothpaste have no difference against H1: the new toothpaste is better than the old one. Now suppose that the new toothpaste is actually better. If our test accepts the null hypothesis that the toothpaste has no difference then we commit a type II error.
While testing the hypothesis, our aim is to reduce both types of error but it is not possible to control both the errors simultaneously. So we fix the probability of type I error(α) in advance at a satisfactory level and try to minimize the probability of type II error(β). α is also known as the significance level or the size of the critical region.
We know that for large sample sizes, almost all the distributions can be approximated by the normal distribution due to the Central Limit Theorem. This forms the basis of the large sample tests. Let us take a look at some of the tests and also how to perform them in R.
Consider a random sample of size n(≥30) from a normal population with mean µ and variance σ^{2}. We know that the sample mean(x) of this sample will also be normally distributed as N( µ, σ^{2}/n). Thus, the standard normal variate corresponding to x is:
If the population standard deviation is not known, which is usually the case, then we use sample variance as its estimate. Let us take an example of a pizza delivery boy who claims that he takes on an average 8.9 minutes to reach his destination to deliver pizzas. To check on this claim the agency that hires him notes his time taken for 50 orders. It gets a mean of 9.3 minutes with a standard deviation of 1.6 minutes.
Now let us check if the average time taken to deliver a pizza is 8.9 or not. For this we start by setting up the null hypothesis as; H0: the average time taken to deliver the pizza is 8.9 minutes (µ=8.9) against the alternative hypothesis; H1: the average time taken to deliver a pizza is not 89 minutes (µ≠8.9).
According to the situation we have: sample mean(x) = 9.3 minutes, population mean (µ)= 8.9 minutes, population standard deviation(σ) = 1.6 minutes and the sample size(n) = 50. Since the sample is large(≥30) and the population standard deviation is known therefore we apply the large sample test, otherwise t-test is used. On substituting the values in the test statistic formula we get the value of test statistic as
This is a two-tailed test so the critical region will be on both sides of the curve. The critical value from the standard normal table is 1.96 (we have taken a significance level of 0.05). The calculated value of the test statistic, 1.767, is less than the tabulated value of 1.96. Hence, we may accept the null hypothesis at the 0.05 level of significance.
This test can be performed in R as well. The code for which is given below.
Since here we are dealing with two-tailed test the p-value is calculated as p = P(Z ≤ -z) + P(Z ≥ z) = 2*P(Z ≤ -z). Clearly, the p-value is greater than the significance level, therefore, we say that we have less evidence against the null hypothesis and may accept it.
In the above test, we had one sample and we compared the value of the sample mean to that of the population mean. We can also compare the value of two sample means to find out if they belong to populations with identical means or to check if one population is superior or inferior to other.
For this two samples of size n1 and n2 taken from same or different populations with means µ1 and µ2 and variances σ1^{2} and σ2^{2 }respectively. Also, we now that the sample means x1, x2, and their difference, x1 – x2 are distributed as:
Under the null hypothesis that the samples are from the same population (µ1 = µ2), the test statistic is given as:
If population standard deviations are not known then sample variances s1 and s2 are used as:
For better understanding consider an example where it is required to check if the mean level of pay of one state is greater than that of another state. Two samples of employees are taken from sizes 1200 and 1000. The mean and standard deviation of the samples (in thousands of rupees) is given as:
Here, we have to test the null hypothesis, H0: there is no difference between the average pay of two states i.e., µ1 = µ2 as opposed to the alternative hypothesis, H1: mean level of pay of state 1 is greater than that of state 2 i.e., µ1 > µ2 (right tail test). The population standard deviations are not known hence estimating it using sample standard deviation, we get Now, the value of test the statistic is |z| = 24.28. The tabulated value for the left tailed test at 5% level of significance is 1.645.
The calculated value is much greater than the tabulated value at 5% significance level and thus we reject the null hypothesis and conclude that the mean pay of state 2 is higher than the mean pay of state 1.
In R, this test can be done as follows:
This test is used to test if the standard deviations of the two samples differ significantly or not. Let s1 and s2 be standard deviations of two independent samples then under the null hypothesis that the sample standard deviations don’t differ significantly i.e., σ1 = σ2, the test statistic for large samples is given as:
Where σ1^{2} and σ2^{2 }are population variances and n1 and n2 are sample sizes for sample 1 and sample 2 respectively. Sample variance is used as an estimate of population variance when it is not known. Consider a farmer who yields two sets of plots whose variability are as follows:
We have to test whether the variability in two sets of plots is significant. The null hypothesis can be stated as, H0: there is no significant difference between the variability of the two plots i.e., σ1=σ2 against the alternative hypothesis, H1: two sets of plots have significantly different variability i.e., σ1≠σ2. We have, s1=34, s2=28, n1=40, n2=60, σ1^{2} =34^{2} = 1156, σ2^{2}=28^{2}=784. Substituting these values in test statistic we get z as 1.3.
This value is less than the tabulated at 0.05 level of significance which is given as 1.96. Hence, we cannot reject the null hypothesis and conclude that difference between the variability for two sets of plots is not significant.
In R, it can be done as follows:
The post Hypothesis Testing Examples appeared first on StepUp Analytics.
]]>