The post Bayesian Inference in Python appeared first on StepUp Analytics.
]]>Bayes Theorem uses prior knowledge or experience to provide better results. Mathematically speaking, it uses conditional probability of an event
Bayes’ theorem is stated mathematically as the following equation:
P(A/B) = P(B/A)P(A)/P(B)
where
Bayesian Inference is a method of Statistical Inference where we update the probability of our hypothesis(prior) H, as more information(data) D becomes available and finally arrive at our posterior probability. Bayes Theorem lays down the foundation of Bayesian Inference. Mathematically Expressing, we have
P(H/D) = P(H)P(D/H)/P(D)
where
In the above coin toss example, we have the probability of heads coming up to be 0.35. We assume our prior-the probability distribution of heads coming up to be Uniform(0,1) Distribution. And then on the basis of data, we arrive at our posterior distribution of heads coming up. What we notice is that our distribution gets more accurate as we increase the number of trials.
Now, assuming prior to being Bernoulli(0.5) Distribution and combining our result with our previous prior we have
INFLUENCE OF THE PRIOR:
The prior influences the result of our analysis. As it is obvious from the above example, the influence of the prior is more dominant when data-volume is less. The prior eventually subsidize as the number of trials becomes larger (where using frequentist’s inference methods might be a better option).
HOW TO CHOOSE A PRIOR:
This is quite a subjective question. Some do use a non-informative prior (such as Uniform(0,1) in our first example). Unless and until we are quite sure, it is not recommended to use strongly informative priors in our analysis.
The resultant probability distribution which summarizes both prior and the data is the posterior.
HIGHEST POSTERIOR DENSITY (HPD) INTERVAL
A highly useful tool to summarize the spread of the posterior density. It is defined as the shortest interval containing a given portion of probability density.
There are several ways to compute Posterior computationally which can be broadly classified as
NON-MARKOVIAN METHODS
GRID COMPUTING: It is a brute force approach mainly used when we cannot compute the posterior analytically. For a single parameter model, the grid approach is as follows:
The Grid Computing approach does not scale well for high-dimension data.
There are other Non-markovian methods as well such as QUADRATIC(LAPLACE) METHOD and VARIATIONAL METHODS.
MARKOVIAN METHODS:
There’s a family of methods known as MCMC- MONTE CARLO MARKOV CHAIN Methods. Here as well, we need to compute the prior and the likelihood at each point to approximate the whole posterior distribution. MCMC methods outperform the grid approximation because they are designed to spend more time in higher probability regions than in lower ones.
# Revisiting the coin toss Example
Using PyMC3 Library (Python Library for probabilistic Programming)
Get more articles on Python for Data Science.
Download the iPython notebook used in the above example from GitHub
The post Bayesian Inference in Python appeared first on StepUp Analytics.
]]>The post Random Variable and Distribution: The Concept appeared first on StepUp Analytics.
]]>In this article, first, we’ll discuss the properties of a random variable, then types of random variables along with their probability functions, in brief.
So, the properties of a random variable are:
Note: We use a capital letter, example X, to stand for the random variable and its equivalent lower case, example x, to stand for a value that it takes.
Now, there are two types of random variables.
Let’s discuss about them in brief.
Discrete Random Variable
A random variable that can only take certain numerical values (i.e. discrete values) is called a discrete random variable. For example, the number of applicants for a job or the number of accident-free days in one month at a factory.
The function f_{x}(x) = P(X=x) for each x in the range of X is the probability function (PF) of X – it specifies how the total probability of 1 is divided up amongst the possible values of X and so gives the probability distribution of X. Probability functions are also known as ‘pdf’.
Note the requirements for a function to qualify as the probability function of a discrete random variable, for all x within the range of X:
Other than the probability function, the cumulative distribution function (CDF) of X is also very important. It is given by:
for all real values of X, gives the probability that X assumes a value that doesn’t exceed x.
The graph of F_{x}(x) against x starts at a height of 0 then increases by jumps as values of x are reached for which P(X=x) is positive. Once all possible values are included F_{x}(x) takes its maximum value of 1. F_{x}(x) is called a step function.
Let’s understand the above concepts more clearly with the help of an example. Suppose we roll a fair die, then it’s probability distribution would be:
From this table, it can be shown that the sum of all the probabilities is 1. Also, technically we should write the CDF as:
Now, let’s understand what continuous random variable is.
Continuous Random Variable
A random variable that can take any numerical value within a given range is called a continuous random variable. For example, the temperature of a cup of coffee served at a restaurant or the weight of refuse on a truck arriving at a landfill.
The probability associated with an interval of values, (a, b) say, is represented as P(a<x<b) or P(a ≤x ≤b) – these have the same values – and is the area under the curve of the probability density function (pdf) from a to b. So probabilities can be evaluated by integrating the pdf, f_{x}(x). Thus,
The conditions for a function to serve as pdf are as follows:
for -∞≤x≤∞.
You should have noticed that these conditions are equivalent to those of the probability function for a discrete random variable, where the summation is replaced by integration for the continuous case.
The cumulative distribution function (CDF) is defined to be the function:
For a continuous random variable, F_{x}(x) is a continuous, non-decreasing function, defined for all real values of x.
The graph of F(x) can be shown as:
Now we can work out the CDF from pdf. But how can we get back again?
Well, we integrate the pdf, f(x), to get the CDF, F(x), so it makes sense that to go back we differentiate.
We can obtain the pdf, f(x), from the CDF, F(x), as follows:
Like a discrete random variable, a continuous random variable can also be understood more clearly with the help of an example.
Suppose we have the following probability density function:
It can be seen that at x=1, f_{X}(x)=3/7 and at x=2, f_{X}(x)=12/7. Therefore, the given pdf is greater than 0 in the interval [1, 2]. Also, if we integrate the pdf, then we get,
Now, cumulative distribution function can be found as:
Hence, we should write the CDF as:
Now, we should know how to obtain expected values for both, discrete as well as continuous random variables.
Expected values are numerical summaries of important characteristics of the distributions of random variables. So, let’s see how mean and standard deviation of a random variable are obtained.
Mean
E[X] is a measure of the average/center/ location of the distribution of X. It is called the mean of the distribution of X, or just the mean of X, and is usually denoted by μ.
E[X] is calculated by summing (discrete case) or integrating (continuous case) the product:
value x probability of assuming that value
over all values which X can assume.
Thus, for the discrete case:
and, for the continuous case:
Variance and Standard Deviation
The variance, σ^{2} is a measure of the spread/ dispersion/ variability of the distribution. Specifically, it is a measure of the spread of the distribution about its mean.
Formally,
is the expected value (or mean) of the squared deviation of X from its mean. The standard deviation, σ, is the positive square root of this – hence the term sometimes used “root mean squared deviation”.
Simplifying:
If we take our above example of a discrete random variable, it can be shown how mean and variance are calculated. Let’s calculate these values to understand the calculation of expected values more clearly.
∴ Mean = E[X] = 21/6 = 7/2
and Variance = E[X^{2}] – {E[X]}^{2} = 91/6 – (7/2)^{2} = 35/12
Now, there are some linear functions of X also. Consider changing the origin and the scale of X.
Let Y = aX + b. Let E[X] = μ.
E[Y] = E[aX + b] = aμ + b
So Y – E[Y] = aX + b – [aμ + b] = a[X – μ].
These are important results. The results for the expected value can be thought of simply as “whatever you multiply the random variable by or add to it, you do the same to the mean”. However, the addition of a constant to a random variable does not alter the variance.
This should make sense since the variance is a measure of spread and the spread is not altered when the same constant is added to all values. When you multiply the random variable by a constant you multiply the standard deviation by the same value, so the variance is multiplied by that constant squared.
Now, let’s see what a probability distribution is. The probability distribution for a random variable describes how the probabilities are distributed over the values of the random variable. There are two types of probability distributions, discrete and continuous. For a discrete random variable, say X, the probability distribution is defined by a probability mass function.
Whereas for a continuous random variable, say Y, the probability distribution is defined by a probability density function, as discussed above in discrete and continuous random variable heading. There are some special probability distributions.
Two of the most widely used discrete probability distributions are the binomial and Poisson distribution.
First is Binomial Distribution. The probability mass function of binomial distribution provides the probability that ‘x’ successes will occur in ‘n’ trials of a binomial experiment. If X ~ Bin(n, p), then
for x = 0, 1, 2, …, n and 0<p<1.
Here, there are two outcomes, success or failure, which are possible on each trial and ‘p’ denotes success on any trial. The trials are independent.
Suppose, X ~ Bin(5, 0.7), then
Second is Poisson Distribution. The Poisson distribution is often used as a model of the number of arrivals at a facility within a given period of time. If X ~ Poi(μ), then
for x = 0, 1, 2, … and μ>0.
Here, the parameter ‘μ’ is the mean of a random variable ‘x’.
Suppose, the mean number of calls arriving in a 15 minutes period is 10. Then, to compute the probability that 5 calls arrive within the next 15 minutes period, we have μ = 10 and x = 5, then
The most widely used continuous probability distribution is the normal distribution. The graph of the normal distribution is a bell-shaped curve. The probabilities for the normal distribution can be computed using statistical tables for the standard normal distribution, which has a mean of 0 and a standard deviation of 1.
The normal distribution depends on two parameters, μ, and σ^{2}, which is mean and variance of a random variable. The probability density function of the normal distribution, if X ~ N(μ, σ^{2}), is given by:
for -∞<x<∞.
Suppose X ~ N(μ, σ^{2}), then to convert the random variable X from normal probability distribution to standard normal distribution, we have:
Z is another random variable, which follows a standard normal distribution with mean zero and variance 1.
Suppose X ~ N(50, 3^{2}), then
There are many other discrete and continuous probability distributions, other than the three discussed above. Other discrete probability distributions include uniform distribution, Bernoulli distribution, geometric distribution, hypergeometric distribution, and negative binomial distribution; other commonly used continuous probability distributions include uniform, gamma, exponential, chi-square, beta, log-normal, t and F distribution.
So, to summarise, a random variable is a set of possible values from a random experiment which uses probabilities to decide its value. Then, there are two types of random variables, discrete and continuous; and both have two basic requirements, i.e., the value of probability function should lie between 0 and 1 and sum of all the values of probability should be 1. Next, we have seen how to obtain mean and variance of a random variable followed by some important discrete and continuous probability distributions.
I hope that I am able to tell you about the concept of a random variable in brief.
The post Random Variable and Distribution: The Concept appeared first on StepUp Analytics.
]]>The post Correlation Analysis Using R appeared first on StepUp Analytics.
]]>Scatterplots are used to visualize the data and get a rough idea of the existence and degree of correlation between two variables. The plot shows the scatter of the points, hence called scatterplot.
Here, for the birthweight data, we have a look at the scatterplot of Gestation period (in weeks) v/s Birthweight.
The scatterplot indicates a positive degree of correlation between the Gestation period (in weeks) v/s Birthweight. Now, to quantify this relationship, we use the coefficient of correlation.
The coefficient of linear correlation provides a measure of how well a linear regression model explains the relationship between two variables. To put it simply, it gives a quantification for the relationship between the variables under study.
The degree of association between the x and y values is summarised by the value of an appropriate correlation coefficient each of which take values from -1 to +1. A negative value indicates that the two variables move together in opposite directions, the eg. speed of the train and time taken to reach the destination exhibits a negative correlation. A positive value indicates that the two variables move together in the same direction, eg. the height and weight of a human being.
In this
Pearson correlation coefficient (also called Pearson’s product-moment correlation coefficient) measures the strength of the linear relationship between two variables.
The correlation between two variables is calculated in R using cor( ) function.
For the Birthweight data, we observe a moderate positive correlation between Gestation period (in weeks) and Birthweight.
Spearman’s rank correlation measures correlation based on the ranks of observations. If data are quantitative, then it is less precise than Pearson’s correlation coefficient as we use actual observations for Pearson’s correlation coefficient which gives more information than their ranks.
Q. Find the Spearman’s rank correlation between Mathematics and Statistics marks scored by 2nd-year college students.
There is a moderate positive correlation between Mathematics and Statistics marks scored by 2nd-year college students.
Kendall’s rank correlation coefficient τ measures the strength of dependence of rank correlation between two variables. Any pair of observations (Xi, Yi) ; (Xj, Yj) and where i≠j, is said to be concordant if the ranks for both elements agree, i.e. Xi > Xj and Yi> Yj or Xi <Xj and Yi < Yj, otherwise it is said to be discordant.
Q. Two judges ranked 10 contestants in a fancy dress competition. The ranks are from most favorite to least favorite. Calculate Kendall’s rank correlation
There is a positive high degree correlation between ranks given by 2 judges.
The sample correlation coefficient (studied so far) measures the extent of the linear relationship between the two variables (X, Y) for the sample data. The population parameter, ρ, measures the extent of the linear relationship between the variables (X, Y) in the population.
We are usually interested in testing whether the population correlation coefficient is significant or not. The hypothesis is stated as,
H_{0}: ρ = 0 and is tested against one of the following alternatives,
H_{1}: ρ ≠ 0
H_{1}: ρ > 0
H_{1}: ρ < 0
For the Birthweight data, we test the hypothesis:
H_{0}: The population correlation coefficient between birthweight and gestation period is equal to 0
v/s
H_{1}: The population correlation coefficient between birthweight and gestation period is not equal to 0
The test is carried out in R using cor.test() function in R.
As the p-value is less than 0.05, reject H_{0} and conclude that the population correlation coefficient between birthweight and gestation period is not equal to 0.
Since we are using ranks rather than the actual data, no assumption is needed about the distribution of X, Y or (X, Y), i.e. it is a non-parametric test. For the data giving Mathematics and Statistics marks scored by 2nd-year college students, test the following hypothesis:
H_{0}: The population correlation coefficient between Mathematics and Statistics marks is equal to 0
v/s
H_{1}: The population correlation coefficient between Mathematics and Statistics marks is not equal to 0
As p-value < 0.05, reject H_{0} and conclude that the population correlation coefficient between Mathematics and Statistics marks is not equal to 0.
For the data giving ranks of 10 contestants in a fancy dress competition by 2 judges, test the following hypothesis:
H_{0}: The population correlation coefficient between rank given by Judge 1 and Judge 2 is equal to 0
v/s
H_{1}: The population correlation coefficient between rank given by Judge 1 and Judge 2 is not equal to 0
As the p-value < 0.05, reject H_{0 } and conclude that the population correlation coefficient between rank given by Judge 1 and Judge 2 is not equal to 0.
Data Used in the above example can be downloaded from here
The post Correlation Analysis Using R appeared first on StepUp Analytics.
]]>The post Actuarial Science: Graduation and Statistical Test appeared first on StepUp Analytics.
]]>The premium calculation requires different values which are mortality rate, interest rate etc and all these values are available in our actuarial table. But have you ever thought how these values are estimated before printing it in the actuarial table?
Th answer to the above question is, lot of data are collected and processed to calculate the values (mortality rates, probabilities etc.). The estimated value should be justifiable. If not, the values are processed further and made suitable for the use. Once the processing is done and best estimated value is calculated, now values can be printed in the actuarial table and used for calculation purpose.
Now, Let’s take an example to see whether the estimated value is a good estimate or not. If not, then what technique should be used to get a good estimated value. In the below example we have an estimated mortality rate and we want to see whether these rates are justifiable or not. we will do this by plotting a scatterplot: –
Let’s see the R code to plot a scatter plot:
Store the values in a variable and then create a dataframe
age<-c(30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49)
force_of_mortality<- c(0.000557,.000645,.000497,.000474,.000372,.000757, .000701, .000618, .000738,. 000709,.000921, .000964,.001285,.001367, .001804,.001942,.001963,.002537,.002471,.003011)
Graduation_A<-data.frame(age,force_of_mortality)
Now plot a scatterplot to see the mortality rate movement
install.packages(“ggplot2”)
library(ggplot2)
## Warning: package ‘ggplot2’ was built under R version 3.5.2 ggplot(Graduation_A,aes(y=force_of_mortality,x=age))+geom_point() +theme_classic()
The above trend of mortality rate is not quite justifiable. Mortality rate keeps on decreasing after age 35 till age 37 and then mortality rate rises and drops again. We all know mortality rate increases gradually with age. But here rates are not increasing gradually. We need to process these rates further and make them suitable for our purpose.
The technique used to do so is called Graduation. Graduation is a technique used to get a smooth and justifiable mortality rate (which is also called as Graduated Mortality Rate).
There are three methods, that are used to carry out graduation: –
The aims of graduation are: –
Why do we need smoothness?
Let’s take an example to understand the importance of smoothness. Suppose we are calculating premium according to our estimated mortality rate. Then we can notice that
This does not sound good for the business and this drop-in premium with an increase in age is not justifiable. So, using graduation we calculate a smooth rate that will be justifiable and appropriate to use. For smoothing, we can make use of the data at adjacent ages to improve the sampling error at each age. For example, if the force of mortality is smooth and not changing too rapidly, then our estimate of µ_{X} should not be too far away from estimating µ_{X-1 }and µ_{X+1}, as well as being the ‘best’ estimate, in some sense, of µ_{X}.
Three desirable features of a graduation are: –
We need to make a balance between smoothing and adherence to data. At one extreme, we could easily smooth the crude estimates by ignoring the data altogether; we want to avoid such extremes since we want the graduation to be representative of the experience.
If the graduation process results in rates that are smooth but show little adherence to the data, then we say that the rates are over-graduated. If insufficient smoothing has been carried out then it is called under-graduation.
Now let’s begin with the practical part.
We will be using the same example that we used above to plot a scatterplot.
We will be performing a chi-square test to check the suitability of our estimate. Chi-Squared test is used to assess whether the observed numbers of individuals who fall into specified categories are consistent with a model that predicts the expected numbers in each category. It is a test for the overall goodness of fit. First, we will calculate the Standardised Deviation and apply the chi-square test on it. If the Chi-Square test follows, we will check for the defects which the Chi-Square test fails to detect. The defects are:-
Now let’s understand, what is the standardized deviation and how it is calculated?
Z_{X} is our Standardised Deviation. The theoretical result between Z_{X} and Chi-Square are: –
Hypothetical test used during Graduation will be: –
H_{0}: There are no significant differences between the two sets of rates.
H_{1}: There is significant difference between two rates.
Now, let’s calculate Standardised Deviation (Z_{X}) and Z_{X}^{2}. R code for that is: –
First, add all the remaining entries in Graduation_A’s dataframe before calculating Standardised deviation
Central_expose_risk<-c(70000,66672,68375,65420,61779,66091,68514, 69560, 65000, 66279,67300,65368,65391, 62917,66537,62302,62145, 63856, 61097, 61110)
death<-c(39,43,34,31,23,50,48,43,48,47,62,63,84,86,120,121,122,162,151,184)
graduated_force_of_mortality<- c(.000388,.000429,.000474,.000524,.000579, .000640,.000708,.000782,.000865,.000956,.001056,.001168,.001291,.001427, .001577,.001743,.001962,.002129,.002353,.002601)
Graduation_A<-data.frame(age,Central_expose_risk,death, force_of_mortality,graduated_force_of_mortality)
Now calculate Standardised Deviation (Z_{X}^{2})
product<-round(Graduation_A$Central_expose_risk*Graduation_A$graduated_force_of_mortality,2)
Graduation_A<-data.frame(Graduation_A,product)
standard_deviation<-round((death-product)/(sqrt(product)),2)
Graduation_A<-data.frame(Graduation_A,standard_deviation)
square_standardised_deviation<-round(standard_deviation**2,2)
Graduation_A<-data.frame(Graduation_A,square_standardised_deviation)
View(Graduation_A)
Final step is to performing Chi-Square Test…
R code to calculate the p-value of chi-square test is: –
1-pchisq(sum(square_standardised_deviation),df=18)
## [1] 0.0007645671
Here we took degrees of freedom=18, because it was stated in the question that 2 parameters are estimated. So, for 2 estimation we will reduce 2 degrees of freedom.
p-value came out to be less than 0.05. So, we have sufficient evidence to reject the null hypotheses. Since chi-square test does not follow, we will not check for the defects.
Now, Let’s take another example: –
Perform similar steps that we have performed above.
age<-c(30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49)
Central_expose_risk<-c(70000,66672, 68375, 65420, 61779, 66091, 68514, 69560, 65000, 66279, 67300, 65368, 65391, 62917, 66537, 62302, 62145, 63856, 61097,61110)
death<-c(39,43,34,31,23,50,48,43,48,47,62,63,84,86,120,121,122,162,151,184)
force_of_mortality <- c(0.000557,.000645,.000497,.000474,.000372, .000757, .000701, .000618, .000738, .000709, .000921, .000964, .001285, .001367, .001804, .001942, .001963, .002537, .002471, .003011)
graduated_force_of_mortality <-c(0.000555, .000658, .000488, .000432, .000486, .000596, .000685, .000713, .000709, .000733, .000831, .001015, .001259, .001494, .001679, .001866, .002134, .002423, .002498, .003008)
Graduation_B <- data.frame(age,Central_expose_risk, death, force_of_mortality, graduated_force_of_mortality)
All the values have been entered. Now calculate the required values
product <-round(Graduation_B$Central_expose_risk*Graduation_B$graduated_force_of_mortality, 2)
Graduation_B<-data.frame(Graduation_B, product)
standard_deviation<-round((death-product)/(sqrt(product)),2)
Graduation_B<-data.frame(Graduation_B, standard_deviation)
square_standardised_deviation <- round(standard_deviation**2, 2)
Graduation_B <-data.frame(Graduation_B, square_standardised_deviation)
View(Graduation_B)
Performing Chi-Square Test
1-pchisq(sum(square_standardised_deviation), df=10)
## [1] 0.4936845
p-value comes out to be greater than 0.05. So, we have insufficient evidence to reject null hypotheses.
Now, we will check for the defects
shapiro.test(square_standardised_deviation)
##
## Shapiro-Wilk normality test
##
## data: square_standardised_deviation
## W = 0.69161, p-value = 3.086e-05
As p-value is greater than .05. We have insufficient evidence to reject the null hypothesis and we can conclude that the distribution of the data is not significantly different from a normal distribution.
Sign Test to check for overall biasedness. Hypotheses for the test is: –
H0: – P~B(m,0.5)
“P” denotes number of positive deviation.
a=length(Graduation_B$standard_deviation)
i=1
c=0
while (i<=a) {
if(Graduation_B$standard_deviation[i]>0){
c=c+1
}
i=i+1
}
print(c)
## [1] 12 binom.test(x=c, n=20, p=.5, alternative = “two.sided”)
Output
##
## Exact binomial test
##
## data: c and 20
## number of successes = 12, number of trials = 20, p-value = 0.5034
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.3605426 0.8088099
## sample estimates:
## probability of success
## 0.6
As p-value is greater than 0.05. So, we have insufficient evidence to reject null hypothesis. We can conclude that data is not overall biased.
Serial Correlation test
cor.test(y=standard_deviation,x=age) ##
## Pearson’s product-moment correlation
##
## data: age and standard_deviation
## t = -0.089788, df = 18, p-value = 0.9294
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4593781 0.4253448
## sample estimates:
## cor
## -0.02115851
The p-value is greater than .05, we have insufficient evidence to reject Null Hypotheses. We can conclude that data is not in clusters.
I hope you have learned something new through this article. That’s all from my side. In case of any doubt please comment below. We will get back to you soon with your query. Thank You
The post Actuarial Science: Graduation and Statistical Test appeared first on StepUp Analytics.
]]>The post Theory of Estimation Or What is Estimation appeared first on StepUp Analytics.
]]>Population: A group of individuals under study is called population. The population may be finite or infinite. Eg. All the registered voters in India.
Sample: A finite subset of statistical individuals in a population. Eg. Selecting some voters from all registered voters.
Parameter: The statistical constants of the population such as mean (μ), variance (σ^{2}) etc. Eg. Mean of income of all the registered voters.
Statistic: The statistical constants of the sample such as mean (X̄), variance (s^{2}) etc. In other words, any function of the random sample x_{1}, x_{2},…,
Estimator: If a statistic is used to estimate an unknown parameter θ of the distribution, then it is called an estimator. Eg. Sample mean is an estimator of population mean.
Estimate: A particular value of the estimator is called an estimate of an unknown parameter. Eg. Mean income of selected voters is ₹25000 which represents mean income of all the registered voters.
Sampling Distribution: When the total probability is distributed according to the value of statistic then the distribution is said to be sampling distribution. Eg. If we want the average height of a voter, we can randomly select some of them and use the sample mean to estimate the population mean.
Standard Error: The standard deviation of the sampling distribution of a statistic is known as its standard error and is denoted by ‘s.e.’ Eg. If we want to know the variability of the height of voters, then standard error is used.
Now, before discussing about different methods of finding estimates of unknown population parameter, it is important to know the characteristics of a good estimator. Here, “a good estimator” is one which is close to the true value of the parameter as much as possible. The following are some of the criterion that should be satisfied by a good estimator:
Unbiasedness
This is a desirable property of a good estimator. An estimator T_{n }is said to be an unbiased estimator of γ (θ), where γ (θ) is a function of unknown parameter θ, if the expectation of the estimator is equal to the population parameter, i.e.,
E [T_{n}] = γ (θ)
Example: If X ~ N (μ,σ^{2}),
Consistency
An estimator is said to be consistent if increasing the sample size produces an estimate with smaller standard error (standard deviation of sampling distribution of a statistic). In other words, if the sample size increases, it becomes almost certain that the value of a statistic will be very close to the true value of the parameter. Example: Sample mean is a consistent estimator of the population mean, since as sample size n→∞, the sample means converges to the population mean in probability and variability of the sample mean tends to 0.
Efficiency
There is a necessity of some further criterion which will enable us to choose between the estimators, with the common property of consistency. Such a criterion which is based on the variances of the sampling distribution of estimators is usually known as efficiency.
It refers to the size of the standard error of the statistic. If two statistic are compared from a sample of same size and we try to decide which one a good estimator is, the statistic that has a smaller standard error or standard deviation of the sampling distribution will be selected.
If T_{1} is the most efficient estimator with variance V_{1} and T_{2}, any other estimator with variance V_{2}, then the efficiency E of T_{2} is given by:
[∵ Efficiency and Variances are inversely proportional]
Sufficiency
An estimator is said to be sufficient for a parameter, if it contains all the information in the sample regarding the parameter.
If T_{n} is an estimator of parameter θ, based on a sample x_{1}, x_{2},…, x_{n} of size n from the population with density f(x,θ), such that the conditional distribution of x_{1}, x_{2},…, x_{n} given T_{n}, is independent of θ, then T_{n} is sufficient estimator for θ.
Methods of Point Estimation
So far we have been discussing the requisites of a good estimator. Now we shall briefly outline some of the important methods of obtaining such estimators. Commonly used methods are:
Method of Moments (MoM)
The basic principle is to equate population moments (i.e. the means, variances, etc. of the theoretical model) to the corresponding sample moments (i.e. the means, variances, etc. of the sample data observed) and solve for the parameter(s).
Let x_{1}, x_{2}, …, x_{n} be a random sample from any distribution f(x,θ) which has m unknown parameters θ_{1}, θ_{2}, …, θ_{m}, where m ≤ n. Then the moment estimators θ ̂ _{1}, θ ̂ _{2}, …, θ ̂ _{m }are obtained by equating the first m sample moments to the corresponding m population moments and then solving for θ_{1}, θ_{2}, …, θ_{m}.
Method of Maximum Likelihood Estimation (MLE)
MLE is widely regarded as the best general method of finding estimators. In particular, MLE’s usually have easily determined asymptotic properties and are especially good in the large sample situations. “Asymptotic’’ here just means when the samples are very large.
Let x_{1}, x_{2}, …, x_{n} be a random sample from a population with density f(x,θ). The likelihood function of the observed sample at the function of θ is given by:
Notice that the likelihood function is a function of the unknown parameter θ. So different values of θ would give different values for the likelihood. The maximum likelihood approach is to find the value of θ that would have been most likely to give us the particular sample we got. In other words, we need to find the value of θ that maximizes the likelihood function. In most cases, taking logs greatly simplifies the determination of the MLE θ ̂. Differentiating the likelihood or log likelihood with respect to the parameter and setting the derivative to 0 gives the MLE for the parameter.
It is necessary to check, either formally or through simple logic, that the turning point is a maximum. The formal approach would be to check that the second derivative is negative.
Method of Minimum Variance
It is also known as Minimum Variance Unbiased Estimator (MVUE). As the name itself depicts, estimator which is unbiased as well as having minimum variance.
If a statistic T_{n} based on a sample of size n is such that:
Method of Least Squares
The principle of least squares is used to fit a curve of the form:
where θ_{i}’s are unknown parameters, to a set of n sample observations (x_{i}, y_{i}); i=1,2,…,n from a bivariate population. It consists of minimizing the sum of squares of residuals,
subject to variations in θ_{1}, θ_{2}, …, θ_{n}. The normal equations for estimating θ_{1}, θ_{2}, …, θ_{n} are given by:
Confidence Intervals and Confidence Limits
Confidence interval provides an ‘interval estimate’ for an unknown population parameter. It is designed to contain the parameter’s value with some stated probability. The width of the interval provides a measure of the precision accuracy of the estimator involved.
Let x_{i}, i = 1, 2, … n be a random sample of size n from f(x,θ). If T_{1}(x) and T_{2}(x) be any two statistics such that T_{1}(x) ≤ T_{2}(x) then,
P(T_{1}(x) < θ < T_{2}(x)) = 1 – α
where α is level of significance, then the random interval (T_{1}(x), T_{2}(x)) is called 100(1-α)% confidence interval for θ.
Here, T_{1} is called lower confidence limit and T_{2} is called upper confidence limit. (1-α) is called the confidence coefficient.
Usually, the value of α is taken as 5% in the testing of hypothesis. Thus, if α = 5%, then there is a 95% chance of the estimate to be in the confidence interval.
Interval estimate = Point estimate ± Margin of Error
The margin of error is the amount of random sampling error. In other words, the range of values above and below the sample statistic.
Margin of Error = Critical Value * Standard Error of the statistic
Here, a critical value is the point (or points) on the scale of the test statistic beyond which we reject the null hypothesis, and is derived from the level of significance α of a particular test into consideration.
Confidence intervals are not unique. In general, they should be obtained via the sampling distribution of a good estimator, in particular, the MLE. Even then there is a choice between one-sided and two-sided intervals and between equal-tailed and shortest length intervals although these are often the same.
So, we have learned what the estimation is, i.e., the process of providing numerical value to unknown population parameter. To test whether an estimate is a good estimator of the population parameter, an estimate should have the following characteristics:
There are different methods of finding estimates such as method of moments, MLE, minimum variance and least squares. Of these methods, MLE is considered as the best general method of finding estimates.
Also, there are two types of estimations, point and interval estimation. Point estimation provides a single value to the estimate, whereas, interval estimation provides confidence interval which is likely to include the unknown population parameter.
Hence, now you have the basic understanding about the theory of estimation.
The post Theory of Estimation Or What is Estimation appeared first on StepUp Analytics.
]]>The post Lasso And Elastic Net Regression appeared first on StepUp Analytics.
]]>For example, it shrinks the coefficients towards zero, but it does not set any of them exactly to zero. It does not perform feature selection and etc. So in this article, I have introduced two new methods such as lasso and elastic net regression which deals with these issues very well and does both variable selection and regularization.
Lasso (or least absolute shrinkage and selection operator) is a regression analysis method that follows the L1 regularization and penalizes the absolute size of the regression coefficients similar to ridge regression. In addition; it is capable of reducing the variability and improving the accuracy of linear regression models. Lasso regression differs from ridge regression in a way that it uses absolute values in the penalty function, instead of squares. This leads to penalizing the regression coefficients for which some of the parameter estimates turn out exactly zero. Hence, much like the best subset selection method, lasso performs variable selection out of the given n variables.
The tuning parameter lambda is chosen by cross-validation. When lambda is small, the result is essentially the least squares estimates (OLS). As lambda increases, shrinkage occurs and the less important feature’s coefficient shrinks to zero thus, removing some feature altogether.
So, a major advantage of lasso is that it is a combination of both shrinkage and selection of variables. In cases of a very large number of features, lasso allows us to efficiently find the sparse model that involves a small subset of the features.
The cost function is given below, where the highlighted part is the L1 regularization.
The method was proposed by Professor Robert Tibshirani from the University of Toronto, Canada. He said, “The Lasso minimizes the residual sum of squares to the sum of the absolute value of the coefficients being less than a constant. Because of the nature of this constraint, it tends to produce some coefficients that are exactly 0 and hence gives interpretable models”.
In his article titled Regression Shrinkage and Selection via the Lasso, Tibshirani tells us about this technique with respect to various other statistical models such as subset selection and ridge regression. He goes on to say that “lasso can even be extended to generalized regression models and tree-based models. In fact, this technique provides possibilities for even conducting statistical estimations.”
Traditional methods like cross-validation, stepwise regression to handle overfitting and perform feature selection works well with a small set of features but penalized regression techniques are a great alternative when we are dealing with a large set of features.
Lasso was originally formulated for least squares models and this simple case reveals a substantial amount about the behavior of the estimator, including its relationship to ridge regression and best subset selection and the connections between lasso coefficient estimates and so-called soft thresholding. It also reveals that (like standard linear regression) the coefficient estimates need not be unique if covariates are collinear. [Source: Wikipedia]
Though originally defined for least squares, lasso regularization is easily extended to a wide variety of statistical models including generalized linear models, generalized estimating equations, proportional hazards models, and M-estimators, in a straightforward fashion. Lasso’s ability to perform subset selection relies on the form of the constraint and has a variety of interpretations including in terms of geometry, Bayesian statistics, and convex analysis. [Source: Wikipedia]
As discussed above, lasso can set coefficients to zero, while ridge regression, which appears superficially similar, but cannot. This is due to the difference in the shape of the constraint boundaries in the two cases.
From the figure, one can see that the constraint region of lasso regression is a rotated square and its corners lie on the axes, while the constraint region of ridge regression is a sphere which is rotationally invariant and, therefore, has no corners. A convex object that lies tangent to the boundary, is likely to encounter a corner a hypercube, for which some components of are identically zero, while in the case of a sphere, the points on the convex object boundary for which some of the components are not distinguished from the others and the convex object is not likely to contact a point at which some components are zero.
In the case of ML, both ridge regression and Lasso find their respective advantages. Both these techniques tackle overfitting, which is generally present in a realistic statistical model. It all depends on the computing power and data available to perform these techniques on statistical software. Ridge regression is faster compared to lasso but then again lasso has the advantage of completely reducing unnecessary parameters in the model.
One important limitation of lasso regression is that, for grouped variables, the lasso fails to do grouped selection. It tends to select one variable from a group and ignore the others.
Elastic-net is a mix of both L1 and L2 regularizations. A penalty is applied to the sum of the absolute values and to the sum of the squared values:
Lambda is a shared penalization parameter while alpha sets the ratio between L1 and L2 regularization in the Elastic Net Regularization. Hence, we expect a hybrid behavior between L1 and L2 regularization. Though coefficients are cut, the cut is less abrupt than the cut with lasso penalization alone. The hyper-parameter is between 0 and 1 and controls how much L2 or L1 penalization is used. The usual approach to optimizing the lambda hyper-parameter is through cross-validation—by minimizing the cross-validated mean squared prediction error—but in elastic net regression, the optimal lambda hyper-parameter also depends upon the alpha hyper-parameter.
This article takes a cross-validated approach that uses the grid search to find the optimal alpha hyper-parameter while also optimizing the lambda hyper-parameter for the data set.
In my previous article, I used the glmnet package to show the ridge regression in R. In this article, I have used the caret package for better comparison between the techniques.
Loading the MASS package to get the data set
library (MASS)
data <- Boston
Splitting the dataset in training and testing data
train <- data [1:400,]
test <- data [401:506,]
Setting up a grid range of lambda values
lambda <- 10^seq (-3, 3, length = 100)
Loading the required libraries
library (tidyverse)
library (caret)
library (Metrics)
We fit the ridge regression model on the training data using k fold cross validation
set.seed (123)
ridge <- train (
medv ~., data = train, method = “glmnet”,
trControl = trainControl (“cv”, number = 10),
tuneGrid = expand.grid (alpha = 0, lambda = lambda))
plot (ridge$finalModel , xlab = “L2 Norm” )
Displaying the regression coefficients below
coef (ridge$finalModel, ridge$bestTune$lambda)
We save the predicted values of the response variable in a vector prediction_ridge
prediction_ridge <- predict (ridge, test)
Saving the RMSE, SSE and MAPE values in Accuracy_lasso
Accuracy_ridge <- data.frame (
RMSE = RMSE (prediction_ridge, test$medv),
SSE = sse (test$medv, prediction_ridge),
MAPE = mape (test$medv, prediction_ridge))
The only difference between the R code used for ridge and lasso regression is that for lasso regression, we need to specify the argument alpha = 1 instead of alpha = 0 (for ridge regression).
Now executing the Lasso Regression
set.seed (123)
lasso <- train (
medv ~., data = train, method = “glmnet”,
trControl = trainControl (“cv”, number = 10),
tuneGrid = expand.grid (alpha = 1, lambda = lambda))
plot (lasso$finalModel , xlab = “L1 Norm” )
If we look at the plot, the x-axis is the maximum permissible value the L1 norm can take. So when we have a small L1 norm, we have a lot of regularization. Therefore, an L1 norm of zero gives an empty model, and as you increase the L1 norm, variables will “enter” the model as their coefficients take non-zero values.
Displaying the regression coefficients below
coef (lasso$finalModel, lasso$bestTune$lambda)
We save the predicted values of the response variable in a vector prediction_ lasso
prediction_lasso <- predict (lasso, test)
Saving the RMSE, SSE and MAPE values in Accuracy_lasso
Accuracy_lasso <-data.frame (
RMSE = RMSE (prediction_lasso, test$medv),
SSE = sse (test$medv, prediction_lasso),
MAPE = mape (test$medv, prediction_lasso))
The elastic net regression models do not require us to mention a specific value of lambda and alpha. We use caret package to automatically select the best tuning parameters alpha and lambda. The caret package tests a range of possible alpha and lambda values, and then selects the best values for lambda and alpha, resulting in a final model that is an elastic net model.
Now executing the Elastic Net Regression
set.seed (123)
elasticnet <- train (
medv ~., data = train, method = “glmnet”,
trControl = trainControl (“cv”, number = 10), tuneLength = 10)
plot (elasticnet$finalModel , xlab= “Elasticnet Regularization”)
Displaying the regression coefficients below
coef (elasticnet$finalModel, elasticnet$bestTune$lambda)
We save the predicted values of the response variable in a vector prediction_ elasticnet
predictions_elasticnet <- predict (elasticnet, test)
Saving the RMSE, SSE and MAPE values in Accuracy_ elasticnet
Accuracy_elasticnet <-data.frame (
RMSE = RMSE (predictions_elasticnet, test$medv),
SSE = sse (test$medv, predictions_elasticnet),
MAPE = mape (test$medv, predictions_elasticnet))
We finally bring the RMSE, SSE and MAPE values of the three regression techniques in a dataframe Accuracy.
Accuracy <- rbind.data.frame (Accuracy_ridge = Accuracy_ridge, Accuracy_lasso = Accuracy_lasso, Accuracy_elasticnet = Accuracy_elasticnet)
Accuracy
Here both lasso and elastic net regression do a great job of feature selection technique in addition to the shrinkage method. On the other hand, the lasso achieves poor results in accuracy. This is because there is a high degree of collinearity in the features. Further, the L1 norm is underdetermined when the number of predictors exceeds the number of observations while ridge regression can handle this.
From our example we see that penalized regression models performed much better than the multiple linear regression models. But it can be said that Lasso regression performs better than ridge in scenarios with many noise predictors and worse in the presence of correlated predictors. Elastic net, is a hybrid of the two, and performs well in all these scenarios.
The post Lasso And Elastic Net Regression appeared first on StepUp Analytics.
]]>The post Ridge Regression and Its Application appeared first on StepUp Analytics.
]]>The OLS function works quite well when some assumptions like a linear relationship, no autocorrelation, homoscedasticity, more observations than variables, normal distribution of the residuals and No or little multicollinearity are fulfilled.
But in many real-life scenarios, these assumptions are violated. In those cases, we need to find alternative approaches to provide solutions. Penalized/Regularized regression techniques such as ridge, lasso and elastic net regression work very well in these cases. In this article, I have tried to explain the ridge regression technique which is a way of creating regression models when the number of predictor variables of a dataset is more than the number of observations or when the data suffers from multicollinearity (independent variables are highly correlated).
Regularization methods provide a means to control our regression coefficients, which can help to reduce the variance and decrease the sampling error. Ridge regression belongs to a class of regression tools that use L2 regularization. L2 regularization works as a small addition to the OLS function that weights the residuals in a particular way to make the parameters more stable. The L2 penalty parameter, which equals the square of the magnitude of coefficients, is given by,
And the regression function is given by,
The amount of the penalty can be fine-tuned using a constant called lambda (λ). Selecting a good value for λ is critical. When λ = 0, the penalty term has no effect and ridge regression produces classical least square coefficients. If λ = ∞, the impact of the penalty grows and all coefficients are shrunk to zero. The ideal penalty is therefore somewhere in between 0 and ∞.
In this way, ridge regression puts constraints on the magnitude of the coefficients and help to reduce the magnitude and fluctuations of the coefficients and progressively shrinks them towards zero. This will definitely help to reduce the variance of the model. The outcome is typically a model that fits the training data less well than OLS but generalizes better because it is less sensitive to extreme variance in the data such as outliers.
Note that, in contrast to the ordinary least square regression, ridge regression is highly affected by the scale of the predictors. Therefore, it is better to standardize (i.e., scale) the predictors before applying the ridge regression so that all the predictors are on the same scale.
Advantages and Disadvantages Of Ridge Regression
Here I have given the link of a website below, where you can get the mathematical and geometric interpretation of Ridge regression More Info
Loading the MASS package to get the data set
library (MASS)
data <- Boston
Splitting the dataset in training and testing data
train <- data [1:400,]
test <- data [401:506,]
Loading libraries required for Ridge regression
library(tidyverse)
library(caret)
library (glmnet)
library (MASS)
library (Metrics)
We need to know the glmnet package
For more details about this package: More Info
There is another function lm.ridge () in MASS package which can also be used. Please see the link below for more details about the function. More Info [Page Number: 79]
Preparing the training data set for training the regression model
x.train <- model.matrix (medv~., train) [,-1]
We save the response variable housing price in a vector y.train
y.train <- train$medv
We need to find the best value for lambda for the given data set with the function cv.glmnet()
set.seed (123)
cv <- cv.glmnet (x.train, y.train, alpha = 0)
plot (cv)
Displaying the best lambda value
cv$lambda.min
We fit the final model on the training data by adding the best lambda value.
model_ridge <- glmnet (x.train, y.train, alpha = 0, lambda = cv$lambda.min)
Displaying the regression coefficients below
coef (model_ridge)
Preparing the test data set to be used as a data matrix and discarding the intercept for predicting the values of the response variable.
x.test <- model.matrix (medv ~., test)[,-1]
We save the predicted values of the response variable Housing price in a vector prediction_ridge
prediction_ridge <- as.vector(predict(model_ridge,x.test))
Saving the RMSE, SSE and MAPE value of the predicted values of the test data set in Accuracy_ridge
Accuracy_ridge <- data.frame(
RMSE = RMSE (prediction_ridge, test$medv),
SSE = sse (test$medv, prediction_ridge),
Mape = mape (test$medv, prediction_ridge))
Now we fit the multiple linear regression model on the training data set
names (train)
model_lm <- lm (medv ~ crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+black+lstat, data=train)
From the summary of the model we can find the p value of the individual predictor variables and decide which variables to be kept in the model
summary (model_lm)
model_lm <- lm (medv ~ crim+zn+nox+rm+dis+rad+tax+ptratio+lstat, data=train)
summary (model_lm)
We need to check the multicollinearity with the help of the function vif () from car package.
vif (model_lm)
We also need to exclude the predictor variables with high vif values to avoid multicollinearity. Though we may allow multicollinearity up to a certain level.
model_lm <- lm (medv ~ crim+zn+nox+rm+dis+rad+ptratio+lstat, data=train)
Below I have mentioned the summary of the updated final model with all the significant variables and the vif values of the variables. The values of the R square and adjusted R square are pretty close, which also shows that the present predictor variables in the model are pretty significant.
summary (model_lm)
vif (model_lm)
We compute the prediction of the test data set with multiple linear regression which was trained using the training dataset
prediction_lm <- predict (model_lm, test [,-14])
We find out the RMSE, SSE, and MAPE of the regression model and save them in Accuracy_lm
Accuracy_lm <-data.frame (
RMSE = RMSE (prediction_lm, test$medv),
SSE = sse (test$medv, prediction_lm),
MAPE = mape (test$medv, prediction_lm))
We save the RMSE, SSE and MAPE values of both linear and ridge regression models in Accuracy.
Accuracy <- rbind.data.frame (Accuracy_ridge = Accuracy_ridge, Accuracy_lm = Accuracy_lm)
Accuracy
From the Accuracy mentioned above, it is clear that even though the least square estimates are unbiased; the accuracy of the model is compromised. By adding a degree of bias to the regression estimates, ridge regression reduces the standard errors. But with other models like the lasso and elastic net regression, we have a possibility of getting a better accuracy value.
This is because complicated models tend to overfit the training data. In my next article, I will introduce you to lasso and elastic net regression and explain the comparative advantage of using these models over multiple linear or ridge regression models.
To learn more on Statistics for Data Science Read
The post Ridge Regression and Its Application appeared first on StepUp Analytics.
]]>The post Queueing Theory and Its Application appeared first on StepUp Analytics.
]]>In this article, we will learn about Queueing Theory and its practical applications. We all have experienced the annoyance of having to wait in a queue. We wait in line at supermarkets to check out, we wait in line in banks and post offices and we wait in line at fast food restaurants. But we as customers do not like waiting. And the managers of these establishments also don’t like their customers to wait as it may cost them business.
So the first question that arises is that “Why is there waiting?”
To which the answer is that there is more demand for service that there is an available facility for that service.
And “Why is this so?”
For which there could be a number of reasons such as, shortage of available servers, limitation of space, economic limitations etc.
These limitations can be removed with the expenditure of capital. And to know how much service should then be made available, one needs to know:
Queuing theory attempts to answer these questions through detailed mathematical analysis. The ultimate goal is to achieve an economic balance between cost of service and the cost associated with the waiting for that service.
A Queuing system can be described as one in which customers are arriving for service, waiting for their service if it is not immediately available and if having waited for service leaving the system after being served.
The term ‘customer’ is used in general sense and does not imply necessarily a human customer. For example, a customer can be a computer program waiting to be run or an airplane waiting in line to take off.
Queuing Theory was developed to provide models to predict the behavior of systems that attempt to provide service for randomly arising demands.
For defining the characteristics we’ll first explain the following terms:
In the context of above, there are six basic characteristics of a queuing process that provide an adequate description of a queuing system
1. Arrival Pattern: In general situations, the process of arrivals is random (stochastic). It is, therefore, necessary to know the probability distribution of the times between successive customer arrivals (inter-arrival times). Also, the customers can arrive simultaneously (batch or bulk arrival) and if so, the probability distribution describing the size of the batch.
The reaction of the customer upon entering the system.
2. Service Patterns: We need to describe a probability distribution for the sequence of customer service time. Service may be single or in batch, there are many situations where customers may be served simultaneously by the same server, such as people boarding a train, sightseers on a guided tour. The situation in which service depends on the number of customers waiting is referred to as State-Dependent service.
3. Queue Discipline: It refers to the manner in which the customers are selected for service when a queue has formed. Most common disciplines are:
4. System Capacity: In some queuing process there is a physical limitation to the amount of waiting room so that when the line reaches a certain length, no further customers are allowed to enter until space becomes available as a result of service completion. This situation is referred to as finite queuing situation.
5. The number of Service Channels: By this, we are typically referring to the number of parallel service stations which can serve customers simultaneously. It is assumed that service mechanisms of parallel channels operate independently of each other.
6. Stages of Service: A queuing system may have only a single stage of service, or it may have several stages. An example of a multistage the queuing system would be a physical examination procedure, where each patient must proceed through several stages, such as medical history, blood tests etc.
A Queueing process is described by a series of symbols and dashes such as A/B/X/Y/Z where
Some standard symbols for the characteristic distributions are as follows
These notations are referred to as Kendall’s Notation.
Generally, there are three types of system response of interest:
Since most queuing systems are stochastic, these measures are often random variables and their probability distribution is desired.
There are two types of customer waiting times:
The task of a Queuing Analyst if generally one of the two things:
= N_{s} (t) + N_{q }(t)
= P [N (t) =n]
(1/ λ) is the expected inter arrival time.
µ_{n }= cµ when n ≥ 1 (all servers are busy). 1/µ is the expected service time.
= Probability of exactly ‘n’ customers in the queuing system.
Where, N is the random variable giving the number of customers in the system.
One of the most powerful relations in queuing given by John D.C. Little. This formula relates the steady-state mean system size to steady state average customer waiting times.
Little’s formula is:
Also, since E (T) = E (T_{q} + S) = E (T_{q}) + E (S) or
W = W_{q} + (1/µ)
Hence it is necessary to find only one of the four expected values.
The post Queueing Theory and Its Application appeared first on StepUp Analytics.
]]>The post AB Testing With R: An Example Of Marketing Campaign appeared first on StepUp Analytics.
]]>Even e-commerce companies in India like Amazon and Flipkart have a lot of questions about their websites, application designs, and marketing strategies. These questions can be answered by conducting an A/B test.
When comparing two versions of products (such as A and B ) for similar customers are tested to see which group should sell more in the market or sometimes two groups of customers A and B for similar products, to see which group we should target for the products, we use A/B testing.
For example for a website:
Null Hypothesis: Assumption that there is no difference between the conversion rates for products A and B
Alternative Hypothesis: There is a difference between the conversion rates for products A and B
To reject the Null Hypothesis we need a p-value that is lower than the significance level i.e. P < 0.05
install.packages ("pwr") library (pwr) ######## 2-sample test for equality of proportions ############ prop.test(c (225, 250), c (3450, 3000))
The p-value is less than 0.05, so we can reject the hypothesis that conversion rates are equal.
But one cannot directly conclude that A and B have dissimilar conversion rates or vice verse. Here true underlying behavior is not known as we are trying to test the hypothesis by carrying out the experiment over a sample.
The Disadvantages of Using A/B Test:
Bayesian statistics in A/B testing is mainly based on past or prior knowledge of similar experiment and the present data. The past knowledge is known as prior also prior probability distribution (Wiki) is combined with current experiment data to make a conclusion on the test at hand.
In this method, we model the metric for each variant. We have prior knowledge about the conversion rate for A which has a certain range of values based on the historical data. After observing data from both variants, we estimate the most likely values or the new evidence for each variant.
Now we need to know:
What is Posterior Probability Distribution?
Posterior probability is the probability of an event to happen after all the background information about the event has been taken into account. Posterior probability as an adjustment on prior probability:
Posterior probability = prior probability + new evidence (called likelihood). And the Posterior Probability Distribution is Posterior Distribution = Prior Distribution + Likelihood Function (“new evidence”)
Open the link for further information: Wiki
By calculating this posterior distribution for each variant, we can express the uncertainty about our beliefs through probability statements.
install.packages (“bayesAB”) library (bayesAB)
The link below contains all the information to explain the parameters and functions in the package bayesAB. CRAN
Using the previous example
library (bayesAB) A_binom <- rbinom (3450, 1, 0.065) B_binom <- rbinom (3000, 1, 0.083)
About rbinom function rbinom (n, size, p) where
n = number of observations
size = number of trials
p = vector of probability
We choose the alpha and beta level from the prior knowledge we had about parameters. Here I have shown the test with two levels of the values. We generally use trial and error method to get the distribution to look like our imagined prior distribution. The peak should be centered over our expected mean based on previous experiments.
plotBeta (1, 1) plotBeta (100, 200) ## more specific range of p AB1 <- bayesTest (A_binom, B_binom, priors = c ('alpha' = 1, 'beta' = 1), distribution = 'bernoulli')
Saving the outputs of the test in AB2
AB2 <- bayesTest (A_binom, B_binom, priors = c ('alpha' = 100,'beta' = 200), distribution = 'bernoulli')
Here I have checked the AB2 test with an alpha and beta value of 100 and 200 respectively. You can also check the plots and results for AB1.
Print tells us the inputs we have made and the summary statistics of the data.
print (AB2)
summary (AB2)
The summary gives the credible interval. Bayesian intervals treat their bounds as fixed and the estimated Parameter as a random variable, whereas frequentist confidence intervals treat their bounds as random Variables and parameters as the fixed value.
It also shows that P (A>B) is by 0.00068%. So, B is much better than A. And the posterior expected loss for choosing B over A is low.
plot (AB2)
The means are quite separate, but there is a minimum overlap between distributions. Credible interval highlights this overlap region. To quantify the findings we calculate the probability of one variation beating another i.e. if we randomly draw a sample from Product A and from Product B, what are the chances that sample from B would have higher conversion rates than that of A.
So, from the diagrams and the summary of the test we can easily solve the problems which we had faced earlier while doing a simple prop.test.
Similarly, we can also try the test for other specific distributions like Poisson, normal, exponential and etc and check the results for them. Then we can combine the results of the tests and find out an overall credible interval and a percentage of A Over B or vice versa.
A/B test approaches are centered on hypothesis tests used with a point estimate (probability of rejecting the null) of a hard-to-interpret value. Oftentimes, the statistician or data scientist laying down the groundwork for the A/B test will have to do a power test to determine sample size. This quickly gets messy in terms of interpretability. More importantly, it is simply not as robust as Bayesian A/B testing and it does not have the ability to inspect an entire distribution over a parameter.
Bayesian statistics is simply more powerful and informative than a normal A/B test. While frequentist A/B testing requires the length of the test to be defined in advance, Bayesian testing does not. It can calculate the potential dangers of ending the test (the loss value) at any point, and gives a constantly updated probability of either variant being better and by how much. Ending the test early can be disastrous for frequentist A/B testing. A Bayesian approach, therefore, provides us with much greater flexibility during the experiment.
There is no agreed method for choosing a prior and it requires skill to estimate subjective prior beliefs into a mathematically calculated prior. If not done correctly it could lead to misleading results. The posterior distribution can be heavily influenced by the selection of the prior and the selection of the prior is a subjective process. Moreover, Bayesian statistics require a high level of computational resource, particularly in models with a large number of parameters.
The main advantage of the Bayesian approach is the ability to include historical data and to select a prior distribution. The main disadvantage with this approach is the subjective nature of the selection process for the prior.
The post AB Testing With R: An Example Of Marketing Campaign appeared first on StepUp Analytics.
]]>The post Obtaining A Critical Region And p-Value appeared first on StepUp Analytics.
]]>In the case of the two-tailed test, there would be two critical regions (As shown in the above graph). In the cases when we are interested to find whether the values are different or not equal, we use the two-tailed test. If percentage level is assumed to be (1-α) level, then both the critical regions would be of size α/2. The following hypothesis is an example of a two-tailed test.
H_{0}: µ = µ_{0 }H_{1}: µ ≠ µ_{0}
One-tailed tests are used when we are interested only in the extreme values that are greater than or less than a comparative value (say µ_{0}). In the case of one-tailed tests, there is only one critical region.
One-tailed tests are of two types-
Hypothesis-
H_{0}: µ = µ_{0 }_{ }H_{1}: µ < µ_{0}
Hypothesis-
H_{0}: µ = µ_{0 }H_{1}: µ > µ_{0}
In the case of one-tailed test, the critical region is of the value α (Unlike α/2 in the case of two-tailed).
Now, in order to obtain the critical value, we must know the type of hypothesis, the distribution the test follows, the percentage level at which we are working and lastly whether the test is two-tailed or one (right or left tailed). We’ve discussed all the above terms above, so now obtaining the value beyond which the critical region lies would be easy to find.
Step 1: Check the null and the alternative hypothesis.
Step 2: Take note of the distribution the test follows.
Step 3: Calculate the degrees of freedom, if any.
Step 4: Open the tables and look up for the distribution.
Step 5: If it is a two-tailed test at suppose 95% level of a Normal distribution, then look up for the value of 2.5% (α/2). And if it is one tailed test then look up for the value of 5% and then put a negative sign depending on the fact whether it is left or right tailed.
In the case of normal distribution, we do not require to calculate the degree of freedom, but in the cases of other distribution like t-distribution or chi-square distribution, we need to calculate the degree of freedom. On the other hand, both Normal, as well as the t distribution, are symmetrical so we need to just check one value and just replace signs, but in the case of non-symmetrical distributions, we need to check the individual values.
For example, if we are working on the chi-square distribution at 95%, we need to first find the degree of freedom of that chi-square and then check the value of both 97.5% as well as 2.5%. To begin, just follow the steps and practice with the Normal distribution. Once you’ve mastered it, go for the calculation of the degree of freedom and then the other distributions.
Critical regions are as critical as their name suggests and hence should be calculated carefully, or else we might end up in a wrong conclusion (Type 1 or Type 2 error).
The post Obtaining A Critical Region And p-Value appeared first on StepUp Analytics.
]]>