The post Hypothesis Testing Examples appeared first on StepUp Analytics.

]]>Then we may be interested in knowing if this sample average is in line with the population average of 85 or not. Hypothesis testing is like a litmus test that gives us the path for rejection or acceptance of an assumption or a claim except for the fact that it is not deterministic but probabilistic. It is a technique to compare two datasets or a sample from a dataset.

Let us first see what a hypothesis is and take a look at some of the terms that are inclusive to hypothesis testing.

A hypothesis is nothing but some assumptions that we make about the population parameters that we want to verify. Two hypotheses are included in every test namely the null hypothesis and alternative hypothesis. The Null Hypothesis is the statement of no difference and is denoted as **H0**. It simply asserts that there is no real difference between the sample and the population and the difference is accidental or by chance. An alternative hypothesis is a statement against the null hypothesis.

It is the contradiction of the null hypothesis. It is usually denoted by **H1**. For example, a sample of 50 light bulbs is tested for their life and we want to test if the average lifetime of the bulbs is 300 days. Then we will set up the null hypothesis as “the lifetime is 300 days” and the alternative hypothesis will be “the lifetime is not 300 days”.

To test a hypothesis we need to have a single value based on the sample observations that can be compared with a pre-defined value so as to reach a decision. This value is computed using a certain formula and follows a particular probability distribution under some assumptions. Since the value calculated is used for testing and is derived from the sample, it is called a test statistic.

We all are familiar with the game of darts. Consider a simpler version of such a game in which an aim on the outer ring results in the disqualification (rejected, straightaway) of the aimer whereas an aim on the inner two circles results in qualification (acceptance) of the aimer for further rounds (only qualified, not yet the winner).

Just like this dartboard is divided into areas of rejection and acceptance, in a similar way a probability curve is divided into acceptance region and the rejection region (also called the critical region). If a test statistic falls in the critical region then the null hypothesis is rejected, it may be accepted otherwise. Hence, a critical region can be defined as the region of rejection of** H0** when **H0** is true.

A point to be noted here is that we reject the null hypothesis much strongly as compared to its acceptance (as in the example above, where the aimer is only qualified and is not the winner). The reason is that we deal with a sample rather than the population itself. In R.A. Fisher’s own words:

“The null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis.”

The confidence with which a decision is taken depends on the significance level chosen. The significance level is the probability of rejecting the null hypothesis when it is true. It is the size of the critical region and is expressed in terms of percentage. For example, a significance level of 5% means that the null hypothesis will be rejected 5 out of 100 times when it is true. The significance level is denoted by α.

As we have seen above, the critical region is a portion of the probability curve. This portion can lie on either end of the curve or on both ends of the curve. A test is recognized as one tailed or two tailed depending upon which side of the curve the critical area lies which further depends on the nature of our alternative hypothesis. For example, in the lightbulb problem if we want to test that the lifetime of bulbs is greater than 300 days then our alternative hypothesis will be “lifetime of bulbs >300 days” (right-tailed).

If we want to test if it is less than 300 days then the alternative hypothesis becomes “lifetime of bulbs < 300 days” (left tailed). If we do not care about whether it is greater or less and just want to test if it is 300 days or not then alternative hypothesis becomes “lifetime of bulbs ≠ 300 days” (two-tailed). The critical region, say at 5%, in these cases can be illustrated as below:

For a two-tailed test, the critical region is divided into two parts, one for the right side and other for the left side. While for one-tailed test it remains undivided. So, if we are dealing with a two-tailed test at significance level (size of the critical region) α% then on each side we have (α/2)% of the area.

In simple terms, the p-value is a evidence to accept or reject the null hypothesis. Consider a coin that someone says is biased. So, to test this claim we set up the null hypothesis that “the coin is unbiased” as opposed to the alternative hypothesis that “the coin is biased”. Now to test its biasedness the coin is tossed 20 times and suppose that 18 heads and 2 tails are obtained. Clearly, we should have got a similar number of heads and tails if the coin was unbiased. But to prove this we need to have an evidence in the form of a p-value.

It is the probability of obtaining a result equal to or more extreme than the observed value. So, in this case, “the result more extreme than observed” would be (19 heads, 1 tail), (20 heads, 0 tail), (19 tails, 1 head) or (20 tails, 0 head). Calculating the probability of obtaining this result (using binomial distribution) under the null hypothesis we get:

P(18 heads and 2 tails) = P(18 tails and 2 heads) = 0.000181

P(19 heads and 1 tail) = P(19 tails and 1 head) = 0.00001907

P(20 heads and 0 tail) = P(20 tails and 0 head) = 0.0000009536

Adding up the probabilities we get the p-value as 0.0004.

This p-value is compared with the significance level which we will take here as 0.05. If the p-value is greater then the significance level then we say that the evidence against the null hypothesis is weak, which means we can accept the null hypothesis. If the p-value is less than the significance level then the evidence against the null hypothesis is strong and hence we reject the null hypothesis.

So, in this case, the p-value is very small as compared to the significance level, therefore, we can safely say that the null hypothesis is rejected and the coin is indeed a biased one.

Statistics is probabilistic and so is hypothesis testing. There is always a probability of making a wrong decision. While making decisions, four possibilities arise:

- Rejecting H0 when it is true
- Accepting H0 when it is false
- Accepting H0 when it is true
- Rejecting H0 when it is false

Clearly, the last two decisions are correct. First two decisions reject and accept the null hypothesis wrong. They are errors. The first one rejects the null hypothesis when it is, in fact, true, it is called the type I error. The probability of committing type I error is denoted by α.

The second one accepts the null hypothesis when it is false, it is called the type II error. The probability of committing type II error is denoted by β.

Consider an example of testing whether a new toothpaste is better than the previous toothpaste in fighting dental cavities. The hypotheses are **H0:** the toothpaste have no difference against **H1:** the new toothpaste is better than the old one. Now suppose that the new toothpaste is actually better. If our test accepts the null hypothesis that the toothpaste has no difference then we commit a type II error.

While testing the hypothesis, our aim is to reduce both types of error but it is not possible to control both the errors simultaneously. So we fix the probability of **type I error(α)** in advance at a satisfactory level and try to minimize the probability of **type II error(β)**. α is also known as the significance level or the size of the critical region.

We know that for large sample sizes, almost all the distributions can be approximated by the normal distribution due to the Central Limit Theorem. This forms the basis of the large sample tests. Let us take a look at some of the tests and also how to perform them in R.

Consider a random sample of size n(≥30) from a normal population with mean µ and variance σ^{2}. We know that the sample mean(x) of this sample will also be normally distributed as N( µ, σ^{2}/n). Thus, the standard normal variate corresponding to x is:

If the population standard deviation is not known, which is usually the case, then we use sample variance as its estimate. Let us take an example of a pizza delivery boy who claims that he takes on an average 8.9 minutes to reach his destination to deliver pizzas. To check on this claim the agency that hires him notes his time taken for 50 orders. It gets a mean of 9.3 minutes with a standard deviation of 1.6 minutes.

Now let us check if the average time taken to deliver a pizza is 8.9 or not. For this we start by setting up the null hypothesis as; **H0**: the average time taken to deliver the pizza is 8.9 minutes (µ=8.9) against the alternative hypothesis; **H1**: the average time taken to deliver a pizza is not 89 minutes (µ≠8.9).

According to the situation we have: sample mean(x) = 9.3 minutes, population mean (µ)= 8.9 minutes, population standard deviation(σ) = 1.6 minutes and the sample size(n) = 50. Since the sample is large(≥30) and the population standard deviation is known therefore we apply the large sample test, otherwise t-test is used. On substituting the values in the test statistic formula we get the value of test statistic as

This is a two-tailed test so the critical region will be on both sides of the curve. The critical value from the standard normal table is 1.96 (we have taken a significance level of 0.05). The calculated value of the test statistic, 1.767, is less than the tabulated value of 1.96. Hence, we may accept the null hypothesis at the 0.05 level of significance.

This test can be performed in R as well. The code for which is given below.

Since here we are dealing with two-tailed test the p-value is calculated as **p = P(Z ≤ -z) + P(Z ≥ z) = 2*P(Z ≤ -z)**. Clearly, the **p-value** is greater than the significance level, therefore, we say that we have less evidence against the null hypothesis and may accept it.

In the above test, we had one sample and we compared the value of the sample mean to that of the population mean. We can also compare the value of two sample means to find out if they belong to populations with identical means or to check if one population is superior or inferior to other.

For this two samples of size n1 and n2 taken from same or different populations with means µ1 and µ2 and variances σ1^{2} and σ2^{2 }respectively. Also, we now that the sample means x1, x2, and their difference, x1 – x2 are distributed as:

Under the null hypothesis that the samples are from the same population (µ1 = µ2), the test statistic is given as:

If population standard deviations are not known then sample variances s1 and s2 are used as:

For better understanding consider an example where it is required to check if the mean level of pay of one state is greater than that of another state. Two samples of employees are taken from sizes 1200 and 1000. The mean and standard deviation of the samples (in thousands of rupees) is given as:

Here, we have to test the null hypothesis, **H0:** there is no difference between the average pay of two states i.e., µ1 = µ2 as opposed to the alternative hypothesis, **H1:** mean level of pay of state 1 is greater than that of state 2 i.e., µ1 > µ2 (right tail test). The population standard deviations are not known hence estimating it using sample standard deviation, we get Now, the value of test the statistic is |z| = 24.28. The tabulated value for the left tailed test at 5% level of significance is 1.645.

The calculated value is much greater than the tabulated value at 5% significance level and thus we reject the null hypothesis and conclude that the mean pay of state 2 is higher than the mean pay of state 1.

In R, this test can be done as follows:

This test is used to test if the standard deviations of the two samples differ significantly or not. Let s1 and s2 be standard deviations of two independent samples then under the null hypothesis that the sample standard deviations don’t differ significantly i.e., σ1 = σ2, the test statistic for large samples is given as:

Where σ1^{2} and σ2^{2 }are population variances and n1 and n2 are sample sizes for sample 1 and sample 2 respectively. Sample variance is used as an estimate of population variance when it is not known. Consider a farmer who yields two sets of plots whose variability are as follows:

We have to test whether the variability in two sets of plots is significant. The null hypothesis can be stated as, **H0:** there is no significant difference between the variability of the two plots i.e., **σ1=σ2** against the alternative hypothesis, **H1:** two sets of plots have significantly different variability i.e., **σ1≠σ2**. We have, s1=34, s2=28, n1=40, n2=60, σ1^{2} =34^{2} = 1156, σ2^{2}=28^{2}=784. Substituting these values in test statistic we get z as 1.3.

This value is less than the tabulated at 0.05 level of significance which is given as 1.96. Hence, we cannot reject the null hypothesis and conclude that difference between the variability for two sets of plots is not significant.

In R, it can be done as follows:

The post Hypothesis Testing Examples appeared first on StepUp Analytics.

]]>The post Correlation Coefficient In Data Science appeared first on StepUp Analytics.

]]>The correlation coefficient between two variables X (say height) and Y (say weight) is a numerical measure of **linear relationship** between them, denoted by ‘**r**’.

**-1≤ ᵖ(X,Y) ≤ +1**

**Example:** Suppose we have data on the number of study hours(X) and the number of sleeping hours(Y) of different students. If we need to check up to how much extent the two affect each other.

**Solution:** E(X)=6

E(Y)=8

E(X-E(X)) (Y-E(Y))

r (X, Y) = E(X-E(X)) (Y-E(Y)) / √V(X) √V(Y)

=-20/20

=-1

The negative value shows they have association in opposite directions, if the value of one increases the other decreases .i.e as the number of sleeping hours increase the study hours of student decrease.

R code:

X=c(2,4,6,8,10) Y=c(10,8,6,4,2) cor(X,Y) [1] -1 plot(X,Y)

**Some Correlation Values And Their Interpretation:**

It should be clearly understood that **Correlation** is described as the analysis which lets us know The linear association or absence of the linear relationship between two variables ‘x’ and ‘y’. The following table clearly explains the difference between the correlation coefficient and the regression coefficient

**Comparison Chart**

In almost any business problem, it is useful to express one quantity in terms of its relationship with others. For example, sales might increase when the marketing department spends more on TV advertisements, or a customer’s average purchase amount on an e-commerce website might depend on a number of factors related to that customer.

Often, correlation is the first step to understanding these relationships and subsequently building better business and statistical models.

Not only this, other examples can be: Consumer spending and GDP are two metrics that maintain a positive relationship with one another. When spending increases, GDP also rises as firms produce more goods and services to meet consumer demand. Conversely, firms slow production amid a slowdown in consumer spending to bring production costs in line with revenues and limit excess supply.

**Statistical importance (helpful for analysis of correlated data)**Statistical inference based on Pearson’s correlation coefficient often focuses on one of the following two aims:

- One aim is to test the null hypothesis that the true correlation coefficient
*ρ*is equal to 0, based on the value of the sample correlation coefficient*r*:

If r is observed correlation coefficient in the sample of n pairs of observations from a bivariate normal population,H

_{0}:*ρ=0 i.e.*population correlation coefficient is zero.**t = r√n-2 / √(1-r**follows t distribution with (n-2) degree of freedom.^{2})R-code: cor.test(X,Y,method=c("pearson")) # X and Y are the collected sample values

- The other aim is to derive a confidence interval that, on repeated sampling, has a given probability of containing
*ρ*

t.test(X,Y)

**Do we always need the exact values of the variables to calculate the correlation between them?
**Many of the times instead of quantitative data, we come across qualitative data given in specific order or rank. So in such situations, we can find a special type of correlation called

Spearman’s rank correlation coefficient can be defined as a special case of Pearson *ρ *applied to ranked (sorted) of the variables. The formula for Spearman’s coefficient looks very similar to that of Pearson, with the distinction of being computed on ranks instead of raw scores:

**ρ (rank.x,rank.y)=Cov(rank.x,rank.y)/SD[x] SD[y]**

If all ranks are unique (i.e. there are no ties in ranks), you can also use a simplified version:

It should be noted that Pearson’s coefficient gives the exact value but the Spearman’s coefficient gives the approximate value, for example, consider the following cases:

**1.** For the Pearson correlation coefficient to be +1 when one variable increases than the other variable increases by a consistent amount. This relationship forms a perfect line. The Spearman correlation coefficient is also +1 in this case:

**Pearson = +1, Spearman = +1**

**2**.If the relationship is that one variable increases when the other increases, but the amount is not consistent, the Pearson correlation coefficient is positive but less than +1. The Spearman coefficient still equals +1 in this case.

**Pearson = +0.851, Spearman = +1**

**3. **When a relationship is random or non-existent, then both correlation coefficients are nearly zero.

**Pearson = −0.093, Spearman = −0.093**

**4. **If the relationship is a perfect line for a decreasing relationship, then both correlation coefficients are −1.

**Pearson = −1, Spearman = −1**

**5.** If the relationship is that one variable decreases when the other increases, but the amount is not consistent, then the Pearson correlation coefficient is negative but greater than −1. The Spearman coefficient still equals −1 in this case

**Pearson = −0.799, Spearman = −1**

Thus, we found the various aspects where the correlation coefficient is used to find the association between variables. Pearson’s correlation measures only *linear* relationships. Consequently, if your data contain a perfect curvilinear relationship, the correlation coefficient will not detect it, it will show value zero.

For further studies and updates, latest updates or interview tips on data science and machine learning, subscribe to our emails.

The post Correlation Coefficient In Data Science appeared first on StepUp Analytics.

]]>The post ANOVA Using SPSS appeared first on StepUp Analytics.

]]>________________________________________________________________________________________________

Assumptions for ANOVA Test.

ANOVA Test is based on the test statistics F (or variance ratio), now for validity of the F-test in ANOVA are as follows,

- The observation are independent.
- Parent population from which observation are taken is normal, and
- Various treatment and environmental effects are additive in nature.

Below is the example,

**Question**: The following table shows the lives (in hours) of four batches of electric lamps.

Perform an analysis of variance or one way classification of these data and show that significance test does not reject their homogeneity.

Null hypothesis, Ho : mb1=mb2=mb3=mb4 ## mbi implies mean of batch i i=1,2,3,4

Alternative hypothesis, Ha : Atleast two means are different

I will answer several common questions about how to perform Analysis of Variance (ANOVA) in SPSS.

- How to manage data in SPSS?
- How do we treat Batch?

Firstly, enter all the observations in one column either row or column wise like here I have entered data row wise

Secondly, there are four batches of bulbs. If you have entered the data row wise then put the corresponding batch number to the very next column say batch.

Note: you can also enter data column wise and put the corresponding batch number in the very next column.

Select the “observation” and put it in the dependent column, and “batch” in the factor column.

Click Ok.

You will get the desired result.

**Interpretation**

p-value for Analysis of Variance (ANOVA) is 0.123, indicates that we do not have enough evidence to reject Null hypothesis at 0.05 level of significance. Hence we may accept null hypothesis i.e the treatment means are equal. This test is statistically insignificant.

Note: here the interpretation is made on the basis of p-value.

**Author**

Zishan Hussain

The post ANOVA Using SPSS appeared first on StepUp Analytics.

]]>The post Hypothesis Testing appeared first on StepUp Analytics.

]]>In first hypothesis, we set

HN :β=0

HA: β≠0 (this is two-tailed test)

However, if we have prior knowledge about the sign of β (say positive) then the hypothesis set are as

HN :β =0

HA: β>0 (one tailed test)

In case of two tailed or one tailed test, the procedure to examine the statistical significance of , the decision rule is as follows:

- If |t*|> t∆ (n-2) i.e absolute value of the computed test statistic (t*) greater than the critical t at level of significance ∆ and degree of freedom n-2, then reject the HN and conclude is statistically significant at ∆ level of significance.

(where degree of freedom is the total number of observations less the number of independent constraints put on them. For k variable model, df= n-k, where n is the total number of observations)

**The P- value Approach**

Apart from providing computed t statistics for the estimated coefficients, also provide the p- values which can also be used as an alternative approach in assessing the significance of the estimated coefficients. P-value is the lowest level of significance at which the HN could be rejected based on the test statistic actually observed. For example, if p= 0.002, it implies that HN will be rejected at 2% level of significance. It is mandatory to reject the HN and conclude that the estimated coefficient under consideration is statistically significant if p-value is less or equal to 0.10.

**Steps**

**1: State the Null Hypothesis.**

**2: State the Alternative Hypothesis.**

**3: Set α.**

**4: Collect Data.**

**5: Calculate a test statistic.**

**6: Construct Acceptance / Rejection regions.**

**7: Based on steps 5 and 6, draw a conclusion about H _{0.}**

**Upper Tail Test of Population Mean with Known Variance**

The null hypothesis of the upper tail test of the population mean can be expressed as follows:

where μ_{0} is a hypothesized upper bound of the true population mean μ.

Let us define the test statistic z in terms of the sample mean, the sample size and the population standard deviation σ :

Then the null hypothesis of the upper tail test is to be rejected if z ≥ z_{α} , where z_{α} is the 100(1− α) percentile of the standard normal distribution.

Then the null hypothesis of the upper tail test is to be rejected if z ≥ z_{α} , where z_{α} is the 100(1− α) percentile of the standard normal distribution.

Suppose the food label on a cookie bag states that there is at most 2 grams of saturated fat in a single cookie. In a sample of 35 cookies, it is found that the mean amount of saturated fat per cookie is 2.1 grams. Assume that the population standard deviation is 0.25 grams. At .05 significance level, can we reject the claim on food label?

The null hypothesis is that μ ≤ 2. We begin with computing the test statistic.

> xbar = 2.1 # sample mean

> mu0 = 2 # hypothesized value

> sigma = 0.25 # population standard deviation

> n = 35 # sample size

> z = (xbar−mu0)/(sigma/sqrt(n))

> z # test statistic

[1] 2.3664

> mu0 = 2 # hypothesized value

> sigma = 0.25 # population standard deviation

> n = 35 # sample size

> z = (xbar−mu0)/(sigma/sqrt(n))

> z # test statistic

[1] 2.3664

We then compute the critical value at .05 significance level.

> alpha = .05

> z.alpha = qnorm(1−alpha)

> z.alpha # critical value

[1] 1.6449

> z.alpha = qnorm(1−alpha)

> z.alpha # critical value

[1] 1.6449

The test statistic 2.3664 is greater than the critical value of 1.6449. Hence, at .05 significance level, we reject the claim that there is at most 2 grams of saturated fat in a cookie.

Instead of using the critical value, we apply the pnorm function to compute the upper tail p-value of the test statistic. As it turns out to be less than the .05 significance level, we reject the null hypothesis that μ ≤ 2.

> pval = pnorm(z, lower.tail=FALSE)

> pval # upper tail p−value

[1] 0.0089802

> pval # upper tail p−value

[1] 0.0089802

Author: Debasmita Mitra

The post Hypothesis Testing appeared first on StepUp Analytics.

]]>