# Data Sciences Interview preparation

#### The measure of central tendency and dispersion

**Central tendency** is the way to describe the centermost value of the data set i.e. it describes where the maximum values lie. The 3 ways to calculate central tendency are:

**Mean:** it is something that includes every value in data set by adding all the values divided by the no of values.

**Median:** when we ordered the given values in any form and find the most central value of the frame.

**Mode:** The most frequent number which occurs in the frame is the mode.

**Dispersion** means the extent to which a numerical data is likely to vary about an average value. dispersion is nothing else but “scatteredness”.

The various measure of dispersion is:

**Range:** It is the difference between two extreme observations of the distribution.

**Quartile Deviation:** Quartile Deviation (QD) is the product of half of the difference between the upper and lower quartiles.

**Mean Deviation:** Mean deviation is a statistical measure which gives us the average deviation of values from the mean in a sample.

**Standard Deviation:** positive square root of the arithmetic means of the squares of the deviation from means. It is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the values are spread out over a wider range from the mean.

**Quartiles: **It means dividing the Probability distribution into areas of equal probability. Quartiles are 3 cut points that divide the data into 4 equal parts.

The first quartile (Q1) is defined as the middle number between the smallest number and the median of the dataset.

The second quartile (Q2) is the median of the data and 50% of the data lies below this point.

The third quartile (Q3) is the middle value between the median and the highest value of the data set.

#### Skewness and Kurtosis

Skewness is defined as a lack of symmetry. Generally, skewness gives the idea of the shape of the curve which can be particularly termed as positive skewed and negative skewed.

Kurtosis gives an idea about the flatness of the frequency curve and can be classified into the normal, leptokurtic, and platykurtic curve.

The Key difference between Skewness and Kurtosis is that Skewness talks about the degree of lop side and the latter talks about the flatness of the distribution.

**Questions**

**Ques**. What does a low and high standard deviation indicate?**Ans:** A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.

**Ques. **Suppose that you are 5 inches in height and you have to cross the river where the observation of the water is 2,4,8,6 and 10. Would you be able to cross the river without sinking? which method will you apply for the same?**Ans:** Here we will consider the method of range (dispersion method) to form an instant image whether we can cross the river or not. the highest observation here is 10 and the lowest is 2 so that will give us 8, it won’t be possible to row the river without sinking.

**Ques.** Which is the ideal average among the central tendency?**Ans:** Arithmetic mean which satisfy all the condition of an ideal average can be laid as the most accurate one.

**Ques. **What is the median of the human body?**Ans:** Navel is the median of the body which divides the body into two equal parts.

**Ques. **What are the** **different types of mean?**Ans:** There are 3 types of mean:

**Arithmetic**: Simple means is also called the arithmetic mean were the sum of all the observation divided by the total no of observation.**Harmonic**: This is generally used when rates are desired in a particular frame.**Geometric**: the nth root of the product of n values.

**Ques.** Which is the best measure of dispersion?**Ans:** Standard deviation is the best method as it satisfies almost all the properties for an ideal measure of dispersion but is affected by outliers as it includes mean which is a very sensitive index.

**Ques. **Which method of dispersion is based on 50% of the data?**Ans:** Quartile deviation is the method of dispersion which includes 50% of the data only so it can’t be regarded as a reliable measure as it ignores the rest of the data.

**Ques.** Is mean possible for 2,3,4,5,6,7and 1000?**Ans:** Yes, the mean does exist for this distribution but as extreme values are present in this so it won’t provide an accurate answer for that. As we can see that the mean 146.71 and we can see that none of the observation is accumulating around that value so this will not provide us with valuable results.

**Ques.** What is the meaning of absolute and relative measure of dispersion? **Ans:** Absolute measures the dispersion in the same units as the units of original data whereas Relative dispersion is defined as the ratio of the standard deviation to the mean. Unlike absolute dispersion, relative dispersion is dimensionless.

**Ques.** If all the observation is increased by 10, then what will be the effect on standard deviation?**Ans:** It will remain unchanged as the variance is independent of the change of origin and Standard deviation is the root of the variance so it will also remain unchanged.

**Ques. **What is the range of Karl Pearson’s range of skewness?**Ans:** The range of skewness of the same will be -3 to 3 but in practical use, these limits can’t be attained.

**Ques.** List the case when the skewness will be called positively and negatively skewed?**Ans:** There are generally two ways to say that:

i. skewness is said to be positive when mean > mode/median

ii. skewness is said to be negative when mean < mode/median

**Ques.** Enlist all the cases when Bowley’s coefficient of skewness can be used?**Ans:** Bowley coefficient which is also known as quartile coefficient of skewness is extremely useful when median and quartiles are being used.

**Ques. **What is the key difference between median and mean?**Ans:** Median divides the whole data into two equal halves i.e. the probability of one half will be exactly equal to the other half which will be 0.5

Mean is the value around which most of the values lie. It does not divide the data into two equal halves but both of these can be equal in various situations.

**Ques. **If we have ordered data then can we use mode?**Ans: **Mode is used when we are dealing with unordered or nominal data.

There are two types of Categorical data:

1. Ordered data (mainly values representing rank or order)

2. Unordered data

When we have ordinal data, the mean and median is the best measure, and when its categorical data, the mode is the best choice.

**Ques. **We have values: 20,25,29,25,18,20,30,31,26 and their mean us 24.8. If we take the sum of deviations of these values from their mean then what will be the value?**Ans: **Sum of deviations about its mean is always zero.

**Ques. ** Suppose we have some data and the analyst wants to check the variability in the data. What statistic should the analyst prefer for this?**Ans:** The most efficient measure of dispersion is Standard deviation. It helps us to check the variation or the spread of data about the mean. This one is the most efficient because it takes all the data points into account.

**Ques. **What do 0 standard deviations signify?

**A: **0 standard deviation means that our random variable is constant and all the data points are the same without any dispersion.

**Example:** If a random variable has values like 9,9,9,9,9,9,9,9,9 then the mean of X will be 9 and its standard deviation will be 0.

**Ques. **If we calculate variance with n-1 in the denominator instead of n, what will it mean?**Ans: **It means that the data we have is a sample and not the population because variance calculated with n-1 in the denominator gives us sample variance.

#### CORRELATION AND REGRESSION

When we deal with bivariate distribution or data, we are interested in finding out correlation and covariance between the two variables. If the change in one variable affects change in other variables them the variables are said to be correlated. The correlation can be of:

i. If two variable deviates in the same direction that means when an increase in one variable results in a corresponding increase in the other then they are said to be positively correlated.

ii. If two variable deviates in a different direction that means when an increase in one variable result decrease in the other variable then they are called as negatively correlated.

**Regression Analysis** is statistical modeling to find the relationship between a dependent variable and an independent variable. This is undertaken to predict future values based on the previously given data. It helps in estimating the relationship between two or more dependent and independent variables or how one variable is affecting the others and by how much.

The most common form of regression is linear regression which assumes a linear relationship between the input variables (x) and the single output variable (y). More specifically, y can be calculated from a linear combination of the input variables.

For Example: if suppose we have to find a relationship between pen and ink, so we can say that pen is a dependent variable, and ink is an independent variable.

#### Questions

**Ques:** Are correlation and dependency the same thing?**Ans:** Non-dependency between two variables means no correlation but the converse is not true. A perfect dependency can also have zero correlation.

**Example:** square of x is linearly dependent with y which shows that it will both show positive as well as negative correlation and zero correlation also. This tells us that for a pair of variables that are perfectly dependent on each other, It can also give you a zero correlation.

**Ques:** Suppose that Raman ate 5 plates of pasta which is being ordered online and the owner of the ordered restaurant is corona infected. Is by any chance Raman can get affected? Is the negatively or positively related to the happening of the infection?**Ans: **Yes, Raman has a chance of getting infected as one or the other day (without knowing) could have visited the restaurant, and as this is a communicable disease the chances of transfer from him to many are very high. Hence, we can say that the chances are positively related to each other.

**Ques:** What do points which are dense depicts?**Ans:** There can be two cases here:

i. If the points of data are close enough to each other given the condition that they are dense in nature then they have a good amount of correlation among them.

ii. If the points are scattered everywhere, we can say that they don’t have a very good amount of correlation among them.

**Ques:** What is the relation between two independent variables?**Ans:** The two independent variables are always uncorrelated.

**Ques:** Is the Pearson coefficient sensitive to outliers?**Ans: **Yes, the** **Pearson coefficient is very much sensitive to the outlier as even a single point can change the direction of the coefficient.

**Ques:** What is the difference between CORRELATION and REGRESSION?**Ans:** THE DIFFERENCE IS:

i. The slope in a linear regression gives the marginal change in output/target variable by changing the independent variable by unit distance. The correlation has no slope.

ii. The intercept in a linear regression gives the value of the target variable if one of the inputs/independent variables is set zero. Correlation does not have this information.

iii. Linear regression can give you a prediction given all the input variables. Correlation analysis does not predict anything.

**Ques**. Difference between correlation and covariance?**Ans:** Correlation is simply the normalized co-variance with the standard deviation of both the factors and this is generally done to get the results into the prescribed limits. In simple words, both terms measure the relationship and the dependency between two variables. “Covariance” indicates the direction of the linear relationship between variables. “Correlation” on the other hand measures both the strength and direction of the linear relationship between two variables.

**Ques**. If a constant 50 is subtracted from each value of X and Y, what will be the change in the regression coefficient?**Ans:** No, there will be no change as it is independent of the change of origin but not of scale.

**Ques.** Why do we have two lines of regression?**Ans:** There are always two lines of regression which mean one Y on X and another is X on Y.

**Ques**. What is the angle between the two lines?**Ans:** Yes, there are two angles between them namely:

i. acute angle

ii. obtuse angle.

**Ques**. When do two lines intersect?**Ans:** Two lines intersect each other when they are perpendicular to each other.

**Ques.** State the difference between homoscedastic and heteroscedastic**Ans:** Homoscedastic refers to a condition in which the variance of the residual, or error term, in a regression model, is constant. Here accumulation of points is not concentrated at equal distance to the line where Heteroscedastic has the problem of outliers as points are widely varying from each other.

**Ques**. What are the assumptions for REGRESSION?**Ans:** The basic assumption for regression is as follows:

i. Independence of error

ii. Normality of error distribution

iii. Quantitative data condition

iv. Homoscedastic

**Ques.** What does Multiple Regression mean?**Ans:** When more than two variables are highly related to each other than multiple regression comes into place or when we have more than one explanatory variable.

**Ques.** State the assumptions of Multiple Regression?**Ans:** The assumptions are as follows:

i. Multivariate normality: it is said that the residuals are normally distributed

ii. No multicollinearity: it is assumed that two variables are not correlated with each other

#### Theory of Hypothesis Testing and P-Value

A hypothesis is some kind of statement about the population distribution which we want to verify from the kind of information available to us. It is a proposition that is made according to past data, facts, and personal assumptions.

When the hypothesis satisfies the population completely then it is known as SIMPLE HYPOTHESIS whereas when it does not satisfy the population completely it is known as COMPOSITE HYPOTHESIS.

**We have two kinds of hypothesis:**

**i. NULL HYPOTHESIS:** It is a type of hypothesis which assumes that there are no specific differences between the two or more characteristics of the sample. It is usually a commonly accepted fact and we try to disapprove of this hypothesis using various tests.

**ii. ALTERNATIVE HYPOTHESIS:** In statistical terms, when there is a significant difference between the two variables it is an alternative hypothesis. It is mainly a hypothesis opposite or alternate of the Null Hypothesis.

**Degree of Freedom:** It is the number of values in the final calculation of a statistic that are free to vary to estimate a parameter. The degrees of freedom equal your sample size minus the number of parameters you need to calculate during an analysis.

#### Types of tests: There are two types of tests in statistics

i. **Two-Tailed test:** If we have an alpha of 5% then it will allot half of the alpha at one tail of the distribution of the test statistic and the other half on the other. In this type of test, we try to test our null hypothesis in both the direction of the distribution.

Suppose we want to test if the students have scored 70% marks or not. Then according to this, our null hypothesis will be mean (x) = 70. Now, according to the two-tailed tests, we will test both if the mean is significantly greater than x and if the mean is significantly less than x.

ii. One-Tailed Test: It is exactly like two-tailed tests but in this, we test our null hypothesis in only one direction. There are two types of One-Tailed test**Right tailed test:** When the null hypothesis is being tested on the right side of the distribution.,**Left tailed test:** When the null hypothesis is being tested on the left side of the distribution.

**P-value:** It is the probability of obtaining results as extreme as the observed results of a statistical hypothesis test, assuming that the null hypothesis is correct.

**Sampling Error:** As a sample cannot include all the characteristics of the population. So, it is prone to some errors and those errors are known as sampling errors. There are two types of sampling errors

Now here we come up with two types of errors:**Type 1 Error:** Probability of rejecting H0, when H0 is true. This is also known as the producer’s risk.**Type 2 Error:** Probability of accepting H0, when H0 is false. This is also known as the consumer’s risk.

#### Questions

**Ques. **Why do we reject our null hypothesis when the p-value is small?**Ans: **P-Value is the observed value of significance which is compared with the actual level of significance and hence when the p-value is small, we tend to reject the null hypothesis.

**Ques. **Which error is more dangerous?**Ans:** Type 1 error is more dangerous as it involves producer risk which means the products/supplies which are capable of being supplied will not get into the market.

**Ques. **What does the level of significance depict?**Ans:** The probability of Type-1 error, is known as Level of Significance and it is also known as the size of the critical region.

**Ques.** Is p=0.05 significant?**Ans:** P-value under 0.05 shows strong evidence against the null hypothesis and hence it shows strong evidence to reject the hypothesis and accept the alternative one.

**Ques. **What is the difference between one tail and two tail?**Ans: **In the one-tail test, the entire region of 5% is available either in right or left whereas two-tail splits the alpha range into two parts.

**Ques.** When can the two-tailed test be used?**Ans: **As two-tail uses both positive and negative tails of the distribution so they take the possibility of difference of the same.

**Ques. **What is a critical region?**Ans: **These are those areas that lead to the rejection of the Null Hypothesis at some probability level.

**Ques. **Which is the best critical region?

Ans: The best critical region is the region that helps in minimizing both kinds of errors i.e. Type 1 and type 2 error.

#### Types Of Test and Goodness of Fit

**GOODNESS OF FIT** It is the statistical technique to determine how well sample observation satisfies the population observation. It helps in comparison of observed sample distribution with the expected probability distribution and analyzing the same.

**Now the different types of test which we know are:**

**CHI-SQUARE TEST:** This test is done to test the relationship between the categorical data. The null hypothesis under this is that there is no significant difference between the data that means they are independent.

A very high chi-square value indicates that the data does not fit into that completely.

**T-TEST: **This test helps us to know the significant difference between two or more means. T distribution curve converges the normal distribution curve from below. There are three types of tests involving T-distribution:

i. **Single T-Test:** this test compares the significant difference between two means

ii. **Paired T-Test:** this test compares measures the before and after effect of the situation

iii. **One-Sample T-Test:** this test compares one means with a standard or given mean

**F-TEST:** This test generally arises when the data is being fitted with the method of least square. this test helps in comparing the variances of two different populations and find out whether there is any kind of significant difference between the two populations or not.

**Z-TEST:** This test is applied to check the difference between two population mean when variances are known in the sample. This test is based on the normal distribution.

#### Questions

**Ques:** Name a distribution whose variance is twice of its mean?**Ans:** Chi-square distribution has a mean (n) and variance (2n).

**Ques:** In what condition does the chi-square distribution tend to gamma distribution?**Ans:** When the parameter becomes ½ and ½n then only it will tend towards chi-square distribution.

**Ques.** What is the sum of the independent chi-square variant?**Ans:** The sum of independent chi-square variate is also a chi-square variate.

**Ques.** If X1 and X2 are two independent chi-square variates with n1 and n2 df then which distribution X1/X2 will it follow?**Ans:** If these are two independent variates then it will follow b2(n1/2, n2/2) distribution.

**Ques.** State the condition for the chi-square test?**Ans:** The conditions for the chi-square test are:

i. The sample observation should be independent with each other

ii. N the total frequency should be sufficiently large where each unit should be greater than 50.

iii. Constraints of the cell frequency should be linear.

iv. No cell frequency should be less than 5

v. We will not make any kind of assumption regarding the parent population.

**Ques.** What distribution will (2X) ^1/2 will follow?**Ans:** As x is a chi-square variate with n degree of freedom then it will follow a normal distribution with mean (2n) ^1/2 and variance of 1.

**Ques.** Which test can be used for testing the homogeneity of an independent estimate of the population correlation coefficient?**Ans:** Chi-square test can be used to test the same.

**Ques.** What does the term “degree of freedom” means?**Ans:** The number of independent variates which are present in any kind of statistics is called the degree of freedom.

**Ques.** In what condition will YATES’S CORRECTION will be used?**Ans:** If any frequency or we can say that any of the cell frequency is smaller than 5 then by the method of pooling which is popularly known as “Yate’s correction” can be used.

**Ques.** In what condition will student’s t distribution will follow Cauchy distribution?**Ans:** When we have v=1, in the pdf of t – distribution then it will follow Cauchy distribution.

**Ques.** What is the value of the odd moments in t-distribution?**Ans:** The odd moments in the t-distribution will all be 0.

**Ques**. Which test will be used to test the significance of the observed partial correlation coefficient?**Ans:** T-TEST will be used here.

**Ques**. State applications of t-distribution?**Ans: **The application of t-distribution will be-

- To test whether the sample mean differs significantly from the population mean
- To test the significant difference between the two samples
- To test the significance of an observed sample correlation coefficient.

**Ques.** What will be the null hypothesis under the t-test for a single mean**Ans:** The null hypothesis under this will be there is no significant difference between the sample mean and the population mean

**Ques.** Suppose that the government wants to calculate the before and after effect of the pandemic situation in DELHI for working-class people. Which test should the government apply?**Ans:** PAIRED T-TEST should be applied by the government as it takes into account the before and after effect.

**Ques.** Define F distribution?**Ans:** The ratio of two independent chi-square variates with df v1 and v2 is defined as f distribution.

**Ques.** What is the mode of F-distribution?**Ans:** The mode of f distribution is always less than unity.

**Ques.** Which test will be used to test the equality of two population variances?**Ans:** F- DISTRIBUTION will be used to test the equality of two population variances.

**Ques.** State the application of Z-transformation?**Ans:** Z-TEST has the following application:

- To test whether an observed value of r differs significantly from a hypothetical value in the correlation coefficient.
- To test the significant difference between two independent correlation coefficients.

#### NON-PARAMETRIC TESTS

Non – Parametric methods allow us to assume that the sample is being drawn with no particular distribution. These tests are more robust than others as they can be applied to a broader range of situations.

This test gives lesser power to rejection of the null hypothesis as the p-value associated with this is way higher than the p-value associated with the parametric test.

These tests are way too simple as compared with the parametric test but when we talk about more accuracy, the parametric test gives us more accurate results for the same.

Some advantages of using non-parametric are:

- They are very simple methods that can be used anytime.
- No assumption is being made about the form of the frequency function
- They can deal with every kind of data which are in ranks also.

#### Questions

**Ques.** What is the difference between parametric and non-parametric methods?**Ans:** In the parametric method parent population does exist but in the non-parametric method, this classification is mainly missing.

In the parametric methods, we do not take into account socio-economic data whereas in the non-parametric methods it had spread down its roots into psychometry, sociology, and statistics.

In the parametric method ranking of data is not possible whereas when we talk about non-parametric method ranking is possible.

**Ques.** What is the meaning of RUN?**Ans:** RUN is nothing else but a regular combination of a sequence of letters that is surrounded by a different combination of another kind.

**Ques.** Names two tests which are based upon one sample or matched pair**Ans:** SIGN TEST and WILCOXON SIGNED RANK TEST are generally used in one sample test.

**Ques.** Name one test which is based upon two independent samples?**Ans:** MANN-WHITNEY TEST is the test that is based upon two independent samples.

**Ques:** Name one test which can be used as an alternative to ONE WAY ANOVA?**Ans:** KRUSKAL WALLIS TEST which is a non-parametric test is an alternative test to one-way ANOVA.

**Ques.** If we want to check the trends in time data then which non-parametric test can be used?**Ans:** The Mann-Kendall test can be used to judge the trend for the same.

**Ques.** Suppose that we need to compare our data from the median and then find the deviation in the sample. Which test can be used in this case?**Ans:** WILCOXON SIGN RANK TEST can be used to find the median and then find the deviation from the same.

**Ques.** Which test can be used to find a correlation between two samples?**Ans:** Spearman rank correlation to find the correlation between two samples.

**Ques.** Suppose that I have two samples in which their central tendencies are being measured, then which test can be used for the same?**Ans:** MEDIAN TEST can be used in this case as they take into account the central tendency which is based upon two samples.

#### MAXIMUM LIKELIHOOD FUNCTION, ROC CURVE, AND PROPERTIES OF GOOD ESTIMATOR

MLE is the method of estimating the parameters by maximizing a likelihood function when the wastage/effect is minimum. This method gives the most efficient result as MLE doesn’t need to be always uniquely defined.

Steps to calculate MLE:

- Make the function into the likelihood function
- Now differentiate the same for which parameter you need to calculate the MLE
- Put the first differentiation as zero and get a value
- Differentiate the function again to maximize the result

Hence by following these simple steps, we will be able to find likelihood function

ROC CURVE: This curve tells us how capable our model is to distinguish between different classes. The higher the value of the model, the better is the model.

Properties of good estimator:

**UNBIASNESS**

An estimator is called to be an unbiased estimator if its mean exists and gives us a particular value. Sometimes unbiasedness can give an absurd answer also.

For eg: E(T)=Y, HERE Y is an unbiased estimator of T.

**CONSISTENCY**

Here it needs to follow two things i.e., the variance should be zero as n tends to infinity, and expectation should also exist for a specific distribution.

**EFFICIENCY**

The ratio of variances of two populations is simply termed as efficiency.

**SUFFICIENCY**

When the estimator contains all the necessary information about the population then it can be called as sufficient estimator.

#### Questions

**Ques.** What is the meaning of MLE?**Ans:** For making the observed data most probable under the assumptions of statistical models we tend to maximize the likelihood functions, so it’s a method for estimating the parameter of a probability distribution.

**Ques.** Is MLE UNBIASED and CONSISTENT?**Ans:** MLE’S are always consistent but need not be unbiased as when the sample size increases it will tend towards unity.

**Ques.** Is MLE the most efficient estimator?**Ans:** Yes, MLE is the most efficient estimator in the class of estimator.

**Ques.** What is the ROC curve is all about?**Ans:** ROC curve is the graphical representation of the diagnostic ability of a binary classified system as its thresholds keep on varying.

**Ques.** What is the full form of ROC?**Ans:** RECEIVER OPERATING CHARACTERISTIC curve

**Ques**. What does the ROC curve denote?**Ans:** ROC curve denotes the benefits or the advantages of using a particular test or distribution.

**Ques.** What does ROC AUC value tell us?**Ans:** In ROC AUC –

- 0.5 suggests that there is no discrimination
- 0.7-0.8 suggests that the sample is acceptable
- more than 0.9 that suggests outstanding

**Ques.** What is an Estimator?**Ans:** Estimator is a statistic to estimate some facts about the population.

**Ques.** Name two types of ESTIMATIONS?**Ans:** The two types of estimations are:

- Point estimation: point estimation that takes into account a single point is known as point estimation.
- Interval estimation: it is the range of numbers in which population parameter lies

**Ques.** State the properties of a good estimator?

**Ans:** THERE ARE 4 PROPERTIES OF GOOD ESTIMATOR:

**Unbiasedness:**If θ is an unbiased estimate of θ, then we must have E (θ) = θ.**Consistency:**If an estimator, say θ, approaches the parameter θ closer and closer as the sample size n increases, θ is said to be a consistent estimator of θ.**Efficiency:**The concept of efficiency refers to the sampling variability of an estimator. If two competing estimators are both unbiased, the one with the smaller variance (for given sample size) is said to be relatively more efficient**Sufficiency:**An estimator is said to be sufficient if it conveys as much information as possible about the parameter which is contained in the sample.

**Ques.** Do unbiased estimates always exist?**Ans:** NO, it is not necessary for an unbiased estimator to exist as sometimes it provides absurd results as well.

#### Supervised and Unsupervised

**SUPERVISED LEARNING:** In the supervised model we train the model to be well labeled so that we can very well predict the unforeseen data. Successful building and scaling of data at equal intervals so that desired results can be obtained.

**Ques. **Why supervised data is important/helpful?**Ans: **

- Supervised data help us to correct our mistakes from the previous happenings.
- It helps in the optimization of the results.
- It helps in tackling real-life problems which help in increasing efficiency.

**UNSUPERVISED DATA:** In the unsupervised model, no predetermined input and output is set which means in this model we do not need to supervise the data as the model is free to explore and can go into many directions.

**Ques. **Why un-supervised data is important/helpful?

- As the model is free to explore so this helps the model to grow in every direction.
- It is way to easy to get this type of data easily than the labeled form of data

#### Questions

**Ques. **Suppose that I want to go to work from home. So, I checked the weather condition, the route which can be taken, and stations where I can stop to take food. which model I am taking into consideration?**Ans:** The person is taking into consideration the SUPERVISED MODEL as every aspect here is very well labeled starting from the type of route to the station the person will stop.

**Ques.** State some techniques to have supervised data?**Ans:** There can be two techniques in which we can have the same:

- Regression method: prediction of the model to have the required results.
- Classification method: when we divide into different categories to have the required result

**Ques. **State some techniques to have unsupervised data?**Ans:** There can be two techniques in which we can have the same:

- Clustering: it deals with structure or patterns and if they naturally exist for the same, they will find it
- Association: it helps in establishing the association between different databases if they naturally exist.

**Ques.** State some algorithms used in both supervised and unsupervised models?**Ans:** Support vector machine, Neural network, Linear and logistics regression, random forest, and Classification trees are used in a supervised model. Unsupervised algorithms can be divided into different categories: Cluster algorithms, K-means, Hierarchical clustering, etc.

**Ques.** Which model will be used when we need to classify big data into real life?**Ans:** Un-supervised model will be used here as these models don’t have any direction and hence helps in classification.

For more preparations and materials: click