The post Data Sciences Interview preparation appeared first on StepUp Analytics.

]]>**Central tendency** is the way to describe the centermost value of the data set i.e. it describes where the maximum values lie. The 3 ways to calculate central tendency are:

**Mean:** it is something that includes every value in data set by adding all the values divided by the no of values.

**Median:** when we ordered the given values in any form and find the most central value of the frame.

**Mode:** The most frequent number which occurs in the frame is the mode.

**Dispersion** means the extent to which a numerical data is likely to vary about an average value. dispersion is nothing else but “scatteredness”.

The various measure of dispersion is:

**Range:** It is the difference between two extreme observations of the distribution.

**Quartile Deviation:** Quartile Deviation (QD) is the product of half of the difference between the upper and lower quartiles.

**Mean Deviation:** Mean deviation is a statistical measure which gives us the average deviation of values from the mean in a sample.

**Standard Deviation:** positive square root of the arithmetic means of the squares of the deviation from means. It is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the values are spread out over a wider range from the mean.

**Quartiles: **It means dividing the Probability distribution into areas of equal probability. Quartiles are 3 cut points that divide the data into 4 equal parts.

The first quartile (Q1) is defined as the middle number between the smallest number and the median of the dataset.

The second quartile (Q2) is the median of the data and 50% of the data lies below this point.

The third quartile (Q3) is the middle value between the median and the highest value of the data set.

Skewness is defined as a lack of symmetry. Generally, skewness gives the idea of the shape of the curve which can be particularly termed as positive skewed and negative skewed.

Kurtosis gives an idea about the flatness of the frequency curve and can be classified into the normal, leptokurtic, and platykurtic curve.

The Key difference between Skewness and Kurtosis is that Skewness talks about the degree of lop side and the latter talks about the flatness of the distribution.

**Ques**. What does a low and high standard deviation indicate?**Ans:** A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.

**Ques. **Suppose that you are 5 inches in height and you have to cross the river where the observation of the water is 2,4,8,6 and 10. Would you be able to cross the river without sinking? which method will you apply for the same?**Ans:** Here we will consider the method of range (dispersion method) to form an instant image whether we can cross the river or not. the highest observation here is 10 and the lowest is 2 so that will give us 8, it won’t be possible to row the river without sinking.

**Ques.** Which is the ideal average among the central tendency?**Ans:** Arithmetic mean which satisfy all the condition of an ideal average can be laid as the most accurate one.

**Ques. **What is the median of the human body?**Ans:** Navel is the median of the body which divides the body into two equal parts.

**Ques. **What are the** **different types of mean?**Ans:** There are 3 types of mean:

**Arithmetic**: Simple means is also called the arithmetic mean were the sum of all the observation divided by the total no of observation.**Harmonic**: This is generally used when rates are desired in a particular frame.**Geometric**: the nth root of the product of n values.

**Ques.** Which is the best measure of dispersion?**Ans:** Standard deviation is the best method as it satisfies almost all the properties for an ideal measure of dispersion but is affected by outliers as it includes mean which is a very sensitive index.

**Ques. **Which method of dispersion is based on 50% of the data?**Ans:** Quartile deviation is the method of dispersion which includes 50% of the data only so it can’t be regarded as a reliable measure as it ignores the rest of the data.

**Ques.** Is mean possible for 2,3,4,5,6,7and 1000?**Ans:** Yes, the mean does exist for this distribution but as extreme values are present in this so it won’t provide an accurate answer for that. As we can see that the mean 146.71 and we can see that none of the observation is accumulating around that value so this will not provide us with valuable results.

**Ques.** What is the meaning of absolute and relative measure of dispersion? **Ans:** Absolute measures the dispersion in the same units as the units of original data whereas Relative dispersion is defined as the ratio of the standard deviation to the mean. Unlike absolute dispersion, relative dispersion is dimensionless.

**Ques.** If all the observation is increased by 10, then what will be the effect on standard deviation?**Ans:** It will remain unchanged as the variance is independent of the change of origin and Standard deviation is the root of the variance so it will also remain unchanged.

**Ques. **What is the range of Karl Pearson’s range of skewness?**Ans:** The range of skewness of the same will be -3 to 3 but in practical use, these limits can’t be attained.

**Ques.** List the case when the skewness will be called positively and negatively skewed?**Ans:** There are generally two ways to say that:

i. skewness is said to be positive when mean > mode/median

ii. skewness is said to be negative when mean < mode/median

**Ques.** Enlist all the cases when Bowley’s coefficient of skewness can be used?**Ans:** Bowley coefficient which is also known as quartile coefficient of skewness is extremely useful when median and quartiles are being used.

**Ques. **What is the key difference between median and mean?**Ans:** Median divides the whole data into two equal halves i.e. the probability of one half will be exactly equal to the other half which will be 0.5

Mean is the value around which most of the values lie. It does not divide the data into two equal halves but both of these can be equal in various situations.

**Ques. **If we have ordered data then can we use mode?**Ans: **Mode is used when we are dealing with unordered or nominal data.

There are two types of Categorical data:

1. Ordered data (mainly values representing rank or order)

2. Unordered data

When we have ordinal data, the mean and median is the best measure, and when its categorical data, the mode is the best choice.

**Ques. **We have values: 20,25,29,25,18,20,30,31,26 and their mean us 24.8. If we take the sum of deviations of these values from their mean then what will be the value?**Ans: **Sum of deviations about its mean is always zero.

**Ques. ** Suppose we have some data and the analyst wants to check the variability in the data. What statistic should the analyst prefer for this?**Ans:** The most efficient measure of dispersion is Standard deviation. It helps us to check the variation or the spread of data about the mean. This one is the most efficient because it takes all the data points into account.

**Ques. **What do 0 standard deviations signify?

**A: **0 standard deviation means that our random variable is constant and all the data points are the same without any dispersion.

**Example:** If a random variable has values like 9,9,9,9,9,9,9,9,9 then the mean of X will be 9 and its standard deviation will be 0.

**Ques. **If we calculate variance with n-1 in the denominator instead of n, what will it mean?**Ans: **It means that the data we have is a sample and not the population because variance calculated with n-1 in the denominator gives us sample variance.

When we deal with bivariate distribution or data, we are interested in finding out correlation and covariance between the two variables. If the change in one variable affects change in other variables them the variables are said to be correlated. The correlation can be of:

i. If two variable deviates in the same direction that means when an increase in one variable results in a corresponding increase in the other then they are said to be positively correlated.

ii. If two variable deviates in a different direction that means when an increase in one variable result decrease in the other variable then they are called as negatively correlated.

**Regression Analysis** is statistical modeling to find the relationship between a dependent variable and an independent variable. This is undertaken to predict future values based on the previously given data. It helps in estimating the relationship between two or more dependent and independent variables or how one variable is affecting the others and by how much.

The most common form of regression is linear regression which assumes a linear relationship between the input variables (x) and the single output variable (y). More specifically, y can be calculated from a linear combination of the input variables.

For Example: if suppose we have to find a relationship between pen and ink, so we can say that pen is a dependent variable, and ink is an independent variable.

**Ques:** Are correlation and dependency the same thing?**Ans:** Non-dependency between two variables means no correlation but the converse is not true. A perfect dependency can also have zero correlation.

**Example:** square of x is linearly dependent with y which shows that it will both show positive as well as negative correlation and zero correlation also. This tells us that for a pair of variables that are perfectly dependent on each other, It can also give you a zero correlation.

**Ques:** Suppose that Raman ate 5 plates of pasta which is being ordered online and the owner of the ordered restaurant is corona infected. Is by any chance Raman can get affected? Is the negatively or positively related to the happening of the infection?**Ans: **Yes, Raman has a chance of getting infected as one or the other day (without knowing) could have visited the restaurant, and as this is a communicable disease the chances of transfer from him to many are very high. Hence, we can say that the chances are positively related to each other.

**Ques:** What do points which are dense depicts?**Ans:** There can be two cases here:

i. If the points of data are close enough to each other given the condition that they are dense in nature then they have a good amount of correlation among them.

ii. If the points are scattered everywhere, we can say that they don’t have a very good amount of correlation among them.

**Ques:** What is the relation between two independent variables?**Ans:** The two independent variables are always uncorrelated.

**Ques:** Is the Pearson coefficient sensitive to outliers?**Ans: **Yes, the** **Pearson coefficient is very much sensitive to the outlier as even a single point can change the direction of the coefficient.

**Ques:** What is the difference between CORRELATION and REGRESSION?**Ans:** THE DIFFERENCE IS:

i. The slope in a linear regression gives the marginal change in output/target variable by changing the independent variable by unit distance. The correlation has no slope.

ii. The intercept in a linear regression gives the value of the target variable if one of the inputs/independent variables is set zero. Correlation does not have this information.

iii. Linear regression can give you a prediction given all the input variables. Correlation analysis does not predict anything.

**Ques**. Difference between correlation and covariance?**Ans:** Correlation is simply the normalized co-variance with the standard deviation of both the factors and this is generally done to get the results into the prescribed limits. In simple words, both terms measure the relationship and the dependency between two variables. “Covariance” indicates the direction of the linear relationship between variables. “Correlation” on the other hand measures both the strength and direction of the linear relationship between two variables.

**Ques**. If a constant 50 is subtracted from each value of X and Y, what will be the change in the regression coefficient?**Ans:** No, there will be no change as it is independent of the change of origin but not of scale.

**Ques.** Why do we have two lines of regression?**Ans:** There are always two lines of regression which mean one Y on X and another is X on Y.

**Ques**. What is the angle between the two lines?**Ans:** Yes, there are two angles between them namely:

i. acute angle

ii. obtuse angle.

**Ques**. When do two lines intersect?**Ans:** Two lines intersect each other when they are perpendicular to each other.

**Ques.** State the difference between homoscedastic and heteroscedastic**Ans:** Homoscedastic refers to a condition in which the variance of the residual, or error term, in a regression model, is constant. Here accumulation of points is not concentrated at equal distance to the line where Heteroscedastic has the problem of outliers as points are widely varying from each other.

**Ques**. What are the assumptions for REGRESSION?**Ans:** The basic assumption for regression is as follows:

i. Independence of error

ii. Normality of error distribution

iii. Quantitative data condition

iv. Homoscedastic

**Ques.** What does Multiple Regression mean?**Ans:** When more than two variables are highly related to each other than multiple regression comes into place or when we have more than one explanatory variable.

**Ques.** State the assumptions of Multiple Regression?**Ans:** The assumptions are as follows:

i. Multivariate normality: it is said that the residuals are normally distributed

ii. No multicollinearity: it is assumed that two variables are not correlated with each other

A hypothesis is some kind of statement about the population distribution which we want to verify from the kind of information available to us. It is a proposition that is made according to past data, facts, and personal assumptions.

When the hypothesis satisfies the population completely then it is known as SIMPLE HYPOTHESIS whereas when it does not satisfy the population completely it is known as COMPOSITE HYPOTHESIS.

**We have two kinds of hypothesis:**

**i. NULL HYPOTHESIS:** It is a type of hypothesis which assumes that there are no specific differences between the two or more characteristics of the sample. It is usually a commonly accepted fact and we try to disapprove of this hypothesis using various tests.

**ii. ALTERNATIVE HYPOTHESIS:** In statistical terms, when there is a significant difference between the two variables it is an alternative hypothesis. It is mainly a hypothesis opposite or alternate of the Null Hypothesis.

**Degree of Freedom:** It is the number of values in the final calculation of a statistic that are free to vary to estimate a parameter. The degrees of freedom equal your sample size minus the number of parameters you need to calculate during an analysis.

i. **Two-Tailed test:** If we have an alpha of 5% then it will allot half of the alpha at one tail of the distribution of the test statistic and the other half on the other. In this type of test, we try to test our null hypothesis in both the direction of the distribution.

Suppose we want to test if the students have scored 70% marks or not. Then according to this, our null hypothesis will be mean (x) = 70. Now, according to the two-tailed tests, we will test both if the mean is significantly greater than x and if the mean is significantly less than x.

ii. One-Tailed Test: It is exactly like two-tailed tests but in this, we test our null hypothesis in only one direction. There are two types of One-Tailed test**Right tailed test:** When the null hypothesis is being tested on the right side of the distribution.,**Left tailed test:** When the null hypothesis is being tested on the left side of the distribution.

**P-value:** It is the probability of obtaining results as extreme as the observed results of a statistical hypothesis test, assuming that the null hypothesis is correct.

**Sampling Error:** As a sample cannot include all the characteristics of the population. So, it is prone to some errors and those errors are known as sampling errors. There are two types of sampling errors

Now here we come up with two types of errors:**Type 1 Error:** Probability of rejecting H0, when H0 is true. This is also known as the producer’s risk.**Type 2 Error:** Probability of accepting H0, when H0 is false. This is also known as the consumer’s risk.

**Ques. **Why do we reject our null hypothesis when the p-value is small?**Ans: **P-Value is the observed value of significance which is compared with the actual level of significance and hence when the p-value is small, we tend to reject the null hypothesis.

**Ques. **Which error is more dangerous?**Ans:** Type 1 error is more dangerous as it involves producer risk which means the products/supplies which are capable of being supplied will not get into the market.

**Ques. **What does the level of significance depict?**Ans:** The probability of Type-1 error, is known as Level of Significance and it is also known as the size of the critical region.

**Ques.** Is p=0.05 significant?**Ans:** P-value under 0.05 shows strong evidence against the null hypothesis and hence it shows strong evidence to reject the hypothesis and accept the alternative one.

**Ques. **What is the difference between one tail and two tail?**Ans: **In the one-tail test, the entire region of 5% is available either in right or left whereas two-tail splits the alpha range into two parts.

**Ques.** When can the two-tailed test be used?**Ans: **As two-tail uses both positive and negative tails of the distribution so they take the possibility of difference of the same.

**Ques. **What is a critical region?**Ans: **These are those areas that lead to the rejection of the Null Hypothesis at some probability level.

**Ques. **Which is the best critical region?

Ans: The best critical region is the region that helps in minimizing both kinds of errors i.e. Type 1 and type 2 error.

**GOODNESS OF FIT** It is the statistical technique to determine how well sample observation satisfies the population observation. It helps in comparison of observed sample distribution with the expected probability distribution and analyzing the same.

**Now the different types of test which we know are:**

**CHI-SQUARE TEST:** This test is done to test the relationship between the categorical data. The null hypothesis under this is that there is no significant difference between the data that means they are independent.

A very high chi-square value indicates that the data does not fit into that completely.

**T-TEST: **This test helps us to know the significant difference between two or more means. T distribution curve converges the normal distribution curve from below. There are three types of tests involving T-distribution:

i. **Single T-Test:** this test compares the significant difference between two means

ii. **Paired T-Test:** this test compares measures the before and after effect of the situation

iii. **One-Sample T-Test:** this test compares one means with a standard or given mean

**F-TEST:** This test generally arises when the data is being fitted with the method of least square. this test helps in comparing the variances of two different populations and find out whether there is any kind of significant difference between the two populations or not.

**Z-TEST:** This test is applied to check the difference between two population mean when variances are known in the sample. This test is based on the normal distribution.

**Ques:** Name a distribution whose variance is twice of its mean?**Ans:** Chi-square distribution has a mean (n) and variance (2n).

**Ques:** In what condition does the chi-square distribution tend to gamma distribution?**Ans:** When the parameter becomes ½ and ½n then only it will tend towards chi-square distribution.

**Ques.** What is the sum of the independent chi-square variant?**Ans:** The sum of independent chi-square variate is also a chi-square variate.

**Ques.** If X1 and X2 are two independent chi-square variates with n1 and n2 df then which distribution X1/X2 will it follow?**Ans:** If these are two independent variates then it will follow b2(n1/2, n2/2) distribution.

**Ques.** State the condition for the chi-square test?**Ans:** The conditions for the chi-square test are:

i. The sample observation should be independent with each other

ii. N the total frequency should be sufficiently large where each unit should be greater than 50.

iii. Constraints of the cell frequency should be linear.

iv. No cell frequency should be less than 5

v. We will not make any kind of assumption regarding the parent population.

**Ques.** What distribution will (2X) ^1/2 will follow?**Ans:** As x is a chi-square variate with n degree of freedom then it will follow a normal distribution with mean (2n) ^1/2 and variance of 1.

**Ques.** Which test can be used for testing the homogeneity of an independent estimate of the population correlation coefficient?**Ans:** Chi-square test can be used to test the same.

**Ques.** What does the term “degree of freedom” means?**Ans:** The number of independent variates which are present in any kind of statistics is called the degree of freedom.

**Ques.** In what condition will YATES’S CORRECTION will be used?**Ans:** If any frequency or we can say that any of the cell frequency is smaller than 5 then by the method of pooling which is popularly known as “Yate’s correction” can be used.

**Ques.** In what condition will student’s t distribution will follow Cauchy distribution?**Ans:** When we have v=1, in the pdf of t – distribution then it will follow Cauchy distribution.

**Ques.** What is the value of the odd moments in t-distribution?**Ans:** The odd moments in the t-distribution will all be 0.

**Ques**. Which test will be used to test the significance of the observed partial correlation coefficient?**Ans:** T-TEST will be used here.

**Ques**. State applications of t-distribution?**Ans: **The application of t-distribution will be-

- To test whether the sample mean differs significantly from the population mean
- To test the significant difference between the two samples
- To test the significance of an observed sample correlation coefficient.

**Ques.** What will be the null hypothesis under the t-test for a single mean**Ans:** The null hypothesis under this will be there is no significant difference between the sample mean and the population mean

**Ques.** Suppose that the government wants to calculate the before and after effect of the pandemic situation in DELHI for working-class people. Which test should the government apply?**Ans:** PAIRED T-TEST should be applied by the government as it takes into account the before and after effect.

**Ques.** Define F distribution?**Ans:** The ratio of two independent chi-square variates with df v1 and v2 is defined as f distribution.

**Ques.** What is the mode of F-distribution?**Ans:** The mode of f distribution is always less than unity.

**Ques.** Which test will be used to test the equality of two population variances?**Ans:** F- DISTRIBUTION will be used to test the equality of two population variances.

**Ques.** State the application of Z-transformation?**Ans:** Z-TEST has the following application:

- To test whether an observed value of r differs significantly from a hypothetical value in the correlation coefficient.
- To test the significant difference between two independent correlation coefficients.

Non – Parametric methods allow us to assume that the sample is being drawn with no particular distribution. These tests are more robust than others as they can be applied to a broader range of situations.

This test gives lesser power to rejection of the null hypothesis as the p-value associated with this is way higher than the p-value associated with the parametric test.

These tests are way too simple as compared with the parametric test but when we talk about more accuracy, the parametric test gives us more accurate results for the same.

Some advantages of using non-parametric are:

- They are very simple methods that can be used anytime.
- No assumption is being made about the form of the frequency function
- They can deal with every kind of data which are in ranks also.

**Ques.** What is the difference between parametric and non-parametric methods?**Ans:** In the parametric method parent population does exist but in the non-parametric method, this classification is mainly missing.

In the parametric methods, we do not take into account socio-economic data whereas in the non-parametric methods it had spread down its roots into psychometry, sociology, and statistics.

In the parametric method ranking of data is not possible whereas when we talk about non-parametric method ranking is possible.

**Ques.** What is the meaning of RUN?**Ans:** RUN is nothing else but a regular combination of a sequence of letters that is surrounded by a different combination of another kind.

**Ques.** Names two tests which are based upon one sample or matched pair**Ans:** SIGN TEST and WILCOXON SIGNED RANK TEST are generally used in one sample test.

**Ques.** Name one test which is based upon two independent samples?**Ans:** MANN-WHITNEY TEST is the test that is based upon two independent samples.

**Ques:** Name one test which can be used as an alternative to ONE WAY ANOVA?**Ans:** KRUSKAL WALLIS TEST which is a non-parametric test is an alternative test to one-way ANOVA.

**Ques.** If we want to check the trends in time data then which non-parametric test can be used?**Ans:** The Mann-Kendall test can be used to judge the trend for the same.

**Ques.** Suppose that we need to compare our data from the median and then find the deviation in the sample. Which test can be used in this case?**Ans:** WILCOXON SIGN RANK TEST can be used to find the median and then find the deviation from the same.

**Ques.** Which test can be used to find a correlation between two samples?**Ans:** Spearman rank correlation to find the correlation between two samples.

**Ques.** Suppose that I have two samples in which their central tendencies are being measured, then which test can be used for the same?**Ans:** MEDIAN TEST can be used in this case as they take into account the central tendency which is based upon two samples.

MLE is the method of estimating the parameters by maximizing a likelihood function when the wastage/effect is minimum. This method gives the most efficient result as MLE doesn’t need to be always uniquely defined.

Steps to calculate MLE:

- Make the function into the likelihood function
- Now differentiate the same for which parameter you need to calculate the MLE
- Put the first differentiation as zero and get a value
- Differentiate the function again to maximize the result

Hence by following these simple steps, we will be able to find likelihood function

ROC CURVE: This curve tells us how capable our model is to distinguish between different classes. The higher the value of the model, the better is the model.

Properties of good estimator:

**UNBIASNESS**

An estimator is called to be an unbiased estimator if its mean exists and gives us a particular value. Sometimes unbiasedness can give an absurd answer also.

For eg: E(T)=Y, HERE Y is an unbiased estimator of T.

**CONSISTENCY**

Here it needs to follow two things i.e., the variance should be zero as n tends to infinity, and expectation should also exist for a specific distribution.

**EFFICIENCY**

The ratio of variances of two populations is simply termed as efficiency.

**SUFFICIENCY**

When the estimator contains all the necessary information about the population then it can be called as sufficient estimator.

**Ques.** What is the meaning of MLE?**Ans:** For making the observed data most probable under the assumptions of statistical models we tend to maximize the likelihood functions, so it’s a method for estimating the parameter of a probability distribution.

**Ques.** Is MLE UNBIASED and CONSISTENT?**Ans:** MLE’S are always consistent but need not be unbiased as when the sample size increases it will tend towards unity.

**Ques.** Is MLE the most efficient estimator?**Ans:** Yes, MLE is the most efficient estimator in the class of estimator.

**Ques.** What is the ROC curve is all about?**Ans:** ROC curve is the graphical representation of the diagnostic ability of a binary classified system as its thresholds keep on varying.

**Ques.** What is the full form of ROC?**Ans:** RECEIVER OPERATING CHARACTERISTIC curve

**Ques**. What does the ROC curve denote?**Ans:** ROC curve denotes the benefits or the advantages of using a particular test or distribution.

**Ques.** What does ROC AUC value tell us?**Ans:** In ROC AUC –

- 0.5 suggests that there is no discrimination
- 0.7-0.8 suggests that the sample is acceptable
- more than 0.9 that suggests outstanding

**Ques.** What is an Estimator?**Ans:** Estimator is a statistic to estimate some facts about the population.

**Ques.** Name two types of ESTIMATIONS?**Ans:** The two types of estimations are:

- Point estimation: point estimation that takes into account a single point is known as point estimation.
- Interval estimation: it is the range of numbers in which population parameter lies

**Ques.** State the properties of a good estimator?

**Ans:** THERE ARE 4 PROPERTIES OF GOOD ESTIMATOR:

**Unbiasedness:**If θ is an unbiased estimate of θ, then we must have E (θ) = θ.**Consistency:**If an estimator, say θ, approaches the parameter θ closer and closer as the sample size n increases, θ is said to be a consistent estimator of θ.**Efficiency:**The concept of efficiency refers to the sampling variability of an estimator. If two competing estimators are both unbiased, the one with the smaller variance (for given sample size) is said to be relatively more efficient**Sufficiency:**An estimator is said to be sufficient if it conveys as much information as possible about the parameter which is contained in the sample.

**Ques.** Do unbiased estimates always exist?**Ans:** NO, it is not necessary for an unbiased estimator to exist as sometimes it provides absurd results as well.

**SUPERVISED LEARNING:** In the supervised model we train the model to be well labeled so that we can very well predict the unforeseen data. Successful building and scaling of data at equal intervals so that desired results can be obtained.

**Ques. **Why supervised data is important/helpful?**Ans: **

- Supervised data help us to correct our mistakes from the previous happenings.
- It helps in the optimization of the results.
- It helps in tackling real-life problems which help in increasing efficiency.

**UNSUPERVISED DATA:** In the unsupervised model, no predetermined input and output is set which means in this model we do not need to supervise the data as the model is free to explore and can go into many directions.

**Ques. **Why un-supervised data is important/helpful?

- As the model is free to explore so this helps the model to grow in every direction.
- It is way to easy to get this type of data easily than the labeled form of data

**Ques. **Suppose that I want to go to work from home. So, I checked the weather condition, the route which can be taken, and stations where I can stop to take food. which model I am taking into consideration?**Ans:** The person is taking into consideration the SUPERVISED MODEL as every aspect here is very well labeled starting from the type of route to the station the person will stop.

**Ques.** State some techniques to have supervised data?**Ans:** There can be two techniques in which we can have the same:

- Regression method: prediction of the model to have the required results.
- Classification method: when we divide into different categories to have the required result

**Ques. **State some techniques to have unsupervised data?**Ans:** There can be two techniques in which we can have the same:

- Clustering: it deals with structure or patterns and if they naturally exist for the same, they will find it
- Association: it helps in establishing the association between different databases if they naturally exist.

**Ques.** State some algorithms used in both supervised and unsupervised models?**Ans:** Support vector machine, Neural network, Linear and logistics regression, random forest, and Classification trees are used in a supervised model. Unsupervised algorithms can be divided into different categories: Cluster algorithms, K-means, Hierarchical clustering, etc.

**Ques.** Which model will be used when we need to classify big data into real life?**Ans:** Un-supervised model will be used here as these models don’t have any direction and hence helps in classification.

For more preparations and materials: click

The post Data Sciences Interview preparation appeared first on StepUp Analytics.

]]>The post Excel – Descriptive Statistics appeared first on StepUp Analytics.

]]>crucial for working on any dataset and statisticians employ various statistical tools and techniques to present the data in an easily understandable form.

**What is Descriptive Statistics?**

Summarizing and presenting an unorganized dataset in an organized way which may be tabular, graphical or numerical is referred to as descriptive statistics. Descriptive statistics help the readers in understanding the data well at first look. It also helps in making statistical inferences from the data.

For example, for Census 2021 the Government of India will conduct primary research to gather the data of its citizens which is helpful in policy-making decisions. The government can employ descriptive statistics

on this huge dataset to better understand the data by describing and summarizing its features.

Descriptive statistics include measures of location such as minimum, maximum, percentiles, quartiles,

and central tendency (mean, median, and mode); and measures of dispersion or variability, including range, variance, and standard deviation.

**Measures of Location:**

Measures of location are the best way in which one can reduce a set of data summarizing all data points with a single value. A single number is derived that is most representative of the entire dataset. The derived single number generally shows most of the properties of the data distribution.

Let us, for example, study a data set obtained from the measurement of the length of the crocodiles that are there in a pond. We observed the following things:

− the average length of a crocodile is 3.4 meters or that

− 50 % of the crocodiles have a length of fewer than 3.4 meters or that

− 90 % of the crocodiles have a length between 2.8 meters to 4.3 meters

Here in the above-stated example, the first two statements try to specify in some manner the location of the peak of the distribution, whereas the third statement tells us about the variability in the dataset.

**Measures of Central Tendency:**

These are the measures that help in locating the center of the data. These are as follows:

**Mean: **The mean value is generally known as the average value of the data. To get the value of the mean, we simply take summation of all the observations and then divide the sum by the total number of observations that are there in the dataset.

**Mean = (Sum of Observations)/ (Total Number of Observations)**

For Example, the marks of the 5 students are 2, 9, 10, 3, 6. The mean here would be:

**Mean = (2+9+10+3+6)/ 5 = 30/ 5 = 6**

Interpretation: Here we can say that on an average a student has got 6 marks in the class.

**Median:** The median is the location or a point that has half the data points smaller than that location and half the data points larger than that location. In the case of asymmetric data, the median is considered to be a more stable measure of central tendency that means. The reason behind it is that the median, in particular, is less affected by outliers (i.e. any extreme values) than the mean.

The median is calculated as follows:

− Sort all values in ascending/descending order.

− If the total number of values is odd then take the middle number of the series.

− If the total number of values even then takes the average of the middle two numbers.

For example, the marks of the 5 students are 2, 9, 10, 3, 6. Here, one must first sort these values in ascending order 2, 3, 6, 9, 10. Since the number of observations here is odd, we would take the middle value as the median for this series which is 6 in this case.

**Mode:** The Mode is the value that occurs most frequently in the dataset. It is not generally used in the statistical analysis. The mode is useful in the case of categorical data to describe the most frequent category. In a series, there could be more than one mode also and in that case, the series is called bimodal/ multimodal series.

For example, the marks of the 5 students are 2, 9, 10, 6, 6. In this case, 6 occurs most of the time (twice) in this series and hence becomes the mode of this series.

**Measures of Dispersion:** Measures of dispersion play a very important role in describing the spread of the data and its variation around a central value (Mean/ Median/ Mode). The spread of a data set can be described by various descriptive statistics like range, variance, and standard deviation.

**Range:** It is the difference between the lowest and highest values in a dataset. It is very simple to compute and useful when someone wishes to interpret the whole of a dataset. The range is useful to show the spread within a dataset and also, for comparing the spread/ variability among two or more similar datasets.

For example, the marks of the 5 students are 2, 9, 10, 3, 6. To calculate the value of the range, we need to subtract the lowest value which is 2 from the highest value which is 10 and we get the range as 8.

**Variance:** It is a numerical value used to show how widely individual observations vary in a group. When the individual observations vary to a large extent from the mean of the group then the variance of the group would be high, and if the individual observations vary to a small extent from the mean of the group then the variance of the group would below.

One must distinguish between the population variance and sample variance. Generally, they have different notations, and also, they are calculated differently. The variance of a population is denoted by

σ2 and the variance of a sample, by s2.

To calculate the variance of a population the following formula is used:

where σ2 is the variance of the population, X is the mean of the population, Xi is the ith element from the population and N is the total number of elements in the population.

The variance of a sample is defined by a different formula:

where s2 is the variance of the sample, x is the mean of the sample, xi is the ith element from the sample, and n is the total number of elements in the sample. By using the above-stated formula, we get the variance of the sample which is an unbiased estimate of the population variance.

**Standard Deviation:** Square root of the variance gives the standard deviation for the data. It is preferred over variance as it measures the deviation from the mean in the same units as the original data.

For example, if the data is of distance measurements in kilometers then the standard deviation will also be measured in kilometers, hence, the standard deviation is comparatively easy to interpret as compared to the variance.

**Measures of Shape:****Skewness:** The skewness measures the asymmetry of the distribution by the tendency of the data to be spread out on one side of mean compared to the other. There are two types of skewness i.e. left-skewed and right-skewed distribution.

The nearer the skewness statistics is to 0, the more symmetric a dataset is. The data is said to be left-skewed, implying, the mean is less than the median when the skewness statistic is negative and it is said to be right-skewed, implying the mean is greater than the median when the statistic is positive.

**Kurtosis:** It measures the peakedness of a distribution. Kurtosis statistic less than 3, implies, distribution with tails data exceeding the normal distribution also referred to as a platykurtic curve. Statistic greater

than 3, implies the tail data is less extreme compared to the normal distribution, also referred to as a leptokurtic curve.

Now let’s see how to add Data Analysis option in MS-Excel for easy application of data analysis options

on various datasets.

**Step 1:** Go to ‘File’ that appears on the top left side of MS-Excel.

**Step 3:** Check ‘Analysis ToolPak’, ‘Analysis ToolPak – VBA’ and ‘Solver Add-in’ in the dialog box. Next, click OK. The Data Analysis option has been added now.

**Step 4:** Click ‘Data’ from the options and see the ‘Data Analysis’ option is now available for use.

**Case Study:**

Let’s understand how to use descriptive statistics in MS-Excel on stock prices and infer the results. Consider Axis Bank traded as equity stock on NSE. The data has been downloaded from NSE for a period of 24 months, 1st January 2018 to 30th December 2019.

**Note:** Return on stock prices is calculated as Pt – Pt-1 / Pt-1where Pt is current periods’ price and Pt-1 is previous periods’ price

Now, we will use the **‘Data Analysis’** option in Excel [as learned earlier] on the monthly returns of Axis Bank stock to find a summary of the descriptive statistics and infer the results.

**Step 1:** Click on ‘Data’ then select ‘Data Analysis’. Find the ‘Descriptive Statistics’ option as shown in the image below and click OK.

**Step 2:** Select the ‘Input Range’ here, $D$3:$D$25 as only numeric values to be considered. Define the ‘Output Range’ for the result to be generated and then select ‘Summary Statistics’ and click OK.

The following Summary Statistics are generated for the Monthly Return of Axis Bank stock.

**Interpretation of the Summary Statistics:**

The count of the data implies that there are 23 observations of monthly return. The maximum monthly return that an investor could earn on a monthly basis is a profit of 16.06% and the minimum being a loss of

11.38% on the Axis Bank stock, implying a range of 27.44% between the maximum loss and profit. The sum of 28.25% implies that an investor can gain 28.25% if he/ she kept the stock for the time period of the analysis i.e. from 1st January 2018 till 30th December 2019.

From the measures of central tendency, we can infer that the average or mean return that the stock yields over the time period are 1.23%, whereas the median is 1.12%. The mean and median are slightly different but they imply that the data is not symmetric. Also, there is no mode in the data of monthly returns. The standard deviation of 6.02% implies the extent of deviation from the mean return that the stock yields. The measures of shape infer that the distribution is rightly skewed with skewness statistic being positive.

Kurtosis statistic is less than 3 which implies, that the investor will experience occasional extreme returns (either positive or negative), more extreme than the usual + or – three standard deviations from the mean that is predicted by the normal distribution of returns.

As seen above, one can use descriptive statistics to better understand the data. We can perform the descriptive statistics on any stock using the historical price data to check the range of profit and loss and take better-informed decisions on the basis of our inferences.

The post Excel – Descriptive Statistics appeared first on StepUp Analytics.

]]>The post Web Scraping Using R From Amazone appeared first on StepUp Analytics.

]]>Web scraping is a technique for converting the data present in an unstructured format (HTML tags) over the web to the structured format which can easily be accessed and used.

Almost all the main languages provide ways for performing web scraping. In this article, we’ll use R for scraping the data for the most popular feature smartphones of 2019 from the Amazone website.

We’ll get a number of features for each of the 15 popular feature smartphones released in 2019. Also, we’ll look at the most common problems that one might face while scraping data from the internet because of the lack of consistency in the website code and look at how to solve these problems. If you are more comfortable using Python, I’ll recommend you to go through this website for getting started with web scraping using Python

There are several ways of scraping data from the web. Some of the popular ways are:

**Human Copy-Paste:**This is a slow and efficient way of scraping data from the web. This involves humans themselves analyzing and copying the data to local storage.**Text pattern matching:**Another simple yet powerful approach to extract information from the web is by using regular expression matching facilities of programming languages. You can learn more about regular expressions in this website**.****API Interface:**Many websites like Facebook, Twitter, LinkedIn, etc. provides public and/or private APIs which can be called using the standard code for retrieving the data in the prescribed format.**DOM Parsing:**By using web browsers, programs can retrieve the dynamic content generated by client-side scripts. It is also possible to parse web pages into a DOM tree, based on which programs can retrieve parts of these pages.

**Title***:- *Name, storage, and color of the top 15 smartphones.**Price**:- Price of smartphones.**Rating**:-Ratings of the smartphones

Now, let’s get started with scraping the Amazone website for the 15 most popular feature smartphones released in 2019. You can access them on this website Click.

rvest:- Hadley Wickham authored the rvest package for web scraping in R. rvest is useful in extracting the information you need from web pages.

- Click the Tools button and install packages.
- Then click the Install button.

rvest contains the basic web scraping functions, which are quite effective. Using the following functions, we will try to extract the data from web sites.

- read_html(url) : scrape HTML content from a given URL
- html_nodes(): identifies HTML wrappers.
- html_nodes(“.class”): calls node based on CSS class
- html_nodes(“#id”): calls node based on <div> id
- html_nodes(xpath=”xpath”): calls node based on xpath (we’ll cover this later)
- html_attrs(): identifies attributes (useful for debugging)
- html_table(): turns HTML tables into data frames
- html_text(): strips the HTML tags and extracts only the text

Let’s implement it and see how it works. We will scrape the Amazon website for comparision of top 15 smartphones.

#loading the package:

library(‘rvest’)

#Specifying the URL for the desired website to be scrapped

url <- https://www.amazon.in/s?k=top+smartphones&dc&crid=3JN9QKV0R5211&sprefix=top+smart%2Caps%2C376&ref=a9_sc_1

#Reading the html content from Amazon

webpage <- read_html(url)

In this code, we read the HTML content from the given URL and assign that HTML into the `webpage`

variable.

Now, as the next step, we will extract the following information from the website:

**Title:** The title of the product.**Price: **The price of the product.**Rating: **The user rating of the product.**Size:** The size of the product. **Color:** The color of the product.

Next, we will make use of HTML tags, like the title of the product and price, for extracting data using Inspect Element. In order to find out the class of the HTML tag, use the following steps:

**=> go to chrome browser => go to this URL => right-click => inspect** **element**

NOTE: If you are not using the Chrome browser, check out this article.

Based on CSS selectors such as class and id, we will scrape the data from the HTML. To find the CSS class for the product title, we need to right-click on the title and select “Inspect” or “Inspect Element”.

As you can see below, I extracted the title of the product with the help of ** html_nodes** in which I passed the id of the title —

#scrape title of the product

title_html <- html_nodes(webpage, ‘.a-size-medium’)

title <- html_text(title_html)

head(title)

We could get the title of the product using spaces and \n.

The next step would be to remove spaces and new line with the help of the ** gsub **function.

# remove all space and newlines

title<-gsub(“\n”,””,title)

head(title)

**Price of the product:**

price_html <- html_nodes(webpage, ‘.a-price-whole’)

price <- html_text(price_html) head(price)

**Output:**

**Rating of the products:**

rating_html <- html_nodes(webpage, ‘.a-icon-alt’)

rating <- html_text(rating_html)

head(rating)

**Output:**

Now we have successfully scraped all the 4 features for the 15 most popular feature smartphones from amazon released in 2019. Let’s combine them to create a dataframe and inspect its structure.

I believe this article would have given you a complete understanding of the web scraping in R. Now, you also have a fair idea of the problems which you might come across and how you can make your way around them. As most of the data on the web is present in an unstructured format, web scraping is a really handy skill for any data scientist.

Also, you can post the answers to the above three questions in the comment section below. Did you enjoy reading this article? Do share your views with me. If you have any doubts/questionsns feel free to drop them below.

To read more about R and its implementation Click

The post Web Scraping Using R From Amazone appeared first on StepUp Analytics.

]]>The post Data Science: Python for Data Preprocessing appeared first on StepUp Analytics.

]]>Similarly, we will be preprocessing the data by cleaning it, removing insignificant features and then performing data exploration. These steps comprise of data preprocessing/data wrangling which is mandatory before visualization of data and the fe

Continuing with the first part of this series, we will be looking at different techniques involved in the preprocessing of data. This series of articles will be covering the following topics:-

- Web Scraping
- Data Preprocessing
- Feature Engineering and Model Selection

We will look into the following topics which are covered under preprocessing of raw data:-

**Data Preprocessing****Data Exploration****Identifying variables**

**Data Cleaning****Dropping Features****Missing Values Identification****Formatting the data.**

**Data Visualization/ Exploratory Data Analysis**

**Text data preprocessing**

Data preprocessing involves a collection of steps which helps to purify the data and extract the useful and remove the insignificant information. Data obtained from real-world is incomplete, inconsistent and it also contains numerous errors. Thus to counter this issue with the data, we are using data preprocessing which aids in removing discrepancies in names and related problems.

In the real world, we generally have two broad categories of data. In the first kind of data, we have continuous and categorical features and then in the second kind of data we have the text data. So these two types of data require different steps for preprocessing. So first let’s have a look at the steps for continuous and categorical features.

To understand the following steps of data handling and visualization we will be using two different datasets which are available on Kaggle. The first dataset is regarding the information of Cities of India and the other dataset is related to Online Shopping.

Initially, after loading the dataset, we have to study the dataset which is used by us for gathering the general insights which data possess. To gather these insights, we look at the continuous features. For this we will use the **head(), info()** and **describe()** function of pandas.

The **head()** function helps us to know the columns and rows contained in the dataset. By default, the **head****()** function displays the first 5 values of the dataset.

After looking at some initial values, we can determine the shape i.e. number of columns and rows contained in the dataset by using a **shape** attribute of the dataframe.

The **Info()** function tells us the count of rows which consists of values in the given number of columns. By comparing the rows displayed through shape attribute and **info()** function, we can predict if there are any cells with **NaN** values. In the above example, we have no empty cells as all the 493 rows consist of some values. Here we can also know the data type of all the columns of the dataset.

We can look at the variables/columns of the data which can act as predictor variables and target variables. So this can build a connection between such variables. In the mentioned example, state_name and name_of_city are the predictor variables whereas other columns can be the target variables.

Moreover, we can also remove the columns which are of no significance and cannot provide any sort of insight into the dataset. For example, here in this dataset, we will remove the **state_code**, **dist_code** and **location** are not of any use. So we will remove them from our dataframe so that these columns do not hinder the results of other operations.

The **drop()** function removes the columns from the dataframe and **axis = 1** parameter specifies that we are wanting to remove the columns.

In univariate analysis, we perform the analysis of variables based upon their type i.e. either they will be continuous or categorical. If we have continuous variables which are also the case with a dataset which we have, we can use **describe() ** function. Through this function, we can discern the different measures which reveal the central tendencies and spread of the variable.

Here the values which **describe** function displays are count, mean, standard deviation, minimum and maximum value along with the three quartile values. For categorical values, we can use a frequency table for a better understanding of different categories.

As the column ‘**state_name**’ is categorical, we can find the frequency of different states in the dataset. This is done by using **value_counts()** function.

Now after having gone through the data, we will be looking to clean the dataset by removing erroneous values, redundant data. Also, we would want to remove the columns with missing values. Moreover, it is suggested to nullify the impact of outliers as well.

For performing data cleaning, we would be using a different dataset which is related to Online Shopping.

When it comes to data cleaning, we focus onto three main steps which are as follows:-

As we had a look earlier, we removed three columns from the dataset as they were of no significance. Similarly, if there are some more columns which have a large number of missing columns, then it is recommended to drop that column comprehensively. Along with this, at this stage, we perform changes which are necessary to make the data apt for preprocessing steps.

As we can see the column names are starting with capital letters, thus we would want them to be in lowercase so that it does not hinder in the proceedings.

By using **rename()** function, we will be able to rename the column names as per our requirement.

We can clearly see that the column names have been changed into

Managing the missing values in the dataset plays a crucial part in data preprocessing. If we do not handle the missing values, then we can get misleading results. First, for checking missing values, we can use the following code snippet.

In the above code snippet, we look at null values and sort them in ascending order. We can see that the cust_id column has the most number of missing values.

In this code snippet, we look at the rows with missing values. Thus after knowing this, we can decide how we will handle these missing values. The different ways of managing missing values are as follows:-

- We can fill these missing values with random values like ‘0’
- Ignoring missing values, if they are less in number
- Another way is to fill these missing rows with mean, mode or median of the columns. This method is more preferred than the other two methods.

By using the **dropna****()** function, we can remove the rows with empty values and then store it in a new dataframe.

When we check for the missing values, we can see that there are no missing values.

Formatting of data involves making the data types compatible with other data types of the columns, removing abbreviations and also creating new columns by using values from existing columns. In this example, we will be formatting the following things.

If we look in the above example of this dataset, the description is present in the upper case. Thus to change it, we will be using the **lower()** function to make the description to lower case.

In the above snippet, we can see cust_id column is of float type and to change this, we will be using the code shown above.

Here in this picture, the cust_id data type is now of integer type. Therefore, this shows how to change the data type as per our needs.

After completing the preprocessing of data, the next step is to perform the visualization of data. This is also known as Exploratory Data Analysis. We will use both the datasets for visualization and getting insights from them. First, let’s look at some visualizations from the cities dataset.

This snippet code displays the top 5 cities with the highest population with all the column values.

For obtaining states with the highest population, we will be using the ** groupby()** function and then plotting by using the

Clearly discernible from this plot that states like Uttar Pradesh, Maharashtra has the highest population whereas states like Meghalaya, Mizoram, and Nagaland have the lowest population.

Now let’s look at some of the visualizations of the other dataset.

This displays the highest money spending customers from different countries.

The above visualization helps us to know the number of orders on different days of the week. There can be more visualizations as well. It depends on our creativity and curiosity about what we want to know from the dataset.

Most of the data obtained from the websites and other sources are text data and thus, it is required to process them in a different manner. I have covered text data preprocessing which was regarding Natural Language Processing.

You can have a look at the article and know what are the steps required for text data preprocessing. Generally, the steps involved for preprocessing text data is discussed as follows:-

**Converting text to lowercase:**It is recommended because this removes anomalies between identical words which are present in both upper and lowercase.

**Remove Numbers:**It is a tedious task to process text data with numbers, so we remove numbers from it.

**Removing punctuation and special letters:**All the punctuation and special letters are providing no information. So we always drop them from text.

**Removing stop words:**These stop words are the
prepositions like a, an, the etc. which are in abundance in data but are of no
use. So they are also removed.

After the above steps, then we have options of some specific steps which can be taken as per our requirements. The Jupyter notebooks for this article can be referred from here.

The post Data Science: Python for Data Preprocessing appeared first on StepUp Analytics.

]]>The post Install Python on Windows and Mac (Anaconda) appeared first on StepUp Analytics.

]]>**1.**Download and install Anaconda (windows version) from

Choose either the Python 2 or Python 3 Version depending on your needs. It doesn’t affect the installation process. Download Link

**2.** Select the default options when prompted during the installation of Anaconda.

Note: If you checked this box, steps 4 and 5are not needed. The reason why it isn’t preselected is a lot of people don’t have administrative rights on their computers.

**3. **After you finished installing, open **Anaconda Prompt**. Type the command below to see that you can use a Jupyter (IPython) Notebook.

If you want a basic tutorial going over how to open Jupyter and using python, please see the video below.

Windows OS: YouTube Video

Mac OS: YouTube Video

**4.** If you didn’t check the add Anaconda to path argument during the installation process, you will have to add python and conda to your environment variables. You know you need to do so if you open a **command prompt** (not anaconda prompt) and get the following messages

**5.** This step gives two options for adding python and conda to your path (only choose 1 option).

If you don’t know where your conda and/or python is, you type the following commands into your **anaconda prompt**

Happy Learning!

The post Install Python on Windows and Mac (Anaconda) appeared first on StepUp Analytics.

]]>The post AI vs Machine Learning vs Deep Learning vs Data Science appeared first on StepUp Analytics.

]]>How are they related to the much talked about and vast field of Data Science? This article tries to set clear lines between these terms and exemplify the functioning and usage in the industry.

In the present scenario of ultimate connectivity of everything around the globe has resulted in the generation of huge amounts of data. The big question now arises: What do we do with this available data? Is this data ready to be processed? This is where data science steps in. Data science is an umbrella term that encompasses data analytics, data mining, machine learning, and several other related disciplines.

Data science is basically an extension of statistics dealing with a large amount of data with the help of computer algorithms. Data science converts the available raw data into a useful computer-compatible form.

This means the data can be molded according to specific requirements. It includes data cleansing, data preparation, and data analysis. This processed data can now be used as input for various machine learning algorithms. Data science usually deals with technologies like R, Python, SQL, Hadoop etc.

We have seen that data science is the upper domain that provides processed data. The way this data is processed to get the output deals with technologies like AI, ML, and DL. There are many ways to distinguish between these terms. However, there are certain differences that hold true in many cases.

First of all, they hold a chronological order. The concept of Artificial Intelligence was introduced first, and then machine learning was developed. After that deep learning gained momentum. Deep Learning is a subset of Machine Learning. Machine Learning is a subset of Artificial Intelligence. This development looks like a top-down approach where with the advent of new technologies, advancements of deep learning may be developed in the future.

Being said that, it means that all ML is AI but all AI is not ML. Similarly, all DL is ML but not all ML is DL. This can be understood with the given Venn diagram.

*Artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and other animals.* (Source: Wikipedia)

It means it is a way to enable the computer to think and make decisions just like human beings. It lets computer function on its own without human interference. The computer can make its own decisions in an open human interactive environment without any human support. AI can be just a few lines of if-else statements or even pages of lines of code. It all depends on the context they need to be used in.

*Machine learning is a field of computer science that uses statistical techniques to give computer systems the ability to “learn” with data, without being explicitly programmed. *(Source: Wikipedia)

It is a subset of Artificial Intelligence. It usually deals with self-learning of computers using the dataset and improvising the output from experience without being explicitly told what to do. This can then be used to predict future patterns or trends. Machine learning could be composed of statistical analysis (Unsupervised learning) or predictive analysis (Supervised learning). Recommendations on Netflix, Instagram, and Facebook make use of machine learning algorithms by analyzing past activities of the user.

*Deep learning is part of a broader family of machine learning methods based on learning data representations, as opposed to task-specific algorithms. *(Source: Wikipedia)

It is a more specific class of algorithms used in machine learning. It uses a brain like structure for functioning, called an artificial neural network. It deals mostly with predictive analysis. Deep learning algorithms use a hierarchal form of learning. Each algorithm at the lowest level deals with low complexity. It uses what it learned to create a statistical model output. This process continues through several levels/layers of the neural network. Iterations continue until a decent acceptable accuracy is reached.

Let’s take an example to clarify the differences between these two. Suppose there is a taxi company that sends a taxi to you whenever you book one, similar to Uber. It is synced with the map of the city that it operates in. Whenever you enter the pickup and drop location, the application checks the data of all previous rides along the same route. It then predicts an estimated cost for the ride. This is machine learning.

The same application also tells you the route that you will be taken through. If there is traffic congestion anywhere in the route, an alert is issued in form of voice stating “Traffic Jam Ahead!”. Here the application finds an unusual situation and takes a decision on its own that it is unfavorable and notifies the user. This is artificial intelligence.

Suppose the same application, that we have considered before has a feature that it identifies the car plate number and matches it with the allotted taxi’s number. If it is the same, the user is notified. This way the user doesn’t have to look around for the taxi. Here a neural network is employed that captures the taxi’s number plate and read the numbers from the image and matches with the dataset. This is deep learning.

We can say that all the terms that we have discussed so far are interrelated with each other. However, conceptually they hold major differences. As a person passionate about data science, these differences should be well understood. These lines of difference have now been drawn.

Reference Article on Introduction to above techniques **Read More**

The post AI vs Machine Learning vs Deep Learning vs Data Science appeared first on StepUp Analytics.

]]>The post Data Science Solution Using Titanic Dataset appeared first on StepUp Analytics.

]]>The sinking of the Titanic is one of the most infamous shipwrecks in history. On 15 April 1912, Titanic made its first voyage. The ship sank after it collided with an iceberg. About 1500 people were killed during this incident. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

The basic workflow to solve any data science problem is as follows :

- Identifying the problem
- Acquire test and training data
- Clean the data
- Analyze the data
- Model, predict and solve the problem
- Visualize the data and come up with a solution

But here our goal is to get a generalized prediction as fast as possible. But this doesn’t mean to avoid the exploratory data analysis (EDA).

Before you begin I recommend you to read about the Random Forest Algorithm first as in this tutorial we are gonna use random forest algorithm only. But you can also implement it using any other algorithms like logistic regression, decision tree etc.

#An efficient data structure import pandas as pd #importing the data X = pd.read_csv("C:/Users/DELL/Downloads/train.csv") X.describe()

y=X.pop("Survived") X.head()

#Impute age with mean value X["Age"].fillna(X.Age.mean(), inplace=True) X.describe()

#selecting only the numeric variables numeric_variables=list(X.dtypes[X.dtypes!="object"].index) X[numeric_variables].head()

#importing Random Forest Classifier from sklearn.ensemble import RandomForestRegressor #instantiate parametrs model = RandomForestRegressor(n_estimators=100, oob_score=True, random_state=42) #fit the model model.fit(X[numeric_variables], y) ############################# ########## Output ########### ############################# RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1, oob_score=True, random_state=42, verbose=0, warm_start=False)

Having a high value for n_estimators increases the number of trees which improves the prediction rate

#prediction error #oob score gives the R^2 value based on oob predictions model.oob_score_ #### Output #### 0.1361695005913669

#finding C-stat value from sklearn.metrics import roc_auc_score y_oob=model.oob_prediction_ print("C-stat", roc_auc_score(y, y_oob)) #### Output #### C-stat 0.7399551550399983

Now we have a benchmark which can be further improved

#function to show stats on categorical variables def describe_categorical(X) : from IPython.display import display, HTML display(HTML(X[X.columns[X.dtypes == "object"]].describe().to_html()))

describe_categorical(X)

#dropping variables which are not relevant X.drop(["Name", "Ticket", "PassengerId"], axis=1, inplace=True)

categorical_variables=["Sex", "Cabin", "Embarked"] for variable in categorical_variables : #filling missing data with "Missing" X[variable].fillna("Missing", inplace=True) #create array of dummies dummies= pd.get_dummies(X[variable], prefix=variable) X=pd.concat([X, dummies], axis=1) #drop the main variable X.drop([variable], axis=1, inplace=True)

#print all columns def printall(X, max_rows=10): from IPython.display import display, HTML display(HTML(X.to_html(max_rows=max_rows))) printall(X)

model=RandomForestRegressor(100, oob_score=True, n_jobs=1, random_state =42) model.fit(X, y) print("C-stat : ",roc_auc_score(y, model.oob_prediction_)) #### Output #### C-stat : 0.8641256298000618

feature_importances= pd.Series(model.feature_importances_, index=X.columns) #feature_importances.sort() feature_importances.plot(kind="bar", figsize=(40,10));

- parameters to improve the model
- n_estimators = number of trees in the forest
- max_features = number of features considered for best a split
- min_samples_leaf = minimum number of samples in newly created leaves
- n_jos = multiple processors that can be used to train and test the model

##n_estimators : finding the optimal number of trees results=[] n_estimator_values=[10,20,50,100,150,200] for trees in n_estimator_values: model=RandomForestRegressor(trees, oob_score=True, n_jobs=-1,random_state=42) model.fit(X, y) print(trees, "trees") roc_score=roc_auc_score(y, model.oob_prediction_) print("C-stat : ", roc_score) results.append(roc_score) print(" ") pd.Series(results, n_estimator_values).plot();

10 trees

C-stat : 0.8274933691240853

20 trees

C-stat : 0.8562218387498801

50 trees

C-stat : 0.8620644659615038

100 trees

C-stat : 0.8641256298000618

150 trees

C-stat : 0.8635770513107297

200 trees

C-stat : 0.8650230616005709

#finding the best execution time %%timeit model=RandomForestRegressor(200, oob_score=True, n_jobs=-1,random_state=42) model.fit(X, y)

661 ms ± 87.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit model=RandomForestRegressor(200, oob_score=True, n_jobs=1,random_state=42) model.fit(X, y)

920 ms ± 188 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

#finding optimal value for max_features results=[] max_features_values=["auto", "sqrt", "log2", None, 0.2, 0.9] for max_features in max_features_values: model=RandomForestRegressor(n_estimators=200, oob_score=True, n_jobs=-1,random_state=42, max_features=max_features) model.fit(X, y) print(max_features, "option") roc_score=roc_auc_score(y, model.oob_prediction_) print("C-stat : ", roc_score) results.append(roc_score) print(" ") pd.Series(results, max_features_values).plot(kind="barh", xlim=(0.86, 0.9));

auto option

C-stat : 0.8650230616005709

sqrt option

C-stat : 0.8665516249640495

log2 option

C-stat : 0.8673851447075491

None option

C-stat : 0.8650230616005709

0.2 option

C-stat : 0.8639365566314086

0.9 option

C-stat : 0.8643573110067215

#finding the optimal values for min_samples_leaf results=[] min_sample_leaf_values=[1,2,3,4,5,6,7,8,9,10] for min_sample in min_sample_leaf_values: model=RandomForestRegressor(n_estimators=200, oob_score=True, n_jobs=-1,random_state=42, max_features="log2", min_samples_leaf=min_sample) model.fit(X, y) print(min_sample, "min sample") roc_score=roc_auc_score(y, model.oob_prediction_) print("C-stat : ", roc_score) results.append(roc_score) print(" ") pd.Series(results, min_sample_leaf_values).plot();

1 min sample

C-stat : 0.8673851447075491

2 min sample

C-stat : 0.8620831069781316

3 min sample

C-stat : 0.8419135269868661

4 min sample

C-stat : 0.8385048839463566

5 min sample

C-stat : 0.8371147967063985

6 min sample

C-stat : 0.8389096603074169

7 min sample

C-stat : 0.8339564758891764

8 min sample

C-stat : 0.8334292014188476

9 min sample

C-stat : 0.8322201983404169

10 min sample

C-stat : 0.8326249747014774

#optimised model model=RandomForestRegressor(n_estimators=200, oob_score=True, n_jobs=-1,random_state=42, max_features="log2", min_samples_leaf=1) model.fit(X, y) roc_score=roc_auc_score(y, model.oob_prediction_) print("C-stat : ", roc_score)

C-stat : 0.8673851447075491

As you can clearly observe that the prediction result improved from 0.73 to 0.86 which is fair enough.

You can also check for the prediction accuracy by implementing other machine learning algorithms and compare them to select the best among them.

The post Data Science Solution Using Titanic Dataset appeared first on StepUp Analytics.

]]>The post How Do I Begin A Career in Data Science appeared first on StepUp Analytics.

]]>– David Buckingham

**Overview****Introduction to Data Science****Careers in Data Science****Skillsets and their roles****Where do I start?****How to land a job in Data Science?****Survey Reports and Stats**

Is Data Science job really the sexiest job of the 21^{st }century? Or is it a Hype? Is data the currency of future?

A few years ago when computer science and software engineering was so captivating, everyone wanted to land a job only as a programmer, web developer, software developer etc. It followed a basic law of economics – supply and demand.

Back then, the supply for computer science engineers was less and demand in the market was much higher. But the days are gone and now there’s been a huge increase in supply as the plethora of CS engineers is graduating every year. As the time passed, the salaries got lower but the software industry is still above the average pay.

Currently, the same is the scenario of the Data Science industry also where the supply is really low and the demand is high. As a consequence, the package offered is also quite tempting. The average package of a data scientist is about $130,000. Wow! That figure is really eye-catching, isn’t?

According to a survey done by Analytics India in 2018, the demand for data scientists and other Data Science careers is the highest in IT & Service industries followed by Banking and e-commerce.

By now, a lot of questions must be arising in your mind like how to make a career in Data Science? What are the skill sets required for it? Can I have a transition to Data Science from engineering or other professions? etc etc. So let’s discuss it one by one. Before we begin, let me give you a brief introduction on what is Data Science.

Technically, Data Science is the multidisciplinary blend of data interference, algorithm development, and technology for uncovering the finding from data and to solve complex problems analytically. To put it in simpler words, Data Science is discovering the data insights. For example :

- NetFlix data mine studies the movie viewing pattern to understand their viewer’s interest.
- Search engines like Google, Yahoo etc implements Data Science algorithms to produce the best search results in the blink of an eye.
- Recommendation system like “people you may know” and suggestions from Facebook and LinkedIn also make use of Data Science algorithms.

There are a plethora of Data Science applications which you use in your day-to-day life.

There are top 9 job roles in the world of Data Science. They are :

- Data Analyst
- Data Engineer
- Data Scientist
- Database Administrator
- Machine Learning Engineer
- Data Architect
- Business Analyst
- Statistician
- Data and Analytics Manager

Most of you would have come across these super cool job titles. What skills do you actually require to become one and what are their roles in the industry? Let’s have a look

We often face such situations where we know our goals but have no clue where to begin. So here are some tips you could follow step by step to begin your journey towards the world of Data Science where you have a lot to explore and learn new things.

**1. Choose your Data Science career wisely
**Choosing a right career is the most crucial thing because it could turn your life 180

**2. Introductory Online courses and MOOCs
**Now that you have decided on a career, you need to develop its skills. The best way is to enroll for some online courses and MOOCs on Udemy, Coursera, Udacity etc. which has really good stuff to begin with. Here is a list of some popular courses.

**Udacity –****Intro to Machine Learning****IBM –****Data Science Fundamentals****Coursera –****Data-Driven Decision Making****California Institute of Technology –****Learning from Data**

**3. ****Read Data Science blogs and books
**You can find a gazillion of articles on Data Science, machine learning, statistics, actuarial science etc., at StepUp Analytics, coursera, analyticsvidhya etc. They are very easy to comprehend.

**4. Start coding and get acquainted with data science tools
**To grasp the concepts, it is important to practice what you have learned because you learn Data Science by doing and practicing, not just by reading or watching online tutorials. You can choose any programming language which you are comfortable with but Python and R are the most recommended programming languages. HackerRank is a good platform to practice coding.

**5. Practice
**Focusing on practical applications is more important than theory. Work on various projects and participate in online Data Science competition on Kaggle.

**6. Build a network and stay updated
**Having a Data Science network is also important for your profession and the best platform for this is LinkedIn. Keep yourself updated to meet the current needs in the market by subscribing to newsletters from Analytics India, StepUp Analytics, kdnuggets etc.

- At
**47.1%**LinkedIn, the social networking website for professionals seems to be the first choice for looking for as well as posting a job in Data Science. - The next popular way to find a job is checking with friends and acquaintances to look for employment opportunities at
**19%**. - This is followed by popular job portal Naukri at
**15.7%**

The above stats clearly give us a bigger picture of Data Science jobs in 2017 and the good news is that you need not have a formal degree in Data Science. Even if you are an engineer or from a non-Data Science background, you can still make a fantastic career in Data Science. All you need is some patience. Well, to bring to your notice, I’m also a B.Tech graduate and have been working in Data Science domain.

**Aim, Execute, Conquer. Repeat!!**

I hope you find this article helpful. Please share your thoughts/feedbacks in the comment section below.

Reference: **kdnuggets** and **analyticsindiamag**

The post How Do I Begin A Career in Data Science appeared first on StepUp Analytics.

]]>The post Types of Sampling Techniques: Stratified Sampling appeared first on StepUp Analytics.

]]>Before giving the notion of sampling and its various types like Stratified Sampling and its application, let us first define the population. A **population** is the full set of all the possible units of analysis. It is also sometimes called the universe of observations.

For example, if we want to find the impact of a medicine on patients of tuberculosis in an area, our

the population will include all the patients of tuberculosis in that area.

When all the members of the population are explicitly identified, the resulting list is called a **sampling frame**. The sampling frame is a document that can be used with the different selection procedures described below to create a subset of the population for study. This subset is the **sample**. For example, a sampling frame for voters in a precinct would be the voter registration listing.

A sample is a collection of certain values chosen from the population. The sample size, usually denoted

by n, is the number of these values. If these values are chosen at random, the sample is called a random

sample. Each entry on the sampling frame is called a **sampling unit.**

A census is a study of every unit, everyone, or everything, in a population. It is known as a complete enumeration, which means a complete count.

Suppose you wish to study the impact of corporate image advertising in large corporations. You might define the unit of analysis as the corporation, and the population as “Fortune 500 Corporations” (a listing of the 500 largest corporations in the United States compiled by Fortune magazine). If we actually measure the amount of advertising for each of the 1000 corporations, we will be conducting a census of the variable.

If the population is infinite, the complete enumeration is not possible. Also, if the units are destroyed in the

course of inspection (e.g., inspection of crackers, explosive materials, etc.), 100% inspection, though

possible , is not at all desirable. Besides these problems, there may be time constraints to complete our

research, or administrative and financial implications.

In such cases, sampling is used to study the population. The sample characteristics are utilized to approximately determine or estimate the population. The error involved in such approximations are called sampling error and are inherent and usually unavoidable in any and every sampling scheme.

Sampling is quite often used in day-to-day practical life. For example, in a shop, we assess the quality of sugar, wheat or any other commodity by taking a handful of it from the bag and then decide to purchase it or not. A housewife normally tests the cooked products to find out if they are properly cooked and contain the proper quantity of salt.

It is the technique of drawing a sample in such a way that each unit of the population has an equal and

independent chance of being included in the sample.

If a sampling frame is available, drawing a representative probability sample is quite easy. You simply

select units from the list by using some truly random process like a random numbers table or computer

the program, so that every entry on the list has exactly the same probability of being chosen.

A clear example of a simple random probability sample is drawing a name from a hat: the sampling frame (a list

of names defining the universe) is torn up into equally-sized slips of paper, placed in a hat, mixed up

(randomized), and then a name is picked from the hat. All names have an equal probability of being picked,

and the mixing process ensures that there is no systematic bias in selecting a name.

Once again, there is no way to predict whose name will be drawn. Any name can be chosen, and all names have the same chance probability of being drawn (1 divided by the number of names in the hat).

Researchers can create a simple random sample using a couple of methods. With a lottery method, each member of the population is assigned a number, after which numbers are selected at random. The example in which the names of 25 employees out of 250 are chosen out of a hat is an example of the lottery method at work. Each of the 250 employees would be assigned a number between 1 and 250, after which 25 of those numbers would be chosen at random.

For larger populations, a manual lottery method can be quite time-consuming. In such a case we can use

a ‘Random Number Table’, which have been so constructed that each of the digits 0, 1, 2,.., 9 appear with approximately the same frequency and independently of each other.

The method of drawing the random sample consists of the following steps:

- Identify the N units in the population with the numbers from 1 to N.
- Select at random, any page of the ‘random number table’ and pick up the numbers in any row or column or diagonal at random.
- The population units corresponding to the numbers selected in step (ii) constitutes the random sample.

For Example: If we randomly sample 4 people (or labels) from 8, we need 4 random digits without replacement from 1, 2, 3,…, 8.

From the above table, we will simply read off random digits ignoring those that are out of range or recur (we are sampling without replacement) until we get four of them. Going from left to right across the top row of Table 1 we get 1 2 4 [2] [9] 6 3 5 ….(Here, we have placed square brackets around numbers that are repeats of previously appearing numbers or are out of range.) Taking the first four usable numbers we get 1, 2, 4, 6 and the random sample consists of the individuals with those labels.

Consider a population of potato sacks, each of which has either 12, 13, 14, 15, 16, 17, or 18 potatoes, and all the values are equally likely. Suppose that, in this population, there is exactly one sack with each number.

So the whole population has seven sacks. If I sample two with replacement, then I first pick one (say 14). I had a 1/7 probability of choosing that one. Then I replace it. Then I pick another. Every one of them still has 1/7 probability of being chosen. And there are exactly 49 different possibilities here (assuming we distinguish between the first and second.) They are: (12,12), (12,13), (12, 14), (12,15), (12,16), (12,17), (12,18), (13,12), (13,13), (13,14), etc.

Consider the same population of potato sacks. If I sample two without replacement, then I first pick one (say 14). I had a 1/7 probability of choosing that one. Then I pick another. At this point, there are only six possibilities: 12, 13, 15, 16, 17, and 18. So there are only 42 different possibilities here (again assuming that we distinguish between the first and the second.) They are: (12,13), (12,14), (12,15), (12,16), (12,17), (12,18), (13,12), (13,14), (13,15), etc.

When we sample with replacement, the two sample values are independent. Practically, this means that what we get on the first one doesn’t affect what we get on the second. Mathematically, this means that the covariance between the two is zero.

In sampling without replacement, the two sample values aren’t independent. Practically, this means that what we got on the first one affects what we can get for the second one. Mathematically, this means that the covariance between the two isn’t zero. That complicates the computations. In particular, if we have an SRS (simple random sample) without replacement, from a population with variance, then the covariance of two of the different sample values is

where N is the population size. If the population is very large, this covariance is very close to zero. In that case, sampling with replacement isn’t much different from sampling without replacement.

Stratified random sampling is a method of sampling that involves the division of a population into smaller groups known as strata. In stratified random sampling or stratification, the strata are formed based on members’ shared attributes or characteristics.

Stratified random sampling is also called proportional random sampling or quota random sampling. The sample size of each stratum in this technique is proportionate to the population size of the stratum when viewed against the entire population. This means that each stratum has the same sampling fraction (n/N),

Stratified random sampling is a better method than simple random sampling. Stratified random sampling divides a population into subgroups or strata, and random samples are taken, in proportion to the population, from each of the strata created. The members in each of the stratum formed have similar attributes and characteristics. This method of sampling is widely used and very useful when the target population is heterogeneous. A simple random sample should be taken from each stratum.

For example, suppose we want to draw a probability sample of 100 undergraduates from a university for a study of the effect of professor-student communication patterns on student grades. Figures from the university records state that 65% of the students are majoring in Liberal Arts. We know, then, that a representative sample should include 65 Liberal Arts majors and 35 majors in other fields.

So we begin by separating the registrar’s student list (the sampling frame) into two strata: the Liberal Arts majors and the non-Liberal Arts majors. We then draw a random sample of 65 from the Liberal Arts stratum and

another random sample of 35 from the non-Liberal Arts stratum. The result is an unbiased sample in which there is no sampling error on the stratifying variable (academic major).

The sample has exactly the same proportion of Liberal Arts/non-Liberal Arts majors as does the population. Of course, other unstratified variables in the sample are still subject to sampling error.

The same method used above can be applied to the polling of elections, the income of varying populations, and income for different jobs across a nation.

In school, while selecting the captain of sports teams, most of our coaches asked us to call out numbers such as 1-5 and the students with a random number decided by the coach, for this instance, 3, would be called out to be the captains of different teams.

It would be a non-stressful selection process for both the coach as well as the players. Such a method of sampling is called Systematic Sampling.

Systematic sampling is a probability sampling method where the elements are chosen from a target population by selecting a random starting point and selecting other members after a fixed ‘sampling interval’. The sampling interval is calculated by dividing the entire population size by the desired sample size.

**Systematic Sampling Formula for the interval (i) = N/n**

The bias introduced by systematic random sampling is usually small, for practical situations, so this procedure is frequently used. However, if it is possible to draw a simple random sample rather than a systematic random sample, one should always do so.

The process of selection can interact with a hidden periodic trait within the population. If the sampling technique coincides with the periodicity of the trait, the sampling technique will no longer be random and representativeness of the sample is compromised.

The researcher must be certain that the chosen constant interval between subjects does not reflect a certain pattern of traits present in the population. If a pattern in the population exists and it coincides with the interval set by the researcher, the randomness of the sampling technique is compromised.

Cluster sampling is a sampling technique that divides the main population into various sections (clusters).

In this sampling technique, the analysis is carried out on a sample which consists of multiple sample parameters such as demographics, habits, background – or any other population attribute which may be the focus of conducted research. It is usually used when groups that are similar yet internally diverse form a statistical population. Instead of selecting the entire population of data, cluster sampling allows the researchers to collect data by bifurcating the data into small, more effective groups.

We first divide the population under study into some recognizable sub-divisions or clusters and then a

simple random sample of these clusters is drawn.

For example, if we are interested in obtaining the income of opinion data in a city, the whole city may be divided into N different blocks or localities (which determine the clusters) and a simple random sample of n blocks is drawn. The individuals in the selected blocks determine the cluster sample.

In most cases, sampling by clusters happens over multiple stages. A stage is considered to be the steps taken to get to a desired sample and cluster sampling is divided into a single-stage, two-stage, and multiple stages.

In single-stage cluster sampling, we divide the entire sample frame into clusters, usually based on some naturally occurring geographic grouping (e.g. city, town village, hospital). Then we sample these clusters and measure every element within the selected clusters.

Two-stage sampling is the same thing as single-stage sampling, but instead of taking all the elements found in the selected clusters (called the first stage of sampling), we take a random sample of elements from the cluster.

For example, in single-stage sampling, we might take an SRS of cities. Within each city, we would measure

characteristics of all hospitals. In a two-stage sampling plan, we would take an SRS of cities, and then within each city, we would list out every hospital. Then you would take an SRS of hospitals.

If we are interested in obtaining a sample of, say, n households from a particular State, the first stage units may be a district, the second stage units may be villages in the districts and the third stage units will be households in the villages. Each stage thus results in a reduction of the sample size. This is called multi-stage sampling. In this, a sample of first stage units is done by a suitable method of sampling. From among the selected first stage units, a sub-sample of secondary stage units is drawn. Further stages may be added to arrive at a sample of the desired sampling units.

Cluster Sampling is preferred when it is hard or expensive to visit each group/stratum (as is required in

stratified sampling). However, variance increases in this procedure.

The post Types of Sampling Techniques: Stratified Sampling appeared first on StepUp Analytics.

]]>The post Free eBooks On Data Science Using Python appeared first on StepUp Analytics.

]]>In the information and Digital age, data is all around us. Within this data, there are answers to many important questions across many societal domains (politics, business, science, etc.). But if you had access to a large dataset, would you be able to find the answers you seek? The answer is No!

One should acquire a decent understanding of how to handle data, how to gather information from the raw data and visualize it to drive insights.

This set of eBook on Data Science Using Python will help you to learn the basic building blocks of python specifically for Data Science and formalize your Data Science journey using python.

**python (Scipy)****jupyter notebooks****pandas****numpy****matplotlib****git****and many other tools**

**Think Python: How to Think Like a Computer Scientist**

*Allen Downey,*2012**Python Programming**

*Wikibooks*, 2015**Learning Python for Data Science****Mastering Python**

A logical, reasonably standardized, but flexible project structure for doing and sharing data science work*. *This repository will help you with learning resources with good examples, dataset, and codes.

**Python Data Science Handbook GitHub Repo****PyData Handbook****Python for Data Analysis****Cookie Cutter Data Science**

**The Data Analytics Handbook**

*Brian Liou, Tristan Tao, & Declan Shener,*2015

**An Introduction to Data Science**

*Jeffrey Stanton,*2013**School of Data Handbook**

*School of Data,*2015

It covers topics on data preparation, data munging, data wrangling. It introduces a friendly interface IPython to code. In addition, it also covers NumPy and Pandas.

**Python Machine Learning Algorithms****What You Need to Know About Machine Learning****Advanced Python Machine Learning****Getting Started With Tensorflow**

**Understanding the Chief Data Officer**

*Julie Steele,*2015

**Introduction to Machine Learning**

*Amnon Shashua,*2008**Machine Learning**

*Abdelhamid Mellouk & Abdennacer Chebira,*450**Machine Learning – The Complete Guide**

*Wikipedia***Deep Learning**

*Yoshua Bengio, Ian J. Goodfellow, & Aaron Courville,*2015**Neural Networks and Deep Learning**

*Michael Nielsen,*2015**Deep Learning**

*Yoshua Bengio, Ian J. Goodfellow, & Aaron Courville,*2015**Neural Networks and Deep Learning**

*Michael Nielsen,*2015**Building Machine Learning Systems with Python**

**Data Mining Algorithms In R**

*Wikibooks,*2014**R Deep Learning essentialss**

The post Free eBooks On Data Science Using Python appeared first on StepUp Analytics.

]]>