The post H20 Package: Classification Using Logistic Regression appeared first on StepUp Analytics.

]]>But in this

Here logistic regression comes from the underlying assumption of the GLMs which I will discuss in the next section. I will be talking about two ways of carrying out logistic regression in R. One being the standard method of using the **glm** function from the base package and the other being the h2o.glm function from the h2o package in R. We will also see how the accuracy has improved from the first model to the second model.

**What are
Generalized linear models and how do they differ from the classical linear
models?**

We already know that the distribution of the error term in the linear models is assumed to follow a normal distribution. But in the cases where we have binary classes for the response variable, we assume that the error term does not follow the normal distribution. Rather it follows the logistic distribution, given by the Cumulative density function:

Hence the term logistic regression. The above cdf can be graphically represented as:

Which is also known as the sigmoid function. The output of this function will always be between 0 and 1.

For the analysis I’ll be using an example dataset and the following steps will be followed:

- Reading the data
- Splitting the data into training and testing sets
- Applying glm on the training set
- Prediction using the test data
- Calculating the accuracy

The dataset considered here contains 21 variables and 3168 observations, where the label variable represents if the voice of an individual considered is a male or female. Before we step forward for the analysis, there is some pre-processing of the data required. We will first subset the data and consider only those variables that are important for our analysis and then convert the label variable into factor variable with levels 1 and 2 representing female and male respectively.

The data looks somewhat like this:

data <- read.csv(“voice.csv”)

head(data)

data$label <- factor(data$label)

str(data)

names(data)

Along with label we have a set of 20 other variables that are the descriptive statistics upon which our response variable depends.

Now, we attempt to partition the data into training and testing data sets. For

library(caret)

set.seed(99)

Train=createDataPartition(data$label,p=0.75, list=F)

training <- data[Train,]

testing <- data[-Train,]

Use set.seed(99) for replicability purposes. In the nest

fit <- glm(label~Q25+Q75+sp.ent+sfm+mode+meanfun+minfun, data=training, family = binomial(link = “logit”))

To check what model has been formed and in order to interpret the results, we use the summary function to extract all the information possible.

summary(fit)

Results:

The call function shows that we have rightly executed the code and that our response and predictor variables are rightly placed after and before the ~. From the table, we can see that all of the independent variables used in the modeling are significant at 10% level of significance. Although, we can remove mode variable if we consider a 5% level of significance.

Apart from the table, it is necessary to note that the AIC value obtained from this model is 440. We will try to reduce this using a different method in the next section. But before that, we now try to calculate how well this model performs on testing or unseen data.

p <- predict(fit,newdata = testing,type = “response”)

head(p) #gives the first 6 observations

The serial numbers simply mean that the first 10 observations might have been considered in the testing dataset and like that for observation 14. In order to check how accurate the above model classifies the gender label on test data, we set a threshold of 0.5. That is the observations with probability greater than 0.5 will be considered a male else a female.

To measure the accuracy and to check for the misclassification, we form a confusion matrix.

pred1 <- ifelse(p>0.5,2,1)

tab <- table(pred1,testing$label)

tab

Thus, from the table, we can see that there 379 observations that were correctly classified as females and 386 correctly classified males. Now, in order to calculate the accuracy we use:

Which gives a great accuracy of 96.6%.

Before we proceed to the analysis it is necessary to understand what this package is and what does it do?

H2o is a leading open source platform that makes it easy for financial services, insurance companies, and healthcare companies to deploy AI and deep learning to solve complex problems. To make it easier for non-engineers to create complete analytic workflows, H2O’s platform includes interfaces for R, Python and many more tools.

The steps that will be followed here are quite different from the previous case:

- Initialise the H2o package.
- Read in the data
- Data pre-processing if required
- Convert data into H2o readable format
- Split the data into training and testing sets
- Check for the accuracy of the model on the test data

Following the above steps:

library(h2o)

h2o.init(max_mem_size = “2G”, nthreads = 2, ip=”localhost”, port=54321) data <- read.csv(“voice.csv”)

data1 <- data[,c(4,5,9,10,11,13,14,21)]

d.hex <- as.h2o(data1,destination_frame = “d.hex”)

Now we have the data that can be read in by the h2o package. Converting the data might take a few seconds, depending upon the configuration of your laptop. Wait until the conversion shows 100%. To check the first few values you can use the head function.

For splitting the dataset, instead of the caret package we use the inbuilt functions from the h2o package only.

head(d.hex)

set.seed(99)

split <- h2o.splitFrame(data = d.hex,ratios = 0.75)

train <- split[[1]]

test <- split[[2]]

After running these functions, we can now carry out the glm function from the h2o package only on the training data and then check it on the test data for how accurately it classifies categories. Again running these codes might take a few seconds.

fit3 <- h2o.glm(x=1:7,y=8,training_frame = train,family = “binomial”, link=”logit”)

h2o.performance(fit3,test)

The performance function will give a list of variables of performance measures like RMSE, Logloss, AUC etc. But in this blog, I intend to concentrate only on the AIC and the confusion matrix.

We can clearly see that the AIC value lower from the first model. Which is an indication of a better and a robust model.

We extract the confusion matrix to measure accuracy and misclassification. The beauty of this package is that it contains all the necessary functions required for the analysis so that you don’t have referred to different packages for different functions.

Clearly 390 females and 373 males are rightly classified. The accuracy is given by: (390+373)/(390+373+13+8) = 97%. Which slightly greater than the previous model. But there is no significant difference in the accuracy between the two models.

In order to check the predictions made for each observation in the test data and the how strong the probability is for the prediction made we use the following function:

So the prediction made for the first observation is a male with probability **0.99966 **which is quite high and so on.

And this is how you can use two different methods of carrying out logistic regression on the same dataset.

**Download** the data Used in this blog. Read the latest articles on **Machine Learning**

The post H20 Package: Classification Using Logistic Regression appeared first on StepUp Analytics.

]]>The post Enabling Machine Learning Algorithm appeared first on StepUp Analytics.

]]>#Import Library #Import other necessary libraries like pandas, numpy... from sklearn import linear_model #Load Train and Test datasets #Identify feature and response variable(s) and values must be numeric and numpy arrays x_train=input_variables_values_training_datasets y_train=target_variables_values_training_datasets x_test=input_variables_values_test_datasets # Create linear regression object linear = linear_model.LinearRegression() # Train the model using the training sets and check score linear.fit(x_train, y_train) linear.score(x_train, y_train) #Equation coefficient and Intercept print('Coefficient: \n', linear.coef_) print('Intercept: \n', linear.intercept_) #Predict Output predicted= linear.predict(x_test)

#Load Train and Test datasets #Identify feature and response variable(s) and values must be numeric and numpy arrays x_train <- input_variables_values_training_datasets y_train <- target_variables_values_training_datasets x_test <- input_variables_values_test_datasets x <- cbind(x_train,y_train) <span class="c"># Train the model using the training sets and check score </span><span class="hl std">linear</span> <span class="hl kwb"><-</span> <span class="hl kwd">lm</span><span class="hl std">(y_train</span> <span class="hl opt">~</span> <span class="hl std">.</span><span class="hl std">,</span> <span class="hl kwc">data</span> <span class="hl std">= x</span><span class="hl std">) summary(linear)</span><span class="c"> #Predict Output predicted= </span><span class="n">predict</span><span class="p">(linear,</span><span class="n">x_test</span><span class="p">) </span>

#Import Library from sklearn.linear_model import LogisticRegression #Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset # Create logistic regression object model = LogisticRegression() # Train the model using the training sets and check score model.fit(X, y) model.score(X, y) #Equation coefficient and Intercept print('Coefficient: \n', model.coef_) print('Intercept: \n', model.intercept_) #Predict Output predicted= model.predict(x_test)

x <- cbind(x_train,y_train) <span class="c"># Train the model using the training sets and check score </span><span class="hl std">logistic</span> <span class="hl kwb"><-</span> g<span class="hl kwd">lm</span><span class="hl std">(y_train</span> <span class="hl opt">~</span> <span class="hl std">.</span><span class="hl std">,</span> <span class="hl kwc">data</span> <span class="hl std">= x,family='binomial'</span><span class="hl std">) summary(logistic)</span><span class="c"> #Predict Output predicted= </span><span class="n">predict</span><span class="p">(</span><span class="hl std">logistic</span><span class="p">,</span><span class="n">x_test</span><span class="p">)</span>

Decision Tree

#Import Library #Import other necessary libraries like pandas, numpy... from sklearn import tree #Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset # Create tree object model = tree.DecisionTreeClassifier(criterion='gini') # for classification, here you can change the algorithm as gini or entropy (information gain) by default it is gini # model = tree.DecisionTreeRegressor() for regression # Train the model using the training sets and check score model.fit(X, y) model.score(X, y) #Predict Output predicted= model.predict(x_test)

library(rpart) x <- cbind(x_train,y_train) <span class="c"># grow tree fit <- rpart(</span><span class="hl std">y_train</span> <span class="hl opt">~</span> <span class="hl std">.</span><span class="hl std">,</span> <span class="hl kwc">data</span> <span class="hl std">= x,</span><span class="c">method="class") </span><span class="hl std">summary(fit)</span><span class="c"> #Predict Output predicted= </span><span class="n">predict</span><span class="p">(</span><span class="c">fit</span><span class="p">,</span><span class="n">x_test</span><span class="p">)</span>

#Import Library from sklearn import svm #Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset # Create SVM classification object model = svm.svc() # there is various option associated with it, this is simple for classification. You can refer link, for mo# re detail. # Train the model using the training sets and check score model.fit(X, y) model.score(X, y) #Predict Output predicted= model.predict(x_test)

library(e1071) x <- cbind(x_train,y_train) <span class="c"># Fitting model fit <-</span>svm(<span class="hl std">y_train</span> <span class="hl opt">~</span> <span class="hl std">.</span>, data = x) <span class="c">summary(fit)</span><span class="c"> #Predict Output predicted= </span><span class="n">predict</span><span class="p">(</span><span class="c">fit</span><span class="p">,</span><span class="n">x_test</span><span class="p">)</span>

#Import Library from sklearn.naive_bayes import GaussianNB #Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset # Create SVM classification object model = GaussianNB() # there is other distribution for multinomial classes like Bernoulli Naive Bayes, <span style="color: #0000ff;"><a style="color: #0000ff;" href="http://scikit-learn.org/stable/modules/naive_bayes.html">Refer link</a></span> # Train the model using the training sets and check score model.fit(X, y) #Predict Output predicted= model.predict(x_test)

library(e1071) x <- cbind(x_train,y_train) <span class="c"># Fitting model fit <-naiveBayes</span>(<span class="hl std">y_train</span> <span class="hl opt">~</span> <span class="hl std">.</span>, data = x) <span class="c">summary(fit)</span><span class="c"> #Predict Output predicted= </span><span class="n">predict</span><span class="p">(</span><span class="c">fit</span><span class="p">,</span><span class="n">x_test</span><span class="p">)</span>

**Python Code**

#Import Library from sklearn.neighbors import KNeighborsClassifier #Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset # Create KNeighbors classifier object model KNeighborsClassifier(n_neighbors=6) # default value for n_neighbors is 5 # Train the model using the training sets and check score model.fit(X, y) #Predict Output predicted= model.predict(x_test)

library(knn) x <- cbind(x_train,y_train) <span class="c"># Fitting model fit <-knn</span>(<span class="hl std">y_train</span> <span class="hl opt">~</span> <span class="hl std">.</span>, data = x,k=5) <span class="c">summary(fit)</span><span class="c"> #Predict Output predicted= </span><span class="n">predict</span><span class="p">(</span><span class="c">fit</span><span class="p">,</span><span class="n">x_test</span><span class="p">)</span>

Python Code

<span class="kn">#Import Library </span><span class="kn">from</span> <span class="nn">sklearn.cluster</span> <span class="kn">import</span> KMeans <span class="c">#Assumed you have, X (</span><span class="c">attributes</span><span class="c">) for training data set and x_test(attributes) of test_dataset </span><span class="c"># Create KNeighbors classifier object </span><span class="n">model</span> <span class="n">k_means</span> <span class="o">=</span> <span class="n">KMeans</span><span class="p">(</span><span class="n">n_clusters</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">0</span><span class="p">) </span><span class="c"># Train the model using the training sets and check score </span><span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">) </span><span class="c">#Predict Output predicted= model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">x_test</span><span class="p">)</span>

library(cluster) fit <- kmeans(X, 3) # 5 cluster solution

**Python Code**

<span class="kn">#Import Library </span><span class="kn">from</span> <span class="nn">sklearn.ensemble</span> <span class="kn">import</span> <span class="n">RandomForestClassifier </span><span class="c">#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset </span><span class="c"># Create Random Forest object</span> <span class="n">model= </span><span class="n">RandomForestClassifier</span><span class="p">(</span><span class="p">) </span><span class="c"># Train the model using the training sets and check score </span><span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">) </span><span class="c">#Predict Output predicted= model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">x_test</span><span class="p">)</span>

**R Code**

library(randomForest) x <- cbind(x_train,y_train) <span class="c"># Fitting model </span>fit <- randomForest(Species ~ ., x,ntree=500) <span class="c">summary(fit)</span><span class="c"> #Predict Output predicted= </span><span class="n">predict</span><span class="p">(</span><span class="c">fit</span><span class="p">,</span><span class="n">x_test</span><span class="p">)</span>

#Import Library from sklearn import decomposition <span class="c">#Assumed you have training and test data set as train and test </span><span class="c"># Create PCA obeject</span> <span class="n">pca= decomposition.PCA(n_components=k) #default value of k =min(n_sample, n_features) </span><span class="c"># For Factor analysis </span><span class="n">#fa= decomposition.FactorAnalysis()</span><span class="c"> # Reduced the dimension of training dataset using PCA </span><span class="p">train_reduced = pca.fit_transform(train) </span><span class="c">#Reduced the dimension of test dataset </span><span class="p">test_reduced = pca.transform(test) </span>#For more detail on this, please refer <a href="http://scikit-learn.org/stable/modules/decomposition.html#decompositions" target="_blank" rel="nofollow noopener noreferrer">this link</a>.

**R Code**

library(stats) pca <- princomp(<span class="p">train</span>, cor = TRUE) <span class="p">train_reduced <- predict(pca,</span><span class="p">train) </span><span class="p">test_reduced <- predict(pca,</span><span class="p">test)</span>

<span class="p"> </span>

<span class="kn">#Import Library </span><span class="kn">from</span> <span class="nn">sklearn.ensemble</span> <span class="kn">import</span> <span class="n">GradientBoostingClassifier </span><span class="c">#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset </span><span class="c"># Create Gradient Boosting Classifier object</span> <span class="n">model= </span><span class="n">GradientBoostingClassifier</span><span class="p">(</span><span class="n">n_estimators</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">learning_rate</span><span class="o">=</span><span class="mf">1.0</span><span class="p">, </span><span class="n">max_depth</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">0</span><span class="p">) </span><span class="c"># Train the model using the training sets and check score </span><span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">) </span><span class="c">#Predict Output predicted= model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">x_test</span><span class="p">) </span>

library(caret) x <- cbind(x_train,y_train) <span class="c"># Fitting model </span>fitControl <- trainControl( method = "repeatedcv", number = 4, repeats = 4) fit <- train(y ~ ., data = x, method = "gbm", trControl = fitControl,verbose = FALSE) <span class="c">predicted= </span><span class="n">predict</span><span class="p">(</span><span class="c">fit</span><span class="p">,</span><span class="n">x_test,type= "prob"</span><span class="p">)[,2]</span>

Thanks

Team StepUpAnalytics

The post Enabling Machine Learning Algorithm appeared first on StepUp Analytics.

]]>The post Logistics Regression in Python Using Pandas appeared first on StepUp Analytics.

]]>I thought of starting a series in which I will Implement various Machine Leaning techniques using Python.

To start with today we will look at Logistic Regression in Python and I have used iPython Notebook.

Here is the data set used as part of this demo Download

We will import the following libraries in Python

import pandas as pd #for handling datasets import statsmodels.api as sm #for statistical modeling import pylab as pl #for plotting import numpy as np #for numerical computation

dfTrain = pd.read_csv('E:\\Sajid_Mac\\DataSciencePython\\Logistic Regression with StatsModels\\test.csv') dfTest = pd.read_csv('E:\\Sajid_Mac\\ataSciencePython\\Logistic Regression with StatsModels\\train.csv')

print(dfTrain.head())

print(dfTest.tail())

# summarize the data dfTrain.describe()

# take a look at the standard deviation of each column dfTrain.std()

# frequency table cutting presitge and whether or not someone was admitted pd.crosstab(dfTest['admit'], dfTest['prestige'], rownames=['admit'])

#explore data dfTest.groupby('admit').mean()

# plot one column dfTest['gpa'].hist() pl.title('Histogram of GPA') pl.xlabel('GPA') pl.ylabel('Frequency') pl.show()

# barplot of gre score grouped by admission status (True or False) pd.crosstab(dfTest.gre, dfTest.admit.astype(bool)).plot(kind='bar') pl.title('GRE score by Admission Status') pl.xlabel('GRE score') pl.ylabel('Frequency') pl.show()

# dummify prestige dummy_ranks = pd.get_dummies(dfTest['prestige'], prefix='prestige') print(dummy_ranks.head())

# create a clean data frame for the regression cols_to_keep = ['admit', 'gre', 'gpa'] data = dfTest[cols_to_keep].join(dummy_ranks.ix[:, 'prestige_good':]) print(data.head())

# manually add the intercept data['intercept'] = 1.0

train_cols = data.columns[1:] print (data.columns[1:])

#Logistic Regression logit = sm.Logit(data['admit'], data[train_cols])

# fit the model result = logit.fit() print (result.summary())

Result will look something like this Optimization terminated successfully. Current function value: 0.567976 Iterations 6 Logit Regression Results ============================================================================== Dep. Variable: admit No. Observations: 300 Model: Logit Df Residuals: 295 Method: MLE Df Model: 4 Date: Tue, 11 Apr 2017 Pseudo R-squ.: 0.07857 Time: 10:20:59 Log-Likelihood: -170.39 converged: True LL-Null: -184.92 LLR p-value: 7.604e-06 ===================================================================================== coef std err z P>|z| [95.0% Conf. Int.] ------------------------------------------------------------------------------------- gre 0.0020 0.001 1.649 0.099 -0.000 0.004 gpa -0.2227 0.229 -0.974 0.330 -0.671 0.225 prestige_good -1.7587 0.394 -4.460 0.000 -2.532 -0.986 prestige_ok -2.0917 0.463 -4.517 0.000 -2.999 -1.184 prestige_veryGood -1.0220 0.352 -2.903 0.004 -1.712 -0.332 =====================================================================================

# recreate the dummy variables dummy_ranks_test = pd.get_dummies(dfTest['prestige'], prefix='prestige') print (dummy_ranks_test)

#create intercept column dfTest['intercept'] = 1.0

# keep only what we need for making predictions cols_to_keep = ['gre', 'gpa', 'prestige', 'intercept'] dfTest = dfTest[cols_to_keep].join(dummy_ranks_test.ix[:, 'prestige_good':]) dfTest.head() # make predictions on the enumerated dataset dfTest['admit_pred'] = result.predict(dfTest[train_cols]) #see probabilities print(dfTest.head())

#convert probabilities to 'yes' 'no' dfTest['admit_yn']= np.where(dfTest['admit_pred'] > 0.5,'yes','no') print (dfTest.head())

cols= ['gre', 'gpa', 'admit_yn'] dfTest[cols].groupby('admit_yn').mean() gre gpa admit_yn no 573.229572 3.328288 yes 696.279070 3.732558

cols= ['gre', 'gpa', 'admit_yn'] dfTest[cols].groupby('admit_yn').mean() gre gpa admit_yn no 573.229572 3.328288 yes 696.279070 3.732558

dfTest.to_csv('E:\\Sajid_Mac\\DataSciencePython\\Logistic Regression with StatsModels\\output.csv', sep=',')

Download the entire iPython Notebook Here

The post Logistics Regression in Python Using Pandas appeared first on StepUp Analytics.

]]>The post Linear Regression Assumptions appeared first on StepUp Analytics.

]]>- Linear relationship
- Multivariate normality
- No or little multicollinearity
- No auto-correlation
- Homoscedasticity

Linear regression needs at least 2 variables of metric (ratio or interval) scale. A rule of thumb for the sample size is that regression analysis requires at least 20 cases per independent variable in the analysis.

Firstly, linear regression needs the relationship between the independent and dependent variables to be linear. It is also important to check for outliers since linear regression is sensitive to outlier effects. The linearity assumption can best be tested with scatter plots, the following two examples depict two cases, where no and little linearity is present.

Secondly, the linear regression analysis requires all variables to be multivariate normal. This assumption can best be checked with a histogram and a fitted normal curve or a Q-Q-Plot. Normality can be checked with a goodness of fit test, e.g., the Kolmogorov-Smirnov test. When the data is not normally distributed in a non-linear transformation, e.g., log-transformation might fix this issue, however, it can introduce effects of multicollinearity.

Thirdly, linear regression assumes that there is little or no multicollinearity in the data. Multicollinearity occurs when the independent variables are not independent of each other. A second important independence assumption is that the error of the mean has to be independent of the independent variables.

**Multicollinearity might be tested with 4 central criteria:**

- Correlation matrix – when computing the matrix of Pearson’s Bivariate Correlation among all independent variables the correlation coefficients need to be smaller than 1.

- Tolerance – the tolerance measures the influence of one independent variable on all other independent variables; the tolerance is calculated with an initial linear regression analysis. Tolerance is defined as T = 1 – R² for these first step regression analysis. With T < 0.1 there might be multicollinearity in the data and with T < 0.01 there certainly is.

- Variance Inflation Factor (VIF) – the variance inflation factor of the linear regression is defined as VIF = 1/T. Similarly, with VIF > 10 there is an indication for multicollinearity to be present; with VIF > 100 there is certainly multicollinearity in the sample.

- Condition Index – the condition index is calculated using a factor analysis on the independent variables. Values of 10-30 indicate a mediocre multicollinearity in the linear regression variables, values > 30 indicate strong multicollinearity.

If multicollinearity is found in the data centering the data, that is deducting the mean score might help to solve the problem. Other alternatives to tackle the problems is conducting a factor analysis and rotating the factors to insure independence of the factors in the linear regression analysis.

Fourthly, a linear regression analysis requires that there is little or no autocorrelation in the data. Autocorrelation occurs when the residuals are not independent from each other. In other words when the value of y(x+1) is not independent from the value of y(x). This for instance typically occurs in stock prices, where the price is not independent from the previous price.

While a scatterplot allows you to check for autocorrelations, you can test the linear regression model for autocorrelation with the Durbin-Watson test. Durbin-Watson’s d tests the null hypothesis that the residuals are not linearly auto-correlated. While d can assume values between 0 and 4, values around 2 indicate no autocorrelation. As a rule of thumb values of 1.5 < d < 2.5 show that there is no auto-correlation in the data, however, the Durbin-Watson test only analyses linear autocorrelation and only between direct neighbors, which are first order effects.

The last assumption the linear regression analysis makes is homoscedasticity. The scatter plot is the good way to check whether homoscedasticity (that is the error terms along the regression are equal) is given. If the data is heteroscedastic the scatter plots looks like the following examples:

The Goldfeld-Quandt Test can test for heteroscedasticity. The test splits the data in high and low value to see if the samples are significantly different. If homoscedasticity is present, a non-linear correction might fix the problem.

The post Linear Regression Assumptions appeared first on StepUp Analytics.

]]>