The post Step By Step Guide To Time Series Analysis In R appeared first on StepUp Analytics.

]]>A successful analysis of these can help professionals observe patterns and ensure the smooth functioning of the business. The most important part of time series analysis is forecasting or prediction of future values using the historical data. These predictions help in determining the future course of action and give an approximate idea about how the business will look a year from now.

Before we dive into the analysis of temporal data in R, let us understand the different components of time series data. These components are shown below in the figure:

**Trend Component:** By trend component, we mean that the general tendency of the data to increase or decrease during a long period of time.

**Seasonal Component:** The variations in the time series that arise due to the rhythmic forces which operate over a span of less than 12 months or a year.

**Cyclical Component:** The oscillatory movements in a time series that last for more than a year.

**Random Component:** Random or irregular variations or fluctuations which are not accounted for by trend, seasonality and cyclical components are defined as the random component. These are also called episodic fluctuations.

Before we begin, make sure the “forecast” package is installed on your devices. We can easily observe these components once we use the **decompose()** function in R: For example, let us consider the ausbeer dataset from the ‘fpp’ package in R. We subset the data and consider a time period of 1992-2006.

Plotting this time series:

> beer2 = window(ausbeer,start=1992,end=2006-.1)

> plot(beer2)

We can clealry see a seasonal pattern in the plot. Now we try to decompose this series in order to extract all the components:

library(ggfortify)

autoplot(decompose(beer2))

Before we proceed to the analysis of the time series data, we need to make a simplifying assumption in order to maintain the regularity of the time series. This assumption is the stationarity assumption. This simply means that the mean and variance of the time series are constant over time i.e they are invariant over time. Also, the covariance between two time periods depends on the gap between the two periods and not the time at which they are computed.

In order to verify the stationarity of a specific

The null hypothesis under this test is that the time series under study in not stationary and we accept Ho if the **p-value>0.05**. Carrying out this test in R:

library(

adf.test(beer2)

The results obtained are as follows:

Hence, clearly, we reject the null hypothesis and say that the time series considered is stationary. In case it wasn’t stationary we could’ve simply used the

Now, once you have checked for the stationarity, we can go onto the next stage of forecasting future observations. There are many forecasting methods but in this blog, we will be discussing the following:

- The average method
- Naïve method
- Seasonal naïve method
- ARIMA
- SARIMA

**Average Method:** Under this method, the forecasted future values are just an average of the historical values. It can be easily executed in R using the following commands:

**beerfit1 = meanf(beer2, h=5)** #where in h=5 shows that the forecast is made for 5 years.**plot(beerfit1) **

Where the blue line represents the forecasted values and the dark and light grey areas represent the 95% and 80% confidence intervals.

**Naïve Method:** As the name suggests, under this method the forecasted values are just equal to the last observation.

**beerfit2 = naive(beer2, h=5)plot(beerfit2)**

**Seasonal Naïve Method:** We know that there is some seasonality in our

**beerfit3 = snaive(beer2, h=5) plot(beerfit3)**

**ARIMA:** Now coming onto the most important and widely used methods of forecasting. The time series analysis is based on the assumption that the underline time series is stationary or can make stationary by differencing it 1 or more times. This is known as the ARIMA (p, d, q) model where d denotes the number of times a time series has to be differenced to make it stationary. In most applications d = 1, i.e, we take only the first differences of the time series.

Of course, if a time series is already stationary, then an **ARIMA (p, d, q)** becomes **ARMA (p, q)** model.

Now since lets break this abbreviation down in other parts:

** Auto Regression or Auto Regressive models:**

The above model is called an autoregressive model of order p, AR(p), since it involves regressing Y at time t on its lagged p periods into the past, the value of p

The other being Moving average models:

The above model is called a moving average model of order q, MA(q), since we express Xt as a weighted or moving average of the current and past white noise error terms. **Hence, ARIMA(p,d,q).**

**Seasonal ARIMA**: When it comes to highly seasonal data, like in our case the ausbeer time series data, we use an extension of ARIMA models called SARIMA models wherein the regular non-seasonal ARIMA includes the additional seasonal terms

where m = number of observations per year. We use

For

Now comes an important segment where the order of ARIMA and Seasonal ARIMA models need to be determined. This can be either done by using the ACF and PACF plots which help in determining the order of MA and AR models respectively or a straightforward way of using the **auto.arima()** function from the forecast package which will automatically give us the best model with the lowest AIC value.

But since I want this blog to be easily understood by newbies, we only consider the **auto. arima()** function for now. Although, this function shouldn’t be applied blindly to any time series. Another advantage of using this function is

**beerfit4 = auto.arima(beer2,stepwise = F,approximation = F,trace = T) beerfit4**

This runs a number of iterations and checks the AIC value for different order values of ARIMA. Finally providing us the best model with the lowest AIC value.

From the

**fore <- forecast(beerfit4,h=5) plot(fore)**

The last step in this analysis is to measure the forecasting accuracy of the models. This can be done by the accuracy() function in the forecast package. It gives us a list of different accuracy measures that can be used to determine the best model fit. #compare the forecasted value in 2007 with the actual value(beer3)

**beer3 = window(ausbeer, start=2007)accuracy(beerfit1, beer3)accuracy(beerfit2, beer3)accuracy(beerfit3, beer3)accuracy(fore,beer3) **

From the above result we clearly note that the lowest RMSE value has been obtained for the Seasonal ARIMA model i.e 12.161. Hence, it is the best fit model.

The post Step By Step Guide To Time Series Analysis In R appeared first on StepUp Analytics.

]]>The post Classical Normal Linear Regression Model (CNLRM) appeared first on StepUp Analytics.

]]>In this article, we will discuss the details of the **C**lassical **N**ormal **L**inear **R**egression **M**odel (**CNLRM**). The method of ordinary least squares is attributed to Carl Friedrich Gauss, a German mathematician. Under certain Assumption, this method of estimation has some very attractive statistical properties that made it one of most powerful and popular method of regression analysis.

The two variable Population Regression Function:

However, the Population functions can’t be obtained directly, hence we estimated them from the help of sample regression functions:

Where 𝑌 ̂𝑖^{ }is the estimated (conditional mean) value of 𝑌𝑖

The OLS Estimated of 𝛽1 and 𝛽2 can be obtained as follows:

On differentiating partially with respect to 𝛽1 and 𝛽2 we obtained the following results:

and

Thus we get the estimated of the population regression function as:

Where,

**The Assumption Underlying The Method Of Least Square.
**The objective of estimating 𝛽1 and 𝛽2 only, the method of OLS discussed is suffice, but if the objective is to draw the inference about the true value of population variables 𝛽1and 𝛽2 then we have to look upon the fictional form of 𝑌𝑖′𝑠 or the functional form of 𝑋𝑖′𝑠 and 𝑢𝑖′𝑠. This is because the value of population regression function i.e., 𝑌 𝑖 = 𝛽1 + 𝛽2𝑋𝑖 + 𝑢𝑖 depends on Xi and error terms.

Therefore, unless we are specified about how 𝑋𝑖 and 𝑢𝑖 are created or generated, there is no way we can make any statistical inference about 𝑌𝑖 and also, as we shall see, about 𝛽1and 𝛽2. Thus, we need some assumptions made about the 𝑋𝑖 variables and error terms are extremely critical to valid interpretation of the regression estimates.

The Gaussian, standard, or classical linear regression model (CLRM), which is the cornerstone of almost every economic theory, makes the following 7 assumptions:

**Assumption1:** The regression model is linear in terms of parameters.

**Assumption2:** The values of 𝑋𝑖′𝑠 are fixed or 𝑋 values are independent of the error term.

**Assumption3:** the mean of error terms is zero.

**Assumption4:** the variance of error terms is constant; this assumption is also known as Homoscedasticity.

**Assumption5:** the is no autocorrelation between the error terms (or disturbances)

**Assumption6:** the number of observations is greater than the number of parameters to be estimated. **Assumption7:** The 𝑋 values in the given sample must not be the same. i.e., the variance of 𝑋𝑖′𝑠 is positive.

**NOTE**: Gauss-Markov Theorem:

Given the assumption of the Classical linear regression model, the least-square estimators, in the class of unbiased linear estimators, have minimum variance, that is, they are BLUE

Using the method of OLS we are able to estimate the population parameters 𝛽1 and 𝛽2, under the assumptions of the classical linear regression model, as 𝛽 ̂1 and𝛽 ̂2.But, since these estimators differ from sample to sample. therefore, these estimators as random variables.

Hence, we called the estimators as random variable thus we have to find the probability distribution of these estimators.

**The Probability Distribution Of Disturbances (𝒖𝒊′𝒔)**

But since 𝑋𝑖′𝑠 are assumed fixed, or nonstochastic, because ours is conditional regression analysis, conditional on the fixed values of 𝑋𝑖.

also **𝑌 𝑖 = 𝛽1 + 𝛽2𝑋𝑖 + 𝑢𝑖**

Hence the 𝛽 ̂2 can be rewritten

the 𝑘𝑖, the beta and 𝑘𝑖 are fixed hence the estimate 𝛽 ̂2 is ultimately a linear function of the random variable.

Therefore, the probability distribution of estimators depends on the assumption of error terms. Since we need the probability distribution of estimated to draw the inference about population parameters, we have to draw the assumption about the distribution of error term.

Since the OLS does not make any assumption about the probabilistic nature of 𝑢𝑖, it is of little help for the purpose of drawing of drawing inference about population regression function from the sample regression function, the Gauss-Markov Theorem notwithstanding.

This void can be filled if we are willing to assume that the 𝑢𝑖′𝑠 follow some probability distribution. For reasons to be explained shortly, in the regression context it is usually assumed that the 𝑢𝑖′𝑠 follow a normal distribution.

Thus, adding the normality assumption of the classical linear regression model (CLRM) discussed earlier, we obtained what is known as the **classical normal linear regression model (CNLRM)**.

The post Classical Normal Linear Regression Model (CNLRM) appeared first on StepUp Analytics.

]]>The post Application of Linear Regression Via Python’s sklearn Library appeared first on StepUp Analytics.

]]>Before wasting any further time, let’s dive into!

(If you are still confused between Regression and Classification tasks and need StepUp Analytics to do an article on it, just give us the orders We’ll be happy to help you out!).

- Introduction to Linear Regression
- The main purpose and its use case (problem domains)
- Basic hypothesis/mathematics behind the design of algorithm.
- Application of Linear Regression on a dataset via Python’s sklearn library
- Summary

Linear regression is an approach of linearly-mapping a relationship between a scalar input variable

(or dependent variable) and one or more continuous output variables (or independent variables). In some terminologies, the dependent variable is also referred to as the *Target Variable* and the independent variables as the *Predictor variables. *

Now, you can infer from the number of variables that when output is computed on the basis of a single variable then the approach is known as ** Simple Linear Regression**, and when two or more variables are accountable for the output then it is known as

In this article, we will cover the Simple Linear Regression to get a gist of what is actually happening behind the name.

As you may already know that a Machine Learning task is either classification (predicting a class/label) or regression (predicting a real-valued quantity). So, whenever you come across a regression task and you see a somewhat linear pattern between the input and the output variables then son, you may require the Linear Regression Model for your problem statement.

**The Core Purpose In Linear Regression Is To Obtain A Line That Best Fits The Data. **

Visualize the data problem as a graph as shown beside where all the points are values of the variable at that point. Each point (in blue) in the graph is known as a *data point* and the fitted line (in red) called a ** Regression Line**.

So the task of the algorithm is to find the best fit line for which the actual values are as close to the line as possible. In other words, the task is that the total prediction error is as small as possible, where the error is the distance between the points to the regression line.

The problem statement which is covered in this model is like:

- Predicting weight from the height of a person (Single Linear Regression)
- Predicting price of a house on the basis of its age, number of bedrooms, square ft area, etc. (Multiple Linear Regression)
- Predicting the height of a child on the basis of his father’s (Single Linear Regression)
- Predicting the market stock prices of a company based on its past performances, and so on

Keep in mind that the input variables can’t be categorical at all. In case you’re dealing with any, then it will be better than you convert them into numerical before feeding them into the model.

Since the beginning of this article, we have been using the word “*Linear*” which implicitly means a straight line. Just a little squeeze in your brain and the idea or basic algorithm behind the Linear Regression will automatically come to you. It will surely have the *straight line equation*, right? Let us see.

**Y = b _{0 }+ b_{1}X_{1} + b_{2}X_{2 }+ b_{3}X_{3 }+ ….**

The above equation is used in case of *Multiple Linear Regression* problems and the same for *Simple Linear Regression* would be updated as

**Y = b _{0 }+ b_{1}X_{1}**

Here,

**Y** = output/ target variable/ response

**b _{0}** = Bias Coefficient (adds intercept-wise flexibility to the line)

**b _{1}** = coefficient/parameter related to X

**b **= coefficients/parameters learned and updated while training/fitting of the model

Now, if you compare the equation with that of straight line then you can see the similarity between the two.

Until now, we discussed the hypothesis used in linear regressions tasks, but after the regression line is fitted and output is produced, we also need to evaluate it. We need to evaluate the model’s performance based on the actual outcome and the output that the model predicted. And, there comes the need for evaluation and optimization techniques.

Out of the many techniques currently used, the “*least squares”* is the easiest for beginners and is mostly used with Simple Linear Regression. What it actually does is, it just computes the distance between the actual value (plotted on the graph) and the value predicted by the model (i.e. value on the regression line) and square it. Now, the aim of the model will be to minimize this distance as much as it can. There are various other optimization algorithms like gradient descent which will be covered explicitly in further articles.

Mathematically, it is implemented using the formula below,

Where **n** = number of instances (or number of examples)

**Pred _{i}** = Predicted value for i

**Y _{i}** = Actual value for i

**J** = Cost Function (least squares in this case)

The code to the notebook can be accessed via

https://gist.github.com/srajan-jha/edf6e4673da408151366aad6d4e1ef27

Sklearn is a python’s library which has all the basic machine learning models already implemented in it. It also contains some classical datasets which you can play around with.

Sklearn makes the implementation of any machine learning model way too easier, however, I’d suggest you try and code the model from scratch. This way you’d get a clearer picture of the model.

To get a detailed understanding of what actually is happening in the code, feel free to reach out to us in the comments below!

In this article, you read about what actually is a Linear Model and how it is divided into Simple and Multiple Task depending upon the number of variables involved.

Moreover, we tried to understand the problem domains where this algorithm might stand out. Also, I tried to give you all a basic understanding of the mathematics involved in the name: Linear Regression. Then further we understood the need for optimization algorithms and also went through the Least Squares Method. The code implementation was also covered later which was done using SKLEARN Library. (Feel free to check out the Linear Regression Code from Sklearn).

You must know that Linear Regression is the most basic algorithm and it might not perform well for every other dataset. You must look out for the Linear Relation between the inputs and outputs and if you think it exists, then only move further.

At the end of the day, it all comes down to the accuracy of the model so if you think a Linear Regression isn’t performing good enough on your dataset, then you must know that it’s just a beginning and there are a lot of models which are still waiting to be discovered by you! Stay tuned with us and get to know them all.

The post Application of Linear Regression Via Python’s sklearn Library appeared first on StepUp Analytics.

]]>The post Churn Modelling for Mobile Telecommunications appeared first on StepUp Analytics.

]]>

Churn is one of the biggest threat to the telecommunication industry. Every telecommunication industry deploys the best models that suit their need to avoid the voluntary or involuntary churn of a customer. This is called churn modelling. Below I will take you through the terms frequently used in building this model.

- Churn represents the loss of an existing customer to a competitor
- A prevalent problem in retail:
- Mobile phone services
- Home mortgage refinance
- Credit card

- Churn is a problem for any provider of a subscription service or recurring purchasable
- Costs of customer acquisition and win-back can be high
- Much cheaper to invest in customer retention
- Difficult to recoup costs of customer acquisition unless customer is retained for a minimum length of time

- Churn is especially important to mobile phone service providers
- easy for a subscriber to switch services
- Phone number portability will remove last important obstacle

**Predicting Churn: Key to a Protective Strategy**

- Predictive modelling can assist churn management
- By tagging customers most likely to churn

- High risk customers should first be sorted by profitability
- Campaign targeted to the most profitable at-risk customers
- Typical retention campaigns include
- Incentives such as price breaks
- Special services available only to select customers

- To be cost-effective retention campaigns must be targeted to the right customers
- Customers who would probably leave without the incentive
- Costly to offer incentives to those who would stay regardless

Here, We have a sample telecom data on which we will run Churn Modelling using R code.

library(rattle) # The weather data set and normVarNames(). library(randomForest) # Impute missing values using na.roughfix(). library(rpart) # decision tree library(tidyr) # Tidy the data set. library(ggplot2) # Visualize data. library(dplyr) # Data preparation and pipes %>%. library(lubridate) # Handle dates. library(corrgram)

Loading data directly from the web

nm <- read.csv("http://www.sgi.com/tech/mlc/db/churn.names", skip=4, colClasses=c("character", "NULL"), header=FALSE, sep=":")[[1]] dat <- read.csv("http://www.sgi.com/tech/mlc/db/churn.data", header=FALSE, col.names=c(nm, "Churn")) nobs <- nrow(dat) colnames(dat) dsname <- "dat" ds <- get(dsname) dim(ds) (vars <- names(ds)) target<- 'Churn'; ds$phone.number<-NULL; ds$churn<-(as.numeric(ds$Churn) - 1) ds$Churn<-NULL ds$state<-NULL ## Split ds into train and test ## 75% of the sample size smp_size <- floor(0.75 * nrow(ds)) ## set the seed to make your partition reproducible set.seed(123) train_ind <- sample(seq_len(nrow(ds)), size = smp_size) train <- ds[train_ind, ] test <- ds[-train_ind, ] dim(train) dim(test) corrgram(train, lower.panel=panel.ellipse, upper.panel=panel.pie);

Fitting a Model

lm.fit <- lm(churn~., data=train); # Multiple R-squared: 0.1784, Adjusted R-squared: 0.1724

pred.lm.fit<-predict(lm.fit, test); RMSE.lm.fit<-sqrt(mean((test$churn)^2)) RMSE.lm.fit; #0.3232695 # building a simpler model, similar R2 lm.fit.step <- lm(churn ~ international.plan + voice.mail.plan + total.day.charge + total.eve.minutes + total.night.charge + total.intl.calls + total.intl.charge + number.customer.service.calls, data=train); # Multiple R-squared: 0.1767, Adjusted R-squared: 0.174 pred.lm.fit.step <-predict(lm.fit.step, test); RMSE.lm.fit.step <-sqrt(mean((pred.lm.fit.step-test$churn)^2)) RMSE.lm.fit.step; #0.3227848 <- simpler, and better RMSE

# logistic regression using a generalized linear model glm.step <- glm(churn ~ international.plan + voice.mail.plan + total.day.charge + total.eve.minutes + total.night.charge + total.intl.calls + total.intl.charge + number.customer.service.calls, family = binomial, data = train) pred.glm.step <- predict.glm(glm.step, newdata = test, type = "response") RMSE.glm.step <- sqrt(mean((pred.glm.step-test$churn)^2)) RMSE.glm.step; #0.3179586 <- better than the linear model

# build a decision tree based on the selected variables rpart.fit.step <- rpart(churn ~ international.plan + voice.mail.plan + total.day.charge + total.eve.minutes + total.night.charge + total.intl.calls + total.intl.charge + number.customer.service.calls, data=train, method="class"); pred.rpart.step <- predict(rpart.fit.step, test); # See correction below RMSE.rpart.step <- sqrt(mean((pred.rpart.step-test$churn)^2)) RMSE.rpart.step; #0.6742183 <- much worse than the linear model # forgot type="class" pred.rpart.step <- as.numeric(predict(rpart.fit.step, test, type="class")) - 1; RMSE.rpart.step <- sqrt(mean((pred.rpart.step-test$churn)^2)) RMSE.rpart.step; #0.2423902 <- better than the linear model sum(pred.rpart.step==test$churn)/nrow(test) # 0.941247 94% tests are correctly matched

# Build Random Forest Ensemble set.seed(415) rf.fit.step <- randomForest(as.factor(churn) ~ international.plan + voice.mail.plan + total.day.charge + total.eve.minutes + total.night.charge + total.intl.calls + total.intl.charge + number.customer.service.calls, data=train, importance=TRUE, ntree=2000) varImpPlot(rf.fit.step); pred.rf.fit.step <- as.numeric(predict(rf.fit.step, test))-1; RMSE.rf.fit.step <- sqrt(mean((pred.rf.fit.step-test$churn)^2)) RMSE.rf.fit.step; #0.2217221 improvement from the linear model, so a non-linear, decision tree approach is better sum(pred.rf.fit.step==test$churn)/nrow(test) # 0.9508393 95% tests are correctly matched

Algorithm | RMSE | Comment |

Linear Model | 0.3232695 | |

Simpler Linear Model | 0.1767 | |

Logistic Regression | 0.3179586 | better than the linear model |

Decision Tree | 0.6742183 | much worse than the linear model (Overfitting) |

Decision Tree (Without type = “class”) | 0.2423902 | better than the liniar model |

Random Forest | 0.2217221 | improvement from the linear model so a non-linear decision tree approach is better |

The post Churn Modelling for Mobile Telecommunications appeared first on StepUp Analytics.

]]>The post Enabling Machine Learning Algorithm appeared first on StepUp Analytics.

]]>#Import Library #Import other necessary libraries like pandas, numpy... from sklearn import linear_model #Load Train and Test datasets #Identify feature and response variable(s) and values must be numeric and numpy arrays x_train=input_variables_values_training_datasets y_train=target_variables_values_training_datasets x_test=input_variables_values_test_datasets # Create linear regression object linear = linear_model.LinearRegression() # Train the model using the training sets and check score linear.fit(x_train, y_train) linear.score(x_train, y_train) #Equation coefficient and Intercept print('Coefficient: \n', linear.coef_) print('Intercept: \n', linear.intercept_) #Predict Output predicted= linear.predict(x_test)

#Load Train and Test datasets #Identify feature and response variable(s) and values must be numeric and numpy arrays x_train <- input_variables_values_training_datasets y_train <- target_variables_values_training_datasets x_test <- input_variables_values_test_datasets x <- cbind(x_train,y_train) <span class="c"># Train the model using the training sets and check score </span><span class="hl std">linear</span> <span class="hl kwb"><-</span> <span class="hl kwd">lm</span><span class="hl std">(y_train</span> <span class="hl opt">~</span> <span class="hl std">.</span><span class="hl std">,</span> <span class="hl kwc">data</span> <span class="hl std">= x</span><span class="hl std">) summary(linear)</span><span class="c"> #Predict Output predicted= </span><span class="n">predict</span><span class="p">(linear,</span><span class="n">x_test</span><span class="p">) </span>

#Import Library from sklearn.linear_model import LogisticRegression #Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset # Create logistic regression object model = LogisticRegression() # Train the model using the training sets and check score model.fit(X, y) model.score(X, y) #Equation coefficient and Intercept print('Coefficient: \n', model.coef_) print('Intercept: \n', model.intercept_) #Predict Output predicted= model.predict(x_test)

x <- cbind(x_train,y_train) <span class="c"># Train the model using the training sets and check score </span><span class="hl std">logistic</span> <span class="hl kwb"><-</span> g<span class="hl kwd">lm</span><span class="hl std">(y_train</span> <span class="hl opt">~</span> <span class="hl std">.</span><span class="hl std">,</span> <span class="hl kwc">data</span> <span class="hl std">= x,family='binomial'</span><span class="hl std">) summary(logistic)</span><span class="c"> #Predict Output predicted= </span><span class="n">predict</span><span class="p">(</span><span class="hl std">logistic</span><span class="p">,</span><span class="n">x_test</span><span class="p">)</span>

Decision Tree

#Import Library #Import other necessary libraries like pandas, numpy... from sklearn import tree #Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset # Create tree object model = tree.DecisionTreeClassifier(criterion='gini') # for classification, here you can change the algorithm as gini or entropy (information gain) by default it is gini # model = tree.DecisionTreeRegressor() for regression # Train the model using the training sets and check score model.fit(X, y) model.score(X, y) #Predict Output predicted= model.predict(x_test)

library(rpart) x <- cbind(x_train,y_train) <span class="c"># grow tree fit <- rpart(</span><span class="hl std">y_train</span> <span class="hl opt">~</span> <span class="hl std">.</span><span class="hl std">,</span> <span class="hl kwc">data</span> <span class="hl std">= x,</span><span class="c">method="class") </span><span class="hl std">summary(fit)</span><span class="c"> #Predict Output predicted= </span><span class="n">predict</span><span class="p">(</span><span class="c">fit</span><span class="p">,</span><span class="n">x_test</span><span class="p">)</span>

#Import Library from sklearn import svm #Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset # Create SVM classification object model = svm.svc() # there is various option associated with it, this is simple for classification. You can refer link, for mo# re detail. # Train the model using the training sets and check score model.fit(X, y) model.score(X, y) #Predict Output predicted= model.predict(x_test)

library(e1071) x <- cbind(x_train,y_train) <span class="c"># Fitting model fit <-</span>svm(<span class="hl std">y_train</span> <span class="hl opt">~</span> <span class="hl std">.</span>, data = x) <span class="c">summary(fit)</span><span class="c"> #Predict Output predicted= </span><span class="n">predict</span><span class="p">(</span><span class="c">fit</span><span class="p">,</span><span class="n">x_test</span><span class="p">)</span>

#Import Library from sklearn.naive_bayes import GaussianNB #Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset # Create SVM classification object model = GaussianNB() # there is other distribution for multinomial classes like Bernoulli Naive Bayes, <span style="color: #0000ff;"><a style="color: #0000ff;" href="http://scikit-learn.org/stable/modules/naive_bayes.html">Refer link</a></span> # Train the model using the training sets and check score model.fit(X, y) #Predict Output predicted= model.predict(x_test)

library(e1071) x <- cbind(x_train,y_train) <span class="c"># Fitting model fit <-naiveBayes</span>(<span class="hl std">y_train</span> <span class="hl opt">~</span> <span class="hl std">.</span>, data = x) <span class="c">summary(fit)</span><span class="c"> #Predict Output predicted= </span><span class="n">predict</span><span class="p">(</span><span class="c">fit</span><span class="p">,</span><span class="n">x_test</span><span class="p">)</span>

**Python Code**

#Import Library from sklearn.neighbors import KNeighborsClassifier #Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset # Create KNeighbors classifier object model KNeighborsClassifier(n_neighbors=6) # default value for n_neighbors is 5 # Train the model using the training sets and check score model.fit(X, y) #Predict Output predicted= model.predict(x_test)

library(knn) x <- cbind(x_train,y_train) <span class="c"># Fitting model fit <-knn</span>(<span class="hl std">y_train</span> <span class="hl opt">~</span> <span class="hl std">.</span>, data = x,k=5) <span class="c">summary(fit)</span><span class="c"> #Predict Output predicted= </span><span class="n">predict</span><span class="p">(</span><span class="c">fit</span><span class="p">,</span><span class="n">x_test</span><span class="p">)</span>

Python Code

<span class="kn">#Import Library </span><span class="kn">from</span> <span class="nn">sklearn.cluster</span> <span class="kn">import</span> KMeans <span class="c">#Assumed you have, X (</span><span class="c">attributes</span><span class="c">) for training data set and x_test(attributes) of test_dataset </span><span class="c"># Create KNeighbors classifier object </span><span class="n">model</span> <span class="n">k_means</span> <span class="o">=</span> <span class="n">KMeans</span><span class="p">(</span><span class="n">n_clusters</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">0</span><span class="p">) </span><span class="c"># Train the model using the training sets and check score </span><span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">) </span><span class="c">#Predict Output predicted= model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">x_test</span><span class="p">)</span>

library(cluster) fit <- kmeans(X, 3) # 5 cluster solution

**Python Code**

<span class="kn">#Import Library </span><span class="kn">from</span> <span class="nn">sklearn.ensemble</span> <span class="kn">import</span> <span class="n">RandomForestClassifier </span><span class="c">#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset </span><span class="c"># Create Random Forest object</span> <span class="n">model= </span><span class="n">RandomForestClassifier</span><span class="p">(</span><span class="p">) </span><span class="c"># Train the model using the training sets and check score </span><span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">) </span><span class="c">#Predict Output predicted= model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">x_test</span><span class="p">)</span>

**R Code**

library(randomForest) x <- cbind(x_train,y_train) <span class="c"># Fitting model </span>fit <- randomForest(Species ~ ., x,ntree=500) <span class="c">summary(fit)</span><span class="c"> #Predict Output predicted= </span><span class="n">predict</span><span class="p">(</span><span class="c">fit</span><span class="p">,</span><span class="n">x_test</span><span class="p">)</span>

#Import Library from sklearn import decomposition <span class="c">#Assumed you have training and test data set as train and test </span><span class="c"># Create PCA obeject</span> <span class="n">pca= decomposition.PCA(n_components=k) #default value of k =min(n_sample, n_features) </span><span class="c"># For Factor analysis </span><span class="n">#fa= decomposition.FactorAnalysis()</span><span class="c"> # Reduced the dimension of training dataset using PCA </span><span class="p">train_reduced = pca.fit_transform(train) </span><span class="c">#Reduced the dimension of test dataset </span><span class="p">test_reduced = pca.transform(test) </span>#For more detail on this, please refer <a href="http://scikit-learn.org/stable/modules/decomposition.html#decompositions" target="_blank" rel="nofollow noopener noreferrer">this link</a>.

**R Code**

library(stats) pca <- princomp(<span class="p">train</span>, cor = TRUE) <span class="p">train_reduced <- predict(pca,</span><span class="p">train) </span><span class="p">test_reduced <- predict(pca,</span><span class="p">test)</span>

<span class="p"> </span>

<span class="kn">#Import Library </span><span class="kn">from</span> <span class="nn">sklearn.ensemble</span> <span class="kn">import</span> <span class="n">GradientBoostingClassifier </span><span class="c">#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset </span><span class="c"># Create Gradient Boosting Classifier object</span> <span class="n">model= </span><span class="n">GradientBoostingClassifier</span><span class="p">(</span><span class="n">n_estimators</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">learning_rate</span><span class="o">=</span><span class="mf">1.0</span><span class="p">, </span><span class="n">max_depth</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">0</span><span class="p">) </span><span class="c"># Train the model using the training sets and check score </span><span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">) </span><span class="c">#Predict Output predicted= model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">x_test</span><span class="p">) </span>

library(caret) x <- cbind(x_train,y_train) <span class="c"># Fitting model </span>fitControl <- trainControl( method = "repeatedcv", number = 4, repeats = 4) fit <- train(y ~ ., data = x, method = "gbm", trControl = fitControl,verbose = FALSE) <span class="c">predicted= </span><span class="n">predict</span><span class="p">(</span><span class="c">fit</span><span class="p">,</span><span class="n">x_test,type= "prob"</span><span class="p">)[,2]</span>

Thanks

Team StepUpAnalytics

The post Enabling Machine Learning Algorithm appeared first on StepUp Analytics.

]]>The post Linear Regression Analysis using R appeared first on StepUp Analytics.

]]>In this post we will consider the case of simple linear regression with one response variable and a single independent variable. For this example we will use some data

The purpose of using this data is to determine whether there is a relationship, described by a simple linear regression model, between variables.

First I will tell you the steps:

- Open R script.
- Set working directory according to your habit.
- Load the dataset in R software(.txt, .xlsx, .csv).

If the extension of your file is ‘.xlsx’ then

Use these commands

require(xlsx)

var1 <- read.xlsx(“<filename with proper extension>”, sheetIndex = <number or name >)

#sheetIndex is nothing but the title of excel sheet where your data stored.

Now we are ready to go to make a linear regression model using R programming software

My R script

Now load the data

You seen in the image that first i checked my working directory and then changed it to another directory, this means the working datafiles have another location so i changed it for my help.

I can set the working directory by two methods: 1)files->change directory 2) setwd(“<path>”)

In the following data

X = annual franchise fee ($1000)

Y = start up cost ($1000)

for a pizza franchise

You can see the details in below image is

Here in this image you can see that the R² valued is very less, this means that there is almost no relationship between the two variables.

**How good is your Regression model?**

- Based on R² value, we can explain the model.
- Difference between observations (which are not explained by model) is the error term or residual.
- In the above regression model the value of almost R²=.22, 22% variance of dependent variables which are explained by the model and the remaining 78% which is not explained, is error term or residual.

The post Linear Regression Analysis using R appeared first on StepUp Analytics.

]]>The post R FUNCTIONS FOR REGRESSION ANALYSIS appeared first on StepUp Analytics.

]]>Here are some helpful R functions for regression analysis grouped by their goal. The name of the package is in parentheses.

__Linear model __

**Anova:** Anova Tables for Linear and Generalized Linear Models (car)

**anova:** Compute an analysis of variance table for one or more linear model fits (stasts)

**coef:** is a generic function which extracts model coefficients from objects returned by modelling functions. coefficients is an alias for it (stasts)

**coeftest:** Testing Estimated Coefficients (lmtest)

**confint:** Computes confidence intervals for one or more parameters in a fitted model. The base has a method for objects inheriting from class “lm” (stasts)

**deviance:** Returns the deviance of a fitted model object (stats)

**effects:** Returns (orthogonal) effects from a fitted model, usually a linear model. This is a generic function, but currently only has methods for objects inheriting from classes “lm” and “glm” (stasts)

**fitted:** is a generic function which extracts fitted values from objects returned by modeling functions fitted. Values is an alias for it (stasts)

**formula:** provide a way of extracting formulae which have been included in other objects (stasts)

linear.hypothesis: Test Linear Hypothesis (car)

**lm:** is used to fit linear models. It can be used to carry out regression, single stratum analysis of variance and analysis of covariance (stasts)

**model.matrix:** creates a design matrix (stasts)

**predict:** Predicted values based on the linear model object (stasts)

residuals: is a generic function which extracts model residuals from objects returned by modelling functions (stasts)

summary.lm: summary method for class “lm” (stats)

**vcov:** Returns the variance-covariance matrix of the main parameters of a fitted model object (stasts)

__Model – Variables selection __

**add1:** Compute all the single terms in the scope argument that can be added to or dropped from the model, fit those models and compute a table of the changes in fit (stats)

**AIC:** Generic function calculating the Akaike information criterion for one or several fitted model objects for which a log-likelihood value can be obtained, according to the formula -2*log-likelihood + k*npar, where npar represents the number of parameters in the fitted model, and k = 2 for the usual AIC, or k = log(n) (n the number of observations) for the so-called BIC or SBC (Schwarz’s Bayesian criterion) (stats)

Cpplot: Cp plot (faraway)

**drop1:** Compute all the single terms in the scope argument that can be added to or dropped from the model, fit those models and compute a table of the changes in fit (stats)

**extractAIC:** Computes the (generalized) Akaike An Information Criterion for a fitted parametric model (stats)

leaps: Subset selection by `leaps and bounds’ (leaps)

**maxadjr:** Maximum Adjusted R-squared (faraway)

**offset:** An offset is a term to be added to a linear predictor, such as in a generalised linear model, with known coefficient 1 rather than an estimated coefficient (stats)

**step:** Select a formula-based model by AIC (stats)

**update.formula:** is used to update model formulae. This typically involves adding or dropping terms, but updates can be more general (stats)

__Diagnostics__

**cookd:** Cook’s Distances for Linear and Generalized Linear Models (car)

cooks.distance: Cook’s distance (stats)

**covratio:** covariance ratio (stats)

**dfbeta:** DBETA (stats)

**dfbetas:** DBETAS (stats)

**dffits:** DFFTITS (stats)

**hat:** diagonal elements of the hat matrix (stats)

**hatvalues:** diagonal elements of the hat matrix (stats)

**influence.measures:** This suite of functions can be used to compute some of the regression (leave-one-out deletion) diagnostics for linear and generalized linear models (stats)

**lm.influence:** This function provides the basic quantities which are used in forming a wide variety of diagnostics for checking the quality of regression fits (stats)

**ls.diag:** Computes basic statistics, including standard errors, t- and p-values for the regression coefficients (stats)

**outlier.test:** Bonferroni Outlier Test (car)

**rstandard:** standardized residuals (stats)

**rstudent:** studentized residuals (stats)

**vif:** Variance Inflation Factor (car)

** **

__Graphics__

**ceres.plots:** Ceres Plots (car)

**cr.plots:** Component+Residual (Partial Residual) Plots (car)

**influence.plot:** Regression Influence Plot (car)

**leverage.plots:** Regression Leverage Plots (car)

**panel.car:** Panel Function Coplots (car)

**plot.lm:** Four plots (selectable by which) are currently provided: a plot of residuals against fitted values, a Scale-Location plot of sqrt{| residuals |} against fitted values, a Normal Q-Q plot, and a plot of Cook’s distances versus row labels (stats)

**prplot:** Partial Residual Plot (faraway)

**qq.plot:** Quantile-Comparison Plots (car)

**qqline:** adds a line to a normal quantile-quantile plot which passes through the first and third quartiles (stats)

**qqnorm:** is a generic function the default method of which produces a normal QQ plot of the values in y (stats)

**reg.line:** Plot Regression Line (car)

**scatterplot.matrix:** Scatterplot Matrices (car)

**scatterplot:** Scatterplots with Boxplots (car)

**spread.level.plot:** Spread-Level Plots (car)

__Tests__

**ad.test:** Anderson-Darling test for normality (nortest)

**bartlett.test:** Performs Bartlett’s test of the null that the variances in each of the groups (samples) are the same **(stats) bgtest:** Breusch-Godfrey Test (lmtest) bptest: Breusch-Pagan Test (lmtest)

**cvm.test:** Cramer-von Mises test for normality (nortest)

**durbin.watson:** Durbin-Watson Test for Autocorrelated Errors (car)

**dwtest:** Durbin-Watson Test (lmtest)

**levene.test:** Levene’s Test (car)

**lillie.test:** Lilliefors (Kolmogorov-Smirnov) test for normality (nortest)

**ncv.test:** Score Test for Non-Constant Error Variance (car)

**pearson.test:** Pearson chi-square test for normality (nortest)

**sf.test:** Shapiro-Francia test for normality (nortest)

**shapiro.test:** Performs the Shapiro-Wilk test of normality (stats)

__Variables transformations__

**box.cox:** Box-Cox Family of Transformations (car)

**boxcox:** Box-Cox Transformations for Linear Models (MASS)

**box.cox.powers:** Multivariate Unconditional Box-Cox Transformations (car)

**box.tidwell:** Box-Tidwell Transformations (car)

**box.cox.var:** Constructed Variable for Box-Cox Transformation (car)

__Ridge regression__

**lm.ridge:** Ridge Regression (MASS)

** **

__Segmented regression__

**segmented:** Segmented relationships in regression models (segmented)

**slope.segmented:** Summary for slopes of segmented relationships (segmented)

__Generalized Least Squares (GLS)__

**ACF.gls:** Autocorrelation Function for gls Residuals (nlme)

**anova.gls:** Compare Likelihoods of Fitted Objects (nlme)

**gls:** Fit Linear Model Using Generalized Least Squares (nlme)

**intervals.gls:** Confidence Intervals on gls Parameters (nlme)

**lm.gls:** fit Linear Models by Generalized Least Squares (MASS)

**plot.gls:** Plot a gls Object (nlme)

**predict.gls:** Predictions from a gls Object (nlme)

**qqnorm.gls:** Normal Plot of Residuals from a gls Object (nlme)

**residuals.gls:** Extract gls Residuals (nlme) summary.gls: Summarize a gls Object (nlme)

__Generalized Linear Models (GLM)__

**family:** Family objects provide a convenient way to specify the details of the models used by functions such as glm (stats)

**glm.nb:** fit a Negative Binomial Generalized Linear Model (MASS)

**glm:** is used to fit generalized linear models, specified by giving a symbolic description of the linear predictor and a description of the error distribution (stats)

**polr: ** Proportional Odds Logistic Regression (MASS)

__Non-linear Least Squares (NLS)__

**nlm:** This function carries out a minimization of the function f using a Newton-type algorithm (stats)

**nls:** Determine the nonlinear least-squares estimates of the nonlinear model parameters and return a class nls object (stats)

**nlscontrol:** Allow the user to set some characteristics of the nls nonlinear least squares algorithm (stats)

**nlsModel:** This is the constructor for nlsModel objects, which are function closures for several functions in a list. The closure includes a nonlinear model formula, data values for the formula, as well as parameters and their values (stats)

__Generalized Non-linear Least Squares (GNLS)__

**coef.gnls:** Extract gnls Coefficients (nlme)

**gnls:** Fit Nonlinear Model Using Generalized Least Squares (nlme)

**predict.gnls:** Predictions from a gnls Object (nlme)

__Loess regression____ __

**loess:** Fit a polynomial surface determined by one or more numerical predictors, using local fitting (stats)

**loess.control:** Set control parameters for loess fits (stats)

**predict.loess:** Predictions from a loess fit, optionally with standard errors (stats)

**scatter.smooth:** Plot and add a smooth curve computed by loess to a scatter plot (stats)

__Splines regression__

**bs:** B-Spline Basis for Polynomial Splines (splines)

**ns:** Generate a Basis Matrix for Natural Cubic Splines (splines)

**periodicSpline:** Create a Periodic Interpolation Spline (splines)

**polySpline:** Piecewise Polynomial Spline Representation (splines)

**predict.bSpline:** Evaluate a Spline at New Values of x (splines)

**predict.bs:** Evaluate a Spline Basis (splines)

**splineDesign:** Design Matrix for B-splines (splines)

**splineKnots:** Knot Vector from a Spline (splines)

**splineOrder:** Determine the Order of a Spline (splines)

__Robust regression__

**lqs:** Resistant Regression (MASS)

**rlm: ** Robust Fitting of Linear Models (MASS)

** **

__Structural equation models__

**sem:** General Structural Equation Models (sem)

**tsls:** Two-Stage Least Squares (sem)

** **

__Simultaneous Equation Estimation____ __

**systemfit:** Fits a set of linear structural equations using Ordinary Least Squares (OLS), Weighted Least Squares (WLS), Seemingly Unrelated Regression (SUR), TwoStage Least Squares (2SLS), Weighted Two-Stage Least Squares (W2SLS) or Three-Stage Least Squares (3SLS) (systemfit)

__Partial Least Squares Regression (PLSR) and Principal Component Regression (PCR)____ __

**biplot.mvr:** Biplots of PLSR and PCR Models (pls)

**coefplot:** Plot Regression Coefficients of PLSR and PCR models (pls)

**crossval:** Cross-validation of PLSR and PCR models (pls)

**cvsegments:** Generate segments for cross-validation (pls)

**kernelpls.fit:** Kernel PLS (Dayal and MacGregor) (pls)

**msc:** Multiplicative Scatter Correction (pls)

**mvr:** Partial Least Squares and Principal Components Regression (pls)

**mvrCv:** Cross-validation (pls)

**oscorespls.fit:** Orthogonal scores PLSR (pls)

**predplot:** Prediction Plots (pls)

**scoreplot:** Plots of Scores and Loadings (pls)

**scores:** Extract Scores and Loadings from PLSR and PCR Models (pls)

**svdpc.fit:** Principal Components Regression (pls)

**validationplot:** Validation Plots (pls)

__Quantile regression__

**anova.rq:** Anova function for quantile regression fits (quantreg)

**boot.rq:** Bootstrapping Quantile Regression (quantreg)

**lprq:** locally polynomial quantile regression (quantreg)

**nlrq:** Function to compute nonlinear quantile regression estimates (quantreg)

**qss:** Additive Nonparametric Terms for rqss Fitting (quantreg)

**ranks:** Quantile Regression Ranks (quantreg)

**rq:** Quantile Regression (quantreg)

**rqss:** Additive Quantile Regression Smoothing (quantreg)

**rrs.test:** Quantile Regression Rankscore Test (quantreg)

**standardize:** Function to standardize the quantile regression process (quantreg)

__Linear and nonlinear mixed effects models____ __

**ACF:** Autocorrelation Function (nlme)

**ACF.lme:** Autocorrelation Function for lme Residuals (nlme)

**anova.lme:** compare Likelihoods of Fitted Objects (nlme)

**fitted.lme:** Extract lme Fitted Values (nlme)

**fixed.effects:** Extract Fixed Effects (nlme)

**intervals:** Confidence Intervals on Coefficients (nlme)

**intervals.lme:** Confidence Intervals on lme Parameters (nlme)

**lme:** Linear Mixed-Effects Models (nlme)

**nlme:** Nonlinear Mixed-Effects Models (nlme)

**predict.lme:** Predictions from an lme Object (nlme)

**predict.nlme:** Predictions from an nlme Obj (nlme)

**qqnorm.lme:** Normal Plot of Residuals or Random Effects from an lme object (nlme)

**random.effects:** Extract Random Effects (nlme)

**ranef.lme:** Extract lme Random Effects (nlme)

**residuals.lme:** Extract lme Residuals (nlme)

**simulate.lme:** simulate lme models (nlme)

**summary.lme:** Summarize an lme Object (nlme)

**glmmPQL:** fit Generalized Linear Mixed Models via PQL (MASS)

__Generalized Additive Model (GAM)__

**anova.gam:** compare the fits of a number of gam models (gam)

**gam.control:** control parameters for fitting gam models (gam)

**gam:** Fit a generalized additive model (gam)

**na.gam.replace:** a missing value method that is helpful with gams (gam)

**plot.gam:** an interactive plotting function for gams (gam)

**predict.gam:** make predictions from a gam object (gam)

**preplot.gam:** extracts the components from a gam in a plot-ready form (gam)

**step.gam:** stepwise model search with gam (gam) summary.gam: summary method for gam (gam)

__Survival analysis____ __

**anova.survreg:** ANOVA tables for survreg objects (survival)

**clogit:** Conditional logistic regression (survival)

**cox.zph:** Test the proportional hazards assumption of a Cox regression (survival)

**coxph:** Proportional Hazards Regression (survival)

**coxph.detail:** Details of a cox model fit (survival)

**coxph.rvar:** Robust variance for a Cox model (survival)

**ridge:** ridge regression (survival)

**survdiff:** Test Survival Curve Differences (survival)

**survexp:** Compute Expected Survival (survival)

**survfit:** Compute a survival Curve for Censored Data (survival)

**survreg:** Regression for a parametric survival model (survival)

__Classification and Regression Trees __

**cv.tree:** Cross-validation for Choosing tree Complexity (tree)

**deviance.tree:** Extract Deviance from a tree Object (tree)

**labels.rpart:** Create Split Labels for an rpart Object (rpart)

**meanvar.rpart:** Mean-Variance Plot for an rpart Object (rpart)

**misclass.tree: ** Misclassifications by a Classification tree (tree)

**na.rpart: ** Handles Missing Values in an rpart Object (rpart)

**partition.tree: ** Plot the Partitions of a simple Tree Model (tree)

**path.rpart:** Follow Paths to Selected Nodes of an rpart Object (rpart)

**plotcp:** Plot a Complexity Parameter Table for an rpart Fit (rpart)

**printcp:** Displays CP table for Fitted rpart Object (rpart)

**prune.misclass:** Cost-complexity Pruning of Tree by error rate (tree)

**prune.rpart:** Cost-complexity Pruning of an rpart Object (rpart)

**prune.tree: ** Cost-complexity Pruning of tree Object (tree)

**rpart:** Recursive Partitioning and Regression Trees (rpart)

**rpconvert: ** Update an rpart object (rpart)

**rsq.rpart:** Plots the Approximate R-Square for the Different Splits (rpart)

**snip.rpart:** Snip Subtrees of an rpart Object (rpart)

**solder:** Soldering of Components on Printed-Circuit Boards (rpart)

**text.tree:** Annotate a Tree Plot (tree)

**tile.tree: ** Add Class Barplots to a Classification Tree Plot (tree)

**tree.control: ** Select Parameters for Tree (tree)

**tree.screens: ** Split Screen for Plotting Trees (tree)

**tree: ** Fit a Classification or Regression Tree (tree)

__Beta regression__

**betareg:** Fitting beta regression models (betareg)

**plot.betareg:** Plot Diagnostics for a betareg Object (betareg)

**predict.betareg:** Predicted values from beta regression model (betareg)

**residuals.betareg:** Residuals function for beta regression models (betareg)

**summary.betareg: ** Summary method for Beta Regression (betareg)

The post R FUNCTIONS FOR REGRESSION ANALYSIS appeared first on StepUp Analytics.

]]>The post Implementation Of Classification Algorithms appeared first on StepUp Analytics.

]]>- Title: Car Evaluation Database
- Relevant Information Paragraph

Car Evaluation Database was derived from a simple hierarchical decision model originally developed for the demonstration of DEX M. Bohanec, V. Rajkovic: Expert system for decision making. Sistemica 1(1), pp. 145-157, 1990.). The model evaluates cars according to the following concept structure:

- CAR car acceptability
- PRICE overall price
- Buying buying price
- maint price of the maintenance
- TECH technical characteristics
- COMFORT comfort
- Doors number of doors
- Person’s capacity in terms of persons to carry
- lug_boot the size of luggage boot
- Safety estimated safety of the car

Input attributes are printed in lowercase. Besides the target concept (CAR), the model includes three intermediate concepts: PRICE, TECH, and COMFORT. Every concept is in the original model related to its lower level descendants by a set of examples (for these examples sets see

http://www-ai.ijs.si/BlazZupan/car.html).

The Car Evaluation Database contains examples with the structural information removed, i.e., directly relates CAR to the six input attributes: buying, maint, doors, persons, lug_boot, and safety. Because of the known underlying concept structure, this database may be particularly useful for testing constructive induction and structure discovery methods.

- Number of Instances: 1728 (instances completely cover the attribute space)
- Number of Attributes: 6
- Attribute Values:

Buying v-high, high, med, low

maint v-high, high, med, low

Doors 2, 3, 4, 5-more

Persons 2, 4, more

lug_boot small, med, big

Safety low, med, high

- Missing Attribute Values: none
- Class Distribution (number of instances per class)

**1. Load the data**

car_eval <- read.csv("Path", header=FALSE) colnames(car_eval)<-c ("buying","maint","doors","persons","lug_boot","safety","class") head(car_eval)

## buying maint doors persons lug_boot safety class

## 1 vhigh vhigh 2 2 small low unacc

## 2 vhigh vhigh 2 2 small med unacc

## 3 vhigh vhigh 2 2 small high unacc

## 4 vhigh vhigh 2 2 med low unacc

## 5 vhigh vhigh 2 2 med med unacc

## 6 vhigh vhigh 2 2 med high unacc

**2. Exploratory Data Analysis**

summary(car_eval)

## buying maint doors persons lug_boot safety ## high :432 high :432 2 :432 2 :576 big :576 high:576 ## low :432 low :432 3 :432 4 :576 med :576 low :576 ## med :432 med :432 4 :432 more:576 small:576 med :576 ## vhigh:432 vhigh:432 5more:432 ## class ## acc : 384 ## good : 69 ## unacc:1210 ## vgood: 65

str(car_eval) ## 'data.frame': 1728 obs. of 7 variables: ## $ buying : Factor w/ 4 levels "high","low","med",..: 4 4 4 4 4 4 4 4 4 4 ... ## $ maint : Factor w/ 4 levels "high","low","med",..: 4 4 4 4 4 4 4 4 4 4 ... ## $ doors : Factor w/ 4 levels "2","3","4","5more": 1 1 1 1 1 1 1 1 1 1 ... ## $ persons : Factor w/ 3 levels "2","4","more": 1 1 1 1 1 1 1 1 1 2 ... ## $ lug_boot: Factor w/ 3 levels "big","med","small": 3 3 3 2 2 2 1 1 1 3 ... ## $ safety : Factor w/ 3 levels "high","low","med": 2 3 1 2 3 1 2 3 1 2 ... ## $ class : Factor w/ 4 levels "acc","good","unacc",..: 3 3 3 3 3 3 3 3 3 3 ...

**3) Classification Analysis**

library(VGAM) #Build the model model1<-vglm(class~buying+maint+doors+persons+lug_boot+safety,family = "multinomial",data=car_eval) #Summarize the model summary(model1)

#Predict using the model x<-car_eval[,1:6] y<-car_eval[,7]

probability<-predict(model1,x,type="response") car_eval$pred_log_reg<-apply(probability,1,which.max) car_eval$pred_log_reg[which(car_eval$pred_log_reg=="1")]<-levels(car_eval$class)[1] car_eval$pred_log_reg[which(car_eval$pred_log_reg=="2")]<-levels(car_eval$class)[2] car_eval$pred_log_reg[which(car_eval$pred_log_reg=="3")]<-levels(car_eval$class)[3] car_eval$pred_log_reg[which(car_eval$pred_log_reg=="4")]<-levels(car_eval$class)[4] #Accuracy of the model mtab<-table(car_eval$pred_log_reg,car_eval$class) library(caret) confusionMatrix(mtab)

The post Implementation Of Classification Algorithms appeared first on StepUp Analytics.

]]>