# Different Ways Of Variable Reduction

Once upon a time, there was a teacher in a village he asked his students to narrate the importance of Subhas Chandra Bose in the fight for freedom of India. But he started from the name of parents of Subhas Bose and described his early life in details as well he also described his marital status and everything after that he said the importance of him as a freedom fighter of India. As a result, he took 40 minutes to describe the importance of Subhas Bose as a freedom fighter which was the main question the teacher asked. So the teacher advised his student after giving a short intro he should focus on his main topic and don’t waste his time giving importance on other irrelevant topics.

Similarly, there is the same possibility in Data Science field where we find the relationship between independent variables and dependent variables and we should give less importance on less important dependent variable and we should consider only the important independent variables which have an effect on the regression analysis with dependent variable using different methods.

Basically, the variable reduction process can be done in two ways:

**Feature selection****Feature extraction**

In Feature selection, we discuss

**backward elimination****forward elimination****bidirectional elimination**

and in Feature extraction, we discuss

**Correlation****Analysis****PCA****Exploratory factor analysis****Multicollinearity****Linear discriminate analysis**.**Wald chi-square****method**

The variable reduction is a crucial step for accelerating model building without losing the potential predictive power of the data. With the advent of Big Data and sophisticated data mining techniques, the number of variables encountered is often tremendous making variable selection or dimension reduction techniques imperative to produce models with acceptable accuracy and generalization.

It may be noted that the following techniques are not used in the given order and moreover before going to take care of variable reduction we should emphasize more in univariate analysis of variables. To check the frequency distribution summary regression analysis and the most important thing is checking the missing value in any variable.

**Feature Selection**

In **BACKWORD ELIMINATION** during regression analysis the independent variables which are less important those variables are eliminated from backward direction.

In **FORWARD ELIMINATION** during regression analysis the independent variables which are less important those variables are eliminated from forward direction.

In **BIDIRECTIONAL ELIMINATION** during regression analysis the independent variables which are less important those variables are eliminated from both directions.

**FEATURE EXTRACTION**

In this case, we describe many methods of reduction of variables and we take care about dimension reduction also. Among the first is **CORRELATION ANALYSIS**. Correlation is the linear relationship between variables. Suppose we want to find a relationship between a hundred independent variables with one dependent variable.

For this, we create a correlation matrix. On the basis of correlation, we take those independent variables as explanatory variable among them which are highly correlated with the dependent variable. The sign of the correlation coefficient indicates the direction of association and it always lies between -1 (perfect negative linear association) and 1 (perfect positive linear association). A zero value of r indicates no linear relationship.

Now we discuss the correlation between two independent variables. A higher correlation coefficient (r) between two independent variables implies redundancy, indicating a possibility that they are measuring the same construct. In such a scenario, it would be prudent to select either of the two variables in the consideration or to adopt an alternative approach to selection which involves two most widely used techniques viz. Principal Component Analysis (PCA) and Exploratory Factor Analysis.

Now we discuss the next approach of variable reduction **PRINCIPAL COMPONENT ANALYSIS (PCA)** Principal Component Analysis is a variable reduction procedure and helps in obtaining a smaller number of variables called Principal Components, which account for most of the variance in the observed variables from a group of a large number of redundant (correlated) variables.

Suppose among 100 explanatory variables just 44 variables are highly correlated. Among 44 variables some variables are such type that the correlation among 3^{rd} and 5^{th} variables is 0.87 and that of between 3^{rd }and 8^{th} variables is 0.85. So this correlation is highly significant in this case PCA is necessary. Principal Component Analysis can be performed on a set of correlated variables to obtain a new composite variable (Principal Component) which will have the properties of all the variables in question.

Linear combination of optimally-weighted variables under consideration and can be used for subsequent analysis. One can compute as many principal components as the number of independent variables which can be further analyzed and retained on the basis of the variability explained by them.

Now we discuss the important variable reduction approach **Exploratory Factor Analysis** is also a variable reduction procedure, similar to Principal Component Analysis in many respects but the underlying procedure for both the techniques remain the same but there are conceptually dissimilarities between this two method which will be explained here.

Factor analysis is a statistical technique concerned with the reduction of a set of observable dependent variables in terms of a small number of latent factors. The underlying assumption of factor analysis is that there exists a number of unobserved latent variables (or “factors”) that account for the correlations among observed variables, such that if the latent variables are partialled out or held constant, the partial correlations among observed variables all become zero.

In other words, the latent factors determine the values of the observed variables. The term “common” in common factor analysis describes the variance that is analyzed. It is assumed that the variance of a single variable can be decomposed into common variance that is shared by other variables included in the model, and unique variance that is unique to a particular variable and includes the error component. **Common factor analysis (CFA)** analyzes only the common variance of the observed variables; principal component analysis considers the total variance and makes no distinction between common and unique variance. The selection of one technique over the other is based upon several criteria.

Next, we look at **MULTICOLLINEARITY** which occurs when independent variables are highly correlated among themselves.

Now we discuss another popular method of variable reduction, **Wald Chi-Square**. The Wald Chi-Square test statistic is the squared ratio of the Estimate to the Standard Error of the respective predictor.

Now we discuss another method of variable reduction i.e. **Linear Discriminant Analysis** that also works as a dimensional reduction algorithm, it means that it reduces the number of dimension from original to C — 1 number of features where C is the number of classes. In this example, we have 3 classes and 18 features, LDA will reduce from 18 features to only 2 features. After reducing, neural network model will be applied to classification task.

**Executable Code:**

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
library(faraway) str(divusa) df=data.frame(divusa[,2:7]) head(df) set.seed(1000) round(cor(df),2) library(GGally) library(caTools) library(MASS) split <- sample.split(rownames(df),SplitRatio = 0.80) train <- df[split == T , ];str(train) test <- df[split == F , ] ; str(test) ggcorr(df) model <- lm(divorce ~ . , data = train ) summary(model) vif(model) |

Now we will work with the Model

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
model1 <- lm(divorce ~ femlab + marriage + birth + military , data = train ) ; summary(model1) model2 <- lm(divorce ~ femlab + marriage + birth , data = train ) ; summary(model2) pc<-prcomp(train[,-1] , center = T , scale. = T) attributes(pc) print(pc) summary(pc) library(MASS) linear <- lda(divorce ~ . , data = train ) linear attributes(linear) |

1 2 3 4 5 6 7 8 9 10 11 12 |
library(faraway) str(divusa) 'data.frame': 77 obs. of 7 variables: $ year : int 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 ... $ divorce : num 8 7.2 6.6 7.1 7.2 7.2 7.5 7.8 7.8 8 ... $ unemployed: num 5.2 11.7 6.7 2.4 5 3.2 1.8 3.3 4.2 3.2 ... $ femlab : num 22.7 22.8 22.9 23 23.1 ... $ marriage : num 92 83 79.7 85.2 80.3 79.2 78.7 77 74.1 75.5 ... $ birth : num 118 120 111 110 111 ... $ military : num 3.22 3.56 2.46 2.21 2.29 ... |

1 2 3 4 5 6 7 8 9 10 |
df=data.frame(divusa[,2:7]) head(df) divorce unemployed femlab marriage birth military 1 8.0 5.2 22.70 92.0 117.9 3.2247 2 7.2 11.7 22.79 83.0 119.8 3.5614 3 6.6 6.7 22.88 79.7 111.2 2.4553 4 7.1 2.4 22.97 85.2 110.5 2.2065 5 7.2 5.0 23.06 80.3 110.9 2.2889 6 7.2 3.2 23.15 79.2 106.6 2.1735 |

1 2 3 4 5 6 7 8 9 10 11 |
set.seed(1000) round(cor(df),2) divorce unemployed femlab marriage birth military divorce 1.00 -0.21 0.91 -0.53 -0.72 0.02 unemployed -0.21 1.00 -0.26 -0.27 -0.31 -0.40 femlab 0.91 -0.26 1.00 -0.65 -0.60 0.05 marriage -0.53 -0.27 -0.65 1.00 0.67 0.26 birth -0.72 -0.31 -0.60 0.67 1.00 0.14 military 0.02 -0.40 0.05 0.26 0.14 1.00 |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
library(GGally) ggcorr(df) library(caTools) library(MASS) split <- sample.split(rownames(df),SplitRatio = 0.80) train <- df[split == T , ];str(train) 'data.frame': 61 obs. of 6 variables: $ divorce : num 8 6.6 7.1 7.2 7.2 7.5 7.8 7.8 8 7.5 ... $ unemployed: num 5.2 6.7 2.4 5 3.2 1.8 3.3 4.2 3.2 8.7 ... $ femlab : num 22.7 22.9 23 23.1 23.1 ... $ marriage : num 92 79.7 85.2 80.3 79.2 78.7 77 74.1 75.5 67.6 ... $ birth : num 118 111 110 111 107 ... $ military : num 3.22 2.46 2.21 2.29 2.17 ... |

1 2 3 4 5 6 7 8 9 10 11 12 13 |
test <- df[split == F , ] ; str(test) 'data.frame': 16 obs. of 6 variables: $ divorce : num 7.2 6.1 7.5 9.4 13.6 9.9 9.2 9.2 10 21.9 ... $ unemployed: num 11.7 24.9 21.7 9.9 3.9 2.9 4.3 5.5 5.2 6.1 ... $ femlab : num 22.8 24.9 25.3 28.5 31.8 ... $ marriage : num 83 61.3 71.8 88.5 106.2 ... $ birth : num 119.8 76.3 78.5 83.4 113.3 ... $ military : num 3.56 1.94 1.95 13.5 10.98 ... ggcorr(df) |

1 2 3 4 5 6 7 8 9 10 |
model <- lm(divorce ~ . , data = train ) summary(model) Call: lm(formula = divorce ~ ., data = train) Residuals: Min 1Q Median 3Q Max -3.7757 -0.8095 0.1004 0.8099 4.2183 |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.33958 3.75962 1.154 0.253385 unemployed -0.12820 0.06594 -1.944 0.057008 . femlab 0.36660 0.03339 10.980 1.76e-15 *** marriage 0.10869 0.02758 3.941 0.000231 *** birth -0.13537 0.01823 -7.426 7.55e-10 *** military -0.02294 0.01487 -1.543 0.128537 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.69 on 55 degrees of freedom Multiple R-squared: 0.913, Adjusted R-squared: 0.9051 F-statistic: 115.4 on 5 and 55 DF, p-value: < 2.2e-16 |

1 2 3 4 5 6 7 8 9 10 11 12 13 |
vif(model) unemployed femlab marriage birth military 2.033954 3.072896 2.553974 2.411061 1.255901 model1 <- lm(divorce ~ femlab + marriage + birth + military , data = train ) ; summary(model1) Call: lm(formula = divorce ~ femlab + marriage + birth + military, data = train) Residuals: Min 1Q Median 3Q Max -3.5737 -1.0960 0.0504 1.0498 3.7809 |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.86188 2.70601 -0.319 0.751 femlab 0.40512 0.02753 14.714 < 2e-16 *** marriage 0.12694 0.02657 4.777 1.32e-05 *** birth -0.11922 0.01662 -7.172 1.80e-09 *** military -0.01585 0.01477 -1.074 0.288 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.731 on 56 degrees of freedom Multiple R-squared: 0.907, Adjusted R-squared: 0.9004 F-statistic: 136.5 on 4 and 56 DF, p-value: < 2.2e-16 |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
model2 <- lm(divorce ~ femlab + marriage + birth , data = train ) ; summary(model2) Call: lm(formula = divorce ~ femlab + marriage + birth, data = train) Residuals: Min 1Q Median 3Q Max -3.5830 -1.0663 0.0771 1.0841 4.0233 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.03003 2.57878 0.012 0.991 femlab 0.39637 0.02633 15.052 < 2e-16 *** marriage 0.11726 0.02503 4.685 1.78e-05 *** birth -0.11982 0.01664 -7.203 1.46e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.733 on 57 degrees of freedom Multiple R-squared: 0.9051, Adjusted R-squared: 0.9001 F-statistic: 181.2 on 3 and 57 DF, p-value: < 2.2e-16 |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
pc<- prcomp(train[,-1] , center = T , scale. = T) attributes(pc) $names [1] "sdev" "rotation" "center" "scale" "x" $class [1] "prcomp" print(pc) Standard deviations (1, .., p=5): [1] 1.5317557 1.2139813 0.8226453 0.5723273 0.4191302 Rotation (n x k) = (5 x 5): PC1 PC2 PC3 PC4 PC5 unemployed 0.2590132 -0.64154699 0.53135939 -0.13974986 0.468462139 femlab 0.4691847 0.49679795 -0.22686585 0.04723785 0.692356829 marriage -0.5858718 -0.02705203 0.08701897 0.70051315 0.397154149 birth -0.5688098 -0.07288045 -0.31653177 -0.65389192 0.378651419 military -0.2144470 0.57928049 0.74727801 -0.24483719 -0.008770997 |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
summary(pc) Importance of components: PC1 PC2 PC3 PC4 PC5 Standard deviation 1.5318 1.2140 0.8226 0.57233 0.41913 Proportion of Variance 0.4693 0.2948 0.1353 0.06551 0.03513 Cumulative Proportion 0.4693 0.7640 0.8993 0.96487 1.00000 library(MASS) linear <- lda(divorce ~ . , data = train ) linear Call: lda(divorce ~ ., data = train) Prior probabilities of groups: 6.1 6.6 7.1 7.2 7.5 7.8 8 8.3 8.4 8.5 8.7 8.8 8.9 9.3 0.01639344 0.01639344 0.03278689 0.03278689 0.03278689 0.04918033 0.03278689 0.01639344 0.01639344 0.01639344 0.01639344 0.01639344 0.01639344 0.03278689 9.4 9.5 9.6 9.9 10.1 10.3 10.6 10.9 11 11.2 12 12.4 13.4 14.4 0.03278689 0.01639344 0.03278689 0.01639344 0.03278689 0.01639344 0.03278689 0.01639344 0.01639344 0.03278689 0.01639344 0.01639344 0.01639344 0.01639344 14.9 15.8 17 17.9 18.2 19.3 19.5 19.8 20.3 20.5 20.7 20.9 21.1 21.2 0.01639344 0.01639344 0.01639344 0.01639344 0.01639344 0.01639344 0.01639344 0.01639344 0.01639344 0.01639344 0.01639344 0.03278689 0.03278689 0.03278689 21.5 22.6 22.8 0.01639344 0.03278689 0.01639344 |

From the above entire discussion and example, we really realize the importance of variable reduction properly and implement it on the real-life example. Because sometimes this type of a situation takes place to predict a single variable we take 80 to 90 independent variables some them mean same thing or some of them are also irrelevant in predicting the dependent variable also and some of them has no relationship with the dependent variable as a result prediction can’t be correct and expenses will be large. For all these reasons reduction of variables is so much necessary simultaneously different methods of reducing variables should be adopted carefully and stepwise. Now we