Feature Engineering And Variable Transformation
What is Feature Engineering?
Why do some machine learning projects succeed while others fail? What makes the difference? Easily, the most important factor is the features used (features driven by feature engineering). If you have many independent features and each correlates well with the class, learning is easy.
On the other hand, if the class is a very complex function of the features, you may not be able to learn it. Often, the raw data is not interpretable by machines, therefore it is important to work in Feature Engineering. This is the most prominent phase of machine learning. It is often a task where intuition and creativity are as important as the domain knowledge.
“Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data”.
Feature Engineering is the Process in which, we extract more information from the dataset which we are using in our project.
In Feature Engineering, we do not add the data (new feature) from (outside the existing dataset) although we create the new feature(helpful feature) by using the existing dataset.
Importance of Feature Engineering: The features in your data will directly influence the predictive models you use and the results you can achieve.
You can say that: the better features that you prepare and choose, the better results you will achieve.
The features in your data will directly influence the predictive models you use and the results you can achieve.
What Is The role of Better Features
- Better features mean flexibility
- Better features mean simpler models
- Better features mean better results.
Note: In this Blog, I will take the Titanic Dataset. In this dataset, I will perform all possible operations of feature engineering in the implementation part of this blog.
Steps which are involved while solving any problem in machine learning are as follows:
- Gathering data
- Cleaning data
- Feature engineering
- Defining model
- Training, testing model and predicting the output
The Process of Feature Engineering
Feature Engineering having it’s two parts:
- Variable transformation
- Variable / Feature creation
Transformation refers to the replacement of a variable by a function. For instance, replacing a variable x by the square/cube root or logarithm x is a transformation. Or we can say transformation is a process that changes the distribution or relationship of a variable with others.
When should we use Variable Transformation?
Change the scale:
When you want to change the scale of a variable(feature) or standardize its value. Suppose values of the variable are on different scales, in that case, one must use variable transformation this transformation does not change the shape of the variable distribution.
Transform complex non-linear relationships into linear relationships:
If two variables having a linear relationship is much better to fit a good model as compared to the non-linear or curved relation. Transformation helps us to convert a non-linear relation into the linear relation. Below you can find two scatter plots, here I used the log transformation that converted the cervical relation to a linear relation.
Symmetric distribution is preferred over skewed (odd) distribution: Suppose if in the dataset some variable having skewed (odd) distribution of values, Here we can use transformations which reduces skewness.
Here, there can be two types of skewed distribution:
- Left-skewed distribution: for left skewed variables, we take a square/cube or exponential of variables.
- Right-skewed distribution: For the right-skewed distribution, we take the square/cube root or logarithm of the variable.
See here: what are the Left and right skewed distribution?
Implementation point of view:
It depends on your visualization ability, some of the transformation depends on the data set that you used. You will understand more clearly through the Titanic dataset .
Like in Titanic dataset there is a feature called “ age ” initially it was continuous data feature and it is not much favourable for fitting the best model. So I converted it into a categorical data feature. You will get more clarity in the implementation part of feature engineering .
Common Methods of Variable Transformation:
There are various methods used to transform the variable, here I will take some important methods that usually used by data scientist.
Below I will talk about some transformation methods in detail and will also talk about their pros and cons.
Square root: The square root, x to x (1/2) = sqrt(x) , is a transformation with a moderate effect on distribution shape, it is weaker than the logarithm and the cube root. It is also used for reducing right skewness , and also has the advantage that it can be applied to zero values. Note that the square root of an area has the units of a length. It is commonly applied to counted data, especially if the values are mostly rather small.
Cube root: The cube root, x 1/3. This is a fairly strong transformation with a substantial effect on distribution shape, it is weaker than the logarithm. It is also used for reducing right skewness and has the advantage that it can be applied to zero and negative values. Note that the cube root of a volume has the units of a length. It is commonly applied to rainfall data.
Logarithm: The logarithm, x*log 10 x, or x*log e(x) or ln(x) is a strong transformation with a major effect on distribution shape. It is commonly used for reducing right skewness and is often appropriate for measured variables. It can not be applied to zero or negative values. One unit on a logarithmic scale means a multiplication by the base of logarithms being used. Exponential growth or decline.
Reciprocal: The reciprocal, x to 1/x, with its sibling the negative reciprocal, x to – 1/x, is a very strong transformation with a drastic effect on distribution shape. It can not be applied to zero values. Although it can be applied to negative values, it is not useful unless all values are positive. The reciprocal of a ratio may often be interpreted as easily as the ratio itself: Example:
- Population density (people per unit area) becomes area per person
- Persons per doctor becomes doctors per person
- Rates of erosion become time to erode a unit depth
(In practice, we might want to multiply or divide the results of taking the reciprocal by some constant, such as 1000 or 10000, to get numbers that are easy to manage, but that itself has no effect on skewness or linearity.)
The reciprocal reverses order among values of the same sign: largest becomes smallest, etc. The negative reciprocal preserves order among values of the same sign.
Square: The square, x to x2, has a moderate effect on distribution shape and it could be used to reduce left skewness.
Binning: It is used to categorize variables. It is performed on original values, percentile or frequency. The decision of categorization technique is based on business understanding. For example, we can categorize income into three categories, namely: High, Average and Low.
We can also, perform co-variate binning which depends on the value of more than one variables. You will clear more about Binning in the implementation part of feature engineering.
Feature/Variable Creation & Its Benefits
Feature ( Variable) creation is a process to generate a new feature (Variable) by using the existing variables. Suppose In the data set we have the date(dd-mm-yy) as an input variable. We can generate new variables like the day, month, year, week, weekday that may have the better relationship with target variable.
This step is used to highlight the hidden relationship in a variable. Below you can see from the table I created three new variables by using the variable date(dd-mm-yy). We will clear more in implementation part of Feature Engineering.
There are various techniques to create new features. Let’s look at the some of the commonly used methods:
Creating derived variables
This refers to creating new variables from existing variable(s) using the set of functions or different methods. Let’s look at the Titanic dataset In this data set, variable age has missing values. To predict missing values, we used the salutation (Master, Mr, Miss, Mrs) of the name as a new variable. We will talk more about later in the Implementation part of Feature Engineering.
Creating dummy variables
Dummy variables mostly used when we have to convert the categorical variable into numerical variables. Dummy variables are also called Indicator Variables. It is useful to take the categorical variable as a predictor in statistical models. In the Titanic dataset, you will see in a variable gender there are only two types of values “male” and “female”, here will create two new variables “ Var_Male ”and “Var_Female” see in the table below.
Implementation of Feature Engineering in Titanic Dataset
Below you can see the my “jupyter notebook” along with the test and train dataset where I performed all possible operations of feature engineering in the Titanic dataset.