# K Fold Cross Validation

“If you torture the data long enough, it will confess” – Ronald Coase

For any machine learning model you design, what is the most common and the important thing you expect from it. Yes, the expected accuracy rate. Conventionally, you used to divide your dataset into training dataset and the test dataset, supply the inputs to the algorithm say k-NN or logistic regression and compute the accuracy scores. Now these accuracy rate is dependent on the algorithm and its various parameters. Like changing the ‘k’ value in k-NN algorithm or changing the train-test dataset ratio would alter the accuracy rate of your model. So the question is how to choose any algorithm and how to assign values to its parameters or is there any systematic way to do it. One such solution to our problem is K Fold Cross Validation. Now Let’s see what’s K-Fold Cross Validation.

• What is K-Fold Cross Validation?
• Why do you need to use it?
• Implementing K-Fold in Python
• K-Fold Cross validation v/s Test/Train method
• Things to keep in mind

#### What is K-Fold Cross Validation?

In simple words, K-Fold Cross Validation is a popular validation technique which is used to analyze the performance of any machine learning model in terms of accuracy.

K-Fold Cross Validation is a non-exhaustive cross validation technique. It does not compute all the possible ways of splitting the dataset. In this validation technique, it divides the dataset into training and test dataset and tries different combinations of that.

Before we move further, let’s have an overview of K-Fold Cross validation technique with an example: Suppose you are trying to fit the model using k-NN algorithm with k=1 to 40. The inputs and the output along with the k-NN algorithm are supplied to the K-Fold cross validation. K-fold gives the accuracy score results for all values of k from 1 to 40. Seeing these accuracy rates, you can interpret the optimized k value. Now let’s have a look on the following figure for better understanding (Image source Wikipedia.org)

As you can see there are 20 columns which have been divided into test and training dataset with  test_size = 0.25 or 25%. The ‘k’ value represents the number of iterations. In the first iteration, the first 5 columns is treated as test data and the remaining as training data. In second iteration, it chooses the next 5 columns as test data and the remaining as training data. Similarly in the third and the fourth iteration. The loop runs 4 times as k=4. So it tells how the accuracy rate changes when the training dataset changes.

Now suppose if you think of proceeding with the ‘test/train method’ (not using K-Fold method) and chose the first 5 columns as your test data and remaining as the training data and applied it to any algorithm. There’s a possibility that you may get 100% accuracy for that set of training dataset and assume that your model woks well. Well this is a false perception. You might be wrong here because different combinations of training data set would give different accuracy results. So you need to try it for different possibilities to know the actual results.

#### Why do you need to use it?

The main motive of using K-fold cross validation is to test our model in the training phase to:

• Check for over-fitting.
• To know how the model will generalize to independent data, i.e., your test dataset.

#### Implementing K-Fold using Python

Without wasting much time, let’s try writing some code from scratch. For simplicity let’s begin with Iris dataset. You can download it from the following link : https://archive.ics.uci.edu/ml/machine-learning-databases/iris/

Or with minimal lines of codes you can directly download the dataset from the internet.

Now let’s load the iris dataset into the object named ‘iris’.

Store inputs and output as numpy arrays in different objects.

Choose the k with accuracy score from above list

Try Implementing the K-Fold cross validation on the same dataset using some other algorithms and see the results.

#### Things to remember

• K-Fold cross validation is not a model building technique but a model evaluation
• It is used to evaluate the performance of various algorithms and its various parameters on the same dataset.
• Although it takes a high computational time (depending upon the k value and the dataset) but it is a standard approach to evaluate the performance of the model.