# K – Nearest Neighbors Algorithm

K – Nearest Neighbors Algorithm, also known as K-NN Algorithm, is a very fundamental type of classification algorithm. It is used to classify objects based on closest training observations in the feature space.

It is an instance based and supervised machine learning algorithm.

- Instance based: there’s no explicit learning of a model instead it memorizes the training instances to use it subsequently as a knowledge in classifying an unseen observation.
- Supervised machine learning: Both input and output is fed into the algorithm to study the pattern (unlike in unsupervised learning where only input is passed into the algorithm)

__Table on contents__

__Table on contents__

- Assumptions
- How does k-NN exactly work?
- How to choose the value of k?
- Exploring more on k-NN using Python
- Pros and Cons

**Assumptions**

- Data points are in metric space. Distance between these points is commonly measured using Euclidian distance.
- The ‘k’ value decides the number of neighbors in the vicinity which influences the classification.

**How does K-NN exactly work?**

Let’s take a simple example to understand this algorithm. As you can see the spread of red dots and the blue dots in the following diagram. Here, your objective is to find the class of the unseen observation i.e., ‘the green dot’. That means you need to find whether the green dot belongs to the class of red dots or the blue dots. Let’s assume that k=3 i.e., you need to find 3 closest neighbors of green dot.

Keep increasing the radius of a circle keeping green dot as the center until you have 3 closest points inside this circle. See the following diagram.

Among the three dots, two are blue dots and one being a red dot. Hence the green dot belongs to the class of blue dots.

Now let’s have a deeper look into this algorithm. Basically K-NN algorithm classifies an unseen observation using majority voting depending upon similarity among the k nearest instances. Similarity is nothing but the distance metric between two data points. This distance can be measured using Euclidian, Manhattan, Minkowski or Hamming Distance. Euclidian distance is a popular choice. Euclidian distance is given by:

Note: Euclidian, Manhattan, Minkowski holds good only for continuous variables, whereas, Hamming for categorical variables.

Generally, K-NN algorithm consists of 3 steps:

- Find the Euclidian distance between the x and the training data sets.
- Choose the k number of closest training data points depending upon the Euclidian distance.
- Estimate the conditional probability of each class. The new instance is assigned to the class with highest probability.

In other words, it searches the memorized training datasets for k instances that closely resembles the new instance and assign the class to the new instance accordingly.

**How to choose the value of k?**

This would be an obvious question coming to your mind now because your results might vary depending upon the k value. Well it totally depends upon your data set. For small value of k, there will be less bias and high variance. But for large k value, there will be high bias and low variance. See the following figure to visualize it better,

It is clear from the above figures that k controls the shapes of the boundaries. As a programmer, you must be able to choose appropriate k value that best fit the dataset or you can use the ‘k fold cross validation method’ which will give the optimized results.

**Exploring more on k-NN Algorithm using Python**

Before we proceed further, keep in mind these two simple steps to implement a machine learning algorithm in Python:

- Pass the inputs (i.e., the features/column names) and the output(target) as numpy arrays (as different objects).
- Both the inputs and the output must be numerical.

Without much delay, let us implement k-NN in Python. For simplicity let’s begin with Iris dataset. You can download it from the following link: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/

Or with minimal lines of codes you can directly download the dataset from the internet.

1 2 3 |
from IPython.display import HTML HTML('<iframe src=http://archieve.ics.uci.edu/ml/machine-learning-databases/iris/iris.data> </frame>') |

Now let’s load the iris dataset into the object named ‘iris’.

1 2 3 |
from sklearn.datasets import load_iris iris=load_iris() |

As you can see in the above figure, the ‘species’ column or the target is not numerical.

1 2 3 4 5 6 7 8 |
print(iris.target) [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2] |

what happened here is: setosa is considered as 0, versicolor as 1 and virginica as 2. So now the output is numerical and is fit to supply to the algorithm.

Now if you want to check whether the inputs and the output is numpy arrays. Also, you need to store them in different objects.

1 2 3 4 5 6 |
print(iris.data) print(iris.target) X=iris.data y=iris.target |

Next we need to supply these inputs and the output to the algorithm, i.e., fitting the data into the algorithm.

Now we import the class from sklearn classifier and instantiate it with k=1 and store it in knn object.

1 2 3 4 |
from sklearn.neighbors import KNeighborsClassifier knn=KNeighborsClassifier(n_neighbors=1) print(knn) |

Split the dataset into train and test, where test = 30%

1 2 3 |
from sklearn.model_selection import train_test_split Xtrain, Xtest, ytrain, ytest = train_test_split(X,y,test_size=0.3) |

Fit the model

1 2 3 4 5 6 7 8 9 10 11 12 13 |
print(Xtrain) print(Xtest) print(ytrain) print(ytest) knn.fit(Xtrain,ytrain) y_pred=knn.predict(Xtest) print(y_pred) from sklearn.metrics import accuracy_score print(accuracy_score(ytest,y_pred)) |

**Pros and Cons**

**Pros:**

- It is very effective if the training data is large. Hence k-NN is used for such applications.
- It is robust to noisy training data.
- It has minimum training time as k-NN algorithm doesn’t learn anything. It just stores the instances in the memory using which it makes predictions accordingly.

**Cons:**

- The result may vary depending upon the k value. Hence we need to choose the appropriate k value.
- There’s a chance that neighborhood does not contain data points from all the classes.
- Testing is computationally expensive as the training dataset needs to be stored in memory, we would require more memory to store them.