Object Detection With CNN, RCNN And Fast RCNN
Image classification algorithm takes the entire image as the input. It identifies the object in the image and outputs the class to which it belongs. However, the object detection algorithm would tell you which different objects are present in the image and also, it’s a location in the image. Thus, it outputs the bounding boxes (x, y, width, height) which indicates the location of the object in the image. The object detection algorithm is being used widely in face detection or vehicle detection. The simplest example is how Facebook detects faces in an image.
It can be used in stores to count the number of people each day and maintain some statistics on how crowded the store is on each day, which is the most crowded day and which is the least crowded day. Object detection is basically used to find out objects that belong to a particular class (vehicle, human being, cat, dog, etc) in an image.
In this article, we will see the overview of object detection using CNN and detailed explanation of RCNN and fast RCNN.
What is object detection?
As seen in the above image, classification refers to classifying the object to a class and assigning it a class label. Here, the image consists of a single image and the main goal is to provide it a class name. In localization, a region of that object is found out. The object is been bounded by a box. Object detection is the combination of both classification and localization. In real life, images consist of several different objects that belong to different classes. We cannot label them with a single class. Hence, object detection methods are used.
Convolution Neural Network(CNN) Approach of Object Detection
This is the basic structure of Convolution neural network that is used for image classification. Input image undergoes various pooling and convolution layers, followed by fully connected layers. The output is the predicted class along with its confidence.
How to use CNN for object detection?
Divide the input image in to separate regions. Now, each of these regions would be considered as separate images. Feed these images as the input to CNN and classify each image to a class. Finally combine all these regions to get the original image with the detected regions.
The advantage of this method is, it is quick. It quickly divides the images into several regions. However, the image is not taken into consideration while forming the regions and simply divides the images into fixed-size regions. Also, the objects in the image can have different aspect ratios and spatial locations. Therefore, a large number of regions will be required, which makes it computationally intensive.
RCNN (Region based CNN)
As seen in the CNN approach, a lot more regions are required for object detection. To solve this problem, firstly a region proposal algorithm is used to find the most promising regions in the images. Selective search algorithm is the most commonly used algorithm.
Selective search algorithm.
Selective search is a fast object detection algorithm. It does the grouping of similar regions based on the color, texture, size and shape using graph-based segmentation method. The output image below shows the segments. But here these segments cannot be used as region proposals because
1. As seen in the image, few objects in the input image may contain two or more segments.
2. Region proposal of objects that are covered by some other objects cannot be created. E.g.: cup filled with coffee.
Thus, it uses over segmentation method. Here, the objects that are segmented from the background are again segmented into sub-components.
Now the selective search algorithm adds bounding boxes to the region proposal and groups the adjacent segments. This step is repeated over iterations such that smaller segments are combined to from large segments. This is known as hierarchical segmentation.
This algorithm extracts 2000 regions per image. Thus, instead of having a huge number of images we can work with just 2000 images.
Steps in R-CNN
- Take the input image
- Find the Region of Interest (ROI) using selective search algorithm.
- Reshape these inputs into a fixed size as required by the CNN. It acts as an input to a pre trained CNN (e.g. AlexNet)
- In the final layer, SVMs are added. It detects if the object is present in the image, if yes, then classifies it to
- Apply bounding box regression.
In the above layer of SVM, it classifies of the region proposal into different classes. However, in order to get a tighter bounding box around the object, linear regression is applied over the region proposal
Problems in RCNN
- Since every image has 2000 region proposals, CNN has to extract features from each region proposal.
- There are 3 models in this algorithm
- CNN for feature extraction
- Linear SVM classifier for identifying objects
- Regression model for bounding boxes
This makes RCNN very slow. It can not be used for real dataset.
What makes RCNN slow?
Running CNN 2000 times per image. This makes it computationally intensive. Fast RCNN removes this dilemma. It passes the input image into the CNN model to get the convolution feature map. These feature maps are converted into region proposals. Now, these region proposals are pooled (usually max pooing). This pooling layer is called as RoI (Region of Interest) pooling. Thus, 2000 passes for one image is reduced to just 1.
- Take the input Image.
- Pass it through a ConvNet. It generates RoI
- Apply RoI pooling to each region. This reshapes each region into a fixed size.
- Pass the regions through Fully connected layers. It classifies them as well as returns the bounding boxes. A softmax and the linear regression model is used simultaneously for the same.
Problems with Fast RCNN
- Even though there is only one image as opposed to 2000 regions per image, it still uses selective search approach to find RoI.
- Although, it better than RCNN, we cannot use it in rea world dataset.