The purpose of RCNN is to solve the problem of object detection. Given a certain image, we want to be able to draw bounding boxes over all of the objects. The process can be split into two general components, the region proposal step, and the classification step.
Selective Search performs the function of generating 2000 different regions that have the highest probability of containing an object. After we’ve come up with a set of region proposals, these proposals are then “warped” into an image size that can be fed into a trained CNN (AlexNet in this case) that extracts a feature vector for each region. This vector is then used as the input to a set of linear SVMs that are trained for each class and output a classification. The vector also gets fed into a bounding box regressor to obtain the most accurate coordinates.
Non-maxima suppression is then used to suppress bounding boxes that have a significant overlap with each other.
If you wish to know more on Neural Network, Go through this awesome article on Neural Network Workflow
Improvements were made to the original model because of 3 main problems:
- Training took multiple stages (ConvNets to SVMs to bounding box regressors),
- Was computationally expensive, and
- Was extremely slow (RCNN took 53 seconds per image).
Fast R-CNN was able to solve the problem of speed by basically sharing computation of the conv layers between different proposals and swapping the order of generating region proposals and running the CNN.
In this model, the image is first fed through a ConvNet, features of the region proposals are obtained from the last feature map of the ConvNet and lastly, we have our fully connected layers as well as our regression and classification heads.
Faster R-CNN works to combat the somewhat complex training pipeline that both R-CNN and Fast R-CNN exhibited. The authors insert a region proposal network (RPN) after the last convolutional layer. This network is able to just look at the last convolutional feature map and produce region proposals from that. From that stage, the same pipeline as R-CNN is used (ROI pooling, FC, and the classification and regression heads).