Faster R-CNN

One of the most used architectures for object detection

Faster R-CNN is an architecture for object detection achieving great results on most benchmark data sets. It builds directly on the work on the R-CNN and Fast R-CNN architectures but is more accurate as it uses a deep network for region proposal unlike the other two.

The breakthrough of Faster R-CNN is that it does the region proposals and classification predictions on the same feature map instead of using a sliding window approach and then splitting the tasks like its predecessors.

Intuition

https://arxiv.org/pdf/1506.01497.pdf

First, the architecture uses a backbone network to extract some features of the input image. Any classification architecture can be used, e.g., some ResNet variant combined with a Feature Pyramide Network (FPN).

Then, an anchor is generated for each feature, and for each anchor, a set of anchor boxes with variable sizes and aspect ratios are created.

RPN

‌The Region Proposal Network (RPN) detects the "good" anchor boxes, which will be forwarded to the next layer. The RPN consists of a classifier and a regressor.

‌The classifier predicts if an anchor box contains an object (IoU of anchor box with ground truth label is above a certain threshold) or contains parts of the background (IoU of anchor box with ground truth label is below a certain threshold).

‌Then, the regressor predicts offsets for the anchor boxes which contain objects to fit them as tightly as possible to the ground truth labels.

NMS

‌The RPN outputs a lot of noise, i.e., multiple bounding boxes for the same object. To reduce the noise and improve the performance of the overall model, Non-Max Suppression (NMS) is applied.

‌NMS works by identifying the bounding boxes with the highest confidence and then discarding the ones with high overlap:

Step 1: Select the box with the highest confidence (=objectiveness) score and pass it forward. Step 2: Then, compare the overlap (IoU) of this box with other boxes. Step 3: Remove the bounding boxes with high overlap (IoU > threshold, often 0.5). Step 4: Then, move to the box with the next highest confidence score. Step 5: Repeat steps 2-4 until all boxes have been checked.

Non-Max Surpression explained. Image source at the bottom of the page

ROI box head

‌Finally, the RoI pooling layer converts generated proposals of variable sizes to a fixed size to run a classifier and regress a bounding box on top of it.

Hyperparameters

Typically, the following hyperparameters are tweaked when using Faster R-CNN:

Backbone network

‌Specifying the architecture for the network on which Faster R-CNN is built

Mostly, the backbone network is a ResNet variation.

IoU thresholds for RPN

These thresholds are used to decide if an anchor box generated contains an object or is part of the background.

Everything that is above the upper IoU threshold of the proposed anchor box and ground truth label will be classified as an object and forwarded. Everything below the lower threshold will be classified as background and the network will be penalized. For all the anchor boxes with an IoU between the thresholds, we're not sure if it's for- or background and we'll just ignore them.

Empirically, setting the lower bound to 0.3 and the upper to 0.7 leads to robust results.

Number of convolution filters in the ROI box head

‌How many convolution filters the final layer to make the classification contains. To a certain degree, increasing the number of filters will enable the network to learn more complex features, but the effect vanishes if you add too many filters and the network will perform worse (see the original ResNet paper to understand why you cannot endlessly chain convolution filters).

A good default value is 4 conv filters.

Number of fully connected layers in the ROI box head

‌How many fully connected layers (FC) the last part of the network contains. Increasing the number of FCs can increase performance for a computational cost, but you might overfit the sub-network if you add too many.

Often, 2 FCs are used as starting point.

NMS number of proposals

Pre NMS

‌The maximum of proposals that are taken into consideration by NMS. The proposals are sorted descending after confidence and only the ones with the highest confidence are chosen.

Post NMS

‌The maximum of proposals that will be forwarded to the ROI box head. Again, the proposals are sorted descending after confidence and only the ones with the highest confidence are chosen.

Config for training

‌Low numbers of NMS proposals in training will result in a lower recall, but higher precision. Vice versa.

The default for post NMS number of proposals in training is 1000, pre NMS 4000.

Config for testing

‌Here, the number of NMS proposals for the simple forward pass, e.g., inference, is defined. Less NMS proposals will increase inference speed, but higher numbers yield greater performance.

If you don't need super-fast inference, a good default is 1000 for post NMS number of proposals and 2000 for pre NMS.

The default ROI pooler resolution is 7 x 7.

Pooler Sampling Ratio

After extracting the Region of Interests from the feature map, they should be adjusted to a certain dimension before feeding them to the fully connected layer that will later do the actual object detection. For this, ROI Align is used which makes use of points that would be sampled from a defined grid, to resize the ROIs. The number of points that we use is defined by Pooler Sampling Ratio.

If Pooler Sampling ratio is set to 2, then 2 * 2 = 4 points are used for the interpolation.

Pooler resolution

It is the size to pool proposals before feeding them to the mask predictor, in model playground default value is set as 7.

Depth of Resnet model

It is the depth variant of resnet to use as the backbone feature extractor, in Model Playground depth can be set as 18/50/101/152

Weights

It's the weights to use for model initialization, and in Model Playground R50-FPN COCO weights is used.

Code implementation

PyTorch
PyTorch
# import necessary libraries
from PIL import Image
import matplotlib.pyplot as plt
import torch
import torchvision.transforms as T
import torchvision
import torch
import numpy as np
import cv2
import os
# get the pretrained model from torchvision.models
# Note: pretrained=True will get the pretrained weights for the model.
# model.eval() to use the model for inference
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
model.eval()
# Class labels from official PyTorch documentation for the pretrained model
# Note that there are some N/A's
# for complete list check https://tech.amikelive.com/node-718/what-object-categories-labels-are-in-coco-dataset/
COCO_INSTANCE_CATEGORY_NAMES = [
'__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign',
'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A',
'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
'bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table',
'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book',
'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
]
# Functions to run inference on the image and display the output bounding boxes on the image.
def get_prediction(img_path, threshold):
"""
get_prediction
parameters:
- img_path - path of the input image
- threshold - threshold value for prediction score
method:
- Image is obtained from the image path
- the image is converted to image tensor using PyTorch's Transforms
- image is passed through the model to get the predictions
- class, box coordinates are obtained, but only prediction score > threshold
are chosen.
"""
img = Image.open(img_path)
transform = T.Compose([T.ToTensor()])
img = transform(img)
pred = model([img])
pred_class = [COCO_INSTANCE_CATEGORY_NAMES[i] for i in list(pred[0]['labels'].numpy())]
pred_boxes = [[(i[0], i[1]), (i[2], i[3])] for i in list(pred[0]['boxes'].detach().numpy())]
pred_score = list(pred[0]['scores'].detach().numpy())
pred_t = [pred_score.index(x) for x in pred_score if x>threshold][-1]
pred_boxes = pred_boxes[:pred_t+1]
pred_class = pred_class[:pred_t+1]
return pred_boxes, pred_class
def object_detection_api(img_path, threshold=0.5, rect_th=3, text_size=3, text_th=3):
"""
object_detection_api
parameters:
- img_path - path of the input image
- threshold - threshold value for prediction score
- rect_th - thickness of bounding box
- text_size - size of the class label text
- text_th - thickness of the text
method:
- prediction is obtained from get_prediction method
- for each prediction, bounding box is drawn and text is written
with opencv
- the final image is displayed
"""
boxes, pred_cls = get_prediction(img_path, threshold)
img = cv2.imread(img_path)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
for i in range(len(boxes)):
cv2.rectangle(img, boxes[i][0], boxes[i][1],color=(0, 255, 0), thickness=rect_th)
cv2.putText(img,pred_cls[i], boxes[i][0], cv2.FONT_HERSHEY_SIMPLEX, text_size, (0,255,0),thickness=text_th)
plt.figure(figsize=(20,30))
plt.imshow(img)
plt.xticks([])
plt.yticks([])
plt.show()
# testing on image
object_detection_api('/content/Hasty_Founders.jpg', threshold=0.8)

Further resources