Mask R-CNN
Like Faster R-CNN, but for instance segmentation
Mask R-CNN is the most used architecture for instance segmentation. It is almost built the same way as Faster R-CNN. The major difference is that there is an extra head that predicts masks inside the predicted bounding boxes.
Also, the authors replaced the RoI pool layer with the RoI align layer. RoI pool mappings are often a bit noisy. The difference is so small that it is negligible for object detection, but not when you want to create pixel-perfect masks for instance segmentation.
Image taken from the original paper

Hyperparameters

Typically, the following hyperparameters are tweaked when using Faster R-CNN:

Backbone network

‌Specifying the architecture for the network on which Faster R-CNN is built
Mostly, the backbone network is a ResNet variation.

IoU thresholds for RPN

These thresholds are used to decide if an anchor box generated contains an object or is part of the background.
Everything that is above the upper IoU threshold of the proposed anchor box and ground truth label will be classified as an object and forwarded. Everything below the lower threshold will be classified as background and the network will be penalized. For all the anchor boxes with an IoU between the thresholds, we're not sure if it's for- or background and we'll just ignore them.
Empirically, setting the lower bound to 0.3 and the upper to 0.7 leads to robust results.

Number of convolution filters in the ROI box head

‌How many convolution filters the final layer to make the classification contains. To a certain degree, increasing the number of filters will enable the network to learn more complex features, but the effect vanishes if you add too many filters and the network will perform worse (see the original ResNet paper to understand why you cannot endlessly chain convolution filters).
A good default value is 4 conv filters.

Number of fully connected layers in the ROI box head

‌How many fully connected layers (FC) the last part of the network contains. Increasing the number of FCs can increase performance for a computational cost, but you might overfit the sub-network if you add too many.
Often, 2 FCs are used as starting point.

NMS number of proposals

Pre NMS

‌The maximum of proposals that are taken into consideration by NMS. The proposals are sorted descending after confidence and only the ones with the highest confidence are chosen.

Post NMS

‌The maximum of proposals that will be forwarded to the ROI box head. Again, the proposals are sorted descending after confidence and only the ones with the highest confidence are chosen.
Config for training
‌Low numbers of NMS proposals in training will result in a lower recall, but higher precision. Vice versa.
The default for post NMS number of proposals in training is 1000, pre NMS 4000.
Config for testing
‌Here, the number of NMS proposals for the simple forward pass, e.g., inference, is defined. Less NMS proposals will increase inference speed, but higher numbers yield greater performance.
If you don't need super-fast inference, a good default is 1000 for post NMS number of proposals and 2000 for pre NMS.

Pooler Sampling Ratio

After extracting the Region of Interests from the feature map, they should be adjusted to a certain dimension before feeding them to the fully connected layer that will later do the actual object detection. For this, ROI Align is used which makes use of points that would be sampled from a defined grid, to resize the ROIs. The number of points that we use is defined by Pooler Sampling Ratio.
If Pooler Sampling ratio is set to 2, then 2 * 2 = 4 points are used for the interpolation.

Pooler resolution

It is the spatial size to pool proposals before feeding them to the mask predictor, in model playground default value is set as 14.

Depth of Resnet model

It is the depth variant of resnet to use as the backbone feature extractor, in Model Playground depth can be set as 18/50/101/152

Weights

It's the weights to use for model initialization, and in Model Playground R50-FPN COCO or R50-FPN LVIS weights are used.

Code implementation

PyTorch
1
# import necessary libraries
2
from PIL import Image
3
import matplotlib.pyplot as plt
4
import torch
5
import torchvision.transforms as T
6
import torchvision
7
import torch
8
import numpy as np
9
import cv2
10
import random
11
import time
12
import os
13
14
15
16
17
# These are the classes that are available in the COCO-Dataset
18
COCO_INSTANCE_CATEGORY_NAMES = [
19
'__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
20
'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign',
21
'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
22
'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A',
23
'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
24
'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
25
'bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
26
'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
27
'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table',
28
'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
29
'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book',
30
'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
31
]
32
33
# get the pretrained model from torchvision.models
34
# Note: pretrained=True will get the pretrained weights for the model.
35
# model.eval() to use the model for inference
36
model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)
37
model.eval()
38
39
# random_colour_masks() function to fill the predicted-mask with colors
40
# get_predictions() to return the final predictions from the model
41
# instance_segmentation_api() to overlay the colored mask over the original image and plot it
42
43
def random_colour_masks(image):
44
"""
45
random_colour_masks
46
parameters:
47
- image - predicted masks
48
method:
49
- the masks of each predicted object is given random colour for visualization
50
"""
51
colours = [[0, 255, 0],[0, 0, 255],[255, 0, 0],[0, 255, 255],[255, 255, 0],[255, 0, 255],[80, 70, 180],[250, 80, 190],[245, 145, 50],[70, 150, 250],[50, 190, 190]]
52
r = np.zeros_like(image).astype(np.uint8)
53
g = np.zeros_like(image).astype(np.uint8)
54
b = np.zeros_like(image).astype(np.uint8)
55
r[image == 1], g[image == 1], b[image == 1] = colours[random.randrange(0,10)]
56
coloured_mask = np.stack([r, g, b], axis=2)
57
return coloured_mask
58
59
def get_prediction(img_path, threshold):
60
"""
61
get_prediction
62
parameters:
63
- img_path - path of the input image
64
method:
65
- Image is obtained from the image path
66
- the image is converted to image tensor using PyTorch's Transforms
67
- image is passed through the model to get the predictions
68
- masks, classes and bounding boxes are obtained from the model and soft masks are made binary(0 or 1) on masks
69
ie: eg. segment of cat is made 1 and rest of the image is made 0
70
71
"""
72
img = Image.open(img_path)
73
transform = T.Compose([T.ToTensor()])
74
img = transform(img)
75
pred = model([img])
76
pred_score = list(pred[0]['scores'].detach().numpy())
77
pred_t = [pred_score.index(x) for x in pred_score if x>threshold][-1]
78
masks = (pred[0]['masks']>0.5).squeeze().detach().cpu().numpy()
79
pred_class = [COCO_INSTANCE_CATEGORY_NAMES[i] for i in list(pred[0]['labels'].numpy())]
80
pred_boxes = [[(i[0], i[1]), (i[2], i[3])] for i in list(pred[0]['boxes'].detach().numpy())]
81
masks = masks[:pred_t+1]
82
pred_boxes = pred_boxes[:pred_t+1]
83
pred_class = pred_class[:pred_t+1]
84
return masks, pred_boxes, pred_class
85
86
87
def instance_segmentation_api(img_path, threshold=0.5, rect_th=3, text_size=3, text_th=3):
88
"""
89
instance_segmentation_api
90
parameters:
91
- img_path - path to input image
92
method:
93
- prediction is obtained by get_prediction
94
- each mask is given random color
95
- each mask is added to the image in the ration 1:0.8 with opencv
96
- final output is displayed
97
"""
98
masks, boxes, pred_cls = get_prediction(img_path, threshold)
99
img = cv2.imread(img_path)
100
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
101
for i in range(len(masks)):
102
rgb_mask = random_colour_masks(masks[i])
103
img = cv2.addWeighted(img, 1, rgb_mask, 0.5, 0)
104
cv2.rectangle(img, boxes[i][0], boxes[i][1],color=(0, 255, 0), thickness=rect_th)
105
cv2.putText(img,pred_cls[i], boxes[i][0], cv2.FONT_HERSHEY_SIMPLEX, text_size, (0,255,0),thickness=text_th)
106
plt.figure(figsize=(20,30))
107
plt.imshow(img)
108
plt.xticks([])
109
plt.yticks([])
110
plt.show()
111
112
113
# We will use the following colors to fill the pixels
114
colours = [[0, 255, 0],
115
[0, 0, 255],
116
[255, 0, 0],
117
[0, 255, 255],
118
[255, 255, 0],
119
[255, 0, 255],
120
[80, 70, 180],
121
[250, 80, 190],
122
[245, 145, 50],
123
[70, 150, 250],
124
[50, 190, 190]]
125
126
#Testing on Image
127
128
instance_segmentation_api('/content/Hasty_Founders.jpg', 0.75)
Copied!
In Model playground, after creating a split for Instance segmentation/Object detection tweak the Hyper-parameters of Mask R-CNN:
    Backbone network: The Backbone is the Conv Net architecture that is to be used in the first step of Mask R-CNN. Here it's ResNet50.
    IoU Thresholds: IoU thresholds for considering objects as background or foreground. Objects with IoU in between are ignored. The closer the predicted bounding box values are to the actual bounding box values the greater the intersection, and the greater the IoU value.
    Number of Fully Connected Layers: Number of hidden layers in the box predictor
NMS: In Object Detection, the objects in the image can be of different sizes and shapes, and to capture each of these perfectly, the algorithms that are used create multiple bounding boxes. (left image). Ideally, for each object in the image, we must have a single bounding box. To select the best bounding box, from the multiple predicted bounding boxes, these object detection algorithms use non-max suppression. This technique is used to “suppress” the less likely bounding boxes and keep only the best one. It takes into account the IoU Thresholds
    Post NMS parameters: This step is different for training and testing. In training, we select top N (default = 2000) proposals based on their corresponding objectness score from all proposals of all images in an entire batch. In testing, N proposals are selected for each image in the batch and kept separately.
    Pre NMS parameters: Selection of top k anchors based on their corresponding objectness score.
    The depth of the resnet model: The depth variant of resnet to use as the backbone feature extractor
    Weights: Weights to use for model initialization (R50-FPN COCO)
    Pooler Resolution: Size to pool proposals before feeding them to the box predictor. ROI Align/Pool gives a constant output of P X P irrespective of the proposal size, where P is Pooler Resolution.
    Pooler Sampling Ratio: Number of bins for each sampling point

Further resources

Last modified 2mo ago