Mask R-CNN is the most used architecture for instance segmentation. It is almost built the same way as Faster R-CNN. The major difference is that there is an extra head that predicts masks inside the predicted bounding boxes.
Also, the authors replaced the RoI pool layer with the RoI align layer. RoI pool mappings are often a bit noisy. The difference is so small that it is negligible for object detection, but not when you want to create pixel-perfect masks for instance segmentation.
Typically, the following hyperparameters are tweaked when using Faster R-CNN:
Specifying the architecture for the network on which Faster R-CNN is built
These thresholds are used to decide if an anchor box generated contains an object or is part of the background.
Everything that is above the upper IoU threshold of the proposed anchor box and ground truth label will be classified as an object and forwarded. Everything below the lower threshold will be classified as background and the network will be penalized. For all the anchor boxes with an IoU between the thresholds, we're not sure if it's for- or background and we'll just ignore them.
How many convolution filters the final layer to make the classification contains. To a certain degree, increasing the number of filters will enable the network to learn more complex features, but the effect vanishes if you add too many filters and the network will perform worse (see the original ResNet paper to understand why you cannot endlessly chain convolution filters).
How many fully connected layers (FC) the last part of the network contains. Increasing the number of FCs can increase performance for a computational cost, but you might overfit the sub-network if you add too many.
The maximum of proposals that are taken into consideration by NMS. The proposals are sorted descending after confidence and only the ones with the highest confidence are chosen.
The maximum of proposals that will be forwarded to the ROI box head. Again, the proposals are sorted descending after confidence and only the ones with the highest confidence are chosen.
Config for training
Config for testing
After extracting the Region of Interests from the feature map, they should be adjusted to a certain dimension before feeding them to the fully connected layer that will later do the actual object detection. For this, ROI Align is used which makes use of points that would be sampled from a defined grid, to resize the ROIs. The number of points that we use is defined by Pooler Sampling Ratio.
It is the spatial size to pool proposals before feeding them to the mask predictor, in model playground default value is set as 14.
It is the depth variant of resnet to use as the backbone feature extractor, in Model Playground depth can be set as 18/50/101/152
It's the weights to use for model initialization, and in Model Playground R50-FPN COCO or R50-FPN LVIS weights are used.
# import necessary librariesfrom PIL import Imageimport matplotlib.pyplot as pltimport torchimport torchvision.transforms as Timport torchvisionimport torchimport numpy as npimport cv2import randomimport timeimport os# These are the classes that are available in the COCO-DatasetCOCO_INSTANCE_CATEGORY_NAMES = ['__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus','train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign','parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow','elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A','handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball','kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket','bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl','banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza','donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table','N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone','microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book','clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush']# get the pretrained model from torchvision.models# Note: pretrained=True will get the pretrained weights for the model.# model.eval() to use the model for inferencemodel = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)model.eval()# random_colour_masks() function to fill the predicted-mask with colors# get_predictions() to return the final predictions from the model# instance_segmentation_api() to overlay the colored mask over the original image and plot itdef random_colour_masks(image):"""random_colour_masksparameters:- image - predicted masksmethod:- the masks of each predicted object is given random colour for visualization"""colours = [[0, 255, 0],[0, 0, 255],[255, 0, 0],[0, 255, 255],[255, 255, 0],[255, 0, 255],[80, 70, 180],[250, 80, 190],[245, 145, 50],[70, 150, 250],[50, 190, 190]]r = np.zeros_like(image).astype(np.uint8)g = np.zeros_like(image).astype(np.uint8)b = np.zeros_like(image).astype(np.uint8)r[image == 1], g[image == 1], b[image == 1] = colours[random.randrange(0,10)]coloured_mask = np.stack([r, g, b], axis=2)return coloured_maskdef get_prediction(img_path, threshold):"""get_predictionparameters:- img_path - path of the input imagemethod:- Image is obtained from the image path- the image is converted to image tensor using PyTorch's Transforms- image is passed through the model to get the predictions- masks, classes and bounding boxes are obtained from the model and soft masks are made binary(0 or 1) on masksie: eg. segment of cat is made 1 and rest of the image is made 0"""img = Image.open(img_path)transform = T.Compose([T.ToTensor()])img = transform(img)pred = model([img])pred_score = list(pred['scores'].detach().numpy())pred_t = [pred_score.index(x) for x in pred_score if x>threshold][-1]masks = (pred['masks']>0.5).squeeze().detach().cpu().numpy()pred_class = [COCO_INSTANCE_CATEGORY_NAMES[i] for i in list(pred['labels'].numpy())]pred_boxes = [[(i, i), (i, i)] for i in list(pred['boxes'].detach().numpy())]masks = masks[:pred_t+1]pred_boxes = pred_boxes[:pred_t+1]pred_class = pred_class[:pred_t+1]return masks, pred_boxes, pred_classdef instance_segmentation_api(img_path, threshold=0.5, rect_th=3, text_size=3, text_th=3):"""instance_segmentation_apiparameters:- img_path - path to input imagemethod:- prediction is obtained by get_prediction- each mask is given random color- each mask is added to the image in the ration 1:0.8 with opencv- final output is displayed"""masks, boxes, pred_cls = get_prediction(img_path, threshold)img = cv2.imread(img_path)img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)for i in range(len(masks)):rgb_mask = random_colour_masks(masks[i])img = cv2.addWeighted(img, 1, rgb_mask, 0.5, 0)cv2.rectangle(img, boxes[i], boxes[i],color=(0, 255, 0), thickness=rect_th)cv2.putText(img,pred_cls[i], boxes[i], cv2.FONT_HERSHEY_SIMPLEX, text_size, (0,255,0),thickness=text_th)plt.figure(figsize=(20,30))plt.imshow(img)plt.xticks()plt.yticks()plt.show()# We will use the following colors to fill the pixelscolours = [[0, 255, 0],[0, 0, 255],[255, 0, 0],[0, 255, 255],[255, 255, 0],[255, 0, 255],[80, 70, 180],[250, 80, 190],[245, 145, 50],[70, 150, 250],[50, 190, 190]]#Testing on Imageinstance_segmentation_api('/content/Hasty_Founders.jpg', 0.75)