The encoder (left side) typically consists of a pre-trained classification network, such as ResNet, using convolution blocks followed by maxpool downsampling to encode the input image into feature representations at various levels. In this architecture, there is repeated application of two 3x3 convolutions. Each convolution is followed by a ReLU and batch normalization. Then a 2x2 max pooling operation is applied to reduce dimensions by half. Again, at each downsampling step, we double the number of feature channels, while we cut in half the spatial dimensions.