As shown in Figure 4.3, GoogLeNet architecture is not as straightforward as the previous architectures we studied, although it can be analyzed region by region. The input images are first processed by a classic series of convolutional and max-pooling layers. Then, the information goes through a stack of nine inception modules. These modules (often called subnetworks; further detailed in Figure 4.4), are blocks of layers stacked vertically and horizontally. For each module, the input feature maps are passed to four parallel sub-blocks composed of one or two different layers (convolutions with different kernel sizes and max-pooling).
The results of these four parallel operations are then concatenated together along the depth dimension and into a single feature volume:
In the preceding figure, all the convolutional and max-pooling layers have SAME for padding. The convolutions...