While VGG's authors took AlexNet and worked on standardizing and optimizing its structure in order to obtain a clearer and deeper architecture, researchers at Google took a different approach. Their first consideration, as mentioned in the paper, was the optimization of the CNN computational footprint.
Indeed, in spite of careful engineering (refer to VGG), the deeper CNNs are, the larger their number of trainable parameters and their number of computations per prediction become (it is costly with respect to memory and time). For instance, VGG-16 weighs approximately 93 MB (in terms of parameter storage), and the VGG submission for ILSVRC took two to three weeks to train on four GPUs. With approximately 5 million parameters, GoogLeNet is 12 times lighter than AlexNet and 21 times lighter than VGG-16, and the network was trained within a week. As a result, GoogLeNet—and more recent inception networks—can even run on more modest machines (such as smartphones...