MobileNet Baselines

Community Article Published July 26, 2024

Those who follow me know that I can't resist an opportunity to update an old baseline.

When the MobileNet-V4 paper came out I noted that they re-ran their MobileNet-V1 baseline to get a 74% ImageNet accuracy. The original models were around 71%. That's quite a jump.

Intruiged, I looked more closely at their recipe for the 'small' model with unusual optimizer hparams that brought the AdamW beta1 from the default 0.9 -> 0.6, taking it closer to RMSProp. Additionally, there was fairly high dropout and augmentation for a smaller model but a very long epoch count (9600 ImageNet-1k epochs in their case).

I set out to try these hparams myself in timm, initially in training a reproduction of the MobileNet-V4-Small where I successfully hit 73.8 at 2400 epochs (instead of 9600), I then took a crack at MobileNet-V1 as I'd never had that model in timm.

My MobileNet-V1 run just finished, 3600 ImageNet-1k epochs with a 75.4% top-1 accuracy on ImageNet at the 224x224 train resolution (76% at 256x256) -- no distillation, no additional data. The OOD dataset scores on ImageNet-V2, Sketch, etc seem pretty solid so it doesn't appear a gross overfit. Weights here: https://huggingface.co/timm/mobilenetv1_100.ra4_e3600_r224_in1k

Comparing to some other MobileNets:

Original MobileNet-V1 1.0
- Weights: by Google, https://github.com/tensorflow/models/tree/master/research/slim/nets/mobilenet
- Accuracy: 70.9%, Param: 4.2M, GMAC: 0.6
Original MobileNet-V2 1.0
- Weights: by Google, https://github.com/tensorflow/models/tree/master/research/slim/nets/mobilenet)
- Accuracy: 71.8%, Param: 3.5M GMAC: 0.3
MobileNet-V2 1.0
- Weights: by me in timm, https://huggingface.co/timm/mobilenetv2_100.ra_in1k
- Accuracy: 73.0%, Param: 3.5M, GMAC: 0.3
MobileNet-V2 1.0 (MNV4 Paper) - Accuracy: 73.4%, Param: 3.5M, GMAC: 0.3
Original MobileNet-V4 Small (MNV4 Paper) - Accuracy: 73.8%, Param: 3.8M, GMAC: 0.2
MobileNet-V4 Small
- Weights: by me in timm, https://huggingface.co/timm/mobilenetv4_conv_small.e2400_r224_in1k
- Accuracy: 73.8%, Param: 3.8M, GMAC: 0.2
MobileNet-V1 1.0 (MNV4 Paper) - Accuracy: 74.0%, Param: 4.2M, GMAC: 0.6
MobileNet-V2 1.1 w/ depth scaling
- Weights: by me in timm, https://huggingface.co/timm/mobilenetv2_110d.ra_in1k
- Accuracy: 75.0%, Param: 4.5M, GMAC: 0.4
MobileNet-V1
- Weights: This recipe, https://huggingface.co/timm/mobilenetv1_100.ra4_e3600_r224_in1k
- Accuracy: 75.4%, Param: 4.2M, GMAC: 0.6
MobileNet-V3 Large 1.0
- Weights: by Google, https://huggingface.co/timm/tf_mobilenetv3_large_100.in1k
- Accuracy: 75.5%, Param: 5.5M, GMAC: 0.2
MobileNet-V3 Large 1.0
- Weights: by me in timm, https://huggingface.co/timm/mobilenetv3_large_100.ra_in1k
- Accuracy: 75.8%, Param: 5.5M, GMAC: 0.2

I decided to give the old EfficientNet-B0 a go with these hparams. 78.6% top-1 accuracy. To put that in perspective the B0 trainings by top-1 are:

Original (Google, https://huggingface.co/timm/tf_efficientnet_b0.in1k) - 76.7
AutoAugment (Google, https://huggingface.co/timm/tf_efficientnet_b0.aa_in1k) - 77.1
AdvProp+AA (Google, https://huggingface.co/timm/tf_efficientnet_b0.ap_in1k) - 77.6
RandAugment (Me in timm, https://huggingface.co/timm/efficientnet_b0.ra_in1k) - 77.7
This MNV4 inspired recipe (https://huggingface.co/timm/efficientnet_b0.ra4_e3600_r224_in1k) - 78.6
NoisyStudent+RA (Google, https://huggingface.co/timm/tf_efficientnet_b0.ns_jft_in1k) - 78.8

So a pure ImageNet-1k with no distillation and no extra data managed just a hair under the very impressive NoisyStudent models which had unlabeled access to JFT. Additionally the OOD test set scores are holding up relative to NoisyStudent, that's also impressive. I actually think this recipe could be tweaked to push the B0 to 79%. The accuracy improvement petered out early on this run, there is room for improvement with a tweak to the aug+reg.

What were my differences from the MobileNet-V4 hparams? Well, for one I used timm, if you read the Supplementary Material, section A of the Resnet Strikes Back paper, I detailed a number of fixes and improvements over the default RandAugment that's used in all Tensorflow and most JAX based trainings I'm aware of. I feel some of the issues in the original are detremental to great training. Other differences?

Repeated Augmentation (https://arxiv.org/abs/1901.09335, https://arxiv.org/abs/1902.05509)
Small probability of random gaussian blur & random grayscale added in addition to RandAugment
Random erasing w/ guassian noise used instead of cutout, outside of RandAugment

So, the theme I've visited many times (Resnet Strikes Back, https://huggingface.co/collections/timm/searching-for-better-vit-baselines-663eb74f64f847d2f35a9c19, and many timm weights) continues to hold there is a lot of wiggle room for improving old results through better training regimens.

I wonder, in 7-8 years time how much can be added to todays SOTA 100+B dense transformer architectures with better recipes and training techniques.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote