MobileNet Baselines
Those who follow me know that I can't resist an opportunity to update an old baseline.
When the MobileNet-V4 paper came out I noted that they re-ran their MobileNet-V1 baseline to get a 74% ImageNet accuracy. The original models were around 71%. That's quite a jump.
Intruiged, I looked more closely at their recipe for the 'small' model with unusual optimizer hparams that brought the AdamW beta1
from the default 0.9 -> 0.6, taking it closer to RMSProp. Additionally, there was fairly high dropout and augmentation for a smaller model but a very long epoch count (9600 ImageNet-1k epochs in their case).
I set out to try these hparams myself in timm
, initially in training a reproduction of the MobileNet-V4-Small where I successfully hit 73.8 at 2400 epochs (instead of 9600), I then took a crack at MobileNet-V1 as I'd never had that model in timm
.
My MobileNet-V1 run just finished, 3600 ImageNet-1k epochs with a 75.4% top-1 accuracy on ImageNet at the 224x224 train resolution (76% at 256x256) -- no distillation, no additional data. The OOD dataset scores on ImageNet-V2, Sketch, etc seem pretty solid so it doesn't appear a gross overfit. Weights here: https://huggingface.co/timm/mobilenetv1_100.ra4_e3600_r224_in1k
Comparing to some other MobileNets:
- Original MobileNet-V1 1.0
- Weights: by Google, https://github.com/tensorflow/models/tree/master/research/slim/nets/mobilenet
- Accuracy: 70.9%, Param: 4.2M, GMAC: 0.6
- Original MobileNet-V2 1.0
- Weights: by Google, https://github.com/tensorflow/models/tree/master/research/slim/nets/mobilenet)
- Accuracy: 71.8%, Param: 3.5M GMAC: 0.3
- MobileNet-V2 1.0
- Weights: by me in
timm
, https://huggingface.co/timm/mobilenetv2_100.ra_in1k - Accuracy: 73.0%, Param: 3.5M, GMAC: 0.3
- Weights: by me in
- MobileNet-V2 1.0 (MNV4 Paper) - Accuracy: 73.4%, Param: 3.5M, GMAC: 0.3
- Original MobileNet-V4 Small (MNV4 Paper) - Accuracy: 73.8%, Param: 3.8M, GMAC: 0.2
- MobileNet-V4 Small
- Weights: by me in
timm
, https://huggingface.co/timm/mobilenetv4_conv_small.e2400_r224_in1k - Accuracy: 73.8%, Param: 3.8M, GMAC: 0.2
- Weights: by me in
- MobileNet-V1 1.0 (MNV4 Paper) - Accuracy: 74.0%, Param: 4.2M, GMAC: 0.6
- MobileNet-V2 1.1 w/ depth scaling
- Weights: by me in
timm
, https://huggingface.co/timm/mobilenetv2_110d.ra_in1k - Accuracy: 75.0%, Param: 4.5M, GMAC: 0.4
- Weights: by me in
- MobileNet-V1
- Weights: This recipe, https://huggingface.co/timm/mobilenetv1_100.ra4_e3600_r224_in1k
- Accuracy: 75.4%, Param: 4.2M, GMAC: 0.6
- MobileNet-V3 Large 1.0
- Weights: by Google, https://huggingface.co/timm/tf_mobilenetv3_large_100.in1k
- Accuracy: 75.5%, Param: 5.5M, GMAC: 0.2
- MobileNet-V3 Large 1.0
- Weights: by me in
timm
, https://huggingface.co/timm/mobilenetv3_large_100.ra_in1k - Accuracy: 75.8%, Param: 5.5M, GMAC: 0.2
- Weights: by me in
I decided to give the old EfficientNet-B0 a go with these hparams. 78.6% top-1 accuracy. To put that in perspective the B0 trainings by top-1 are:
- Original (Google, https://huggingface.co/timm/tf_efficientnet_b0.in1k) - 76.7
- AutoAugment (Google, https://huggingface.co/timm/tf_efficientnet_b0.aa_in1k) - 77.1
- AdvProp+AA (Google, https://huggingface.co/timm/tf_efficientnet_b0.ap_in1k) - 77.6
- RandAugment (Me in
timm
, https://huggingface.co/timm/efficientnet_b0.ra_in1k) - 77.7 - This MNV4 inspired recipe (https://huggingface.co/timm/efficientnet_b0.ra4_e3600_r224_in1k) - 78.6
- NoisyStudent+RA (Google, https://huggingface.co/timm/tf_efficientnet_b0.ns_jft_in1k) - 78.8
So a pure ImageNet-1k with no distillation and no extra data managed just a hair under the very impressive NoisyStudent models which had unlabeled access to JFT. Additionally the OOD test set scores are holding up relative to NoisyStudent, that's also impressive. I actually think this recipe could be tweaked to push the B0 to 79%. The accuracy improvement petered out early on this run, there is room for improvement with a tweak to the aug+reg.
What were my differences from the MobileNet-V4 hparams? Well, for one I used timm
, if you read the Supplementary Material, section A of the Resnet Strikes Back paper, I detailed a number of fixes and improvements over the default RandAugment that's used in all Tensorflow and most JAX based trainings I'm aware of. I feel some of the issues in the original are detremental to great training. Other differences?
- Repeated Augmentation (https://arxiv.org/abs/1901.09335, https://arxiv.org/abs/1902.05509)
- Small probability of random gaussian blur & random grayscale added in addition to RandAugment
- Random erasing w/ guassian noise used instead of cutout, outside of RandAugment
So, the theme I've visited many times (Resnet Strikes Back, https://huggingface.co/collections/timm/searching-for-better-vit-baselines-663eb74f64f847d2f35a9c19, and many timm
weights) continues to hold there is a lot of wiggle room for improving old results through better training regimens.
I wonder, in 7-8 years time how much can be added to todays SOTA 100+B dense transformer architectures with better recipes and training techniques.