Comparisons of timm Optimizers w/ Caution

This repo contains summaries of several sets of experiments comparing a number of optimizers with and without caution (https://huggingface.co/papers/2411.16085) enabled.

The runs were all performed training a smaller ViT (vit_wee_patch16_reg1_gap_256) for 200 epochs (10M samples seen) from scratch on the timm 'mini-imagenet' dataset, a 100 class subset of imagenet with same image sizes as originals.

So far I have results for adamw, laprop, and mars (https://huggingface.co/papers/2411.10438). You can find full results in sub-folders by optimizer names. In all of these runs, the experiments with 'c' prefix in the name have caution enabled.

This is what the 'caution' addition looks like in an optimizer:

    mask = (exp_avg * grad > 0).to(grad.dtype)
    mask.div_(mask.mean().clamp_(min=1e-3))
    exp_avg = exp_avg * mask

Train args:

./distributed_train.sh 2 --dataset hfds/timm/mini-imagenet --num-classes 100 --model vit_wee_patch16_reg1_gap_256 -j 8 --epochs 200 --warmup-prefix --sched-on-updates --warmup-lr 0 --mixup .2 --model-ema --model-ema-decay 0.999 --model-ema-warmup --aa rand-m9-mstd0.5-inc1 --remode pixel --reprob 0.25 --amp --weight-decay .05 --drop 0.1 --drop-path .1 -b 288 --opt cadamw --lr 1e-3

LaProp

optim best_epoch train_loss eval_loss eval_top1 eval_top5 lr
claprop, lr=1e-03 204.0 2.2173619270324707 1.0931779468536378 73.920000390625 91.33000009765624 0.0
claprop, lr=5e-04 183.0 2.262192726135254 1.0912627222061158 73.77000073242188 91.22000260009766 1.3478660293113704e-05
laprop, lr=5e-04 198.0 2.2425642013549805 1.1426102781295775 71.73000213623047 90.55000146484376 1.109508849230001e-06
laprop, lr=1e-03 179.0 2.290040969848633 1.168387135314941 71.15000104980469 90.18000189208983 3.806023374435663e-05
claprop, lr=2e-04 195.0 2.546172380447388 1.2475446645736694 68.30000163574219 89.15000153808593 9.97634228344235e-07
laprop, lr=2e-04 204.0 2.6702351570129395 1.309178423690796 67.07999990234374 88.67000270996094 0.0
claprop, lr=2e-03 193.0 2.678058862686157 1.5239886917114258 62.08000177001953 84.8 1.4890673845226132e-05
laprop, lr=2e-03 200.0 2.70467209815979 1.522907255935669 61.46000135498047 85.28000162353516 1.9732715717284413e-06

LaProp Top-1 Evaluation Accuracy on Mini-ImageNet

Top-1

LaProp Train Loss

Loss

AdamW

optim best_epoch train_loss eval_loss eval_top1 eval_top5
cadamw, lr=1e-03 184.0 2.2688851356506348 1.0868136840820313 73.52000141601563 91.60000036621092
cadamw, lr=5e-04 199.0 2.163278102874756 1.0976034646987916 73.3900005859375 91.31000137939454
cadamw, lr=1e-03, clip grads 203.0 2.1360626220703125 1.1043113907814026 73.33000158691407 91.41000042724608
adamw, lr=1e-03, clip grads 195.0 2.2746386528015137 1.142998440361023 72.11000151367188 90.47000052490236
adamw, lr=5e-04 185.0 2.3040246963500977 1.1535791856765747 71.50000120849609 90.4800001953125
adamw, lr=1e-03 199.0 2.223684310913086 1.1657958560943604 71.22999993896484 90.30999958496092
cadamw, lr=2e-04 189.0 2.538627862930298 1.2325929063796996 68.94999995117188 89.61000139160156
adamw, lr=2e-04 203.0 2.579624652862549 1.3085522148132325 67.11000026855469 88.66000164794922

AdamW Top-1 Evaluation Accuracy on Mini-ImageNet

Top-1

AdamW Train Loss

Loss

MARS

optim best_epoch train_loss eval_loss eval_top1 eval_top5
cmars, lr=1e-03 198.0 2.054780960083008 1.0435627010345458 74.91000185546875 92.08000146484376
cmars, lr=2e-03 203.0 2.0272469520568848 1.0705795244216918 74.31000185546876 91.54000092773435
mars, lr=1e-03 184.0 2.219767808914185 1.07215625667572 74.06000178222656 91.6200013671875
mars, lr=2e-03 197.0 2.1453990936279297 1.0963781481742858 73.73000098876953 91.1500006225586
cmars, lr=5e-04 198.0 2.2018630504608154 1.083557384109497 73.32000045166015 91.67000092773438
mars, lr=5e-04 189.0 2.322845220565796 1.1199828132629397 72.02999995117187 90.86000173339843

MARS Top-1 Evaluation Accuracy on Mini-ImageNet

Top-1

MARS Train Loss

Loss

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model's library. Check the docs .

Dataset used to train rwightman/timm-optim-caution