Batch size 30 AdamW vs Batch Size 1 Adafactor SDXL Training Comparison

Community Article Published August 8, 2024

I was hanging OneTrainer Discord yesterday and saw one of the very old and experienced user comment. He was saying AdamW is better than Adafactor. So I have asked his config which you can see here : https://gist.github.com/FurkanGozukara/5e9ee7d2b2070abb9a173dab342e1221

I have done my AdamW training with this config on RTX A6000 GPU on Massed Compute. Used batch size 30 since I had my regular 15 training images and every epoch 15 reg images. So every epoch was total 30 images thus in 1 step with batch size 30, it was able to train 1 epoch. It consumed 47 GB VRAM.

The experimental used training dataset is shared below — it is a bad dataset with purpose because if works on bad dataset it will work even better on a better dataset — machine learning general rule — Garbage in, garbage out.

Why this dataset is bad? It has repeating background, repeating clothes, lacking poses.

Then I did same dataset and concepts training with my config shared in below post. The training name is default:

https://www.patreon.com/posts/96028218

If you are not my patreon supporter entire config shown in below tutorial video:

https://youtu.be/0t5l6CP9eBg

 

Since AdamW uses batch size 30, I have trained it up to 750 epochs because bigger batch size = lower LR effect. Moreover it had lower LR than usual and didn’t train Text Encoder. Saved a checkpoint every 150 epochs. 1 step duration was 11.5 second on RTX A6000 on Massed Compute. So with 1 step it trained 1 epoch. Watch above tutorial you will understand 15 train images + 15 reg images.

With my best config shared on Patreon and also in above YouTube tutorial, I did up to 150 epochs. Saved a checkpoint every 30 epochs. 1 step was taking 2 second thus every epoch was taking 60 seconds on RTX A6000 on Massed Compute (slower than usual for some reason). So total training times of both training was almost equal.

AdamW uses more VRAM than Adafactor. Speed depends on using batch size, xformers and such. In my training i don’t use xformers and i do batch size 1. Because lower batch size = better quality. Thus if you need speed go up in batch size until you hit maximum speed and no-more.

So here comes the results. My results are far superior in resemblance. You can see the full grid comparison file in below link : https://huggingface.co/MonsterMMORPG/Generative-AI/resolve/main/AdamW_vs_Adafactor.png

The grid file is 321 MB and 11458 px vs 22772 px resolution. It compares 20 special prompts + 1 generic prompt so that you can see if model is totally cooked or not.

Here below 4 grids of the above comparison as JPG.

After 450 epoch AdamW training became totally cooked and the resemblance is still way behind the Adafactor training. I am yet once again to find a better config than my researched config. I have done over 150 full trainings to find my best config :)

Adafactor normally a dynamic LR optimizer but I use it in a special static way. My config is very powerful and allows you to train more or less until becomes cooked or undertrained.

I am open to consultations as well. You can join our discord channel and message me or find me on LinkedIn.

Patreon exclusive posts index to find our scripts easily, Patreon scripts updates history to see which updates arrived to which scripts and amazing Patreon special generative scripts list that you can use in any of your task.

Join discord to get help, chat, discuss and also tell me your discord username to get your special rank : SECourses Discord

Please also Star, Watch and Fork our Stable Diffusion & Generative AI GitHub repository and join our Reddit subreddit and follow me on LinkedIn (my real profile)