Thoughts on adding Hybrid Sharded Data Parallel to the guide

#107
by mattmcclean - opened

Just wondering if you have considered adding Hybrid Sharded Data Parallel to the guide which allows you to decouple the FSDP sharding degree from the DP degree .This will make the all-gather operation better as there are larger messages transmitted over fewer ranks so bandwidth should be better than using the much larger DP degree. DP is used to only synchronize the gradients.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment