H100 configuration

#2
by wyceee - opened

Hey,

Have you guys tested the 1.5b model with the H100? If so, what are the best configurations for the H100 running the 1.5b model? I'm still running it with the basic config.

Also, can we expect a bigger model soon?

# Model arguments
model_revision: main
torch_dtype: bfloat16
attn_implementation: flash_attention_2
bf16: true
tf32: true

# Dataset arguments
dataset_id_or_path: 'openai/gsm8k'

# Lora Arguments
# No LoRA is used here

# Training arguments
max_steps: 150 # Original 450
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
learning_rate: 5.0e-7 # 1.0e-6 as in the deepseek math paper 5-e7 from https://hijkzzz.notion.site/unraveling-rlhf-and-its-variants-engineering-insights#147d9a33ecc9806090f3d5c749d31f05
lr_scheduler_type: cosine
warmup_ratio: 0.03
# GRPO specific parameters
beta: 0.001 # 0.04 as in the deepseek math paper 0.001 from https://hijkzzz.notion.site/unraveling-rlhf-and-its-variants-engineering-insights#147d9a33ecc9806090f3d5c749d31f05
max_prompt_length: 256
max_completion_length: 1024
num_generations: 8
use_vllm: true
# vllm_device: "cuda:3"
vllm_gpu_memory_utilization: 0.2

# Logging arguments
logging_strategy: steps
logging_steps: 2
report_to:
- tensorboard
save_strategy: "steps"
save_steps: 25
seed: 42

# Hugging Face Hub
# push_to_hub: false
# hub_strategy: every_save

# Script arguments
public_maddr: "/ip4/38.101.215.12/tcp/30002"
host_maddr: "/ip4/0.0.0.0/tcp/38331"
max_rounds: 10000

Stay tuned on this, some updates coming soon..

I'm interested too. Running this right now:

Training arguments

max_steps: 100
per_device_train_batch_size: 2
gradient_accumulation_steps: 8
learning_rate: 1.0e-6
max_prompt_length: 384
max_completion_length: 1024
num_generations: 4
vllm_gpu_memory_utilization: 0.3

Will probably increase the steps further if it runs stably.

Sign up or log in to comment