NaNs when fine-tuning
#4
by
cbudd
- opened
I've only tried initial tests but this model is unstable during finetuning. My first epoch loss is 0.0 and then becomes NaN. The same script runs fine for the 1b variant. Snippet below...
model_config = AutoConfig.from_pretrained(args.model)
model_config.attention_dropout = 0.1
model_config.resid_dropout = 0.1
model = AutoModelForCausalLM.from_pretrained(args.model, config=model_config, torch_dtype=torch.float16, attn_implementation='eager' if "gemma-3" in args.model else "sdpa")
peft_config = LoraConfig(
r=8,
lora_alpha=32,
target_modules = "all-linear",
lora_dropout=0.1,
bias="none"
)
model = get_peft_model(model, peft_config)
config = SFTConfig(
output_dir=out_dir,
num_train_epochs=args.epochs,
per_device_train_batch_size=args.batch_size,
do_eval=True,
eval_strategy="steps",
eval_steps=20,
save_strategy="steps",
save_steps=20,
save_total_limit=1,
metric_for_best_model="eval_loss",
greater_is_better=False,
save_on_each_node=False,
weight_decay=0.05,
fp16=True,
load_best_model_at_end=True,
)
trainer = SFTTrainer(
model=model,
train_dataset=train_dataset,
eval_dataset=test_dataset,
args=config,
)
trainer.train()
Resolved, was the combination of amp int he trainer "fp16=True" and loading the model in half precision "torch_dtype=torch.float16".
thanks for sharing!