google/gemma-3-270m · NaNs when fine-tuning

about 15 hours ago

I've only tried initial tests but this model is unstable during finetuning. My first epoch loss is 0.0 and then becomes NaN. The same script runs fine for the 1b variant. Snippet below...

model_config = AutoConfig.from_pretrained(args.model)
model_config.attention_dropout = 0.1
model_config.resid_dropout = 0.1
model = AutoModelForCausalLM.from_pretrained(args.model, config=model_config, torch_dtype=torch.float16, attn_implementation='eager' if "gemma-3" in args.model else "sdpa")

peft_config = LoraConfig(
    r=8, 
    lora_alpha=32,
    target_modules = "all-linear",
    lora_dropout=0.1,
    bias="none"
)
model = get_peft_model(model, peft_config)

config = SFTConfig(
    output_dir=out_dir,
    num_train_epochs=args.epochs,
    per_device_train_batch_size=args.batch_size,
    do_eval=True,
    eval_strategy="steps", 
    eval_steps=20,              
    save_strategy="steps",         
    save_steps=20,  
    save_total_limit=1,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    save_on_each_node=False,
    weight_decay=0.05,
    fp16=True,
    load_best_model_at_end=True,
)

trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    args=config,
)
trainer.train()

cbudd

about 4 hours ago

Resolved, was the combination of amp int he trainer "fp16=True" and loading the model in half precision "torch_dtype=torch.float16".

ArthurZ

Google org about 1 hour ago

thanks for sharing!