Lora Training OOM with 2x NVIDIA RTX A6000 (2x48GB)
I have two RTX A6000 which totals to 96GB of VRAM but when I try to fine tune the model with Lora i immediatly get torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU
.
Even with batch size = 1 it ooms the first GPU instantly while the 2nd one still has space.
Is this just not enough memory or is something wrong with my code?
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 442552 C ...s/idefics-finetune/.venv/bin/python 48500MiB |
| 1 N/A N/A 442552 C ...s/idefics-finetune/.venv/bin/python 19816MiB |
+-----------------------------------------------------------------------------------------+
Here is my training script. I hope there's something wrong with my code:
import torch
from peft import LoraConfig
from transformers import AutoProcessor, BitsAndBytesConfig, Idefics2ForConditionalGeneration
from datasets import load_dataset
USE_LORA = True
USE_QLORA = False
processor = AutoProcessor.from_pretrained(
"HuggingFaceM4/idefics2-8b",
do_image_splitting=False
)
# Three options for training, from the lowest precision training to the highest precision training:
# - QLora
# - Standard Lora
# - Full fine-tuning
if USE_QLORA or USE_LORA:
lora_config = LoraConfig(
r=8,
lora_alpha=8,
lora_dropout=0.1,
target_modules='.*(text_model|modality_projection|perceiver_resampler).*(down_proj|gate_proj|up_proj|k_proj|q_proj|v_proj|o_proj).*$',
use_dora=False,# if USE_QLORA else True,
init_lora_weights="gaussian"
)
if USE_QLORA:
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16
)
model = Idefics2ForConditionalGeneration.from_pretrained(
"HuggingFaceM4/idefics2-8b",
torch_dtype=torch.float16,
quantization_config=bnb_config if USE_QLORA else None,
#attn_implementation='flash_attention_2'
#device_map='auto'
)
model.add_adapter(lora_config)
model.enable_adapters()
else:
model = Idefics2ForConditionalGeneration.from_pretrained(
"HuggingFaceM4/idefics2-8b",
torch_dtype=torch.float16,
_attn_implementation="flash_attention_2", # Only available on A100 or H100
)
dataset = load_dataset("dataset/malicious", split="train")
split = dataset.train_test_split(test_size=0.5)
train_dataset = split['train']
test_dataset = split['test']
p_brand = '''What brands can you see on the image?'''
class MyDataCollator:
def __init__(self, processor):
self.processor = processor
self.image_token_id = processor.tokenizer.additional_special_tokens_ids[
processor.tokenizer.additional_special_tokens.index("<image>")
]
def __call__(self, examples):
texts = []
images = []
for example in examples:
image = example["image"]
question = p_brand
answer = example["brand"]
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": question},
]
},
{
"role": "assistant",
"content": [
{"type": "text", "text": answer}
]
}
]
text = self.processor.apply_chat_template(messages, add_generation_prompt=False)
texts.append(text.strip())
images.append([image])
batch = processor(text=texts, images=images, return_tensors="pt", padding=True)
labels = batch["input_ids"].clone()
labels[labels == processor.tokenizer.pad_token_id] = self.image_token_id
batch["labels"] = labels
return batch
data_collator = MyDataCollator(processor)
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
num_train_epochs=2,
per_device_train_batch_size=1,
per_device_eval_batch_size=1,
gradient_accumulation_steps=8,
warmup_steps=0,
learning_rate=1e-4,
weight_decay=0.01,
logging_steps=1,
output_dir="output/test-brand-001",
save_strategy="steps",
save_steps=10,
save_total_limit=1,
evaluation_strategy="epoch",
fp16=True,
remove_unused_columns=False,
report_to="none",
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset,
eval_dataset=test_dataset, # You can also evaluate (loss) on the eval set, note that it will incur some additional GPU memory
)
trainer.train()
Any help would be greatly appreciated.
Try to add a MAX_LENGTH to your batch = processor(text=texts, images=images, padding=True, truncation=True, max_length=MAX_LENGTH, return_tensors="pt")
, I am setting MAX_LENGTH = 768 for my case.
Also I commented out eval_dataset, and evaluation_strategy for my example.
Try to add a MAX_LENGTH to your
batch = processor(text=texts, images=images, padding=True, truncation=True, max_length=MAX_LENGTH, return_tensors="pt")
, I am setting MAX_LENGTH = 768 for my case.
Thank you for those tipps, but it seems that even when I put max_length=200 or lower it still ooms the first GPU instantly when starting training.
Are you training with LORA on just 2 A6000 GPUs and it works?
Hi! can you push some dataset samples to the hub? I could re-run your code with 2xH100 and report how the memory is distributed
Hi, I'm having a similar issue using the same GPUs (2xA6000).
I'm trying to reproduce this tutorial:
https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Idefics2/Fine_tune_Idefics2_for_multi_page_PDF_question_answering_on_DUDE.ipynb
My only modification is to use devices=2 in the lightning trainer.
Using QLORA I get this issue: https://github.com/TimDettmers/bitsandbytes/issues/89#issuecomment-2094943374
Using LORA it goes OOM.
Hi! can you push some dataset samples to the hub? I could re-run your code with 2xH100 and report how the memory is distributed
Thanks for your interest in this issue.
I've uploaded a small subset of the dataset under: "ayyylemao/idefics2-test"
Regards