Lora Training OOM with 2x NVIDIA RTX A6000 (2x48GB)

#71
by ayyylemao - opened

I have two RTX A6000 which totals to 96GB of VRAM but when I try to fine tune the model with Lora i immediatly get torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU.
Even with batch size = 1 it ooms the first GPU instantly while the 2nd one still has space.
Is this just not enough memory or is something wrong with my code?

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    442552      C   ...s/idefics-finetune/.venv/bin/python      48500MiB |
|    1   N/A  N/A    442552      C   ...s/idefics-finetune/.venv/bin/python      19816MiB |
+-----------------------------------------------------------------------------------------+

Here is my training script. I hope there's something wrong with my code:

import torch
from peft import LoraConfig
from transformers import AutoProcessor, BitsAndBytesConfig, Idefics2ForConditionalGeneration
from datasets import load_dataset

USE_LORA = True
USE_QLORA = False

processor = AutoProcessor.from_pretrained(
    "HuggingFaceM4/idefics2-8b",
    do_image_splitting=False
)


# Three options for training, from the lowest precision training to the highest precision training:
# - QLora
# - Standard Lora
# - Full fine-tuning
if USE_QLORA or USE_LORA:
    lora_config = LoraConfig(
        r=8,
        lora_alpha=8,
        lora_dropout=0.1,
        target_modules='.*(text_model|modality_projection|perceiver_resampler).*(down_proj|gate_proj|up_proj|k_proj|q_proj|v_proj|o_proj).*$',
        use_dora=False,# if USE_QLORA else True,
        init_lora_weights="gaussian"
    )
    if USE_QLORA:
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.float16
        )
    model = Idefics2ForConditionalGeneration.from_pretrained(
        "HuggingFaceM4/idefics2-8b",
        torch_dtype=torch.float16,
        quantization_config=bnb_config if USE_QLORA else None,
        #attn_implementation='flash_attention_2'
        #device_map='auto'
    )
    model.add_adapter(lora_config)
    model.enable_adapters()
else:
    model = Idefics2ForConditionalGeneration.from_pretrained(
        "HuggingFaceM4/idefics2-8b",
        torch_dtype=torch.float16,
        _attn_implementation="flash_attention_2", # Only available on A100 or H100
    )






dataset = load_dataset("dataset/malicious", split="train")
split = dataset.train_test_split(test_size=0.5)
train_dataset = split['train']
test_dataset = split['test']
p_brand = '''What brands can you see on the image?'''

class MyDataCollator:
    def __init__(self, processor):
        self.processor = processor
        self.image_token_id = processor.tokenizer.additional_special_tokens_ids[
            processor.tokenizer.additional_special_tokens.index("<image>")
        ]

    def __call__(self, examples):
        texts = []
        images = []
        for example in examples:
            image = example["image"]
            question = p_brand
            answer = example["brand"]
            messages = [
                {
                    "role": "user",
                    "content": [
                        {"type": "image"},
                        {"type": "text", "text": question},
                    ]
                },
                {
                    "role": "assistant",
                    "content": [
                        {"type": "text", "text": answer}
                    ]
                }
            ]
            text = self.processor.apply_chat_template(messages, add_generation_prompt=False)
            texts.append(text.strip())
            images.append([image])

        batch = processor(text=texts, images=images, return_tensors="pt", padding=True)

        labels = batch["input_ids"].clone()
        labels[labels == processor.tokenizer.pad_token_id] = self.image_token_id
        batch["labels"] = labels
        return batch

data_collator = MyDataCollator(processor)

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    num_train_epochs=2,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=8,
    warmup_steps=0,
    learning_rate=1e-4,
    weight_decay=0.01,
    logging_steps=1,
    output_dir="output/test-brand-001",
    save_strategy="steps",
    save_steps=10,
    save_total_limit=1,
    evaluation_strategy="epoch",
    fp16=True,
    remove_unused_columns=False,
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset, # You can also evaluate (loss) on the eval set, note that it will incur some additional GPU memory
)

trainer.train()

Any help would be greatly appreciated.

Try to add a MAX_LENGTH to your batch = processor(text=texts, images=images, padding=True, truncation=True, max_length=MAX_LENGTH, return_tensors="pt"), I am setting MAX_LENGTH = 768 for my case.

Also I commented out eval_dataset, and evaluation_strategy for my example.

Try to add a MAX_LENGTH to your batch = processor(text=texts, images=images, padding=True, truncation=True, max_length=MAX_LENGTH, return_tensors="pt"), I am setting MAX_LENGTH = 768 for my case.

Thank you for those tipps, but it seems that even when I put max_length=200 or lower it still ooms the first GPU instantly when starting training.
Are you training with LORA on just 2 A6000 GPUs and it works?

HuggingFaceM4 org

Hi! can you push some dataset samples to the hub? I could re-run your code with 2xH100 and report how the memory is distributed

Hi, I'm having a similar issue using the same GPUs (2xA6000).
I'm trying to reproduce this tutorial:
https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Idefics2/Fine_tune_Idefics2_for_multi_page_PDF_question_answering_on_DUDE.ipynb

My only modification is to use devices=2 in the lightning trainer.

Using QLORA I get this issue: https://github.com/TimDettmers/bitsandbytes/issues/89#issuecomment-2094943374

Using LORA it goes OOM.

Hi! can you push some dataset samples to the hub? I could re-run your code with 2xH100 and report how the memory is distributed

Thanks for your interest in this issue.
I've uploaded a small subset of the dataset under: "ayyylemao/idefics2-test"
Regards

Sign up or log in to comment