Dohoon_Qwen2-VL-7B-Instruct_ForAju (μμ£Όλνκ΅ λ©ν°λͺ¨λ¬ λ₯λ¬λ μ±λ¦°μ§)
λ¨μΌ λ©ν°λͺ¨λ¬ λͺ¨λΈ(μ΄λ―Έμ§ μΊ‘μ
, VQA, μν μΆλ‘ , λ¬Έλ§₯ QA, μμ½)μ λ³λμ μμ
λΆκΈ°(task-branching) μμ΄ λ¨μΌ ν둬ννΈ λΌμ°ν
λ‘μ§μΌλ‘ μ²λ¦¬νλλ‘ Qwen/Qwen2-VL-7B-Instruct
λͺ¨λΈμ LoRA(QLoRA)
λ₯Ό μ μ©ν΄ λ―ΈμΈμ‘°μ ν μ΄λν° κ°μ€μΉμ
λλ€.
μ΄ μ μ₯μλ μ΄λν° κ°μ€μΉλ§ ν¬ν¨νλ©°, μλ³Έ λ² μ΄μ€ λͺ¨λΈμ ν¬ν¨νμ§ μμ΅λλ€.
- κ°λ°μ: dohoon0508
- Finetuned from:
Qwen/Qwen2-VL-7B-Instruct
- νκ²½: Google Colab / PyTorch / Transformers / PEFT / bitsandbytes
- ν΅μ¬ νΉμ§:
- Single System Prompt: λͺ¨λ νμ€ν¬λ₯Ό λ¨μΌ μμ€ν ν둬ννΈλ‘ μ²λ¦¬νμ¬ λΆκΈ° μλ μΆλ‘ νμ΄νλΌμΈ ꡬν
- Rule-based Task Routing: μ λ ₯(μ΄λ―Έμ§/ν μ€νΈ)κ³Ό μ§λ¬Έ μ 무μ λ°λΌ 5κ°μ§ νμ€ν¬(Captioning, VQA, Math, Text QA, Summarization)λ₯Ό λμ μΌλ‘ κ²°μ
- Task-specific Decoding: κ° νμ€ν¬μ νΉμ±μ λ§μΆ° μ΅λ μμ± ν ν° μμ λ¬Έμ₯ κ°μ κΈ°λ°μ λμ μ€λ¨ κΈ°μ€ μ μ©
- Vision Tower Frozen: νλ ¨ μ€ Vision Towerμ κ°μ€μΉλ λκ²°νμ¬ ν¨μ¨μ± μ¦λ
π§ μ¬μ©λ² (μ΄λν° λ‘λ)
λ€μμ transformers
μ peft
λΌμ΄λΈλ¬λ¦¬λ₯Ό μ¬μ©νμ¬ λ² μ΄μ€ λͺ¨λΈμ λ³Έ μ΄λν°λ₯Ό λ‘λνλ λ°©λ²μ
λλ€.
from transformers import AutoProcessor, AutoModelForCausalLM
from peft import PeftModel
import torch
base_id = "Qwen/Qwen2-VL-7B-Instruct"
adapter_id = "dohoon0508/Dohoon_Qwen2-VL-7B-Instruct_ForAju"
# νλ‘μΈμ λ° 4-bit μμνλ λ² μ΄μ€ λͺ¨λΈ λ‘λ
processor = AutoProcessor.from_pretrained(base_id, trust_remote_code=True)
base_model = AutoModelForCausalLM.from_pretrained(
base_id,
device_map="auto",
trust_remote_code=True,
torch_dtype=torch.bfloat16, # or torch.float16
load_in_4bit=True
)
# μ΄λν°(LoRA) κ°μ€μΉ μ μ©
model = PeftModel.from_pretrained(base_model, adapter_id)
model.eval()
# μΆλ‘ μμ (VQA)
# from PIL import Image
# import requests
# image_url = "[https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/bee.JPG](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/bee.JPG)"
# image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
# question = "Question: What is the main subject in this image?"
# messages = [
# {"role": "system", "content": [{"type": "text", "text": "You are a multimodal assistant..."}]}, # μ€μ μ¬μ©νλ μμ€ν
ν둬ννΈ μ μ©
# {"role": "user", "content": [{"type": "image"}, {"type": "text", "text": question}]}
# ]
# prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# enc = processor(text=prompt, images=[image], return_tensors="pt")
# out = model.generate(**{k: v.to(model.device) for k, v in enc.items()}, max_new_tokens=128)
# generated_text = processor.batch_decode(out, skip_special_tokens=True)[0]
# print(generated_text)
π νμΌ κ΅¬μ±
adapter_model.safetensors: LoRA μ΄λν° κ°μ€μΉ νμΌ
adapter_config.json: μ΄λν° μ€μ νμΌ
README.md: λͺ¨λΈ μΉ΄λ
tokenizer.json, tokenizer.model, tokenizer_config.json, processor_config.json λ± κΈ°ν μ€μ νμΌ
π¬ νμ΅ κ°μ
νλ λ°©μ: QLoRA (4-bit NormalFloat) + LoRA
LoRA λμ λͺ¨λ: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
LoRA νμ΄νΌνλΌλ―Έν°:
r = 32
lora_alpha = 16
lora_dropout = 0.05
λΉμ νμ: μμ λκ²° (Frozen)
νμ΅ νμ΄νΌνλΌλ―Έν°:
per_device_train_batch_size = 1
gradient_accumulation_steps = 16
learning_rate = 1e-4 (Cosine μ€μΌμ€λ¬)
warmup_ratio = 0.03
μ λ°λ: bf16 (μ¬μ© κ°λ₯ μ) / fp16
λ°μ΄ν°: λν μ 곡 λ©ν°νμ€ν¬ λ°μ΄ν° (.parquet)
ν둬ννΈ: κ³ μ λ λ¨μΌ μμ€ν
ν둬ννΈ + (μ΄λ―Έμ§/ν
μ€νΈ + μ§λ¬Έ) ννλ‘ κ΅¬μ±νμ¬ νμ€ν¬ λΆκΈ° μμ
λΌλ²¨λ§: μμ€ κ³μ° μ ν둬ννΈμ ν΄λΉνλ ν ν°μ -100μΌλ‘ λ§μ€νΉνμ¬ μ λ΅ ν ν°μλ§ loss λ°μ
π§ μΆλ‘ λ©λͺ¨
λμ½λ©:
Greedy Search (do_sample=False, num_beams=1)
no_repeat_ngram_size = 4
repetition_penalty = 1.05
λμ μμ± μ μ΄:
νμ€ν¬ μ’
λ₯(Captioning, Summarization λ±)μ λ°λΌ μ΅λ μμ± ν ν° μλ₯Ό λμ μΌλ‘ μ‘°μ
λ¬Έμ₯ λΆνΈ(., !, ?) κ°μλ₯Ό κ°μ§νμ¬ μ§μ λ λ¬Έμ₯ μμ λλ¬νλ©΄ μμ±μ μ‘°κΈ° μ€λ¨νλ StopOnSentenceCount κΈ°μ€ μ μ©
νμ²λ¦¬:
κΈμΉμ΄("I'm sorry", "As an AI" λ±) μ κ±°
μν λ¬Έμ μ κ²½μ°, μ λ΅μ #### {answer} νμμΌλ‘ μΆμΆ/κ°μ
VQA μλ΅μ κ°κ²°μ±μ μν΄ μ²« λ¬Έμ₯λ§ μ¬μ©
β
κΆμ₯ μ¬μ© λ²μ
μ΄λ―Έμ§ μΊ‘μ
λ, VQA, ν
μ€νΈ μμ½ λ± λ€μν λ©ν°λͺ¨λ¬ μ§μ(Instruction)λ₯Ό λ¨μΌ λͺ¨λΈλ‘ μ²λ¦¬νλ μ°κ΅¬/μ€ν
λ³λμ λΌμ°ν
λ‘μ§ μμ΄ ν둬ννΈλ§μΌλ‘ νμ€ν¬λ₯Ό ꡬλΆνλ λͺ¨λΈμ λ₯λ ₯ λΆμ
LoRA/QLoRAλ₯Ό νμ©ν λκ·λͺ¨ μΈμ΄ λͺ¨λΈ(LLM)μ ν¨μ¨μ νμΈνλ μ¬λ‘ μ°κ΅¬
β οΈ μ ν λ° μ£Όμμ¬ν
μμ± λͺ¨λΈμ νΉμ±μ μ¬μ€κ³Ό λ€λ₯Έ μ 보(Hallucination)λ μ€ν΄μ μμ§κ° μλ λ΄μ©μ μμ±ν μ μμ΅λλ€.
λ―Όκ°νκ±°λ μμ /μ€λ¦¬μ μꡬμ¬νμ΄ μ€μν λλ©μΈμ μ μ©ν κ²½μ°, λ°λμ μΆκ°μ μΈ νν°λ§ λλ κ°λλ μΌ μ₯μΉκ° νμν©λλ€.
λ² μ΄μ€ λͺ¨λΈ(Qwen/Qwen2-VL-7B-Instruct) λ° νμ΅ λ°μ΄ν°μ μλ³Έ λΌμ΄μ μ€μ μ½κ΄μ μ€μν΄μΌ ν©λλ€.
π μ°Έκ³
Base model: Qwen/Qwen2-VL-7B-Instruct
νλ‘μ νΈ μ μ₯μ: https://github.com/dohoon0508/ajukaggle
- Downloads last month
- 12