File size: 4,887 Bytes
028fbff
 
 
 
 
 
 
 
 
f1e87c4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
license: mit
datasets:
- Vi-VLM/Vista
language:
- vi
---

LLaVA-Qwen1.5-1.8b model trained with LoRA, on a subset of Vista Vi LLaVA Complex Reasoning. 
Loss: ~1.5

Training script
```bash
deepspeed moellava/train/train_mem.py \
    --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 0.00000125 \
    --lora_path /kaggle/temp/lora-llavaqwen \
    --deepspeed ./scripts/zero3.json \
    --model_name_or_path Qwen/Qwen1.5-1.8B \
    --version qwen \
    --data_path /kaggle/temp/vi_llava_train.json \
    --image_folder /kaggle/input/coco-2017-dataset/coco2017/train2017 \
    --image_tower google/siglip-base-patch16-256-multilingual \
    --image_projector_type mlp2x_gelu \
    --pretrain_mm_mlp_adapter /kaggle/temp/pt-llavaqwen1.5-1.8b/mm_projector.bin \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --fp16 True \
    --output_dir ./checkpoints/ft-lora-llavaqwen1.5-1.8b-complex_reasoning \
    --num_train_epochs 1 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 100 \
    --save_total_limit 1 \
    --learning_rate 1e-5 \
    --weight_decay 0. \
    --warmup_ratio 0 \
    --lr_scheduler_type "cosine" \
    --logging_steps 5 \
    --tf32 False \
    --model_max_length 1024 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb \
    --run_name ft-llava-qwen1.5-1.8b-lora-vista_reasoning-cont \
    --push_to_hub True
```

Python code to merge LoRA
```python
from typing import Optional, List
class ModelArguments:
    model_name_or_path: Optional[str] = "facebook/opt-125m"
    version: Optional[str] = "v0"
    freeze_backbone: bool = False
    tune_mm_mlp_adapter: bool = False
    mm_vision_select_layer: Optional[int] = -1   # default to the last layer
    pretrain_mm_mlp_adapter: Optional[str] = None
    mm_use_im_start_end: bool = False
    mm_use_im_patch_token: bool = True
    mm_vision_select_feature: Optional[str] = "patch"
    # ===================================================================
    image_tower: Optional[str] = 'google/siglip-base-patch16-256-multilingual'
    video_tower: Optional[str] = None
    image_projector_type: Optional[str] = 'linear'
    video_projector_type: Optional[str] = 'linear'
    video_global_proj: bool = False
    video_temproal_proj: bool = False
    video_spatial_proj: bool = False
    # ===================================================================

    # =============================================================
    only_lora_ffn: bool = True
    moe_enable: bool = False
    train_modules: Optional[List[str]] = None
    moe_mode: str = "sparse"
    moe_layers_idx: Optional[List[int]] = None
    ep_size: int = 1
    num_experts: Optional[List[int]] = 4
    top_k_experts: int = 2
    capacity_factor: float = 1.
    eval_capacity_factor: float = 2.
    min_capacity: int = 0
    use_residual: bool = False
    router_aux_loss_coef: float = 0.01

class DataArguments:
    lazy_preprocess: bool = False
    is_multimodal: bool = False
    image_aspect_ratio: str = 'pad'
    # ===================================================================
    data_path: Optional[List[str]] = None
    image_folder: Optional[str] = None
    video_folder: Optional[str] = None
    num_frames: int = 8

model_args = ModelArguments()
data_args = DataArguments()

import torch 
from peft import PeftModel
from moellava.model import LlavaQwen1_5ForCausalLM

model_name_or_path = 'Qwen/Qwen1.5-1.8B'
lora_path = 'llavaqwen1.5-lora'

model = LlavaQwen1_5ForCausalLM.from_pretrained(
    model_name_or_path,
)

model.to(torch.float16)
model = PeftModel.from_pretrained(model, lora_path)
model

import transformers

tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_args.model_name_or_path,
    model_max_length=1024,
    padding_side="right",
    use_fast=False,
)
tokenizer.add_special_tokens({'unk_token': '<|extra_0|>'})

model.get_model().initialize_vision_modules(
    model_args=model_args,
)

image_tower = model.get_image_tower()
image_tower.to(dtype=torch.float16)

data_args.image_processor = image_tower.image_processor
data_args.is_multimodal = True

model.config.image_aspect_ratio = data_args.image_aspect_ratio
model.config.tokenizer_padding_side = tokenizer.padding_side

model.config.mm_use_im_start_end = data_args.mm_use_im_start_end = model_args.mm_use_im_start_end
model.config.mm_use_im_patch_token = model_args.mm_use_im_patch_token
model.initialize_vision_tokenizer(model_args, tokenizer=tokenizer)

merged_model = model.merge_and_unload()
merged_model.save_pretrained("llava-qwen1.5-1.8b-complex_reasoning-merged")
```