Adapting Multimodal Large Language Models to Domains via Post-Training
This repos contains the biomedicine MLLM developed from gemma-3-4b-it in our paper: On Domain-Adaptive Post-Training for Multimodal Large Language Models. The correspoding training dataset is in biomed-visual-instructions.
The main project page is: Adapt-MLLM-to-Domains
1. To Chat with AdaMLLM
Our model architecture aligns with the base model: gemma-3-4b-it. We provide a usage example below, and you may refer to the official google/gemma-3-4b-it for more advanced usage instructions.
Note: For AdaMLLM, always place the image at the beginning of the input instruction in the messages.
Click to expand
Below, there are some code snippets on how to get quickly started with running the model. First, install the Transformers library. Gemma 3 is supported starting from transformers 4.50.0.
$ pip install -U transformers
Then, copy the snippet from the section that is relevant for your use case.
Running with the pipeline
API
You can initialize the model and processor for inference with pipeline
as follows.
from transformers import pipeline
import torch
pipe = pipeline(
"image-text-to-text",
model="AdaptLLM/biomed-gemma-3-4b-it",
device="cuda",
torch_dtype=torch.bfloat16
)
With instruction-tuned models, you need to use chat templates to process our inputs first. Then, you can pass it to the pipeline.
messages = [
{
"role": "system",
"content": [{"type": "text", "text": "You are a helpful assistant."}]
},
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
{"type": "text", "text": "What animal is on the candy?"}
]
}
]
output = pipe(text=messages, max_new_tokens=200)
print(output[0]["generated_text"][-1]["content"])
2. Domain-Specific Benchmarks
We provide biomed-VQA-benchmark to evaluate any MLLMs.
3. To Reproduce this Domain-Adapted MLLM
Using our training data, biomed-visual-instructions, you can easily reproduce our models based on the LlamaFactory repository.
For reference, we train from google/gemma-3-4b-it for 1 epoch with a learning rate of 1e-5, and a global batch size of 128.
Citation
If you find our work helpful, please cite us.
@article{adamllm,
title={On Domain-Specific Post-Training for Multimodal Large Language Models},
author={Cheng, Daixuan and Huang, Shaohan and Zhu, Ziyu and Zhang, Xintong and Zhao, Wayne Xin and Luan, Zhongzhi and Dai, Bo and Zhang, Zhenliang},
journal={arXiv preprint arXiv:2411.19930},
year={2024}
}
Adapt LLM to Domains (ICLR 2024)
@inproceedings{
cheng2024adapting,
title={Adapting Large Language Models via Reading Comprehension},
author={Daixuan Cheng and Shaohan Huang and Furu Wei},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=y886UXPEZ0}
}
- Downloads last month
- 18