Adapting Multimodal Large Language Models to Domains via Post-Training

This repos contains the biomedicine MLLM developed from gemma-3-4b-it in our paper: On Domain-Adaptive Post-Training for Multimodal Large Language Models. The correspoding training dataset is in biomed-visual-instructions.

The main project page is: Adapt-MLLM-to-Domains

1. To Chat with AdaMLLM

Our model architecture aligns with the base model: gemma-3-4b-it. We provide a usage example below, and you may refer to the official google/gemma-3-4b-it for more advanced usage instructions.

Note: For AdaMLLM, always place the image at the beginning of the input instruction in the messages.

Click to expand

Below, there are some code snippets on how to get quickly started with running the model. First, install the Transformers library. Gemma 3 is supported starting from transformers 4.50.0.

$ pip install -U transformers

Then, copy the snippet from the section that is relevant for your use case.

Running with the `pipeline` API

You can initialize the model and processor for inference with pipeline as follows.

from transformers import pipeline
import torch

pipe = pipeline(
    "image-text-to-text",
    model="AdaptLLM/biomed-gemma-3-4b-it",
    device="cuda",
    torch_dtype=torch.bfloat16
)

With instruction-tuned models, you need to use chat templates to process our inputs first. Then, you can pass it to the pipeline.

messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}]
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    }
]

output = pipe(text=messages, max_new_tokens=200)
print(output[0]["generated_text"][-1]["content"])

2. Domain-Specific Benchmarks

We provide biomed-VQA-benchmark to evaluate any MLLMs.

3. To Reproduce this Domain-Adapted MLLM

Using our training data, biomed-visual-instructions, you can easily reproduce our models based on the LlamaFactory repository.

For reference, we train from google/gemma-3-4b-it for 1 epoch with a learning rate of 1e-5, and a global batch size of 128.

Citation

If you find our work helpful, please cite us.

AdaMLLM

@article{adamllm,
  title={On Domain-Specific Post-Training for Multimodal Large Language Models},
  author={Cheng, Daixuan and Huang, Shaohan and Zhu, Ziyu and Zhang, Xintong and Zhao, Wayne Xin and Luan, Zhongzhi and Dai, Bo and Zhang, Zhenliang},
  journal={arXiv preprint arXiv:2411.19930},
  year={2024}
}

Adapt LLM to Domains (ICLR 2024)

@inproceedings{
cheng2024adapting,
title={Adapting Large Language Models via Reading Comprehension},
author={Daixuan Cheng and Shaohan Huang and Furu Wei},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=y886UXPEZ0}
}

AdaptLLM
/

biomed-gemma-3-4b-it

Adapting Multimodal Large Language Models to Domains via Post-Training

1. To Chat with AdaMLLM

Running with the `pipeline` API

2. Domain-Specific Benchmarks

3. To Reproduce this Domain-Adapted MLLM

Citation

Model tree for AdaptLLM/biomed-gemma-3-4b-it

Dataset used to train AdaptLLM/biomed-gemma-3-4b-it

Adapting Multimodal Large Language Models to Domains via Post-Training

1. To Chat with AdaMLLM

Running with the pipeline API

2. Domain-Specific Benchmarks

3. To Reproduce this Domain-Adapted MLLM

Citation

Model tree for AdaptLLM/biomed-gemma-3-4b-it

Dataset used to train AdaptLLM/biomed-gemma-3-4b-it

Running with the `pipeline` API