CulturalPangea-7B Model Card
Grounding Multilingual Multimodal LLMs With Cultural Knowledge
๐ ๐ฉ๐ช ๐ซ๐ท ๐ฌ๐ง ๐ช๐ธ ๐ฎ๐น ๐ต๐ฑ ๐ท๐บ ๐จ๐ฟ ๐ฏ๐ต ๐บ๐ฆ ๐ง๐ท ๐ฎ๐ณ ๐จ๐ณ ๐ณ๐ด ๐ต๐น ๐ฎ๐ฉ ๐ฎ๐ฑ ๐น๐ท ๐ฌ๐ท ๐ท๐ด ๐ฎ๐ท ๐น๐ผ ๐ฒ๐ฝ ๐ฎ๐ช ๐ฐ๐ท ๐ง๐ฌ ๐น๐ญ ๐ณ๐ฑ ๐ช๐ฌ ๐ต๐ฐ ๐ณ๐ฌ ๐ฎ๐ฉ ๐ป๐ณ ๐ฒ๐พ ๐ธ๐ฆ ๐ฎ๐ฉ ๐ง๐ฉ ๐ธ๐ฌ ๐ฑ๐ฐ ๐ฐ๐ช ๐ฒ๐ณ ๐ช๐น ๐น๐ฟ ๐ท๐ผ
๐ Homepage | ๐ค CulturalPangea-7B | ๐ CulturalGround | ๐ป Github | ๐ Arxiv
![[IMAGE]](https://neulab.github.io/CulturalGround/static/img/icons/culturalpangea1.png)
Model Details
- Model:
CulturalPangea-7B
is an open-source Multilingual Multimodal LLM fine-tuned to interpret and reason about long-tail cultural entities and concepts. It is designed to bridge the cultural gap often present in MLLMs. - Date:
CulturalPangea-7B
was trained in 2025. - Training Dataset: The model was fine-tuned on the CulturalGround dataset, using 14 million open-ended and 6 million multiple-choice culturally-grounded VQA pairs samples from 30M total samples(22M OE, 8M MCQs). This was interleaved with the substantial portion of original Pangea instruction data to maintain general abilities.
- Architecture:
CulturalPangea-7B
is a fine-tuned version of Pangea-7B. It uses a frozen CLIP-ViT vision encoder with a Qwen2-7B-Instruct LLM backbone. During training, only the connector and the language model were fine-tuned.
Uses
CulturalPangea-7B
follows the same architecture and usage patterns as LLaVA-NeXT and Pangea-7B.
Direct Use
First, you need to clone and install the LLaVA-NeXT repository.
git clone [https://github.com/LLaVA-VL/LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT)
cd LLaVA-NeXT
pip install -e ".[train]"
Then, you can load CulturalPangea-7B using the following code:
from llava.model.builder import load_pretrained_model
model_path = 'neulab/CulturalPangea-7B'
model_name = 'CulturalPangea-7B-qwen'
args = {"multimodal": True}
tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, None, model_name, **args)
Defining helper functions for model inference:
import torch
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
from llava.utils import disable_torch_init
from llava.constants import IGNORE_INDEX, DEFAULT_IMAGE_TOKEN, IMAGE_TOKEN_INDEX
from typing import Dict
import transformers
import re
from PIL import Image
def preprocess_qwen(sources, tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False, max_len=2048, system_message: str = "You are a helpful assistant.") -> Dict:
roles = {"human": "<|im_start|>user", "gpt": "<|im_start|>assistant"}
im_start, im_end = tokenizer.additional_special_tokens_ids
nl_tokens = tokenizer("\n").input_ids
_system = tokenizer("system").input_ids + nl_tokens
_user = tokenizer("user").input_ids + nl_tokens
_assistant = tokenizer("assistant").input_ids + nl_tokens
input_ids = []
source = sources
if roles[source[0]["from"]] != roles["human"]: source = source[1:]
input_id, target = [], []
system = [im_start] + _system + tokenizer(system_message).input_ids + [im_end] + nl_tokens
input_id += system
target += [im_start] + [IGNORE_INDEX] * (len(system) - 3) + [im_end] + nl_tokens
assert len(input_id) == len(target)
for j, sentence in enumerate(source):
role = roles[sentence["from"]]
if has_image and sentence["value"] is not None and "<image>" in sentence["value"]:
num_image = len(re.findall(DEFAULT_IMAGE_TOKEN, sentence["value"]))
texts = sentence["value"].split('<image>')
_input_id = tokenizer(role).input_ids + nl_tokens
for i,text in enumerate(texts):
_input_id += tokenizer(text).input_ids
if i<len(texts)-1: _input_id += [IMAGE_TOKEN_INDEX] + nl_tokens
_input_id += [im_end] + nl_tokens
assert sum([i==IMAGE_TOKEN_INDEX for i in _input_id])==num_image
else:
if sentence["value"] is None: _input_id = tokenizer(role).input_ids + nl_tokens
else: _input_id = tokenizer(role).input_ids + nl_tokens + tokenizer(sentence["value"]).input_ids + [im_end] + nl_tokens
input_id += _input_id
input_ids.append(input_id)
return torch.tensor(input_ids, dtype=torch.long)
def generate_output(prompt, image=None, do_sample=False, temperature=0, top_p=0.5, num_beams=1, max_new_tokens=1024):
image_tensors = []
prompt = "<image>\n" + prompt
# image can be a path to a local file or a PIL image
if isinstance(image, str):
image = Image.open(image)
image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values']
image_tensors.append(image_tensor.half().cuda())
input_ids = preprocess_qwen([{'from': 'human', 'value': prompt},{'from': 'gpt','value': None}], tokenizer, has_image=True).cuda()
with torch.inference_mode():
output_ids = model.generate(
input_ids,
images=image_tensors,
do_sample=do_sample,
temperature=temperature,
top_p=top_p,
num_beams=num_beams,
max_new_tokens=max_new_tokens,
use_cache=True
)
outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0]
outputs = outputs.strip()
return outputs
An example of multimodal inference:
prompt = "What cultural significance does the landmark in the image hold?"
image = "image.png"
print(generate_output(prompt, image=image))
Citing the Model
If you use CulturalPangea or the CulturalGround dataset, please cite our work:
@preprint{nyandwi2025grounding,
title={Grounding Multilingual Multimodal LLMs With Cultural Knowledge},
author={Nyandwi, Jean de Dieu and Song, Yueqi and Khanuja, Simran and Neubig, Graham},
year={2025}
}
- Downloads last month
- 27