YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Synthetic Visual Genome
This repository contains the ROBIN-3B model based on Qwen2.5-3B, introduced in the paper Synthetic Visual Genome. It is designed for scene graph understanding and dense visual relationships.
π€ Checkpoints
- Robin-3b Stage 2 [this repo]: π€ hf-model
- Robin-3b Stage 1: TBD
- Robin-3b Stage 0: TBD
π Quick Start: Scene Graph Generation with SAM
Generate scene graph for each image, using segment-anything masks and optional GroundingDINO object regions
- First install Segment Anything
pip install git+https://github.com/facebookresearch/segment-anything.git
- Download all the checkpoints:
- ViT-H SAM model
- Robin-3b
- Run
git clone https://huggingface.co/jamepark3922/robin-qwen2.5-3b
- Run
- CLIP-convnext
The default path of all the checkpoints:
βββ demo
βββ checkpoints
β βββ robin-qwen2.5-3b-sg-stage2
β βββ sam_vit_h_4b8939.pth
βββ open_clip_pytorch_model.bin
Note: You might need to change the "mm_vision_tower" in config.json
of robin-3b model to the Absolute path of open_clip_pytorch_model.bin
.
Scene Graph Generation for Single Image πΌοΈ
Refer to the SyntheticVG repo for the full code to generate scene graph for a single image using SAM and Robin-3b.
import requests
from segment_anything import sam_model_registry
from svg.pipeline.region_proposal.region_generator import SamGroundingDinoRegionGenerator
from svg.pipeline.grounding.grounding_dino import GroundingDinoSAM
from svg.pipeline.captioning.gpt4o import GPT4Captioner
from svg.pipeline.robin import RobinPipeline
from svg.draw_utils import visualize_masks
image = Image.open(requests.get('http://farm4.staticflickr.com/3377/3573516590_a1f6cf2cbd_z.jpg', stream=True).raw)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
sam_ckpt = 'sam_vit_h_4b8939.pth'
sam_model = sam_model_registry["vit_h"](checkpoint=sam_ckpt).to(device)
# Optional: grounding_dino + gpt4o captioner for additional region grounding
print('Loading GroundingDino model...')
grounding_model = GroundingDinoSAM(
"IDEA-Research/grounding-dino-base",
sam_model,
device
)
captioner = GPT4Captioner()
region_generator = SamGroundingDinoRegionGenerator(
sam_model=sam_model,
grounding_model=grounding_model, # None if not using
captioner=captioner
)
regions: list[dict] = region_generator.generate_regions(image, region_mode='merged')
# Generate scene graph from regions
model = RobinPipeline(robin_path, device=device)
sg, _ = model.generate_scene_graph(im, regions)
objects: list[str] = sg['objects']
relations: list[tuple[int, int, str]] = sg['relations']
# Visualize the scene graph
image_rgb = np.array(image)
image_with_masks: np.ndarray = visualize_masks(
image_rgb, regions,
draw_bbox=True, draw_mask = True, draw_polygon=False,
white_padding=50
)
cv2.imwrite('scene_graph.jpg', image_with_masks)
with open('scene_graph.json', 'w') as f:
json.dump(scene_graph, f, indent=4)
You can also run predict.py
to generate scene graph for a single image.
python predict.py --image_path path/to/image.jpg
BibTeX ποΈ
If you find this work useful, please consider citing:
@misc{park2025syntheticvisualgenome,
title={Synthetic Visual Genome},
author={Jae Sung Park and Zixian Ma and Linjie Li and Chenhao Zheng and Cheng-Yu Hsieh and Ximing Lu and Khyathi Chandu and Quan Kong and Norimasa Kobori and Ali Farhadi and Yejin Choi and Ranjay Krishna},
year={2025},
eprint={2506.07643},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.07643},
}
- Downloads last month
- 10
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support