deepseek-vl-7b-coco-refg-lora-upgraded-bbox
This model is a fine-tuned version of deepseek-ai/deepseek-vl-7b-chat for referring expression grounding on COCO-RefG dataset.
Model Description
- Task: Referring Expression Grounding (REG) - locating objects in images based on natural language descriptions
- Base Model: DeepSeek-VL 7B Chat
- Training Method: LoRA (Vision + Language) + Custom BBox Head with Position Embeddings
- BBox Head Architecture: EnhancedBoundingBoxHead
- Dataset: COCO-RefG (RefCOCOg)
- Training Samples: 7573
- Validation Samples: 5023
- Output: Bounding box coordinates [x, y, width, height] in normalized format [0,1]
Performance
Metric | Value |
---|---|
Mean IoU | 0.2567 |
Accuracy @ IoU 0.5 | 12.5% |
Accuracy @ IoU 0.75 | 2.3% |
Evaluated Samples | 5023 |
Usage
import torch
from transformers import AutoModelForCausalLM
from deepseek_vl.models import VLChatProcessor
from deepseek_vl.utils.io import load_pil_images
from PIL import Image
import requests
# Load the model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
processor = VLChatProcessor.from_pretrained("deepseek-ai/deepseek-vl-7b-chat")
model = AutoModelForCausalLM.from_pretrained(
"deepseek-ai/deepseek-vl-7b-chat",
torch_dtype=torch.float16,
trust_remote_code=True,
device_map="auto"
)
# Load LoRA weights if available
if lora_applied:
from peft import PeftModel
model = PeftModel.from_pretrained(model, "robinn6/deepseek-vl-7b-coco-refg-lora-upgraded-bbox/lora_weights")
# Load the bbox head
from bbox_head_model import LightweightBBoxHead
bbox_head = LightweightBBoxHead(model.language_model.config.hidden_size, mid_dim=512)
checkpoint = torch.load("robinn6/deepseek-vl-7b-coco-refg-lora-upgraded-bbox/bbox_head/pytorch_model.bin", map_location=device)
bbox_head.load_state_dict(checkpoint['model_state_dict'])
bbox_head.to(device)
bbox_head.eval()
# Prepare input
image_path = "path/to/your/image.jpg"
referring_expression = "the red car on the left"
conversation = [
{
"role": "User",
"content": f"<image_placeholder>Where is {referring_expression}?",
"images": [image_path],
},
{"role": "Assistant", "content": ""},
]
# Process and predict
pil_images = load_pil_images(conversation)
inputs = processor(conversations=conversation, images=pil_images, force_batchify=True)
inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
with torch.no_grad():
embeds = model.prepare_inputs_embeds(**inputs)
outputs = model.language_model.model(
inputs_embeds=embeds,
attention_mask=inputs.get("attention_mask"),
output_hidden_states=True
)
hidden_states = outputs.hidden_states[-1]
last_token_hidden = hidden_states[:, -1, :]
bbox_pred = bbox_head(last_token_hidden)
# bbox_pred contains [x, y, width, height] in normalized coordinates
x, y, w, h = bbox_pred[0].cpu().numpy()
print(f"Bounding box: x={x:.3f}, y={y:.3f}, w={w:.3f}, h={h:.3f}")
Training Details
- Learning Rate: 1e-4
- Optimizer: AdamW
- Scheduler: MultiStepLR (milestones: [200, 400, 600], gamma: 0.5)
- Loss Function: GIoU Loss (α=3.0) + L1 Loss (β=0.3)
- Epochs: 3
- Batch Size: 1
- Training Time: 239.8
Limitations
- The model is trained on COCO-RefG dataset and may not generalize well to other domains
- Performance may vary depending on the complexity of referring expressions
- Bounding box predictions are in normalized coordinates [0,1]
Citation
If you use this model, please cite:
@misc{deepseek_vl_7b_coco_refg_lora_upgraded_bbox_2024,
title={deepseek-vl-7b-coco-refg-lora-upgraded-bbox: DeepSeek-VL Fine-tuned for Referring Expression Grounding},
author={{HF_USERNAME}},
year={2024},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/robinn6/deepseek-vl-7b-coco-refg-lora-upgraded-bbox}}
}
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for robinn6/deepseek-vl-7b-coco-refg-lora-upgraded-bbox
Base model
deepseek-ai/deepseek-vl-7b-chatDataset used to train robinn6/deepseek-vl-7b-coco-refg-lora-upgraded-bbox
Evaluation results
- Mean IoU on COCO-RefG (RefCOCOg)self-reported0.257
- Accuracy @ IoU 0.5 on COCO-RefG (RefCOCOg)self-reported12.5%