deepseek-vl-7b-coco-refg-lora-upgraded-bbox

This model is a fine-tuned version of deepseek-ai/deepseek-vl-7b-chat for referring expression grounding on COCO-RefG dataset.

Model Description

Task: Referring Expression Grounding (REG) - locating objects in images based on natural language descriptions
Base Model: DeepSeek-VL 7B Chat
Training Method: LoRA (Vision + Language) + Custom BBox Head with Position Embeddings
BBox Head Architecture: EnhancedBoundingBoxHead
Dataset: COCO-RefG (RefCOCOg)
Training Samples: 7573
Validation Samples: 5023
Output: Bounding box coordinates [x, y, width, height] in normalized format [0,1]

Performance

Metric	Value
Mean IoU	0.2567
Accuracy @ IoU 0.5	12.5%
Accuracy @ IoU 0.75	2.3%
Evaluated Samples	5023

Usage

import torch
from transformers import AutoModelForCausalLM
from deepseek_vl.models import VLChatProcessor
from deepseek_vl.utils.io import load_pil_images
from PIL import Image
import requests

# Load the model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
processor = VLChatProcessor.from_pretrained("deepseek-ai/deepseek-vl-7b-chat")
model = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/deepseek-vl-7b-chat",
    torch_dtype=torch.float16,
    trust_remote_code=True,
    device_map="auto"
)

# Load LoRA weights if available
if lora_applied:
    from peft import PeftModel
    model = PeftModel.from_pretrained(model, "robinn6/deepseek-vl-7b-coco-refg-lora-upgraded-bbox/lora_weights")

# Load the bbox head
from bbox_head_model import LightweightBBoxHead
bbox_head = LightweightBBoxHead(model.language_model.config.hidden_size, mid_dim=512)
checkpoint = torch.load("robinn6/deepseek-vl-7b-coco-refg-lora-upgraded-bbox/bbox_head/pytorch_model.bin", map_location=device)
bbox_head.load_state_dict(checkpoint['model_state_dict'])
bbox_head.to(device)
bbox_head.eval()

# Prepare input
image_path = "path/to/your/image.jpg"
referring_expression = "the red car on the left"

conversation = [
    {
        "role": "User",
        "content": f"<image_placeholder>Where is {referring_expression}?",
        "images": [image_path],
    },
    {"role": "Assistant", "content": ""},
]

# Process and predict
pil_images = load_pil_images(conversation)
inputs = processor(conversations=conversation, images=pil_images, force_batchify=True)
inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

with torch.no_grad():
    embeds = model.prepare_inputs_embeds(**inputs)
    outputs = model.language_model.model(
        inputs_embeds=embeds,
        attention_mask=inputs.get("attention_mask"),
        output_hidden_states=True
    )

    hidden_states = outputs.hidden_states[-1]
    last_token_hidden = hidden_states[:, -1, :]
    bbox_pred = bbox_head(last_token_hidden)

    # bbox_pred contains [x, y, width, height] in normalized coordinates
    x, y, w, h = bbox_pred[0].cpu().numpy()
    print(f"Bounding box: x={x:.3f}, y={y:.3f}, w={w:.3f}, h={h:.3f}")

Training Details

Learning Rate: 1e-4
Optimizer: AdamW
Scheduler: MultiStepLR (milestones: [200, 400, 600], gamma: 0.5)
Loss Function: GIoU Loss (α=3.0) + L1 Loss (β=0.3)
Epochs: 3
Batch Size: 1
Training Time: 239.8

Limitations

The model is trained on COCO-RefG dataset and may not generalize well to other domains
Performance may vary depending on the complexity of referring expressions
Bounding box predictions are in normalized coordinates [0,1]

Citation

If you use this model, please cite:

@misc{deepseek_vl_7b_coco_refg_lora_upgraded_bbox_2024,
  title={deepseek-vl-7b-coco-refg-lora-upgraded-bbox: DeepSeek-VL Fine-tuned for Referring Expression Grounding},
  author={{HF_USERNAME}},
  year={2024},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/robinn6/deepseek-vl-7b-coco-refg-lora-upgraded-bbox}}
}

robinn6
/

deepseek-vl-7b-coco-refg-lora-upgraded-bbox

deepseek-vl-7b-coco-refg-lora-upgraded-bbox

Model Description

Performance

Usage

Training Details

Limitations

Citation

Model tree for robinn6/deepseek-vl-7b-coco-refg-lora-upgraded-bbox

Dataset used to train robinn6/deepseek-vl-7b-coco-refg-lora-upgraded-bbox

Evaluation results