VRSight Object Detection Model

Fine-tuned YOLOv8n model for detecting UI elements and interactive objects in virtual reality environments. This model powers the VRSight system, a post hoc 3D screen reader for blind and low vision VR users.

Model Weights: best.pt (available in the Files tab)
Full System: github.com/MadisonAbilityLab/VRSight
Paper: VRSight (UIST 2025) Training Dataset: UWMadAbility/DISCOVR

Developed by: Daniel Killough, Justin Feng, Zheng Xue Ching, Daniel Wang, Rithvik Dyava, Yapeng Tian*, Yuhang Zhao
Affiliations: University of Wisconsin-Madison, *University of Texas at Dallas

Quick Start

Installation & Download

pip install ultralytics

# Download model weights
wget -O best.pt https://huggingface.co/UWMadAbility/VRSight/resolve/main/best.pt

Basic Usage

from ultralytics import YOLO

# Load model
model = YOLO('best.pt')

# Run inference on VR screenshot
results = model('vr_screenshot.jpg')

# Process results
for result in results:
    boxes = result.boxes
    for box in boxes:
        class_id = int(box.cls[0])
        confidence = float(box.conf[0])
        bbox = box.xyxy[0].tolist()
        
        print(f"Class: {model.names[class_id]}")
        print(f"Confidence: {confidence:.2f}")
        print(f"BBox: {bbox}")

Batch Processing

results = model.predict(
    source='vr_screenshots/',
    save=True,
    conf=0.25,
    device='0'  # GPU 0, or 'cpu'
)

Model Details

Architecture

  • Base: YOLOv8n (Nano variant - optimized for real-time performance)
  • Input: 640×640 pixels
  • Output: Bounding boxes with class predictions and confidence scores
  • Classes: 30 VR object types across 6 categories

Performance

Metric Test Set
mAP@50 67.3%
mAP@75 49.5%
mAP 46.3%
Inference Speed ~20-30+ FPS

Key Finding: Base YOLOv8n trained on COCO rarely detected VR objects, demonstrating the necessity of VR-specific training data. See Table 1 in the paper for per-class metrics.

Object Classes (30 Total)

The model detects 6 categories of VR objects:

Avatars: avatar, avatar-nonhuman, chat-bubble, chat-box
Informational: sign-text, ui-text, sign-graphic, menu, ui-graphic, progress-bar, hud, indicator-mute
Interactables: interactable, button, target, portal, writing-utensil, watch, writing-surface, spawner
Safety: guardian, out-of-bounds
Seating: seat-single, table, seat-multiple, campfire
VR System: hand, controller, dashboard, locomotion-target

See the paper (Table 1) for detailed descriptions and per-class performance.

Training Details

Dataset

  • DISCOVR: 17,691 labeled images from 17 social VR apps
  • Train: 15,207 images | Val: 1,645 images | Test: 839 images
  • Augmentation: Horizontal/vertical flips, rotation, scaling, shearing, HSV jittering

Training Configuration

  • GPU: NVIDIA A100
  • Epochs: 250
  • Image Size: 640×640
  • Method: Fine-tuned from YOLOv8n pretrained weights

VRSight System Integration

This model is one component of the complete VRSight system, which combines:

  • This object detection model (detects VR objects)
  • Depth estimation (DepthAnythingV2)
  • GPT-4o (scene atmosphere and detailed descriptions)
  • OCR (text reading)
  • Spatial audio (TTS -> WebVR app e.g., PlayCanvas)

To use the full VRSight system, see the GitHub repository.

Limitations

  • VR-specific: Trained on social VR apps - performance varies on other VR types
  • Lighting: Reduced accuracy in dark environments
  • Coverage: 30 classes cover common social VR objects but not all possible VR elements
  • Application types: Best performance in social VR; may struggle with faster-paced games

See Section 7.2 of the paper for detailed discussion.

Citation

Please cite use of this model, the DISCOVR dataset, or the fine-tuned object detection model using the VRSight paper:

@inproceedings{killough2025vrsight,
  title={VRSight: An AI-Driven Scene Description System to Improve Virtual Reality Accessibility for Blind People},
  author={Killough, Daniel and Feng, Justin and Ching, Zheng Xue and Wang, Daniel and Dyava, Rithvik and Tian, Yapeng and Zhao, Yuhang},
  booktitle={Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology},
  pages={1--17},
  year={2025},
  publisher={ACM},
  address={Busan, Republic of Korea},
  doi={10.1145/3746059.3747641}
}

License

CC BY 4.0 - Free to use with attribution

Contact

Related Resources

Downloads last month
64
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for UWMadAbility/VRSight

Base model

Ultralytics/YOLOv8
Finetuned
(99)
this model

Dataset used to train UWMadAbility/VRSight