EgoExOR Scene Graph Foundation Model
|
|
This repository hosts the foundation model for surgical scene graph generation trained on the EgoExOR dataset β a multimodal, multi-perspective dataset collected in a simulated operating room (OR) environment.
Operating rooms (ORs) demand precise coordination among surgeons, nurses, and equipment in a fast-paced, occlusion-heavy environment, necessitating advanced perception models to enhance safety and efficiency. Existing datasets either provide partial egocentric views or sparse exocentric multi-view context, but don't explore the comprehensive combination of both. We introduce EgoExOR, the first OR dataset and accompanying benchmark to fuse first-person and third-person perspectives. Spanning 94 minutes (84,553 frames at 15 FPS) of two simulated spine procedures, Ultrasound-Guided Needle Insertion and Minimally Invasive Spine Surgery, EgoExOR integrates egocentric data (RGB, gaze, hand tracking, audio) from wearable glasses, exocentric RGB and depth from RGB-D cameras, and ultrasound imagery. Its detailed scene graph annotations, covering 36 entities and 22 relations (~573,000 triplets), enable robust modeling of clinical interactions, supporting tasks like action recognition and human-centric perception. We evaluate the surgical scene graph generation performance of two adapted state-of-the-art models and offer a new baseline that explicitly leverages EgoExORβs multimodal and multi-perspective signals. Our new dataset and benchmark set a new foundation for OR perception, offering a rich, multimodal resource for next-generation clinical perception. Our code available at EgoExOR GitHub and dataset EgoExOR Hugging Face Dataset
π§ Model Overview
Figure: Overview of the proposed EgoExOR model for surgical scene graph generation. The model employs a dual-branch architecture to separately process egocentric and exocentric modalities. Fused embeddings are passed to a large language model (LLM) to autoregressively generate scene graph triplets representing entities and their interactions.
EgoExOR Model. To fully exploit EgoExORβs rich multi-perspective data, we introduce a new baseline model featuring a dual-branch architecture. The egocentric branch processes first person RGB, hand pose, and gaze data, while the exocentric branch handles third-person RGB-D, ultrasound recordings, audio, and point clouds. Each branch uses a 2-layer transformer to fuse its inputs into N feature embeddings. These are concatenated and fed into the LLM for triplet prediction. By explicitly separating and fusing perspective-specific features, our model better captures actions and staff interactions, outperforming single-stream baselines in modeling complex OR dynamics.
π Benchmark Results
This model outperforms prior single-stream baselines like ORacle and MM2SG by effectively leveraging perspective-specific signals.
Model UI F1 MISS F1 Overall F1 ORacle (Baseline) 0.70 0.71 0.69 MM2SG (Baseline) 0.77 0.68 0.72 EgoExOR (Ours) 0.86 0.70 0.79
Overall the results, shown in Table above, the dual-branch EgoExOR model achieves the highest macro F1. Several predicates in EgoExOR rely on understanding transient tool-hand trajectories, and fine-grained action cues. This emphasizes the importance of explicitly modeling multiple viewpoints and leveraging all available modalities to improve OR scene understanding.
ποΈ Dataset
EgoExOR provides:
- 84,553 frames (94 mins)
- 2 surgical procedures (Ultrasound Injection & MISS)
- 36 entities, 22 predicates
- Over 573,000 triplets
- Multimodal signals: RGB, depth, gaze, audio, ultrasound, point cloud, hand tracking
You can find the dataset processing tools GitHub repo.
π Links
- π₯οΈ Code: EgoExOR GitHub
- π€ Dataset: EgoExOR Hugging Face Dataset
- π€ Model Card & Weights: EgoExOR Hugging Face Model
Model tree for ardamamur/EgoExOR
Base model
liuhaotian/llava-v1.5-7b