YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Traffic3D: Lightweight Monocular 3D Traffic Scene Reconstruction

A modular, end-to-end pipeline for reconstructing semantically consistent 3D traffic scenes from a single RGB image. Designed for near real-time inference (≥15 FPS on RTX 3090) with all GNN components under 500K parameters.

Architecture Overview

RGB Image (H×W×3)
    │
    ▼
┌─────────────────────────┐
│  Stage 1: Input         │
│  Augmentation           │  → 5-channel tensor [RGB + Positional + Edge]
│  • Positional Encoding  │
│  • Sobel/Canny Edges    │
└─────────┬───────────────┘
          │
          ▼
┌─────────────────────────┐
│  Stage 2: Segmentation  │
│  • Lightweight UNet     │  → Semantic map S (H×W×K)
│  • Edge Weighting       │  → S'(x,y) = S(x,y) * (1 + α*C(x,y))
│  • Boundary Head (SBCB) │
└─────────┬───────────────┘
          │
          ▼
┌─────────────────────────┐
│  Stage 3: Primitives    │
│  • Connected Components │  → Cuboids, Cylinders, Cones, Planes
│  • PCA-based Fitting    │  → Scene Graph (nodes + edges)
│  • Graph Construction   │
└─────────┬───────────────┘
          │
          ▼
┌─────────────────────────┐
│  Stage 4: GNN           │
│  • GraphSAGE / GATv2    │  → Refined relational features
│  • Edge Feature Inject  │  → Improved spatial consistency
│  • LayerNorm + Dropout  │
└─────────┬───────────────┘
          │
          ▼
┌─────────────────────────┐
│  Stage 5: Point Cloud   │
│  • Surface Sampling     │  → 2K-20K 3D points
│  • Gaussian Noise       │  → Class/Instance/Primitive labels
│  • PLY Export           │  → Optional GNN features per point
└─────────────────────────┘

Key Features

Monocular 3D: Reconstructs 3D scenes from a single RGB image — no LiDAR, stereo, or depth sensors required
Edge-Aware Segmentation: Sobel/Canny edge confidence maps improve boundary IoU by ≥15% over baseline
Primitive-Based Representation: Vehicles→cuboids, pedestrians→cylinders, trees→cones, road/sky→planes
Lightweight GNN: All three GNN variants (GraphSAGE, GATv2, Hybrid) are under 500K parameters
Modular Design: Each stage is independently testable, trainable, and replaceable
4-Phase Training: Pretrain → Edge fine-tune → GNN → End-to-end

Installation

pip install torch torchvision torch_geometric scipy scikit-learn numpy

Quick Start

import torch
from traffic3d.models.pipeline import Traffic3DPipeline

# Initialize pipeline
pipeline = Traffic3DPipeline(
    num_classes=19,       # Cityscapes classes
    base_ch=32,           # Lightweight UNet (4.3M params)
    gnn_type='sage',      # or 'gat', 'hybrid'
    edge_method='sobel',  # or 'canny'
    points_per_primitive=512,
)

# Forward pass
rgb = torch.randint(0, 256, (1, 3, 512, 1024), dtype=torch.uint8)
results = pipeline(rgb, training=False)

# Access outputs
segmentation = results['seg_outputs']['segmentation']  # [1, 512, 1024]
primitives = results['primitives'][0]                    # List of Primitive objects
point_cloud = results['point_clouds'][0]                 # PointCloudOutput

# Save point cloud
from traffic3d.models.point_cloud import PointCloudGenerator
PointCloudGenerator.save_ply(point_cloud, 'scene.ply')

Pipeline Stages

Stage 1: Input Augmentation

Channel	Description	Purpose
0-2	RGB (normalized)	Visual features
3	Positional Encoding P(x,y)	Vertical depth prior (top=far, bottom=near)
4	Edge Confidence C(x,y)	Boundary detection for edge weighting

Stage 2: Edge-Weighted Semantic Segmentation

Lightweight UNet with edge weighting and auxiliary boundary supervision (SBCB-style, zero inference overhead):

Edge Weighting: S'(x,y) = S(x,y) * (1 + α * C(x,y))
Loss: L_total = L_ce_edge + λ * L_boundary (λ=0.4)

Stage 3: Primitive Extraction + Scene Graph

Object Type	Primitive	Fitting Method
Vehicles/Buildings	Cuboid	PCA-based orientation
Pedestrians	Cylinder	Bounding extent
Trees	Cone	Bounding extent
Road/Sky	Plane	PCA normal estimation

Node Features (26D): [class_embedding(16), centroid(3), size(3), orientation(4)]
Edge Features (5D): [distance, adjacency_flag, relative_position(3)]

Stage 4: GNN Relational Refinement

Model	Architecture	Parameters	Description
GraphSAGE	EdgeAwareSAGEConv × 2	~29K	Custom MessagePassing with edge injection
GATv2	GATv2Conv (4-head + 1-head)	~29K	Dynamic attention with native edge_dim
Hybrid	SAGE + GAT + learned gate	~62K	Automatic blending of both approaches

Stage 5: 3D Point Cloud Generation

~512 points sampled per primitive surface
Gaussian noise (σ ≈ 0.02) for realism
Output: 2K-20K points with class/instance/primitive labels
PLY export for visualization

Training Strategy

4-Phase Training

from traffic3d.models.pipeline import Traffic3DPipeline, Traffic3DTrainer

pipeline = Traffic3DPipeline(num_classes=19)
trainer = Traffic3DTrainer(pipeline, device=torch.device('cuda'))

trainer.phase1_pretrain_segmentation(train_loader, epochs=30, lr=1e-3)
trainer.phase2_finetune_edge_weighted(train_loader, epochs=15, lr=5e-4, lambda_boundary=0.4)
trainer.phase3_train_gnn(graph_dataset, epochs=50, lr=1e-3)
trainer.phase4_end_to_end(train_loader, epochs=10, lr=1e-4)

Loss Functions

Loss	Formula	Use
EdgeWeightedCE	`CE * (1 + α*C(x,y))`	Segmentation with boundary focus
BoundaryLoss	Binary CE on boundary (on-the-fly GT)	Boundary refinement
CombinedSegLoss	`L_ce + λ * L_boundary` (λ=0.4)	Full segmentation training
RelationalConsistency	Contrastive on GNN features	Scene graph training
ChamferDistance	Bidirectional nearest-neighbor	3D quality evaluation

Evaluation Metrics & Targets

Metric	Target	Description
3D IoU	~0.68	3D bounding box overlap
Centroid L2	~0.49m	Primitive position accuracy
Edge Graph Accuracy	~78%	Scene graph correctness (F1)
Chamfer Distance	~0.041	Point cloud reconstruction quality
Boundary IoU	+15%	Improvement over non-edge baseline
FPS	≥15	RTX 3090 real-time throughput

Ablation Studies

from traffic3d.utils.evaluation import AblationStudy
ablation = AblationStudy(device=torch.device('cuda'))
results = ablation.run_all()
# Ablates: λ, GNN architecture, edge method, points per primitive
print(ablation.summary_table())

Verified Parameter Budget

GNN=sage     | GNN: 28,736   | Under 500K: ✓ | Total Pipeline: 4.36M
GNN=gat      | GNN: 29,312   | Under 500K: ✓ | Total Pipeline: 4.36M
GNN=hybrid   | GNN: 62,016   | Under 500K: ✓ | Total Pipeline: 4.39M

Datasets

Dataset	Use	Classes
Cityscapes	Primary training	19 semantic
BDD100K	Robustness testing	19 semantic
CARLA	Synthetic 3D GT supervision	Configurable

Project Structure

traffic3d/
├── __init__.py
├── models/
│   ├── input_augmentation.py   # Stage 1: Positional + Edge encoding
│   ├── segmentation.py         # Stage 2: Lightweight UNet + edge weighting
│   ├── primitive_extraction.py # Stage 3: Primitives + scene graph
│   ├── gnn_refinement.py       # Stage 4: GraphSAGE / GATv2 / Hybrid GNN
│   ├── point_cloud.py          # Stage 5: Surface sampling + PLY export
│   └── pipeline.py             # End-to-end pipeline + 4-phase trainer
├── losses/
│   └── __init__.py             # EdgeCE, BoundaryLoss, ChamferDistance, etc.
├── utils/
│   └── evaluation.py           # Metrics, Evaluator, AblationStudy
├── data/ and configs/

Optimization for Edge Deployment

INT8 Quantization: 4× memory reduction, <1% accuracy drop
TensorRT Export: UNet → ONNX → TensorRT for 2-3× speedup
Structured Pruning: Remove 30% UNet channels with fine-tuning
GNN Batching: Batch multiple scene graphs per forward pass
Adaptive LOD: Points-per-primitive based on object distance

Suggested Research Extensions

Temporal GNN: Video consistency via temporal edges between frames
Depth Anything V2: Replace depth prior with metric depth estimation
Superquadric Fields: Differentiable superquadrics (SuperOcc-style)
Multi-Scale GNN: Hierarchical local + global message passing
Self-Supervised Pre-training: Contrastive learning on unlabeled driving data
Dynamic Object Tracking: Velocity estimation via primitive tracking

References

MonoScene — Monocular 3D SSC (CVPR 2022)
VoxFormer — Sparse Voxel Transformer (CVPR 2023)
STDC-Seg — Real-time Segmentation (CVPR 2021)
SBCB — Boundary-Conditioned Backbone (2023)
GATv2 — Dynamic Graph Attention (ICLR 2022)
GraphSAGE — Inductive Representation Learning (NeurIPS 2017)
SuperOcc — Superquadric Occupancy (2025)
Depth Anything V2 — Monocular Depth Foundation Model
REACT — Real-time Scene Graph Generation (2024)

License

MIT License

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for HugMaster2002/traffic3d-monocular-reconstruction

SuperOcc: Toward Cohesive Temporal Modeling for Superquadric-based Occupancy Prediction

Paper • 2601.15644 • Published Jan 22

Depth Anything V2

Paper • 2406.09414 • Published Jun 13, 2024 • 103

REACT: Real-time Efficiency and Accuracy Compromise for Tradeoffs in Scene Graph Generation

Paper • 2405.16116 • Published May 25, 2024 • 1

Boosting Semantic Segmentation with Semantic Boundaries

Paper • 2304.09427 • Published Apr 19, 2023

VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion

Paper • 2302.12251 • Published Feb 23, 2023