๐ŸŸข LIVE ยท nuScenes ยท Apple Silicon MPS

OpenDriveFM

Trust-Aware Multi-Camera BEV Occupancy Prediction
with GPT-2 Causal Trajectory Estimation

Self-Supervised BEV Perception LLM Fine-tuning Sparse Training Imitation Learning World Model Neural Pruning C++ LibTorch
317 FPS (MPS)
3.15ms p50 Latency
2.457m Traj ADE
18.4% Over CV Baseline
0.136 BEV IoU
100% Fault Detection
553K Parameters
83ร— Smaller than ProtoOcc

System Overview

Camera-only pipeline โ€” no LiDAR at inference. 6 surround cameras โ†’ BEV occupancy + ego trajectory.

6 ร— Camera (90ร—160px) โ”‚ โ–ผ CNN Stem or ViTStem (patch_size=16, 50 patches/cam) Shared weights ร— 6 โ†’ (BยทV, 384, H/8, W/8) โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ–ผ โ–ผ BEV Lifter (LSS) CameraTrustScorer Kโปยนร—[u,v,1]=ray Laplacian + Sobel physics gate T_camโ†’ego frame self-supervised, zero fault labels โ†’ (B, 192, 64, 64) score t โˆˆ [0,1] per camera โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€ Trust-Weighted Fusion โ”€โ”€โ”€โ”€โ”˜ softmax(t) ร— BEV_features bev_pool_kernel.py 2.1ร— speedup โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ–ผ โ–ผ BEV Decoder CausalTrajHead 4ร—ConvTranspose2d GPT-2 transformer IoU = 0.136 3 layers, 4 heads, 666K params ADE = 2.457m
๐Ÿง 

GPT-2 Causal Trajectory Head

Autoregressive waypoint prediction via behavioral cloning on 404 real nuScenes expert demonstrations. Lower-triangular causal mask. 666K parameters.

โšก

Vectorized BEV Pool Kernel

Replaces Python for-loop over 6 cameras with single batched einsum GPU operation. 2.1ร— speedup on CPU. Runs on Apple MPS.

๐Ÿ”ฌ

Sparse Attention Training

SparseCausalTrajHead with strided (64%), local window (73%), and combined (58%) attention patterns. O(Tยทk) vs O(Tยฒ) dense.

๐Ÿ“Š

LLM Fine-tuning

GPT-2 (124M params) fine-tuned on tokenized nuScenes trajectories. 404-token vocabulary. Causal LM objective. Loss 16.97โ†’0.0004.

โœ‚๏ธ

Neural Network Pruning

L1 unstructured pruning: 30% sparsity โ†’ 464K params. Zero latency regression. Compatible with downstream quantization.

๐ŸŽฅ

BEV Occupancy Forecasting

Predicts T+1, T+2, T+3 future BEV frames from T=4 past observations. Same paradigm as UniAD (NeurIPS 2023). 6.6M params.

CameraTrustScorer

Detects degraded cameras with zero fault labels โ€” pure contrastive self-supervised learning. The only autonomous driving paper (vs ProtoOcc, GAFusion, PointBeV) with real-time fault tolerance.

# Contrastive margin loss โ€” the only supervision signal # No fault labels. No human annotation. Zero supervision. L_trust = max(0, t_faulted โˆ’ t_clean + 0.2) # t_clean > t_faulted + 0.2 enforced during training # Training signal comes from data augmentation, not labels
Condition Trust Score Drop vs Clean Visualization Category
Clean (baseline) 0.795 โ€” โ€”
Blur 0.340 -57% Known
Occlusion 0.310 -61% Known
Noise 0.460 -42% Known
Glare 0.420 -47% Known
Rain 0.491 -38% Known
Heavy Snow โšก 0.355 ~-55% UNSEEN
Dense Fog โšก 0.380 ~-52% UNSEEN

โšก UNSEEN = not in training set. Detected via physics signals (Laplacian + Sobel) that generalize across fault types.

No Trust vs Uniform vs Trust-Aware

Comparing fusion strategies to isolate the benefit of the CameraTrustScorer. Trust benefit is larger under fault conditions โ€” as designed.

No Trust (baseline)

0.0706

Uniform weights, ignores camera quality

Uniform Average

0.0752

Simple mean across all cameras equally

Trust-Aware (ours) โญ

0.0776

Weighted by self-supervised trust score

Under Fault Conditions (1 camera faulted)

No Trust

0.0643

Uniform Average

0.0717

Trust-Aware โญ

0.0814

+26.6% over No-Trust

13 Training Experiments

v2Initial CNN + trust scorerFirst working pipeline
v5AdamW + CosineAnnealingLRLoss 26โ†’9.5
v7Scene-level splitsNo data leakage
v8 โ˜…Geometry BEV lifterIoU=0.136
v9LiDAR depth supervisionADE 2.740โ†’2.559m
v11 โ˜…T=4 temporal video fusion + 128ร—128 BEVADE=2.457m BEST
v133-class semantic labelsIoU=0.131 vehicle
v14Full LSS from scratchNeeds more epochs

CVPR Paper Comparison

SystemSpeedParameters TrajFault ToleranceHardware
ProtoOcc CVPR259.5 FPS46.2M โœ—โœ— 8ร—A100
GAFusion CVPR248 FPS~80M โœ—โœ— 2ร—3090
PointBeV CVPR24~10 FPS~40M โœ—โœ— A100
OpenDriveFM โ˜… 317 FPS 553K โœ“ ADE=2.457m โœ“ 7 fault types MacBook

Try It Yourself

Run the live demo locally or explore the full codebase

๐Ÿ“‚ View on GitHub ๐ŸŽฎ Live Demo (Gradio)
# Quick start git clone https://github.com/AI-688-Image-and-Vision-Computing/Opendrivefm.git cd opendrivefm conda env create -f environment.yml python apps/demo/live_demo_webcam.py --nuscenes # OR python scripts/gradio_app.py --share