🟢 LIVE · nuScenes · Apple Silicon MPS

OpenDriveFM

Trust-Aware Multi-Camera BEV Occupancy Prediction
with GPT-2 Causal Trajectory Estimation

Self-Supervised BEV Perception LLM Fine-tuning Sparse Training Imitation Learning World Model Neural Pruning C++ LibTorch

317 FPS (MPS)

3.15ms p50 Latency

2.457m Traj ADE

18.4% Over CV Baseline

0.136 BEV IoU

100% Fault Detection

553K Parameters

83× Smaller than ProtoOcc

Architecture

System Overview

Camera-only pipeline — no LiDAR at inference. 6 surround cameras → BEV occupancy + ego trajectory.

6 × Camera (90×160px) │ ▼ CNN Stem or ViTStem (patch_size=16, 50 patches/cam) Shared weights × 6 → (B·V, 384, H/8, W/8) │ ┌────┴────────────────────────┐ ▼ ▼ BEV Lifter (LSS) CameraTrustScorer K⁻¹×[u,v,1]=ray Laplacian + Sobel physics gate T_cam→ego frame self-supervised, zero fault labels → (B, 192, 64, 64) score t ∈ [0,1] per camera │ │ └──── Trust-Weighted Fusion ────┘ softmax(t) × BEV_features bev_pool_kernel.py 2.1× speedup │ ┌────────────┴────────────┐ ▼ ▼ BEV Decoder CausalTrajHead 4×ConvTranspose2d GPT-2 transformer IoU = 0.136 3 layers, 4 heads, 666K params ADE = 2.457m

🧠

GPT-2 Causal Trajectory Head

Autoregressive waypoint prediction via behavioral cloning on 404 real nuScenes expert demonstrations. Lower-triangular causal mask. 666K parameters.

⚡

Vectorized BEV Pool Kernel

Replaces Python for-loop over 6 cameras with single batched einsum GPU operation. 2.1× speedup on CPU. Runs on Apple MPS.

🔬

Sparse Attention Training

SparseCausalTrajHead with strided (64%), local window (73%), and combined (58%) attention patterns. O(T·k) vs O(T²) dense.

📊

LLM Fine-tuning

GPT-2 (124M params) fine-tuned on tokenized nuScenes trajectories. 404-token vocabulary. Causal LM objective. Loss 16.97→0.0004.

✂️

Neural Network Pruning

L1 unstructured pruning: 30% sparsity → 464K params. Zero latency regression. Compatible with downstream quantization.

🎥

BEV Occupancy Forecasting

Predicts T+1, T+2, T+3 future BEV frames from T=4 past observations. Same paradigm as UniAD (NeurIPS 2023). 6.6M params.

Self-Supervised Learning

CameraTrustScorer

Detects degraded cameras with zero fault labels — pure contrastive self-supervised learning. The only autonomous driving paper (vs ProtoOcc, GAFusion, PointBeV) with real-time fault tolerance.

# Contrastive margin loss — the only supervision signal # No fault labels. No human annotation. Zero supervision. L_trust = max(0, t_faulted − t_clean + 0.2) # t_clean > t_faulted + 0.2 enforced during training # Training signal comes from data augmentation, not labels

Condition	Trust Score	Drop vs Clean	Category
Clean (baseline)	0.795	—	—
Blur	0.340	-57%	Known
Occlusion	0.310	-61%	Known
Noise	0.460	-42%	Known
Glare	0.420	-47%	Known
Rain	0.491	-38%	Known
Heavy Snow ⚡	0.355	~-55%	UNSEEN
Dense Fog ⚡	0.380	~-52%	UNSEEN

⚡ UNSEEN = not in training set. Detected via physics signals (Laplacian + Sobel) that generalize across fault types.

Ablation Study

No Trust vs Uniform vs Trust-Aware

Comparing fusion strategies to isolate the benefit of the CameraTrustScorer. Trust benefit is larger under fault conditions — as designed.

No Trust (baseline)

0.0706

Uniform weights, ignores camera quality

Uniform Average

0.0752

Simple mean across all cameras equally

Trust-Aware (ours) ⭐

0.0776

Weighted by self-supervised trust score

Under Fault Conditions (1 camera faulted)

No Trust

0.0643

Uniform Average

0.0717

Trust-Aware ⭐

0.0814

+26.6% over No-Trust

Experimental History

13 Training Experiments

v2Initial CNN + trust scorerFirst working pipeline

v5AdamW + CosineAnnealingLRLoss 26→9.5

v7Scene-level splitsNo data leakage

v8 ★Geometry BEV lifterIoU=0.136

v9LiDAR depth supervisionADE 2.740→2.559m

v11 ★T=4 temporal video fusion + 128×128 BEVADE=2.457m BEST

v133-class semantic labelsIoU=0.131 vehicle

v14Full LSS from scratchNeeds more epochs

vs State of the Art

CVPR Paper Comparison

System	Speed	Parameters	Traj	Fault Tolerance	Hardware
ProtoOcc CVPR25	9.5 FPS	46.2M	✗	✗	8×A100
GAFusion CVPR24	8 FPS	~80M	✗	✗	2×3090
PointBeV CVPR24	~10 FPS	~40M	✗	✗	A100
OpenDriveFM ★	317 FPS	553K	✓ ADE=2.457m	✓ 7 fault types	MacBook

Try It Yourself

Run the live demo locally or explore the full codebase

📂 View on GitHub 🎮 Live Demo (Gradio)

# Quick start git clone https://github.com/AI-688-Image-and-Vision-Computing/Opendrivefm.git cd opendrivefm conda env create -f environment.yml python apps/demo/live_demo_webcam.py --nuscenes # OR python scripts/gradio_app.py --share