GaussianFormer3D: Multi-Modal Gaussian-based Semantic Occupancy Prediction with 3D Deformable Attention

ICRA 2026

RSS 2025 Workshop on Gaussian Representations for Robot Autonomy

Corresponding Author 1Georgia Institute of Technology
GaussianFormer3D

We propose a new LiDAR-camera fusion-based semantic occupancy prediction framework using 3D Gaussian representations. We evaluate it on both on-road and off-road driving scenarios. Our method demonstrates superior performance on overall occupancy Intersection-of-Union (IoU), achieves substantial performance gains on small objects (e.g., pedestrian, motorcycle) and large surfaces (e.g., manmade, vegetation), and consumes less memory during inference.

Abstract

3D semantic occupancy prediction is essential for achieving safe, reliable autonomous driving and robotic navigation. Compared to camera-only perception systems, multi-modal pipelines, especially LiDAR-camera fusion methods, can produce more accurate and fine-grained predictions. Although voxel-based scene representations are widely used for semantic occupancy prediction, 3D Gaussians have emerged as a continuous and significantly more compact alternative. In this work, we propose a multi-modal Gaussian-based semantic occupancy prediction framework utilizing 3D deformable attention, namely GaussianFormer3D. We introduce a voxel-to-Gaussian initialization strategy that provides 3D Gaussians with accurate geometry priors from LiDAR data, and design a LiDAR-guided 3D deformable attention mechanism to refine these Gaussians using LiDAR-camera fusion features in a lifted 3D space. Extensive experiments on real-world on-road and off-road autonomous driving datasets demonstrate that GaussianFormer3D achieves state-of-the-art prediction performance with reduced memory consumption and improved efficiency.

Method

We propose a novel multi-modal Gaussian-based semantic occupancy prediction framework. By integrating LiDAR and camera data, our method significantly outperforms camera-only baselines with similar memory usage.

We design a voxel-to-Gaussian initialization module to provide 3D Gaussians with geometry priors from LiDAR data. We also develop an enhanced 3D deformable attention mechanism to update Gaussians by aggregating LiDAR-camera fusion features in a lifted 3D space.

We present extensive evaluations on two on-road datasets, nuScenes-SurroundOcc and nuScenes-Occ3D, and one off-road dataset, RELLIS3D-WildOcc. Results show that our method performs on par with state-of-the-art dense grid-based methods while having reduced memory consumption and improved efficiency.

GaussianFormer3D Method

Quantitative Results

3D semantic occupancy prediction performance on the on-road nuScenes-SurroundOcc validation set.

nuScenes-SurroundOcc

3D semantic occupancy prediction performance on the off-road RELLIS3D-WildOcc validation and test sets.

RELLIS3D-WildOcc

Efficiency evaluation and comparison on the nuScenes-SurroundOcc validation set during inference.

efficiency

Qualitative Results

Visualization on the on-road nuScenes validation set.

nuScenes-Occ3D-vis

Visualization comparison with GaussianFormer.

nuScenes-gaussian-comparison

Visualization on the off-road RELLIS3D dataset.

RELLIS3D-WildOcc-vis

Visualization of multi-resolution occupancy prediction from 3D Gaussians.

nuScenes-multi-reso

Visualization of different Gaussian initialization strategies.

nuScenes-gaussian-init

BibTeX

@article{zhao2025gaussianformer3d,
      title={GaussianFormer3D: Multi-Modal Gaussian-based Semantic Occupancy Prediction with 3D Deformable Attention},
      author={Zhao, Lingjun and Wei, Sizhe and Hays, James and Gan, Lu},
      journal={arXiv preprint arXiv:2505.10685},
      year={2025}
    }