ShelfGaussian: Shelf-Supervised Open-Vocabulary Gaussian-based 3D Scene Understanding

Abstract

We introduce ShelfGaussian, an open-vocabulary multi-modal Gaussian-based 3D scene understanding framework supervised by off-the-shelf vision foundation models (VFMs). Gaussian-based methods have demonstrated superior performance and computational efficiency across a wide range of scene understanding tasks. However, existing methods either model objects as closed-set semantic Gaussians supervised by annotated 3D labels, neglecting their rendering ability, or learn open-set Gaussian representations via purely 2D self-supervision, leading to degraded geometry and limited to camera-only settings. To fully exploit the potential of Gaussians, we propose a Multi-Modal Gaussian Transformer that enables Gaussians to query features from diverse sensor modalities, and a Shelf-Supervised Learning Paradigm that efficiently optimizes Gaussians with VFM features jointly at 2D image and 3D scene levels. We evaluate ShelfGaussian on various perception and planning tasks. Experiments on Occ3D-nuScenes demonstrate its state-of-the-art zero-shot semantic occupancy prediction performance. ShelfGaussian is further evaluated on an unmanned ground vehicle (UGV) to assess its in-the-wild performance across diverse urban scenarios.

Methodology

We propose a multi-modal Gaussian transformer architecture that queries features from different sensor modalities to predict feed-forward 3D Gaussians, supporting camera-only, LiDAR-only, LiDAR-camera, camera-radar, and LiDAR-camera-radar sensor input.
We introduce a novel shelf-supervised learning paradigm to optimize Gaussians using off-the-shelf VFMs at both 2D image and 3D scene levels. A highly efficient, CUDA-accelerated Gaussian-to-Voxel (G2V) splatting module is designed to enable high-dimensional VFM feature distillation and speed up training.
ShelfGaussian achieves state-of-the-art zero-shot performance on semantic occupancy prediction on the nuScenes dataset, superior performance on Bird's-Eye-View (BEV) segmentation, and reduced collision rate in trajectory planning when integrated into our Gaussian-Planner.
We further test ShelfGaussian on an unmanned ground vehicle (UGV) across diverse urban scenes, demonstrating its superior in-the-wild performance. All code and data will be open-sourced to benefit the community.

DINO-Driven Pseudo Labeling Engine

We teleoperate our UGV through urban scenarios to collect paired image and point cloud sequences along with trajectories from onboard camera and LiDAR. LiDAR points are then projected to image and decorated with pixel-wise DINO features. These points are aggregated and voxelized at a customized resolution to be 3D pseudo labels.

CUDA-Accelerated Gaussian-to-Voxel Splatting

Dual-CSR Structure for CUDA-Accelerated Gaussian2Voxel. Gaussian-to-Tile CSR: index pointers store tile offsets per Gaussian, indices record tile IDs, and values store Gaussian IDs. Tile-to-Gaussian CSR: index pointers store Gaussian offsets per tile, and indices record Gaussian IDs obtained by sorting and run-length encoding (RLE) tile-Gaussian pairs.

Qualitative Results on Occ3D-nuScenes Dataset

Qualitative Results on Custom Dataset

BibTeX

@article{zhao2025shelfgaussian,
  author    = {Zhao, Lingjun and Luo, Yandong and Hays, James and Gan, Lu},
  title     = {ShelfGaussian: Shelf-Supervised Open-Vocabulary Gaussian-based 3D Scene Understanding},
  year      = {2025},
  note      = {Preprint available soon on arXiv.}
}