CoMamba: Real-time Cooperative Perception Unlocked with State Space Models
Jinlong Li1Xinyu Liu2Baolu Li2Runsheng Xu3 Jiachen Li4
Hongkai Yu2 Zhengzhong Tu1
1 Texas A&M University, 2 Cleveland State University, 3 UCLA, 4 University of California, Riverside.
CoMamba scales remarkably well, achieving linear-complexity costs in GFLOPs, latency, and GPU memory relative to the number of agents, while still maintaining excellent perception performance.
Overview
This work explores the potential adoption of state space models for the challenging V2X/V2V cooperative perception task, which involves high-order, multimodal visual information fusion using LiDAR scans. We propose CoMamba, the first attempt to explore the potential of linear-complexity Mamba models for V2X cooperative perception. Our CoMamba is a novel V2X perception framework that efficiently models V2X feature interactions using state-space models. Notably, CoMamba scales linearly with the increasing number of connected agents, whereas previous transformer models all suffer from quadratic complexity with respect to total data dimensionality.
Overview of our CoMamba V2X-based perception framework. (a) CoMamba V2X perception system involves V2X metadata sharing, LiDAR visual encoder, feature sharing, and CoMamba fusion network to conduct final prediction. (b) our CoMamba fusion network leverages the Cooperative 2D-Selective-Scan Module to effectively fuse the complex interactions present in high-resource-cost V2X data sequences. The Global-wise Pooling Module efficiently attains global information among the overlapping features of the CAVs.
Cooperative 2D-Selective-Scan (CSS2D)
The features of K CAVs are embedded into patches. These patches are then traversed along four different scanning paths, with each 1D sequence (KHW) independently processed by distinct Mamba blocks in parallel. Afterward, the resulting outputs are reshaped and merged to form the 3D feature maps, which maintain the same dimensions as the input features. In this instance, we use K = 3 as an illustrative example.
Qualitative Results
LiDAR-based 3D detection performance comparison. We show Average Precision (AP) at IoU=0.5 and 0.7 on four V2X testing sets from OPV2V, V2X-Set, and V2V4Real datasets.
Qualitative Results
Camera-only 3D detection performance comparison. We show Average Precision (AP) at IoU=0.5 and 0.7 on the OPV2V and V2XSet datasets.
Visualization of 3D detection results
Quantitative comparison of image quality on the nuScenes nighttime validation set
BibTeX
  @article{li2024comamba,
    title={CoMamba: Real-time Cooperative Perception Unlocked with State Space Models},
    author={Li, Jinlong and Liu, Xinyu and Li, Baolu and Xu, Runsheng and Li, Jiachen and Yu, Hongkai and Tu, Zhengzhong},
    journal={arXiv preprint arXiv:2409.10699},
    year={2024}
  }