Equivariant Volumetric Grasping

Abstract

We propose a new volumetric grasp model that is equivariant to rotations around the vertical axis, leading to a significant improvement in sample efficiency. Our model employs a tri-plane volumetric feature representation---i.e., the projection of 3D features onto three canonical planes. We introduce a novel tri-plane feature design in which features on the horizontal plane are equivariant to 90° rotations, while the sum of features from the other two planes remains invariant to reflections induced by the same transformations. This design is enabled by a new deformable steerable convolution, which combines the adaptability of deformable convolutions with the rotational equivariance of steerable ones. This allows the receptive field to adapt to local object geometry while preserving equivariance properties. We further develop equivariant adaptations of two state-of-the-art volumetric grasp planners, GIGA and IGD. Specifically, we derive a new equivariant formulation of IGD's deformable attention mechanism and propose an equivariant generative model of grasp orientations based on flow matching. We provide a detailed analytical justification of the proposed equivariance properties and validate our approach through extensive simulated and real-world experiments. Our results demonstrate that the proposed projection-based design significantly reduces both computational and memory costs. Moreover, the equivariant grasp models built on top of our tri-plane features consistently outperform their non-equivariant counterparts, achieving higher performance with only a modest computational overhead.

Key Insight

This figure considers a workspace that contains a pink cone and a blue box. As the workspace rotates by increments of 90°, the XY plane also rotates accordingly, but the XZ and YZ planes transform differently: Every time the workspace rotates 90°, the YZ plane becomes the previous XZ plane, and the XZ plane becomes a flipped copy of the previous YZ plane. This observation forms a key intuition of our paper. Let us consider a feature queried at the point located by the star in the figure. Observing the XZ and YZ planes, we see that, for all rotations of the scene, the query point always points at no object at all in one plane, and at the pink cone on the other plane. Thus, a sum of matching features in the XZ and YZ planes is invariant to C4 transformation. Since the figure above lists C4 transformations exhaustively, this observation is in fact a general rule, which defines the tri-plane feature transformation under C4 group actions.

Equivariant Triplane UNet

As shown in Figure above, we propose a dual-branch network to encode the tri-plane feature. The first branch processes the XY plane. It consists of a steerable-CNN UNet designed for equivariance to g ∈ C₄,, and it yields a refined table-plane feature field \( \hat{f}_{\text{xy}} \). The second branch, \(h_{\text{s}}(\cdot)\), processes the XZ and YZ planes. It consists of a reflection-invariant UNet and yields \(\hat{f}_{\text{xz}}\) and \(\hat{f}_{\text{yz}}\).

EquiGIGA and EquiIGD

We further extend our framework to support equivariant variants of two SOTA volumetric grasp planners—GIGA and IGD to EquiGIGA and EquiIGD. The experimental results show that the proposed EquiGIGA and EquiIGD achieve the highest performance among all SOTA methods, and the EquiGIGA and EquiIGD built on our tri-plane features consistently outperform their non-equivariant baselines.

Real World Experiment Setup

Packed and Pile Scenes

Adversarial Scenes

BibTeX

@article{song2025equivariant,
  title={Equivariant Volumetric Grasping},
  author={Song, Pinhao and Hu, Yutong and Li, Pengteng and Detry, Renaud},
  journal={arXiv preprint arXiv:2507.18847},
  year={2025}
}