👑 SIRE: SE(3) Intrinsic Rigidity Embeddings

TL;DR:
•Discovery of object embeddings via intrinsics rigidity embeddings and simple 4D reconstruction objective.
•It can be trained using just RGB videos and off-the-shelf 2D point trackers, either on a single video for per-scene 4D reconstruction, or on a dataset of videos to learn generalizable priors over geometry and rigidity.

Abstract

Motion serves as a powerful cue in human perception, helping to organize the physical world into distinct entities by separating independently moving surfaces. We introduce SIRE, a method for learning intrinsic rigidity embeddings from video, capturing the underlying motion structure of dy- namic scenes. Our approach softly groups pixels into rigid components by learning a feature space where points be- longing to the same rigid object share similar embeddings. We achieve this by extending the traditional static 3D scene reconstruction to also include dynamic scene modeling, via a simple yet effective 4D reconstruction loss. By lifting 2D point tracks into SE(3) rigid trajectories, and enforcing consistency with their 2D re-projections, our method learns compelling rigidity-based representations without explicit supervision. Crucially, our framework is fully end-to-end differentiable and can be optimized either on video datasets to learn generalizable image priors, or surprisingly even on a single video to capture scene-specific structure – high- lighting strong data efficiency. We demonstrate the effec- tiveness of our rigidity embeddings and 4D reconstruction across multiple settings, including self-supervised depth es- timation, SE(3) rigid motion estimation, and object segmen- tation. Our findings suggest that our simple formulation can pave the way towards learning self-supervised learning of priors over geometry and object rigidities from large-scale video data.

Self-Supervised Results

Here we plot depth, rigidity embeddings, and selected rigidity maps from our model trained on CO3D-Dogs without any depth supervision.

Per-Track Self-Attention Grids Interpreted as Rigidity Weights

Here we plot the track-images (RGB,SE3 rotation and translation, and affinity embedding PCA) on top, the per-track rigidity grid in middle, and highlighted rigidity maps (bottom).

4D point clouds predicted by our model

Here we show 4D reconstructions using our method. The top two use our self-supervised depth prior and the bottom two use off-the-shelf depth.