SIRE

Motion serves as a powerful cue in human perception, helping to organize the physical world into distinct entities by separating independently moving surfaces. We introduce SIRE, a method for learning intrinsic rigidity embeddings from video, capturing the underlying motion structure of dy- namic scenes. Our approach softly groups pixels into rigid components by learning a feature space where points be- longing to the same rigid object share similar embeddings. We achieve this by extending the traditional static 3D scene reconstruction to also include dynamic scene modeling, via a simple yet effective 4D reconstruction loss. By lifting 2D point tracks into SE(3) rigid trajectories, and enforcing consistency with their 2D re-projections, our method learns compelling rigidity-based representations without explicit supervision. Crucially, our framework is fully end-to-end differentiable and can be optimized either on video datasets to learn generalizable image priors, or surprisingly even on a single video to capture scene-specific structure – high- lighting strong data efficiency. We demonstrate the effec- tiveness of our rigidity embeddings and 4D reconstruction across multiple settings, including self-supervised depth es- timation, SE(3) rigid motion estimation, and object segmen- tation. Our findings suggest that our simple formulation can pave the way towards learning self-supervised learning of priors over geometry and object rigidities from large-scale video data.

👑 SIRE: SE(3) Intrinsic Rigidity Embeddings

Abstract

Self-Supervised Results

Per-Track Self-Attention Grids Interpreted as Rigidity Weights

4D point clouds predicted by our model