FlowCam: Training Generalizable 3D Radiance Fields without Camera Poses via Pixel-Aligned Scene Flow

FlowCam

Training Generalizable 3D Radiance Fields without Camera Poses via Pixel-Aligned Scene Flow

Input Video
Estimated Poses
Novel Trajectory
Rendered RGB
Rendered Depth

Our model infers poses and reconstructs a radiance field in a single feed-forward pass, without performing any per-scene optimization. Above we demonstrate real-time, feed-forward pose estimation (~20fps), followed by feed-forward 3D reconstruction and view synthesis with a novel trajectory. Since our model estimates poses and geometry on short video clips, we apply both our pose estimation and view synthesis on sliding windows of the video and trajectory. Our model is trained without camera poses, supervised only via re-rendering losses and optical flow on top of uncurated video.

Abstract

Reconstruction of 3D neural fields from posed images has emerged as a promising method for self-supervised representation learning. The key challenge preventing the deployment of these 3D scene learners on large-scale video data is their dependence on precise camera poses from structure-from-motion, which is prohibitively expensive to run at scale. We propose a method that jointly reconstructs camera poses and 3D neural scene representations online and in a single forward pass. We estimate poses by first lifting frame-to-frame optical flow to 3D scene flow via differentiable rendering, preserving locality and shift-equivariance of the image processing backbone. SE(3) camera pose estimation is then performed via a weighted least-squares fit to the scene flow field. This formulation enables us to jointly supervise pose estimation and a generalizable neural scene representation via re-rendering the input video, and thus, train end-to-end and fully self-supervised on real-world video datasets. We demonstrate that our method performs robustly on diverse, real-world video, notably on sequences traditionally challenging to optimization-based pose estimation techniques.

pipeline

Given a set of video frames, our method first computes frame-to-frame camera poses (left) and then re-renders the input video (right). To estimate pose between two frames, we compute off-the-shelf optical flow to establish 2D correspondences. Using single-view pixelNeRF, we obtain a surface point cloud as the expected 3D ray termination point for each pixel, X, Y respectively. Because X and Y are pixel-aligned, optical flow allows us to compute 3D scene flow as the difference of corresponding 3D points. We then find the camera pose P ∈ SE(3) that best explains the 3D flow field by solving a weighted least-squares problem with flow confidence weights W. Using all frame-to-frame poses, we re-render all frames. We enforce an RGB loss and a flow loss between projected pose-induced 3D scene flow and 2D optical flow. Our method is trained end-to-end, assuming only an off-the-shelf optical flow estimator.

Additional Sliding-Window CO3D results

Here we show additional pose estimation and novel view synthesis results. As our model operates on shorter video clips, we apply our model in a sliding window fashion for these longer sequences. Though this sliding window approach allows our model to estimate poses and 3D geometry for extended sequences, it also yields temporal flickering in the renderings. Exploring global optimization extensions for longer sequences is exciting future work.
Input Video
Estimated Poses
Novel Trajectory
Rendered RGB
Rendered Depth

Additional results on all datasets

Below we show the input video, rendererd time interpolation, and a rendered novel trajectory, for scenes from each dataset.

CO3D-Hydrants

Input
Interp.
Wobble
Input
Interp.
Wobble

CO3D-10Category

Input
Interp.
Wobble
Input
Interp.
Wobble

RealEstate10K

Input
Interp.
Wobble
Input
Interp.
Wobble

KITTI

Input
Interp.
Wobble


Fine-tuned Pose Estimation and View Synthesis on Large-Scale, Out-of-Distribution Scene

We evaluate our RealEstate10K-trained model on a significantly out-of-distribution scene from the Tanks and Temples dataset, first without any scene-adaptation and then with. Note that our scene-adaptation fine-tuning stage is not equivalent to direct optimization of camera poses and a radiance field, as performed e.g. in BARF: neither camera poses nor the radiance field are free variables. Instead, we fine-tune the weights of our convolutional inference backbone and MLP renderer on subsequences of the target video for more accurate feed-forward prediction. Even with this significant distribution gap, our RealEstate10K-model's estimated trajectory captures the looping structure of the ground truth, albeit with accumulated drift. After the scene-adaptation fine-tuning stage, our model estimates poses which align closely with the ground truth, despite having no loop closure mechanism. We also plot the trajectory estimated by BARF, which fails to capture the correct pose distribution.

Input Video
Rendered RGB
Rendered Depth

BARF
Our RealEstate10K Model
Our Scene Adapted Model
Novel trajectory


Related Links

  • Our renderer is based on pixelNeRF.
  • Also see the concurrent work DBARF which also regresses camera poses alongside a generalizable NeRF.

Bibtex

Acknowledgements

We thank David Charatan for helpful discussions and comments on the text. This website is recycled from pixelNeRF.

Please send any questions or comments to Cameron Smith.