CoherentGS: Sparse Novel View Synthesis with Coherent 3D Gaussians
ECCV 2024
Overview of the optimization pipeline. For every input image, we obtain monocular depth (Depth Anything) and dense flow correspondences between all image pairs (FlowFormer++). These inputs are utilized to initialize a good set of 3D Gaussians for the subsequent optimization stage. The initialized 3D Gaussians, along with depth-based segmentation masks, are then used to perform a regularized 3D Gaussian optimization to obtain high-quality reconstruction.
Abstract
The field of 3D reconstruction from images has rapidly evolved in the past few years, first with the introduction of Neural Radiance Field (NeRF) and more recently with 3D Gaussian Splatting (3DGS). The latter provides a significant edge over NeRF in terms of the training and inference speed, as well as the reconstruction quality. Although 3DGS works well for dense input images, the unstructured point-cloud like representation quickly overfits to the more challenging setup of extremely sparse input images (e.g., 3 images), creating a representation that appears as a jumble of needles from novel views. To address this issue, we propose regularized optimization and depth-based initialization. Our key idea is to introduce a structured Gaussian representation that can be controlled in 2D image space. We then constraint the Gaussians, in particular their position, and prevent them from moving independently during optimization. Specifically, we introduce single and multiview constraints through an implicit convolutional decoder and a total variation loss, respectively. With the coherency introduced to the Gaussians, we further constrain the optimization through a flow-based loss function. To support our regularized optimization, we propose an approach to initialize the Gaussians using monocular depth estimates at each input view. We demonstrate significant improvements compared to the state-of-the-art sparse-view NeRF-based approaches on a variety of scenes.
Talk
Implicit Decoder
During regularized optimization, the implicit decoder predicts the residual depth ΔD that moves the Gaussians from their initial position towards the true scene depth D. The input coordinate n to the decoder corresponds to the input view with camera camn. To preserve sharp discontinuities, we apply binary segmentation masks to the decoder output obtained by thresholding the monocular depth.
Optimization
Implicit decoder enables smooth deformation of initialized Gaussians resulting in coherent geometry and high quality texture.
Comparisons with other few-view NeRF methods
Baseline method (left) vs CoherentGS (right). Scene trained on 2 views. Try selecting different methods and scenes!
Inpainting
In contrast to other methods, our approach does not hallucinates occluded details. This provides a unique advantage where the user can apply any inpainting technique to fill in the missing regions. As proof of concept, here we apply a simple inpainting technique to generate the missing texture and project it into the scene.
Related Work
Sparse Novel View Synthesis
RI3D: Sparse View Synthesis Using Repair and Inpainting Diffusion Priors
PanoDreamer: 3D Panorama Synthesis from a Single Image
ReShader: View-Dependent Highlights for Single Image View-Synthesis
Implicit Models for view and time interpolation
Implicit View-Time Interpolation of Stereo Videos using Multi-Plane Disparities and Non-Uniform Coordinates
Frame Interpolation for Dynamic Scenes with Implicit Flow Encoding
Citation
Acknowledgements
The project was funded in part by a generous gift from Meta. The website template was borrowed from ReconFusion.