CoherentGS: Sparse Novel View Synthesis with Coherent 3D Gaussians

1Texas A&M University,   2Meta Reality Labs,   3LMU Munich

ECCV 2024



arXiv Video Code (coming soon)



Overview of the optimization pipeline. For every input image, we obtain monocular depth (Depth Anything) and dense flow correspondences between all image pairs (FlowFormer++). These inputs are utilized to initialize a good set of 3D Gaussians for the subsequent optimization stage. The initialized 3D Gaussians, along with depth-based segmentation masks, are then used to perform a regularized 3D Gaussian optimization to obtain high-quality reconstruction.

Abstract

The field of 3D reconstruction from images has rapidly evolved in the past few years, first with the introduction of Neural Radiance Field (NeRF) and more recently with 3D Gaussian Splatting (3DGS). The latter provides a significant edge over NeRF in terms of the training and inference speed, as well as the reconstruction quality. Although 3DGS works well for dense input images, the unstructured point-cloud like representation quickly overfits to the more challenging setup of extremely sparse input images (e.g., 3 images), creating a representation that appears as a jumble of needles from novel views. To address this issue, we propose regularized optimization and depth-based initialization. Our key idea is to introduce a structured Gaussian representation that can be controlled in 2D image space. We then constraint the Gaussians, in particular their position, and prevent them from moving independently during optimization. Specifically, we introduce single and multiview constraints through an implicit convolutional decoder and a total variation loss, respectively. With the coherency introduced to the Gaussians, we further constrain the optimization through a flow-based loss function. To support our regularized optimization, we propose an approach to initialize the Gaussians using monocular depth estimates at each input view. We demonstrate significant improvements compared to the state-of-the-art sparse-view NeRF-based approaches on a variety of scenes.



Implicit Decoder


During regularized optimization, the implicit decoder predicts the residual depth ΔD that moves the Gaussians from their initial position towards the true scene depth D. The input coordinate n to the decoder corresponds to the input view with camera camn. To preserve sharp discontinuities, we apply binary segmentation masks to the decoder output obtained by thresholding the monocular depth.



Optimization



Implicit decoder enables smooth deformation of initialized Gaussians resulting in coherent geometry and high quality texture.


Comparisons with other few-view NeRF methods


RGB Depth

Baseline method (left) vs CoherentGS (right). Scene trained on 2 views. Try selecting different methods and scenes!

llff_orchids llff_trex llff_fortress llff_horns llff_fern llff_flower ip_s3 ip_s1 zed_s1 zed_s8


Inpainting



In contrast to other methods, our approach does not hallucinates occluded details. This provides a unique advantage where the user can apply any inpainting technique to fill in the missing regions. As proof of concept, here we apply a simple inpainting technique to generate the missing texture and project it into the scene.

Citation

Acknowledgements

The website template was borrowed from ReconFusion.