TL;DR: Create 3D scenes from extremely sparse images using finetuned diffusion models.


Abstract

In this paper, we propose RI3D, a novel 3DGS-based approach that harnesses the power of diffusion models to reconstruct high-quality novel views given a sparse set of input images. Our key contribution is separating the view synthesis process into two tasks of reconstructing visible regions and hallucinating missing regions, and introducing two personalized diffusion models, each tailored to one of these tasks. Specifically, one model ('repair') takes a rendered image as input and predicts the corresponding high-quality image, which in turn is used as a pseudo ground truth image to constrain the optimization. The other model ('inpainting') primarily focuses on hallucinating details in unobserved areas. To integrate these models effectively, we introduce a two-stage optimization strategy: the first stage reconstructs visible areas using the repair model, and the second stage reconstructs missing regions with the inpainting model while ensuring coherence through further optimization. Moreover, we augment the optimization with a novel Gaussian initialization method that obtains per-image depth by combining 3D-consistent and smooth depth with highly detailed relative depth. We demonstrate that by separating the process into two tasks and addressing them with the repair and inpainting models, we produce results with detailed textures in both visible and missing regions that outperform state-of-the-art approaches on a diverse set of scenes with extremely sparse inputs.



Overview



We first initialize the Gaussians by obtaining high-quality per-view depth maps. We then fine-tune the repair and inpainting diffusion models on the scene at hand. Finally, we utilize the two models to optimize 3DGS representation in two stages. In the first stage, we use the repair model to reconstruct the areas covered by the input, while restricting the unseen areas to remain empty. In the second stage, we inpaint the missing regions and continue the optimization using the repair model.



Depth Enhancement



The depth estimated by DUSt3R is geometrically consistent in the high confidence regions, but of poor quality in the remaining areas. Monocular depth is highly detailed, but is not 3D consistent. Our proposed method combines the two depth maps into a detailed and geometrically consistent depth. Applying bilateral filtering, further sharpens the boundaries.



Comparisons to other methods

Compare the renders of our method RI3D (right) with baseline methods (left). Try selecting different methods and scenes!






Acknowledgements

The project was funded in part by a generous gift from Meta. Portions of this research were conducted with the advanced computing resources provided by Texas A&M High Performance Research Computing.