Implicit View-Time Interpolation of Stereo Videos using Multi-Plane Disparities and Non-Uniform Coordinates

1Texas A&M University, 2Leia Inc.

CVPR 2023

We propose a lightweight approach to reconstruct images at novel views and times from a stereo video, captured with standard cameras (e.g., cellphones). We build upon X-Fields and propose several key ideas including multi-plane disparities and non-linear coordinates to significantly improve the results. Our method runs in near real-time rates (23 fps) and has low memory and storage costs. Our system can be deployed on VR and light field displays to provide an immersive experience for the users (Lume Pad with a light field display shown on the right)

Abstract

In this paper, we propose an approach for view-time interpolation of stereo videos. Specifically, we build upon X-Fields that approximates an interpolatable mapping between the input coordinates and 2D RGB images using a convolutional decoder. Our main contribution is to analyze and identify the sources of the problems with using XFields in our application and propose novel techniques to overcome these challenges. Specifically, we observe that XFields struggles to implicitly interpolate the disparities for large baseline cameras. Therefore, we propose multi-plane disparities to reduce the spatial distance of the objects in the stereo views. Moreover, we propose non-uniform time coordinates to handle the non-linear and sudden motion spikes in videos. We additionally introduce several simple, but important, improvements over X-Fields. We demonstrate that our approach is able to produce better results than the state of the art, while running in near real-time rates and having low memory and storage costs.

Video

View Synthesis Results


Multi-Plane Disparities

Single-Plane

The network fails to interpolate objects with large disparity using a single channel output.

Multi-Plane

We propose multi-plane disparities that places objects in different planes to reduce displacement in encdoing space.

The multi-plane disparities output by the view synthesis network between reference viewpoints.

The shifted multi-plane disparities based on plane position and viewpoint.


Non-Uniform Coordinates

Natural videos have non-linear motions, and thus it is difficult to represent the two flows at each frame using a single Jacobian. A straightforward way to address this problem is to estimate two different Jacobians at each frame (Dual Jacobians). However, this approach (the same as single Jacobian) will have difficulty handling motion spikes. With our proposed non-uniform coordinates, we use two different coordinates to estimate the previous and next Jacobians at each frame. The large unused regions in-between allow the network to smoothly accommodate motion spikes. The mean flow magnitude plot shows the performance of different coordinate schemes with respect to the guidance flow. We plot the average flow magnitude per frame in a video sequence.

Time coordinates

Mean flow magnitude

Dual Jacobians

Dual Jacobians network fails to accommodate large non-linear motion from a shaky video captured using a handheld device.

Non-Uniform Coordinates

Non-uniform coordinates fit the motion spikes allowing for smooth video frame interpolation.

Related Work

Frame Interpolation for Dynamic Scenes with Implicit Flow Encoding is another work where we utilize implicit networks to interpolate between two near duplicate images of a scene. We discuss and demonstrate the effects of several hyperparameters on interpolation.

BibTeX

@inproceedings{Paliwal2023implicit,
  author    = {Paliwal, Avinash and Tsarov, Andrii and Kalantari, Nima Khademi},
  title     = {Implicit View-Time Interpolation of Stereo Videos using Multi-Plane Disparities and Non-Uniform Coordinates},
  booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  year      = {2023},
}