3D volumetric video
Image by myshoun from Pixabay

“Explore a Scene from Any Vantage Point You Want”: 3D Volumetric Video Breakthrough Means Streaming in 3D May Soon Be a Reality

Brown University researchers have revealed a new video processing method called PackUV, which they are describing as a “key step” towards realistic, storable, 3D volumetric video that can be viewed from all angles and is compatible with the video codecs that currently power most video on the internet, making it streamable.

The team behind the new volumetric video processing approach said their technique could enable practical 3D video streaming on everyday devices like smartphones, computers, and smart TVs without requiring new display technologies, ushering in a new era of realistic 3D video entertainment.

3D Volumetric Video Offers Unprecedented Versatility and Challenges

According to Brown computer science graduate student and study leader Aashish Rai, volumetric video involves capturing actions with multiple synchronized cameras encircling the target scene. After the scene is recorded, specialized algorithms rebuild the location in three dimensions. Notably, the newly constructed volumetric video can be viewed from any perspective within the recording space.

“With volumetric video, you can basically explore a scene from any vantage point you want,” Rai explained, adding that capturing three dimensions plus a time dimension actually makes the resulting recording “a 4D video.”

Capturing video in this manner allows directors to show scenes from perspectives unattainable by conventional filming techniques. In theory, such a video could be combined with a user interface that lets viewers navigate through a scene, including options such as viewing a sports play from on the field or a concert from the stage.

Still, the Brown researchers note, several challenges have kept volumetric 3D video from wider adoption. This includes compressing the video enough to make streaming 3D volumetric content viable with current internet infrastructure and protocols.

“Volumetric video is incredibly hard to store and stream,” Rai explained, adding that a 30-minute clip “can balloon to terabytes of data, and the formats it comes in are completely alien to the infrastructure the internet already runs on — your computer, your streaming service, your video codec.”

Rendering 3D Video Onto a 2D ‘Surface’ Creates Internet-Capable Files

To overcome the obstacles preventing the wider adoption of the technology, the Brown team started with the 3D scene rendering method currently in use, called 3D Gaussian Splatting. According to the team’s statement, this approach renders 3D images using “fuzzy blobs that encode the color, opacity, and shape of points in space,” called Gaussians.

In the new approach, the team found a way of mapping a 3D scene and its millions of Gaussians into a more manageable 2D image. According to Rai, the approach is similar to how a mapmaker projects a 3D globe onto a flat, 2D surface, resulting in “a structured, multi-scale image” that encodes all the information contained in the original dynamic 3D scene.

3D Volumetric video
Image Credit: The Interactive 3D Vision and Learning Lab at Brown University.

Next, the team’s process involves stacking the 3D-encoded images together. The result is a video with a much more manageable file size than traditional 3D volumetric videos, which the team notes “is compatible with stalwart video codecs that run Netflix, YouTube and most of the rest of the internet.”

“We basically convert this entire 4D scene into a normal video that you can stream over the internet and share with friends,” Rai explained.

Renders Scenes Up to 30 Minutes Without Breaking Down

In addition to overcoming file-size and streaming limitations that have plagued current 3D volumetric video strategies, the Brown team said their work addresses the tendency of current methods to “break down” over time, thereby limiting the length of potential videos.

The primary challenge is tracking objects when they go out of camera view, such as a ball temporarily “disappearing” behind a competitor. The team said the existing technology also has trouble handling “novel movement,” such as a person entering a room midway through another sequence of events.

According to Rai, their approach solves this limitation by splitting a longer video file “into small chunks.” Once separated, their system checks the start of each video segment to determine whether something has entered or left the scene. Once PackUV makes that determination, Rai said it instructs the software to “model accordingly.”

“By restarting the tracking process more frequently, the new technique is better able to reacquire objects that have been temporarily blocked and deal appropriately with new movements,” the research team explained, adding that their approach can seamlessly render complex 3D volumetric video scenes up to 30 minutes in length without failure, “far longer than other Gaussian Splatting approaches.”

3D Volumetric Video Could Impact Entertainment, Manufacturing, and “Other Areas”

To validate their approach, the Brown team put together what they described as potentially “the largest dataset of multi-view video ever assembled” and made it publicly available to other researchers. This includes video of all kinds of activities, including cooking, woodworking, and various sports.

Critically, the assembled dataset was all captured with arrays of 50 to 90 synchronized cameras. Rai’s team said these included actions captured in laboratory settings, specially equipped with cameras, as well as mobile camera arrays capturing action “in the real world.”

Although this work is just a first step toward streamable, 3D volumetric video at the viewer’s fingertips, Rai said that their work helps advance a technology with a wealth of potential future applications, in which building a ‘digital twin’ of the real world is critical to seamless streaming.

“There are real-world applications in entertainment and sports, for example, but also other use cases — manufacturing and other areas — where you need to create digital twins of the real world,” Sridhar said. “Fundamentally, that’s what this work is about.”

Rai will present the work, PackUV: Packed Gaussian UV Maps for 4D Volumetric Video, at the IEEE/CVF Conference on Computer Vision and Pattern Recognition in June.

Christopher Plain is a Science Fiction and Fantasy novelist and Head Science Writer at The Debrief. Follow and connect with him on X, learn about his books at plainfiction.com, or email him directly at christopher@thedebrief.org.