Techno Blender
Digitally Yours.

How Can AI Make Old Videos Look Smoother? | by Mikhail Klassen | Jan, 2023

0 33


It was back in 2015 and I was creating some data visualizations for a meeting with my PhD supervisory committee. This was towards the end of my graduate studies in astrophysics and I had run some supercomputer simulations of a massive star forming inside an interstellar cloud of gas.

I figured the best way to show off the results of these simulations was to create various animations of the data. My analysis code had chewed through gigabytes of simulation data and dumped image files at specific time intervals. So I used ffmpeg to stitch them together into an animation:

Video of an astrophysics simulation showing the evolution of a protostellar disk around a forming star. Author’s own work. See Klassen et al. (2016).

I always wanted higher frame rates, so I could have smoother and more beautiful animation, but I was limited by the data I had from my simulation.

Sometimes I would spend a few hours seeing if I could perhaps generate intermediate frames to make my movies appear smoother when I compiled them. My attempts at the time were clumsy and ultimately abandoned, but today there are many techniques that work very well. In fact, most modern TVs include the option (on by default) to artificially enhance the frame rate of the video.

The formal name for this is video frame interpolation (VFI) and it turns out there’s quite a body of research in this domain.

Video footage is typically assembled from individual still frames. If the frame rate is low, the video can appear “choppy”. To make the video appear smoother, various techniques have been developed for artificially inserting intermediate frames between the original frames of the video.

In the example below, a short anime clip has had its frame rate upscaled 8 times using an AI-based technique. The result is a much smoother video.

Let’s explore a few of these techniques. We’ll start with some approaches that don’t require artificial intelligence and end with some of the state-of-the-art approaches.

As our source materials, we’ll use a 6-second clip of a train locomotive by the filmmaker Pat Whelen, uploaded to the free stock photo and video site Pexels.

Original 23 fps clip of a train locomotive. Credit: Pat Whelen, Pexels. Free to use.

The original video is recorded in high resolution at a framerate of 23 frames per second. In general, we will aim to double the framerate during our interpolations.

Frame averaging

One of the simplest ways of adding more frames is to blend two sequential frames to create an average of the two. This is easy to do. I used ffmpeg to perform the interpolation using the minterpolate filter:

ffmpeg \
-i input/trip_nyc_1911_orig_short.mp4 \
-crf 10 \
-vf "minterpolate=fps=46:mi_mode=blend" \
output/trip_nyc_1911_blended.mp4

More details on the filter settings can be found here. The result is below, converted to an animated gif.

Source video frame rate enhanced using video frame blending.

If you look carefully, you can see from the resulting video that artifacts appear wherever there is motion. If an object moves too much between frames, then the blended intermediate frame very obviously shows the object in both places and appearing semi-transparent where the motion is.

To overcome this problem, researchers developed motion estimation techniques, which improves the quality of the interpolation.

Motion estimation

To create intermediate frames that don’t look like just the blurred average of the two other frames, motion-compensated frame interpolation was developed. This approach works by identifying objects in sequential frames from similar visual features and calculating the path taken by those objects. This creates a vector “motion field” that an interpolation algorithm can exploit to improve the quality of the intermediate frames.

A figure from Choi & Ko (2010), showing the motion field overlaid on a video frame.

Several motion estimation algorithms have been implemented into ffmpeg over the years, and so we’ll use one called “overlapped block motion compensation” (OBMC). Here is the result:

Source video frame rate enhanced using a motion interpolation technique called overlapped block motion compensation.

While the quality of the resulting video is much better, it’s still possible to get minor artifacts in the video. To achieve the above video, I again relied on the minterpolate filter in ffmpeg, but with a few different settings:

ffmpeg \
-i input/trip_nyc_1911_orig_short.mp4 \
-crf 10 \
-vf "minterpolate=fps=46:mi_mode=mci:mc_mode=obmc:me_mode=bidir" \
output/trip_nyc_1911_obmc.mp4

Frame interpolation using artificial intelligence

The area of computer vision has been making rapid progress with the application of new artificial intelligence techniques. Deep neural networks can perform highly accurate image segmentation, object detection, and even 3D depth estimation.

The previous technique relied on estimating the motion of objects between frames. These objects were detected using visual features. Convolutional neural networks (CNNs) are great extracting visual features from an image, so it seems reasonable to apply them here.

Next, if we could know a little more about the 3D nature of the world represented by the image, then we could better estimate the movement of different parts of the image and produce more accurate intermediate frames.

The real game-changer here is adding depth estimation. A neural network trained to predict the distance to parts of an image achieves this.

The animation below shows the results from a paper titled “Digging Into Self-Supervised Monocular Depth Estimation” (ICCV 2019, Code). A deep neural network accurately estimates, frame by frame, which parts of the image represent objects that are closer vs further away.

That this technique works well shouldn’t come as too much of a surprise. This is something that our brains do intuitively, even with when taking away parallax information by closing one eye. Using both eyes improves the accuracy even further.

Depth-aware video frame interpolation techniques, such as the DAIN algorithm introduced in 2019 by Bao et al., perform better than previous approaches because they can account for occlusion, i.e. when objects pass behind each other.

More recently, the “Real-time Intermediate Flow Estimation” (RIFE) technique by Huang et al. (ECCV 2022) achieved superior benchmark scores while also being much faster to run. It probably represents the current state of the art at this time of writing.

An example video clip from the RIFE paper showing 16x interpolation of 2 input images: https://github.com/megvii-research/ECCV2022-RIFE
The depth estimation performed by RIFE from the same input.

Let’s conclude our exercise by applying the RIFE algorithm to our New York City scene. I clone the GitHub repository to my laptop, downloaded the pretrained model parameters as described in the repo’s ReadMe file, and made a few tweaks to the requirements.txt file. I ran the code on my laptop without the help of a GPU. To process 6 seconds of video footage took about 15minutes. Here are the results:

Video interpolation using a novel deep CNN technique, “real-time intermediate flow estimation” (RIFE).

It looks even better in the higher-resolution mp4 video that RIFE created. For this article, I converted that video to an animated gif.

This field is moving very quickly. Image generation AIs trained on billions of images from the internet can already produce highly consistent, novel imagery from a sequence of input images or text-based prompts.

Other techniques extend depth estimation approaches to create point clouds from multiple camera angles and then synthesize new artificial perspectives using a technique called neural rendering. See, for example, Rückert et al. (2021) and their neural rendering approach:

Neural rendering in Rückert et al. (2021). Images on the left are generated from a sequence of ground-truth (GT) input images (bottom right).

As the field continues to advance, we can expect greater accuracy of the intermediate frames with fewer artifacts, even as more and more of the frames are essentially “dreamt up” or hallucinated by very large pre-trained artificial neural networks. Relative few input frames are needed to produce highly believable, smooth transitions between them

This opens up a lot of creative possibilities, such as the restoration of old archival videos, animating still images, and creating virtual reality worlds from a handful of 2D photographs.

It may also lead to the flooding of the world with so much synthetic imagery that it becomes difficult to distinguish what represents ground truth. In the meantime, expect your videos to look really smooth.

My original research data visualization (see above), passed through the RIFE algorithm.


It was back in 2015 and I was creating some data visualizations for a meeting with my PhD supervisory committee. This was towards the end of my graduate studies in astrophysics and I had run some supercomputer simulations of a massive star forming inside an interstellar cloud of gas.

I figured the best way to show off the results of these simulations was to create various animations of the data. My analysis code had chewed through gigabytes of simulation data and dumped image files at specific time intervals. So I used ffmpeg to stitch them together into an animation:

Video of an astrophysics simulation showing the evolution of a protostellar disk around a forming star. Author’s own work. See Klassen et al. (2016).

I always wanted higher frame rates, so I could have smoother and more beautiful animation, but I was limited by the data I had from my simulation.

Sometimes I would spend a few hours seeing if I could perhaps generate intermediate frames to make my movies appear smoother when I compiled them. My attempts at the time were clumsy and ultimately abandoned, but today there are many techniques that work very well. In fact, most modern TVs include the option (on by default) to artificially enhance the frame rate of the video.

The formal name for this is video frame interpolation (VFI) and it turns out there’s quite a body of research in this domain.

Video footage is typically assembled from individual still frames. If the frame rate is low, the video can appear “choppy”. To make the video appear smoother, various techniques have been developed for artificially inserting intermediate frames between the original frames of the video.

In the example below, a short anime clip has had its frame rate upscaled 8 times using an AI-based technique. The result is a much smoother video.

Let’s explore a few of these techniques. We’ll start with some approaches that don’t require artificial intelligence and end with some of the state-of-the-art approaches.

As our source materials, we’ll use a 6-second clip of a train locomotive by the filmmaker Pat Whelen, uploaded to the free stock photo and video site Pexels.

Original 23 fps clip of a train locomotive. Credit: Pat Whelen, Pexels. Free to use.

The original video is recorded in high resolution at a framerate of 23 frames per second. In general, we will aim to double the framerate during our interpolations.

Frame averaging

One of the simplest ways of adding more frames is to blend two sequential frames to create an average of the two. This is easy to do. I used ffmpeg to perform the interpolation using the minterpolate filter:

ffmpeg \
-i input/trip_nyc_1911_orig_short.mp4 \
-crf 10 \
-vf "minterpolate=fps=46:mi_mode=blend" \
output/trip_nyc_1911_blended.mp4

More details on the filter settings can be found here. The result is below, converted to an animated gif.

Source video frame rate enhanced using video frame blending.

If you look carefully, you can see from the resulting video that artifacts appear wherever there is motion. If an object moves too much between frames, then the blended intermediate frame very obviously shows the object in both places and appearing semi-transparent where the motion is.

To overcome this problem, researchers developed motion estimation techniques, which improves the quality of the interpolation.

Motion estimation

To create intermediate frames that don’t look like just the blurred average of the two other frames, motion-compensated frame interpolation was developed. This approach works by identifying objects in sequential frames from similar visual features and calculating the path taken by those objects. This creates a vector “motion field” that an interpolation algorithm can exploit to improve the quality of the intermediate frames.

A figure from Choi & Ko (2010), showing the motion field overlaid on a video frame.

Several motion estimation algorithms have been implemented into ffmpeg over the years, and so we’ll use one called “overlapped block motion compensation” (OBMC). Here is the result:

Source video frame rate enhanced using a motion interpolation technique called overlapped block motion compensation.

While the quality of the resulting video is much better, it’s still possible to get minor artifacts in the video. To achieve the above video, I again relied on the minterpolate filter in ffmpeg, but with a few different settings:

ffmpeg \
-i input/trip_nyc_1911_orig_short.mp4 \
-crf 10 \
-vf "minterpolate=fps=46:mi_mode=mci:mc_mode=obmc:me_mode=bidir" \
output/trip_nyc_1911_obmc.mp4

Frame interpolation using artificial intelligence

The area of computer vision has been making rapid progress with the application of new artificial intelligence techniques. Deep neural networks can perform highly accurate image segmentation, object detection, and even 3D depth estimation.

The previous technique relied on estimating the motion of objects between frames. These objects were detected using visual features. Convolutional neural networks (CNNs) are great extracting visual features from an image, so it seems reasonable to apply them here.

Next, if we could know a little more about the 3D nature of the world represented by the image, then we could better estimate the movement of different parts of the image and produce more accurate intermediate frames.

The real game-changer here is adding depth estimation. A neural network trained to predict the distance to parts of an image achieves this.

The animation below shows the results from a paper titled “Digging Into Self-Supervised Monocular Depth Estimation” (ICCV 2019, Code). A deep neural network accurately estimates, frame by frame, which parts of the image represent objects that are closer vs further away.

That this technique works well shouldn’t come as too much of a surprise. This is something that our brains do intuitively, even with when taking away parallax information by closing one eye. Using both eyes improves the accuracy even further.

Depth-aware video frame interpolation techniques, such as the DAIN algorithm introduced in 2019 by Bao et al., perform better than previous approaches because they can account for occlusion, i.e. when objects pass behind each other.

More recently, the “Real-time Intermediate Flow Estimation” (RIFE) technique by Huang et al. (ECCV 2022) achieved superior benchmark scores while also being much faster to run. It probably represents the current state of the art at this time of writing.

An example video clip from the RIFE paper showing 16x interpolation of 2 input images: https://github.com/megvii-research/ECCV2022-RIFE
The depth estimation performed by RIFE from the same input.

Let’s conclude our exercise by applying the RIFE algorithm to our New York City scene. I clone the GitHub repository to my laptop, downloaded the pretrained model parameters as described in the repo’s ReadMe file, and made a few tweaks to the requirements.txt file. I ran the code on my laptop without the help of a GPU. To process 6 seconds of video footage took about 15minutes. Here are the results:

Video interpolation using a novel deep CNN technique, “real-time intermediate flow estimation” (RIFE).

It looks even better in the higher-resolution mp4 video that RIFE created. For this article, I converted that video to an animated gif.

This field is moving very quickly. Image generation AIs trained on billions of images from the internet can already produce highly consistent, novel imagery from a sequence of input images or text-based prompts.

Other techniques extend depth estimation approaches to create point clouds from multiple camera angles and then synthesize new artificial perspectives using a technique called neural rendering. See, for example, Rückert et al. (2021) and their neural rendering approach:

Neural rendering in Rückert et al. (2021). Images on the left are generated from a sequence of ground-truth (GT) input images (bottom right).

As the field continues to advance, we can expect greater accuracy of the intermediate frames with fewer artifacts, even as more and more of the frames are essentially “dreamt up” or hallucinated by very large pre-trained artificial neural networks. Relative few input frames are needed to produce highly believable, smooth transitions between them

This opens up a lot of creative possibilities, such as the restoration of old archival videos, animating still images, and creating virtual reality worlds from a handful of 2D photographs.

It may also lead to the flooding of the world with so much synthetic imagery that it becomes difficult to distinguish what represents ground truth. In the meantime, expect your videos to look really smooth.

My original research data visualization (see above), passed through the RIFE algorithm.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment