Hierarchical Flow Diffusion for Efficient Frame Interpolation

CVPR 2025

1Insta360 Research 2MagicLeap
teaser
Input Overlay LDMVFI(8.3s) CBBD(2.1s) SGM-VFI(0.19s) Ours(0.20s) Ground truth

Abstract

Most recent diffusion-based methods still show a large gap compared to non-diffusion methods for video frame interpolation, in both accuracy and efficiency. Most of them formulate the problem as a denoising procedure in latent space directly, which is less effective caused by the large latent space. We propose to model bilateral optical flow explicitly by hierarchical diffusion models, which has much smaller search space in the denoising procedure. Based on the flow diffusion model, we then use a flow-guided images synthesizer to produce the final result. We train the flow diffusion model and the image synthesizer end to end. Our method achieves state of the art in accuracy, and 10+ times faster than other diffusion-based methods.

Image 1
(a) Baseline diffusion-based strategy
Image 2
(b) Our method with hierarchical flow diffusion

Different strategies with diffusion models for video frame interpolation. Given an image pair \( (I_0, I_1) \), our goal is to predict the intermediate frame \(\tilde{I}_t\). (a) Most diffusion-based methods formulate the problem as a denoising process in the latent space \(\tilde{F}_t\) directly, and train the diffusion network and the encoder-decoder (“E” and “D”) network separately. This strategy is less effective due to the large latent space. On the other hand, this method cannot handle complex motions and large displacement. (b) We use a hierarchical strategy with explicit flow modeling. We first train a flow-based encoder-decoder for image synthesis with image pairs and the ground truth optical flow. Then, unlike most diffusion-based methods that denoise the latent space directly, we use a hierarchical diffusion model, conditioned on the encoder features \((F_0, F_1)\), to explicitly denoise optical flow from coarse to fine. We use the predicted bilateral flow \((\tilde{f}_0, \tilde{f}_1)\) to warp image features for the synthesizer, and finally fine-tune the synthesizer and the diffusion models jointly.

Results

Visual Comparison. For complex motions, most non-diffusion-based methods (SGM-VFI) produce blurry results. However, most diffusion-based methods (LDMVFI) struggle in handling large motions. Our method achieves the best accuracy and produces high-quality results in most cases, thanks to the proposed hierarchical flow diffusion models.

SGM-VFI
Ours
LDMVFI
Ours
SGM-VFI
Ours
LDMVFI
Ours
SGM-VFI
Ours
LDMVFI
Ours

Input Overlay

SGM-VFI
Ours

vs SGM-VFI

LDMVFI
Ours

vs LDMVFI

Ground Truth


Quantitative Analysis. Our method outperforms the current SOTA methods significantly, especially in the hard and extreme subset of SNU-FILM.

snufilm comparison

Citation

If you find this work useful in your research, please consider citing:
@inproceedings{yang2025hfd,
      title     = {Hierarchical Flow Diffusion for Efficient Frame Interpolation},
      author    = {Hai, Yang and Wang, Guo and Su, Tan and Jiang, Wenjie and Hu, Yinlin},
      booktitle = {CVPR},
      year      = {2025},
    }