One-Shot High-Fidelity Talking-Head Synthesis with Deformable Neural Radiance Field

1Shanghai AI Laboratory, 2DAMO Academy, Alibaba Group, 3Northwestern Polytechnical University
CVPR 2023

Abstract

Teaser image.

In this paper, we propose HiDe-NeRF, which achieves high-fidelity and free-view talking-head synthesis. Drawing on the recently proposed Deformable Neural Radiance Fields, HiDe-NeRF represents the 3D dynamic scene into a canonical appearance field and an implicit deformation field, where the former comprises the canonical source face and the latter models the driving pose and expression.

In particular, we improve the fidelity from two aspects:

  • to enhance identity expressiveness, we design a generalized appearance module that leverages multi-scale volume features to preserve face shape and details;
  • to improve expression preciseness, we propose a lightweight deformation module that explicitly decouples the pose and expression to enable precise expression modeling.

Video

Method Overview

Procedure image.

The Multi-Scale Generalized Appearance Module (MGA) encodes the source image into multi-scale canonical volume features. The Lightweight Expression-Aware Deformation Module (LED) predicts backward deformation for each observed point to shift them into the canonical space and retrieve their volume features. The deformation is learned from paired SECCs, conditioned on the positions of points sampled form the rays. The Image Generation Module takes as input the deformed points to sample features from different scales of tri-planes. These multi-scale features are composed for the following neural rendering. Here we also design a refine network to further refine the texture details (e.g., teeth, skin, hair, etc.), and to enhance the resolution of rendered images.

Visual Results

Image reenactment

Free view generation

Our method enables the free-view generation of the talking-head video. As shown below, users can change the view directions(yaw, pitch, roll) arbitrarily.

Shape Preservation

Different from previous warping-based methods, whose result will be misguided by the driving facial shape. Our method can preserve the source identity information much better.

Interpolate start reference image.

Comparison of shape preservation with prior works. AVD-s/d indicate average vertices distance with source and driving mesh. The first and third row contains the source, driving, and generated images. The second and fourth row includes the corresponding canonical mesh of their above image. Best results are marked with yellow.

BibTeX

@inproceedings{li2023hidenerf,
  author    = {Li, Weichuang and Zhang, Longhao and Wang, Dong and Zhao, Bin and Wang, Zhigang and Chen, Mulin and Zhang, Bang and Wang, Zhongjian and Bo, Liefeng and Li, Xuelong},
  title     = {One-Shot High-Fidelity Talking-Head Synthesis with Deformable Neural Radiance Field},
  journal   = CVPR,
  year      = {2023},
}