One-Shot High-Fidelity Talking-Head Synthesis with Deformable Neural Radiance Field

¹Shanghai AI Laboratory, ²DAMO Academy, Alibaba Group, ³Northwestern Polytechnical University

CVPR 2023

Abstract

In this paper, we propose HiDe-NeRF, which achieves high-fidelity and free-view talking-head synthesis. Drawing on the recently proposed Deformable Neural Radiance Fields, HiDe-NeRF represents the 3D dynamic scene into a canonical appearance field and an implicit deformation field, where the former comprises the canonical source face and the latter models the driving pose and expression.

In particular, we improve the fidelity from two aspects:

to enhance identity expressiveness, we design a generalized appearance module that leverages multi-scale volume features to preserve face shape and details;
to improve expression preciseness, we propose a lightweight deformation module that explicitly decouples the pose and expression to enable precise expression modeling.

Method Overview

The Multi-Scale Generalized Appearance Module (MGA) encodes the source image into multi-scale canonical volume features. The Lightweight Expression-Aware Deformation Module (LED) predicts backward deformation for each observed point to shift them into the canonical space and retrieve their volume features. The deformation is learned from paired SECCs, conditioned on the positions of points sampled form the rays. The Image Generation Module takes as input the deformed points to sample features from different scales of tri-planes. These multi-scale features are composed for the following neural rendering. Here we also design a refine network to further refine the texture details (e.g., teeth, skin, hair, etc.), and to enhance the resolution of rendered images.

Shape Preservation

Different from previous warping-based methods, whose result will be misguided by the driving facial shape. Our method can preserve the source identity information much better.

Comparison of shape preservation with prior works. AVD-s/d indicate average vertices distance with source and driving mesh. The first and third row contains the source, driving, and generated images. The second and fourth row includes the corresponding canonical mesh of their above image. Best results are marked with yellow.

BibTeX

@inproceedings{li2023hidenerf, author = {Li, Weichuang and Zhang, Longhao and Wang, Dong and Zhao, Bin and Wang, Zhigang and Chen, Mulin and Zhang, Bang and Wang, Zhongjian and Bo, Liefeng and Li, Xuelong}, title = {One-Shot High-Fidelity Talking-Head Synthesis with Deformable Neural Radiance Field}, journal = CVPR, year = {2023}, }

One-Shot High-Fidelity Talking-Head Synthesis with Deformable Neural Radiance Field

Abstract

Video

Method Overview

Visual Results

Image reenactment

Free view generation

Shape Preservation

BibTeX