The Multi-Scale Generalized Appearance Module (MGA) encodes the source image into multi-scale canonical volume features. The Lightweight Expression-Aware Deformation Module (LED) predicts backward deformation for each observed point to shift them into the canonical space and retrieve their volume features. The deformation is learned from paired SECCs, conditioned on the positions of points sampled form the rays. The Image Generation Module takes as input the deformed points to sample features from different scales of tri-planes. These multi-scale features are composed for the following neural rendering. Here we also design a refine network to further refine the texture details (e.g., teeth, skin, hair, etc.), and to enhance the resolution of rendered images.
Our method enables the free-view generation of the talking-head video. As shown below, users can change the view directions(yaw, pitch, roll) arbitrarily.
Different from previous warping-based methods, whose result will be misguided by the driving facial shape. Our method can preserve the source identity information much better.
Comparison of shape preservation with prior works. AVD-s/d indicate average vertices distance with source and driving mesh. The first and third row contains the source, driving, and generated images. The second and fourth row includes the corresponding canonical mesh of their above image. Best results are marked with yellow.
@inproceedings{li2023hidenerf,
author = {Li, Weichuang and Zhang, Longhao and Wang, Dong and Zhao, Bin and Wang, Zhigang and Chen, Mulin and Zhang, Bang and Wang, Zhongjian and Bo, Liefeng and Li, Xuelong},
title = {One-Shot High-Fidelity Talking-Head Synthesis with Deformable Neural Radiance Field},
journal = CVPR,
year = {2023},
}