CyberHost: Taming Audio-driven Avatar Diffusion Model with Region Codebook Attention

1Bytedance,2Zhejiang University
*Equal contribution,Project lead,Internship at Bytedance

Abstract

Diffusion-based video generation technology has advanced significantly, catalyzing a proliferation of research in human animation. However, the majority of these studies are confined to same-modality driving settings, with cross-modality human body animation remaining relatively underexplored. In this paper, we introduce CyberHost, an end-to-end audio-driven human animation framework that ensures hand integrity, identity consistency, and natural motion. The key design of CyberHost is the Region Codebook Attention mechanism, which improves the generation quality of facial and hand animations by integrating fine-grained local features with learned motion pattern priors. Furthermore, we have developed a suite of human-prior-guided training strategies, including body movement map, hand clarity score, pose-aligned reference feature, and local enhancement supervision, to improve synthesis results. To our knowledge, CyberHost is the first end-to-end audio-driven human diffusion model capable of facilitating zero-shot video generation within the scope of the human body. Extensive experiments demonstrate that CyberHost surpasses previous works in both quantitative and qualitative aspects.

           
* The teasers above are generated after super-resolution processing.

Overall Framework

The overall structure of CyberHost.

CyberHost employs a dual U-Net architecture as its foundational structure and utilizes the motion frame strategy for temporal continuation, establishing a baseline for audio-driven body animation.

Based on the baseline, to enhance the modeling capability for the key human region, i.e., hands and faces, we adapt the proposed Region Codebook Attention to both the facial and hand regions and insert them into multiple stages of the Denoising U-Net. The Region Codebook consists of two parts: the motion codebook learned from the dataset and the identity descriptor extracted from cropped local images.

To reduce the uncertainty in full-body animation driven solely by audio, several improvements have been implemented: (1) The Body Movement Map is employed to stabilize the root movements of the body. It is encoded and merged with the noised latent, serving as the input for the denoising U-Net. (2) Hand clarity is explicitly enhanced by incorporating the Hand Clarity Score as a residual into the time embedding to mitigate the effects of motion blur in the data. (3) The Pose Encoder encodes the reference skeleton map, which is then integrated into the reference latent, yielding a Pose-aligned Reference Feature.

Multimodal-Driven Demo

CyberHost supports mixed-signal driving. The driving signals for the generated results below come from hand pose templates and audio.
       

 

Audio-Driven Demos on Open-set Test Images

CyberHost achieves zero-shot human animation on open-set test images in audio-driven settings.

           

 

           

 

           
           

 

           

 

Audio to Video Comparison with Baselines

Compare to DiffGesture & MimicMotion

 

 

 

 

Compare to Vlogger

* Images and audio are provided by Vlogger homepage
 

 

 

 

Video to Video Comparison with Baselines

CyberHost supports video-driven body reenactment and surpasses current state-of-the-art methods in terms of generation quality. The driving signals are skeleton maps extracted from the GT(ground-truth) video using DWPose, and the reference frame is the first frame of the GT video.

 

Ethics Concerns

The purpose of this work is only for research. The images and audios used in these demos are from public sources. If there are any concerns, please contact us (jianwen.alan@gmail.com) and we will delete it in time.

 

BibTex

If you find this project is useful to your research, please cite us:

            @article{lin2024cyberhost,
              title={CyberHost: Taming Audio-driven Avatar Diffusion Model with Region Codebook Attention},
              author={Lin, Gaojie and Jiang, Jianwen and Liang, Chao and Zhong, Tianyun and Yang, Jiaqi and Zheng, Yanbo},
              journal={arXiv preprint arXiv:2409.01876},
              year={2024}
            }