Task

Automatic Video Dubbing (AVD)

Given a sentence and a corresponding video clip (without audio), the goal of automatic video dubbing (AVD) is to synthesize natural and intelligible speech whose content is consistent with the sentence, and whose prosody is synchronized with the lip movement of the active speaker in the video. Compared to the traditional speech synthesis task which only generates natural and intelligible speech given the sentence, AVD task is more difficult due to the synchronization requirement.

Abstract

Dubbing is a post-production process of re-recording actors' dialogues, which is extensively used in filmmaking and video production. It is usually performed manually by professional voice actors who read lines with proper prosody, and in synchronization with the pre-recorded videos. In this work, we propose Neural Dubber, the first neural network model to solve a novel automatic video dubbing (AVD) task: synthesizing human speech synchronized with the given video from the text. Neural Dubber is a multi-modal text-to-speech (TTS) model that utilizes the lip movement in the video to control the prosody of the generated speech. Furthermore, an image-based speaker embedding (ISE) module is developed for the multi-speaker setting, which enables Neural Dubber to generate speech with a reasonable timbre according to the speaker's face. Experiments on the chemistry lecture single-speaker dataset and LRS2 multi-speaker dataset show that Neural Dubber can generate speech audios on par with state-of-the-art TTS models in terms of speech quality. Most importantly, both qualitative and quantitative evaluations show that Neural Dubber can control the prosody of synthesized speech by the video, and generate high-fidelity speech temporally synchronized with the video.

Method

First, we apply a phoneme encoder and a video encoder to process the phonemes and video frames respectively. After the encoding, both raw phonemes and video frames turn into sequences of hidden representations. Then we feed these hidden representations into the text-video aligner and get the expanded sequence Hmel with the same length as the target mel-spectrograms sequence. Meanwhile, a face image randomly selected from the video frames is input into image-based speaker embedding (ISE) module to generate a image-based speaker embedding (only used in multi-speaker setting). We add Hmel and ISE together and feed them into the variance adaptor to add some variance information (e.g., pitch and energy). Finally, we use the mel-spectrogram decoder to convert the adapted hidden sequence into mel-spectrograms sequence.

Results

Since the AVD task aims to synthesize human speech synchronized with the video from text, the audio quality and the audio-visual synchronization (AV Sync) are the important evaluation criteria.

Human Evaluation
We conduct the mean opinion score (MOS) evaluation on the test set to measure the audio quality and the audio-visual synchronization.
Quantitative Evaluation
In order to measure the synchronization between the generated speech and the video quantitatively, we use the pre-trained SyncNet which can explicitly test for synchronization between speech audio and lip movements in unconstrained videos in the wild. We adopt two metrics: Lip Sync Error - Distance (LSE-D) and Lip Sync Error - Confidence (LSE-C) that can be automatically calculated by the pre-trained SyncNet model.

Single-speaker AVD

We conduct qualitative and quantitative evaluation on the chem single-speaker dataset, to compare the audio quality and the audio-visual synchronization of the video clips generated by Neural Dubber with other systems.

It can be seen that Neural Dubber can surpass the Video-based Tacotron baseline and is on par with FastSpeech 2 in terms of audio quality, which demonstrates that Neural Dubber can synthesize high-quality speech. Furthermore, in terms of the av sync, Neural Dubber outperforms FastSpeech 2 and Video-based Tacotron by a big margin and matches GT (Mel + PWG) system in both qualitative and quantitative evaluations, which shows that Neural Dubber can control the prosody of speech and generate speech synchronized with the video.

Multi-speaker AVD

We conduct human evaluation and quantitative evaluation on the LRS2 multi-speaker dataset to compare Neural Dubber with other systems in multi-speaker setting.

We can see that Neural Dubber outperforms FastSpeech 2 by a significant margin in terms of audio quality, exhibiting the effectiveness of ISE in multi-speaker AVD. The qualitative and quantitative evaluations show that the speech synthesized by Neural Dubber is much better than that of FastSpeech 2 and is on par with the ground truth recordings in terms of synchronization. These results show that Neural Dubber can address the multi-speaker AVD which is more challenging than the single-speaker AVD.

Demos

Single-speaker AVD

And we expect that we'll have an increase in that vapor intensity.
GT	GT (Mel + PWG)	FastSpeech 2
Video-based Tacotron	Neural Dubber
Well let's just calculate from the ideal gas law.
GT	GT (Mel + PWG)	FastSpeech 2
Video-based Tacotron	Neural Dubber

More demos, click to expand!

So I can figure out the fraction of oxygen particles from the relationship between the pressures.
GT	GT (Mel + PWG)	FastSpeech 2
Video-based Tacotron	Neural Dubber
So green is plus, green is plus, these are two wave functions coming together that have positive sign everywhere.
GT	GT (Mel + PWG)	FastSpeech 2
Video-based Tacotron	Neural Dubber
So now if I know the pressure goes down by about half an atmosphere, well, the total pressure used to be 2.9.
GT	GT (Mel + PWG)	FastSpeech 2
Video-based Tacotron	Neural Dubber
Just 100 thousandth of a mole dissolves in about a liter of water.
GT	GT (Mel + PWG)	FastSpeech 2
Video-based Tacotron	Neural Dubber

Multi-speaker AVD

WHO KNEW THAT ONE MAN
GT	GT (Mel+PWG)	FastSpeech 2	Neural Dubber
THAT'S THE BEST KIND OF TEACHING
GT	GT (Mel+PWG)	FastSpeech 2	Neural Dubber

More demos, click to expand!

THE VERY BEST TALENTS
GT	GT (Mel+PWG)	FastSpeech 2	Neural Dubber
THE FIRST AND SECOND FLOOR FLATS
GT	GT (Mel+PWG)	FastSpeech 2	Neural Dubber
TIME NOW FOR ONE OF THE GREATEST TV COMEBACKS OF OUR TIME
GT	GT (Mel+PWG)	FastSpeech 2	Neural Dubber
YOU HAVE A PROBLEM WITH IT
GT	GT (Mel+PWG)	FastSpeech 2	Neural Dubber

Contact

If you find our work useful in your research, please consider citing:

@inproceedings{hu2021neural,
title={Neural Dubber: Dubbing for Videos According to Scripts},
author={Hu, Chenxu and Tian, Qiao and Li, Tingle and Yuping, Wang and Wang, Yuxuan and Zhao, Hang},
booktitle={Thirty-Fifth Conference on Neural Information Processing Systems},
year={2021}
}

For more information, please contact: hu-cx21@mails.tsinghua.edu.cn

My current local time is .