Chenxu Hu1, Qiao Tian2, Tingle Li1,3, Yuping Wang2, Yuxuan Wang2, Hang Zhao1,3

1IIIS, Tsinghua University     2ByteDance     3Shanghai Qi Zhi Institute

NeurIPS 2021


Task

The schematic diagram of the automatic video dubbing (AVD) task. Given the video script and the video as input, the AVD task aims to synthesize speech that is temporally synchronized with the video. This is a scene where two people are talking with each other. The face picture is gray to indicate that the person was not talking at that time.

Given a sentence and a corresponding video clip (without audio), the goal of automatic video dubbing (AVD) is to synthesize natural and intelligible speech whose content is consistent with the sentence, and whose prosody is synchronized with the lip movement of the active speaker in the video. Compared to the traditional speech synthesis task which only generates natural and intelligible speech given the sentence, AVD task is more difficult due to the synchronization requirement.

Abstract

Dubbing is a post-production process of re-recording actors' dialogues, which is extensively used in filmmaking and video production. It is usually performed manually by professional voice actors who read lines with proper prosody, and in synchronization with the pre-recorded videos. In this work, we propose Neural Dubber, the first neural network model to solve a novel automatic video dubbing (AVD) task: synthesizing human speech synchronized with the given video from the text. Neural Dubber is a multi-modal text-to-speech (TTS) model that utilizes the lip movement in the video to control the prosody of the generated speech. Furthermore, an image-based speaker embedding (ISE) module is developed for the multi-speaker setting, which enables Neural Dubber to generate speech with a reasonable timbre according to the speaker's face. Experiments on the chemistry lecture single-speaker dataset and LRS2 multi-speaker dataset show that Neural Dubber can generate speech audios on par with state-of-the-art TTS models in terms of speech quality. Most importantly, both qualitative and quantitative evaluations show that Neural Dubber can control the prosody of synthesized speech by the video, and generate high-fidelity speech temporally synchronized with the video.

Method


First, we apply a phoneme encoder and a video encoder to process the phonemes and video frames respectively. After the encoding, both raw phonemes and viedo frames turn into sequences of hidden representations. Then we feed these hidden representations into the text-video aligner and get the expanded sequence Hmel with the same length as the target mel-spectrograms sequence. Meanwhile, a face image randomly selected from the video frames is input into image-based speaker embedding (ISE) module to generate a image-based speaker embedding (only used in multi-speaker setting). We add Hmel and ISE together and feed them into the variance adaptor to add some variance information (e.g., pitch and energy). Finally, we use the mel-spectrogram decoder to convert the adapted hidden sequence into mel-spectrograms sequence.

Results

Since the AVD task aims to synthesize human speech synchronized with the video from text, the audio quality and the audio-visual synchronization (AV Sync) are the important evaluation criteria.

Human Evaluation
We conduct the mean opinion score (MOS) evaluation on the test set to measure the audio quality and the audio-visual synchronization.
Quantitative Evaluation
In order to measure the synchronization between the generated speech and the video quantitatively, we use the pre-trained SyncNet which can explicitly test for synchronization between speech audio and lip movements in unconstrained videos in the wild. We adopt two metrics: Lip Sync Error - Distance (LSE-D) and Lip Sync Error - Confidence (LSE-C) that can be automatically calculated by the pre-trained SyncNet model.

We conduct qualitative and quantitative evaluation on the chem single-speaker dataset, to compare the audio quality and the audio-visual synchronization of the video clips generated by Neural Dubber with other systems.

Table 1: The evaluation results for the single-speaker AVD. The subjective metrics for audio quality and av sync are with 95% confidence intervals.

It can be seen that Neural Dubber can surpass the Video-based Tacotron baseline and is on par with FastSpeech 2 in terms of audio quality, which demonstrates that Neural Dubber can synthesize high-quality speech. Furthermore, in terms of the av sync, Neural Dubber outperforms FastSpeech 2 and Video-based Tacotron by a big margin and matches GT (Mel + PWG) system in both qualitative and quantitative evaluations, which shows that Neural Dubber can control the prosody of speech and generate speech synchronized with the video.

We conduct human evaluation and quantitative evaluation on the LRS2 multi-speaker dataset to compare Neural Dubber with other systems in multi-speaker setting.

Table 2: The evaluation results for the multi-speaker AVD. The subjective metrics for audio quality and av sync are with 95% confidence intervals.

We can see that Neural Dubber outperforms FastSpeech 2 by a significant margin in terms of audio quality, exhibiting the effectiveness of ISE in multi-speaker AVD. The qualitative and quantitative evaluations show that the speech synthesized by Neural Dubber is much better than that of FastSpeech 2 and is on par with the ground truth recordings in terms of synchronization. These results show that Neural Dubber can address the multi-speaker AVD which is more challenging than the single-speaker AVD.

Demos

Single-speaker AVD

And we expect that we'll have an increase in that vapor intensity.
GT GT (Mel + PWG) FastSpeech 2
Video-based Tacotron Neural Dubber
Well let's just calculate from the ideal gas law.
GT GT (Mel + PWG) FastSpeech 2
Video-based Tacotron Neural Dubber
More demos, click to expand!
So I can figure out the fraction of oxygen particles from the relationship between the pressures.
GT GT (Mel + PWG) FastSpeech 2
Video-based Tacotron Neural Dubber
So green is plus, green is plus, these are two wave functions coming together that have positive sign everywhere.
GT GT (Mel + PWG) FastSpeech 2
Video-based Tacotron Neural Dubber
So now if I know the pressure goes down by about half an atmosphere, well, the total pressure used to be 2.9.
GT GT (Mel + PWG) FastSpeech 2
Video-based Tacotron Neural Dubber
Just 100 thousandth of a mole dissolves in about a liter of water.
GT GT (Mel + PWG) FastSpeech 2
Video-based Tacotron Neural Dubber


Multi-speaker AVD

WHO KNEW THAT ONE MAN

GT

GT (Mel+PWG)

FastSpeech 2

Neural Dubber

THAT'S THE BEST KIND OF TEACHING

GT

GT (Mel+PWG)

FastSpeech 2

Neural Dubber

More demos, click to expand!
THE VERY BEST TALENTS

GT

GT (Mel+PWG)

FastSpeech 2

Neural Dubber

THE FIRST AND SECOND FLOOR FLATS

GT

GT (Mel+PWG)

FastSpeech 2

Neural Dubber

TIME NOW FOR ONE OF THE GREATEST TV COMEBACKS OF OUR TIME

GT

GT (Mel+PWG)

FastSpeech 2

Neural Dubber

YOU HAVE A PROBLEM WITH IT

GT

GT (Mel+PWG)

FastSpeech 2

Neural Dubber


Contact

If you find our work useful in your research, please consider citing:

@inproceedings{hu2021neural,
title={Neural Dubber: Dubbing for Videos According to Scripts},
author={Hu, Chenxu and Tian, Qiao and Li, Tingle and Yuping, Wang and Wang, Yuxuan and Zhao, Hang},
booktitle={Thirty-Fifth Conference on Neural Information Processing Systems},
year={2021}
}

For more information, please contact: hu-cx21@mails.tsinghua.edu.cn

My current local time is .