DiffVoice: Text-to-Speech with Latent Diffusion
Online Supplement


This page contains > 100 MiB audio files. So it might take minutes for the audio files to load, depending on your network condition.If you are having problems with the audios playing, you can download all audio files on this page by downlading this github repository.

Some of the page's functionality requires javascript. Try to open with a different up-to-date browser. This page is maintained by Zhijun Liu.

Abstract

In this work, we present DiffVoice, a novel text to speech model based on latent diffusion. We propose to first encode speech signals into a phoneme rate latent representation with a variational autoencoder enhanced by adversarial training, and then jointly model the duration and the latent representation with a diffusion model. Subjective evaluations demonstrate that our method beats the best publicly available systems in naturalness. By adopting recent generative inverse problem solving algorithms for diffusion models, DiffVoice achieves the state of the art performance in text based speech editing, and zero shot adaptation.

The following are 6 different renditions of the abstract by a DiffVoice model. Note that this model does not have any speaker input.It samples DIFFerent VOICEs from the latent space.

This notion page contains a few more audio samples using the BigVGAN vocoder.

Results for Seen Speakers in LibriTTS

20 random speakers seen during training from the train-clean-100 split of LibriTTS.

# 26 40 226 887 1069 1624 1737 2136 2817 3240 3699 4088 5750 7302 7511 7794 7800 7859 8324 8580
Ground Truth
DiffVoice
HiFi-GAN
Autoencoder
VITS
FastSpeech 2

Zeroshot Synthesis on LibriTTS test-clean

All speakers are unseen from the test-clean split of LibriTTS.

# 121 237 260 908 1089 1188 1284 1580 1995 2300
Reference
Ground Truth
DiffVoice (Prompt)
HiFi-GAN
Autoencoder
DiffVoice (Encoder)
Meta-StyleSpeech
VITS (Xvec)
YourTTS
FastSp.2 (Xvec)

Zeroshot Synthesis on VCTK

All speakers are unseen from VCTK

# 225 234 238 245 248 261 294 302 326 335 347
Reference
Ground Truth
DiffVoice (Prompt)
HiFi-GAN
Autoencoder
DiffVoice (Encoder)
Meta-StyleSpeech
VITS (Xvec)
YourTTS
FastSp.2 (Xvec)

Speech Editing with DiffVoice

In this section, we demonstrate DiffVoice's ability to conduct text-based speech editing (including insertion, replacement, and inpainting) with state-of-the-art quality.

The following audio samples are obtained from RetrieverTTS website. We specially thank the authors of RetrieverTTS for providing their generated samples for evaluation.

(Speech Inpainting) Ruth was glad to hear that Philip had made a push into the world, and she was sure that his talent and courage would make a way for him. She should pray for his success at any rate, and especially that the Indians, in St. Louis, would not take his scalp.

Original DiffVoice RetrieverTTS

(Speech Insertion) Is the under side of civilization any less important than the upper side of the very same civilization merely because it is deeper and more sombre?

Original DiffVoice RetrieverTTS

(Speech Insertion) They are chiefly formed from combinations under his success of the impressions made in childhood.

Original DiffVoice RetrieverTTS

The following samples are taken from the website of EditSpeech. To which we have no access of the implementation. EditSpeech is trained on VCTK, instead of LibriTTS.

(Speech Insertion) some have accepted it as a miracle never seen before without physical explanation

Original DiffVoice EditSpeech

(Speech Replacement) some have accepted it as a miracle an undeniable fact without physical explanation

Original DiffVoice EditSpeech

(Speech Insertion) for that theoretical and realistic reason cover should not be given

Original DiffVoice EditSpeech

(Speech Replacement) you should always be able to get out in focus your research on some direction

Original DiffVoice EditSpeech

Extra: Comparison with GuidedTTS 2

In this section, we compare our model with Guided TTS 2 by comparing samples obtained from its demo page. GuidedTTS 2 samples are downsampled to 16kHz for comparison. Compared with our work, GuidedTTS 2 is trained on a much larger dataset, and it does not require transcriptions of the source audios.the transcription of the source audios.

Transcript: This audio was generated by a text-to-speech model for Steve Jobs. We use ten second untranscribed speech from Steve Jobs’ Stanford Commencement Address.
Reference (Steve Jobs) DiffVoice(Zero-shot) GuidedTTS2(Zero-shot) GuidedTTS2(Fine-tune)
Transcript: This audio was generated by a text-to-speech model for Sonny. We propose Guided Text-to-Speech 2, a diffusion based generative model for high-quality adaptive text-to-speech using untranscribed data.
Reference (Heung-min Song) DiffVoice(Zero-shot) GuidedTTS2(Zero-shot) GuidedTTS2(Fine-tune)
Transcript: This audio was generated by a text-to-speech model for Emma Watson. We found that the gap between the zero shot approach and the finetune approach is quite large in the case of real world data.
Reference (Emma Watson) DiffVoice(Zero-shot) GuidedTTS2(Zero-shot) GuidedTTS2(Fine-tune)
Transcript: There was unrestrained association of untried and convicted, juvenile with adult prisoners, vagrants, misdemeanants, felons.
Reference (LJ002-0185) DiffVoice(Zero-shot) GuidedTTS2(Zero-shot) GuidedTTS2(Fine-tune)
Transcript: Nor did the methods by which they were perpetrated greatly vary from those in times past.
Reference (LJ029-0212) DiffVoice(Zero-shot) GuidedTTS2(Zero-shot) GuidedTTS2(Fine-tune)
Transcript: He was struck with the appearance of the corpse, which was not emaciated, as after a long disease ending in death;
Reference (LJ017-0244) DiffVoice(Zero-shot) GuidedTTS2(Zero-shot) GuidedTTS2(Fine-tune)