DiffVoice Online Supplement

DiffVoice: Text-to-Speech with Latent Diffusion
Online Supplement

Zhijun Liu
zhijunliu@sjtu.edu.cn

Yiwei Guo
cantabile_kwok@sjtu.edu.cn

Kai Yu
kai.yu@sjtu.edu.cn

This page contains > 100 MiB audio files. So it might take minutes for the audio files to load, depending on your network condition.If you are having problems with the audios playing, you can download all audio files on this page by downlading this github repository.

Some of the page's functionality requires javascript. Try to open with a different up-to-date browser. This page is maintained by Zhijun Liu.

Abstract

In this work, we present DiffVoice, a novel text to speech model based on latent diffusion. We propose to first encode speech signals into a phoneme rate latent representation with a variational autoencoder enhanced by adversarial training, and then jointly model the duration and the latent representation with a diffusion model. Subjective evaluations demonstrate that our method beats the best publicly available systems in naturalness. By adopting recent generative inverse problem solving algorithms for diffusion models, DiffVoice achieves the state of the art performance in text based speech editing, and zero shot adaptation.

The following are 6 different renditions of the abstract by a DiffVoice model. Note that this model does not have any speaker input.It samples DIFFerent VOICEs from the latent space.

This notion page contains a few more audio samples using the BigVGAN vocoder.

Results for Seen Speakers in LibriTTS

20 random speakers seen during training from the train-clean-100 split of LibriTTS.

#	26	40	226	887	1069	1624	1737	2136	2817	3240	3699	4088	5750	7302	7511	7794	7800	7859	8324	8580
Ground Truth
DiffVoice
HiFi-GAN
Autoencoder
VITS
FastSpeech 2

Zeroshot Synthesis on LibriTTS test-clean

All speakers are unseen from the test-clean split of LibriTTS.

#	121	237	260	908	1089	1188	1284	1580	1995	2300
Reference
Ground Truth
DiffVoice (Prompt)
HiFi-GAN
Autoencoder
DiffVoice (Encoder)
Meta-StyleSpeech
VITS (Xvec)
YourTTS
FastSp.2 (Xvec)

Zeroshot Synthesis on VCTK

All speakers are unseen from VCTK

#	225	234	238	245	248	261	294	302	326	335	347
Reference
Ground Truth
DiffVoice (Prompt)
HiFi-GAN
Autoencoder
DiffVoice (Encoder)
Meta-StyleSpeech
VITS (Xvec)
YourTTS
FastSp.2 (Xvec)

Speech Editing with DiffVoice

In this section, we demonstrate DiffVoice's ability to conduct text-based speech editing (including insertion, replacement, and inpainting) with state-of-the-art quality.

The following audio samples are obtained from RetrieverTTS website. We specially thank the authors of RetrieverTTS for providing their generated samples for evaluation.

(Speech Inpainting) Ruth was glad to hear that Philip had made a push into the world, and she was sure that his talent and courage would make a way for him. She should pray for his success at any rate, and especially that the Indians, in St. Louis, would not take his scalp.

Original	DiffVoice	RetrieverTTS

(Speech Insertion) Is the under side of civilization any less important than the upper side of the very same civilization merely because it is deeper and more sombre?

Original	DiffVoice	RetrieverTTS

(Speech Insertion) They are chiefly formed from combinations under his success of the impressions made in childhood.

Original	DiffVoice	RetrieverTTS

The following samples are taken from the website of EditSpeech. To which we have no access of the implementation. EditSpeech is trained on VCTK, instead of LibriTTS.

(Speech Insertion) some have accepted it as a miracle never seen before without physical explanation

Original	DiffVoice	EditSpeech

(Speech Replacement) some have accepted it as ~~a miracle~~ an undeniable fact without physical explanation

Original	DiffVoice	EditSpeech

(Speech Insertion) for that theoretical and realistic reason cover should not be given

Original	DiffVoice	EditSpeech

(Speech Replacement) you should always be able to ~~get out in~~ focus your research on some direction

Original	DiffVoice	EditSpeech

Extra: Comparison with GuidedTTS 2

In this section, we compare our model with Guided TTS 2 by comparing samples obtained from its demo page. GuidedTTS 2 samples are downsampled to 16kHz for comparison. Compared with our work, GuidedTTS 2 is trained on a much larger dataset, and it does not require transcriptions of the source audios.the transcription of the source audios.

Transcript: This audio was generated by a text-to-speech model for Steve Jobs. We use ten second untranscribed speech from Steve Jobs’ Stanford Commencement Address.
Reference (Steve Jobs)	DiffVoice(Zero-shot)	GuidedTTS2(Zero-shot)	GuidedTTS2(Fine-tune)

Transcript: This audio was generated by a text-to-speech model for Sonny. We propose Guided Text-to-Speech 2, a diffusion based generative model for high-quality adaptive text-to-speech using untranscribed data.
Reference (Heung-min Song)	DiffVoice(Zero-shot)	GuidedTTS2(Zero-shot)	GuidedTTS2(Fine-tune)

Transcript: This audio was generated by a text-to-speech model for Emma Watson. We found that the gap between the zero shot approach and the finetune approach is quite large in the case of real world data.
Reference (Emma Watson)	DiffVoice(Zero-shot)	GuidedTTS2(Zero-shot)	GuidedTTS2(Fine-tune)

Transcript: There was unrestrained association of untried and convicted, juvenile with adult prisoners, vagrants, misdemeanants, felons.
Reference (LJ002-0185)	DiffVoice(Zero-shot)	GuidedTTS2(Zero-shot)	GuidedTTS2(Fine-tune)

Transcript: Nor did the methods by which they were perpetrated greatly vary from those in times past.
Reference (LJ029-0212)	DiffVoice(Zero-shot)	GuidedTTS2(Zero-shot)	GuidedTTS2(Fine-tune)

Transcript: He was struck with the appearance of the corpse, which was not emaciated, as after a long disease ending in death;
Reference (LJ017-0244)	DiffVoice(Zero-shot)	GuidedTTS2(Zero-shot)	GuidedTTS2(Fine-tune)