Autoregressive Diffusion Transformer for Text-to-Speech Synthesis


Abstract

Audio language models have recently emerged as a promising approach for various audio generation tasks, relying on audio tokenizers to encode waveforms into sequences of discrete symbols. Audio tokenization often poses a necessary compromise between code bitrate and reconstruction accuracy. When dealing with low-bitrate audio codes, language models are constrained to process only a subset of the information embedded in the audio, which in turn restricts their generative capabilities. To circumvent these issues, we propose encoding audio as vector sequences in continuous space Rd and autoregressively generating these sequences using a decoder-only diffusion transformer (ARDiT). Our findings indicate that ARDiT excels in zero-shot text-to-speech and exhibits performance that compares to or even surpasses that of state-of-the-art models. High-bitrate continuous speech representation enables almost flawless reconstruction, allowing our model to achieve nearly perfect speech editing. Our experiments reveal that employing Integral Kullback-Leibler (IKL) divergence for distillation at each autoregressive step significantly boosts the perceived quality of the samples. Simultaneously, it condenses the iterative sampling process of the diffusion model into a single step. Furthermore, ARDiT can be trained to predict several continuous vectors in one step, significantly reducing latency during sampling. Impressively, one of our models can generate 170 ms of 24 kHz speech per evaluation step with minimal degradation in performance.

You can download all audio files on this page by cloning this github repository.

Prompted Generation

In this task, we evaluate on test set B. We pick a prompt and a target utterance from the same speaker. The models generate target waveforms with prompt waveforms and the transcript of both sentences. All speakers are unseen for all systems during training.

# 237 1089 1188 1320 1995 2300 2830 3570 4077 4446 4970 5105 5683 6829 7021 7176 7729 8230 8463 8555
Text

"Boys," whispered their mother, warningly, "she can't answer you; just look at her face."

After early nightfall the yellow lamps would light up, here and there, the squalid quarter of the brothels.

But in this vignette, copied from Turner, you have the two principles brought out perfectly.

A circle of a few hundred feet in circumference was drawn, and each of the party took a segment for his portion.

The Sun, the Swamp? Then finding all well, she closed her eyes and slept.

Hence the Edison electrolytic meter is no longer used, despite its excellent qualities.

The Reformer had lectured on this Epistle of saint Paul's in fifteen nineteen and again in fifteen twenty three.

The last items of this category of consumption are not given up except under stress of the direst necessity.

This statement assumes, as granted, a distinction between bigamy and the "Mormon" institution of plural marriage.

"Yes, I was happy, wasn't I?" She pressed his hand gently in gratitude. "Weren't you happy then, at all?"

He knew his uncle would be glad to hear that he had at last turned his thoughts to a practical matter.

"Is it not impossible," he murmured aloud, "that any city should disappear so completely?

'It is very happy, for her at least, they are not,' said Rachel, and a long silence ensued.

Give me a check for a hundred and fifty, and I'll turn over to you the forged check and quash further proceedings."

The first person he met was a farm labourer walking alongside a load of peat and smacking at his horse.

For all its appalling speed, the sound of his flight was nothing more than a strong pulsating hiss.

The firm defensive attitude of the people of Lawrence had produced its effect.

Some points may be taken as fixed, and such as any theory of memory must arrive at.

Even so, I had just returned from an arduous journey, exhausted and badly needing a rest.

This was done, the once royal family departing from the palace with shamed and downcast looks.

Prompt
ARDiT (DMD, B=4)
ARDiT (DMD, B=1)
StyleTTS 2
ARDiT (B=1)
ARDiT (B=4)
VoiceCraft (16kHz)
HierSpeech++ (16kHz)
UniCATS
Ground Truth
Autoencoder

* please scroll horizontally to explore additional columns in the table.

Speech Inpainting

We evaluated the performance of text-based speech editing on the speech inpainting task. The models generate complete waveforms given complete texts and partially masked waveforms. The masked sections are highlighted within the text. All speakers were unseen by all systems during training. The following 20 test cases are from test set C (long).

# 237 260 908 1089 1188 1320 1580 1995 2961 3570 3575 4970 4992 5683 6829 6930 7176 8455 8463 8555
Text In a few moments he heard the cherries dropping smartly into the pail, and he began to swing his scythe with that long, even stroke that few American boys ever learn. The instruments, the tools, our guns, are clashing and clanking violently in their collisions with each other; the nails of my boots cling tenaciously to a plate of iron let into the timbers, and I cannot draw my foot away from the spot. Does the Eagle know what is in the pit? Or wilt thou go ask the Mole: Can Wisdom be put in a silver rod? Or Love in a golden bowl? To live, to err, to fall, to triumph, to recreate life out of life! You have the white of foaming water, of buildings and clouds, brought out brilliantly from a white ground; and though part of the subject is in deep shadow, the eye at once catches the one black point admitted in front. Recovering his recollection on the instant, instead of sounding an alarm, which might prove fatal to himself, he remained stationary, an attentive observer of the other's motions. My friend did not appear to be depressed by his failure, but shrugged his shoulders in half humorous resignation. Miss Taylor was soon starving for human companionship, for the lighter touches of life and some of its warmth and laughter. When all of them, both those who show themselves in the sky, and those who retire from view, had come into being, the Creator addressed them thus:--'Gods, sons of gods, my works, if I will, are indissoluble. In the modern community there is also a more frequent attendance at large gatherings of people to whom one's everyday life is unknown; in such places as churches, theaters, ballrooms, hotels, parks, shops, and the like. He spoke French perfectly, I have been told, when need was; but delighted usually in talking the broadest Yorkshire. Ruth was glad to hear that Philip had made a push into the world, and she was sure that his talent and courage would make a way for him. She should pray for his success at any rate, and especially that the Indians, in Saint Louis, would not take his scalp. He was going to leave the room--she followed him, and cried, "But, my Lord, how shall I see again the unhappy object of my treachery? She was up and dressed, and this moment coming down, and would be very happy to see Miss Brandon, if she would step into the drawing-room. "If the prosecution were withdrawn and the case settled with the victim of the forged check, then the young man would be allowed his freedom. But under the circumstances I doubt if such an arrangement could be made." Gradually I knew I was mastering him--then all was blank. "My dear Sir," I should reply (or Madam), "you have come to the right shop. I remained there alone for many hours, but I must acknowledge that before I left the chambers I had gradually brought myself to look at the matter in another light. I wanted nothing more than to see my country again, my friends, my modest quarters by the Botanical Gardens, my dearly beloved collections! Guided by you, how we might stroll towards death, Our only music one another's breath, Through gardens intimate with hollyhocks, Where silent poppies burn between the rocks, By pools where birches bend to confidants Above green waters scummed with lily-plants.
Ground Truth
ARDiT (DMD, B=4)
ARDiT (DMD, B=1)
ARDiT (B=1)
ARDiT (B=4)
UniCATS
VoiceCraft (16kHz)

* please scroll horizontally to explore additional columns in the table.

Prompted Generation (Comparing with Proprietary Systems I)

In this section, we compare our system with proprietary systems including NaturalSpeech 2/3, MegaTTS 2, UniAudio, CLaM-TTS, VoiceBox, and VALL-E. The source codes and model weights for these models are not available. The following samples are obtained from their online demo pages. All waveforms are downsampled to 16kHz. Please note that ARDiT's performance is influenced by the fact that the prompt waveforms are in 16kHz, not 24kHz, and the prompt texts are not semantically coherent with the target texts.

1~4 are obtained from NaturalSpeech 3 and 5~20 are obtained from CLaM-TTS's demo page.

# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Text

It is this that is of interest to theory of knowledge.

For, like as not, they must have thought him a prince when they saw his fine cap.

What you had best do, my child, is to keep it and pray to it that since it was a witness to your undoing, it will deign to vindicate your cause by its righteous judgment.

The strong position held by the Edison system under the strenuous competition that was already springing up was enormously improved by the introduction of the three wire system and it gave an immediate impetus to incandescent lighting.

They moved thereafter cautiously about the hut groping before and about them to find something to show that Warrenton had fulfilled his mission.

And lay me down in thy cold bed and leave my shining lot.

Number ten, fresh nelly is waiting on you, good night husband.

Yea, his honourable worship is within, but he hath a godly minister or two with him, and likewise a leech.

Instead of shoes, the old man wore boots with turnover tops, and his blue coat had wide cuffs of gold braid.

The army found the people in poverty and left them in comparative wealth.

Thus did this humane and right minded father comfort his unhappy daughter, and her mother embracing her again, did all she could to soothe her feelings.

He was in deep converse with the clerk and entered the hall holding him by the arm.

Indeed, there were only one or two strangers who could be admitted among the sisters without producing the same result.

For if he's anywhere on the farm, we can send for him in a minute.

Their piety would be like their names, like their faces, like their clothes, and it was idle for him to tell himself that their humble and contrite hearts it might be paid a far-richer tribute of devotion than his had ever been. A gift tenfold more acceptable than his elaborate adoration.

The air and the earth are curiously mated and intermingled as if the one were the breath of the other.

I had always known him to be restless in his manner, but on this particular occasion he was in such a state of uncontrollable agitation that it was clear something very unusual had occurred.

His death in this conjuncture was a public misfortune.

It is this that is of interest to theory of knowledge.

For a few miles, she followed the line hitherto presumably occupied by the coast of Algeria, but no land appeared to the south.

Prompt
ARDiT (DMD, B=1)
NaturalSpeech 3
NaturalSpeech 2
MegaTTS 2
UniAudio
CLaM-TTS
VoiceBox
VALL-E
Ground Truth

* please scroll horizontally to explore additional columns in the table.

Prompted Generation (Comparing with Proprietary Systems II)

In this section, we compare our system with proprietary Flow Matching based TTS systems including VoiceBox and SpeechFlow. The source codes and model weights for these models are not available. The following samples are obtained from their online demo pages. All waveforms are downsampled to 16kHz. Please note that ARDiT's performance is influenced by the fact that the prompt waveforms are in 16kHz, not 24kHz, and the prompt texts are not semantically coherent with the target texts.

Audio samples are obtained from voicebox.metademolab.com

# 5639-40744-0020 61-70970-0024 908-157963-0027 672-122797-0040 1284-1180-0002 4077-13754-0000 1221-135767-0014 61-70970-0007 1089-134686-0004 6432-63722-0047 5_1 18_1 8_1 29_0 21_0
Text

Thus did this humane and right minded father comfort his unhappy daughter, and her mother embracing her again, did all she could to soothe her feelings.

They moved thereafter cautiously about the hut groping before and about them to find something to show that warrenton had fulfilled his mission.

And lay me down in thy cold bed and leave my shining lot.

And the whole night the tree stood still and in deep thought.

Instead of shoes, the old man wore boots with turnover tops, and his blue coat had wide cuffs of gold braid.

The army found the people in poverty, and left them in comparative wealth.

Yea his honourable worship is within, but he hath a godly minister or two with him, and likewise a leech.

He was in deep converse with the clerk and entered the hall holding him by the arm.

Number ten fresh nelly is waiting on you, good night husband.

Rather a hypothetical question colonel, but I should say it might be a fifty fifty proposition.

How much wood could a woodchuck chuck if a woodchuck could chuck wood.

When feline magicians enchant the city and crafty canine illusionists work to restore balance, don't miss the uproarious clash in 'magic and mischief: the paws of mystery.'

Peter piper picked a peck of pickled peppers.

Voicebox is the swiss army knife of text to speech acing multiple languages, changing voice styles, and dishing out custom samples.

In a land where cat pirates sail the high seas and dog buccaneers chase their tails, embark on a swashbuckling comedy adventure in 'furry buccaneers: the quest for the golden bone.'

Prompt
ARDiT (DMD, B=1)
VoiceBox
SpeechFlow

* please scroll horizontally to explore additional columns in the table.

Prompted Generation (Celebrities and Game Characters)

ARDiT trained only on LibriTTS is capable of imitating famous figures' voice.

Prompts and baseline results are obtained from Mega-TTS and CLaM-TTS's demo pages.

# Michael Caine Jessie Eisenberg Optimus Prime Rachel McAdams Robert Downey Jr. Benedict Cumberbatch Mark Zuckerberg Dwarf from Warcraft Barack Obama Theresa May
Text

And sometimes, in both realms, it's not just about shining the brightest, but enduring the longest.

So here we are, trying to catch up, and hoping this day turns around soon.

We must unite and harness our strengths, for the fate of our world hangs in the balance.

But to those who knew her well, it was a symbol of her unwavering determination and spirit.

We have the responsibility to ensure power and technology are used for the greater good.

However, if you choose to stay, know that the truth I unveil may forever alter the course of your journey.

Our goal is to bridge communication gaps and preserve the richness of these unique languages.

Good afternoon everyone. Today, we are super excited to introduce you all to Introduction to Deep Learning, the course of Carnegie Mellon University.

Good afternoon everyone. Today, we are super excited to introduce you all to Introduction to Deep Learning, the course of Carnegie Mellon University.

Good afternoon everyone. Today, we are super excited to introduce you all to Introduction to Deep Learning, the course of Carnegie Mellon University.

Prompt
ARDiT (DMD, B=1)
MegaTTS
CLaM-TTS

* please scroll horizontally to explore additional columns in the table.

Speech Editing

In this section, we compare the speech editing performance of ARDiT with VoiceBox's demo.

The following audio samples are obtained from VoiceCraft's demo page.

Text: Will find himself completely at a loss on occasions of common and constant recurrence rare and unpredictable circumstances, speculative ability is one thing and practical ability is another.

Original ARDiT (DMD, B=1) VoiceBox VoiceCraft

Text: In zero weather, in mid-winter, when the earth is frozen to a great depth below the surface jack frost has cast his icy spell upon the land, when in driving over the unpaved country roads, they give forth a hard metallic ring.

Original ARDiT (DMD, B=1) VoiceBox VoiceCraft

Text: And especially as I am not very much up in latin myself, he said, the suit was on an insurance policy a classified treasure map that he was defending on the ground of misinterpretations.

Original ARDiT (DMD, B=1) VoiceBox VoiceCraft

Text: Yet these petty operations incessantly continued in time surmount the greatest difficulties, mountains are elevated and oceans bounded vast challenges emerge and unexplored frontiers beckon by the slender force of human beings.

Original ARDiT (DMD, B=1) VoiceBox VoiceCraft

Text: And the carlsruhe inventive professor had to devise an ingenious apparatus which enabled him to bring the preparation at the required temperature on to the very plate of the microscope.

Original ARDiT (DMD, B=1) VoiceBox VoiceCraft

Text: This was george steers the son of a british naval captain and ship modeler who had become an american naval officer and was the first man to take charge of the washington navy yard entrusted with the prestigious role of overseeing the operations at the renowned naval headquarters.

Original ARDiT (DMD, B=1) VoiceBox VoiceCraft

Speech Rate Control

ARDiT TTS can control the output speech rate to some extent, by controlling the total audio duration.

Prompt

Text: The examination and testimony of the experts, enabled the commission to conclude, that five shots may have been fired.

Very Slow Slow Normal
Normal Fast Very Fast