ARDM-DPO Online Supplement

Direct Preference Optimization for Speech Autoregressive Diffusion Models

Zhijun Liu ¹ Dongya Jia ⁴ Xiaoqiang Wang ⁴ Chenpeng Du ⁴ Shuai Wang ³ Zhuo Chen ⁴ Haizhou Li ²

¹School of Data Science, The Chinese University of Hong Kong, Shenzhen

²School of Artificial Intelligence, The Chinese University of Hong Kong, Shenzhen

³Nanjing University ⁴ByteDance Seed

Abstract

Autoregressive diffusion models (ARDMs) have recently been applied to speech generation, achieving state-of-the-art (SOTA) performance in zero-shot text-to-speech. By autoregressively generating continuous speech tokens with next-token diffusion, these models offer a promising alternative to next-token prediction, avoiding the technical complexities associated with discrete speech tokenization. As a relatively new paradigm, research on reinforcement learning (RL)-based fine-tuning of speech ARDMs remains limited. In this paper, we propose Autoregressive Diffusion-Direct Preference Optimization (ARDM-DPO) to advance this research. By fine-tuning the recently proposed zero-shot text-to-speech model DiTAR with DPO, we achieve significant improvements in terms of speech expressiveness and robustness for long texts.

Text

Hugh MacConnell came with his sister, and stood about, managing his tea-cup awkwardly and watching every one out of his deep-set, faded eyes.

They constitute two different orders of facts which correspond to each other, which are always interlaced, and which often bring forth results.

In an instant Servadac mounted the side-work, laid himself down in the gap, and thus filling up the breach by his own body, shouted, "March on!"

My remark pleases him, but I soon prove to him that it is not the right way to speak, however perfect may have been the language of that ancient writer.

"See!" said Uncas, pointing north and south, at the evident marks of the broad trail on either side of him, "the dark-hair has gone toward the forest."

"Then, dear," said Mrs. Whitney, "you must be kinder to her than ever; think what it would be for one of you to be away from home even among friends."

"I quite agree--in regard to Griffin's ghost, or whatever it was--that its appearing first to the little boy, at so tender an age, adds a particular touch."

As a result we have this paradoxical situation: The Gospel supplies the world with the salvation of Jesus Christ, peace of conscience, and every blessing.

Then there was a little pause in the conversation, and I felt myself bound to say something as to the violent interruption to which I had this morning been subjected.

But Rodolfo had been struck by the great beauty of Leocadia, the hidalgo's daughter, and presently he began to entertain the idea of enjoying it at all hazards.

The queens had taken their seats upon a magnificent dias or platform, erected upon the borders of the lake, in a theater of wonderful elegance of construction.

Prompt

DPO

Base Model

Ground Truth

#	0	1	2	3	4	5	6	7	8	9	10
Text	这些花粉被抖出来抖出来抖出来抖出来抖出来抖出来抖出来，再筛选一番，然后酿成酿成酿成酿成酿成酿成酿成蜜，做成蜡。	近日，除了葛洲坝葛洲坝葛洲坝葛洲坝葛洲坝股价下跌外，其余其余其余其余其余三家均均均均均有不同程度的的的的的的的上涨。	好吧，我们别别别别别别别耽搁耽搁耽搁耽搁耽搁耽搁时间了，收拾收拾收拾收拾收拾收拾收拾东西，干干干干干干正经事吧。	涮墩布墩布墩布墩布墩布墩布拖地拖地拖地拖地拖地拖地拖地也是用洗完洗完洗完洗完洗完洗完衣服，倒在在在在在桶里的的的的的水。	我等不及等不及等不及等不及等不及去去去去去玩水，我站站站站站在沙上沙上沙上沙上沙上的时候，脚快脚快脚快脚快脚快烧起来了。	老太太青面獠牙，姑娘一见心惊胆战心惊胆战心惊胆战心惊胆战心惊胆战心惊胆战心惊胆战，打算打算打算打算打算打算打算赶快逃走。	同时同时同时同时同时同时肥胖者腹部腹部腹部腹部腹部腹部脂肪堆积，又限制限制限制限制限制了肺的呼吸运动运动运动运动运动运动运动。	难道是有哪个幸运小家伙小家伙小家伙小家伙小家伙小家伙小家伙，被他看中看中看中看中看中看中了了了了了了不成不成不成不成不成。	好久没去幼稚园看小朋友们了，不能两手空空呀。不能两手空空呀。不能两手空空呀。不能两手空空呀。不能两手空空呀。不能两手空空呀。不能两手空空呀。	刚进刚进刚进刚进刚进房间，陈缵光不明白明白明白明白明白明白明白，三个人怎么挤得上双层双层双层双层双层双层的的的的的的的单人床。	共同建设面向未来面向未来面向未来面向未来面向未来面向未来的的的的的的的交通，和出行服务服务服务服务服务新生态生态生态生态生态生态生态。
Prompt
DPO
Base Model

import torch from torch import Tensor import numpy as np import pyworld from scipy.signal import butter, filtfilt def apply_band_pass_filter( data: Tensor, # CPU float [T] lowcut: float, # Lower bound of the band-pass filter (in Hz). highcut: float, # Upper bound of the band-pass filter (in Hz). fs: float, # Sampling frequency of the data (in Hz). order: int = 5, # Order of the Butterworth filter (default is 5). ) -> Tensor: # CPU float [T] nyq = 0.5 * fs # Nyquist frequency low = lowcut / nyq high = highcut / nyq b, a = butter(order, [low, high], btype="band") # type: ignore filtered_data = filtfilt(b, a, data).copy() return torch.as_tensor(filtered_data, dtype=torch.float32) def pitch_filter_std_reward_v4( wave: Tensor, # float [N_sample] sr: int = 24000, min_f0: float = 120.0, max_f0: float = 400.0, f0_lowcut: float = 1.0, f0_highcut: float = 5.0, f0_filter_order: int = 5, ) -> float: x = wave.detach().cpu().numpy().astype(np.float64) f0, _ = pyworld.dio(x, sr) # type: ignore f0 = f0[(f0 > min_f0) & (f0 < max_f0)] if len(f0) <= 100: # voicing time > 0.5s # raise ValueError("voicing time is too short") return 0.0 p05 = np.percentile(f0, 5) p95 = np.percentile(f0, 95) f0[f0 > p95] = p95 f0[f0 < p05] = p05 f0f = apply_band_pass_filter( f0, lowcut=f0_lowcut, highcut=f0_highcut, fs=200.0, order=f0_filter_order, ) return float(f0f.std())

Direct Preference Optimization for Speech Autoregressive Diffusion Models

Abstract

Task A: Improving Expressiveness by Increasing F0 Variance

Task B: Improving Robustness by Improving Text Likelihood

Reward function for Task A