Direct Preference Optimization for Speech Autoregressive Diffusion Models

Zhijun Liu 1 Dongya Jia 4 Xiaoqiang Wang 4 Chenpeng Du 4 Shuai Wang 3 Zhuo Chen 4 Haizhou Li 2

1School of Data Science, The Chinese University of Hong Kong, Shenzhen

2School of Artificial Intelligence, The Chinese University of Hong Kong, Shenzhen

3Nanjing University 4ByteDance Seed


Abstract

Autoregressive diffusion models (ARDMs) have recently been applied to speech generation, achieving state-of-the-art (SOTA) performance in zero-shot text-to-speech. By autoregressively generating continuous speech tokens with next-token diffusion, these models offer a promising alternative to next-token prediction, avoiding the technical complexities associated with discrete speech tokenization. As a relatively new paradigm, research on reinforcement learning (RL)-based fine-tuning of speech ARDMs remains limited. In this paper, we propose Autoregressive Diffusion-Direct Preference Optimization (ARDM-DPO) to advance this research. By fine-tuning the recently proposed zero-shot text-to-speech model DiTAR with DPO, we achieve significant improvements in terms of speech expressiveness and robustness for long texts.

Task A: Improving Expressiveness by Increasing F0 Variance

Scroll the table horizontally to find more audio samples.

# 0 1 2 3 4 5 6 7 8 9 10
Text

Hugh MacConnell came with his sister, and stood about, managing his tea-cup awkwardly and watching every one out of his deep-set, faded eyes.

They constitute two different orders of facts which correspond to each other, which are always interlaced, and which often bring forth results.

In an instant Servadac mounted the side-work, laid himself down in the gap, and thus filling up the breach by his own body, shouted, "March on!"

My remark pleases him, but I soon prove to him that it is not the right way to speak, however perfect may have been the language of that ancient writer.

"See!" said Uncas, pointing north and south, at the evident marks of the broad trail on either side of him, "the dark-hair has gone toward the forest."

"Then, dear," said Mrs. Whitney, "you must be kinder to her than ever; think what it would be for one of you to be away from home even among friends."

"I quite agree--in regard to Griffin's ghost, or whatever it was--that its appearing first to the little boy, at so tender an age, adds a particular touch."

As a result we have this paradoxical situation: The Gospel supplies the world with the salvation of Jesus Christ, peace of conscience, and every blessing.

Then there was a little pause in the conversation, and I felt myself bound to say something as to the violent interruption to which I had this morning been subjected.

But Rodolfo had been struck by the great beauty of Leocadia, the hidalgo's daughter, and presently he began to entertain the idea of enjoying it at all hazards.

The queens had taken their seats upon a magnificent dias or platform, erected upon the borders of the lake, in a theater of wonderful elegance of construction.

Prompt
DPO
Base Model
Ground Truth

Task B: Improving Robustness by Improving Text Likelihood

Scroll the table horizontally to find more audio samples.

# 0 1 2 3 4 5 6 7 8 9 10
Text

这些花粉被抖出来抖出来抖出来抖出来抖出来抖出来抖出来,再筛选一番,然后酿成酿成酿成酿成酿成酿成酿成蜜,做成蜡。

近日,除了葛洲坝葛洲坝葛洲坝葛洲坝葛洲坝股价下跌外,其余其余其余其余其余三家均均均均均有不同程度的的的的的的的上涨。

好吧,我们别别别别别别别耽搁耽搁耽搁耽搁耽搁耽搁时间了,收拾收拾收拾收拾收拾收拾收拾东西,干干干干干干正经事吧。

涮墩布墩布墩布墩布墩布墩布拖地拖地拖地拖地拖地拖地拖地也是用洗完洗完洗完洗完洗完洗完衣服,倒在在在在在桶里的的的的的水。

我等不及等不及等不及等不及等不及去去去去去玩水,我站站站站站在沙上沙上沙上沙上沙上的时候,脚快脚快脚快脚快脚快烧起来了。

老太太青面獠牙,姑娘一见心惊胆战心惊胆战心惊胆战心惊胆战心惊胆战心惊胆战心惊胆战,打算打算打算打算打算打算打算赶快逃走。

同时同时同时同时同时同时肥胖者腹部腹部腹部腹部腹部腹部脂肪堆积,又限制限制限制限制限制了肺的呼吸运动运动运动运动运动运动运动。

难道是有哪个幸运小家伙小家伙小家伙小家伙小家伙小家伙小家伙,被他看中看中看中看中看中看中了了了了了了不成不成不成不成不成。

好久没去幼稚园看小朋友们了,不能两手空空呀。不能两手空空呀。不能两手空空呀。不能两手空空呀。不能两手空空呀。不能两手空空呀。不能两手空空呀。

刚进刚进刚进刚进刚进房间,陈缵光不明白明白明白明白明白明白明白,三个人怎么挤得上双层双层双层双层双层双层的的的的的的的单人床。

共同建设面向未来面向未来面向未来面向未来面向未来面向未来的的的的的的的交通,和出行服务服务服务服务服务新生态生态生态生态生态生态生态。

Prompt
DPO
Base Model

Reward function for Task A

import torch
from torch import Tensor
import numpy as np
import pyworld
from scipy.signal import butter, filtfilt


def apply_band_pass_filter(
    data: Tensor,  # CPU float [T]
    lowcut: float,  # Lower bound of the band-pass filter (in Hz).
    highcut: float,  # Upper bound of the band-pass filter (in Hz).
    fs: float,  # Sampling frequency of the data (in Hz).
    order: int = 5,  # Order of the Butterworth filter (default is 5).
) -> Tensor:  # CPU float [T]
    nyq = 0.5 * fs  # Nyquist frequency
    low = lowcut / nyq
    high = highcut / nyq
    b, a = butter(order, [low, high], btype="band")  # type: ignore
    filtered_data = filtfilt(b, a, data).copy()
    return torch.as_tensor(filtered_data, dtype=torch.float32)


def pitch_filter_std_reward_v4(
    wave: Tensor,  # float [N_sample]
    sr: int = 24000,
    min_f0: float = 120.0,
    max_f0: float = 400.0,
    f0_lowcut: float = 1.0,
    f0_highcut: float = 5.0,
    f0_filter_order: int = 5,
) -> float:
    x = wave.detach().cpu().numpy().astype(np.float64)
    f0, _ = pyworld.dio(x, sr)  # type: ignore
    f0 = f0[(f0 > min_f0) & (f0 < max_f0)]

    if len(f0) <= 100:  # voicing time > 0.5s
        # raise ValueError("voicing time is too short")
        return 0.0

    p05 = np.percentile(f0, 5)
    p95 = np.percentile(f0, 95)
    f0[f0 > p95] = p95
    f0[f0 < p05] = p05

    f0f = apply_band_pass_filter(
        f0,
        lowcut=f0_lowcut,
        highcut=f0_highcut,
        fs=200.0,
        order=f0_filter_order,
    )
    return float(f0f.std())