from math import ceil
import numpy as np
import torch
import torch.nn as nn
#### Mel Spectrogram Analysis
analysis_stft_fft_length = 2048
analysis_stft_window_length = 512
mel_length = 80
mel_min_f0 = 40.0
mel_max_f0 = 7600.0
#### Sampling
sampling_rate = 22050
frame_length = 128
Many thanks to the authors of python wrappers, and the open source speech processing tools.
We used the following function to extract f0:
from pyworld import dio, stonemask
from pyreaper import reaper
def repaer_stonemask(double_x, frame_length, sampling_rate):
"""
double_x: numpy double array [Samples]
frame_length: int, # of samples in a single frame
sampling_rate: int
returns: numpy double array [Frames]
"""
n_frames = len(double_x) // frame_length
n_samples = n_frames * frame_length
double_x = double_x[:n_samples]
int_x = np.clip(double_x * (65536 // 2), -32768, 32767).astype(np.int16)
times = np.linspace(0, n_frames - 1, n_frames) * frame_length / sampling_rate + frame_length / 2 / sampling_rate
_, _, f0_times, f0, _ = reaper(int_x, sampling_rate, minf0=40.0, maxf0=600.0)
coarse_f0 = np.interp(times, f0_times, f0)
fine_f0 = stonemask(double_x, coarse_f0, times, sampling_rate)
return fine_f0
We used the Adam Optimizer with $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon=1\times 10^{-8}$ in all models. We applied Weight Normalization to all network weights. In all experiments we used the following Noam learning rate scheduling, with different learning rates for generators and discriminators.
def noam_lr(warmup_steps, min_lr, init_lr, step, power=0.35):
return np.maximum(
init_lr * warmup_steps ** power * np.minimum(
step * warmup_steps ** (-1 - power),
step ** -power
),
min_lr
)
The weight between adversarial loss $L_G$, $L_D$, and reconstruction loss $L_R$ is different for each model. See adv_ratio
in configs below.
We used following STFTLoss Definition in PyTorch.
class STFTLoss(nn.Module):
def __init__(self, fft_lengths, window_lengths, hop_lengths, loss_scale_type):
"""
STFT Loss
fft_lengths: list of int
window_lengths: list of int
hop_lengths: list of int
loss_scale_type: str defining the scale of loss
"""
super(STFTLoss, self).__init__()
self.fft_lengths = fft_lengths
self.window_lengths = window_lengths
self.hop_lengths = hop_lengths
self.loss_scale_type = loss_scale_type
def forward(self, x, y):
"""
x: FloatTensor [Batch, 1, T]
y: FloatTensor [Batch, 1, T]
returns: FloatTensor [] as total loss
"""
x, y = x.squeeze(1), y.squeeze(1)
loss = 0.0
batch_size = x.size(0)
z = torch.cat([x, y], dim=0) # [2 x Batch, T]
for fft_length, window_length, hop_length in zip(self.fft_lengths, \
self.window_lengths, self.hop_lengths):
window = torch.hann_window(window_length)
Z = torch.stft(z, fft_length, hop_length, window_length, window) # [2 x Batch, Frame, 2]
SquareZ = Z.pow(2).sum(dim=-1) + 1e-10 # [2 x Batch, Frame]
SquareX, SquareY = SquareZ.split(batch_size, dim=0)
MagZ = SquareZ.sqrt()
MagX, MagY = MagZ.split(batch_size, dim=0)
if self.loss_scale_type == "log_linear":
loss += (MagX - MagY).abs().mean() + 0.5 * (SquareX.log() - SquareY.log()).abs().mean()
elif self.loss_scale_type == "linear":
loss += (MagX - MagY).abs().mean()
else:
raise RuntimeError(f"Unrecognized loss scale type {self.loss_scale_type}")
return loss
hn-sinc-NSF
used in evaluation is trained for 49 epoches. Training took about 3 days on a single 2080Ti.Models | NHV(cGAN) | NHV(GAN) | b-NSF-adv | Parallel WaveGAN | DDSP(S+N) | DDSP(S+N, cGAN) |
---|---|---|---|---|---|---|
G Learning Rate | Noam LR Scheduling, warm up steps = 4000, initial LR = 0.0006, min LR = 0.00001 |
Same as NHV. | Same as NHV; except that initial LR is 0.001 |
Same as b-NSF-adv. | Same as NHV. | Same as NHV. |
D Learning Rate | Noam LR Scheduling, warm up steps = 20000, initial LR = 0.0002, min LR = 0.00001 |
Same as NHV. | Same as NHV. | Same as NHV. | N/A | Same as NHV. |
Optimizers | Default Adam in PyTorch. Adam with $\beta_1 = 0.9$ and $\beta_2 = 0.999$, $\epsilon = 1\times 10 ^{-8}$ | Same as NHV. | Same as NHV. | Same as NHV. | Same as NHV. | Same as NHV. |
Reconstruction Loss $L_R$ | STFT Loss with window sizes (128, 256, 384, 512, 640, 768, 896, 1024, 1536, 2048, 3072, 4096), sum of L1 and log-L1 amplitude distance, ( log_linear in the implementation); FFT lengths are twice the window sizes; Window shifts are 1/4 the window sizes; |
Same as NHV. | Same as NHV; window lengths is changed to (16, 32, 64, 128, 256, 512, 1024, 2048); We used L1 amplitude distance only. |
Same as b-NSF-adv. | Same as NHV. | Same as NHV. |
Weight of $L_R$ and $L_G$, $L_D$ | $L_G$ and $L_D$ weighted by 4.0 | Same as NHV. | $L_G$ and $L_D$ weighted by 0.1 | $L_G$ and $L_D$ weighted by 0.4 | N/A | Same as NHV. |
Generator Details | Described in our paper. Conditioned on only log-mel spectrogram. The final FIR has length 1000, and an exponential decaying rate 0.995. | Same as NHV. | Same as described in the original paper. Conditioned on only log-mel spectrogram. | Same as described in the original paper. Non-causal 30 layer WaveNet Conditioned on log f0, VUV and log-mel spectrograms. | Network same as NHV models. The parameters for harmonic and noise components are harmonic distributions and noise filter FFT amplitude. Output scaled by $\exp(\cdot)$. Harmonic distributions upsampled by hann windows as described in the DDSP paper. | Same as DDSP(S+N). |
Discriminator Details | A Non-causal WaveNet conditioned on log-mel spectrogram. | A 10 layer CNN, with kernel size set to 3, strides set to (2 ,2, 4, 2, 2, 2, 1, 1, 1, 1). Each layer followed by leaky ReLU activation with negative slope 0.2. No conditioning used. | Same as NHV(GAN) | Same as described in the original paper. A 10 layer CNN, with kernel size set to 3, dilations set to (1, 1, 2, 3, 4, 5, 6, 7, 8, 1). Each layer followed by leaky RELU activation with negative slope 0.2. No conditioning used. | N/A | Same as NHV. |
Condition Upsampling | N/A | N/A | Basically the same as described in the original paper. BiLSTM + 1DConv, both with channels = 80. Kernel size of CNN is 3. CNN is followed by BatchNorm1D and $\tanh$ activation. | Upsampled by two ConvTranspose2D. First layer has kernel size (3, 16) with strides (1, 8). The second layer has kernel size (3, 32) with strides (1, 16). So the time dimension is upscaled by a factor of 128. Each layer is followed by a LeakyReLU activation with negative slope 0.4 | N/A | N/A |
F0 Upsampling | Repeat | Repeat | Repeat | N/A | Linear Interpolation | Linear Interpolation |
Adversarial Loss Formulation | Hinge version GAN objective. | Same as NHV. | Same as NHV. | Same as NHV. | N/A | Same as NHV. |
Total Steps | 350K | 350K | 350K | 400K | 400K | 400K |
Training Time | About 2 days on a single 2080Ti | About 1.5 days on 2080Ti | About 2 days on 2080 Ti | About 2 days on 2080 Ti | Less than 1 day on 1080Ti | About 2 days on 2080Ti |