Training Details of Models

In [15]:
from math import ceil
import numpy as np
import torch
import torch.nn as nn

Feature Extraction

In [16]:
#### Mel Spectrogram Analysis
analysis_stft_fft_length = 2048
analysis_stft_window_length = 512
mel_length = 80
mel_min_f0 = 40.0
mel_max_f0 = 7600.0

#### Sampling
sampling_rate = 22050
frame_length = 128

F0 Extraction

Many thanks to the authors of python wrappers, and the open source speech processing tools.

We used the following function to extract f0:

In [12]:
from pyworld import dio, stonemask
from pyreaper import reaper


def repaer_stonemask(double_x, frame_length, sampling_rate):
    """
    double_x: numpy double array [Samples]
    frame_length: int, # of samples in a single frame
    sampling_rate: int
    returns: numpy double array [Frames]
    """
    n_frames = len(double_x) // frame_length
    n_samples = n_frames * frame_length
    double_x = double_x[:n_samples]
    int_x = np.clip(double_x * (65536 // 2), -32768, 32767).astype(np.int16)
    times = np.linspace(0, n_frames - 1, n_frames) * frame_length / sampling_rate + frame_length / 2 / sampling_rate
    _, _, f0_times, f0, _ = reaper(int_x, sampling_rate, minf0=40.0, maxf0=600.0)
    coarse_f0 = np.interp(times, f0_times, f0)
    fine_f0 = stonemask(double_x, coarse_f0, times, sampling_rate)
    return fine_f0

Optimizer

We used the Adam Optimizer with $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon=1\times 10^{-8}$ in all models. We applied Weight Normalization to all network weights. In all experiments we used the following Noam learning rate scheduling, with different learning rates for generators and discriminators.

In [7]:
def noam_lr(warmup_steps, min_lr, init_lr, step, power=0.35):
    return np.maximum(
        init_lr * warmup_steps ** power * np.minimum(
            step * warmup_steps ** (-1 - power), 
            step ** -power
        ), 
        min_lr
    )

Losses

The weight between adversarial loss $L_G$, $L_D$, and reconstruction loss $L_R$ is different for each model. See adv_ratio in configs below.

We used following STFTLoss Definition in PyTorch.

In [8]:
class STFTLoss(nn.Module):
    def __init__(self, fft_lengths, window_lengths, hop_lengths, loss_scale_type):
        """
        STFT Loss
        fft_lengths: list of int
        window_lengths: list of int
        hop_lengths: list of int
        loss_scale_type: str defining the scale of loss
        """
        super(STFTLoss, self).__init__()
        self.fft_lengths = fft_lengths
        self.window_lengths = window_lengths
        self.hop_lengths = hop_lengths
        self.loss_scale_type = loss_scale_type
        
    def forward(self, x, y):
        """
        x: FloatTensor [Batch, 1, T]
        y: FloatTensor [Batch, 1, T]
        returns: FloatTensor [] as total loss
        """
        x, y = x.squeeze(1), y.squeeze(1)
        loss = 0.0
        batch_size = x.size(0)
        z = torch.cat([x, y], dim=0) # [2 x Batch, T]
        for fft_length, window_length, hop_length in zip(self.fft_lengths, \
            self.window_lengths, self.hop_lengths):
            window = torch.hann_window(window_length)
            Z = torch.stft(z, fft_length, hop_length, window_length, window) # [2 x Batch, Frame, 2]
            SquareZ = Z.pow(2).sum(dim=-1) + 1e-10 # [2 x Batch, Frame]
            SquareX, SquareY = SquareZ.split(batch_size, dim=0)
            MagZ = SquareZ.sqrt()
            MagX, MagY = MagZ.split(batch_size, dim=0)
            if self.loss_scale_type == "log_linear":
                loss += (MagX - MagY).abs().mean() + 0.5 * (SquareX.log() - SquareY.log()).abs().mean()
            elif self.loss_scale_type == "linear":
                loss += (MagX - MagY).abs().mean()
            else:
                raise RuntimeError(f"Unrecognized loss scale type {self.loss_scale_type}")
        return loss

Table of Hyper-Parameters

  • We used the source code of NSF. We only changed the sampling rate config. hn-sinc-NSF used in evaluation is trained for 49 epoches. Training took about 3 days on a single 2080Ti.
  • NSF-noadv is trained without adversarial loss functions. Training took less than 1 day on a single 2080Ti.
Models NHV(cGAN) NHV(GAN) b-NSF-adv Parallel WaveGAN DDSP(S+N) DDSP(S+N, cGAN)
G Learning Rate Noam LR Scheduling,
warm up steps = 4000,
initial LR = 0.0006,
min LR = 0.00001
Same as NHV. Same as NHV;
except that initial LR is 0.001
Same as b-NSF-adv. Same as NHV. Same as NHV.
D Learning Rate Noam LR Scheduling,
warm up steps = 20000,
initial LR = 0.0002,
min LR = 0.00001
Same as NHV. Same as NHV. Same as NHV. N/A Same as NHV.
Optimizers Default Adam in PyTorch. Adam with $\beta_1 = 0.9$ and $\beta_2 = 0.999$, $\epsilon = 1\times 10 ^{-8}$ Same as NHV. Same as NHV. Same as NHV. Same as NHV. Same as NHV.
Reconstruction Loss $L_R$ STFT Loss with window sizes (128, 256, 384, 512, 640, 768, 896, 1024, 1536, 2048, 3072, 4096),
sum of L1 and log-L1 amplitude distance, (log_linear in the implementation);
FFT lengths are twice the window sizes; Window shifts are 1/4 the window sizes;
Same as NHV. Same as NHV; window lengths is changed to (16, 32, 64, 128, 256, 512, 1024, 2048);
We used L1 amplitude distance only.
Same as b-NSF-adv. Same as NHV. Same as NHV.
Weight of $L_R$ and $L_G$, $L_D$ $L_G$ and $L_D$ weighted by 4.0 Same as NHV. $L_G$ and $L_D$ weighted by 0.1 $L_G$ and $L_D$ weighted by 0.4 N/A Same as NHV.
Generator Details Described in our paper. Conditioned on only log-mel spectrogram. The final FIR has length 1000, and an exponential decaying rate 0.995. Same as NHV. Same as described in the original paper. Conditioned on only log-mel spectrogram. Same as described in the original paper. Non-causal 30 layer WaveNet Conditioned on log f0, VUV and log-mel spectrograms. Network same as NHV models. The parameters for harmonic and noise components are harmonic distributions and noise filter FFT amplitude. Output scaled by $\exp(\cdot)$. Harmonic distributions upsampled by hann windows as described in the DDSP paper. Same as DDSP(S+N).
Discriminator Details A Non-causal WaveNet conditioned on log-mel spectrogram. A 10 layer CNN, with kernel size set to 3, strides set to (2 ,2, 4, 2, 2, 2, 1, 1, 1, 1). Each layer followed by leaky ReLU activation with negative slope 0.2. No conditioning used. Same as NHV(GAN) Same as described in the original paper. A 10 layer CNN, with kernel size set to 3, dilations set to (1, 1, 2, 3, 4, 5, 6, 7, 8, 1). Each layer followed by leaky RELU activation with negative slope 0.2. No conditioning used. N/A Same as NHV.
Condition Upsampling N/A N/A Basically the same as described in the original paper. BiLSTM + 1DConv, both with channels = 80. Kernel size of CNN is 3. CNN is followed by BatchNorm1D and $\tanh$ activation. Upsampled by two ConvTranspose2D. First layer has kernel size (3, 16) with strides (1, 8). The second layer has kernel size (3, 32) with strides (1, 16). So the time dimension is upscaled by a factor of 128. Each layer is followed by a LeakyReLU activation with negative slope 0.4 N/A N/A
F0 Upsampling Repeat Repeat Repeat N/A Linear Interpolation Linear Interpolation
Adversarial Loss Formulation Hinge version GAN objective. Same as NHV. Same as NHV. Same as NHV. N/A Same as NHV.
Total Steps 350K 350K 350K 400K 400K 400K
Training Time About 2 days on a single 2080Ti About 1.5 days on 2080Ti About 2 days on 2080 Ti About 2 days on 2080 Ti Less than 1 day on 1080Ti About 2 days on 2080Ti