Computational Complexity of Vocoders¶

We report the number of multiplications required to generate 1 second long 22050 Hz audio for different neural vocoders. We do not report real time factors as they are implementation dependent.

In the calculation, We do not consider the complexity caused by mathematical functions, network biases, feature upsampling and source signal generation.

A N point FFT is assumed to cost $5N\log_2 N$ FLOPs.

from math import ceil, log2

fft = lambda x: (5 * x * log2(x))

def conv(k, inc, outc, group=1):
    return ((inc / group) * (outc / group) * group * k * 2 + outc)

def conv_transpose(k, inc, outc):
    return (inc * outc * k * 2 + k * outc)

complex_multiply = lambda x: x * 6

b-NSF¶

b-NSF is basically 5 10-layer WaveNets stacked together. The WaveNets have 64 residual channels and 128 skip channels.

b_nsf_total = (conv(1, 1, 64) + 10 * (conv(3, 64, 128) + 64 + conv(1, 64, 64) + conv(1, 64, 128) + 64 + 128) + conv(2, 128, 2)) * 5
print(f"b-NSF costs {b_nsf_total/10**6:0.0f} * 10^6 FLOPs per sample.")

b-NSF costs 4 * 10^6 FLOPs per sample.

Gaussian WaveNet¶

Parameters for Gaussian WaveNet is 128 skip channels, 64 residual channels, 24 layers convolutions with kernel size of 3.

wavenet_total = (conv(1, 1, 64) + 24 * (conv(3, 64, 128) + 64 + conv(1, 64, 64) + conv(1, 64, 128) + 64 + 128) + conv(1, 128, 128) + conv(1, 128, 2))
print(f"Gaussian WaveNet costs {wavenet_total/10**6:0.0f} * 10^6 FLOPs per sample.")

Gaussian WaveNet costs 2 * 10^6 FLOPs per sample.

Parallel WaveGAN¶

Parallel WaveGAN is basically a 30 layer wavenet with 64 skip channels, and 64 residual channels.

pwg_total = (conv(1, 1, 64) + 30 * (conv(3, 64, 128) + 64 + conv(1, 64, 64) + conv(1, 64, 64) + 64 + 64) + conv(1, 64, 64) + conv(1, 64, 1))
print(f"Parallel WaveGAN costs {pwg_total/10**6:0.0f} * 10^6 FLOPs per sample.")

Parallel WaveGAN costs 2 * 10^6 FLOPs per sample.

MelGAN¶

We use the same hyperparameters reported in the paper.

residual_stack = lambda n: (conv(3, n, n) * 6 + 3 * n)
melgan_total = conv(7, 80, 512) / (8 * 8 * 2 * 2) + \
    conv_transpose(16, 512, 256) / (8 * 8 * 2 * 2) + \
    residual_stack(256) / (8 * 2 * 2) + \
    conv_transpose(16, 256, 128) / (8 * 2 * 2) + \
    residual_stack(128) / (2 * 2) + \
    conv_transpose(4, 128, 64) / (2 * 2) + \
    residual_stack(64) / (2) + \
    conv_transpose(2, 64, 32) / (2) + \
    residual_stack(32) + \
    conv(7, 32, 1)

print(f"MelGAN costs {melgan_total/10**5:0.1f} * 10^5 FLOPs per sample.")

MelGAN costs 4.1 * 10^5 FLOPs per sample.

LPCNet¶

We use the formula given in the paper. The first GRU is made sparse in LPCNet. $C=\left(3 d N_{A}^{2}+3 N_{B}\left(N_{A}+N_{B}\right)+2 N_{B} Q\right) \cdot 2 f_{s}$, where $d=0.1$, $N_{A}=384$, $N_{B}=16$, and $Q=256$, $f_s = 22050$.

lpc_total = (3 * 0.1 * 384 ** 2 + 3 * 16 * (384 + 16) + 2 * 16 * 256) * 2
print(f"LPCNet costs {lpc_total/10**5:0.1f} * 10^5 FLOPs per sample.")

LPCNet costs 1.4 * 10^5 FLOPs per sample.

NHV¶

Contribution of Convolutions¶

nhv_convs = (conv(3, 80, 256) + conv(3, 256, 256, 8) * 2 + conv(3, 256, 222)) * 2 / 128
nhv_filters = 0

Contribution of Filters¶

We assume $N = 2048$ although $N = 1024$ is good enough.

Complex cepstrum $\hat h_h$ and $\hat h_n$ are converted to DFT of $\tilde h_h$ and $\tilde h_n$. The point-wise exponential function is ignored.

nhv_filters += fft(2048) * 2 / 128

DFT of $p[n] w_L[n-mL]$ and $u[n] w_l[n-mL]$ is computed in each frame

nhv_filters += fft(2048) * 2 / 128

Convolution in the frequency domain, then overlap add.

nhv_filters += (complex_multiply(2048) + fft(2048) + 2048) * 2 / 128

Suppose the final FIR is implemented with blocked FFT with frame shift 1024.

nhv_filters += (fft(2048) * 2 + complex_multiply(2048) + 2048) / 1024

nhv_total = nhv_convs + nhv_filters
print(f"NHV costs {nhv_total/10**4:0.1f} * 10^4 FLOPs per sample. NN accounts for {nhv_convs / (nhv_filters + nhv_convs) * 100:02.0f}% of total complexity.")

NHV costs 1.5 * 10^4 FLOPs per sample. NN accounts for 61% of total complexity.