Computational Complexity of Vocoders

We report the number of multiplications required to generate 1 second long 22050 Hz audio for different neural vocoders. We do not report real time factors as they are implementation dependent.

In the calculation, We do not consider the complexity caused by mathematical functions, network biases, feature upsampling and source signal generation.

A N point FFT is assumed to cost $5N\log_2 N$ FLOPs.

In [2]:
from math import ceil, log2

fft = lambda x: (5 * x * log2(x))

def conv(k, inc, outc, group=1):
    return ((inc / group) * (outc / group) * group * k * 2 + outc)

def conv_transpose(k, inc, outc):
    return (inc * outc * k * 2 + k * outc)

complex_multiply = lambda x: x * 6

b-NSF

b-NSF is basically 5 10-layer WaveNets stacked together. The WaveNets have 64 residual channels and 128 skip channels.

In [3]:
b_nsf_total = (conv(1, 1, 64) + 10 * (conv(3, 64, 128) + 64 + conv(1, 64, 64) + conv(1, 64, 128) + 64 + 128) + conv(2, 128, 2)) * 5
print(f"b-NSF costs {b_nsf_total/10**6:0.0f} * 10^6 FLOPs per sample.")
b-NSF costs 4 * 10^6 FLOPs per sample.

Gaussian WaveNet

Parameters for Gaussian WaveNet is 128 skip channels, 64 residual channels, 24 layers convolutions with kernel size of 3.

In [4]:
wavenet_total = (conv(1, 1, 64) + 24 * (conv(3, 64, 128) + 64 + conv(1, 64, 64) + conv(1, 64, 128) + 64 + 128) + conv(1, 128, 128) + conv(1, 128, 2))
print(f"Gaussian WaveNet costs {wavenet_total/10**6:0.0f} * 10^6 FLOPs per sample.")
Gaussian WaveNet costs 2 * 10^6 FLOPs per sample.

Parallel WaveGAN

Parallel WaveGAN is basically a 30 layer wavenet with 64 skip channels, and 64 residual channels.

In [5]:
pwg_total = (conv(1, 1, 64) + 30 * (conv(3, 64, 128) + 64 + conv(1, 64, 64) + conv(1, 64, 64) + 64 + 64) + conv(1, 64, 64) + conv(1, 64, 1))
print(f"Parallel WaveGAN costs {pwg_total/10**6:0.0f} * 10^6 FLOPs per sample.")
Parallel WaveGAN costs 2 * 10^6 FLOPs per sample.

MelGAN

We use the same hyperparameters reported in the paper.

In [6]:
residual_stack = lambda n: (conv(3, n, n) * 6 + 3 * n)
melgan_total = conv(7, 80, 512) / (8 * 8 * 2 * 2) + \
    conv_transpose(16, 512, 256) / (8 * 8 * 2 * 2) + \
    residual_stack(256) / (8 * 2 * 2) + \
    conv_transpose(16, 256, 128) / (8 * 2 * 2) + \
    residual_stack(128) / (2 * 2) + \
    conv_transpose(4, 128, 64) / (2 * 2) + \
    residual_stack(64) / (2) + \
    conv_transpose(2, 64, 32) / (2) + \
    residual_stack(32) + \
    conv(7, 32, 1)
In [7]:
print(f"MelGAN costs {melgan_total/10**5:0.1f} * 10^5 FLOPs per sample.")
MelGAN costs 4.1 * 10^5 FLOPs per sample.

LPCNet

We use the formula given in the paper. The first GRU is made sparse in LPCNet. $C=\left(3 d N_{A}^{2}+3 N_{B}\left(N_{A}+N_{B}\right)+2 N_{B} Q\right) \cdot 2 f_{s}$, where $d=0.1$, $N_{A}=384$, $N_{B}=16$, and $Q=256$, $f_s = 22050$.

In [8]:
lpc_total = (3 * 0.1 * 384 ** 2 + 3 * 16 * (384 + 16) + 2 * 16 * 256) * 2
print(f"LPCNet costs {lpc_total/10**5:0.1f} * 10^5 FLOPs per sample.")
LPCNet costs 1.4 * 10^5 FLOPs per sample.

NHV

Contribution of Convolutions

In [21]:
nhv_convs = (conv(3, 80, 256) + conv(3, 256, 256, 8) * 2 + conv(3, 256, 222)) * 2 / 128
nhv_filters = 0

Contribution of Filters

We assume $N = 2048$ although $N = 1024$ is good enough.

  1. Complex cepstrum $\hat h_h$ and $\hat h_n$ are converted to DFT of $\tilde h_h$ and $\tilde h_n$. The point-wise exponential function is ignored.
In [22]:
nhv_filters += fft(2048) * 2 / 128
  1. DFT of $p[n] w_L[n-mL]$ and $u[n] w_l[n-mL]$ is computed in each frame
In [23]:
nhv_filters += fft(2048) * 2 / 128
  1. Convolution in the frequency domain, then overlap add.
In [24]:
nhv_filters += (complex_multiply(2048) + fft(2048) + 2048) * 2 / 128
  1. Suppose the final FIR is implemented with blocked FFT with frame shift 1024.
In [25]:
nhv_filters += (fft(2048) * 2 + complex_multiply(2048) + 2048) / 1024
In [26]:
nhv_total = nhv_convs + nhv_filters
print(f"NHV costs {nhv_total/10**4:0.1f} * 10^4 FLOPs per sample. NN accounts for {nhv_convs / (nhv_filters + nhv_convs) * 100:02.0f}% of total complexity.")
NHV costs 1.5 * 10^4 FLOPs per sample. NN accounts for 61% of total complexity.