We report the number of multiplications required to generate 1 second long 22050 Hz audio for different neural vocoders. We do not report real time factors as they are implementation dependent.
In the calculation, We do not consider the complexity caused by mathematical functions, network biases, feature upsampling and source signal generation.
A N point FFT is assumed to cost $5N\log_2 N$ FLOPs.
from math import ceil, log2
fft = lambda x: (5 * x * log2(x))
def conv(k, inc, outc, group=1):
return ((inc / group) * (outc / group) * group * k * 2 + outc)
def conv_transpose(k, inc, outc):
return (inc * outc * k * 2 + k * outc)
complex_multiply = lambda x: x * 6
b-NSF is basically 5 10-layer WaveNets stacked together. The WaveNets have 64 residual channels and 128 skip channels.
b_nsf_total = (conv(1, 1, 64) + 10 * (conv(3, 64, 128) + 64 + conv(1, 64, 64) + conv(1, 64, 128) + 64 + 128) + conv(2, 128, 2)) * 5
print(f"b-NSF costs {b_nsf_total/10**6:0.0f} * 10^6 FLOPs per sample.")
Parameters for Gaussian WaveNet is 128 skip channels, 64 residual channels, 24 layers convolutions with kernel size of 3.
wavenet_total = (conv(1, 1, 64) + 24 * (conv(3, 64, 128) + 64 + conv(1, 64, 64) + conv(1, 64, 128) + 64 + 128) + conv(1, 128, 128) + conv(1, 128, 2))
print(f"Gaussian WaveNet costs {wavenet_total/10**6:0.0f} * 10^6 FLOPs per sample.")
Parallel WaveGAN is basically a 30 layer wavenet with 64 skip channels, and 64 residual channels.
pwg_total = (conv(1, 1, 64) + 30 * (conv(3, 64, 128) + 64 + conv(1, 64, 64) + conv(1, 64, 64) + 64 + 64) + conv(1, 64, 64) + conv(1, 64, 1))
print(f"Parallel WaveGAN costs {pwg_total/10**6:0.0f} * 10^6 FLOPs per sample.")
We use the same hyperparameters reported in the paper.
residual_stack = lambda n: (conv(3, n, n) * 6 + 3 * n)
melgan_total = conv(7, 80, 512) / (8 * 8 * 2 * 2) + \
conv_transpose(16, 512, 256) / (8 * 8 * 2 * 2) + \
residual_stack(256) / (8 * 2 * 2) + \
conv_transpose(16, 256, 128) / (8 * 2 * 2) + \
residual_stack(128) / (2 * 2) + \
conv_transpose(4, 128, 64) / (2 * 2) + \
residual_stack(64) / (2) + \
conv_transpose(2, 64, 32) / (2) + \
residual_stack(32) + \
conv(7, 32, 1)
print(f"MelGAN costs {melgan_total/10**5:0.1f} * 10^5 FLOPs per sample.")
We use the formula given in the paper. The first GRU is made sparse in LPCNet. $C=\left(3 d N_{A}^{2}+3 N_{B}\left(N_{A}+N_{B}\right)+2 N_{B} Q\right) \cdot 2 f_{s}$, where $d=0.1$, $N_{A}=384$, $N_{B}=16$, and $Q=256$, $f_s = 22050$.
lpc_total = (3 * 0.1 * 384 ** 2 + 3 * 16 * (384 + 16) + 2 * 16 * 256) * 2
print(f"LPCNet costs {lpc_total/10**5:0.1f} * 10^5 FLOPs per sample.")
nhv_convs = (conv(3, 80, 256) + conv(3, 256, 256, 8) * 2 + conv(3, 256, 222)) * 2 / 128
nhv_filters = 0
We assume $N = 2048$ although $N = 1024$ is good enough.
nhv_filters += fft(2048) * 2 / 128
nhv_filters += fft(2048) * 2 / 128
nhv_filters += (complex_multiply(2048) + fft(2048) + 2048) * 2 / 128
nhv_filters += (fft(2048) * 2 + complex_multiply(2048) + 2048) / 1024
nhv_total = nhv_convs + nhv_filters
print(f"NHV costs {nhv_total/10**4:0.1f} * 10^4 FLOPs per sample. NN accounts for {nhv_convs / (nhv_filters + nhv_convs) * 100:02.0f}% of total complexity.")