Importance of phase in speech has been mentioned in several books and papers. To obtain natural phase reconstruction, NHV uses complex cepstrum to allow neural networks to control phase responses of LTV filters. This notebook proves that a major source of quality degradation in NHV-noadv
compared to baseline NHV
model is unnatural phase in its harmonic component. We prove this argument with the help of STFT analysis and synthesis.
The artefact caused by unnatural phase assumptions in speech, including zero-phase and minimum-phase, is often described as muffledness or buzziness:
2.3 Excitation source design
The phase manipulation mentioned in the previous section allows the control of the excitation source waveform by phase manipulation. This is necessary, because while there is no degradation in speech quality caused by parameter manipulation using the STRAIGHT procedure, there is some initial degradation in speech quality under headphone listening when no temporal fine structure control is employed. Literature on phase effects in timbre perception has suggested that temporal fine-structuure control, based on an all-pass filter design, provides the answer.
Speech Representation and Transformation using Adaptive Interpolation of Weighted Spectrum: Vocoder Revisited, Hideki KAWAHARA, 1997, Proc. ICASSP 1997, pp. 1303-1306, vol.2
..., when compared with its minimum-phase counterpart, the mixed-phase system produces a small but audible improvement in quality. When preferred, the mixed-phase system was judged by the listeners to reduce "buzziness" of the minimum-phase reconstruction. Minimum-phase reconstructions are always more "peaky" or less dispersive than their mixed-phase counterparts because the minimum-phase sequences have energy that is maximally compressed near the time origin, .... This peakiness may explain the apparent buzziness one hears in minimum-phase reconstructions when compared to their mixed-phase counterparts. One also hears a "muffled" quality in the minimum-phase system relative to the mixed-phase system, due perhaps to the removal of accurate timing and fine structure of speech events by replacing the original phase by its minimum-phase counterpart. These undesirable characteristics of the minimum-phase construction are further accentuated in the zero-phase construction.
Discrete-Time Speech Signal Processing: Principles and Practice, Thomas F. Quatieri, pp. 291-292
NHV-noadv
's phase in the harmonic component with NHV
's¶We will replace the phase in NHV-noadv
's harmonic component with phase form the baseline model. By doing this, the degradation in NHV-noadv
's reconstruction is significantly reduced.
The NHV model's configuration is slightly different from the one used in evaluation.
from data.vocoder_data import NPYAudioLoader
from data.synthesis_data import NPYMelF0Loader
import torch
from math import ceil
from configs.nhv_config import NHVSynthConfig
from modules.vocoders.nhv.model import NHV
from IPython.display import Audio
from modules.vocoders.utils.cepstrum import complex_cepstrum_to_imp
from modules.vocoders.spectrogram_analyzer import STFT
from utils.plot import plot_waveforms
from modules.vocoders.source_generator import SourceGenerator
p = NHVSynthConfig.parameters()
p.update({
"liftering": False,
"envelop_max_quefrency": ceil(22050 / 1000 * 6),
"noise_max_quefrency": ceil(22050 / 1000 * 6),
"harmonic_only_reverb": True
})
nhv = NHV(**p)
another_p = NHVSynthConfig.parameters()
another_p.update({"liftering": True})
nhv_noadv = NHV(**another_p)
nhv_state = torch.load("./log/nhv.500000.honly", map_location=NHVSynthConfig.synthesis_target_device)
nhv.load_state_dict(nhv_state["model"])
state = torch.load("./log/nhv.400000.noadv", map_location=NHVSynthConfig.synthesis_target_device)
nhv_noadv.load_state_dict(state["model"])
nhv = nhv.eval()
nhv_noadv = nhv_noadv.eval()
audio_loader = NPYAudioLoader(
NHVSynthConfig.frame_length,
NHVSynthConfig.npy_files,
NHVSynthConfig.train_file_test_function,
NHVSynthConfig.file_to_idx
)
idx, x, f0 = audio_loader.sample_test()
x = x.view(1 ,1, -1)
f0 = f0.view(1, 1, -1)
X = nhv.wave2spec(x)
imp, noise = nhv.decode_source(f0)
tract_imp, noise_imp = nhv.feature2imp(X, f0)
harmonic, noise = nhv.decode_filter(imp, noise, tract_imp, noise_imp)
a_harmonic = harmonic + nhv.add_reverb(harmonic)
a_noise = noise
a = a_harmonic + a_noise
X = nhv.wave2spec(x)
imp, noise = nhv_noadv.decode_source(f0)
tract_imp, noise_imp = nhv_noadv.feature2imp(X, f0)
harmonic, noise = nhv_noadv.decode_filter(imp, noise, tract_imp, noise_imp)
b_harmonic = harmonic + nhv_noadv.add_reverb(harmonic)
b_noise = noise + nhv_noadv.add_reverb(noise)
b = b_harmonic + b_noise
stft = STFT(filter_length=2048, hop_length=128, win_length=2048, window='hann')
a_harmonic_mag, a_harmonic_phase = stft.transform(a_harmonic.squeeze(1))
b_harmonic_mag, b_harmonic_phase = stft.transform(b_harmonic.squeeze(1))
b_harmonic_with_phase_from_a_harmonic = stft.inverse(b_harmonic_mag, a_harmonic_phase)
# Original Speech
Audio(x.detach().squeeze().numpy(), rate=22050)
# Reconstruction of NHV model
Audio(a.detach().squeeze().numpy(), rate=22050)
# Reconstruction of NHV-noadv
Audio(b.detach().squeeze().numpy(), rate=22050)
# Replacing phase in NHV-noadv's harmonic component with NHV's, then remix the audio.
Audio((b_harmonic_with_phase_from_a_harmonic + b_noise).detach().squeeze().numpy(), rate=22050)
pt = 35000
fig = plot_waveforms([
x.detach().squeeze().numpy()[pt: pt + 512],
a.detach().squeeze().numpy()[pt: pt + 512],
b.detach().squeeze().numpy()[pt: pt + 512],
(b_harmonic_with_phase_from_a_harmonic + b_noise).detach().squeeze().numpy()[pt: pt + 512],
], [
"Original",
"NHV(GAN)",
"NHV-noadv",
"Remixed"
])
fig.savefig("remix.pdf", transparent=True, bbox_inches = 'tight', pad_inches = 0.05)
The human auditory system is more sensitive to change in phase when the pitch is low. The following experiment compares the influence of random phase in a 70 Hz impulse train and a 420 Hz impulse train. Although the randomness in phase is kept the same, the timber difference at 70 Hz is much more audible.
sg = SourceGenerator(
frame_length=128, fs=22050, harmonics=100, min_f0=10, max_f0=400)
f = torch.ones(1, 1, 22050) * 70
a = (sg.generate_harmonics(f, random_phase=0) / 20).squeeze().numpy()
b = (sg.generate_harmonics(f, random_phase=0.3 * 3.14) / 20).squeeze().numpy()
c = (sg.generate_harmonics(f * 6, random_phase=0) / 20).squeeze().numpy()
d = (sg.generate_harmonics(f * 6, random_phase=0.3 * 3.14) / 20).squeeze().numpy()
# 70 Hz zero-phase impulse-train
Audio(a, rate=22050, normalize=False)
# 70 Hz random-phase impulse-train
Audio(b, rate=22050, normalize=False)
# 420 Hz zero-phase impulse-train
Audio(c, rate=22050, normalize=False)
# 420 Hz random-phase impulse-train
Audio(d, rate=22050, normalize=False)
fig = plot_waveforms((a[:512], b[:512], c[:512], d[:512]), ("Zero-phase(70Hz)", "Random-phase (70Hz)", "Zero-phase (420Hz)", "Random-phase (420Hz)"), (0, 256, 512))