Importance of Phase in Speech¶

Importance of phase in speech has been mentioned in several books and papers. To obtain natural phase reconstruction, NHV uses complex cepstrum to allow neural networks to control phase responses of LTV filters. This notebook proves that a major source of quality degradation in NHV-noadv compared to baseline NHV model is unnatural phase in its harmonic component. We prove this argument with the help of STFT analysis and synthesis.

The artefact caused by unnatural phase assumptions in speech, including zero-phase and minimum-phase, is often described as muffledness or buzziness:

2.3 Excitation source design

The phase manipulation mentioned in the previous section allows the control of the excitation source waveform by phase manipulation. This is necessary, because while there is no degradation in speech quality caused by parameter manipulation using the STRAIGHT procedure, there is some initial degradation in speech quality under headphone listening when no temporal fine structure control is employed. Literature on phase effects in timbre perception has suggested that temporal fine-structuure control, based on an all-pass filter design, provides the answer.

Speech Representation and Transformation using Adaptive Interpolation of Weighted Spectrum: Vocoder Revisited, Hideki KAWAHARA, 1997, Proc. ICASSP 1997, pp. 1303-1306, vol.2

..., when compared with its minimum-phase counterpart, the mixed-phase system produces a small but audible improvement in quality. When preferred, the mixed-phase system was judged by the listeners to reduce "buzziness" of the minimum-phase reconstruction. Minimum-phase reconstructions are always more "peaky" or less dispersive than their mixed-phase counterparts because the minimum-phase sequences have energy that is maximally compressed near the time origin, .... This peakiness may explain the apparent buzziness one hears in minimum-phase reconstructions when compared to their mixed-phase counterparts. One also hears a "muffled" quality in the minimum-phase system relative to the mixed-phase system, due perhaps to the removal of accurate timing and fine structure of speech events by replacing the original phase by its minimum-phase counterpart. These undesirable characteristics of the minimum-phase construction are further accentuated in the zero-phase construction.

Discrete-Time Speech Signal Processing: Principles and Practice, Thomas F. Quatieri, pp. 291-292

Experiment: Replace `NHV-noadv`'s phase in the harmonic component with `NHV`'s¶

We will replace the phase in NHV-noadv's harmonic component with phase form the baseline model. By doing this, the degradation in NHV-noadv's reconstruction is significantly reduced.

Loading Trained Models¶

The NHV model's configuration is slightly different from the one used in evaluation.

from data.vocoder_data import NPYAudioLoader
from data.synthesis_data import NPYMelF0Loader
import torch
from math import ceil
from configs.nhv_config import NHVSynthConfig
from modules.vocoders.nhv.model import NHV
from IPython.display import Audio
from modules.vocoders.utils.cepstrum import complex_cepstrum_to_imp
from modules.vocoders.spectrogram_analyzer import STFT
from utils.plot import plot_waveforms
from modules.vocoders.source_generator import SourceGenerator

p = NHVSynthConfig.parameters()
p.update({
    "liftering": False,
    "envelop_max_quefrency": ceil(22050 / 1000 * 6),
    "noise_max_quefrency": ceil(22050 / 1000 * 6),
    "harmonic_only_reverb": True
})
nhv = NHV(**p)
another_p = NHVSynthConfig.parameters()
another_p.update({"liftering": True})
nhv_noadv = NHV(**another_p)

nhv_state = torch.load("./log/nhv.500000.honly", map_location=NHVSynthConfig.synthesis_target_device)
nhv.load_state_dict(nhv_state["model"])
state = torch.load("./log/nhv.400000.noadv", map_location=NHVSynthConfig.synthesis_target_device)
nhv_noadv.load_state_dict(state["model"])
nhv = nhv.eval()
nhv_noadv = nhv_noadv.eval()

Randomly sample a sentence from the testset¶

audio_loader = NPYAudioLoader(
    NHVSynthConfig.frame_length,
    NHVSynthConfig.npy_files,
    NHVSynthConfig.train_file_test_function,
    NHVSynthConfig.file_to_idx
)

idx, x, f0 = audio_loader.sample_test()
x = x.view(1 ,1, -1)
f0 = f0.view(1, 1, -1)

Reconstruct with NHV and NHV-noadv¶

X = nhv.wave2spec(x)
imp, noise = nhv.decode_source(f0)
tract_imp, noise_imp = nhv.feature2imp(X, f0)
harmonic, noise = nhv.decode_filter(imp, noise, tract_imp, noise_imp)
a_harmonic = harmonic + nhv.add_reverb(harmonic)
a_noise = noise
a = a_harmonic + a_noise

X = nhv.wave2spec(x)
imp, noise = nhv_noadv.decode_source(f0)
tract_imp, noise_imp = nhv_noadv.feature2imp(X, f0)
harmonic, noise = nhv_noadv.decode_filter(imp, noise, tract_imp, noise_imp)
b_harmonic = harmonic + nhv_noadv.add_reverb(harmonic)
b_noise = noise + nhv_noadv.add_reverb(noise)
b = b_harmonic + b_noise

Replace the phase in NHV-noadv's harmonic component¶

stft = STFT(filter_length=2048, hop_length=128, win_length=2048, window='hann')

a_harmonic_mag, a_harmonic_phase = stft.transform(a_harmonic.squeeze(1))
b_harmonic_mag, b_harmonic_phase = stft.transform(b_harmonic.squeeze(1))

b_harmonic_with_phase_from_a_harmonic = stft.inverse(b_harmonic_mag, a_harmonic_phase)

Comparision¶

# Original Speech
Audio(x.detach().squeeze().numpy(), rate=22050)

# Reconstruction of NHV model
Audio(a.detach().squeeze().numpy(), rate=22050)

# Reconstruction of NHV-noadv
Audio(b.detach().squeeze().numpy(), rate=22050)

# Replacing phase in NHV-noadv's harmonic component with NHV's, then remix the audio.  
Audio((b_harmonic_with_phase_from_a_harmonic + b_noise).detach().squeeze().numpy(), rate=22050)

pt = 35000
fig = plot_waveforms([
    x.detach().squeeze().numpy()[pt: pt + 512],
    a.detach().squeeze().numpy()[pt: pt + 512],
    b.detach().squeeze().numpy()[pt: pt + 512],
    (b_harmonic_with_phase_from_a_harmonic + b_noise).detach().squeeze().numpy()[pt: pt + 512],
], [
    "Original",
    "NHV(GAN)",
    "NHV-noadv",
    "Remixed"
])

fig.savefig("remix.pdf", transparent=True, bbox_inches = 'tight', pad_inches = 0.05)

Experiment: Perception of relative phase difference between sinusoids¶

The human auditory system is more sensitive to change in phase when the pitch is low. The following experiment compares the influence of random phase in a 70 Hz impulse train and a 420 Hz impulse train. Although the randomness in phase is kept the same, the timber difference at 70 Hz is much more audible.

sg = SourceGenerator(
    frame_length=128, fs=22050, harmonics=100, min_f0=10, max_f0=400)
f = torch.ones(1, 1, 22050) * 70
a = (sg.generate_harmonics(f, random_phase=0) / 20).squeeze().numpy()
b = (sg.generate_harmonics(f, random_phase=0.3 * 3.14) / 20).squeeze().numpy()
c = (sg.generate_harmonics(f * 6, random_phase=0) / 20).squeeze().numpy()
d = (sg.generate_harmonics(f * 6, random_phase=0.3 * 3.14) / 20).squeeze().numpy()

# 70 Hz zero-phase impulse-train
Audio(a, rate=22050, normalize=False)

# 70 Hz random-phase impulse-train
Audio(b, rate=22050, normalize=False)

# 420 Hz zero-phase impulse-train
Audio(c, rate=22050, normalize=False)

# 420 Hz random-phase impulse-train
Audio(d, rate=22050, normalize=False)

fig = plot_waveforms((a[:512], b[:512], c[:512], d[:512]), ("Zero-phase(70Hz)", "Random-phase (70Hz)", "Zero-phase (420Hz)", "Random-phase (420Hz)"), (0, 256, 512))