AM/FM Modeling of Harmonic Sounds

This is a support page to deliver additional information and demonstrations related to the paper:
ANALYSIS/SYNTHESIS OF HARMONIC SOUNDS BASED ON AM/FM SCALING OF A PROTOTYPE SIGNAL
submitted to IEEE Transactions on Audio, Speech and Language Processing.

Abstract The paper describes an enhancement of the harmonic sinusoidal modeling technique that extends its capabilities of efficiently representing sounds which are not purely tonal. An improvement is achieved in reconstruction accuracy thanks to introduction of additional modulating components to the partial instantaneous frequencies and instantaneous amplitudes. For this purpose, a classical heterodyne analysis technique is combined with principal component analysis of partials. The sound is represented by a discrete harmonic envelope and a narrowband prototype signal that carries the modulations. We show by experiments that the model is capable of reproducing transients and a significant part of mechanical noise in musical sounds and yields good subjective quality at SNR of 15dB to 26dB.

In the paper we propose a model for harmonic audio signals that may be used in object-based low bit rate coding. The high level, low-resolution spectral characteristics of the signal are represented by a two-dimensional harmonic envelope (HE, shown below, middle), a complex-valued structure similar (but not exactly the same) to the data delivered by a harmonic phase vocoder by Beauchamp and harmonic sinusoidal model by Serra. A high temporal resolution is offered thanks to the second part of the model, the prototype signal (below, right).

CLICK to zoom CLICK to zoom CLICK to zoom

Advantages: The fundamental advantage of the above representation is that these two parts are much less complex signals than the original signal and they can be very efficiently encoded. In the paper we describe the analysis and synthesis process in detail. On this web page we demonstrate that our model is capable of representing accurately the significant acoustic features of the sound: its pitch, texture, timbre, and tonal/noisy character. In a traditional SM this would require a high data rate (a small time shift between consecutive frames, or an excessively high number of partials), which is prohibitive in compression applications. In our model the data rate is low (partial control parameters are subsampled typically 1:500 or 1:1000), however the inclusion of the prototype signal (which is mostly narrow-band) allows to efficiently represent the dominant low-level fluctuations of instantaneous parameters (IF and IA) which are responsible for the acoustic features mentioned above.

How it works: The core of the system is a near phase coherent heterodyne analyzer (the big grey box in the scheme shown above) consisting of a bank of complex harmonic oscillators, individual multipliers and lowpass filters. The harmonic oscillators operate on integer multiples of the instantaneous fundamental frequency (IF0[n]) so that each individual channel deals with a single harmonic partial, yielding its baseband representation (the complex envelope) at the corresponding output. A single output signal for 1st partial of the glissando sound is shown below, left.

Each of the output complex-valued signals from the analyzer is subsequently decimated (1:R). This low-resolution information is stored in the Harmonic Envelope (a matrix whose each row corresponds to one partial and each column corresponds to a decimated time sample). A single row of HE obtained for the glissando sound is shown below, in the middle.

CLICK to zoom CLICK to zoom CLICK to zoom

For each of the baseband signals, its high frequency residual (the remainder after subtracting an upsampled previously decimated signal) is obtained (above, right). These residuals for a single sound are subject to Principal Component Analysis (PCA). The aim of this analysis is to identify and extract the common residual AM modulation that enhances the representation stored in the HE. PCA represents the collection of its input residual signals ak[n] as a linear combination of other signals which we may consider as a local orthogonal base. Only one of these signals (corresponding to the principal eigenvector of the covariance matrix) is preserved. We denote it by a0[n]. In order to reverse the linear transformation in the decoder we also need a short vector of complex-valued scaling constants, G1.

The final prototype signal is a product of AM and FM modulation. The FM term of this signal is defined by IF0[n], and the AM term is defined by the real-valued PCA output which is offset by an arbitrary constant (b > max|a0[n]|) that prevents the amplitude from going below zero (over-modulation). This simple trick allows the both terms to be separated during demodulation. An illustrative example below (left) shows the idea (not to scale). Real examples of the prototype IF and IA are also shown below (middle and right).

CLICK to zoom CLICK to zoom CLICK to zoom

Reconstruction is a straightforward process. The reconstructed sound is generated by the means of additive synthesis, similarly as in the case of a normal sinusoidal model. Individual partials are synthesized as a product of AM and FM modulations and are composed of two complex exponentials. One exponential is the upsampled corresponding row of the HE. The second exponential is obtained by appropriate scaling of the IF and IA recovered from the prototype. The IF is scaled simply by the partial order, k. The IA is scaled by the appropriate element of the scaling vector, G1. This vector is obtained in the process of PCA analysis in the encoder (see above).

For this purpose, the IF and IA have to be recovered from the prototype signal. This may be performed quite reliably using the standard Gabor method employing Hilbert transform, since the prototype is narrow-band and mono-component. More precisely speaking, we do not estimate the instantaneous frequency (IF), which would involve phase unwrapping. For synthesis of partials we simply measure the instantaneous phase of the prototype φ0[n] and apply the scaling by k = 1,2..Kmax.

For demonstrations, please select in the menu on the left.

PLEASE NOTE: THESE PAGES ARE STILL UNDER CONSTRUCTION