On Spectral Resolution in Audio Compression Systems

Сучилин Владимир Александрович — Mon, 13 Jan 2020 04:28:11 +0000

Introduction

The features for the perception of audio information by human ear are well-known [1]. One of these features is the inability of the human hearing to distinguish rather weak sounds in the presence of a more intense tone nearby. In psychoacoustics, it is denoted as the frequency masking. In the frequency domain, this appears when two harmonic oscillations are simultaneously perceived in the restricted frequency range called as the masking area [2]. Actually, the choice of this area should be associated with the limitations of the resolution of the auditory perception, which is the ability to distinguish between two individual but close tone pitches. Empirically, in the range of less than 1000 Hz, the normal human hearing perceives a frequency deviation less than 3 Hz (up to 1.5 Hz). Moreover, above 1000 Hz, it can be estimated as follows [2]:

ŝ ≈ 0.0035 f , (Ɐ f > 1 kHz) (1)

In general, it is senseless to talk about the frequency masking, if some tones are indistinguishable to human hearing. Therefore, the interval [f-s;f+s] can be considered as the genuine masking area.

The use of the frequency masking with DFT

To use of the frequency masking, the digitalized audio signal is transformed in the frequency domain, which is performed by means of DFT [3]-[4]. The latter recalculates N consecutive samples of the digital signal {x_n} into N pairs of coefficients of the complex spectrum S(k) characterizing the representation of the signal in the frequency domain:

Re [S(k)] = (2)

Im [S(k)] =

Further, based on these entities, the amplitude spectrum of the signal is formed. Due to the central symmetry of the amplitude spectrum (i.e. spectrogram), only the first N/2 values are used for the analysis:

|A(k)| = , k = 1…Ñ (3)

where Ñ = (N+1)/2 by odd N, and Ñ = N/2 by even N.

Thus the DFT allows getting N samples of a signal spectrum in the range from zero to half of the sampling rate. Such spectrum, in contrast to the “continuous” spectrum of the signal, is discrete and, as N increases, more and more approaches the real spectrum. In digital audio compression, since N is usually finite when using the DFT it is also reasonable to take into account the spectral resolution, which is expressed by the equality:

ȗ = F_s/ N (4)

With the audio spectrogram, the minimum size of the masking area must be formed in view of the actual spectral resolution, since the use of the frequency masking should be based on evaluating values for two adjacent frequencies of the spectrogram. That means that the reasonable use of the frequency masking is possible only when the following criterion is met:

ȗ ≤ ŝ (5)

Let us define the spectrogram frequency as cut-off frequency, when after this one the criterion (5) is no longer satisfied. In view of (1) and (4), that can be simply expressed as:

f_c ≈ F_s(0.0035 N)^-1 (6)

Thus, the cut-off frequency is directly proportional to the sampling frequency of the audio signal and inversely proportional to the number of samples of DFT.

Example of cut-off frequencies

In the audio compression systems (i.e. audio encoder MP3 [6] or digital broadcasting [4]), the implementation of DFT is generally basing on 1024-point FFT [7]. For this case, the results of applying (6) to the sampling rates used in audio compression systems are shown in the following table.

Table 1. Cut-off frequencies by 1024-point FFT

Sampling rate (kHz)	FFT resolution (Hz)	Cut-off frequency (Hz)
32.0	31.25	9000
44.1	43.07	12500
48.0	46.88	13500

Obviously, by the e.g. doubled power of FFT (i.e. 2048-point FFT) the cut-off frequencies in Table 1 will be increased by two times. That leads to the entire range of frequencies perceived by the human hearing while using the each of the sampling rate of the Table 1.

Discussion

From the results in Table 1, it follows that the area of the reasonable use of the frequency masking in audio compression systems depends on the spectral resolution. As an example, consider the spectrogram (Fig. 1), where 1024-point FFT is used.

Fig. 1 1024-point FFT spectrogram of jazz music fragment

The frequency masking for this spectrogram by the sampling rate of 32 kHz only can be applied at the range up to the red bar in Fig 1 i.e. 9.0 kHz. For the other two sampling rates, these are 12.5 kHz (green bar) and 13.5 kHz (blue bar) accordingly. Fortunately, the average human hearing is unable to perceive frequencies above 12 kHz [2]. This actually prevents the occurrence of audible artifacts at the top of the frequency range which is perceivable by average human hearing, though noticeable to sophisticated music listeners. At the same time, the criterion (5) would be met at all three sampling rates at all frequency range, when the power of FFT is equal to 2048 or more by the audio compression.

Conclusions

As shown above, by using some psychoacoustics effects in audio compression systems it is necessary to take into account both spectral resolution and the resolution of auditory perception. In this regard, a specific frequency can be specified for each sampling rate, which actually determines the frequency range within which these effects may be reasonably applied without noticeable distortion of the sound information. Hereby, the crucial thing is to specify a correct relation between the masking area, on the one hand, and the spectral resolution, on the other. In audio compression systems, the spectral resolution is determined under consideration both the sampling rate and the power of FFT. It is shown that with the use of the 1024-point FFT the frequency masking reasonably works just within the frequency range which is perceived by average human hearing. However, from the point of view of music expert requirements, the use of e.g. 2048-point FFT is more preferable.

Электронный научно-практический журнал «Современные научные исследования и инновации» » частотное маскирование

On Spectral Resolution in Audio Compression Systems