Introduction
PART 1. TOOLS FOR SIGNAL COMPRESSION
Chapter 1. Scalar Quantization
1.1. Introduction
1.2. Optimum scalar quantization
1.3. Predictive scalar quantization
Chapter 2. Vector Quantization
2.1. Introduction
2.2. Rationale
2.3. Optimum codebook generation
2.4. Optimum quantizer performance
2.5. Using the quantizer
2.6. Gain-shape vector quantization
Chapter 3. Sub-band Transform Coding
3.1. Introduction
3.2. Equivalence of filter banks and transforms
3.3. Bit allocation
3.4. Optimum transform
3.5. Performance
Chapter 4. Entropy Coding
4.1. Introduction
4.2. Noiseless coding of discrete, memoryless sources
4.3. Noiseless coding of a discrete source with memory
4.4. Scalar quantizer with entropy constraint
4.5. Capacity of a discrete memoryless channel
4.6. Coding a discrete source with a fidelity criterion
PART 2. AUDIO SIGNAL APPLICATIONS
Chapter 5. Introduction to Audio Signals
5.1. Speech signal characteristics
5.2. Characteristics of music signals
5.3. Standards and recommendations
Chapter 6. Speech Coding
6.1. PCM and ADPCM coders
6.2. The 2.4 bit/s LPC-10 coder
6.3. The CELP coder
Chapter 7. Audio Coding
7.1. Principles of “perceptual coders”
7.2. MPEG-1 layer 1 coder
7.3. MPEG-2 AAC coder
7.4. Dolby AC-3 coder
7.5. Psychoacoustic model: calculating a masking threshold
Chapter 8. Audio Coding: Additional Information
8.1. Low bit rate/acceptable quality coders
8.2. High bit rate lossless or almost lossless coders
Chapter 9. Stereo Coding: A Synthetic Presentation
9.1. Basic hypothesis and notation
9.2. Determining the inter-channel indices
9.3. Downmixing procedure
9.4. At the receiver
9.5. Draft International Standard
PART 3. MATLAB® PROGRAMS
Chapter 10. A Speech Coder
10.1. Introduction
10.2. Script for the calling function
10.3. Script for called functions
Chapter 11. A Music Coder
11.1. Introduction
11.2. Script for the calling function
11.3. Script for called functions
Bibliography
Index
First published 2011 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.
Adapted and updated from Outils pour la compression des signaux: applications aux signaux audioechnologies du stockage d’énergie published 2009 in France by Hermes Science/Lavoisier © Institut Télécom et LAVOISIER 2009
Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:
ISTE Ltd 27-37 St George’s Road London SW19 4EU UK |
John Wiley & Sons, Inc. 111 River Street Hoboken, NJ 07030 USA |
www.iste.co.uk | www.wiley.com |
© ISTE Ltd 2011
The rights of Nicolas Moreau to be identified as the author of this work have been asserted by him in accordance with the Copyright, Designs and Patents Act 1988.
Moreau, Nicolas, 1945-
[Outils pour la compression des signaux. English]
Tools for signal compression / Nicolas Moreau.
p. cm.
“Adapted and updated from Outils pour la compression des signaux : applications aux signaux audioechnologies du stockage d’energie.”
Includes bibliographical references and index.
ISBN 978-1-84821-255-8
1. Sound--Recording and reproducing--Digital techniques. 2. Data compression (Telecommunication) 3. Speech processing systems. I. Title.
TK7881.4.M6413 2011
621.389’3--dc22
2011003206
British Library Cataloguing-in-Publication Data
A CIP record for this book is available from the British Library
ISBN 978-1-84821-255-8
In everyday life, we often come in contact with compressed signals: when using mobile telephones, mp3 players, digital cameras, or DVD players. The signals in each of these applications, telephone-band speech, high fidelity audio signal, and still or video images are not only sampled and quantized to put them into a form suitable for saving in mass storage devices or to send them across networks, but also compressed. The first operation is very basic and is presented in all courses and introductory books on signal processing. The second operation is more specific and is the subject of this book: here, the standard tools for signal compression are presented, followed by examples of how these tools are applied in compressing speech and musical audio signals. In the first part of this book, we focus on a problem which is theoretical in nature: minimizing the mean squared error. The second part is more concrete and qualifies the previous steps in seeking to minimize the bit rate while respecting the psychoacoustic constraints. We will see that signal compression consists of seeking not only to eliminate all redundant parts of the original signal but also to attempt the elimination of inaudible parts of the signal.
The compression techniques presented in this book are not new. They are explained in theoretical framework, information theory, and source coding, aiming to formalize the first (and the last) element in a digital communication channel: the encoding of an analog signal (with continuous times and continuous values) to a digital signal (at discrete times and discrete values). The techniques come from the work by C. Shannon, published at the beginning of the 1950s. However, except for the development of speech encodings in the 1970s to promote an entirely digitally switched telephone network, these techniques really came into use toward the end of the 1980s under the influence of working groups, for example, “Group Special Mobile (GSM)”, “Joint Photographic Experts Group (JPEG)”, and “Moving Picture Experts Group (MPEG)”.
The results of these techniques are quite impressive and have allowed the development of the applications referred to earlier. Let us consider the example of a music signal. We know that a music signal can be reconstructed with quasi-perfect quality (CD quality) if it was sampled at a frequency of 44.1 kHz and quantized at a resolution of 16 bits. When transferred across a network, the required bit rate for a mono channel is 705 kb/s. The most successful audio encoding, MPEG-4 AAC, ensures “transparency” at a bit rate of the order of 64 kb/s, giving a compression rate greater than 10, and the completely new encoding MPEG-4 HE-AACv2, standardized in 2004, provides a very acceptable quality (for video on mobile phones) at 24 kb/s for 2 stereo channels. The compression rate is better than 50!
In the Part 1 of this book, the standard tools (scalar quantization, predictive quantization, vector quantization, transform and sub-band coding, and entropy coding) are presented. To compare the performance of these tools, we use an academic example of the quantization of the realization x(n) of a one-dimensional random process X(n). Although this is a theoretical approach, it not only allows objective assessment of performance but also shows the coherence between all the available tools. In the Part 2, we concentrate on the compression of audio signals (telephone-band speech, wideband speech, and high fidelity audio signals).
Throughout this book, we discuss the basic ideas of signal processing using the following language and notation. We consider a one-dimensional, stationary, zero-mean, random process X(n), with power and power spectral density SX(f). We also assume that it is Gaussian, primarily because the Gaussian distribution is preserved in all linear transformations, especially in a filter which greatly simplifies the notation, and also because a Gaussian signal is the most difficult signal to encode because it carries the greatest quantization error for any bit rate. A column vector of N dimensions is denoted by and constructed with X(mN) … X(mN + N − 1). These N random variables are completely defined statistically by their probability density function:
where is the autocovariance matrix:
Toeplitz matrix with N × N dimensions. Moreover, we assume an auto-regressive process X(n) of order P, obtained through filtering with white noise W(n) with variance via a filter of order P with a transfer function 1/A(z) for A(z) in the form:
The purpose of considering the quantization of an auto-regressive waveform as our example is that it allows the simple explanation of all the statistical characteristics of the source waveform as a function of the parameters of the filter such as, for example, the power spectral density:
where the notation A(f) is inaccurate and should be more properly written as A(exp(j2πf)). It also allows us to give analytical expressions for the quantization error power for different quantization methods when quadratic error is chosen as the measure of distortion. Comparison of the performance of the different methods is thereby possible. From a practical point of view, this example is not useless because it is a reasonable model for a number of signals, for example, for speech signals (which are only locally stationary) when the order P selected is high enough (e.g. 8 or 10).
Let us consider a discrete-time signal x(n) with values in the range [−A, +A]. Defining a scalar quantization with a resolution of b bits per sample requires three operations:
– partitioning the range [−A, +A] into L = 2b non-overlapping intervals {Θ1 … ΘL} of length {Δ1 … ΔL},
– numbering the partitioned intervals {i1 … iL},
– selecting the reproduction value for each interval, the set of these reproduction values forms a dictionary (codebook)1 .
Encoding (in the transmitter) consists of deciding which interval x(n) belongs to and then associating it with the corresponding number i(n) ∈ {1 … L = 2b}. It is the number of the chosen interval, the symbol, which is transmitted or stored. The decoding procedure (at the receiver) involves associating the corresponding reproduction value from the set of reproduction values with the number i(n). More formally, we observe that quantization is a non-bijective mapping to [−A, +A] in a finite set C with an assignment rule:
The process is irreversible and involves loss of information, a quantization error which is defined as . The definition of a distortion measure is required. We use the simplest distortion measure, quadratic error:
This measures the error in each sample. For a more global distortion measure, we use the mean squared error (MSE):
This error is simply denoted as the quantization error power. We use the notation for the MSE.
Figure 1.1(a) shows, on the left, the signal before quantization and the partition of the range [−A, +A] where 6 = 3, and Figure 1.1(b) shows the reproduction values, the reconstructed signal and the quantization error. The bitstream between the transmitter and the receiver is not shown.
The problem now consists of defining the optimal quantization, that is, in defining the intervals {Θ1 … ΘL} and the set of reproduction values to minimize .
Assume that x(n) is the realization of a real-valued stationary random process X(n). In scalar quantization, what matters is the distribution of values that the random process X(n) takes at time n. No other direct use of the correlation that exists between the values of the process at different times is possible. It is enough to know the marginal probability density function of X(n), which is written as px(.).
To characterize the optimum scalar quantization, the range partition and reproduction values must be found which minimize:
[1.1]
This joint minimization is not simple to solve. However, the two necessary conditions for optimization are straightforward to find. If the reproduction values are known, the best partition {Θ1 … ΘL} can be calculated. Once the partition is found, the best reproduction values can be deduced. The encoding part of quantization must be optimal if the decoding part is given, and vice versa. These two necessary conditions for optimization are simple to find when the squared error is chosen as the measure of distortion.
– Condition 1: Given a codebook , the best partition will satisfy:
This is the nearest neighbor rule.
If we define ti such that it defines the boundary between the intervals Θi and Θi+1, minimizing the MSE relative to ti is found by noting:
such that:
– Condition 2: Given a partition {Θ1 … ΘL}, the optimum reproduction values are found from the centroid (or center of gravity) of the section of the probability density function in the region of Θi:
[1.2]
First, note that minimizing relative to involves only an element from the sum given in [1.1]. From the following:
we find the first identity of equation [1.2].
Since:
where is the conditional probability density function of X, where X ∈ Θi, we find:
The required value is the mean value of X in the interval under consideration.2
It can be demonstrated that these two optimization conditions are not sufficient to guarantee optimized quantization except in the case of a Gaussian distribution.
Note that detailed knowledge of the partition is not necessary. The partition is determined entirely by knowing the distortion measure, applying the nearest neighbor rule, and from the set of reproduction values. Figure 1.2 shows a diagram of the encoder and decoder.
When the number L of levels of quantization is high, the optimum partition and the quantization error power can be obtained as a function of the probability density function pX(x), unlike in the previous case. This hypothesis, referred to as the high-resolution hypothesis, declares that the probability density function can be assumed to be constant in the interval [ti−1, ti] and that the reproduction value is located at the middle of the interval. We can therefore write:
We define the length of the interval as:
for an interval [ti−1, ti] and:
is the probability that X(n) belongs to the interval [ti−1,ti] The quantization error power is written as:
Since:
we find:
[1.3]
This is also written as:
The quantization error power depends only on the length of the intervals Δ(i). We are looking for {Δ(l) … Δ(L)} such that is minimized. Let:
As:
since this integral is now independent of Δ(i), we minimize the sum of the cubes of L positive numbers with a constant sum. This is satisfied with numbers that are all equal. Hence, we have:
which implies:
The relation means that an interval is even smaller, that the probability that X(n) belongs to this interval is high, and that all the intervals contribute equally to the quantization error power. The expression for the quantization error power is:
where:
Hence, we have:
Since L = 2b, we obtain what is known as the Bennett formula:
[1.4]
This demonstration is not mathematically rigorous. It will be discussed at the end of Chapter 4 where we compare this mode of quantization with what is known as quantization with entropy constraint.
Two cases are particularly interesting. When X(n) is distributed uniformly, we find:
Note that the explanation via Bennett’s formula is not necessary. We can obtain this result directly!
For a Gaussian zero-mean signal, with power , for which:
we have:
From this, we deduce that:
[1.5]
where:
This equation is referred to throughout this book. From this, we can write the equivalent expression:
From this we deduce the 6 dB per bit rule. We can show that for all other distributions (Laplacian, etc.), the minimum quantization error power is always between these two values. The case of the uniformly distributed signal is more favorable, whereas the Gaussian case is less favorable. Shannon’s work and the rate/distortion theory affirm this observation.
It is interesting to know the statistical properties of the quantization error. We can show that the quantization error is not correlated to the reconstructed signal but this property is not true for the original signal. We can also show that, only in the framework of the high-resolution hypothesis, the quantization error can be modeled by white noise. A detailed analysis is possible (see [LIP 92]).
In practice, pX(x) is unknown. To construct a quantizer, we use empirical data, assign the same weight to each value and apply the Lloyd-Max algorithm in the so-called Linde-Buzo-Gray (LBG) form. This algorithm, which is generalized for vector quantization, is presented in the following chapter.
A non-uniform scalar quantizer can be seen as a uniform scalar quantizer that has been preceded by a nonlinear transformation and followed with the inverse transformation.3 The transformation is defined by its characteristic f(x). From this perspective, the problem consists of choosing the non-linear transformation which minimizes the quantization error power. This forms the subject of important developments in two works by: Jayant and Noll [JAY 84] and Gersho and Gray [GER 92]. This development no longer seems to be of great importance because vector quantization became the basic tool of choice.
During a quantization operation on real signals (speech, music, and pictures), it is important to estimate the parameter A which varies with time; real signals do not satisfy the stationarity hypothesis! We examine this problem in the following chapter by introducing a special quantization called gain shape which is particularly well adapted to signals with significant instantaneous changes in power, for example, audio signals.
In the preceding exposition, we saw that during quantization, no other use is made with the statistical links between successive values of the signal. We will see that predictive scalar quantization aims to decorrelate the signal before quantizing it and that the use of correlation improves the general behavior of the system, that is, it reduces the quantization error power.
An outline of the principle of predictive scalar quantization is shown in Figure 1.3. We subtract a new signal v(n) from the signal x(n). Next, we perform the encoding/decoding procedure on the signal y(n) = x(n) − v(n). At the decoder, we add v(n) back to the reconstructed signal values
We can immediately see that, in a real-world application of coding, this scheme is not very realistic since the signal v(n) must also be transmitted to the decoder, but let us wait until the end of the chapter before demonstrating how we go from a open-loop scheme to a more realistic, but more complicated to analyze, closed-loop scheme.
If we subtract a value from the signal before encoding it and add it back after decoding, the quantization error and reconstruction error must always be equal because:
Hence their respective powers are identical. Since the main interest of the user of the complete system is to have the smallest possible reconstruction error power, the problem becomes simply the reduction of the quantization error power. If we assume an optimized scalar quantization of y(n), we know that the quantization error power can be expressed as:
From this, we conclude that seeking to minimize the reconstruction error power leads us back to minimize from y(n).
We have a great range of choices for v(n). If we take v(n) in the form:
while introducing P parameters, we speak of linear prediction of order P. The signal y(n) is the prediction error which is expressed as:
The relationship between x(n) and y(n) is that of a transfer function filtering operation4:
Minimizing concerns the coefficients of this predictive filter.
This problem has been the subject of numerous studies since 1960s. All modern books that present basic techniques for signal processing include a chapter on this problem, for example [KAY 88]. In this book, we set out a few reminders.
Since this theory, which can be used in numerous signal processing applications, was developed quite rightly with the goal of determining a method of speech coding with reduced rates and, in coding speech, the method uses a block of N samples,5 the problem can be posed in the following way: knowing determine while minimizing the empirical power of the prediction error:
where:
Since:
such that:
we can write:
The vector is that which satisfies:
If ΓtΓ is invertible, we find:
The minimum can be written as:
Let us assume that the signal x(n) can be interpreted (modeled) as the realization of a stationary random process with an autocovariance function rX(k)= E{X(n)X(n − k)}. We need to find the vector which minimizes the prediction error power:
Minimizing relative to involves the normal equations:
and as the autocovariance matrix is definite-positive (except for the limiting case where X(n) is an harmonic random process), it is invertible. We therefore have:
[1.6]
We also have:
[1.7]
Note that these two equations together [1.6] and [1.7] allow the unique matrix representation:
[1.8]
The two solutions are comparable, which is unsurprising because is an estimate of the vector and 1/NΓtΓ is an estimate of . More precisely, the two approaches are asymptotically equivalent since, when the signal X(n) is an ergodic random process, that is, if:
we have:
The difference (essential but subtle) is that since the exact covariance matrix is definite-positive (except in the limiting case where the process is harmonic), it is always invertible while the matrix ΓtΓ (expressed here with P = 1 for simplicity):
stays symmetric but is not always definite-positive.
In practice when we have only N observed data, we wish to maintain the positive property of the matrix. We can see that to approximate the autocovariance function as:
then to construct and from maintains the definite-positive characteristic of which therefore allows its invertibility. We can then define the filter coefficients by:
[1.9]
and the residual signal power by:
This is called linear predictive coding (LPC).
We can also show that the positive property requires that all the zero values of polynomial A(z) are inside the unit circle which assures the stability of the filter 1/A(z). This is a very important property in practice as we will see shortly when we look at coding speech at a reduced rate.
We can show that the prediction error Y(n) is white (more precisely, the signal X(n) is whitened). Note that:
Assume that P is large. As Y(n) is not correlated with the preceding X(n − i) and Y(n − i) is a linear combination of the X(n − i), we can deduce from this that Y(n) is not correlated with Y(n − i). The prediction error Y(n) is therefore white noise but this property is not proven a priori unless P → ∞ (asymptotic behavior). This filter, which gives Y(n) from X(n), is called the “whitening filter.”
If Y(n) is completely whitened, we can write:
Hence, we have:
Recall that the most regularly used spectral density estimate is the periodogram. From N observed data [x(0) … x(N − 1)], we derive:
The equation hints at a second spectral density estimate. From N observed data, we calculate the whitening filter coefficients through an LPC analysis before using the preceding equation.