Recent Titles in the Artech House Signal Processing Library Computer Speech Technology, Robert D. Rodman Digital Signal Processing and Statistical Classification, George J. Miao and Mark A. Clements Handbook of Neural Networks for Speech Processing, Shigeru Katagiri, editor Hilbert Transforms in Signal Processing, Stefan L. Hahn Phase and Phase-Difference Modulation in Digital Communications, Yuri Okunev Signal Processing Fundamentals and Applications for Communications and Sensing Systems, John Minkoff Signals, Oscillations, and Waves: A Modern Approach, David Vakman Statistical Signal Characterization, Herbert L. Hirsch Statistical Signal Characterization Algorithms and Analysis Programs, Herbert L. Hirsch Voice Recognition, Richard L. Klevans and Robert D. Rodman

For further information on these and other Artech House titles, including previously considered out-of-print books now available through our In-Print-Forever ® (IPF®)

program, contact: Artech House 685 Canton Street Norwood, MA 02062 Phone: 781-769-9750 Fax: 781-769-6334 e-mail: [emailprotected]

Artech House 46 Gillingham Street London SW1V 1AH UK Phone: +44 (0)20 7596-8750 Fax: +44 (0)20 7630-0166 e-mail: [emailprotected]

Statistical and Adaptive Signal Processing Spectral Estimation, Signal Modeling, Adaptive Filtering, and Array Processing

Dimitris G. Manolakis Massachusetts Institute of Technology Lincoln Laboratory

Vinay K. Ingle Northeastern University

Stephen M. Kogon Massachusetts Institute of Technology Lincoln Laboratory

Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the U.S. Library of Congress. British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library. This is a reissue of a McGraw-Hill book.

© 2005 ARTECH HOUSE, INC. 685 Canton Street Norwood, MA 02062

All rights reserved. Printed and bound in the United States of America. No part of this book may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage and retrieval system, without permission in writing from the publisher. All terms mentioned in this book that are known to be trademarks or service marks have been appropriately capitalized. Artech House cannot attest to the accuracy of this information. Use of a term in this book should not be regarded as affecting the validity of any trademark or service mark. International Standard Book Number: 1-58053-610-7 10 9 8 7 6 5 4 3 2 1

To my beloved wife, Anna, and to the loving memory of my father, Gregory. DGM To my beloved wife, Usha, and adoring daughters, Natasha and Trupti. VKI To my wife and best friend, Lorna, and my children, Gabrielle and Matthias. SMK

ABOUT THE AUTHORS DIMITRIS G. MANOLAKIS, a native of Greece, received his education (B.S. in physics and Ph.D. in electrical engineering) from the University of Athens, Greece. He is currently a member of the technical staff at MIT Lincoln Laboratory, in Lexington, Massachusetts. Previously, he was a Principal Member, Research Staff, at Riverside Research Institute. Dr. Manolakis has taught at the University of Athens, Northeastern University, Boston College, and Worcester Polytechnic Institute; and he is coauthor of the textbook Digital Signal Processing: Principles, Algorithms, and Applications (Prentice-Hall, 1996, 3d ed.). His research experience and interests include the areas of digital signal processing, adaptive filtering, array processing, pattern recognition, and radar systems. VINAY K. INGLE is Associate Professor of Electrical and Computer Engineering at Northeastern University. He received his Ph.D. in eletrical and computer engineering from Rensselaer Polytechnic Institute in 1981. He has broad research experience and has taught courses on topics including signal and image processing, stochastic processes, and estimation theory. Professor Ingle is coauthor of the textbooks DSP Laboratory Using the ADSP-2101 Microprocessor (Prentice-Hall, 1991) and DSP Using Matlab (PWS Publishing Co., Boston, 1996). STEPHEN M. KOGON received the Ph.D. degree in electrical engineering from Georgia Institute of Technology. He is currently a member of the technical staff at MIT Lincoln Laboratory in Lexington, Massachusetts. Previously, he has been associated with Raytheon Co., Boston College, and Georgia Tech Research Institute. His research interests are in the areas of adaptive processing, array signal processing, radar, and statistical signal modeling.

1.3.1 Rational or Pole-Zero Models / 1.3.2 Fractional Pole-Zero Models and Fractal Models

2.1.1 Continuous-Time, DiscreteTime, and Digital Signals / 2.1.2 Mathematical Description of Signals / 2.1.3 Real-World Signals

2.2.1 Fourier Transforms and Fourier Series / 2.2.2 Sampling of Continuous-Time Signals / 2.2.3 The Discrete Fourier Transform / 2.2.4 The z-Transform / 2.2.5 Representation of Narrowband Signals

2.4.1 System Invertibility and Minimum-Phase Systems / 2.4.2 All-Pass Systems / 2.4.3 Minimum-Phase and All-Pass Decomposition / 2.4.4 Spectral Factorization 64

2.3.1 Analysis of Linear, Time-Invariant Systems / 2.3.2 Response to Periodic Inputs / 2.3.3 Correlation Analysis and Spectral Density

1.5.1 Spatial Filtering or Beamforming / 1.5.2 Adaptive Interference Mitigation in Radar Systems / 1.5.3 Adaptive Sidelobe Canceler

3.1.1 Distribution and Density Functions / 3.1.2 Statistical Averages / 3.1.3 Some Useful Random Variables

3.2.1 Definitions and Second-Order Moments / 3.2.2 Linear Transformations of Random Vectors / 3.2.3 Normal Random Vectors / 3.2.4 Sums of Independent Random Variables

3.3 Discrete-Time Stochastic Processes 3.3.1 Description Using Probability Functions / 3.3.2 Second-Order Statistical Description / 3.3.3 Stationarity /

3.3.4 Ergodicity / 3.3.5 Random Signal Variability / 3.3.6 Frequency-Domain Description of Stationary Processes

3.4.1 Time-Domain Analysis / 3.4.2 Frequency-Domain Analysis / 3.4.3 Random Signal Memory / 3.4.4 General Correlation Matrices / 3.4.5 Correlation Matrices from Random Processes

3.5 Whitening and Innovations Representation 125 3.5.1 Transformations Using Eigen-decomposition / 3.5.2 Transformations Using Triangular Decomposition / 3.5.3 The Discrete KarhunenLoève Transform

3.6.1 Properties of Estimators / 3.6.2 Estimation of Mean / 3.6.3 Estimation of Variance

4.1.1 Linear Nonparametric Signal Models / 4.1.2 Parametric Pole-Zero Signal Models / 4.1.3 Mixed Processes and the Wold Decomposition

4.2.1 Model Properties / 4.2.2 All-Pole Modeling and Linear Prediction / 4.2.3 Autoregressive Models / 4.2.4 Lower-Order Models

4.3 All-Zero Models 4.3.1 Model Properties / 4.3.2 Moving-Average Models / 4.3.3 Lower-Order Models

4.4.1 Model Properties / 4.4.2 Autoregressive Moving-Average Models / 4.4.3 The First-Order Pole-Zero Model 1: PZ (1,1) / 4.4.4 Summary and Dualities

5.1.1 Effect of Signal Sampling / 5.1.2 Windowing, Periodic Extension, and Extrapolation / 5.1.3 Effect of Spectrum Sampling / 5.1.4 Effects of Windowing: Leakage and Loss of Resolution / 5.1.5 Summary

5.2 Estimation of the Autocorrelation of Stationary Random Signals 209 5.3 Estimation of the Power Spectrum of Stationary Random Signals

5.3.1 Power Spectrum Estimation Using the Periodogram / 5.3.2 Power Spectrum Estimation by Smoothing a Single Periodogram— The Blackman-Tukey Method / 5.3.3 Power Spectrum Estimation by Averaging Multiple Periodograms—The WelchBartlett Method / 5.3.4 Some Practical Considerations and Examples

5.4.1 Estimation of Cross-Power Spectrum / 5.4.2 Estimation of Frequency Response Functions

5.5.1 Estimation of Auto Power Spectrum / 5.5.2 Estimation of Cross Power Spectrum

6 Optimum Linear Filters 6.1 Optimum Signal Estimation 6.2 Linear Mean Square Error Estimation

6.2.1 Error Performance Surface / 6.2.2 Derivation of the Linear MMSE Estimator / 6.2.3 PrincipalComponent Analysis of the Optimum Linear Estimator / 6.2.4 Geometric Interpretations and the Principle of Orthogonality / 6.2.5 Summary and Further Properties

6.4.1 Design and Properties / 6.4.2 Optimum FIR Filters for Stationary Processes / 6.4.3 Frequency-Domain Interpretations

6.5.1 Linear Signal Estimation / 6.5.2 Forward Linear Prediction / 6.5.3 Backward Linear Prediction / 6.5.4 Stationary Processes / 6.5.5 Properties

6.6.1 Noncausal IIR Filters / 6.6.2 Causal IIR Filters / 6.6.3 Filtering of Additive Noise / 6.6.4

6.8 Channel Equalization in Data Transmission Systems 310 6.8.1 Nyquist’s Criterion for Zero ISI / 6.8.2 Equivalent Discrete-Time Channel Model / 6.8.3 Linear Equalizers / 6.8.4 Zero-Forcing Equalizers / 6.8.5 Minimum MSE Equalizers

7 Algorithms and Structures for Optimum Linear Filters 333 7.1 Fundamentals of OrderRecursive Algorithms

7.1.1 Matrix Partitioning and Optimum Nesting / 7.1.2 Inversion of Partitioned Hermitian Matrices / 7.1.3 Levinson Recursion for the Optimum Estimator / 7.1.4 OrderRecursive Computation of the LDLH Decomposition / 7.1.5 OrderRecursive Computation of the Optimum Estimate

7.2.1 Innovations and Backward Prediction / 7.2.2 Partial Correlation / 7.2.3 Order Decomposition of the Optimum Estimate / 7.2.4 Gram-Schmidt Orthogonalization

7.3 Order-Recursive Algorithms for Optimum FIR Filters 347 7.3.1 Order-Recursive Computation of the Optimum Filter / 7.3.2

Lattice-Ladder Structure / 7.3.3 Simplifications for Stationary Stochastic Processes / 7.3.4 Algorithms Based on the UDUH Decomposition

7.5.1 Lattice-Ladder Structures / 7.5.2 Some Properties and Interpretations / 7.5.3 Parameter Conversions

7.6.1 Direct Schür Algorithm / 7.6.2 Implementation Considerations / 7.6.3 Inverse Schür Algorithm

7.7.1 LDLH Decomposition of Inverse of a Toeplitz Matrix / 7.7.2 LDLH Decomposition of a Toeplitz Matrix / 7.7.3 Inversion of Real Toeplitz Matrices 378

8.2.1 Derivation of the Normal Equations / 8.2.2 Statistical Properties of Least-Squares Estimators

8.5 LS Computations Using the Normal Equations 416 8.5.1 Linear LSE Estimation / 8.5.2 LSE FIR Filtering and Prediction

8.7.1 Singular Value Decomposition / 8.7.2 Solution of the LS Problem / 8.7.3 Rank-Deficient LS Problems

8.4.1 Signal Estimation and Linear Prediction / 8.4.2 Combined Forward and Backward Linear Prediction (FBLP) / 8.4.3 Narrowband Interference Cancelation

8.6.1 Householder Reflections / 8.6.2 The Givens Rotations / 8.6.3 Gram-Schmidt Orthogonalization

9.2.1 Direct Structures / 9.2.2 Lattice Structures / 9.2.3 Maximum Entropy Method / 9.2.4 Excitations with Line Spectra

9.5 Minimum-Variance Spectrum Estimation 9.6 Harmonic Models and Frequency Estimation Techniques

9.6.1 Harmonic Model / 9.6.2 Pisarenko Harmonic Decomposition / 9.6.3 MUSIC Algorithm / 9.6.4 Minimum-Norm Method / 9.6.5 ESPRIT Algorithm

10.1.1 Echo Cancelation in Communications / 10.1.2 Equalization of Data Communications Channels / 10.1.3 Linear Predictive Coding / 10.1.4 Noise Cancelation

10.2.1 Features of Adaptive Filters / 10.2.2 Optimum versus Adaptive Filters / 10.2.3 Stability and Steady-State Performance of Adaptive Filters / 10.2.4 Some Practical Considerations

10.4.1 Derivation / 10.4.2 Adaptation in a Stationary SOE / 10.4.3 Summary and Design Guidelines / 10.4.4 Applications of the LMS Algorithm / 10.4.5 Some Practical Considerations

10.5.1 LS Adaptive Filters / 10.5.2 Conventional Recursive Least-Squares Algorithm / 10.5.3 Some Practical Considerations / 10.5.4 Convergence and Performance Analysis

10.6.1 LS Computations Using the Cholesky and QR Decompositions / 10.6.2 Two Useful Lemmas / 10.6.3 The QR-RLS Algorithm / 10.6.4 Extended QR-RLS Algorithm / 10.6.5 The Inverse QR-RLS Algorithm / 10.6.6 Implementation of QR-RLS Algorithm Using the Givens Rotations / 10.6.7 Implementation of Inverse QR-RLS Algorithm Using the Givens Rotations / 10.6.8 Classification of RLS Algorithms for Array Processing

10.7.1 Fast Fixed-Order RLS FIR Filters / 10.7.2 RLS LatticeLadder Filters / 10.7.3 RLS Lattice-Ladder Filters Using Error Feedback Updatings / 10.7.4 Givens Rotation–Based LS LatticeLadder Algorithms / 10.7.5 Classification of RLS Algorithms for FIR Filtering

10.8.1 Approaches for Nonstationary SOE / 10.8.2 Preliminaries in Performance Analysis / 10.8.3 The LMS Algorithm / 10.8.4 The RLS Algorithm with Exponential Forgetting / 10.8.5 Comparison of Tracking Performance

11.1.1 Spatial Signals / 11.1.2 Modulation-Demodulation / 11.1.3 Array Signal Model / 11.1.4 The Sensor Array: Spatial Sampling

11.3.1 Optimum Beamforming / 11.3.2 Eigenanalysis of the Optimum Beamformer / 11.3.3 Interference Cancelation Performance / 11.3.4 Tapered Optimum Beamforming / 11.3.5 The Generalized Sidelobe Canceler

11.5.1 Sample Matrix Inversion / 11.5.2 Diagonal Loading with the SMI Beamformer / 11.5.3 Implementation of the SMI Beamformer / 11.5.4 Sample-bySample Adaptive Methods

11.6.1 Linearly Constrained Minimum-Variance Beamformers / 11.6.2 Partially Adaptive Arrays / 11.6.3 Sidelobe Cancelers

1 1.7 Angle Estimation 11.7.1 Maximum-Likelihood Angle Estimation / 11.7.2 Cramér-Rao Lower Bound on Angle Accuracy / 11.7.3 Beamsplitting Algorithms / 11.7.4 Model-Based Methods

12.1.1 Moments, Cumulants, and Polyspectra / 12.1.2 HigherOrder Moments and LTI Systems / 12.1.3 Higher-Order Moments of Linear Signal Models

12.3 Unsupervised Adaptive Filters—Blind Equalizers 702 12.3.1 Blind Equalization / 12.3.2 Symbol Rate Blind Equalizers / 12.3.3 ConstantModulus Algorithm

12.4.1 Zero-Forcing Fractionally Spaced Equalizers / 12.4.2 MMSE Fractionally Spaced Equalizers / 12.4.3 Blind Fractionally Spaced Equalizers

12.5.1 Fractional Unit-Pole Model / 12.5.2 Fractional PoleZero Models: FPZ (p, d, q) / 12.5.3 Symmetric a-Stable Fractional Pole-Zero Processes

12.6.1 Self-Similar Stochastic Processes / 12.6.2 Fractional Brownian Motion / 12.6.3 Fractional Gaussian Noise / 12.6.4 Simulation of Fractional Brownian Motions and Fractional Gaussian Noises / 12.6.5 Estimation of Long Memory /

Appendix A Matrix Inversion Lemma Appendix B Gradients and Optimization in Complex Space

D.4.1 Hermitian Forms after Unitary Transformations / D.4.2 Significant Integral of Quadratic and Hermitian Forms

One must learn by doing the thing; for though you think you know it You have no certainty, until you try. —Sophocles, Trachiniae

The principal goal of this book is to provide a unified introduction to the theory, implementation, and applications of statistical and adaptive signal processing methods. We have focused on the key topics of spectral estimation, signal modeling, adaptive filtering, and array processing, whose selection was based on the grounds of theoretical value and practical importance. The book has been primarily written with students and instructors in mind. The principal objectives are to provide an introduction to basic concepts and methodologies that can provide the foundation for further study, research, and application to new problems. To achieve these goals, we have focused on topics that we consider fundamental and have either multiple or important applications. APPROACH AND PREREQUISITES The adopted approach is intended to help both students and practicing engineers understand the fundamental mathematical principles underlying the operation of a method, appreciate its inherent limitations, and provide sufficient details for its practical implementation. The academic flavor of this book has been influenced by our teaching whereas its practical character has been shaped by our research and development activities in both academia and industry. The mathematical treatment throughout this book has been kept at a level that is within the grasp of upper-level undergraduate students, graduate students, and practicing electrical engineers with a background in digital signal processing, probability theory, and linear algebra. ORGANIZATION OF THE BOOK Chapter 1 introduces the basic concepts and applications of statistical and adaptive signal processing and provides an overview of the book. Chapters 2 and 3 review the fundamentals of discrete-time signal processing, study random vectors and sequences in the time and frequency domains, and introduce some basic concepts of estimation theory. Chapter 4 provides a treatment of parametric linear signal models (both deterministic and stochastic) in the time and frequency domains. Chapter 5 presents the most practical methods for the estimation of correlation and spectral densities. Chapter 6 provides a detailed study of the theoretical properties of optimum filters, assuming that the relevant signals can be modeled as stochastic processes with known statistical properties; and Chapter 7 contains algorithms and structures for optimum filtering, signal modeling, and prediction. Chapter xvii

8 introduces the principle of least-squares estimation and its application to the design of practical filters and predictors. Chapters 9, 10, and 11 use the theoretical work in Chapters 4, 6, and 7 and the practical methods in Chapter 8, to develop, evaluate, and apply practical techniques for signal modeling, adaptive filtering, and array processing. Finally, Chapter 12 introduces some advanced topics: definition and properties of higher-order moments, blind deconvolution and equalization, and stochastic fractional and fractal signal models with long memory. Appendix A contains a review of the matrix inversion lemma, Appendix B reviews optimization in complex space, Appendix C contains a list of the Matlab functions used throughout the book, Appendix D provides a review of useful results from matrix algebra, and Appendix E includes a proof for the minimum-phase condition for polynomials. THEORY AND PRACTICE It is our belief that sound theoretical understanding goes hand-in-hand with practical implementation and application to real-world problems. Therefore, the book includes a large number of computer experiments that illustrate important concepts and help the reader to easily implement the various methods. Every chapter includes examples, problems, and computer experiments that facilitate the comprehension of the material. To help the reader understand the theoretical basis and limitations of the various methods and apply them to real-world problems, we provide Matlab functions for all major algorithms and examples illustrating their use. The Matlab files and additional material about the book can be found at http://www.artechhouse.com/default.asp?frame=Static/ manolakismatlab.html. A Solutions Manual with detailed solutions to all the problems is available to the instructors adopting the book for classroom use. Dimitris G. Manolakis Vinay K. Ingle Stephen M. Kogon

This book is an introduction to the theory and algorithms used for the analysis and processing of random signals and their applications to real-world problems. The fundamental characteristic of random signals is captured in the following statement: Although random signals are evolving in time in an unpredictable manner, their average statistical properties exhibit considerable regularity. This provides the ground for the description of random signals using statistical averages instead of explicit equations. When we deal with random signals, the main objectives are the statistical description, modeling, and exploitation of the dependence between the values of one or more discrete-time signals and their application to theoretical and practical problems. Random signals are described mathematically by using the theory of probability, random variables, and stochastic processes. However, in practice we deal with random signals by using statistical techniques. Within this framework we can develop, at least in principle, theoretically optimum signal processing methods that can inspire the development and can serve to evaluate the performance of practical statistical signal processing techniques. The area of adaptive signal processing involves the use of optimum and statistical signal processing techniques to design signal processing systems that can modify their characteristics, during normal operation (usually in real time), to achieve a clearly predefined application-dependent objective. The purpose of this chapter is twofold: to illustrate the nature of random signals with some typical examples and to introduce the four major application areas treated in this book: spectral estimation, signal modeling, adaptive filtering, and array processing. Throughout the book, the emphasis is on the application of techniques to actual problems in which the theoretical framework provides a foundation to motivate the selection of a specific method.

1.1 RANDOM SIGNALS A discrete-time signal or time series is a set of observations taken sequentially in time, space, or some other independent variable. Examples occur in various areas, including engineering, natural sciences, economics, social sciences, and medicine. A discrete-time signal x(n) is basically a sequence of real or complex numbers called samples. Although the integer index n may represent any physical variable (e.g., time, distance), we shall generally refer to it as time. Furthermore, in this book we consider only time series with observations occurring at equally spaced intervals of time. Discrete-time signals can arise in several ways. Very often, a discrete-time signal is obtained by periodically sampling a continuous-time signal, that is, x(n) = xc (nT ), where T = 1/Fs (seconds) is the sampling period and Fs (samples per second or hertz) is the sampling frequency. At other times, the samples of a discrete-time signal are obtained 1

by accumulating some quantity (which does not have an instantaneous value) over equal intervals of time, for example, the number of cars per day traveling on a certain road. Finally, some signals are inherently discrete-time, for example, daily stock market prices. Throughout the book, except if otherwise stated, the terms signal, time series, or sequence will be used to refer to a discrete-time signal. The key characteristics of a time series are that the observations are ordered in time and that adjacent observations are dependent (related). To see graphically the relation between the samples of a signal that are l sampling intervals away, we plot the points {x(n), x(n+l)} for 0 ≤ n ≤ N − 1 − l, where N is the length of the data record. The resulting graph is known as the l lag scatter plot. This is illustrated in Figure 1.1, which shows a speech signal and two scatter plots that demonstrate the correlation between successive samples. We note that for adjacent samples the data points fall close to a straight line with a positive slope. This implies high correlation because every sample is followed by a sample with about the same amplitude. In contrast, samples that are 20 sampling intervals apart are much less correlated because the points in the scatter plot are randomly spread. When successive observations of the series are dependent, we may use past observations to predict future values. If the prediction is exact, the series is said to be deterministic. However, in most practical situations we cannot predict a time series exactly. Such time

FIGURE 1.1 (a) The waveform for the speech signal “signal”; (b) two scatter plots for successive samples and samples separated by 20 sampling intervals.

series are called random or stochastic, and the degree of their predictability is determined by the dependence between consecutive observations. The ultimate case of randomness occurs when every sample of a random signal is independent of all other samples. Such a signal, which is completely unpredictable, is known as white noise and is used as a building block to simulate random signals with different types of dependence. To summarize, the fundamental characteristic of a random signal is the inability to precisely specify its values. In other words, a random signal is not predictable, it never repeats itself, and we cannot find a mathematical formula that provides its values as a function of time. As a result, random signals can only be mathematically described by using the theory of stochastic processes (see Chapter 3). This book provides an introduction to the fundamental theory and a broad selection of algorithms widely used for the processing of discrete-time random signals. Signal processing techniques, dependent on their main objective, can be classified as follows (see Figure 1.2): •

Signal analysis. The primary goal is to extract useful information that can be used to understand the signal generation process or extract features that can be used for signal classification purposes. Most of the methods in this area are treated under the disciplines of spectral estimation and signal modeling. Typical applications include detection and classification of radar and sonar targets, speech and speaker recognition, detection and classification of natural and artificial seismic events, event detection and classification in biological and financial signals, efficient signal representation for data compression, etc. Signal filtering. The main objective of signal filtering is to improve the quality of a signal according to an acceptable criterion of performance. Signal filtering can be subdivided into the areas of frequency selective filtering, adaptive filtering, and array processing. Typical applications include noise and interference cancelation, echo cancelation, channel equalization, seismic deconvolution, active noise control, etc.

We conclude this section with some examples of signals occurring in practical applications. Although the desciption of these signals is far from complete, we provide sufficient information to illustrate their random nature and significance in signal processing applications.

Theory of stochastic processes, estimation, and optimum filtering (Chapters 2, 3, 6, 7)

FIGURE 1.2 Classification of methods for the analysis and processing of random signals.

Speech signals. Figure 1.3 shows the spectrogram and speech waveform corresponding to the utterance “signal.” The spectrogram is a visual representation of the distribution of the signal energy as a function of time and frequency. We note that the speech signal has significant changes in both amplitude level and spectral content across time. The waveform contains segments of voiced (quasi-periodic) sounds, such as “e,” and unvoiced or fricative (noiselike) sounds, such as “g.”

FIGURE 1.3 Spectrogram and acoustic waveform for the utterance “signal.” The horizontal dark bands show the resonances of the vocal tract, which change as a function of time depending on the sound or phoneme being produced.

Speech production involves three processes: generation of the sound excitation, articulation by the vocal tract, and radiation from the lips and/or nostrils. If the excitation is a quasi-periodic train of air pressure pulses, produced by the vibration of the vocal cords, the result is a voiced sound. Unvoiced sounds are produced by first creating a constriction in the vocal tract, usually toward the mouth end. Then we generate turbulence by forcing air through the constriction at a sufficiently high velocity. The resulting excitation is a broadband noiselike waveform. The spectrum of the excitation is shaped by the vocal tract tube, which has a frequency response that resembles the resonances of organ pipes or wind instruments. The resonant frequencies of the vocal tract tube are known as formant frequencies, or simply formants. Changing the shape of the vocal tract changes its frequency response and results in the generation of different sounds. Since the shape of the vocal tract changes slowly during continuous speech, we usually assume that it remains almost constant over intervals on the order of 10 ms. More details about speech signal generation and processing can be found in Rabiner and Schafer 1978; O’Shaughnessy 1987; and Rabiner and Juang 1993. Electrophysiological signals. Electrophysiology was established in the late eighteenth century when Galvani demonstrated the presence of electricity in animal tissues. Today, electrophysiological signals play a prominent role in every branch of physiology, medicine, and

biology. Figure 1.4 shows a set of typical signals recorded in a sleep laboratory (Rechtschaffen and Kales 1968). The most prominent among them is the electroencephalogram (EEG), whose spectral content changes to reflect the state of alertness and the mental activity of the subject. The EEG signal exhibits some distinctive waves, known as rhythms, whose dominant spectral content occupies certain bands as follows: delta (δ), 0.5 to 4 Hz; theta (θ), 4 to 8 Hz; alpha (α), 8 to 13 Hz; beta (β), 13 to 22 Hz; and gamma (γ ), 22 to 30 Hz. During sleep, if the subject is dreaming, the EEG signal shows rapid low-amplitude fluctuations similar to those obtained in alert subjects, and this is known as rapid eye movement (REM) sleep. Some other interesting features occurring during nondreaming sleep periods resemble alphalike activity and are known as sleep spindles. More details can be found in Duffy et al. 1989 and Niedermeyer and Lopes Da Silva 1998.

FIGURE 1.4 Typical sleep laboratory recordings. The two top signals show eye movements, the next one illustrates EMG (electromyogram) or muscle tonus, and the last one illustrates brain waves (EEG) during the onset of a REM sleep period ( from Rechtschaffen and Kales 1968).

The beat-to-beat fluctuations in heart rate and other cardiovascular variables, such as arterial blood pressure and stroke volume, are mediated by the joint activity of the sympathetic and parasympathetic systems. Figure 1.5 shows time series for the heart rate and systolic arterial blood pressure. We note that both heart rate and blood pressure fluctuate in a complex manner that depends on the mental or physiological state of the subject. The individual or joint analysis of such time series can help to understand the operation of the cardiovascular system, predict cardiovascular diseases, and help in the development of drugs and devices for cardiac-related problems (Grossman et al. 1996; Malik and Camm 1995; Saul 1990). Geophysical signals. Remote sensing systems use a variety of electro-optical sensors that span the infrared, visible, and ultraviolet regions of the spectrum and find many civilian and defense applications. Figure 1.6 shows two segments of infrared scans obtained by a space-based radiometer looking down at earth (Manolakis et al. 1994). The shape of the profiles depends on the transmission properties of the atmosphere and the objects in the radiometer’s field-of-view (terrain or sky background). The statistical characterization and modeling of infrared backgrounds are critical for the design of systems to detect missiles against such backgrounds as earth’s limb, auroras, and deep-space star fields (Sabins 1987; Colwell 1983). Other geophysical signals of interest are recordings of natural and man-made seismic events and seismic signals used in geophysical prospecting (Bolt 1993; Dobrin 1988; Sheriff 1994).

FIGURE 1.5 Simultaneous recordings of the heart rate and systolic blood pressure signals for a subject at rest. Infrared 1 52.0

FIGURE 1.6 Time series of infrared radiation measurements obtained by a scanning radiometer.

Radar signals. We conveniently define a radar system to consist of both a transmitter and a receiver. When the transmitter and receiver are colocated, the radar system is said to be monostatic, whereas if they are spatially separated, the system is bistatic. The radar first transmits a waveform, which propagates through space as electromagnetic energy, and then measures the energy returned to the radar via reflections. When the returns are due to an object of interest, the signal is known as a target, while undesired reflections from the earth’s surface are referred to as clutter. In addition, the radar may encounter energy transmitted by a hostile opponent attempting to jam the radar and prevent detection of certain targets. Collectively, clutter and jamming signals are referred to as interference. The challenge facing the radar system is how to extract the targets of interest in the presence of sometimes severe interference environments. Target detection is accomplished by using adaptive processing methods that exploit characteristics of the interference in order to suppress these undesired signals. A transmitted radar signal propagates through space as electromagnetic energy at approximately the speed of light c = 3 × 108 m/s. The signal travels until it encounters an object that reflects the signal’s energy. A portion of the reflected energy returns to the radar receiver along the same path. The round-trip delay of the reflected signal determines the distance or range of the object from the radar. The radar has a certain receive aperture, either a continuous aperture or one made up of a series of sensors. The relative delay of a signal as it propagates across the radar aperture determines its angle of arrival, or bearing. The extent of the aperture determines the accuracy to which the radar can determine the direction of a target. Typically, the radar transmits a series of pulses at a rate known as the pulse repetition frequency. Any target motion produces a phase shift in the returns from successive pulses caused by the Doppler effect. This phase shift across the series of pulses is known as the Doppler frequency of the target, which in turn determines the target radial velocity. The collection of these various parameters (range, angle, and velocity) allows the radar to locate and track a target. An example of a radar signal as a function of range in kilometers (km) is shown in Figure 1.7. The signal is made up of a target, clutter, and thermal noise. All the signals have been normalized with respect to the thermal noise floor. Therefore, the normalized noise has unit variance (0 dB). The target signal is at a range of 100 km with a signal-to-noise ratio (SNR) of 15 dB. The clutter, on the other hand, is present at all ranges and is highly nonstationary. Its power levels vary from approximately 40 dB at near ranges down to the thermal noise floor (0 dB) at far ranges. Part of the nonstationarity in the clutter is due to the range falloff of the clutter as its power is attenuated as a function of range. However, the rises and dips present between 100 and 200 km are due to terrain-specific artifacts. Clearly, the target is not visible, and the clutter interference must be removed or canceled in order

FIGURE 1.7 Example of a radar return signal, plotted as relative power with respect to noise versus range.

to detect the target. The challenge here is how to cancel such a nonstationary signal in order to extract the target signal and motivate the use of adaptive techniques that can adapt to the rapidly changing interference environment. More details about radar and radar signal processing can be found in Skolnik 1980; Skolnik 1990; and Nathanson 1991.

1.2 SPECTRAL ESTIMATION The central objective of signal analysis is the development of quantitative techniques to study the properties of a signal and the differences and similarities between two or more signals from the same or different sources. The major areas of random signal analysis are (1) statistical analysis of signal amplitude (i.e., the sample values); (2) analysis and modeling of the correlation among the samples of an individual signal; and (3) joint signal analysis (i.e., simultaneous analysis of two signals in order to investigate their interaction or interrelationships). These techniques are summarized in Figure 1.8. The prominent tool in signal analysis is spectral estimation, which is a generic term for a multitude of techniques used to estimate the distribution of energy or power of a signal from a set of observations. Spectral estimation is a very complicated process that requires a deep understanding of the underlying theory and a great deal of practical experience. Spectral analysis finds many applications in areas such as medical diagnosis, speech analysis, seismology and geophysics, radar and sonar, nondestructive fault detection, testing of physical theories, and evaluating the predictability of time series.

Autocorrelation Power spectrum Parametric models Self-similarity Higher-order statistics

Cross-correlation Cross power spectrum Coherence Frequency response Higher-order statistics

FIGURE 1.8 Summary of random signal analysis techniques.

Amplitude distribution. The range of values taken by the samples of a signal and how often the signal assumes these values together determine the signal variability. The signal variability can be seen by plotting the time series and is quantified by the histogram of the signal samples, which shows the percentage of the signal amplitude values within a certain range. The numerical description of signal variability, which depends only on the value of the signal samples and not on their ordering, involves quantities such as mean value, median, variance, and dynamic range.

Figure 1.9 shows the one-step increments, that is, the first difference xd (n) = x(n) − x(n−1), or approximate derivative of the infrared signals shown in Figure 1.6, whereas Figure 1.10 shows their histograms. Careful examination of the shape of the histogram curves indicates that the second signal jumps quite frequently between consecutive samples with large steps. In other words, the probability of large increments is significant, as exemplified

FIGURE 1.9 One-step-increment time series for the infrared data shown in Figure 1.6. 1.8 5.0 1.6 4.5 1.4 Probability density

FIGURE 1.10 Histograms for the infrared increment signals.

by the fat tails of the histogram in Figure 1.10(b). The knowledge of the probability of extreme values is essential in the design of detection systems for digital communications, military surveillance using infrared and radar sensors, and intensive care monitoring. In general, the shape of the histogram, or more precisely the probability density, is very important in applications such as signal coding and event detection. Although many practical signals follow a Gaussian distribution, many other signals of practical interest have distributions that are non-Gaussian. For example, speech signals have a probability density that can be reasonably approximated by a gamma distribution (Rabiner and Schafer 1978). The significance of the Gaussian distribution in signal processing stems from the following facts. First, many physical signals can be described by Gaussian processes. Second, the central limit theorem (see Chapter 3) states that any process that is the result of the combination of many elementary processes will tend, under quite general conditions, to be Gaussian. Finally, linear systems preserve the Gaussianity of their input signals. To understand the last two statements, consider N independent random quantities x1 , x2 , . . . , xN with the same probability density p(x) and pose the following question: When does the probability distribution pN (x) of their sum x = x1 + x2 + · · · + xN have the same shape (within a scale factor) as the distribution p(x) of the individual quantities? The standard answer is that p(x) should be Gaussian, because the sum of N Gaussian random variables is again a Gaussian, but with variance equal to N times that of the individual signals. However, if we allow for distributions with infinite variance, additional solutions are possible. The resulting probability distributions, known as stable or Levy distributions, have infinite variance and are characterized by a thin main lobe and fat tails, resembling the shape of the histogram in Figure 1.10(b). Interestingly enough, the Gaussian distribution is a stable distribution with finite variance (actually the only one). Because Gaussian and stable nonGaussian distributions are invariant under linear signal processing operations, they are very important in signal processing. Correlation and spectral analysis. Although scatter plots (see Figure 1.1) illustrate nicely the existence of correlation, to obtain quantitative information about the correlation structure of a time series x(n) with zero mean value, we use the empirical normalized autocorrelation sequence N −1

which is an estimate of the theoretical normalized autocorrelation sequence. For lag l = 0, the sequence is perfectly correlated with itself and we get the maximum value of 1. If the sequence does not change significantly from sample to sample, the correlation of the sequence with its shifted copies, though diminished, is still close to 1. Usually, the correlation decreases as the lag increases because distant samples become less and less dependent. Note that reordering the samples of a time series changes its autocorrelation but not its histogram. We say that signals whose empirical autocorrelation decays fast, such as an exponential, have short-memory or short-range dependence. If the empirical autocorrelation decays very slowly, as a hyperbolic function does, we say that the signal has long-memory or long-range dependence. These concepts will be formulated in a theoretical framework in Chapter 3. Furthermore, we shall see in the next section that effective modeling of time series with short or long memory requires different types of models. The spectral density function shows the distribution of signal power or energy as a function of frequency (see Figure 1.11). The autocorrelation and the spectral density of a signal form a Fourier transform pair and hence contain the same information. However, they present this information in different forms, and one can reveal information that cannot

FIGURE 1.11 Illustration of the concept of power or energy spectral density function of a random signal.

be easily extracted from the other. It is fair to say that the spectral density is more widely used than the autocorrelation. Although the correlation and spectral density functions are the most widely used tools for signal analysis, there are applications that require the use of correlations among three or more samples and the corresponding spectral densities. These quantities, which are useful when we deal with non-Gaussian processes and nonlinear systems, belong to the area of higher-order statistics and are described in Chapter 12. Joint signal analysis. In many applications, we are interested in the relationship between two different random signals. There are two cases of interest. In the first case, the two signals are of the same or similar nature, and we want to ascertain and describe the similarity or interaction between them. For example, we may want to investigate if there is any similarity in the fluctuation of infrared radiation in the two profiles of Figure 1.6. In the second case, we may have reason to believe that there is a causal relationship between the two signals. For example, one signal may be the input to a system and the other signal the output. The task in this case is to find an accurate description of the system, that is, a description that allows accurate estimation of future values of the output from the input. This process is known as system modeling or identification and has many practical applications, including understanding the operation of a system in order to improve the design of new systems or to achieve better control of existing systems. In this book, we will study joint signal analysis techniques that can be used to understand the dynamic behavior between two or more signals. An interesting example involves using signals, like the ones in Figure 1.5, to see if there is any coupling between blood pressure and heart rate. Some interesting results regarding the effect of respiration and blood pressure on heart rate are discussed in Chapter 5.

1.3 SIGNAL MODELING In many theoretical and practical applications, we are interested in generating random signals with certain properties or obtaining an efficient representation of real-world random signals that captures a desired set of their characteristics (e.g., correlation or spectral features) in the best possible way. We use the term model to refer to a mathematical description that provides an efficient representation of the “essential” properties of a signal. −1 For example, a finite segment {x(n)}N n=0 of any signal can be approximated by a linear combination of constant (λk = 1) or exponentially fading (0 < λk < 1) sinusoids x(n)

where {ak , λk , ωk , φ k }M k=1 are the model parameters. A good model should provide an

accurate description of the signal with 4M N parameters. From a practical viewpoint, we are most interested in parametric models, which assume a given functional form completely specified by a finite number of parameters. In contrast, nonparametric models do not put any restriction on the functional form or the number of model parameters. If any of the model parameters in (1.3.1) is random, the result is a random signal. The most widely used model is given by x(n) =

N where the amplitudes and the frequencies {ωk }N 1 are constants and the phases {φ k }1 are random. This model is known as the harmonic process model and has many theoretical and practical applications (see Chapters 3 and 9). Suppose next that we are given a sequence w(n) of independent and identically distributed observations. We can create a time series x(n) with dependent observations, by linearly combining the values of w(n) as

which results in the widely used linear random signal model. The model specified by the convolution summation (1.3.2) is clearly nonparametric because, in general, it depends on an infinite number of parameters. Furthermore, the model is a linear, time-invariant system with impulse response h(k) that determines the memory of the model and, therefore, the dependence properties of the output x(n). By properly choosing the weights h(k), we can generate a time series with almost any type of dependence among its samples. In practical applications, we are interested in linear parametric models. As we will see, parametric models exhibit a dependence imposed by their structure. However, if the number of parameters approaches the range of the dependence (in number of samples), the model can mimic any form of dependence. The list of desired features for a good model includes these: (1) the number of model parameters should be as small as possible ( parsimony), (2) estimation of the model parameters from the data should be easy, and (3) the model parameters should have a physically meaningful interpretation. If we can develop a successful parametric model for the behavior of a signal, then we can use the model for various applications: 1. To achieve a better understanding of the physical mechanism generating the signal (e.g., earth structure in the case of seismograms). 2. To track changes in the source of the signal and help identify their cause (e.g., EEG). 3. To synthesize artificial signals similar to the natural ones (e.g., speech, infrared backgrounds, natural scenes, data network traffic). 4. To extract parameters for pattern recognition applications (e.g., speech and character recognition). 5. To get an efficient representation of signals for data compression (e.g., speech, audio, and video coding). 6. To forecast future signal behavior (e.g., stock market indexes) (Pindyck and Rubinfeld 1998). In practice, signal modeling involves the following steps: (1) selection of an appropriate model, (2) selection of the “right” number of parameters, (3) fitting of the model to the actual data, and (4) model testing to see if the model satisfies the user requirements for the particular application. As we shall see in Chapter 9, this process is very complicated and depends heavily on the understanding of the theoretical model properties (see Chapter 4), the amount of familiarity with the particular application, and the experience of the user.

1.3.1 Rational or Pole-Zero Models Suppose that a given sample x(n), at time n, can be approximated by the previous sample weighted by a coefficient a, that is, x(n) ≈ ax(n − 1), where a is assumed constant over the signal segment to be modeled. To make the above relationship exact, we add an excitation term w(n), resulting in x(n) = ax(n − 1) + w(n)

where w(n) is an excitation sequence. Taking the z-transform of both sides (discussed in Chapter 2), we have X(z) = az−1 X(z) + W (z)

which results in the following system function: 1 X(z) = (1.3.5) H (z) = W (z) 1 − az−1 By using the identity 1 = 1 + az−1 + a 2 z−2 + · · · −1

(4.2.23)

corresponds to the noncausal sequence h(−n) = (−a)−n u(−n), and its ROC is |z| < 1/|a|. Hence, Rh (z) = H (z)H (z−1 ) =

1 (1 + az−1 )(1 + az)

(4.2.24)

which corresponds to a two-sided sequence because its ROC, |a| < |z| < 1/|a|, is a ring in the z-plane. Using partial fraction expansion, we obtain Rh (z) =

z−1 1 −a 1 + 2 −1 2 1 − a 1 + az 1 − a 1 + az

(4.2.25)

The pole p = −a corresponds to the causal sequence [1/(1 − a 2 )](−a)l u(l − 1), and the pole p = −1/a to the anticausal sequence [1/(1 − a 2 )](−a)−l u(−l). Combining the two terms, we obtain rh (l) = or

1 (−a)|l| 1 − a2

ρ h (l) = (−a)|l|

−∞

(4.2.26) (4.2.27)

Note that complex conjugate poles will contribute two-sided damped sinusoidal terms obtained by combining pairs of the form (4.2.27) with u = p and a = p v .

159 section 4.2 All-Pole Models

160 chapter 4 Linear Signal Models

Impulse train excitations. The response of an AP(P ) model to a periodic impulse train with period L is periodic with the same period and is given by ˜ h(n) +

P

∞

˜ − k) = d0 ak h(n

k=1

δ(n + Lm)

m=−∞

d0 = 0

n + Lm = 0 n + Lm = 0

(4.2.28)

which shows that the prediction error is zero for samples inside the period and d0 at the beginning of each period. If we multiply both sides of (4.2.28) by h˜ ∗ (n − l) and sum over a period 0 ≤ n ≤ L − 1, we obtain r˜h (l) +

P k=1

ak r˜h (l − k) =

d0 ˜ ∗ h (−l) L

all l

(4.2.29)

˜ Since, in contrast to h(n) in (4.2.15), where r˜h (l) is the periodic autocorrelation of h(n). ˜h(n) is not necessarily zero for n < 0, the periodic autocorrelation r˜h (l) will not in general obey the linear prediction equation anywhere. Similar results can be obtained for harmonic process excitations. Model parameters in terms of autocorrelation. Equations (4.2.15) for l = 0, 1, . . . , P comprise P + 1 equations that relate the P + 1 parameters of H (z), namely, d0 and {ak , 1 ≤ k ≤ P }, to the first P + 1 autocorrelation coefficients rh (0), rh (1), . . . , rh (P ). These P + 1 equations can be written in matrix form as · · · rh∗ (P ) rh (0) rh∗ (1) 1 |d0 |2 a1 rh (1) rh (0) · · · rh∗ (P − 1) 0 = (4.2.30) . .. .. . .. . . .. . . . . . . aP 0 rh (P ) rh (P − 1) · · · rh (0) If we are given the first P + 1 autocorrelations, (4.2.30) comprises a system of P + 1 linear equations, with a Hermitian, Toeplitz matrix that can be solved for d0 and {ak }. Because of the special structure in (4.2.30), the model parameters are found from the autocorrelations by using the last set of P equations in (4.2.30), followed by the computation of d0 from the first equation, which is the same as (4.2.17). From (4.2.30), we can write in matrix notation Rh a = −rh

(4.2.31)

where Rh is the autocorrelation matrix, a is the vector of the model parameters, and rh is the vector of autocorrelations. Since rx (l) = σ 2w rh (l), we can also express the model parameters in terms of the autocorrelation rx (l) of the output process x(n) as follows: Rx a = −rx

(4.2.32)

These equations are known as the Yule-Walker equations in the statistics literature. In the sequel, we drop the subscript from the autocorrelation sequence or matrix whenever the analysis holds for both the impulse response and the model output. Because of the Toeplitz structure and the nature of the right-hand side, the linear systems (4.2.31) and (4.2.32) can be solved recursively by using the algorithm of Levinson-Durbin (see Section 7.4). After a is solved for, the system gain d0 can be computed from (4.2.17). Therefore, given r(0), r(1), . . . , r(P ), we can completely specify the parameters of the all-pole model by solving a set of linear equations. Below, we will see that the converse is also true: Given the model parameters, we can find the first P + 1 autocorrelations by

solving a set of linear equations. This elegant solution of the spectral factorization problem is unique to all-pole models. In the case in which the model contains zeros (Q = 0), the spectral factorization problem requires the solution of a nonlinear system of equations. Autocorrelation in terms of model parameters. If we normalize the autocorrelations in (4.2.31) by dividing throughout by r(0), we obtain the following system of equations Pa = −ρ

(4.2.33)

where P is the normalized autocorrelation matrix and ρ = [ρ(1) ρ(2) · · · ρ(P )]H

(4.2.34)

is the vector of normalized autocorrelations. This set of P equations relates the P model coefficients with the first P (normalized) autocorrelation values. If the poles of the all-pole filter are strictly inside the unit circle, the mapping between the P -dimensional vectors a and ρ is unique. If, in fact, we are given the vector a, then the normalized autocorrelation vector ρ can be computed from a by using the set of equations that can be deduced from (4.2.33) Aρ = −a

(4.2.35)

where Aij = ai−j + ai+j , assuming am = 0 for m < 0 and m > P (see Problem 4.6). Given the set of coefficients in a, ρ can be obtained by solving (4.2.35). We will see that, under the assumption of a stable H (z), a solution always exists. Furthermore, there exists a simple, recursive solution that is efficient (see Section 7.5). If, in addition to a, we are given d0 , we can evaluate r(0) with (4.2.20) from ρ computed by (4.2.35). Autocorrelation values r(l) for lags l > P are found by using the recursion in (4.2.18) with r(0), r(1), . . . , r(P ). E XAM PLE 4.2.3.

For the AP(3) model with real coefficients we have a1 r(1) r(0) r(1) r(2) r(1) r(0) r(1) a2 = − r(2) r(3) r(2) r(1) r(0) a3 d02 = r(0) + a1 r(1) + a2 r(2) + a3 r(3)

(4.2.36)

(4.2.37)

Therefore, given r(0), r(1), r(2), r(3), we can find the parameters of the all-pole model by solving (4.2.36) and then substituting into (4.2.37). Suppose now that instead we are given the model parameters d0 , a1 , a2 , a3 . If we divide both sides of (4.2.36) by r(0) and solve for the normalized autocorrelations ρ(1), ρ(2), and ρ(3), we obtain a1 ρ(1) 1 + a2 a3 0 0 ρ(2) = − a2 (4.2.38) a1 + a3 1 ρ(3) a2 a1 1 a3 The value of r(0) is obtained from r(0) =

d02

1 + a1 ρ(1) + a2 ρ(2) + a3 ρ(3)

(4.2.39)

If r(0) = 2, r(1) = 1.6, r(2) = 1.2, and r(3) = 1, the Toeplitz matrix in (4.2.36) is positive definite because it has positive eigenvalues. Solving the linear system gives a1 = −0.9063, a2 = 0.2500, and a3 = −0.1563. Substituting these values in (4.2.37), we obtain d0 = 0.8329. Using the last two relations, we can recover the autocorrelation from the model parameters.

Correlation matching. All-pole models have the unique distinction that the model parameters are completely specified by the first P + 1 autocorrelation coefficients via a set of linear equations. We can write d0 r(0) ↔ (4.2.40) a ρ

161 section 4.2 All-Pole Models

162 chapter 4 Linear Signal Models

that is, the mapping of the model parameters {d0 , a1 , a2 , . . . , aP } to the autocorrelation coefficients specified by the vector {r(0), ρ(1), . . . , ρ(P )} is reversible and unique. This statement implies that given any set of autocorrelation values r(0), r(1), . . . , r(P ), we can always find an all-pole model whose first P + 1 autocorrelation coefficients are equal to the given autocorrelations. This correlation matching of all-pole models is quite remarkable. This property is not shared by all-zero models and is true for pole-zero models only under certain conditions, as we will see in Section 4.4. Spectrum. The z-transform of the autocorrelation r(l) of H (z) is given by 1 R(z) = H (z)H ∗ ∗ (4.2.41) z The spectrum is then equal to R(ej ω ) = |H (ej ω )|2 =

|d0 |2 |A(ej ω )|2

(4.2.42)

The right-hand side of (4.2.42) suggests a method for computing the spectrum: First compute A(ej ω ) by taking the Fourier transform of the sequence {1, a1 , . . . , aP }, then take the squared of the magnitude and divide |d0 |2 by the result. The fast Fourier transform (FFT) can be used to this end by appending the sequence {1, a1 , . . . , aP } with as many zeros as needed to compute the desired number of frequency points. Partial autocorrelation and lattice structures. We have seen that an AP(P ) model is completely described by the first P + 1 values of its autocorrelation. However, we cannot determine the order of the model by using the autocorrelation sequence because it has infinite duration. Suppose that we start fitting models of increasing order m, using the autocorrelation sequence of an AP(P ) model and the Yule-Walker equations (m) ∗ ρ ∗ (m − 1) 1 ρ ∗ (1) · · · a1 ρ (1) .. (m) ρ ∗ (2) a2 ρ(1) . 1 ··· = − (4.2.43) . . .. .. . . . .. . . . . ρ ∗ (1) (m) ρ ∗ (m) ρ(m − 1) · · · ρ(1) 1 am Since am = 0 for m > P , we can use the sequence am , m = 1, 2, . . . , which is known as the partial autocorrelation sequence (PACS), to determine the order of the all-pole model. Recall from Section 2.5 that (m)

(m)

(m) am = km

(4.2.44)

that is, the PACS is identical to the lattice parameters. A statistical definition and interpretation of the PACS are also given in Section 7.2. The PACS can be defined for any valid (i.e., positive definite) autocorrelation sequence and can be efficiently computed by using the algorithms of Levinson-Durbin and Schur (see Chapter 7). Furthermore, it has been shown (Burg 1975) that r(0)

P P 1 − |km | 1 + |km | ≤ R(ej ω ) ≤ r(0) 1 + |km | 1 − |km |

m=1

(4.2.45)

m=1

which indicates that the spectral dynamic range increases if some lattice parameter moves close to 1 or equivalently some pole moves close to the unit circle. Equivalent model representations. From the previous discussions (see also Chapter 7) we conclude that a minimum-phase AP(P ) model can be uniquely described by any one of the following representations:

1. Direct structure: {d0 , a1 , a2 , . . . , aP } 2. Lattice structure: {d0 , k1 , k2 , . . . , kP } 3. Autocorrelation: {r(0), r(1), . . . , r(P )}

163 section 4.2 All-Pole Models

where we assume, without loss of generality, that d0 > 0. Note that the minimum-phase property requires that all poles be inside the unit circle or all |km | < 1 or that RP +1 be positive definite. The transformation from any of the above representations to any other can be done by using the algorithms developed in Section 7.5. Minimum-phase conditions. As we will show in Section 7.5, if the Toeplitz matrix Rh (or equivalently Rx ) is positive definite, then |km | < 1 for all m = 1, 2, . . . , P . Therefore, the AP(P ) model obtained by solving the Yule-Walker equations is minimum-phase. Therefore, the Yule-Walker equations provide a simple and elegant solution to the spectral factorization problem for all-pole models. EXAMPLE 4.2.4. The poles of the model obtained in Example 4.2.3 are 0.8316, 0.0373+0.4319i, and 0.0373 − 0.4319i. We see that the poles are inside the unit circle and that the autocorrelation sequence is positive definite. If we set rh (2) = −1.2, the autocorrelation becomes negative definite and the obtained model a =[1 − 1.222 1.1575]T , d0 = 2.2271, is nonminimum-phase.

Pole locations. The poles of H (z) are the zeros {pk } of the polynomial A(z). If the coefficients of A(z) are assumed to be real, the poles are either real or come in complex conjugate pairs. In order for H (z) to be minimum-phase, all poles must be inside the unit circle, that is, |pk | < 1. The model parameters ak can be written as sums of products of the poles pk . In particular, it is easy to see that a1 = −

P

(4.2.46)

pk k=1

aP =

P

(4.2.47)

(−pk ) k=1

Thus, the first coefficient a1 is the negative of the sum of the poles, and the last coefficient aP is the product of the negative of the individual poles. Since |pk | < 1, we must have |aP | < 1 for a minimum-phase polynomial for which a0 = 1. However, note that the reverse is not necessarily true: |aP | < 1 does not guarantee minimum phase. The roots pk can be computed by using any number of standard root-finding routines.

4.2.2 All-Pole Modeling and Linear Prediction Consider the AP(P ) model x(n) = −

P

ak x(n − k) + w(n)

(4.2.48)

k=1

Now recall from Chapter 1 that the Mth-order linear predictor of x(n) and the corresponding prediction error e(n) are x(n) ˆ =−

M

ak0 x(n − k)

(4.2.49)

k=1

e(n) = x(n) − x(n) ˆ = x(n) +

M k=1

ak0 x(n − k)

(4.2.50)

164 chapter 4 Linear Signal Models

x(n) = −

or

M

ak0 x(n − k) + e(n)

(4.2.51)

k=1

Notice that if the order of the linear predictor equals the order of the all-pole model (M = P ) and if ak0 = ak , then the prediction error is equal to the excitation of the all-pole model, that is, e(n) = w(n). Since all-pole modeling and FIR linear prediction are closely related, many properties and algorithms developed for one of them can be applied to the other. Linear prediction is extensively studied in Chapters 6 and 7.

4.2.3 Autoregressive Models Causal all-pole models excited by white noise play a major role in practical applications and are known as autoregressive (AR) models. An AR(P ) model is defined by the difference equation x(n) = −

P

ak x(n − k) + w(n)

(4.2.52)

k=1

where {w(n)} ∼ WN(0, σ 2w ). An AR(P ) model is valid only if the corresponding AP(P ) system is stable. In this case, the output x(n) is a stationary sequence with a mean value of zero. Postmultiplying (4.2.52) by x ∗ (n − l) and taking the expectation, we obtain the following recursive relation for the autocorrelation: rx (l) = −

P

ak rx (l − k) + E{w(n)x ∗ (n − l)}

(4.2.53)

k=1

Similarly, using (4.1.1), we can show that E{w(n)x ∗ (n − l)} = σ 2w h∗ (−l). Thus, we have rx (l) = −

P

ak rx (l − k) + σ 2w h∗ (−l)

for all l

(4.2.54)

k=1

The variance of the output signal is σ 2x = rx (0) = −

P

ak rx (k) + σ 2w

k=1

σ 2w

σ 2x =

or

1+

(4.2.55)

P

ak ρ x (k) k=1

If we substitute l = 0, 1, . . . , P in (4.2.55) and recall that h(n) = 0 for n < 0, we obtain the following set of Yule-Walker equations: 2 · · · rx (P ) rx (0) rx (1) 1 σw r ∗ (1) rx (0) · · · rx (P − 1) a1 0 x (4.2.56) . = . . . . . .. .. .. .. .. . . rx∗ (P )

rx∗ (P − 1)

· · · rx (0)

aP

Careful inspection of the above equations reveals their similarity to the corresponding relationships developed previously for the AP(P ) model. This should be no surprise since the power spectrum of the white noise is flat. However, there is one important difference we should clarify: AP(P ) models were specified with a gain d0 and the parameters {a1 , a2 , . . . , aP }, but for AR(P ) models we set the gain d0 = 1 and define the model by the

variance of the white excitation σ 2w and the parameters {a1 , a2 , . . . , aP }. In other words, we incorporate the gain of the model into the power of the input signal. Thus, the power spectrum of the output is Rx (ej ω ) = σ 2w |H (ej ω )|2 . Similar arguments apply to all parametric models driven by white noise. We just rederived some of the relationships to clarify these issues and to provide additional insight into the subject. 4.2.4 Lower-Order Models In this section, we derive the properties of lower-order all-pole models, namely, first- and second-order models, with real coefficients. First-order all-pole model: AP(1) An AP(1) model has a transfer function d0 (4.2.57) 1 + az−1 with a single pole at z = −a on the real axis. It is clear that H (z) is minimum-phase if H (z) =

−1 < a < 1

(4.2.58)

From (4.2.18) with P = 1 and l = 1, we have r(1) a1 = − = −ρ(1) r(0) Similarly, from (4.2.44) with m = 1,

(4.2.59)

a1 = a = −ρ(1) = k1

(4.2.60)

(1)

Since from (4.2.4), h(0) = d0 , and from (4.2.5) h(n) = −a1 h(n − 1) for n > 0, the impulse response of a single-pole filter is given by h(n) = d0 (−a)n u(n)

(4.2.61)

The same result can, of course, be obtained by taking the inverse z-transform of H (z). The autocorrelation is found in a similar fashion. From (4.2.18) and by using the fact that the autocorrelation is an even function, r(l) = r(0)(−a)|l|

for all l

(4.2.62)

and from (4.2.20) r(0) =

d02 d02 = 2 1−a 1 − k12

(4.2.63)

Therefore, if the energy r(0) in the impulse response is set to unity, then the gain must be set to r(0) = 1 (4.2.64) d0 = 1 − k12 The z-transform of the autocorrelation is then R(z) =

∞ d02 = r(0) (−a)|l| z−l (1 + az−1 )(1 + az)

(4.2.65)

l=−∞

and the spectrum is d02 d02 = (4.2.66) |1 + ae−j ω |2 1 + 2a cos ω + a 2 Figures 4.4 and 4.5 show a typical realization of the output, the impulse response, autocorrelation, and spectrum of two AP(1) models. The sample process realizations were obtained by driving the model with white Gaussian noise of zero mean and unit variance. When the positive pole (p = −a = 0.8) is close to the unit circle, successive samples R(ej ω ) = |H (ej ω )|2 =

165 section 4.2 All-Pole Models

166

Autocorrelation 3

2

Amplitude

Amplitude

4

0 −2

2 1

−4 0

50 Sample number

100

5 10 Sample number

6

1.0 0.8

Amplitude

Amplitude

15

Spectrum

Impulse response

0.6 0.4

4 2

0.2 0

5 10 15 Sample number

20

0 0.1 0.2 0.3 0.4 0.5 Frequency (cycles/sampling interval)

FIGURE 4.4 Sample realization of the output process, impulse response, autocorrelation, and spectrum of an AP(1) model with a = −0.8.

Sample realization

Autocorrelation

2

2

Amplitude

Amplitude

4

0 −2 −4

0 −2

50 Sample number

100

Impulse response

5 10 Sample number

15

Spectrum 6 Amplitude

1.0 Amplitude

chapter 4 Linear Signal Models

Sample realization

0.5 0 −0.5 0

5 10 15 Sample number

20

4 2 0

0 0.1 0.2 0.3 0.4 0.5 Frequency (cycles/sampling interval)

FIGURE 4.5 Sample realization of the output process, impulse response, autocorrelation, and spectrum of an AP(1) model with a = 0.8.

of the output process are similar, as dictated by the slowly decaying autocorrelation and the corresponding low-pass spectrum. In contrast, a negative pole close to the unit circle results in a rapidly oscillating sequence. This is clearly reflected in the alternating sign of the autocorrelation sequence and the associated high-pass spectrum.

Note that a positive real pole is a type of low-pass filter, while a negative real pole has the spectral characteristics of a high-pass filter. (This situation in the digital domain contrasts with that in the corresponding analog domain where a real-axis pole can only have low-pass characteristics.) The discrete-time negative real pole can be thought of as one-half of two conjugate poles at half the sampling frequency. Notice that both spectra are even and have zero slope at ω = 0 and ω = π . These propositions are true of the spectra of all parametric models (i.e., pole-zero models) with real coefficients (see Problem 4.13). Consider now the real-valued AR(1) process x(n) generated by x(n) = −ax(n − 1) + w(n)

(4.2.67)

σ 2w H (z)H ∗ (1/z∗ ) and previous

where {w(n)} ∼ WN (0, = results, we can see that the autocorrelation and the PSD of x(n) are given by σ 2w ). Using the formula Rx (z)

σ 2w (−a)|l| 1 − a2 1 − a2 Rx (ej ω ) = σ 2w 2 1 + a + 2a cos ω rx (l) =

and

respectively. Since σ 2x = rx (0) = σ 2w /(1 − a 2 ), the SFM of x(n) is [see (Section 4.1.18)] SFMx =

σ 2w = 1 − a2 σ 2x

(4.2.68)

Clearly, if a = 0, then from (4.2.67), x(n) is a white noise process and from (4.2.68), SFMx = 1. If a → 1, then SFMx → 0; and in the limit when a = 1, the process becomes a random walk process, which is a nonstationary process with linearly increasing variance E{x 2 (n)} = nσ 2w . The correlation matrix is Toeplitz, and it is a rare exception in which eigenvalues and eigenvectors can be described by analytical expressions (Jayant and Noll 1984). Second-order all-pole model: AP(2) The system function of an AP(2) model is given by d0 d0 H (z) = = 1 + a1 z−1 + a2 z−2 (1 − p1 z−1 )(1 − p2 z−1 ) From (4.2.46) and (4.2.47), we have a1 = −(p1 + p2 ) a2 = p1 p2

(4.2.69)

(4.2.70)

Recall that H (z) is minimum-phase if the two poles p1 and p2 are inside the unit circle. Under these conditions, a1 and a2 lie in a triangular region defined by −1 < a2 < 1 a2 − a1 > −1

(4.2.71)

a2 + a1 > −1 and shown in Figure 4.6. The first condition follows from (4.2.70) since |p1 | < 1 and |p2 | < 1. The last two conditions can be derived by assuming real roots and setting the larger root to less than 1 and the smaller root to greater than −1. By adding the last two conditions, we obtain the redundant condition a2 > −1. Complex roots occur in the region a12 < a2 ≤ 1 complex poles (4.2.72) 4 with a2 = 1 resulting in both roots being on the unit circle. Note that, in order to have complex poles, a2 cannot be negative. If the complex poles are written in polar form pi = re±j θ

0≤r≤1

(4.2.73)

167 section 4.2 All-Pole Models

168

1.0 Complex conjugate poles

chapter 4 Linear Signal Models

a2

0.5

Real and equal poles

0 Real poles −0.5 −1.0 −2.0

−1.5

−1.0

−0.5

0 a1

0.5

1.0

1.5

2.0

FIGURE 4.6 Minimum-phase region (triangle) for the AP(2) model in the (a1 , a2 ) parameter space.

a1 = −2r cos θ

then

a2 = r 2

(4.2.74)

d0 complex poles (4.2.75) 1 − (2r cos θ )z−1 + r 2 z−2 Here, r is the radius (magnitude) of the poles, and θ is the angle or normalized frequency of the poles. and

H (z) =

Impulse response. The impulse response of an AP(2) model can be written in terms of its two poles by evaluating the inverse z-transform of (4.2.69). The result is d0 h(n) = (p n+1 − p2n+1 )u(n) (4.2.76) p1 − p 2 1 for p1 = p2 . Otherwise, for p1 = p2 = p, h(n) = d0 (n + 1)p n u(n)

(4.2.77)

In the special case of a complex conjugate pair of poles p1 = and p2 = re−j θ , Equation (4.2.76) reduces to sin[(n + 1)θ] h(n) = d0 r n u(n) complex poles (4.2.78) sin θ Since 0 < r < 1, h(n) is a damped sinusoid of frequency θ. rej θ

Autocorrelation. The autocorrelation can also be written in terms of the two poles as d02 p1l+1 p2l+1 r(l) = l≥0 (4.2.79) − (p1 − p2 )(1 − p1 p2 ) 1 − p12 1 − p22 from which we can deduce the energy r(0) =

d02 (1 + p1 p2 ) (1 − p1 p2 )(1 − p12 )(1 − p22 )

(4.2.80)

For the special case of a complex conjugate pole pair, (4.2.79) can be rewritten as d02 r l {sin[(l + 1)θ] − r 2 sin[(l − 1)θ]} l≥0 (4.2.81) [(1 − r 2 ) sin θ](1 − 2r 2 cos 2θ + r 4 ) Then from (4.2.80) we can write an expression for the energy in terms of the polar coordinates of the complex conjugate pole pair r(l) =

r(0) =

d02 (1 + r 2 ) (1 − r 2 )(1 − 2r 2 cos 2θ + r 4 )

(4.2.82)

169

The normalized autocorrelation is given by r l {sin[(l + 1)θ] − r 2 sin[(l − 1)θ]} (1 + r 2 ) sin θ

ρ(l) =

l≥0

(4.2.83)

which can be rewritten as ρ(l) = where

1 r l cos (lθ − β) cos β (1 − r 2 ) cos θ tan β = (1 + r 2 ) sin θ

l≥0

(4.2.84) (4.2.85)

Therefore, ρ(l) is a damped cosine wave with its maximum amplitude at the origin. Spectrum. By setting the two poles equal to p1 = r1 ej θ 1

p2 = r2 ej θ 2

(4.2.86)

the spectrum of an AP(2) model can be written as R(ej ω ) =

d02 [1 − 2r1 cos (ω − θ 1 ) + r12 ][1 − 2r2 cos (ω − θ 2 ) + r22 ]

(4.2.87)

There are four cases of interest Pole locations

Peak locations

Type of R(ej ω )

p1 > 0, p2 > 0

ω=0

Low-pass

p1 < 0, p2 < 0

ω=π

High-pass

p1 > 0, p2 < 0

ω = 0, ω = π

Stopband

p1,2 = re±j θ

0<ω<π

Bandpass

and they depend on the location of the poles on the complex plane. We concentrate on the fourth case of complex conjugate poles, which is of greatest interest. The other three cases are explored in Problem 4.15. The spectrum is given by R(ej ω ) =

d02 [1 − 2r cos (ω − θ ) + r 2 ][1 − 2r cos (ω + θ ) + r 2 ]

(4.2.88)

The peak of this spectrum can be shown to be located at a frequency ωc , given by cos ωc =

1 + r2 cos θ 2r

(4.2.89)

Since 1 + r 2 > 2r for r < 1, and we have cos ωc > cos θ

(4.2.90)

the spectral peak is lower than the pole frequency for 0 < θ < π/2 and higher than the pole frequency for π /2 < θ < π. This behavior is illustrated in Figure 4.7 for an AP(2) model with a1 = −0.4944, a2 = 0.64, and d0 = 1. The model has two complex conjugate poles with r = 0.8 and θ = ±2π /5. The spectrum has a single peak and displays a passband type of behavior. The impulse response is a damped sine wave while the autocorrelation is a damped cosine. The typical realization of the output shows clearly a pseudoperiodic behavior that is explained by the shape of the autocorrelation and the spectrum of the model. We also notice that if the poles are complex conjugates, the autocorrelation has pseudoperiodic behavior. Equivalent model descriptions. We now write explicit formulas for a1 and a2 in terms of the lattice parameters k1 and k2 and the autocorrelation coefficients. From the step-up

section 4.2 All-Pole Models

170

Autocorrelation

2

Amplitude

Amplitude

1.0

0 −2 −4

0.5 0 −0.5

50 Sample number

100

5 10 15 Sample number

20

Spectrum

Impulse response 3 Amplitude

1.0 Amplitude

chapter 4 Linear Signal Models

Sample realization

0.5 0 −0.5

2 1 0

5 10 15 Sample number

20

0 0.1 0.2 0.3 0.4 0.5 Frequency (cycles/sampling interval)

FIGURE 4.7 Sample realization of the output process, impulse response, autocorrelation, and spectrum of an AP(2) model with complex conjugate poles.

and step-down recursions in Section 2.5, we have a1 = k1 (1 + k2 ) a2 = k2

(4.2.91)

a1 1 + a2 k2 = a2

(4.2.92)

and the inverse relations k1 =

From the Yule-Walker equations (4.2.18), we can write the two equations a1 r(0) + a2 r(1) = −r(1) a1 r(1) + a2 r(0) = −r(2)

(4.2.93)

which can be solved for a1 and a2 in terms of ρ(1) and ρ(2) a1 = −ρ(1) a2 =

1 − ρ(2) 1 − ρ 2 (1)

ρ 2 (1) − ρ(2) 1 − ρ 2 (1)

or for ρ(1) and ρ(2) in terms of a1 and a2 a1 ρ(1) = − 1 + a2 a12 ρ(2) = −a1 ρ(1) − a2 = − a2 1 + a2

(4.2.94)

(4.2.95)

From the equations above, we can also write the relation and inverse relation between the

171

coefficients k1 and k2 and the normalized autocorrelations ρ(1) and ρ(2) as

section 4.2 All-Pole Models

k1 = −ρ(1) k2 =

ρ 2 (1) − ρ(2) 1 − ρ 2 (1)

ρ(1) = −k1

and

ρ(2) = k1 (1 + k2 ) − k2

(4.2.96)

(4.2.97)

The gain d0 can also be written in terms of the other coefficients. From (4.2.20), we have d02 = r(0)[1 + a1 ρ(1) + a2 ρ(2)]

(4.2.98)

which can be shown to be equal to d02 = r(0)(1 − k1 )(1 − k2 )

(4.2.99)

Minimum-phase conditions. In (4.2.71), we have a set of conditions on a1 and a2 so that the AP(2) model is minimum-phase, and Figure 4.6 shows the corresponding admissible region for minimum-phase models. Similar relations and regions can be derived for the other types of parameters, as we will show below. In terms of k1 and k2 , the AP(2) model is minimum-phase if |k1 | < 1

|k2 | < 1

(4.2.100)

This region is depicted in Figure 4.8(a). Shown also is the region that results in complex roots, which is specified by 0 < k2 < 1 k12 <

4k2 (1 + k2 )2

(4.2.101) (4.2.102)

Because of the correlation matching property of all-pole models, we can find a minimumphase all-pole model for every positive definite sequence of autocorrelation values. Therefore, the admissible region of autocorrelation values coincides with the positive definite region. The positive definite condition is equivalent to having all the principal minors of the autocorrelation matrix in (4.2.30) be positive definite; that is, the corresponding determinants are positive. For P = 2, there are two conditions: 1 ρ(1) ρ(2) 1 ρ(1) ρ(1) > 0 (4.2.103) det >0 det ρ(1) 1 ρ(1) 1 ρ(2) ρ(1) 1 These two conditions reduce to |ρ(1)| < 1

(4.2.104)

2ρ 2 (1) − 1 < ρ(2) < 1

(4.2.105)

which determine the admissible region shown in Figure 4.8(b). Conditions (4.2.105) can also be derived from (4.2.71) and (4.2.95). The first condition in (4.2.105) is equivalent to a1 (4.2.106) 1 + a < 1 2 which can be shown to be equivalent to the last two conditions in (4.2.71). It is important to note that the region in Figure 4.8(b) is the admissible region for any positive definite autocorrelation, including the autocorrelation of mixed-phase signals. This is reasonable since the autocorrelation does not contain phase information and allows the

172

1.0

1.0 Complex conjugate poles

k2

0.5

0.5

Real and equal poles r (2)

chapter 4 Linear Signal Models

0 −0.5 −1.0 −1.0

−0.5

Real poles

−0.5

0 k1

0.5

−1.0 −1.0

1.0

−0.5

(a)

0 r (1)

0.5

1.0

(b)

FIGURE 4.8 Minimum-phase and positive definiteness regions for the AP(2) model in the (a) (k1 , k2 ) space and (b) (ρ(1), ρ(2)) space.

signal to have minimum- and maximum-phase components. What we are claiming here, however, is that for every autocorrelation sequence in the positive definite region, we can find a minimum-phase all-pole model with the same autocorrelation values. Therefore, for this problem, the positive definite region is identical to the admissible minimum-phase region.

4.3 ALL-ZERO MODELS In this section, we investigate the properties of the all-zero model. The output of the all-zero model is the weighted average of delayed versions of the input signal x(n) =

Q

dk w(n − k)

(4.3.1)

k=0

where Q is the order of the model. The system function is H (z) = D(z) =

Q

dk z−k

(4.3.2)

k=0

The all-zero model can be implemented by using either a direct or a lattice structure. The conversion between the two sets of parameters can be done by using the step-up and stepdown recursions described in Chapter 7 and setting A(z) = D(z). Notice that the same set of parameters can be used to implement either an all-zero or an all-pole model by using a different structure.

4.3.1 Model Properties We next provide a brief discussion of the properties of the all-zero model. Impulse response. It can be easily seen that the AZ(Q) model is an FIR system with an impulse response dn 0≤n≤Q h(n) = (4.3.3) 0 elsewhere

Autocorrelation. The autocorrelation of the impulse response is given by Q−l ∞ ∗ dk dk+l 0≤l≤Q rh (l) = h(n)h∗ (n − l) = k=0 n=−∞ 0 l>Q rh∗ (−l) = rh (l)

and

all l

173 section 4.3 All-Zero Models

(4.3.4)

(4.3.5)

We usually set d0 = 1, which implies that ∗ ∗ rh (l) = dl∗ + d1 dl+1 + · · · + dQ−l dQ

l = 0, 1, . . . , Q

(4.3.6)

hence, the normalized autocorrelation is ∗ ∗ ∗ dl + d1 dl+1 + · · · + dQ−l dQ l = 1, 2, . . . , Q 2 2 ρ h (l) = (4.3.7) 1 + |d1 | + · · · + |dQ | 0 l>Q We see that the autocorrelation of an AZ(Q) model is zero for lags |l| exceeding the order Q of the model. If ρ h (1), ρ h (2), . . . , ρ h (Q) are known, then the Q equations (4.3.7) can be solved for model parameters d1 , d2 , . . . , dq . However, unlike the Yule-Walker equations for the AP(P ) model, which are linear, Equations (4.3.7) are nonlinear and their solution is quite complicated (see Section 9.3). Spectrum. The spectrum of the AZ(Q) model is given by Rh (ej ω ) = D(z)D(z−1 )|z=ej ω = |D(ej ω )|2 =

Q

rh (l)e−j ωl

(4.3.8)

l=−Q

which is basically a trigonometric polynomial. ˜ Impulse train excitations. The response h(n) of the AZ(Q) model to a periodic impulse train with period L is periodic with the same period, and its spectrum is a sampled version of (4.3.8) at multiples of 2π /L (see Section 2.3.2). Therefore, to recover the auto˜ correlation rh (l) and the spectrum Rh (ej ω ) from the autocorrelation or spectrum of h(n), we should have L ≥ 2Q + 1 in order to avoid aliasing in the autocorrelation lag domain. Also, if L > Q, the impulse response h(n), 0 ≤ n ≤ Q, can be recovered from the response ˜ h(n) (no time-domain aliasing) (see Problem 4.24). Partial autocorrelation and lattice-ladder structures. The PACS of an AZ(Q) model is computed by fitting a series of AP(P ) models for P = 1, 2, . . . , to the autocorrelation sequence (4.3.7) of the AZ(Q) model. Since the AZ(Q) model is equivalent to an AP(∞) model, the PACS of an all-zero model has infinite extent and behaves as the autocorrelation sequence of an all-pole model. This is illustrated later for the low-order AZ(1) and AZ(2) models. 4.3.2 Moving-Average Models A moving-average model is an AZ(Q) model with d0 = 1 driven by white noise, that is, x(n) = w(n) +

Q

dk w(n − k)

(4.3.9)

k=1

where {w(n)} ∼ WN(0, σ 2w ). The output x(n) has zero mean and variance of σ 2x = σ 2w

Q k=0

|dk |2

(4.3.10)

174 chapter 4 Linear Signal Models

The autocorrelation and power spectrum are given by rx (l) = σ 2w rh (l) and Rx (ej ω ) = σ 2w |D(ej ω )|2 , respectively. Clearly, observations that are more than Q samples apart are uncorrelated because the autocorrelation is zero after lag Q.

4.3.3 Lower-Order Models To familiarize ourselves with all-zero models, we next investigate in detail the properties of the AZ(1) and AZ(2) models with real coefficients. The first-order all-zero model: AZ(1). For generality, we consider an AZ(1) model whose system function is H (z) = G(1 + d1 z−1 )

(4.3.11)

The model is stable for any value of d1 and minimum-phase for −1 < d1 < 1. The autocorrelation is the inverse z-transform of Rh (z) = H (z)H (z−1 ) = G2 [d1 z + (1 + d12 ) + d1 z−1 ]

(4.3.12)

Hence, rh (0) = G2 (1+d12 ), rh (1) = rh (−1) = G2 d1 , and rh (l) = 0 elsewhere. Therefore, the normalized autocorrelation is 1 l=0 d1 l = ±1 ρ h (l) = 1 + d 2 (4.3.13) 1 0 |l| ≥ 2 The condition −1 < d1 < 1 implies that |ρ h (1)| ≤ 12 for a minimum-phase model. From ρ h (1) = d1 /(1 + d12 ), we obtain the quadratic equation ρ h (1)d12 − d1 + ρ h (1) = 0

(4.3.14)

which has the following two roots: d1 =

1±

1 − 4ρ 2h (1)

2ρ h (1)

(4.3.15)

Since the product of the roots is 1, if d1 is a root, then 1/d1 must also be a root. Hence, only one of these two roots can satisfy the minimum-phase condition −1 < d1 < 1. The spectrum is obtained by setting z = ej ω in (4.3.12), or from (4.3.8) Rh (ej ω ) = G2 (1 + d12 + 2d1 cos ω)

(4.3.16)

The autocorrelation is positive definite if Rh (ej ω ) > 0, which holds for all values of d1 . Note that if d1 > 0, then ρ h (1) > 0 and the spectrum has low-pass behavior (see Figure 4.9), whereas a high-pass spectrum is obtained when d1 < 0 (see Figure 4.10). The first lattice parameter of the AZ(1) model is k1 = −ρ(1). The PACS can be obtained from the Yule-Walker equations by using the autocorrelation sequence (4.3.13). Indeed, after some algebra we obtain km =

(−d1 )m (1 − d12 ) 2(m+1)

1 − d1

m = 1, 2, . . . , ∞

(4.3.17)

(see Problem 4.25). Notice the duality between the ACS and PACS of AP(1) and AZ(1) models.

Time series

175

ACS 1.0

section 4.3 All-Zero Models

0.5 r(l )

Amplitude

2 0

0 −0.5

−2

−1.0 0

50 Sample number

100

5

2.0

0.5

1.5

0 −0.5 −1.0 0

5

15

20

Spectrum

1.0 Amplitude

k(m)

PACS

10 Lag l

10 m

15

1.0 0.5 0

20

0 0.1 0.2 0.3 0.4 0.5 Frequency (cycles/sampling interval)

FIGURE 4.9 Sample realization of the output process, ACS, PACS, and spectrum of an AZ(1) model with d1 = 0.95. ACS

Time series 1.0 0.5 r(l )

Amplitude

2 0

0 −0.5

−2

−1.0 0

50 Sample number

100

2.0

0.5

1.5

0 −0.5 −1.0 0

5

10 Lag l

15

20

Spectrum

1.0 Amplitude

km

PACS

5

10 m

15

20

1.0 0.5 0

0 0.1 0.2 0.3 0.4 0.5 Frequency (cycles/sampling interval)

FIGURE 4.10 Sample realization of the output process, ACS, PACS, and spectrum of an AZ(1) model with d1 = −0.95.

Consider now the MA(1) real-valued process x(n) generated by x(n) = w(n) + bw(n − 1) where {w(n)} ∼

WN(0, σ 2w ).

Using Rx (z) = σ 2w H (z)H (z−1 ), we obtain the PSD function

Rx (ej ω ) = σ 2w (1 + b2 + 2b cos ω)

176 chapter 4 Linear Signal Models

which has low-pass (high-pass) characteristics if 0 < b ≤ 1 (−1 ≤ b < 0). Since σ 2x = rx (0) = σ 2w (1 + b2 ), we have (see Section 4.1.18) SFMx =

σ 2w 1 = 2 σx 1 + b2

(4.3.18)

which is maximum for b = 0 (white noise). The correlation matrix is banded Toeplitz (only a number of diagonals close to the main diagonal are nonzero) 1 b 0 ··· 0 b 1 b · · · 0 2 2 0 b 1 · · · 0 (4.3.19) Rx = σ w (1 + b ) . . . . .. . . . . . . . . . 0

··· 1

and its eigenvalues and eigenvectors are given by λk = Rx (ej ωk ), qn π k/(M + 1), where k = 1, 2, . . . , M (see Problem 4.30).

(k)

= sin ωk n, ωk =

The second-order all-zero model: AZ(2). Now let us consider the second-order allzero model. The system function of the AZ(2) model is H (z) = G(1 + d1 z−1 + d2 z−2 )

(4.3.20)

The system is stable for all values of d1 and d2 , and minimum-phase [see the discussion for the AP(2) model] if −1 < d2 < 1 d2 − d1 > −1 d2 + d1 > −1

(4.3.21)

which is a triangular region identical to that shown in Figure 4.6. The normalized autocorrelation and the spectrum are 1 l=0 (1 + d ) d 1 2 l = ±1 1 + d2 + d2 1 2 ρ h (l) = (4.3.22) d2 l = ±2 1 + d12 + d22 0 |l| ≥ 3 and

Rh (ej ω ) = G2 [(1 + d12 + d22 ) + 2d1 (1 + d2 ) cos ω + 2d2 cos 2ω]

(4.3.23)

respectively. The minimum-phase region in the autocorrelation domain is shown in Figure 4.11 and is described by the equations ρ(2) + ρ(1) = −0.5 ρ(2) − ρ(1) = −0.5

(4.3.24)

ρ 2 (1) = 4ρ(2)[1 − 2ρ(2)] derived in Problem 4.26. The formula for the PACS is quite involved. The important thing is the duality between the ACS and the PACS of AZ(2) and AP(2) models (see Problem 4.27).

FIGURE 4.11 Minimum-phase region in the autocorrelation domain for the AZ(2) model.

1.0

r (2)

0.5

0 −0.5 −1.0 −1.0

−0.5

0 r (1)

0.5

1

4.4 POLE-ZERO MODELS We will focus on causal pole-zero models with a recursive input-output relationship given by Q P x(n) = − ak x(n − k) + dk w(n − k) (4.4.1) k=1

k=0

where we assume that P > 0 and Q ≥ 1. The models can be implemented using either direct-form or lattice-ladder structures (Proakis and Manolakis 1996). 4.4.1 Model Properties In this section, we present some of the basic properties of pole-zero models. Impulse response. The impulse response of a causal pole-zero model can be written in recursive form from (4.4.1) as P h(n) = − ak h(n − k) + dn n≥0 (4.4.2) k=1

where

dn = 0

n>Q

and h(n) = 0 for n < 0. Clearly, this formula is useful if the model is stable. From (4.4.2), it is clear that P h(n) = − ak h(n − k) n>Q (4.4.3) k=1

so that the impulse response obeys the linear prediction equation for n > Q. Thus if we are given h(n), 0 ≤ n ≤ P + Q, we can compute {ak } from (4.4.3) by using the P equations specified by Q + 1 ≤ n ≤ Q + P . Then we can compute {dk } from (4.4.2), using 0 ≤ n ≤ Q. Therefore, the first P + Q + 1 values of the impulse response completely specify the pole-zero model. If the model is minimum-phase, the impulse response of the inverse model hI (n) = Z −1 {A(z)/D(z)}, d0 = 1 can be computed in a similar manner. Autocorrelation. The complex spectrum of H (z) is given by Rd (z) 1 D(z)D ∗ (1/z∗ ) (4.4.4) Rh (z) = H (z)H ∗ ∗ = ∗ ∗ z A(z)A (1/z ) Ra (z) where Rd (z) and Ra (z) are both finite two-sided polynomials. In a manner similar to the

177 section 4.4 Pole-Zero Models

178 chapter 4 Linear Signal Models

all-pole case, we can write a recursive relation between the autocorrelation, impulse response, and parameters of the model. Indeed, from (4.4.4) we obtain 1 A(z)Rh (z) = D(z)H ∗ ∗ (4.4.5) z Taking the inverse z-transform of (4.4.5) and noting that the inverse z-transform of H ∗ (1/z∗ ) is h∗ (−n), we have P

ak rh (l − k) =

k=0

Q

dk h∗ (k − l)

for all l

(4.4.6)

k=0

Since h(n) is causal, we see that the right-hand side of (4.4.6) is zero for l > Q: P

ak rh (l − k) = 0

(4.4.7)

l>Q

k=0

Therefore, the autocorrelation of a pole-zero model obeys the linear prediction equation for l > Q. Because the impulse response h(n) is a function of ak and dk , the set of equations in (4.4.6) is nonlinear in terms of parameters ak and dk . However, (4.4.7) is linear in ak ; therefore, we can compute {ak } from (4.4.7), using the set of equations for l = Q + 1, . . . , Q + P , which can be written in matrix form as rh (Q − 1) · · · rh (Q − P + 1) rh (Q) rh (Q + 1) a1 rh (Q + 1) rh (Q + 2) rh (Q) · · · rh (Q − P + 2) a2 . . = − . . . . . . .. .. .. .. .. aP rh (Q + P − 1) rh (Q + P − 2) · · · rh (Q) rh (Q + P ) (4.4.8) ¯ h a = −¯rh R

or

(4.4.9)

¯ h is a non-Hermitian Toeplitz matrix, and the linear system (4.4.8) can be solved Here, R by using the algorithm of Trench (Trench 1964; Carayannis et al. 1981). Even after we solve for a, (4.4.6) continues to be nonlinear in dk . To compute dk , we use (4.4.4) to find Rd (z) Rd (z) = Ra (z)Rh (z)

(4.4.10)

where the coefficients of Ra (z) are given by ra (l) =

k=k 2

∗ ak ak+|l|

−P ≤ l ≤ P ,

k1 =

k=k1

0, l ≥ 0 , −l, l < 0

k2 =

P − l, l ≥ 0 P, l<0 (4.4.11)

From (4.4.10), rd (l) is the convolution of ra (l) with rh (l), given by rd (l) =

P

ra (k)rh (l − k)

(4.4.12)

k=−P

If r(l) was originally the autocorrelation of a PZ(P , Q) model, then rd (l) in (4.4.12) will be zero for |l| > Q. Since Rd (z) is specified, it can be factored into the product of two polynomials D(z) and D ∗ (1/z∗ ), where D(z) is minimum-phase, as shown in Section 2.4. Therefore, we have seen that, given the values of the autocorrelation rh (l) of a PZ(P , Q) model in the range 0 ≤ l ≤ P + Q, we can compute the values of the parameters {ak } and {dk } such that H (z) is minimum-phase. Now, given the parameters of a pole-zero model, we can compute its autocorrelation as follows. Equation (4.4.4) can be written as Rh (z) = Ra−1 (z)Rd (z)

(4.4.13)

where Ra−1 (z) is the spectrum of the all-pole model 1/A(z), that is, 1/Ra (z). The coefficients of Ra−1 (z) can be computed from {ak } by using (4.2.20) and (4.2.18). The coefficients of Rd (z) are computed from (4.3.8). Then Rh (z) is the convolution of the two autocorrelations thus computed, which is equivalent to multiplying the two polynomials in (4.4.13) and equating equal powers of z on both sides of the equation. Since Rd (z) is finite, the summations used to obtain the coefficients of Rh (z) are also finite. Consider a signal that has autocorrelation values of rh (0) = 19, rh (1) = 9, rh (2) = −5, and rh (3) = −7. The parameters of the PZ(2, 1) model are found in the following manner. First form the equation from (4.4.8) 5 9 19 a1 = 7 −5 9 a2 E XAM PLE 4.4.1.

which yields a1 = − 12 , a2 = 12 . Then we compute the coefficients from (4.4.11), ra (0) = 32 , ra (±1) = − 34 , and ra (±2) = 21 . Computing the convolution in (4.4.12) for l ≤ Q = 1, we obtain the following polynomial: z−1 −1 (z + 2) Rd (z) = 4z + 10 + 4z = 4 1 + 2 Therefore, D(z) is obtained by taking the causal part, that is, D(z) = 2[1 + z−1 /(2)], and d1 = 12 .

Spectrum. The spectrum of H (z) is given by Rh (ej ω ) = |H (ej ω )|2 =

|D(ej ω )|2 |A(ej ω )|2

(4.4.14)

Therefore, Rh (ej ω ) can be obtained by dividing the spectrum of D(z) by the spectrum of A(z). Again, the FFT can be used to advantage in computing the numerator and denominator of (4.4.14). If the spectrum Rh (ej ω ) of a PZ(P , Q) model is given, then the parameters of the (minimum-phase) model can be recovered by first computing the autocorrelation rh (l) as the inverse Fourier transform of Rh (ej ω ) and then using the procedure outlined in the previous section to compute the sets of coefficients {ak } and {dk }. Partial autocorrelation and lattice-ladder structures. Since a PZ(P , Q) model is equivalent to an AP(∞) model, its PACS has infinite extent and behaves, after a certain lag, as the PACS of an all-zero model. 4.4.2 Autoregressive Moving-Average Models The autoregressive moving-average model is a PZ(P , Q) model driven by white noise and is denoted by ARMA(P , Q). Again, we set d0 = 1 and incorporate the gain into the variance (power) of the white noise excitation. Hence, a causal ARMA(P , Q) model is defined by Q P x(n) = − ak x(n − k) + w(n) + dk w(n − k) k=1

(4.4.15)

k=1

where {w(n)} ∼ WN(0, σ 2w ). The ARMA(P , Q) model parameters are {σ 2w , a1 , . . . , aP , d1 , . . . , dQ }. The output has zero mean and variance of σ 2x = −

P k=1

ak rx (k) + σ 2w [1 +

Q

dk h(k)]

(4.4.16)

k=1

where h(n) is the impulse response of the model. The presence of h(n) in (4.4.16) makes the dependence of σ 2x on the model parameters highly nonlinear. The autocorrelation of

179 section 4.4 Pole-Zero Models

x(n) is given by

chapter 4 Linear Signal Models

P

ak rx (l − k) = σ 2w 1 +

k=0

Q

dk h(k − l)

for all l

(4.4.17)

k=1

and the power spectrum by Rx (ej ω ) = σ 2w

|D(ej ω )|2 |A(ej ω )|2

(4.4.18)

The significance of ARMA(P , Q) models is that they can provide more accurate representations than AR or MA models with the same number of parameters. The ARMA model is able to combine the spectral peak matching of the AR model with the ability of the MA model to place nulls in the spectrum.

4.4.3 The First-Order Pole-Zero Model: PZ(1, 1) Consider the PZ(1, 1) model with the following system function H (z) = G

1 + d1 z−1 1 + a1 z−1

(4.4.19)

where d1 and a1 are real coefficients. The model is minimum-phase if −1 < d1 < 1 −1 < a1 < 1

(4.4.20)

a1

which correspond to the rectangular region shown in Figure 4.12(a). 1.0

1.0

0.5

0.5 r (2)

180

0 −0.5 −1.0 −1.0

0 −0.5

−0.5

0 d1

0.5

1.0

−1.0 −1.0

−0.5

(a)

0 r (1)

0.5

1.0

(b)

FIGURE 4.12 Minimum-phase and positive definiteness regions for the PZ(1, 1) model in the (a) (d1 , a1 ) space and (b) (ρ(1), ρ(2)) space.

For the minimum-phase case, the impulse responses of the direct and the inverse models are

0 h(n) = Z −1 {H (z)} = G G(−a )n−1 (d − a ) 1 1 1

n<0 n=0 n>0

(4.4.21)

and

hI (n) = Z

−1

1 H (z)

0 = G G(−d )n−1 (a − d ) 1 1 1

n<0 n=0 n>0

181

(4.4.22)

respectively. We note that as the pole p = −a1 gets closer to the unit circle, the impulse response decays more slowly and the model has “longer memory.” The zero z = −d1 controls the impulse response of the inverse model in a similar way. The PZ(1, 1) model is equivalent to the AZ(∞) model ∞ x(n) = Gw(n) + G h(k)w(n − k) (4.4.23) k=1

or the AP(∞) model x(n) = −

∞

hI (k)x(n − k) + Gw(n)

(4.4.24)

k=1

If we wish to approximate the PZ(1, 1) model with a finite-order AZ(Q) model, the order Q required to achieve a certain accuracy increases as the pole moves closer to the unit circle. Likewise, in the case of an AP(P ) approximation, better fits to the PZ(P , Q) model require an increased order P as the zero moves closer to the unit circle. To determine the autocorrelation, we recall from (4.4.6) that for a causal model rh (l) = −a1 rh (l − 1) + Gh(−l) + Gd1 h(1 − l) or

all l

(4.4.25)

rh (0) = −a1 rh (1) + G + Gd1 (d1 − a1 ) rh (1) = −a1 rh (0) + Gd1 rh (l) = −a1 rh (l − 1)

(4.4.26) l≥2

Solving the first two equations for rh (0) and rh (1), we obtain rh (0) = G

1 + d12 − 2a1 d1

(4.4.27)

1 − a12

(d1 − a1 )(1 − a1 d1 ) 1 − a12 The normalized autocorrelation is given by (d1 − a1 )(1 − a1 d1 ) ρ h (1) = 1 + d12 − 2a1 d1 and

and

rh (1) = G

ρ h (l) = (−a1 )l−1 ρ h (l − 1)

l≥2

(4.4.28)

(4.4.29) (4.4.30)

Note that given ρ h (1) and ρ h (2), we have a nonlinear system of equations that must be solved to obtain a1 and d1 . By using Equations (4.4.20), (4.4.29), and (4.4.30), it can be shown (see Problem 4.28) that the PZ(1, 1) is minimum-phase if the ACS satisfies the conditions |ρ(2)| < |ρ(1)| ρ(2) > ρ(1)[2ρ(1) + 1] ρ(2) > ρ(1)[2ρ(1) − 1]

ρ(1) < 0 ρ(1) > 0

(4.4.31)

which correspond to the admissible region shown in Figure 4.12(b). 4.4.4 Summary and Dualities Table 4.1 summarizes the key properties of all-zero, all-pole, and pole-zero models. These properties help to identify models for empirical discrete-time signals. Furthermore, the table shows the duality between AZ and AP models. More specifically, we see that

section 4.4 Pole-Zero Models

182 chapter 4 Linear Signal Models

1. An invertible AZ(Q) model is equivalent to an AP(∞) model. Thus, it has a finite-extent autocorrelation and an infinite-extent partial autocorrelation. 2. A stable AP(P ) model is equivalent to an AZ(∞) model. Thus, it has an infinite-extent autocorrelation and a finite-extent partial autocorrelation. 3. The autocorrelation of an AZ(Q) model behaves as the partial autocorrelation of an AP(P ) model, and vice versa. 4. The spectra of an AP(P ) model and an AZ(Q) model are related through an inverse relationship.

TABLE 4.1

Summary of all-pole, all-zero, and pole-zero model properties Model Input-output description

AP(P ) P

x(n) +

PZ(P , Q)

AZ(Q)

ak x(n − k) = w(n)

Q

x(n) = d0 w(n) +

k=1

dk w(n − k)

P

x(n) +

k=1

ak x(n − k)

k=1

= d0 w(n) +

Q

dk w(n − k)

k=1

System function

H (z) =

1 = A(z)

d0 P

1+

Q

H (z) = D(z) = d0 +

ak z−k

dk z−k

H (z) =

k=1

D(z) A(z)

k=1

Recursive representation

Finite summation

Infinite summation

Infinite summation

Nonrecursive representation

Infinite summation

Finite summation

Infinite summation

Stablity conditions

Poles inside unit circle

Always

Poles inside unit circle

Invertiblity conditions

Always

Zeros inside unit circle

Zeros inside unit circle

Autocorrelation sequence

Infinite duration (damped exponentials and/or sine waves)

Finite duration

Infinite duration (damped exponentials and/or sine waves after Q − P lags)

Tails off

Cuts off

Tails off

Partial autocorrelation

Finite duration

Infinite duration (damped exponentials and/or sine waves)

Infinite duration (dominated by damped exponentials and/or sine waves after Q − P lags)

Cuts off

Tails off

Tails off

Spectrum

Good peak matching

Good “notch” matching

Good peak and valley matching

These dualities and properties have been shown and illustrated for low-order models in the previous sections.

4.5 MODELS WITH POLES ON THE UNIT CIRCLE In this section, we show that by restricting some poles to being on the unit circle, we obtain models that are useful for modeling certain types of nonstationary behavior. Pole-zero models with poles on the unit circle are unstable. Hence, if we drive them with stationary white noise, the generated process is nonstationary. However, as we will see in the sequel, placing a small number of real poles at z = 1 or complex conjugate poles at zk = e±j θ k provides a class of models useful for modeling certain types of nonstationary behavior. The system function of a pole-zero model with d poles at z = 1, denoted as PZ(P , d, Q), is D(z) 1 H (z) = (4.5.1) A(z) (1 − z−1 )d

and can be viewed as PZ(P , Q) model, D(z)/A(z), followed by a dth-order accumulator. The accumulator y(n) = y(n − 1) + x(n) has the system function 1/(1 − z−1 ) and can be thought of as a discrete-time integrator. The presence of the unit poles makes the PZ(P , d, Q) model non-minimum-phase. Since the model is unstable, we cannot use the convolution summation to represent it because, in practice, only finite-order approximations are possible. This can be easily seen if we recall that the impulse response of the model PZ(0, d, 0) equals u(n) for d = 1 and (n + 1)u(n) for d = 2. However, if D(z)/A(z) is minimum-phase, the inverse model HI (z) = 1/H (z) is stable, and we can use the recursive form (see Section 4.1) to represent the model. Indeed, we always use this representation when we apply this model in practice. The spectrum of the PZ(0, d, 0) model is Rd (ej ω ) =

1 [2 sin(ω/2)]2d

(4.5.2)

and since Rd (0) = ∞ l=−∞ rd (l) = ∞, the autocorrelation does not exist. In the case of complex conjugate poles, the term (1 − z−1 )d in (4.5.1) is replaced by (1 − 2 cos θ k z−1 + z−2 )d , that is, H (z) =

1 D(z) A(z) (1 − 2 cos θ k z−1 + z−2 )d

(4.5.3)

The second term is basically a cascade of AP(2) models with complex conjugate poles on the unit circle. This model exhibits strong periodicity in its impulse response, and its “resonance-like” spectrum diverges at ω = θ k . With regard to the partial autocorrelation, we recall that the presence of poles on the unit circle results in some lattice parameters taking on the values ±1. E XAM PLE 4.5.1.

Consider the following causal PZ(1, 1, 1) model

H (z) =

1 + d1 z−1 1 + d1 z−1 1 = 1 + a1 z−1 1 − z−1 1 − (1 − a1 )z−1 − a1 z−2

(4.5.4)

with −1 < a1 < 1 and −1 < d1 < 1. The difference equation representation of the model uses previous values of the output and the present and previous values of the input. It is given by y(n) = (1 − a1 )y(n − 1) + a1 y(n − 2) + x(n) + d1 x(n − 1)

(4.5.5)

To express the output in terms of the present and previous values of the input (nonrecursive representation), we find the impulse response of the model h(n) = Z −1 {H (z)} = A1 u(n) + A2 (−a1 )n u(n)

(4.5.6)

where A1 = (1 + d1 )/(1 + a1 ) and A2 = (a1 − d1 )/(1 + a1 ). Note that the model is unstable, and it cannot be approximated by an FIR system because h(n) → A1 u(n) as n → ∞. Finally, we can express the output as a weighted sum of previous outputs and the present input, using the impulse response of the inverse model G(z) = 1/H (z) hI (n) = Z −1 {HI (z)} = B1 δ(n) + B2 δ(n − 1) + B3 (−d1 )n u(n)

(4.5.7)

where B1 = (a1 − d1 + a1 d1 )/d12 , B2 = −a1 /d1 , and B3 = (−a1 + d1 − a1 d1 + d12 )/d12 . Since −1 < d1 < 1, the sequence hI (n) decays at a rate governed by the value of d1 . If hI (n) 0 for n ≥ pd , the recursive formula pd hI (k)y(n − k) + x(n)

y(n) = −

(4.5.8)

k=1

provides a good representation of the PZ(1, 1, 1) model. For example, if a1 = 0.3 and d1 = 0.5, we find that |hI (n)| ≤ 0.0001 for n ≥ 12, which means that the current value of the model output can be computed with sufficient accuracy from the 12 most recent values of signal y(n). This is illustrated in Figure 4.13, which also shows a realization of the output process if the model is driven by white Gaussian noise.

183 section 4.5 Models with Poles on the Unit Circle

184

20

0.5

Amplitude

Amplitude

Inverse model: h1(n) 1.0

15 10 5

0 −0.5 −1.0

50 100 150 Sample number

200

Direct model: h(n)

5 10 Sample number

15

Spectrum

1.3

80

1.2

60

Amplitude

Amplitude

chapter 4 Linear Signal Models

Sample realization 25

1.1 1.0 0.9

40 20 0

5 10 Sample number

15

0 0.1 0.2 0.3 0.4 0.5 Frequency (cycles/sampling interval)

FIGURE 4.13 Sample realization of the output process, impulse response, impulse response of the inverse model, and spectrum of a PZ(1, 1, 1) model with a1 = 0.3, d1 = 0.5, and d = 1. The value R(ej 0 ) = ∞ is not plotted.

Autoregressive integrated moving-average models. In Section 3.3.2 we discussed discrete-time random signals with stationary increments. Clearly, driving a PZ(P , d, Q) model with white noise generates a random signal whose dth difference is a stationary ARMA(P , Q) process. Such time series are known in the statistical literature as autoregressive integrated moving-average models, denoted ARIMA (P , d, Q). They are useful in modeling signals with certain stochastic trends (e.g., random changes in the level and slope of the signal). Indeed, many empirical signals (e.g., infrared background measurements and stock prices) exhibit this type of behavior (see Figure 1.6). Notice that the ARIMA(0, 1, 0) process, that is, x(n) = x(n − 1) + w(n), where {w(n)} ∼ WN(0, σ 2w ), is the discrete-time equivalent of the random walk or Brownian motion process (Papoulis 1991). When the unit poles are complex conjugate, the model is known as a harmonic PZ model. This model produces random sequences that exhibit “random periodic behavior” and are known as seasonal time series in the statistical literature. Such signals repeat themselves cycle by cycle, but there is some randomness in both the length and the pattern of each cycle. The identification and estimation of ARIMA and seasonal models and their applications can be found in Box, Jenkins, and Reinsel (1994); Brockwell and Davis (1991); and Hamilton (1994).

4.6 CEPSTRUM OF POLE-ZERO MODELS In this section we determine the cepstrum of pole-zero models and its properties, and we develop algorithms to convert between direct structure model parameters and cepstral coefficients. The cepstrum has been proved a valuable tool in speech coding and recognition applications and has been extensively studied in the corresponding literature (Rabiner and Schafer 1978; Rabiner and Juang 1993; Furui 1989). For simplicity, we consider models with real coefficients.

185

4.6.1 Pole-Zero Models The cepstrum of the impulse response h(n) of a pole-zero model is the inverse z-transform of log H (z) = log D(z) − log A(z) = log d0 +

Q

(4.6.1)

log (1 − zi z−1 ) −

i=1

P

log (1 − pi z−1 )

(4.6.2)

i=1

where {zi } and {pi } are the zeros and poles of H (z), respectively. If we assume that H (z) is minimum-phase and use the power series expansion ∞ α n −n z log (1 − αz−1 ) = − n

|z| > |α|

n=1

we find that the cepstrum c(n) is given by 0 log d0 c(n) = Q P 1 pin − zin n i=1

n<0 n=0 (4.6.3) n>0

i=1

Since the poles and zeros are assumed to be inside the unit circle, (4.6.3) implies that c(n) is bounded by −

P +Q P +Q ≤ c(n) ≤ n n

(4.6.4)

with equality if and only if all the roots are appropriately at z = 1 or z = −1. If H (z) is minimum-phase, then there exists a unique mapping between the cepstrum and the impulse response, given by the recursive relations (Oppenheim and Schafer 1989) c(0) = log h(0) = log d0 c(n) =

n−1 h(n) 1 h(n − m) − mc(m) h(0) n h(0)

n>0

(4.6.5)

m=0

and

h(0) = ec(0) n−1 1 mc(m)h(n − m) h(n) = h(0)c(n) + n

n>0

(4.6.6)

m=0

where we have assumed d0 > 0 without loss of generality. Therefore, given the cepstrum c(n) in the range 0 ≤ n ≤ P +Q, we can completely recover the parameters of the pole-zero model as follows. From (4.6.6) we can compute h(n), 0 ≤ n ≤ P + Q, and from (4.4.2) and (4.4.3) we can recover {ak } and {dk }.

4.6.2 All-Pole Models The cepstrum of a minimum-phase all-pole model is given by (4.6.2) and (4.6.3) with Q = 0. Since H (z) is minimum-phase, the cepstrum c(n) of 1/A(z) is simply the negative of the cepstrum of A(z), which can be written in terms of ak (see also Problem 4.34). As a result, the cepstrum can be obtained from the direct-form coefficients by using the following

section 4.6 Cepstrum of Pole-Zero Models

186

recursion

chapter 4 Linear Signal Models

c(n) =

n−1 1 − (n − k) ak c(n − k) −a n n

1≤n≤P

k=1

(4.6.7)

P 1 − (n − k) ak c(n − k) n

n>P

k=1

The inverse relation is 1 (n − k) ak c(n − k) n n−1

an = −c(n) −

n>0

(4.6.8)

k=1

which shows that the first P cepstral coefficients completely determine the model parameters (Furui 1981). From (4.6.7) it is evident that the cepstrum generally decays as 1/n. Therefore, it may be desirable sometimes to consider c (n) = nc(n)

(4.6.9)

which is known as the ramp cepstrum since it is obtained by multiplying the cepstrum by a ramp function. From (4.6.9) and (4.6.4), we note that the ramp cepstrum of an AP(P ) model is bounded by |c (n)| ≤ P

n>0

(4.6.10)

with equality if and only if all the poles are at z = 1 or z = −1. Also c (n) is equal to the negative of the inverse z-transform of the derivative of log H (z). From the preceding equations, we can write c (n) = −nan −

n−1

ak c (n − k)

1≤n≤P

(4.6.11)

k=1

c (n) = −

P

ak c (n − k)

(4.6.12)

n>P

k=1

and

1 an = c (n) + ak c (n − k) n n−1

n>0

(4.6.13)

k=1

It is evident that the first P values of c (n), 1 ≤ n ≤ P , completely specify the model coefficients. However, since c (0) = 0, the information about the gain d0 is lost in the ramp cepstrum. Equation (4.6.12) for n > P is reminiscent of similar equations for the impulse response in (4.2.5) and the autocorrelation in (4.2.18), with the major difference that for the ramp cepstrum the relation is only true for n > P , while for the impulse response and the autocorrelation, the relations are true for n > 0 and k > 0, respectively. Since R(z) = H (z)H (z−1 ), we have log R(z) = log H (z) + log H (z−1 )

(4.6.14)

and if cr (n) is the real cepstrum of R(ej ω ), we conclude that cr (n) = c(n) + c(−n) For minimum-phase H (z), c(n) = 0 for n < 0. Therefore, n<0 c(−n) n=0 cr (n) = 2c(0) c(n) n>0

(4.6.15)

(4.6.16)

0 c (0) r c(n) = 2 cr (n)

and

187

n<0 n=0

(4.6.17)

n>0

In other words, the cepstrum c(n) can be obtained simply by taking the inverse Fourier transform of log R(ej ω ) to obtain cr (n) and then applying (4.6.17). E XAM PLE 4.6.1.

From (4.6.7) we find that the cepstrum of the AP(1) model is given by 0 n<0 n=0 log d0 (4.6.18) c(n) = 1 (−a)n n>0 n (1)

From (4.2.18) with P = 1 and k = 1, we have a1 = −r(1)/r(0) = k1 ; and from (4.6.7) we have a1 = −c(1). These results are summarized below: (1)

a1

= a = −ρ(1) = k1 = −c(1)

(4.6.19)

The fact that ρ(1) = c(1) here is peculiar to a single-pole spectrum and is not true in general for arbitrary spectra. And ρ(1) is the integral of a cosine-weighted spectrum while c(1) is the integral of a cosine-weighted log spectrum. From (4.6.7), the cepstrum for an AP(2) model is equal to 0 n<0 n=0 log d0 c(n) = 1 (p n + pn ) n>0 2 n 1 For a complex conjugate pole pair, we have E XAM PLE 4.6.2.

(4.6.20)

2 n r cos nθ n>0 (4.6.21) n where p1,2 = r exp(±j θ ). Therefore, the cepstrum of a damped sine wave is a damped cosine wave. The cepstrum and autocorrelation are similar in that they are both damped cosines, but the cepstrum has an additional 1/n weighting. From (4.6.7) and (4.6.8) we can relate the model parameters and the cepstral coefficients: c(n) =

a1 = −c(1) a2 = −c(2) + 12 c2 (1) and

c(1) = −a1 c(2) = −a2 + 12 a12

(4.6.22)

(4.6.23)

Using (4.2.71) and the relations for the cepstrum, we can derive the conditions on the cepstrum for H (z) to be minimum-phase: c(2) >

c2 (1) −1 2

c(2) <

c2 (1) − c(1) + 1 2

(4.6.24)

c2 (1) + c(1) + 1 2 The corresponding admissible region is shown in Figure 4.14. The region corresponding to complex roots is given by c(2) <

1 2 c2 (1) c (1) − 1 < c(2) < 2 4

(4.6.25)

section 4.6 Cepstrum of Pole-Zero Models

188

1.0

chapter 4 Linear Signal Models

c (2)

0.5

Complex poles

0 Real poles

−0.5 −1.0 −2.0

−1.5

−1.0

−0.5

0 c (1)

0.5

1.0

1.5

2.0

FIGURE 4.14 Minimum-phase region of the AP(2) model in the cepstral domain.

In comparing Figures 4.6, 4.8, and 4.14, we note that the admissible regions for the PACS and ACS are convex while that for the cepstral coefficients is not. (A region is convex if a straight line drawn between any two points in the region lies completely in the region.) In general, the PACS and the ACS span regions or spaces that are convex. The admissible region in Figure 4.14 for the model coefficients is also convex. However, for P > 2 the admissible regions for the model coefficients are not convex, in general. Cepstral distance. A measure of the difference between two signals, which has many applications in speech coding and recognition, is the distance between their log spectra (Rabiner and Juang 1993). It is known as the cepstral distance and is defined as π 1 CD | log R1 (ej ω ) − log R2 (ej ω )|2 dω (4.6.26) 2π −π ∞ = [c1 (n) − c2 (n)]2 (4.6.27) n=−∞

where c1 (n) and c2 (n) are the cepstral coefficients of R1 (ej ω ) and R2 (ej ω ), respectively (see Problem 4.36). Since for minimum-phase sequences the cepstrum decays fast, the summation (4.6.27) can be computed with sufficient accuracy using a small number of terms, usually 20 to 30. For minimum-phase all-pole models, which are mostly used in speech processing, the cepstral coefficients are efficiently computed using the recursion (4.6.7).

4.6.3 All-Zero Models The cepstrum of a minimum-phase all-zero model is given by (4.6.2) and (4.6.3) with P = 0. The cepstrum corresponding to a minimum-phase AZ(Q) model is related to its real cepstrum by 0 n<0 c (n) r n=0 c(n) = (4.6.28) 2 n>0 cr (n) Since we found c(n), the coefficients of a minimum-phase AZ(Q) model D(z) can be

evaluated recursively from ec d 0 k−1 dk = 1 mc(m)dk−m c(k)d0 + k

189 problems

k=0 1≤k≤Q

(4.6.29)

m=0

This procedure for finding a minimum-phase polynomial D(z) from the autocorrelation consists in first computing the cepstrum from the log spectrum, then applying (4.6.28) and the recursion (4.6.29) to compute the coefficients dk . This approach to the spectral factorization of AZ(Q) models is preferable because finding the roots of R(z) for large Q may be cumbersome. Mixed pole-zero model representations. In the previous sections we saw that the P + Q + 1 parameters of the minimum-phase PZ(P , Q) model can be represented equivalently and uniquely by P + Q + 1 values of the impulse response, the autocorrelation, or the cepstrum. A question arises as to whether PZ(P , Q) can be represented uniquely by a mixture of representations, as long as the total number of representative values is P +Q+1. For example, could we have a unique representation that consists of, say, Q autocorrelation values and P + 1 impulse response values, or some other mixture? The answer to this question has not been explored in general; the relevant equations are sufficiently nonlinear that a totally different approach would appear to be needed to solve the general problem.

4.7 SUMMARY In this chapter we introduced the class of pole-zero signal models and discussed their properties. Each model consists of two components: an excitation source and a system. In our treatment, we emphasized that the properties of a signal model are shaped by the properties of both components; and we tried, whenever possible, to attribute each property to its originator. Thus, for uncorrelated random inputs, which by definition are the excitations for ARMA models, the second-order moments of the signal model and its minimum-phase characteristics are completely determined by the system. For excitations with line spectra, properties such as minimum phase are meaningful only when they are attributed to the underlying system. If the goal is to model a signal with a line PSD, the most appropriate approach is to use a harmonic process. We provided a detailed description of the autocorrelation, power spectrum density, partial correlation, and cepstral properties of all AZ, AP, and PZ models for the general case and for first- and second-order models. An understanding of these properties is very important for model selection in practical applications.

PROBLEMS 4.1 Show that a second-order pole pi contributes the term npin u(n) and a third-order pole the terms npin u(n)+ n2 pin u(n) to the impulse response of a causal PZ model. The general case is discussed in Oppenheim et al. (1997). 4.2

Consider a zero-mean random sequence x(n) with PSD 5 + 3 cos ω 17 + 8 cos ω (a) Determine the innovations representation of the process x(n). (b) Find the autocorrelation sequence rx (l). Rx (ej ω ) =

190 chapter 4 Linear Signal Models

4.3 We want to generate samples of a Gaussian process with autocorrelation rx (l) = ( 12 )|l| +(− 12 )|l| for all l. (a) Find the difference equation that generates the process x(n) when excited by w(n) ∼ WGN(0, 1). (b) Generate N = 1000 samples of the process and estimate the pdf, using the histogram and the normalized autocorrelation ρ x (l) using ρˆ x (l) [see Equation (1.2.1)]. (c) Check the validity of the model by plotting on the same graph (i) the true and estimated pdf of x(n) and (ii) the true and estimated autocorrelation. 4.4

Compute and compare the autocorrelations of the following processes: (a) x1 (n) = w(n) + 0.3w(n − 1) − 0.4w(n − 2) and (b) x2 (n) = w(n) − 1.2w(n − 1) − 1.6w(n − 2) where w(n) ∼ WGN(0, 1). Explain your findings.

4.5 Compute and plot the impulse response and the magnitude response of the systems H (z) and HN (z) in Example 4.2.1 for a = 0.7, 0.95 and N = 8, 16, 64. Investigate how well the all-zero systems approximate the single-pole system. 4.6

Prove Equation (4.2.35) by writing explicitly Equation (4.2.33) and rearranging terms. Then show that the coefficient matrix A can be written as the sum of a triangular Toeplitz matrix and a triangular Hankel matrix (recall that a matrix H is Hankel if the matrix JHJH is Toeplitz).

4.7 Use the Yule-Walker equations to determine the autocorrelation and partial autocorrelation coefficients of the following AR models, assuming that w(n) ∼ WN(0, 1). (a) x(n) = 0.5x(n − 1) + w(n). (b) x(n) = 1.5x(n − 1) − 0.6x(n − 2) + w(n). What is the variance σ 2x of the resulting process? 4.8 Given the AR process x(n) = x(n − 1) − 0.5x(n − 2) + w(n), complete the following tasks. (a) (b) (c) (d )

Determine ρ x (1). Using ρ x (0) and ρ x (1), compute {ρ x (l)}15 2 by the corresponding difference equation. Plot ρ x (l) and use the resulting graph to estimate its period. Compare the period obtained in part (c) with the value obtained using the PSD of the model. (Hint: Use the frequency of the PSD peak.)

4.9 Given the parameters d0 , a1 , a2 , and a3 of an AP(3) model, compute its ACS analytically and verify your results, using the values in Example 4.2.3. (Hint: Use Cramer’s rule.) 4.10 Consider the following AP(3) model: x(n) = 0.98x(n − 3) + w(n), where w(n) ∼ WGN(0, 1). (a) Plot the PSD of x(n) and check if the obtained process is going to exhibit a pseudoperiodic behavior. (b) Generate and plot 100 samples of the process. Does the graph support the conclusion of part (a)? If yes, what is the period? (c) Compute and plot the PSD of the process y(n) = 13 [x(n − 1) + x(n) + x(n + 1)]. (d ) Repeat part (b) and explain the difference between the behavior of processes x(n) and y(n). 4.11 Consider the following AR(2) models: (i) x(n) = 0.6x(n − 1) + 0.3x(n − 2) + w(n) and (ii) x(n) = 0.8x(n − 1) − 0.5x(n − 2) + w(n), where w(n) ∼ WGN(0, 1). (a) Find the general expression for the normalized autocorrelation sequence ρ(l), and determine σ 2x . (b) Plot {ρ(l)}15 0 and check if the models exhibit pseudoperiodic behavior. (c) Justify your answer in part (b) by plotting the PSD of the two models. 4.12 (a) Derive the formulas that express the PACS of an AP(3) model in terms of its ACS, using the Yule-Walker equations and Cramer’s rule.

(b) Use the obtained formulas to compute the PACS of the AP(3) model in Example 4.2.3. (c) Check the results in part (b) by recomputing the PACS, using the algorithm of LevinsonDurbin. 4.13 Show that the spectrum of any PZ model with real coefficients has zero slope at ω = 0 and ω = π. 4.14 Derive Equations (4.2.71) describing the minimum-phase region of the AP(2) model, starting from the conditions (a) |p1 | < 1, |p2 | < 1 and (b) |k1 | < 1, |k2 | < 1. 4.15 (a) Show that the spectrum of an AP(2) model with real poles can be obtained by the cascade connection of two AP(1) models with real coefficients. (b) Compute and plot the impulse response, ACS, PACS, and spectrum of the AP models with p1 = 0.6, p2 = −0.9, and p1 = p2 = 0.9. 4.16 Prove Equation (4.2.89) and demonstrate its validity by plotting the spectrum (4.2.88) for various values of r and θ. 4.17 Prove that if the AP(P ) model A(z) is minimum-phase, then π 1 1 log dω = 0 2π −π |A(ej ω )|2 4.18 (a) Prove Equations (4.2.101) and (4.2.102) and recreate the plot in Figure 4.8(a). (b) Determine and plot the regions corresponding to complex and real poles in the autocorrelation domain by recreating Figure 4.8(b). 4.19 Consider an AR(2) process x(n) with d0 = 1, a1 = −1.6454 a2 = 0.9025, and w(n) ∼ WGN(0, 1). (a) Generate 100 samples of the process and use them to estimate the ACS ρˆ x (l), using Equation (1.2.1). (b) Plot and compare the estimated and theoretical ACS values for 0 ≤ l ≤ 10. (c) Use the estimated values of ρˆ x (l) and the Yule-Walker equations to estimate the parameters of the model. Compare the estimated with the true values, and comment on the accuracy of the approach. (d ) Use the estimated parameters to compute the PSD of the process. Plot and compare the estimated and true PSDs of the process. (e) Compute and compare the estimated with the true PACS. 4.20 Find a minimum-phase model with autocorrelation ρ(0) = 1, ρ(±1) = 0.25, and ρ(l) = 0 for |l| ≥ 2. 4.21 Consider the MA(2) model x(n) = w(n) − 0.1w(n − 1) + 0.2w(n − 2). (a) Is the process x(n) stationary? Why? (b) Is the model minimum-phase? Why? (c) Determine the autocorrelation and partial autocorrelation of the process. 4.22 Consider the following ARMA models: (i) x(n) = 0.6x(n − 1) + w(n) − 0.9w(n − 1) and (ii) x(n) = 1.4x(n − 1) − 0.6x(n − 2) + w(n) − 0.8w(n − 1). (a) Find a general expression for the autocorrelation ρ(l). (b) Compute the partial autocorrelation km for m = 1, 2, 3. 20 using Equation (c) Generate 100 samples from each process, and use them to estimate {ρ(l)} ˆ 0 (1.2.1). (d ) Use ρ(l) ˆ to estimate {kˆm }20 1 . (e) Plot and compare the estimates with the theoretically obtained values.

191 problems

192 chapter 4 Linear Signal Models

4.23 Determine the coefficients of a PZ(2, 1) model with autocorrelation values rh (0) = 19, rh (1) = 9, rh (2) = −5, and rh (3) = −7. 4.24 (a) Show that the impulse response of an AZ(Q) model can be recovered from its response ˜ h(n) to a periodic train with period L if L > Q. ˜ (b) Show that the ACS of an AZ(Q) model can be recovered from the ACS or spectrum of h(n) if L ≥ 2Q + 1. 4.25 Prove Equation (4.3.17) and illustrate its validity by computing the PACS of the model H (z) = 1 − 0.8z−1 . 4.26 Prove Equations (4.3.24) that describe the minimum-phase region of the AZ(2) model. 4.27 Consider an AZ(2) model with d0 = 2 and zeros z1,2 = 0.95e±j π /3 . (a) Compute and plot N = 100 output samples by exciting the model with the process w(n) ∼ WGN(0, 1). (b) Compute and plot the ACS, PACS, and spectrum of the model. (c) Repeat parts (a) and (b) by assuming that we have an AP(2) model with poles at p1,2 = 0.95e±j π /3 . (d ) Investigate the duality between the ACS and PACS of the two models. 4.28 Prove Equations (4.4.31) and use them to reproduce the plot shown in Figure 4.12(b). Indicate which equation corresponds to each curve. 4.29 Determine the spectral flatness measure of the following processes: (a) x(n) = a1 x(n − 1) + a2 x(n − 2) + w(n) and (b) x(n) = w(n) + b1 w(n − 1) + b2 w(n − 2), where w(n) is a white noise sequence. 4.30 Consider a zero-mean wide-sense stationary (WSS) process x(n) with PSD Rx (ej ω ) and an M × M correlation matrix with eigenvalues {λk }M 1 . Szegö’s theorem (Grenander and Szegö 1958) states that if g(·) is a continuous function, then π 1 g(λ1 ) + g(λ2 ) + · · · + g(λM ) = g[Rx (ej ω )] dω lim M 2π −π M→∞ Using this theorem, show that lim (det Rx )1/M = exp

M→∞

π 1 ln[Rx (ej ω )] dω 2π −π

4.31 Consider two linear random processes with system functions 1 − 0.5z−1 1 − 0.81z−1 − 0.4z−2 and (ii) H (z) = −1 2 (1 − z ) 1 − z−1 (a) Find a difference equation that leads to a numerically stable simulation of each process. (b) Generate and plot 100 samples from each process, and look for indications of nonstationarity in the obtained records. (c) Compute and plot the second difference of (i) and the first difference of (ii). Comment about the stationarity of the obtained records. (i) H (z) =

4.32 Generate and plot 100 samples for each of the linear processes with system functions 1 (a) H (z) = (1 − z−1 ) (1 − 0.9z−1 ) 1 − 0.5z−1 (b) H (z) = (1 − z−1 ) (1 − 0.9z−1 ) 20 and the PACS {kˆ }20 . and then estimate and examine the values of the ACS {ρ(l)} ˆ m 1 0

4.33 Consider the process y(n) = d0 + d1 n + d2 n2 + x(n), where x(n) is a stationary process with known autocorrelation rx (l). (a) Show that the process y (2) (n) obtained by passing y(n) through the filter H (z) = (1−z−1 )2 is stationary. (2) (b) Express the autocorrelation ry (l) of y (2) (n) in terms of rx (l). Note: This process is used in practice to remove quadratic trends from data before further analysis. 4.34 Prove Equation (4.6.7), which computes the cepstrum of an AP model from its coefficients. Q 4.35 Consider a minimum-phase AZ(Q) model D(z) = k=0 dk z−k with complex cepstrum c(k). We create another AZ model with coefficients d˜k = α k dk and complex cepstrum c(k). ˜ (a) If 0 < α < 1, find the relation between c(k) ˜ and c(k). (b) Choose α so that the new model has no minimum phase. (c) Choose α so that the new model has a maximum phase. 4.36 Prove Equation (4.6.27), which determines the cepstral distance in the frequency and time domains.

193 problems

C HAPT E R 5

Nonparametric Power Spectrum Estimation

The essence of frequency analysis is the representation of a signal as a superposition of sinusoidal components. In theory, the exact form of this decomposition (spectrum) depends on the assumed signal model. In Chapters 2 and 3 we discussed the mathematical tools required to define and compute the spectrum of signals described by deterministic and stochastic models, respectively. In practical applications, where only a finite segment of a signal is available, we cannot obtain a complete description of the adopted signal model. Therefore, we can only compute an approximation (estimate) of the spectrum of the adopted signal model (“true” or theoretical spectrum). The quality of the estimated spectrum depends on • • •

How well the assumed signal model represents the data. What values we assign to the unavailable signal samples. Which spectrum estimation method we use.

Clearly, meaningful application of spectrum estimation in practical problems requires sufficient a priori information, understanding of the signal generation process, knowledge of theoretical concepts, and experience. In this chapter we discuss the most widely used correlation and spectrum estimation methods, as well as their properties, implementation, and application to practical problems. We discuss only nonparametric techniques that do not assume a particular functional form, but allow the form of the estimator to be determined entirely by the data. These methods are based on the discrete Fourier transform of either the signal segment or its autocorrelation sequence. In contrast, parametric methods assume that the available signal segment has been generated by a specific parametric model (e.g., a pole-zero or harmonic model). Since the choice of an inappropriate signal model will lead to erroneous results, the successful application of parametric techniques, without sufficient a priori information, is very difficult in practice. These methods are discussed in Chapter 9. We begin this chapter with an introductory discussion on the purpose of, and the DSP approach to, spectrum estimation. We explore various errors involved in the estimation of finite-length data records (i.e., based on partial information). We also outline conventional techniques for deterministic signals, using concepts developed in Chapter 2. Also in Section 3.6, we presented important concepts and results from the estimation theory that are used extensively in this chapter. Section 5.3 is the main section of this chapter in which we discuss various nonparametric approaches to the power spectrum estimation of stationary random signals. This analysis is extended to joint stationary (bivariate) random signals for the computation of the cross-spectrum in Section 5.4. The computation of auto and cross-spectra using Thomson’s multiple windows (or multitapers) is discussed in Section

195

196 chapter 5 Nonparametric Power Spectrum Estimation

5.5. Finally, in Section 5.6 we summarize important topics and concepts from this chapter. A classification of the various spectral estimation methods that are discussed in this book is provided in Figure 5.1.

Spectral estimation

Mainlobe width: smoothing, loss of resolution

Deterministic signal model: Fourier analysis (Section 5.1)

Stochastic signal models

Main limitation: windowing

Main limitations: windowing + randomness

Bias

Sidelobe height: leakage, "wrong" location of peaks

Parametric methods

Nonparametric methods

Fourier analysis (Section 5.3)

· Autocorrelation windowing · Periodogram averaging

Capon's minimum variance (Chapter 9)

Bias + randomness

ARMA (pole-zero) models (Chapter 9)

Harmonic process (Chapter 9)

Long-memory models (Chapter 12)

Multitaper method (Section 5.5)

FIGURE 5.1 Classification of various spectrum estimation methods.

5.1 SPECTRAL ANALYSIS OF DETERMINISTIC SIGNALS If we adopt a deterministic signal model, the mathematical tools for spectral analysis are the Fourier series and the Fourier transforms summarized in Section 2.2.1. It should be stressed at this point that applying any of these tools requires that the signal values in the entire time interval from −∞ to +∞ be available. If it is known a priori that a signal is periodic, then only one period is needed. The rationale for defining and studying various spectra for deterministic signals is threefold. First, we note that every realization (or sample function) of a stochastic process is a deterministic function. Thus we can use the Fourier series and transforms to compute a spectrum for stationary processes. Second, deterministic functions

and sequences are used in many aspects of the study of stationary processes, for example, the autocorrelation sequence, which is a deterministic sequence. Third, the various spectra that can be defined for deterministic signals can be used to summarize important features of stationary processes. Most practical applications of spectrum estimation involve continuous-time signals. For example, in speech analysis we use spectrum estimation to determine the pitch of the glottal excitation and the formants of the vocal tract (Rabiner and Schafer 1978). In electroencephalography, we use spectrum estimation to study sleep disorders and the effect of medication on the functioning of the brain (Duffy, Iyer, and Surwillo 1989). Another application is in Doppler radar, where the frequency shift between the transmitted and the received waveform is used to determine the radial velocity of the target (Levanon 1988). The numerical computation of the spectrum of a continuous-time signal involves three steps: 1. Sampling the continuous-time signal to obtain a sequence of samples. 2. Collecting a finite number of contiguous samples (data segment or block) to use for the computation of the spectrum. This operation, which usually includes weighting of the signal samples, is known as windowing, or tapering. 3. Computing the values of the spectrum at the desired set of frequencies. This step is usually implemented using some efficient implementation of the DFT. The above processing steps, which are necessary for DFT-based spectrum estimation, are shown in Figure 5.2. The continuous-time signal is first processed through a low-pass (antialiasing) filter and then sampled to obtain a discrete-time signal. Data samples of frame length N with frame overlap N0 are selected and then conditioned using a window. Finally, a suitable-length DFT of the windowed data is taken as an estimate of its spectrum, which is then analyzed. In this section, we discuss in detail the effects of each of these operations on the accuracy of the computed spectrum. The understanding of the implications of these effects is very important in all practical applications of spectrum estimation. N

Fs

sc(t)

Low-pass filter H lp(F )

xc(t)

A/D converter

x(n)

N0

Frame blocking

w(n) ~ X N (k)

DFT

xN (n)

Windowing

FIGURE 5.2 DFT-based Fourier analysis system for continuous-time signals.

5.1.1 Effect of Signal Sampling The continuous-time signal sc (t), whose spectrum we seek to estimate, is first passed through a low-pass filter, also known as an antialiasing filter Hlp (F ), in order to minimize the aliasing error after sampling. The antialiased signal xc (t) is then sampled through an analog-to† digital converter (ADC) to produce the discrete-time sequence x(n), that is, x(n) = xc (t)|t=n/Fs †

We will ignore the quantization of discrete-time signals as discussed in Chapter 2.

(5.1.1)

197 section 5.1 Spectral Analysis of Deterministic Signals

198 chapter 5 Nonparametric Power Spectrum Estimation

From the sampling theorem in Section 2.2.2, we have X(ej 2π F /Fs ) = Fs

∞

Xc (F − lFs )

(5.1.2)

l=−∞

where Xc (F ) = Hlp (F )Sc (F ). We note that the spectrum of the discrete-time signal x(n) is a periodic replication of Xc (F ). Overlapping of the replicas Xc (F − lFs ) results in aliasing. Since any practical antialiasing filter does not have infinite attenuation in the stopband, some nonzero overlap of frequencies higher than Fs /2 should be expected within the band of frequencies of interest in x(n). These aliased frequencies give rise to the aliasing error, which, in any practical signal, is unavoidable. It can be made negligible by a properly designed antialiasing filter Hlp (F ). 5.1.2 Windowing, Periodic Extension, and Extrapolation In practice, we compute the spectrum of a signal by using a finite-duration segment. The reason is threefold: 1. The spectral composition of the signal changes with time. or 2. We have only a finite set of data at our disposal. or 3. We wish to keep the computational complexity to an acceptable level. Therefore, it is necessary to partition x(n) into blocks (or frames) of data prior to processing. This operation is called frame blocking, and it is characterized by two parameters: the length of frame N and the overlap between frames N0 (see Figure 5.2). Therefore, the central problem in practical frequency analysis can be stated as follows: Determine the spectrum of a signal x(n), −∞ < n < ∞, from its values in a finite interval 0 ≤ n ≤ N − 1, that is, from a finite-duration segment. Since x(n) is unknown for n < 0 and n ≥ N , we cannot say, without having sufficient a priori information, whether the signal is periodic or aperiodic. If we can reasonably assume that the signal is periodic with fundamental period N , we can easily determine its spectrum by computing its Fourier series, using the DFT (see Section 2.2.1). However, in most practical applications, we cannot make this assumption because the available block of data could be either part of the period of a periodic signal or a segment from an aperiodic signal. In such cases, the spectrum of the signal cannot be determined without assigning values to the signal samples outside the available interval. There are three ways to deal with this issue: 1. Periodic extension. We assume that x(n) is periodic with period N , that is, x(n) = x(n + N ) for all n, and we compute its Fourier series, using the DFT. 2. Windowing. We assume that the signal is zero outside the interval of observation, that is, x(n) = 0 for n < 0 and n ≥ N . This is equivalent to multiplying the signal with the rectangular window 1 0≤n≤N −1 wR (n) (5.1.3) 0 elsewhere The resulting sequence is aperiodic, and its spectrum is obtained by the discrete-time Fourier transform (DTFT). 3. Extrapolation. We use a priori information about the signal to extrapolate (i.e., determine its values for n < 0 and n ≥ N ) outside the available interval and then determine its spectrum by using the DTFT. Periodic extension and windowing can be considered the simplest forms of extrapolation. It should be obvious that a successful extrapolation results in better spectrum estimates

than periodic extension or windowing. Periodic extension is a straightforward application of the DFT, whereas extrapolation requires some form of a sophisticated signal model. As we shall see, most of the signal modeling techniques discussed in this book result in some kind of extrapolation. We first discuss, in the next section, the effect of spectrum sampling as imposed by the application of DFT (and its side effect—the periodic extension) before we provide a detailed analysis of the effect of windowing.

5.1.3 Effect of Spectrum Sampling In many real-time spectrum analyzers, as illustrated in Figure 5.2, the spectrum is computed (after signal conditioning) by using the DFT. From Section 2.2.3, we note that this computation samples the continuous spectrum at equispaced frequencies. Theoretically, if the number of DFT samples is greater than or equal to the frame length N , then the exact continuous spectrum (based on the given frame) can be obtained by using the frequencydomain reconstruction (Oppenheim and Schafer 1989; Proakis and Manolakis 1996). This reconstruction, which requires a periodic sinc function [defined in (5.1.9)], is not a practical function to implement, especially in real-time applications. Hence a simple linear interpolation is used for plotting or display purposes. This linear interpolation can lead to misleading results even though the computed DFT sample values are correct. It is possible that there may not be a DFT sample precisely at a frequency where a peak of the DTFT is located. In other words, the DFT spectrum misses this peak, and the resulting linearly interpolated spectrum provides the wrong location and height of the DTFT spectrum peak. This error can be made smaller by sampling the DTFT spectrum at a finer grid, that is, by increasing the size of the DFT. The denser spectrum sampling is implemented by an operation called zero padding and is discussed later in this section. Another effect of the application of DFT for spectrum calculations is the periodic extension of the sequence in the time domain. From our discussion in Section 2.2.3, it follows that the N -point DFT ˜ X(k) =

N −1

x(n)e−j (2π /N )kn

(5.1.4)

n=0

is periodic with period N . This should be expected given the relationship of the DFT to the Fourier transform or the Fourier series of discrete-time signals, which are periodic in ω with period 2π. A careful look at the inverse DFT x(n) =

N −1 1 ˜ X(k)ej (2π /N )kn N

(5.1.5)

k=0

reveals that x(n) is also periodic with period N . This is a somewhat surprising result since no assumption about the signal x(n) outside the interval 0 ≤ n ≤ N − 1 has been made. However, this periodicity in the time domain can be easily justified by recalling that sampling in the time domain results in a periodicity in the frequency domain, and vice versa. To understand these effects of spectrum sampling, consider the following example in which a continuous-time sinusoidal signal is sampled and then is truncated by a rectangular window before its DFT is performed. E XAM PLE 5.1.1. A continuous-time signal xc (t) = 2 cos 2π t is sampled with a sampling frequency of Fs = 1/T = 10 samples per second, to obtain the sequence x(n). It is windowed by an N-point rectangular window wR (n) to obtain the sequence xN (n). Determine and plot |X˜ N (k)|, the magnitude of the DFT of xN (n), for (a) N = 10 and (b) N = 15. Comment on the shapes of these plots.

199 section 5.1 Spectral Analysis of Deterministic Signals

200

Solution. The discrete-time signal x(n) is a sampled version of xc (t) and is given by

chapter 5 Nonparametric Power Spectrum Estimation

x(n) = xc (t = nT ) = 2 cos

2π n = 2 cos 0.2π n Fs

T = 0.1 s

Then, x(n) is a periodic sequence with fundamental period N = 10. a. For N = 10, we obtain xN (n) = 2 cos 0.4π n, 0 ≤ n ≤ 9, which contains one period of x(n). The periodic extension of xN (n) and the magnitude plot of its DFT are shown in the top row of Figure 5.3. For comparison, the DTFT XN (ej ω ) of xN (n) is also superimposed on the DFT samples. We observe that the DFT has only two nonzero samples, which together constitute the correct frequency of the analog signal xc (t). The DTFT has a mainlobe and several sidelobes due to the windowing effect. However, the DFT samples the sidelobes at their zero values, as illustrated in the DFT plot. Another explanation for this behavior is that since the samples in xN (n) for N = 10 constitute one full period of cos 0.4π n, the 10-point periodic extension of xN (n), shown in the top left graph of Figure 5.3, results in the original sinusoidal sequences x(n). Thus what the DFT “sees” is the exact sampled signal xc (t). In this case, the choice of N is a desirable one. b. For N = 15, we obtain xN (n) = 2 cos 0.4π n, 0 ≤ n ≤ 14, which contains 1 12 periods of x(n). The periodic extension of xN (n) and the magnitude plot of its DFT are shown in the bottom row of Figure 5.3. Once again for comparison, the DTFT XN (ej ω ) of xN (n) is superimposed on the DFT samples. In this case, the DFT plot looks markedly different 8-point periodic extension

8-point DFT

2

|X(k)|

Amplitude

8

−2

−8

0 −0.5

n

−0.25 0 0.25 Normalized frequency

7-point periodic extension

7-point DFT

8

16

0.5

2

|X(k)|

Signal level

7

−2 −7

7

14

n

FIGURE 5.3 Effect of window length L on the DFT spectrum shape.

0 −0.5

−0.25 0 0.25 Normalized frequency

0.5

from that for N = 10 although the DTFT plot appears to be similar. In this case, the DFT does not sample two peaks at the exact frequencies; hence if the resulting DFT samples are joined by the linear interpolation, then we will get a misleading result. Since the sequence xN (n) does not contain full periods of cos 0.4π n, the periodic extension of xN (n) contains discontinuities at n = lN, l = 0, ±1, ±2, . . . , as shown in the bottom left graph of Figure 5.3. This discontinuity results in higher-order harmonics in the DFT values. The DTFT plot also has mainlobes and sidelobes, but the DFT samples these sidelobes at nonzero values. Therefore, the length of the window is an important consideration in spectrum estimation. The sidelobes are the source of the problem of leakage that gives rise to bias in the spectral values, as we will see in the following section. The suppression of the sidelobes is controlled by the window shape, which is another important consideration in spectrum estimation.

A quantitative description of the above interpretations and arguments related to the capacities and limitations of the DFT is offered by the following result (see Proakis and Manolakis 1996). Let xc (t), −∞ < t < ∞, be a continuoustime signal with Fourier transform Xc (F ), −∞ < F < ∞. Then, the N -point sequences {T xp (n), 0 ≤ n ≤ N − 1} and {X˜ p (k), 0 ≤ k ≤ N − 1} form an N-point DFT pair, that is, ∞ ∞ Fs DFT ˜ − lFs xc (nT − mNT ) ←→ Xp (k) Fs Xc k (5.1.6) xp (n) N N

T H E O R E M 5.1 ( D FT SAM PLI N G T H E O R E M ) .

m=−∞

l=−∞

where Fs = 1/T is the sampling frequency. Proof. The proof is explored in Problem 5.1.

Thus, given a continuous-time signal xc (t) and its spectrum Xc (F ), we can create a DFT pair by sampling and aliasing in the time and frequency domains. Obviously, this DFT pair provides a “faithful” description of xc (t) and Xc (F ) if both the time-domain aliasing and the frequency-domain aliasing are insignificant. The meaning of relation (5.1.6) is graphically illustrated in Figure 5.4. In this figure, we show the time-domain signals in the left column and their Fourier transforms in the right column. The top row contains continuous-time signals, which are shown as nonperiodic and of infinite extent in both domains, since many real-world signals exhibit this behavior. The middle row contains the sampled version of the continuous-time signal and its periodic Fourier transform (the nonperiodic transform is shown as a dashed curve). Clearly, aliasing in the frequency domain is evident. Finally, the bottom row shows the sampled (periodic) Fourier transform and its correponding timedomain periodic sequence. Again, aliasing in the time domain should be expected. Thus we have sampled and periodic signals in both domains with the certainty of aliasing one domain and the possibility in both domains. This figure should be recalled any time we use the DFT for the analysis of sampled signals. Zero padding The N -point DFT values of an N -point sequence x(n) are samples of the DTFT X(ej ω ), as discussed in Chapter 2. These samples can be used to reconstruct the DTFT X(ej ω ) by using the periodic sinc interpolating function.Alternatively, one can obtain more (i.e., dense) samples of the DTFT by computing a larger NFFT -point DFT of x(n), where NFFT N . Since the number of samples of x(n) is fixed, the only way we can treat x(n) as an NFFT point sequence is by appending NFFT − N zeros to it. This procedure is called the zero padding operation, and it is used for many purposes including the augmentation of the sequence length so that a power-of-2 FFT algorithm can be used. In spectrum estimation, zero padding is primarily used to provide a better-looking plot of the spectrum of a finitelength sequence. This is shown in Figure 5.5 where the magnitude of an NFFT -point DFT of the eight-point sequence x(n) = cos (2π n/4) is plotted for NFFT = 8, 16, 32, and 64.The DTFT magnitude |X(ej ω )| is also shown for comparison. It can be seen that as more zeros

201 section 5.1 Spectral Analysis of Deterministic Signals

202

are appended (by increasing NFFT ), the resulting larger-point DFT provides more closely spaced samples of the DTFT, thus giving a better-looking plot. Note, however, that the zero padding does not increase the resolution of the spectrum; that is, there are no new peaks and valleys in the display, just a better display of the available information. This type of plot is called a high-density spectrum. For a high-resolution spectrum, we have to collect more information by increasing N . The DTFT plots shown in Figures 5.3 and 5.5 were obtained by using a very large amount of zero padding.

chapter 5 Nonparametric Power Spectrum Estimation

5.1.4 Effects of Windowing: Leakage and Loss of Resolution To see the effect of the window on the spectrum of an arbitrary deterministic signal x(n), defined over the entire range −∞ < n < ∞, we notice that the available data record can be expressed as xN (n) = x(n)wR (n) (5.1.7) Frequency domain

Time domain 1.0

Xc(F )

xc(t)

1.0

0.5

−5

0 0

5

DTF T [xc(nT )]

xc(nT )

0.5

−5

−5

5

0 Frequency (Hz)

5

1.5

DF T [xp(n)]

xp(n)

5

0.5

5

0.5

−5

1.0

0 0

1.0

−5

1.5

1.0

0.5

1.0

0.5

0 0 Time (s)

FIGURE 5.4 Graphical illustration of the DFT sampling theorem.

5

−5

1

8-point DFT

203

DTFT: – – –

~ |X(k)|

section 5.1 Spectral Analysis of Deterministic Signals

1

0.25

0.5

DTFT: – – –

~ |X(k)|

16-point DFT

1

0.25

32-point DFT

0.5

~ |X(k)|

DTFT: – – –

0 0 1

0.25

DTFT : – – –

~ |X(k)|

64-point DFT

0.5

0.25 Normalized frequency

0.5

FIGURE 5.5 Effect of zero padding.

where wR (n) is the rectangular window defined in (5.1.3). Thus, a finite segment of the signal can be thought of as a product of the actual signal x(n) and a data window w(n). In (5.1.7), w(n) = wR (n), but w(n) can be any arbitrary finite-duration sequence. The Fourier transform of xN (n) is π 1 X(ej θ )W (ej (ω−θ ) ) dθ (5.1.8) XN (ej ω ) = X(ej ω ) ⊗ W (ej ω ) 2π −π that is, XN (ej ω ) equals the periodic convolution of the actual Fourier transform with the Fourier transform W (ej ω ) of the data window. For the rectangular window, W (ej ω ) = WR (ej ω ), where sin (ωN/2) −j ω(N −1)/2 WR (ej ω ) = A(ω)e−j ω(N−1)/2 (5.1.9) e sin (ω/2) The function A(ω) is a periodic function in ω with fundamental period equal to 2π and is called a periodic sinc function. Figure 5.6 shows three periods of A(ω) for N = 11. We note that WR (ej ω ) consists of a mainlobe (ML). 2π jω | ω |< WR (e ) N WML(ej ω ) = (5.1.10) 2π 0 <| ω |≤ π N and the sidelobes WSL (ej ω ) = WR (ej ω ) − WML (ej ω ). Thus, (5.1.8) can be written as XN (ej ω ) = X(ej ω ) ⊗ WML (ej ω ) + X(ej ω ) ⊗ WSL (ej ω )

(5.1.11)

204

11

A(v)

chapter 5 Nonparametric Power Spectrum Estimation

0 −1

0 Normalized frequency

1

FIGURE 5.6 Plot of A(ω) = sin (ωN/2)/ sin (ω/2) for N = 11.

The first convolution in (5.1.11) smoothes rapid variations and suppresses narrow peaks in X(ej ω ), whereas the second convolution introduces ripples in smooth regions of X(ej ω ) and can create “false” peaks. Therefore, the spectrum we observe is the convolution of the actual spectrum with the Fourier transform of the data window. The only way to improve the estimate is to increase the window length N or to choose another window shape. For the rectangular window, increasing N results in a narrower mainlobe, and the distortion is reduced. As N → ∞, WR (ej ω ) tends to an impulse train with period 2π and XN (ej ω ) tends to X(ej ω ), as expected. Since in practice the value of N is always finite, the only way to improve the estimate XN (ej ω ) is by properly choosing the shape of the window w(n). The only restriction on w(n) is that it be of finite duration. It is known that any time-limited sequence w(n) has a Fourier transform W (ej ω ) that is nonzero except at a finite number of frequencies. Thus, from (5.1.8) we see that the estimated value XN (ej ω0 ) is computed by using all values of X(ej ω ) weighted by W (ej (ω0 −θ ) ). The contribution of the sinusoidal components with frequencies ω = ω0 to the value XN (ej ω0 ) introduces an error known as leakage. As the name suggests, energy from one frequency range “leaks” into another, giving the wrong impression of stronger or weaker frequency components. To illustrate the effect of the window shape and duration on the estimated spectrum, consider the signal x(n) = cos 0.35π n + cos 0.4π n + 0.25 cos 0.8π n

(5.1.12)

which has a line spectrum with lines at frequencies ω1 = 0.35π , ω2 = 0.4π , and ω3 = 0.8π . This line spectrum (normalized so that the magnitude is between 0 and 1) is shown in the top graph of Figure 5.7 over 0 ≤ ω ≤ π. The spectrum XN (ej ω ) of xN (n) using the rectangular window is given by XN (ej ω ) = 12 [W (ej (ω+ω1 ) ) + W (ej (ω−ω1 ) ) + W (ej (ω+ω2 ) ) + W (ej (ω−ω2 ) ) + 0.25W (ej (ω+ω3 ) ) + 0.25W (ej (ω−ω3 ) )]

(5.1.13)

The second and the third plots in Figure 5.7 show 2048-point DFTs of xN (n) for a rectangular data window with N = 21 and N = 81. We note that the ability to pick out peaks † (resolvability) depends on the duration N − 1 of the data window. To resolve two spectral lines at ω = ω1 and ω = ω2 using a rectangular window, we should have the difference |ω1 − ω2 | greater than the mainlobe width ω, which is approximately equal to 2π /(N −1), in radians per sampling interval, from the plot of A(ω) in Figure 5.6, that is, 2π 2π |ω1 − ω2 | > ω ≈ +1 or N> N −1 |ω1 − ω2 | †

Since there are N samples in a data window, the number of intervals or durations is N − 1.

1

205

|X(e jv )|

True spectral lines

section 5.1 Spectral Analysis of Deterministic Signals

|XN (e jv )|

0 1

0.35 0.4

0.8

1

0.8

1

0.8

1

Spectrum for N = 21

| XN (e jv )|

0 1

0.35 0.4

Spectrum for N = 81

|XN (e jv )|

0 1

0.35 0.4

Hamming window N = 81

0 0

0.35 0.4

0.8

1

v/p

FIGURE 5.7 Spectrum of three sinusoids using rectangular and Hamming windows.

For a rectangular window of length N , the exact value of ω is equal to 1.81π /(N − 1). If N is too small, the two peaks at ω = 0.35π and ω = 0.4π are fused into one, as shown in the N = 21 plot. When N = 81, the corresponding plot shows a resolvable separation; however, the peaks have shifted somewhat from their true locations. This is called bias, and it is a direct result of the leakage from sidelobes. In both cases, the peak at ω = 0.8π can be distinguished easily (but also has a bias). Another important observation is that the sidelobes of the data window introduce false peaks. For a rectangular window, the peak sidelobe level is 13 dB below zero, which is not a good attenuation. Thus these false peaks have values that are comparable to that of the true peak at ω = 0.8π, as shown in Figure 5.7. These peaks can be minimized by reducing the amplitudes of the sidelobes. The rectangular window cannot help in this regard because of Gibb’s well-known phenomenon associated with it. We need a different window shape. However, any window other than the rectangular window has a wider mainlobe; hence this reduction can be achieved only at the expense of the resolution. To illustrate this, consider the Hamming (Hm) data window, given by 0.54 − 0.46 cos 2π n N −1 wHm (n) = 0

0≤n≤N −1

(5.1.14)

otherwise

with the approximate width of the mainlobe equal to 8π /(N − 1) and the exact mainlobe width equal to 6.27π /(N − 1). The peak sidelobe level is 43 dB below zero, which is

206 chapter 5 Nonparametric Power Spectrum Estimation

considerably better than that of the rectangular window. The Hamming window is obtained by using the hamming(N) function in Matlab. The bottom plot in Figure 5.7 shows the 2048-point DFT of the signal xN (n) for a Hamming window with N = 81. Now the peak at ω = 0.8π is more prominent than before, and the sidelobes are almost suppressed. Note also that since the mainlobe width of the Hamming window is wider, the peaks have a wider base—so much so that the first two frequencies are barely recognized. We can correct this problem by choosing a larger window length. This interplay between the shape and the duration of a window function is one of the important issues and, as we will see in Section 5.3, produces similar effects in the spectral analysis of random signals. Some useful windows The design of windows for spectral analysis applications has drawn a lot of attention and is examined in detail in Harris (1978). We have already discussed two windows, namely, the rectangular and the Hamming window. Another useful window in spectrum analysis is due to Hann and is mistakenly known as the Hanning window. There are several such windows with varying degrees of tradeoff between resolution (mainlobe width) and leakage (peak sidelobe level). These windows are known as fixed windows since each provides a fixed amount of leakage that is independent of the length N . Unlike fixed windows, there are windows that contain a design parameter that can be used to trade between resolution and leakage. Two such windows are the Kaiser window and the Dolph-Chebyshev window, which are widely used in spectrum estimation. Figure 5.8 shows the time-domain window functions and their corresponding frequency-domain log-magnitude plots in decibels for these five windows. The important properties such as peak sidelobe level and mainlobe width of these windows are compared in Table 5.1.

TABLE 5.1

Comparison of properties of commonly used windows. Each window is assumed to be of length N . Window type

Peak sidelobe level (dB)

Approximate mainlobe width

Exact mainlode width

Rectangular

−13

Hanning

−32

4π N −1 8π N −1

1.81π N −1 5.01π N −1

Hamming

−43

8π N −1

6.27π N −1

Kaiser

−A

—

A−8 2.285N − 1

Dolph-Chebyshev

−A

—

cosh−1 10A/20 cos−1 cosh N −1

Hanning window. This window is given by the function 0.5 − 0.5 cos 2π n 0≤n≤N −1 N −1 wHn (n) = 0 otherwise

−1

(5.1.15)

which is a raised cosine function. The peak sidelobe level is 32 dB below zero, and the approximate mainlobe width is 8π /(N − 1) while the exact mainlobe width is 5.01π /(N − 1). In Matlab this window function is obtained through the function hanning(N).

Kaiser window. This window function is due to J. F. Kaiser and is given by 2 1 − [1 − 2n/(N − 1)] β I 0 0≤n≤N −1 (5.1.16) wK (n) = I0 (β) 0 otherwise

207 section 5.1 Spectral Analysis of Deterministic Signals

where I0 (·) is the modified zero-order Bessel function of the first kind and β is a window shape parameter that can be chosen to obtain various peak sidelobe levels and the Frequency domain

Time domain 0

w (n) R

Decibels

1

10

−0.5

20

1

0.5

0.5

0.5

0.5

0 Normalized frequency

0.5

Hn

(n)

Decibels

w 0

−13

10

−32 −0.5

20

w

Hm

(n)

Decibels

1

−40 0

10

−0.5

20

1 Decibels

w (n) K

10

−28

−0.5

20

1

w

DC

(n)

Decibels

−40 0

10 n

20

−0.5

FIGURE 5.8 Time-domain window functions and their frequency-domain characteristics for rectangular, Hanning, Hamming, Kaiser, and Dolph-Chebyshev windows.

corresponding mainlobe widths. Clearly, β = 0 results in the rectangular window while β > 0 results in lower sidelobe leakage at the expense of a wider mainlobe. Kaiser has developed approximate design equations for β. Given a peak sidelobe level of A dB below the peak value, the approximate value of β is given by A ≤ 21 0 0.4 β 0.5842(A − 21) + 0.07886(A − 21) (5.1.17) 21 < A ≤ 50 0.1102(A − 8.7) A > 50 Furthermore, to achieve the given values of the peak sidelobe level of A and the mainlobe width ω, the length N must satisfy A−8 ω = (5.1.18) 2.285(N − 1) In Matlab this window is given by the function kaiser(N,beta). Dolph-Chebyshev window. This window is characterized by the property that the peak sidelobe levels are constant; that is, it has an “equiripple” behavior. The window wDC (n) is obtained as the inverse DFT of the Chebyshev polynomial evaluated at N equally spaced frequencies around the unit circle. The details of this window function computation are available in Harris (1978). The parameters of the Dolph-Chebyshev window are the constant sidelobe level A in decibels, the window length N , and the mainlobe width ω. However, only two of the three parameters can be independently specified. In spectrum estimation, parameters N and A are generally specified. Then ω is given by −1 cosh−1 10A/20 −1 ω = cos cosh (5.1.19) N −1 In Matlab this window is obtained through the function chebwin(N,A). To illustrate the usefulness of these windows, consider the same signal containing three frequencies given in (5.1.12). Figure 5.9 shows the spectrum of xN (n) using the 1 |XN (e jv )|

Hanning window

0 0

0.35 0.4

1

0.8

1

0.8

1

Kaiser window

|XN (e jv )|

chapter 5 Nonparametric Power Spectrum Estimation

0 0

0.35 0.4

1

Dolph-Chebyshev window

|XN (e jv )|

208

0 0

0.35 0.4

0.8 v/p

FIGURE 5.9 Spectrum of three sinusoids using Hanning, Kaiser, and Chebyshev windows.

1

Hanning, Kaiser, and Chebyshev windows for length N = 81. The Kaiser and Chebyshev window parameters are adjusted so that the peak sidelobe level is 40 dB or below. Clearly, these windows have suppressed sidelobes considerably compared to that of the rectangular window but the main peaks are wider with negligible bias. The two peaks in the Hanning window spectrum are barely resolved because the mainlobe width of this window is much wider than that of the rectangular window. The Chebyshev window spectrum has uniform sidelobes while the Kaiser window spectrum shows decreasing sidelobes away from the mainlobes.

5.1.5 Summary In conclusion, the frequency analysis of deterministic signals requires a careful study of three important steps. First, the continuous-time signal xc (t) is sampled to obtain samples x(n) that are collected into blocks or frames. The frames are “conditioned” to minimize certain errors by multiplying by a window sequence w(n) of length N . Finally the windowed frames xN (n) are transformed to the frequency domain using the DFT. The resulting DFT spectrum X˜ N (k) is a faithful replica of the actual spectrum Xc (F ) if the following errors are sufficiently small. Aliasing error. This is an error due to the sampling operation. If the sampling rate is sufficiently high and if the antialiasing filter is properly designed so that most of the frequencies of interest are represented in x(n), then this error can be made smaller. However, a certain amount of aliasing should be expected. The sampling principle and aliasing are discussed in Section 2.2.2. Errors due to finite-length window. There are several errors such as resolution loss, bias, and leakage that are attributed to the windowing operation. Therefore, a careful design of the window function and its length is necessary to minimize these errors. These topics were discussed in Section 5.1.4. In Table 5.1 we summarize key properties of five windows discussed in this section that are useful for spectrum estimation. Spectrum reconstruction error. The DFT spectrum X˜ N (k) is a number sequence that must be reconstructed into a continuous function for the purpose of plotting. A practical choice for this reconstruction is the first-order polynomial interpolation. This reconstruction error can be made smaller (and in fact comparable to the screen resolution) by choosing a large number of frequency samples, which can be achieved by the zero padding operation in the DFT. It was discussed in Section 5.1.3. With the understanding of frequency analysis concepts developed in this section, we are now ready to tackle the problem of spectral analysis of stationary random signals. From Chapter 3, we recognize that the true spectral values can only be obtained as estimates. This requires some understanding of key concepts from estimation theory, which is developed in Section 3.6.

5.2 ESTIMATION OF THE AUTOCORRELATION OF STATIONARY RANDOM SIGNALS The second-order moments of a stationary random sequence—that is, the mean value µx , the autocorrelation sequence rx (l), and the PSD Rx (ej ω )—play a crucial role in signal analysis and signal modeling. In this section, we discuss the estimation of the autocorrelation −1 sequence rx (l) using a finite data record {x(n)}N of the process. 0

209 section 5.2 Estimation of the Autocorrelation of Stationary Random Signals

210 chapter 5 Nonparametric Power Spectrum Estimation

For a stationary process x(n), the most widely used estimator of rx (l) is given by the sample autocorrelation sequence N −l−1 1 x(n + l)x ∗ (n) 0≤l ≤N −1 N n=0 rˆx (l) (5.2.1) ∗ (−l) r ˆ −(N − 1) ≤ l < 0 x 0 elsewhere or, equivalently,

rˆx (l)

N −1 1 x(n)x ∗ (n − l) N

0≤l ≤N −1

n=l

rˆx∗ (−l) 0

−(N − 1) ≤ l < 0

(5.2.2)

elsewhere

which is a random sequence. Note that without further information beyond the observed −1 , it is not possible to provide reasonable estimates of rx (l) for |l| ≥ N . data {x(n)}N 0 Even for lag values |l| close to N , the correlation estimates are unreliable since very few x(n + |l|)x(n) pairs are used. A good rule of thumb provided by Box and Jenkins (1976) is that N should be at least 50 and that |l| ≤ N/4. The sample autocorrelation rˆx (l) given in (5.2.1) has a desirable property that for each l ≥ 1, the sample autocorrelation matrix rˆx∗ (1) · · · rˆx∗ (N − 1) rˆx (0) rˆx (1) rˆx (0) · · · rˆx∗ (N − 2) ˆ Rx = . (5.2.3) .. . . .. .. . . . rˆx (N − 1) rˆx (N − 2) · · · rˆx (0) is nonnegative definite (see Section 3.5.1). This property is explored in Problem 5.5. Matˆ x (for example, corr), given the lab provides functions to compute the correlation matrix R −1 ; however, the book toolbox function rx = autoc(x,L); computes rˆx (l) data {x(n)}N n=0 according to (5.2.1) very efficiently. −1 The estimate of covariance γ x (l) from the data record {x(n)}N is given by the sample 0 autocovariance sequence N −l−1 1 [x(n + l) − µ ˆ x ][x ∗ (n) − µ ˆ ∗x ] 0≤l ≤N −1 N n=0 γˆ x (l) = (5.2.4) ∗ γ ˆ (−l) −(N − 1) ≤ l < 0 x 0 elsewhere so that the corresponding autocovariance matrix ˆ x is nonnegative definite. Similarly, the sample autocorrelation coefficient sequence ρˆ x (l) is given by γˆ (l) ρˆ x (l) = x 2 (5.2.5) σˆ x In the rest of this section, we assume that x(n) is a zero-mean process and hence rˆx (l) = γˆ x (l), so that we can discuss the autocorrelation estimate in detail. To determine the statistical quality of this estimator, we now consider its mean and variance. Mean of rˆx (l). We first note that (5.2.1) can be written as rˆx (l) =

∞ 1 x(n + l)w(n + l)x ∗ (n)w(n) N n=−∞

|l| ≥ 0

(5.2.6)

where

w(n) = wR (n) =

1

0≤n≤N −1

elsewhere

211

(5.2.7)

is the rectangular window. The expected value of rˆx (l) is ∞ 1 E{ˆrx (l)} = E{x(n + l)x ∗ (n)}w(n + l)w(n) N n=−∞

E{ˆrx (−l)} = E{ˆrx∗ (l)}

and Therefore where

l≥0

−l ≤0

1 rx (l)rw (l) N ∞ w(n)w(n + l) rw (l) = w(l) ∗ w(−l) = E{ˆrx (l)} =

(5.2.8) (5.2.9)

n=−∞

is the autocorrelation of the window sequence. For the rectangular window N − |l| |l| ≤ N − 1 rw (l) = wB (n) 0 elsewhere which is the unnormalized triangular or Bartlett window. Thus 1 |l| wR (n) E{ˆrx (l)} = rx (l)wB (n) = rx (l) 1 − N N

(5.2.10)

(5.2.11)

Therefore, we conclude that the relation (5.2.1) provides a biased estimate of rx (l) because the expected value of rˆx (l) from (5.2.11) is not equal to the true autocorrelation rx (l). However, rˆx (l) is an asymptotically unbiased estimator since if N → ∞, E{ˆrx (l)} → rx (l). Clearly, the bias is small if rˆx (l) is evaluated for |l| ≤ L, where L is the maximum desired lag and L N . Variance of rˆx (l). An approximate expression for the covariance of rˆx (l) is given by Jenkins and Watts (1968) cov{ˆrx (l1 ), rˆx (l2 )}

∞ 1 [rx (l)rx (l + l2 − l1 ) + rx (l + l2 )rx (l − l1 )] N

(5.2.12)

l=−∞

This indicates that successive values of rˆx (l) may be highly correlated and that rˆx (l) may fail to die out even if it is expected to. This makes the interpretation of autocorrelation graphs quite challenging because we do not know whether the variation is real or statistical. The variance of rˆx (l), which can be obtained by setting l1 = l2 in (5.2.12), tends to zero as N → ∞. Thus, rˆx (l) provides a good estimate of rx (l) if the lag |l| is much smaller than N. However, as |l| approaches N , fewer and fewer samples of x(n) are used to evaluate rˆx (l). As a result, the estimate rˆx (l) becomes worse and its variance increases. Nonnegative definiteness of rˆx (l). An alternative estimator for the autocorrelation sequence is given by N −l−1 1 x(n + l)x ∗ (n) 0≤l≤L

section 5.2 Estimation of the Autocorrelation of Stationary Random Signals

212 chapter 5 Nonparametric Power Spectrum Estimation

and any spectral estimates based on it do not have any negative values. Furthermore, the estimator rˆx (l) has smaller variance and mean square error than the estimator rˇx (l) (Jenkins and Watts 1968). Thus, in this book we use the estimator rˆx (l) defined in (5.2.1). 5.3 ESTIMATION OF THE POWER SPECTRUM OF STATIONARY RANDOM SIGNALS From a practical point of view, most stationary random processes have continuous spectra. However, harmonic processes (i.e., processes with line spectra) appear in several applications either alone or in mixed spectra (a mixture of continuous and line spectra). We first discuss the estimation of continuous spectra in detail. The estimation of line spectra is considered in Chapter 9. The power spectral density of a zero-mean stationary stochastic process was defined in (3.3.39) as ∞

Rx (ej ω ) =

rx (l)e−j ωl

(5.3.1)

l=−∞

assuming that the autocorrelation sequence rx (l) is absolutely summable. We will deal with the problem of estimating the power spectrum Rx (ej ω ) of a stationary process x(n) −1 from a finite record of observations {x(n)}N of a single realization. The ideal goal is to 0 devise an estimate that will faithfully characterize the power-versus-frequency distribution of the stochastic process (i.e., all the sequences of the ensemble) using only a segment of a single realization. For this to be possible, the estimate should typically involve some kind of averaging among several realizations or along a single realization. In some practical applications (e.g., interferometry), it is possible to directly measure the autocorrelation rx (l), |l| ≤ L < N with great accuracy. In this case, the spectrum estimation problem can be treated as a deterministic one, as described in Section 5.1. We will focus on the “stochastic” version of the problem, where Rx (ej ω ) is estimated from the −1 . A natural estimate of Rx (ej ω ), suggested by (5.3.1), is to estimate available data {x(n)}N 0 rx (l) from the available data and then transform it by using (5.3.1). 5.3.1 Power Spectrum Estimation Using the Periodogram The periodogram is an estimator of the power spectrum, introduced by Schuster (1898) in his efforts to search for hidden periodicities in solar sunspot data. The periodogram of the −1 data segment {x(n)}N is defined by 0 2 N −1 1 1 jω −j ωn ˆ Rx (e ) v(n)e (5.3.2) = |V (ej ω )|2 N N n=0

where

V (ej ω )

is the DTFT of the windowed sequence v(n) = x(n)w(n)

0≤n≤N −1

(5.3.3)

The above definition of the periodogram stems from Parseval’s relation (2.2.10) on the power of a signal. The window w(n), which has length N , is known as the data window. Usually, the term periodogram is used when w(n) is a rectangular window. In contrast, the term modified periodogram is used to stress the use of nonrectangular windows. The values −1 of the periodogram at the discrete set of frequencies {ωk = 2π k/N}N can be calculated 0 by 1 ˆ R x (k) Rˆ x (ej 2π k/N ) = |V˜ (k)|2 k = 0, 1, . . . , N − 1 (5.3.4) N

where V˜ (k) is the N -point DFT of the windowed segment v(n). In Matlab, the modified periodogram computation is implemented by using the function Rx = psd(x,Nfft,Fs,window (N),’none’);

where window is the name of any Matlab-provided window function (e.g., hamming); Nfft is the size of the DFT, which is chosen to be larger than N to obtain a high-density spectrum (see zero padding in Section 5.1.1); and Fs is the sampling frequency, which is used for plotting purposes. If the window boxcar is used, then we obtain the periodogram estimate. The periodogram can be expressed in terms of the autocorrelation estimate rˆv (l) of the windowed sequence v(n) as (see Problem 5.9) Rˆ x (ej ω ) =

N −1

rˆv (l) e−j ωl

(5.3.5)

l=−(N −1)

which shows that Rˆ x (ej ω ) is a “natural” estimate of the power spectrum. From (5.3.2) it follows that Rˆ x (ej ω ) is nonnegative for all frequencies ω. This results from the fact that the autocorrelation sequence rˆ (l), 0 ≤ |l| ≤ N − 1, is nonnegative definite. If we use the estimate rˇx (l) from (5.2.13) in (5.3.5) instead of rˆx (l), the obtained periodogram may assume negative values, which implies that rˇx (l) is not guaranteed to be nonnegative definite. The inverse Fourier transform of Rˆ x (ej ω ) provides the estimated autocorrelation rˆv (l), that is, π 1 (5.3.6) rˆv (l) = Rˆ x (ej ω )ej ωl dω 2π −π because rˆv (l) and Rˆ x (ej ω ) form a DTFT pair. Using (5.3.6) and (5.2.1) for l = 0, we have π N −1 1 1 2 rˆv (0) = |v(n)| = Rˆ x (ej ω ) dω N 2π −π

(5.3.7)

n=0

−1 Thus, the periodogram Rˆ x (ej ω ) shows how the power of the segment {v(n)}N , which 0 provides an estimate of the variance of the process x(n), is distributed as a function of frequency.

Filter bank interpretation. The above assertion that the periodogram describes a distribution of power as a function of frequency can be interpreted in a different way, in which the power estimate over a narrow frequency band is attributed to the output power of a narrow-bandpass filter. This leads to the well-known filter bank interpretation of the periodogram. To develop this interpretation, consider the basic (unwindowed) periodogram estimator Rˆ x (ej ω ) in (5.3.2), evaluated at a frequency ωk kω 2π k/N, which can be expressed as N −1 2 1 1 j ωk −j ωk n ˆ x(n)e Rx (e ) = = N N n=0

N −1 2 1 j ωk (N −n) = x(n)e N n=0

1 = N

2 N −1 j ωk m x(N − m)e m=0

N −1 2 j 2π k−j ωk n x(n)e n=0

since ωk N = 2π k

(5.3.8)

213 section 5.3 Estimation of the Power Spectrum of Stationary Random Signals

chapter 5 Nonparametric Power Spectrum Estimation

Clearly, the term inside the absolute value sign in (5.3.8) can be interpreted as a convolution of x(n) and ej ωk n , evaluated at n = N . Define 1 e j ωk n 0≤n≤N −1 (5.3.9) hk (n) N 0 otherwise as the impulse response of a linear system whose frequency response is given by Hk (ej ω ) = F[hk (n)] =

N −1 1 j ωk n −j ωn e e N n=0

=

1 N

N −1

e−j (ω−ωk )n =

n=0

1 e−j N (ω−ωk ) − 1 N e−j (ω−ωk ) − 1

(5.3.10)

1 sin[N (ω − ωk )/2] −j (N −1)(ω−ωk )/2 e = N sin[(ω − ωk )/2] which is a linear-phase, narrow-bandpass filter centered at ω = ωk . The 3-dB bandwidth of this filter is proportional to 2π /N rad per sampling interval (or 1/N cycles per sampling interval). A plot of the magnitude response |Hk (ej ω )|, for ωk = π /2 and N = 50, is shown in Figure 5.10, which evidently shows the narrowband nature of the filter.

FIGURE 5.10 The magnitude of the frequency response of the narrow-bandpass filter for ωk = π /2 and N = 50.

Filter response: vk = p/2, N = 50

−10 Power (dB)

214

−20

−30

−40 −p

0 v

p /2

p

Continuing, we also define the output of the filter hk (n) by yk (n), that is, yk (n) hk (n) ∗ x(n) =

N −1 1 x(n − m)ej ωk m N

(5.3.11)

m=0

Then (5.3.8) can be written as Rˆ x (ej ωk ) = N |yk (N )|2

(5.3.12)

Now consider the average power in yk (n), which can be evaluated using the spectral density as [see (3.3.45) and (3.4.22)] π 1 Rx (ej ω )|Hk (ej ω )|2 dω E{|yk (n)|2 } = 2π −π 1 ω Rx (ej ωk ) = Rx (ej ωk ) (5.3.13) 2π N since Hk (ej ω ) is a narrowband filter. If we estimate the average power E{|yk (n)|2 } using one sample yk (N ), then from (5.3.13) the estimated spectral density is the periodogram given ≈

by (5.3.12), which says that the kth DFT sample of the periodogram [see (5.3.4)] is given by the average power of a single N th output sample of the ωk -centered narrow-bandpass filter. Now imagine one such filter for each ωk , k = 0, . . . , N − 1, frequencies. Thus we have a bank of filters, each tuned to the discrete frequency (based on the data record length), providing the periodogram estimates every N samples. This filter bank is inherently built into the periodogram and hence need not be explicitly implemented. The block diagram of this filter bank approach to the periodogram computation is shown in Figure 5.11. n=N−1

H0(e jv )

H1(e jv )

y0(n)

y0(N)

y1(n)

y1(N)

yN−1(n)

yN−1(N)

N| | 2

⋅

~ R x (0)

⋅

~ R x (1)

N| | 2

HN−1(e jv )

…

…

…

x(n)

⋅

~ R x (N − 1)

N| | 2

FIGURE 5.11 The filter bank approach to the periodogram computation.

In Section 5.1, we observed that the periodogram of a deterministic signal approaches the true energy spectrum as the number of observations N → ∞. To see how the power spectrum of random signals is related to the number observations, we consider the following example. E XAM PLE 5.3.1 ( PE R I O D O GRAM O F A S I M U LATE D WH ITE N O I S E S E Q U E N C E ) . Let x(n) be a stationary white Gaussian noise with zero-mean and unit variance. The theoretical spectrum of x(n) is

Rx (ej ω ) = σ 2x = 1

−π <ω ≤π

To study the periodogram estimate, 50 different N -point records of x(n) were generated using a pseudorandom number generator. The periodogram Rˆ x (ej ω ) of each record was computed for ω = ωk = 2π k/1024, k = 0, 1, . . . , 512, that is, with NFFT = 1024, from the available data using (5.3.4) for N = 32, 128, and 256. These results in the form of periodogram overlays (a Monte Carlo simulation) and their averages are shown in Figure 5.12. We notice that Rˆ x (ej ω ) fluctuates so erratically that it is impossible to conclude from its observation that the signal has a flat spectrum. Furthermore, the size of the fluctuations (as seen from the ensemble average) is not reduced by increasing the segment length N . In this sense, we should not expect the periodogram Rˆ x (ej ω ) to converge to the true spectrum Rx (ej ω ) in some statistical sense as N → ∞. Since Rx (ej ω ) is constant over frequency, the fluctuations of Rˆ x (ej ω ) can be characterized by their mean, variance, and mean square error over frequency for each N and are given in Table 5.2. It can be seen that although the mean value tends to 1 (true value), the standard deviation is not reduced as N increases. In fact, it is close to 1; that is, it is of the order of the size of the quantity to be estimated. This illustrates that the periodogram is not a good estimate of the power spectrum.

Since for each value of ω, Rˆ x (ej ω ) is a random variable, the erratic behavior of the periodogram estimator, which is illustrated in Figure 5.12, can be explained by considering its mean, covariance, and variance.

215 section 5.3 Estimation of the Power Spectrum of Stationary Random Signals

216

TABLE 5.2

chapter 5 Nonparametric Power Spectrum Estimation

Performance of periodogram for white Gaussian noise signal in Example 5.3.1. N

32

128

256

ˆ x (ej ωk )] E[R jω ˆ var[R x (e k )] ˆ MSE

0.7829 0.7232 0.7689

0.8954 1.0635 1.07244

0.9963 1.1762 1.1739

Periodogram average: N = 32 20

Power (dB)

Power (dB)

Periodogram overlay: N = 32 20

−20 −40

−20 −40

0.2p

0.4p

0.6p

0.8p

p

0.2p

0.4p

v Periodogram overlay: N = 128

p

20 Power (dB)

Power (dB)

0.8p

Periodogram average: N = 128

20 0 −20 −40 0

0 −20 −40

0.2p

0.4p

0.6p

0.8p

p

0.2p

0.4p

0.6p

0.8p

p

v

v Periodogram overlay: N = 256

Periodogram average: N = 256 20 Power (dB)

20 Power (dB)

0.6p v

0 −20

−20 −40

−40 0

0.2p

0.4p

0.6p

0.8p

p

0.2p

0.4p

0.6p

0.8p

p

v

v

FIGURE 5.12 Periodograms of white Gaussian noise in Example 5.3.1.

Mean of Rˆ x (ej ω ). Taking the mathematical expectation of (5.3.5) and using (5.2.8), we obtain E{Rˆ x (ej ω )} =

N −1 l=−(N −1)

E{ˆrv (l)}e−j ωl =

1 N

N −1

rx (l)rw (l)e−j ωl

(5.3.14)

l=−(N −1)

Since E{Rˆ x (ej ω )} = Rx (ej ω ), the periodogram is a biased estimate of the true power spectrum Rx (ej ω ).

Equation (5.3.14) can be interpreted in the frequency domain as a periodic convolution. Indeed, using the frequency domain convolution theorem, we have π 1 E{Rˆ x (ej ω )} = Rx (ej θ )Rw (ej (ω−θ ) ) dθ (5.3.15) 2π N −π where

Rw (ej ω ) = |W (ej ω )|2

(5.3.16)

is the spectrum of the window. Thus, the expected value of the periodogram is obtained by convolving the true spectrum Rx (ej ω ) with the spectrum Rw (ej ω ) of the window. This is equivalent to windowing the true autocorrelation rx (l) with the correlation or lag window rw (l) = w(l) ∗ w(−l), where w(n) is the data window. To understand the implications of (5.3.15), consider the rectangular data window (5.2.7). Using (5.2.11), we see that (5.3.14) becomes N −1 |l| (5.3.17) 1− E{Rˆ x (ej ω )} = rx (l)e−j ωl N l=−(N −1)

For nonperiodic autocorrelations, the value of rx (l) becomes negligible for large values of |l|. Hence, as the record length N increases, the term (1 − |l|/N ) → 1 for all l, which implies that lim E{Rˆ x (ej ω )} = Rx (ej ω )

N →∞

(5.3.18)

that is, the periodogram is an asymptotically unbiased estimator of Rx (ej ω ). In the frequency domain, we obtain sin (ωN/2) 2 (5.3.19) Rw (ej ω ) = F{wR (l) ∗ wR (−l)} = |WR (ej ω )|2 = sin (ω/2) where

WR (ej ω ) = e−j ω(N −1)/2

sin (ωN/2) sin (ω/2)

(5.3.20)

is the Fourier transform of the rectangular window. The spectrum Rw (ej ω ), in (5.3.19), of the correlation window rw (l) approaches a periodic impulse train as the window length † increases. As a result, E{Rˆ x (ej ω )} approaches the true power spectrum Rx (ej ω ) as N approaches ∞. The result (5.3.18) holds for any window that satisfies the following two conditions: 1. The window is normalized such that N −1

|w(n)|2 = N

(5.3.21)

n=0

This condition is obtained by noting that, for asymptotic unbiasedness, we want Rw (ej ω )/ N in (5.3.15) to be an approximation of an impulse in the frequency domain. Since the area under the impulse function is unity, using (5.3.16) and Parseval’s theorem, we have π N −1 1 1 |W (ej ω )|2 dω = |w(n)|2 = 1 (5.3.22) 2π N −π N n=0

(ej ω ) of the correlation window decreases

2. The width of the mainlobe of the spectrum Rw as 1/N . This condition guarantees that the area under Rw (ej ω ) is concentrated at the origin as N becomes large. For more precise conditions see Brockwell and Davis (1991). †

This spectrum is sometimes referred to as the Fejer kernel.

217 section 5.3 Estimation of the Power Spectrum of Stationary Random Signals

The bias is introduced by the sidelobes of the correlation window through leakage, as illustrated in Section 5.1. Therefore, we can reduce the bias by using the modified periodogram and a “better” window. Bias can be avoided if either N = ∞, in which case the spectrum of the window is a periodic train of impulses, or Rx (ej ω ) = σ 2x , that is, x(n) has a flat power spectrum. Thus, for white noise, Rˆ x (ej ω ) is unbiased for all N . This fact was apparent in Example 5.3.1 and is very important for practical applications. In the following example, we illustrate that the bias becomes worse as the dynamic range of the spectrum increases. E XAM PLE 5.3.2 ( B IAS AN D LEAKAGE PR O PE R T I E S O F T H E PE R I O D O GRAM ) .

Consider

an AR(2) process with a2 = [1 − 0.75 0.5]T

d0 = 1

(5.3.23)

and an AR(4) process with a4 = [1 − 2.7607 3.8106 − 2.6535 0.9238]T

d0 = 1

(5.3.24)

where w(n) ∼ WN(0, 1). Both processes have been used extensively in the literature for power spectrum estimation studies (Percival and Walden 1993). Their power spectrum is given by (see Chapter 4) Rx (ej ω ) =

σ 2w d0 σ 2w = 2 j ω 2 |A(e )| p j ωk a e k k=0

(5.3.25)

For simulation purposes, N = 1024 samples of each process were generated. The sample realizations and the shapes of the two power spectra in (5.3.25) are shown in Figure 5.13. The dynamic range of the two spectra, that is, max Rx (ej ω )/ min Rx (ej ω ), is about 15 and 65 dB, ω

ω

respectively. From the sample realizations, periodograms and modified periodograms, based on the Hanning window, were computed by using (5.3.4) at NFFT = 1024 frequencies. These are shown in Figure 5.14. The periodograms for the AR(2) and AR(4) processes, respectively, are shown in the

Sample realization: AR(4)

Sample realization: AR(2) 4 2

Amplitude

Amplitude

0 −2

50 0 −50

−4 0

50 100 150 Sample number

200

200

50

5 0 −5 −10

50 100 150 Sample number Power spectrum: AR(4)

Power spectrum: AR(2) 10 Power (dB)

chapter 5 Nonparametric Power Spectrum Estimation

Power (dB)

218

0 0.1 0.2 0.3 0.4 0.5 Frequency (cycles/sampling interval)

−50

0 0.1 0.2 0.3 0.4 0.5 Frequency (cycles/sampling interval)

FIGURE 5.13 Sample realizations and power spectra of the AR(2) and AR(4) processes used in Example 5.3.2.

Periodogram: AR(2) Power (dB)

Power (dB)

50

0 −20 −40

section 5.3 Estimation of the Power Spectrum of Stationary Random Signals

−50

0 0.1 0.2 0.3 0.4 0.5 Frequency (cycles / sampling interval)

Modified periodogram: AR(2)

0 0.1 0.2 0.3 0.4 0.5 Frequency (cycles / sampling interval) Modified periodogram: AR(4)

50 Power (dB)

20 Power (dB)

219

Periodogram: AR(4)

20

0 −20 −40

0 0.1 0.2 0.3 0.4 0.5 Frequency (cycles / sampling interval)

−50

0 0.1 0.2 0.3 0.4 0.5 Frequency (cycles / sampling interval)

FIGURE 5.14 Illustration of properties of periodogram as a power spectrum estimator.

top row while the modified periodograms for the same processes are shown in the bottom row. These plots illustrate that the periodogram is a biased estimator of the power spectrum. In the case of the AR(2) process, since the spectrum has a small dynamic range (15 dB), the bias in the periodogram estimate is not obvious; furthermore, the windowing in the modified periodogram did not show much improvement. On the other hand, the AR(4) spectrum has a large dynamic range, and hence the bias is clearly visible at high frequencies. This bias is clearly reduced by windowing of the data in the modified periodogram. In both cases, the random fluctuations are not reduced by the data windowing operation. E XAM PLE 5.3.3 ( FR E Q U E N CY R E S O LUTI O N PR O PE RTY O F TH E PE R I O D O GRAM ) .

Con-

sider two unit-amplitude sinusoids observed in unit variance white noise. Let x(n) = cos (0.35π n + φ 1 ) + cos (0.4π n + φ 2 ) + ν(n) where φ 1 and φ 2 are jointly independent random variables uniformly distributed over [−π , π] and ν(n) is a unit-variance white noise. Since two frequencies, 0.35π and 0.4π, are close, we will need (see Table 5.1) 1.81π or N > 37 0.4π − 0.35π To obtain a periodogram ensemble, 50 realizations of x(n) for N = 32 and N = 64 were generated, and their periodograms were computed. The plots of these periodogram overlays and the corresponding ensemble average for N = 32 and N = 64 are shown in Figure 5.15. For N = 32, frequencies in the periodogram cannot be resolved, as expected; but for N = 64 it is possible to separate the two sinusoids with ease. Note that the modified periodogram (i.e., data windowing) will not help since windowing increases smoothing and smearing of peaks. N −1>

The case of nonzero mean. In the periodogram method of spectrum analysis in this section, we assumed that the random signal has zero mean. If a random signal has nonzero mean, it should be estimated using (3.6.20) and then removed from the signal prior to computing its periodogram. This is because the power spectrum of a nonzero mean signal has an impulse at the zero frequency. If this mean is relatively large, then because of the leakage inherent in the periodogram, this mean will obscure low-amplitude, low-frequency

220

components of the spectrum. Even though the estimate is not an exact value, its removal often provides better estimates, especially at low frequencies.

chapter 5 Nonparametric Power Spectrum Estimation

Covariance of Rˆ x (ej ω ). Obtaining an expression for the covariance of the periodogram is a rather complicated process. However, it has been shown (Jenkins and Watts 1968) that cov{Rˆ x (ej ω1 ), Rˆ x (ej ω2 )} Rx (e

j ω1

)Rx (e

j ω2

)

sin [(ω1 + ω2 )N/2] 2 N sin [(ω1 + ω2 )/2] (5.3.26) sin [(ω1 − ω2 )N/2] 2 + N sin [(ω1 − ω2 )/2]

This expression applies to stationary random signals with zero mean and Gaussian probability density. The approximation becomes exact if the signal has a flat spectrum (white noise). Although this approximation deteriorates for non-Gaussian probability densities, the qualitative results that one can draw from this approximation appear to hold for a rather broad range of densities. From (5.3.26), for ω1 = (2π /N )k1 and ω2 = (2π /N )k2 with k1 , k2 integers, we have cov{Rˆ x (ej ω1 )Rˆ x (ej ω2 )} 0

for k1 = k2

(5.3.27)

Thus, values of the periodogram spaced in frequency by integer multiples of 2π /N are approximately uncorrelated. As the record length N increases, these uncorrelated periodogram samples come closer together, and hence the rate of fluctuations in the periodogram increases. This explains the results in Figure 5.12. Periodogram average: N = 32

Periodogram overlay: N = 32 20

Power (dB)

Power (dB)

20

−20

−40 0

0.35p 0.4p

−20

−40 0

p

0.35p 0.4p

Periodogram average: N = 64

Periodogram overlay: N = 64 20

Power (dB)

Power (dB)

20

−20

−40

p v

v

0.35p 0.4p v

p

−20

−40 0

0.35p 0.4p

FIGURE 5.15 Illustration of the frequency resolution property of the periodogram in Example 5.3.3.

p v

Variance of Rˆ x (ej ω ). The variance of the periodogram at a particular frequency ω = ω1 = ω2 can be obtained from (5.3.26) sin ωN 2 jω 2 jω ˆ var{Rx (e )} Rx (e ) 1 + (5.3.28) N sin ω For large values of N , the variance of Rˆ x (ej ω ) can be approximated by 2 jω 0<ω<π Rx (e ) j ω var{Rˆ x (e )} 2 j ω ω = 0, π 2Rx (e )

(5.3.29)

This result is crucial, because it shows that the variance of the periodogram (estimate) remains at the level of Rx2 (ej ω ) (quantity to be estimated), independent of the record length N used. Furthermore, since the variance does not tend to zero as N → ∞, the periodogram is not a consistent estimator; that is, its distribution does not tend to cluster more closely † around the true spectrum as N increases. This behavior was illustrated in Example 5.3.1.The variance of Rˆ x (ej ωk ) fails to decrease as N increases because the number of periodogram values Rˆ x (ej ωk ), k = 0, 1, . . . , N − 1, is always equal to the length N of the data record. E XAM PLE 5.3.4 ( C O M PAR I S O N O F PE R I O D O GRAM AN D M O D I FI E D PE R I O D O GRAM ) .

Consider the case of three sinusoids discussed in Section 5.1.4. In particular, we assume that these sinusoids are observed in white noise with x(n) = cos (0.35π n + φ 1 ) + cos (0.4π n + φ 2 ) + 0.25 cos (0.8π n + φ 3 ) + ν(n) where φ 1 , φ 2 , and φ 3 are jointly independent random variables uniformly distributed over [−π , π] and ν(n) is a unit-variance white noise. An ensemble of 50 realizations of x(n) was generated using N = 128. The periodograms and the Hamming window–based modified periodograms of these realizations were computed, and the results are shown in Figure 5.16. The top row of the figure contains periodogram overlays and the corresponding ensemble average for the unwindowed periodogram, and the bottom row shows the same for the modified periodogram. Spurious peaks (especially near the two close frequencies) in the periodogram have been suppressed by the data windowing operation in the modified periodogram; hence the peak corresponding to 0.8π is sufficiently enhanced. This enhancement is clearly at the expense of the frequency resolution (or smearing of the true peaks), which is to be expected. The overall variance of the noise floor is still not reduced.

Failure of the periodogram To conclude, we note that the periodogram in its “basic form” is a very poor estimator of the power spectrum function. The failure of the periodogram when applied to random signals is uniquely pointed out in Jenkins and Watts (1968, p. 213): The basic reason why Fourier analysis breaks down when applied to time series is that it is based on the assumption of fixed amplitudes, frequencies and phases. Time series, on the other hand, are characterized by random changes of frequencies, amplitudes and phases. Therefore it is not surprising that Fourier methods need to be adapted to account for the random nature of a time series.

The attempt at improving the periodogram by windowing the available data, that is, by using the modified periodogram in Example 5.3.4, showed that the presence and the length of the window had no effect on the variance. The major problems with the periodogram lie in its variance, which is on the order of Rx2 (ej ω ), as well as in its erratic behavior. Thus, to obtain a better estimator, we should reduce its variance; that is, we should “smooth” the periodogram. †

The definition of the PSD by Rx (ej ω ) = limN→∞ Rˆ x (ej ω ) is not valid because even if limN→∞ E{Rˆ x (ej ω )} = Rx (ej ω ), the variance of Rˆ x (ej ω ) does not tend to zero as N → ∞ (Papoulis 1991).

221 section 5.3 Estimation of the Power Spectrum of Stationary Random Signals

From the previous discussion, it follows that the sequence Rˆ x (k), k = 0, 1, . . . , N − 1, of the harmonic periodogram components can be reasonably assumed to be a sequence of uncorrelated random variables. Furthermore, it is well known that the variance of the sum of K uncorrelated random variables with the same variance is 1/K times the variance of one of these individual random variables. This suggests two ways of reducing the variance, which also lead to smoother spectral estimators:

222 chapter 5 Nonparametric Power Spectrum Estimation

• •

Average contiguous values of the periodogram. Average periodograms obtained from multiple data segments.

It should be apparent that owing to stationarity, the two approaches should provide comparable results under similar circ*mstances.

5.3.2 Power Spectrum Estimation by Smoothing a Single Periodogram— The Blackman-Tukey Method The idea of reducing the variance of the periodogram through smoothing using a movingaverage filter was first proposed by Daniel (1946). The estimator proposed by Daniel is a zero-phase moving-average filter, given by Rˆ x(PS) (ej ωk )

M M 1 W (ej ωj )Rˆ x (ej ωk−j ) Rˆ x (ej ωk−j ) = 2M + 1 j =−M

(5.3.30)

j =−M

where ωk = (2π /N )k, k = 0, 1, . . . , N − 1, W (ej ωj ) 1/(2M + 1), and the superscript (PS) denotes periodogram smoothing. Since the samples of the periodogram are approxi-

20

Power (dB)

Power (dB)

Periodogram overlay: N = 128 20

−20 −40 0

0.35p 0.4p v

0.8p

−20 −40 0

p

Periodogram average : N = 128

0.35p 0.4p v

20 Hamming window Power (dB)

Hamming window Power (dB)

p

Modified periodogram average: N = 128

Modified periodogram overlay: N = 128 20

−20 −40 0

0.8p

0.35p 0.4p v

0.8p

p

−20 −40 0

0.35p 0.4p v

FIGURE 5.16 Comparison of periodogram and modified periodogram in Example 5.3.4.

0.8p

p

223

mately uncorrelated, 1 var{Rˆ x (ej ωk )} (5.3.31) 2M + 1 that is, averaging 2M + 1 consecutive spectral lines reduces the variance by a factor of 2M + 1. The quantity ω ≈ (2π /N )(2M + 1) determines the frequency resolution, since any peaks within the ω range are smoothed over the entire interval ω into a single peak and cannot be resolved. Thus, increasing M reduces the variance (resulting in a smoother spectrum estimate), at the expense of spectral resolution. This is the fundamental tradeoff in practical spectral analysis. var{Rˆ x(PS) (ej ωk )}

Blackman-Tukey approach The discrete moving average in (5.3.30) is computed in the frequency domain. We now introduce a better and simpler way to smooth the periodogram by operating on the estimated autocorrelation sequence. To this end, we note that the continuous frequency equivalent of the discrete convolution formula (5.3.30) is the periodic convolution π 1 (PS) j ω ˆ Rx (e ) = (5.3.32) Rˆ x (ej (ω−θ ) )Wa (ej θ ) dθ = Rˆ x (ej ω ) ⊗ Wa (ej ω ) 2π −π where Wa (ej ω ) is a periodic function of ω with period 2π , given by ω 1 |ω| < ω 2 Wa (ej ω ) = ω 0 ≤ω≤π 2 By using the convolution theorem, (5.3.32) can be written as Rˆ x(PS) (ej ω ) =

L−1

rˆx (l)wa (l)e−j ωl

(5.3.33)

(5.3.34)

l=−(L−1)

where wa (l) is the inverse Fourier transform of Wa (ej ω ) and L < N . As we have already † mentioned, the window wa (l) is known as the correlation or lag window. The correlation window corresponding to (5.3.33) is wa (l) =

sin (lω/2) πl

−∞

(5.3.35)

Since wa (l) has infinite duration, its truncation at |l| = L ≤ N creates ripples in Wa (ej ω ) (Gibbs effect). To avoid this problem, we use correlation windows with finite duration, that is, wa (l) = 0 for |l| > L ≤ N . For real sequences, where rˆx (l) is real and even, wa (l) [and hence Wa (ej ω )] should be real and even. Given that Rˆ x (ej ω ) is nonnegative, a sufficient (PS) (but not necessary) condition that Rˆ x (ej ω ) be nonnegative is that Wa (ej ω ) ≥ 0 for all ω. This condition holds for the Bartlett (triangular) and Parzen (see Problem 5.11) windows, but it does not hold for the Hamming, Hanning, or Kaiser window. Thus, we note that smoothing the periodogram Rˆ x (ej ω ) by convolving it with the spectrum Wa (ej ω ) = F{wa (l)} is equivalent to windowing the autocorrelation estimate rˆx (l) with the correlation window wa (l). This approach to power spectrum estimation, which was introduced by Blackman and Tukey (1959), involves the following steps:

†

The term spectral window is quite often used for Wa (ej ω ) = F {wa (l)}, the Fourier transform of the correlation window. However, this term is misleading because Wa (ej ω ) is essentially a frequency-domain impulse response. We use the term correlation window for wa (l) and the term Fourier transform of the correlation window for Wa (ej ω ).

section 5.3 Estimation of the Power Spectrum of Stationary Random Signals

224 chapter 5 Nonparametric Power Spectrum Estimation

1. Estimate the autocorrelation sequence from the unwindowed data. 2. Window the obtained autocorrelation samples. 3. Compute the DTFT of the windowed autocorrelation as given in (5.3.34). A pictorial comparison between the theoretical [i.e., using (5.3.32)] and the above practical computation of power spectrum using the single-periodogram smoothing is shown in Figure 5.17.

N−1

Signal data record

0 THEORY

PRACTICE COMPUTE

Periodogram: Rˆ x (e jv )

Autocorrelation: {rˆx (l )}–(LL −1 −1)

Convolution: Wa(e jv ) Rˆ x(e jv )

Windowing: {rˆx(l )wa(l )}–(LL −1 −1)

Rˆ x (e jv ) (PS)

DF T using FF T

Rˆ x (e jv )|v = (PS)

2k NFFT

FIGURE 5.17 Comparison of the theory and practice of the Blackman-Tukey method.

The resolution of the Blackman-Tukey power spectrum estimator is determined by the duration 2L − 1 of the correlation window. For most correlation windows, the resolution is measured by the 3-dB bandwidth of the mainlobe, which is on the order of 2π /L rad per sampling interval. (PS) The statistical quality of the Blackman-Tukey estimate Rˆ x (ej ω ) can be evaluated by examining its mean, covariance, and variance. (PS) Mean of Rˆ x(PS) (ej ω ). The expected value of the smoothed periodogram Rˆ x (ej ω ) can be obtained by using (5.3.34) and (5.2.11). Indeed, we have (PS) E{Rˆ x (ej ω )} =

L−1

E{ˆrx (l)}wa (l)e−j ωl

l=−(L−1)

=

L−1 l=−(L−1)

|l| rx (l) 1 − wa (l)e−j ωl N

(5.3.36)

or, using the frequency convolution theorem, we have E{Rˆ x(PS) (ej ω )} = Rx (ej ω ) ⊗ WB (ej ω ) ⊗ Wa (ej ω )

(5.3.37)

where

WB (ej ω ) = F

1−

|l| 1 sin (ωN/2) 2 wR (n) = N N sin (ω/2)

(5.3.38)

(PS) (PS) is the Fourier transform of the Bartlett window. Since E{Rˆ x (ej ω )} = Rx (ej ω ), Rˆ x (ej ω ) j ω is a biased estimate of Rx (e ). For LN, (1 − |l|/N ) 1 and hence we obtain L−1 |l| (PS) j ω ˆ E{Rx (e )} = rx (l) 1 − wa (l)e−j ωl N l=−(L−1)

Rx (ej ω ) ⊗ Wa (ej ω ) π 1 = Rx (ej θ )Wa (ej (ω−θ ) ) dθ 2π −π

(5.3.39)

If L is sufficiently large, the correlation window wa (l) consists of a narrow mainlobe. If Rx (ej ω ) can be assumed to be constant within the mainlobe, we have π (PS) j ω jω 1 ˆ Wa (ej (ω−θ ) ) dθ E{Rx (e )} Rx (e ) 2π −π which implies that Rˆ x

(PS)

(ej ω ) is asymptotically unbiased if π 1 Wa (ej ω ) dω = wa (0) = 1 2π −π

(5.3.40)

that is, if the spectrum of the correlation window has unit area. Under this condition, if both L and N tend to infinity, then Wa (ej ω ) and WB (ej ω ) become periodic impulse trains and the convolution (5.3.37) reproduces Rx (ej ω ). Covariance of Rˆ x(PS) (ej ω ). The following approximation π 1 (PS) j ω1 (PS) j ω2 ˆ ˆ cov{Rx (e ), Rx (e )} R 2 (ej θ )Wa (ej (ω1 −θ ) )Wa (ej (ω2 −θ ) ) dθ 2π N −π x (5.3.41) derived in Jenkins and Watts (1968), holds under the assumptions that (1) N is sufficiently large that WB (ej ω ) behaves as a periodic impulse train and (2) L is sufficiently large that Wa (ej ω ) is sufficiently narrow that the product Wa (ej (ω1 +θ ) )Wa (ej (ω2 −θ ) ) is negligible. Hence, the covariance increases proportionally to the width of Wa (ej ω ), and the amount of overlap between the windows Wa (ej (ω1 −θ ) ) (centered at ω1 ) and Wa (ej (ω2 −θ ) ) (centered at ω2 ) increases. Variance of Rˆ x(PS) (ej ω ). When ω = ω1 = ω2 , (5.3.41) gives π 1 (PS) j ω ˆ var{Rx (e )} R 2 (ej ω )Wa2 (ej (ω−θ ) ) dθ 2π N −π x If Rx (ej ω ) is smooth within the width of Wa (ej ω ), then π 1 (PS) j ω 2 jω ˆ var{Rx (e )} Rx (e ) W 2 (ej ω ) dω 2π N −π a or where

Ew 2 j ω var{Rˆ x(PS) (ej ω )} R (e ) N x π 1 Ew = W 2 (ej ω ) dω = 2π −π a

0<ω<π L−1 l=−(L−1)

wa2 (l)

(5.3.42)

(5.3.43) (5.3.44) (5.3.45)

225 section 5.3 Estimation of the Power Spectrum of Stationary Random Signals

226

is the energy of the correlation window. From (5.3.29) and (5.3.44) we have

chapter 5 Nonparametric Power Spectrum Estimation

Ew var{Rˆ x (ej ω )} j ω N var{Rˆ x (e )} (PS)

0<ω<π

(5.3.46)

which is known as the variance reduction factor or variance ratio and provides the reduction in variance attained by smoothing the periodogram. In the beginning of this section, we explained the variance reduction in terms of frequency-domain averaging. An alternative explanation can be provided by considering the windowing of the estimated autocorrelation. As discussed in Section 5.2, the variance of the autocorrelation estimate increases as |l| approaches N because fewer and fewer samples are used to compute the estimate. Since every value of rˆx (l) affects the value of Rˆ x (ω) at all frequencies, the less reliable values affect the quality of the periodogram everywhere. Thus, we can reduce the variance of the periodogram by minimizing the contribution of autocorrelation terms with large variance, that is, with lags close to N , by proper windowing. As we have already stressed, there is a tradeoff between resolution and variance. For the variance to be small, we must choose a window that contains a small amount of energy Ew . Since |wa (l)| ≤ 1, we have Ew ≤ 2L. Thus, to reduce the variance, we must have (PS) LN . The bias of Rˆ x (ej ω ) is directly related to the resolution, which is determined by the mainlobe width of the window, which in turn is proportional to 1/L. Hence, to reduce the bias, Wa (ej ω ) should have a narrow mainlobe that demands a large L. The requirements for high resolution (small bias) and low variance can be simultaneously satisfied only if N is sufficiently large. The variance reduction for some commonly used windows is examined in Problem 5.12. Empirical evidence suggests that use of the Parzen window is a reasonable choice. Confidence intervals. In the interpretation of spectral estimates, it is important to know whether the spectral details are real or are due to statistical fluctuations. Such information is provided by the confidence intervals (Chapter 3). When the spectrum is plotted on a logarithmic scale, the (1 − α) × 100 percent confidence interval is constant at every frequency, and it is given by (Koopmans 1974) χ 2ν (1 − α/2) ν (PS) j ω (PS) j ω ˆ ˆ (5.3.47) , 10 log Rx (e ) + 10 log 2 10 log Rx (e ) − 10 log ν χ v (α/2) ν=

where

2N L

(5.3.48) wa2 (l)

l=−(L−1)

is the degrees of freedom of a χ 2ν distribution. Computation of Rˆ x(PS) (ej ω ) using the DFT. In practice, the Blackman-Tukey power spectrum estimator is computed by using an N -point DFT as follows: 1. Estimate the autocorrelation rx (l), using the formula rˆx (l) =

rˆx∗ (−l)

N +l−1 1 = x(n + l)x ∗ (n) N

l = 0, 1, . . . , L − 1

(5.3.49)

n=0

For L > 100, indirect computation of rˆx (l) by using DFT techniques is usually more efficient (see Problem 5.13).

2. Form the sequence rˆx (l)wa (l) f (l) = 0 ∗ rˆx (N − l)wa (N − l)

227

0≤l ≤L−1 L≤l ≤N −L

(5.3.50)

N −L+1≤l ≤N −1

3. Compute the power spectrum estimate Rˆ x(PS) (ej ω )|ω=(2π /N )k = F (k) = DFT {f (l)}

0≤k ≤N −1

(5.3.51)

as the N-point DFT of the sequence f (l). Matlab does not provide a direct function to implement the Blackman-Tukey method. However, such a function can be easily constructed by using built-in Matlab functions and the above approach. The book toolbox function Rx = bt_psd(x,Nfft,window,L);

implements the above algorithm in which window is any available Matlab window and Nfft is chosen to be larger than N to obtain a high-density spectrum. E XAM PLE 5.3.5 ( B LAC K MAN -TU K EY M ETH O D ) . Consider the spectrum estimation of three sinusoids in white noise given in Example 5.3.4, that is,

x(n) = cos (0.35π n + φ 1 ) + cos (0.4π n + φ 2 ) + 0.25 cos (0.8π n + φ 3 ) + ν(n)

(5.3.52)

where φ 1 , φ 2 , and φ 3 are jointly independent random variables uniformly distributed over [−π , π] and ν(n) is a unit-variance white noise. An ensemble of 50 realizations of x(n) was generated using N = 512. The autocorrelations of these realizations were estimated up to lag L = 64, 128, and 256. These autocorrelations were windowed using the Bartlett window, and then their 1024-point DFT was computed as the spectrum estimate. The results are shown in Figure 5.18. The top row of the figure contains estimate overlays and the corresponding ensemble average for L = 64, the middle row for L = 128, and the bottom row for L = 256. Several observations can be made from these plots. First, the variance in the estimate has considerably reduced over the periodogram estimate. Second, the lower the lag distance L, the lower the variance and the resolution (i.e., the higher the smoothing of the peaks). This observation is consistent with our discussion above about the effect of L on the quality of estimates. Finally, all the frequencies including the one at 0.8π are clearly distinguishable, something that the basic periodogram could not achieve.

5.3.3 Power Spectrum Estimation by Averaging Multiple Periodograms— The Welch-Bartlett Method As mentioned in Section 5.3.1, in general, the variance of the sum of K IID random variables is 1/K times the variance of each of the random variables. Thus, to reduce the variance of the periodogram, we could average the periodograms from K different realizations of a stationary random signal. However, in most practical applications, only a single realization is available. In this case, we can subdivide the existing record {x(n), 0 ≤ n ≤ N − 1} into K (possibly overlapping) smaller segments as follows: xi (n) = x(iD + n)w(n)

0 ≤ n ≤ L − 1, 0 ≤ i ≤ K − 1

(5.3.53)

where w(n) is a window of duration L and D is an offset distance. If D < L, the segments overlap; and for D = L, the segments are contiguous. The periodogram of the ith segment is 2 L−1 1 1 j ω j ω 2 −j ωn Rˆ x,i (e ) |Xi (e )| = xi (n)e (5.3.54) L L n=0

section 5.3 Estimation of the Power Spectrum of Stationary Random Signals

228

We remind the reader that the window w(n) in (5.3.53) is called a data window because it is applied directly to the data, in contrast to a correlation window that is applied to the autocorrelation sequence [see (5.3.34)]. Notice that there is no need for the data window to have an even shape or for its Fourier transform to be nonnegative. The purpose of using the data window is to control spectral leakage. (PA) The spectrum estimate Rˆ x (ej ω ) is obtained by averaging K periodograms as follows:

chapter 5 Nonparametric Power Spectrum Estimation

K−1 K−1 1 ˆ 1 |Xi (ej ω )|2 Rx,i (ej ω ) = Rˆ x(PA) (ej ω ) K KL i=0

(5.3.55)

i=0

where the superscript (PA) denotes periodogram averaging. To determine the bias and (PA) variance of Rˆ x (ej ω ), we let D = L so that the segments do not overlap. The so-computed (PA) estimate Rˆ x (ej ω ) is known as the Bartlett estimate. We also assume that rx (l) is very small for |l| > L. This implies that the signal segments can be assumed to be approximately uncorrelated. To show that the simple periodogram averaging in Bartlett’s method reduces the periodogram variance, we consider the following example.

B–T spectrum estimate overlay: L = 64

B–T spectrum estimate average: L = 64 20 Power (dB)

Power (dB)

20 10 0 −10 0

0.35p 0.4p v

0.8p

10 0 −10 0

p

Power (dB)

Power (dB)

20

10 0

0.35p 0.4p v

0.8p

10 0 −10 0

p

0.35p 0.4p v

0.8p

p

B–T spectrum estimate average: L = 256

B–T spectrum estimate overlay: L = 256 20 Power (dB)

20 Power (dB)

p

B–T spectrum estimate average: L = 128

B–T spectrum estimate overlay: L = 128

10 0 −10 0

0.8p v

20

−10 0

0.35p 0.4p

0.35p 0.4p v

0.8p

p

10 0 −10 0

0.35p 0.4p v

FIGURE 5.18 Spectrum estimation of three sinusoids in white noise using the Blackman-Tukey method in Example 5.3.5.

0.8p

p

E XAM PLE 5.3.6 ( PE R I O D O GRAM AV E RAGI N G) . Let x(n) be a stationary white Gaussian noise with zero mean and unit variance. The theoretical spectrum of x(n) is

Rx (ej ω ) = σ 2x = 1

−π <ω ≤π

An ensemble of 50 different 512-point records of x(n) was generated using a pseudorandom number generator. The Bartlett estimate of each record was computed for K = 1 (i.e., the basic periodogram), K = 4 (or L = 128), and K = 8 (or L = 64). The results in the form of estimate overlays and averages are shown in Figure 5.19. The effect of periodogram averaging is clearly evident.

Power (dB)

Power (dB)

20

0 −20

0.2p

0.4p

0.6p

0.8p

0 −20 −40 0

p

0.2p

0.4p

Power (dB)

Power (dB)

p

20

0 −20

0.2p

0.4p

0.6p

0.8p

0 −20 −40

p

0.2p

0.4p

0.6p

0.8p

p

v

v

Bartlett estimate average: K = 8

Bartlett estimate overlay: K = 8 20 Power (dB)

20 Power (dB)

0.8p

Bartlett estimate average: K = 4

Bartlett estimate overlay: K = 4 20

0 −20 −40

0.6p v

v

−40

section 5.3 Estimation of the Power Spectrum of Stationary Random Signals

Bartlett estimate average: K = 1

Bartlett estimate overlay: K = 1 20

−40

229

0.2p

0.4p

0.6p

0.8p

p

0 −20 −40

0.2p

0.4p

v

0.6p

0.8p

v

FIGURE 5.19 Spectral estimation of white noise using Bartlett’s method in Example 5.3.6. (PA) Mean of Rˆ x(PA) (ej ω ). The mean value of Rˆ x (ej ω ) is

E{Rˆ x(PA) (ej ω )} =

K−1 1 E{Rˆ x,i (ej ω )} = E{Rˆ x (ej ω )} K

(5.3.56)

i=0

where we have assumed that E{Rˆ x,i (ej ω )} = E{Rˆ x (ej ω )} because of the stationarity assumption. From (5.3.56) and (5.3.15), we have π 1 E{Rˆ x(PA) (ej ω )} = E{Rˆ x (ej ω )} = Rx (ej θ )Rw (ej (ω−θ ) ) dθ (5.3.57) 2π L −π

p

230 chapter 5 Nonparametric Power Spectrum Estimation

(PA) where Rw (ej ω ) is the spectrum of the data window w(n). Hence, Rˆ x (ej ω ) is a biased estimate of Rx (ej ω ). However, if the data window is normalized such that L−1

w 2 (n) = L

(5.3.58)

n=0 (PA) the estimate Rˆ x (ej ω ) becomes asymptotically unbiased [see the discussion following equation (5.3.15)]. (PA) Variance of Rˆ x(PA) (ej ω ). The variance of Rˆ x (ej ω ) is 1 var{Rˆ x(PA) (ej ω )} = var{Rˆ x (ej ω )} (5.3.59) K (assuming segments are independent) or using (5.3.29) gives 1 var{Rˆ x(PA) (ej ω )} Rx2 (ej ω ) (5.3.60) K (PA) Clearly, as K increases, the variance tends to zero. Thus, Rˆ x (ej ω ) provides an asymptotically unbiased and consistent estimate of Rx (ej ω ). If N is fixed and N = KL, we see that increasing K to reduce the variance (or equivalently obtain a smoother estimate) results in a decrease in L, that is, a reduction in resolution (or equivalently an increase in bias). When w(n) in (5.3.53) is the rectangular window of duration L, the square of its Fourier transform is equal to the Fourier transform of the triangular sequence wT (n) L−|l|, |l| < L, which when combined with the 1/L factor in (5.3.57), results in the Bartlett window 1 − |l| |l| < L L wB (l) = (5.3.61) 0 elsewhere 1 sin (ωL/2) 2 with (5.3.62) WB (ej ω ) = L sin (ω/2)

This special case of averaging multiple nonoverlapping periodograms was introduced by Bartlett (1953). The method has been extended to modified overlapping periodograms by Welch (1970), who has shown that the shape of the window does not affect the variance formula (5.3.59). Welch showed that overlapping the segments by 50 percent reduces the variance by about a factor of 2, owing to doubling the number of segments. More overlap does not result in additional reduction of variance because the data segments become less and less independent. Clearly, the nonoverlapping segments can be uncorrelated only for white noise signals. However, the data segments can be considered approximately uncorrelated if they do not have sharp spectral peaks or if their autocorrelations decay fast. (PA) Thus, the variance reduction factor for the spectral estimator Rˆ x (ej ω ) is 1 var{Rˆ x (ej ω )} K var{Rˆ x (ej ω )} (PA)

0<ω<π

(5.3.63)

and is reduced by a factor of 2 for 50 percent overlap. Confidence intervals. The (1 − α) × 100 percent confidence interval on a logarithmic scale may be shown to be (Jenkins and Watts 1968) χ 22K (1 − α/2) 2K (PA) (PA) j ω j ω 10 log Rˆ x (e ) − 10 log , 10 log Rˆ x (e ) + 10 log 2 2K χ 2K (α/2) (5.3.64) where

χ 22K

is a chi-squared distribution with 2K degrees of freedom.

(PA) Computation of Rˆ x(PA) (ej ω ) using the DFT. In practice, to compute Rˆ x (ej ω ) at L equally spaced frequencies ωk = 2πk/L, 0 ≤ k ≤ L − 1, the method of periodogram averaging can be easily and efficiently implemented by using the DFT as follows (we have assumed that L is even): −1 1. Segment data {x(n)}N into K segments of length L, each offset by D duration using 0

x¯i (n) = x(iD + n)

0 ≤ i ≤ K − 1, 0 ≤ n ≤ L − 1

(5.3.65)

If D = L, there is no overlap; and if D = L/2, the overlap is 50 percent. 2. Window each segment, using data window w(n) xi (n) = x¯i (n)w(n) = x(iD + n)w(n)

0 ≤ i ≤ K − 1, 0 ≤ n ≤ L − 1

(5.3.66)

3. Compute the N -point DFTs Xi (k) of the segments xi (n), 0 ≤ i ≤ K − 1, X˜ i (k) =

L−1

xi (n)e−j (2π /L)kn

0 ≤ k ≤ L − 1, 0 ≤ i ≤ K − 1

(5.3.67)

n=0

4. Accumulate the squares |X˜ i (k)|2 S˜i (k)

K−1

|X˜ i (k)|2

0 ≤ k ≤ L/2

(5.3.68)

i=0 (PA) 5. Finally, normalize by KL to obtain the estimate Rˆ x (k): K−1 1 ˜ Si (k) Rˆ x(PA) (k) = KL

0 ≤ k ≤ N/2

(5.3.69)

i=0

(PA) At this point we emphasize that the spectrum estimate Rˆ x (k) is always nonnegative. A pictorial description of this computational algorithm is shown in Figure 5.20. A more (PA) efficient way to compute Rˆ x (k) is examined in Problem 5.14.

Offset D 0 L−1 Segment 1 Segment 2

Signal data N − 1 record

…

Segment K

…

Periodogram 1

…

Periodogram 2

Periodogram K 1

…

NFFT

Averaging

PSD Estimate

FIGURE 5.20 Pictorial description of the Welch-Bartlett method.

231 section 5.3 Estimation of the Power Spectrum of Stationary Random Signals

232 chapter 5 Nonparametric Power Spectrum Estimation

In Matlab the Welch-Bartlett method is implemented by using the function Rx = psd(x,Nfft,Fs,window (L),Noverlap,’none’);

where window is the name of any Matlab-provided window function (e.g., hamming); Nfft is the size of the DFT, which is chosen to be larger than L to obtain a high-density spectrum; Fs is the sampling frequency, which is used for plotting purposes; and Noverlap specifies the number of overlapping samples. If the boxcar window is used along with Noverlap=0, then we obtain Bartlett’s method of periodogram averaging. (Note that Noverlap is different from the offset parameter D given above.) If Noverlap=L/2 is used, then we obtain Welch’s averaged periodogram method with 50 percent overlap. A biased estimate rˆx (l), |l| < L, of the autocorrelation sequence of x(n) can be ob(PA) tained by taking the inverse N -point DFT of Rˆ x (k) if N ≥ 2L − 1. Since only samples (PA) j ω of the continuous spectrum Rˆ x (e ) are available, the obtained autocorrelation sequence (PA) rˆx (l) is an aliased version of the true autocorrelation rx (l) of the signal x(n) (see Problem 5.15). E XAM PLE 5.3.7 ( BAR T LE TT ’S M E T H O D ) . Consider again the spectrum estimation of three sinusoids in white noise given in Example 5.3.4, that is,

x(n) = cos (0.35π n + φ 1 ) + cos (0.4π n + φ 2 ) + 0.25 cos (0.8π n + φ 3 ) + ν(n)

(5.3.70)

where φ 1 , φ 2 , and φ 3 are jointly independent random variables uniformly distributed over [−π , π] and ν(n) is a unit-variance white noise. An ensemble of 50 realizations of x(n) was generated using N = 512. The Bartlett estimate of each ensemble was computed for K = 1 (i.e., the basic periodogram), K = 4 (or L = 128), and K = 8 (or L = 64). The results in the form of estimate overlays and averages are shown in Figure 5.21. Observe that the variance in the estimate has consistently reduced over the periodogram estimate as the number of averaging segments has increased. However, this reduction has come at the price of broadening of the spectral peaks. Since no window is used, the sidelobes are very prominent even for the K = 8 segment. Thus confidence in the ω = 0.8π spectral line is not very high for the K = 8 case. E XAM PLE 5.3.8 ( WE LC H ’S M E T H O D ) . Consider Welch’s method for the random process in the above example for N = 512, 50 percent overlap, and a Hamming window. Three different values for L were considered; L = 256 (3 segments), L = 128 (7 segments), and L = 64 (15 segments). The estimate overlays and averages are shown in Figure 5.22. In comparing these results with those in Figure 5.21, note that the windowing has considerably reduced the spurious peaks in the spectra but has also further smoothed the peaks. Thus the peak at 0.8π is recognizable with high confidence, but the separation of two close peaks is not so clear for L = 64. However, the L = 128 case provides the best balance between separation and detection. On comparing the Blackman-Tukey (Figure 5.18) and Welch estimates, we observe that the results are comparable in terms of variance reduction and smoothing aspects.

5.3.4 Some Practical Considerations and Examples The periodogram and its modified version, which is the basic tool involved in the estimation of the power spectrum of stationary signals, can be computed either directly from the signal −1 samples {x(n)}N using the DTFT formula 0 2 N −1 1 Rˆ x (ej ω ) = w(n)x(n)e−j ωn (5.3.71) N n=0

or indirectly using the autocorrelation sequence Rˆ x (ej ω ) =

N −1 l=−(N −1)

rˆx (l)e−j ωl

(5.3.72)

−1 where rˆx (l) is the estimated autocorrelation of the windowed segment {w(n)x(n)}N . The 0 periodogram Rˆ x (ej ω ) provides an unacceptable estimate of the power spectrum because

1. it has a bias that depends on the length N and the shape of the data window w(n) and 2. its variance is equal to the true spectrum Rx (ej ω ).

233 section 5.3 Estimation of the Power Spectrum of Stationary Random Signals

Given a data segment of fixed duration N , there is no way to reduce the bias, or equivalently to increase the resolution, because it depends on the length and the shape of the window. However, we can reduce the variance either by averaging the single periodogram of the data (method of Blackman-Tukey) or by averaging multiple periodograms obtained by partitioning the available record into smaller overlapping segments (method of BartlettWelch). The method of Blackman-Tukey is based on the following modification of the indirect periodogram formula L−1 Rˆ x(PS) (ej ω ) = rˆx (l)wa (l)e−j ωl (5.3.73) l=−(L−1)

which basically involves windowing of the estimated autocorrelation (5.2.1) with a proper Bartlett estimate overlay: K = 1

Bartlett estimate average: K = 1 20 Power (dB)

Power (dB)

20 10 0 −10

0.35p 0.4p v

0.8p

10 0 −10

p

Bartlett estimate overlay: K = 4

Power (dB)

Power (dB)

0.35p 0.4p v

0.8p

10 0 −10

p

Bartlett estimate overlay: K = 8

0.35p 0.4p v

0.8p

p

Bartlett estimate average: K = 8 20 Power (dB)

20 Power (dB)

p

20

10

10 0 −10

0.8p

Bartlett estimate average: K = 4

20

−10

0.35p 0.4p v

0.35p 0.4p v

0.8p

p

10 0 −10

0.35p 0.4p v

FIGURE 5.21 Estimation of three sinusoids in white noise using Bartlett’s method in Example 5.3.7.

0.8p

p

correlation window. Using only the first L N more-reliable values of the autocorrelation sequence reduces the variance of the spectrum estimate by a factor of approximately L/N . However, at the same time, this reduces the resolution from about 1/N to about 1/L. The recommended range for L is between 0.1N and 0.2N . The method of Bartlett-Welch is based on partitioning the available data record into windowed overlapping segments of length L, computing their periodograms by using the direct formula (5.3.71), and then averaging the resulting periodograms to compute the estimate L−1 2 1 Rˆ x(PA) (ej ω ) = xi (n)e−j ωn (5.3.74) KL

234 chapter 5 Nonparametric Power Spectrum Estimation

n=0

whose resolution is reduced to approximately 1/L and whose variance is reduced by a factor of about 1/K, where K is the number of segments. The reduction in resolution and variance of the Blackman-Tukey estimate is achieved by “averaging” the values of the spectrum at consecutive frequency bins by windowing the estimated autocorrelation sequence. In the Bartlett-Welch method, the same effect is achieved by averaging the values of multiple shorter periodograms at the same frequency Welch estimate overlay: L = 256

Welch estimate average: L = 256 20 Power (dB)

Power (dB)

20 10 0 −10 0

0.35p 0.4p v

0.8p

10 0 −10 0

p

Welch estimate overlay: L = 128

Power (dB)

Power (dB)

0.35p 0.4p v

0.8p

10 0 −10 0

p

0.35p 0.4p v

0.8p

p

Welch estimate average: L = 64

Welch estimate overlay: L = 64 20 Power (dB)

20 Power (dB)

p

20

10

10 0 −10 0

0.8p

Welch estimate average: L = 128

20

−10 0

0.35p 0.4p v

0.35p 0.4p v

0.8p

p

10 0 −10 0

0.35p 0.4p v

FIGURE 5.22 Estimation of three sinusoids in white noise using Welch’s method in Example 5.3.8.

0.8p

p

235

bin. The PSD estimation methods and their properties are summarized in Table 5.3. The multitaper spectrum estimation method given in the last column of Table 5.3 is discussed in Section 5.5.

section 5.3 Estimation of the Power Spectrum of Stationary Random Signals

TABLE 5.3

Comparison of PSD estimation methods. Periodogram Rˆ x (ej ω )

Single-periodogram smoothing (Blackman-Tukey): (PS) Rˆ x (ej ω )

Multiple-periodogram averaging (Bartlett-Welch): (PA) Rˆ x (ej ω )

Multitaper (Thomson): (MT) j ω Rˆ x (e )

Description of the method

Compute DFT of data record

Compute DFT of windowed autocorrelation estimate (see Figure 5.17)

Split record into K segments and average their modified periodograms (see Figure 5.20)

Window data record using K orthonormal tapers and average their periodograms (see Figure 5.30)

Basic idea

Natural estimator of Rx (ej ω ); the error |rx (l) − rˆx (l)| is large for large |l|

Local smoothing of Rˆ x (ej ω ) by weighting rˆx (l) with a lag window wa (l)

Overlap data records to create more segments; window segments to reduce bias; average periodograms to reduce variance

For properly designed orthogonal tapers, periodograms are independent at each frequency. Hence averaging reduces variance

Bias

Severe for small N; negligible for large N

Asymptotically unbiased

Asymptotically unbiased

Negligible for properly designed tapers

Resolution

∝

∝

1 L L is segment length

∝

Variance

Unacceptable: about

Rx2 (ej ω ) K K is number of segments

Rx2 (ej ω ) K K is number of tapers

1 N

Rx2 (ej ω ) for all N

∝

1 , L is maximum lag L

Rx2 (ej ω ) ×

Ew N

EXAMPLE 5.3.9 (COMPARISON OF BLACKMAN-TUKEY AND WELCH-BARTLETT METHODS).

Figure 5.23 illustrates the properties of the power spectrum estimators based on autocorrelation windowing and periodogram averaging using the AR(4) model (5.3.24). The top plots show the power spectrum of the process. The left column plots show the power spectrum obtained by windowing the data with a Hanning window and the autocorrelation with a Parzen window of length L = 64, 128, and 256. We notice that as the length of the window increases, the resolution decreases and the variance increases. We see a similar behavior with the method of averaged periodograms as the segment length L increases from 64 to 256. Clearly, both methods give comparable results if their parameters are chosen properly.

Example of ocean wave data. To apply spectrum estimation techniques discussed in this chapter to real data, we will use two real-valued time series that are obtained by recording the height of ocean waves as a function of time, as measured by two wave gages of different designs. These two series are shown in Figure 5.24. The top graph shows the wire wave gage data while the bottom graph shows the infrared wave gage data. The frequency responses of these gages are such that—mainly because of its inertia—frequencies higher than 1 Hz cannot be reliably measured. The frequency range between 0.2 and 1 Hz is also important because the rate at which the spectrum decreases has a physical model associated with it. Both series were collected at a rate of 30 samples per second. There are 4096 samples in † each series. We will also use these data to study joint signal analysis in the next section. †

These data were collected by A. Jessup, Applied Physics Laboratory, University of Washington. It was obtained from StatLib, a statistical archive maintained by Carnegie Mellon University.

1 N

236

Power (dB) L = 64

50

L = 128

−50

50

0 0.1 0.2 0.3 0.4 0.5 Frequency (cycles/sampling interval)

0 0.1 0.2 0.3 0.4 0.5 Frequency (cycles/sampling interval) L = 128

−50

L = 256

−50

50

0 0.1 0.2 0.3 0.4 0.5 Frequency (cycles/sampling interval)

L = 64

−50

0 0.1 0.2 0.3 0.4 0.5 Frequency (cycles/sampling interval)

Power (dB)

−50

50 Power (dB)

−50 0 0.1 0.2 0.3 0.4 0.5 Frequency (cycles/sampling interval)

0 0.1 0.2 0.3 0.4 0.5 Frequency (cycles/sampling interval)

50 Power (dB)

Power (dB) Power (dB)

50

Power (dB)

50

−50

Power (dB)

chapter 5 Nonparametric Power Spectrum Estimation

Periodogram averaging

Autocorrelation windowing 50

0 0.1 0.2 0.3 0.4 0.5 Frequency (cycles/sampling interval) L = 256

−50

0 0.1 0.2 0.3 0.4 0.5 Frequency (cycles/sampling interval)

FIGURE 5.23 Illustration of the properties of the power spectrum estimators using autocorrelation windowing (left column) and periodogram averaging (right column) in Example 5.3.9.

E XAM PLE 5.3.10 (ANALYS I S O F T H E O C EAN WAV E DATA) . Figure 5.25 depicts the periodogram averaging and smoothing estimates of the wire wave gage data. The top row of plots shows the Welch estimate using a Hamming window, L = 256, and 50 percent overlap between segments. The bottom row shows the Blackman-Tukey estimate using a Bartlett window and a lag length of L = 256. In both cases, a zoomed view of the plots between 0 and 1 Hz is shown in the right column to obtain a better view of the spectra. Both spectral estimates provide a similar spectral behavior, especially over the frequency range of 0 to 1 Hz. Furthermore, both show a broad, low-frequency peak at 0.13 Hz, corresponding to a period of about 8 s. The dominant features of the time series thus can be attributed to this peak and other features in the 0- to 0.2-Hz range. The shape of the spectrum between 0.2 and 1 Hz is a decaying exponential and is consistent with the physical model. Similar results were obtained for the infrared wave gauge data.

237

Wire wave gage Wave height (m)

1

section 5.4 Joint Signal Analysis

−1

35

70

105

140

105

140

Infrared wave gage Wave height (m)

1

−1

35

70 Time (s)

FIGURE 5.24 Display of ocean wave data. Welch estimate: L = 256, Hamming window

Zoomed view 10

Power (dB)

0 0 −20 −10 −40 0

3

6 9 Frequency (Hz)

12

15

−20

0 0.13

0.5 Hz

1

Zoomed view

Blackman-Tukey estimate: L = 256, Bartlett window 10 Power (dB)

0 0 −20 −10 −40 0

3

9 6 Frequency (Hz)

12

15

−20

0 0.13

0.5 Hz

1

FIGURE 5.25 Spectrum estimation of the ocean wave data using the Welch and Blackman-Tukey methods.

5.4 JOINT SIGNAL ANALYSIS Until now, we discussed estimation techniques for the computation of the power spectrum of one random process x(n), which is also known as univariate spectral estimation. In many practical applications, we have two jointly stationary random processes and we wish to study the correlation between them. The analysis and computation of this correlation

238 chapter 5 Nonparametric Power Spectrum Estimation

and the associated spectral quantities are similar to those of univariate estimation and are called bivariate spectrum estimation. In this section, we provide a brief overview of this joint signal analysis. Let x(n) and y(n) be two zero-mean, jointly stationary random processes with power spectra Rx (ej ω ) and Ry (ej ω ), respectively. Then from (3.3.61), the cross-power spectral density of x(n) and y(n) is given by ∞

Rxy (ej ω ) =

rxy (l)e−j ωl

(5.4.1)

l=−∞

where rxy (l) is the cross-correlation sequence between x(n) and y(n). The cross-spectral density Rxy (ej ω ) is, in general, a complex-valued function that is difficult to interpret or plot in its complex form. Therefore, we need to express it by using real-valued functions that are easier to deal with. It is customary to express the conjugate of Rxy (ej ω ) in terms of its real and imaginary components, that is, Rxy (ej ω ) = Cxy (ω) − j Qxy (ω)

(5.4.2)

Cxy (ω) Re [Rxy (ej ω )]

(5.4.3)

∗ (ej ω )] = − Im[Rxy (ej ω )] Qxy (ω) Im [Rxy

(5.4.4)

where is called the cospectrum and

is called the quadrature spectrum. Alternately, the most popular approach is to express Rxy (ej ω ) in terms of its magnitude and angle components, that is,

where and

Rxy (ej ω ) = Axy (ω) exp[j ;xy (ω)] 2 (ω) + Q2 (ω) Axy (ω) = |Rxy (ej ω )| = Cxy xy

(5.4.6)

;xy (ω) = Rxy (ej ω ) = tan−1 {−Qxy (ω)/Cxy (ω)}

(5.4.7)

(5.4.5)

The magnitude Axy (ω) is called the cross-amplitude spectrum, and the angle ;xy (ω) is called the phase spectrum. All these derived functions are real-valued and hence can be examined graphically. However, the phase spectrum has the 2π ambiguity in its computation, which makes its interpretation somewhat problematic. From (3.3.64) the normalized cross-spectrum, called the complex coherence function, is given by Gxy (ω) =

Rxy (ej ω )

(5.4.8)

Rx (ej ω )Ry (ej ω ) which is a complex-valued frequency-domain correlation coefficient that measures the correlation between the random amplitudes of the complex exponentials with frequency ω in the spectral representations of x(n) and y(n). Hence to interpret this coefficient, its magnitude |Gxy (ω)| is computed, which is referred to as the coherency spectrum. Recall that in Chapter 3, we called |Gxy (ω)|2 the magnitude-squared coherence (MSC). Clearly, 0 ≤ |Gxy (ω)| ≤ 1. Since the coherency spectrum captures the amplitude spectrum but completely ignores the phase spectrum, in practice, the coherency and the phase spectrum are useful real-valued summaries of the cross-spectrum.

5.4.1 Estimation of Cross Power Spectrum Now we apply the techniques developed in Section 5.3 to the problem of estimating the −1 cross-spectrum and its associated real-valued functions. Let {x(n), y(n)}N be the data 0

record available for estimation. By using the periodogram (5.3.5) as a guide, the estimator for Rxy (ej ω ) is the cross-periodogram given by Rˆ xy (ej ω )

where

1 N rˆxy (l) = 1 N 0

N −1

rˆxy (l)e−j ωl

(5.4.9)

l=−(N −1) N −l−1

x(n + l)y ∗ (n)

0≤l ≤N −1

x(n)y ∗ (n − l)

−(N − 1) ≤ l ≤ −1

n=0 N +l−1

(5.4.10)

n=0

l ≤ −N or l ≥ N

In analogy to (5.3.2), the cross-periodogram can also be written as N −1 ∗ N −1 1 j ω −j ωn −j ωn Rˆ xy (e ) = x(n)e y(n)e N n=0

(5.4.11)

n=0

Once again, it can be shown that the bias and variance properties of the cross-periodogram are as poor as those of the periodogram. Another disturbing result of these periodograms is that from (5.4.11) and (5.3.2), we obtain 2 N −1 2 2 N −1 1 j ω 2 −j ωn −j ωn x(n)e y(n)e |Rˆ xy (e )| = = Rˆ x (ej ω )Rˆ y (ej ω ) N n=0

n=0

which implies that if we estimate the MSC from the “raw” autoperiodograms as well as cross-periodograms, then the result is always unity for all frequencies. This seemingly unreasonable result is due to the fact that the frequency-domain correlation coefficient at each frequency ω is estimated by using only one single pair of observations from the two signals. Therefore, a reasonable amount of smoothing in the periodogram is necessary to reduce the inherent variability of the cross-spectrum and to improve the accuracy of the estimated coherency. This variance reduction can be achieved by straightforward extensions of various techniques discussed in Section 5.3 for power spectra. These methods include periodogram smoothing across frequencies and the various modified periodogram averaging techniques. In practice, Welch’s approach to modified periodogram averaging, based on overlapped segments, is preferred owing to its superior performance. For illustration purposes, we describe Welch’s approach in a brief fashion. In this approach, we subdivide the existing data records {x(n), y(n); 0 ≤ n ≤ N − 1} into K overlapping smaller segments of length L as follows: xi (n) = x(iD + n)w(n) yi (n) = y(iD + n)w(n)

0 ≤ n ≤ L − 1, 0 ≤ i ≤ K − 1

(5.4.12)

where w(n) is a data window of length L and D = L/2 for 50 percent overlap. The cross-periodogram of the ith segment is given by L−1 ∗ L−1 1 1 jω jω ∗ jω −j ωn −j ωn ˆ Ri (e ) = Xi (e )Yi (e ) = xi (n)e yi (n)e (5.4.13) L L n=0

n=0

(PA) Finally, the smoothed cross-spectrum Rˆ xy (ej ω ) is obtained by averaging K cross-periodograms as follows: K−1 K−1 1 ˆ jω 1 (PA) j ω Rˆ xy (e ) = Xi (ej ω )Yi∗ (ej ω ) Ri (e ) = K KL i=0

i=0

(5.4.14)

239 section 5.4 Joint Signal Analysis

240 chapter 5 Nonparametric Power Spectrum Estimation

(PA) Similar to (5.3.51), the DFT computation of Rˆ xy (ej ω ) is given by L−1 ∗ K−1 L−1 1 ˆ (PA) −j 2π kn/N −j 2π kn/N xi (n)e yi (n)e R xy (k) = KL i=0

n=0

(5.4.15)

n=0

where 0 ≤ k ≤ N − 1, N > L. Estimation of cospectra and quadrature spectra. Once the cross-spectrum Rxy (ej ω ) has been estimated, we can compute the estimates of all the associated real-valued spectra (PA) by replacing Rxy (ej ω ) with its estimate Rˆ xy (ej ω ) in the definitions of these functions. To estimate the cospectrum, we use K−1 1 (PA) (PA) j ω j ω ∗ j ω Cˆ xy (ω) = Re[Rˆ xy (e )] = Re Xi (e )Yi (e ) (5.4.16) KL i=0

and to estimate the quadrature spectrum, we use K−1 1 (PA) (PA) j ω jω ∗ jω ˆ ˆ Xi (e )Yi (e ) Qxy (ω) = − Im[Rxy (e )] = − Im KL

(5.4.17)

i=0

The analyses of bias, variance, and covariance of these estimates are similar in complexity to those of the autocorrelation spectral estimates, and the details can be found in Goodman (1957) and Jenkins and Watts (1968). Estimation of cross-amplitude and phase spectra. Following the definitions in (5.4.6) and (5.4.7), we may estimate the cross-amplitude spectrum Axy (ω) and the phase spectrum ;xy (ω) between the random processes x(n) and y(n) by (PA) (PA) 2 ˆ ˆ (PA) Axy (ω) = [Cˆ xy (ω)]2 + [Q (5.4.18) xy (ω)] and

−1 ˆ (PA) ˆ (PA) ˆ (PA) ; xy (ω) = tan {−Qxy (ω)/Cxy (ω)}

(5.4.19)

(PA) jω ˆ (PA) where the estimates Cˆ xy (ej ω ) and Q xy (e ) are given by (5.4.16) and (5.4.17), respectively. Since the cross-amplitude and phase spectral estimates are nonlinear functions of the cospectral and quadrature spectral estimates, their analysis in terms of bias, variance, and covariance is much more complicated. Once again, the details are available in Jenkins and Watts (1968).

Estimation of coherency spectrum. The coherency spectrum is given by the magnitude of the complex coherence Gxy (ω). Replacing Rxy (ej ω ), Rx (ej ω ), and Ry (ej ω ) by their estimates in (5.4.8), we see the estimate for the coherency spectrum is given by (PA) (PA) 2 1/2 ˆ (PA) [Cˆ xy (ω)]2 + [Q |Rˆ xy (ej ω )| xy (ω)] (PA) ˆ |Gxy (ω)| = = (5.4.20) (PA) (PA) (PA) (PA) Rˆ x (ej ω )Rˆ y (ej ω ) Rˆ x (ej ω )Rˆ y (ej ω ) with bias and variance properties similar to those of the cross-amplitude spectrum. In Matlab the function Rxy=csd(x,y,Nfft,Fs,window (L),Noverlap);

is available, which is similar to the psd function described in Section 5.3.3. It estimates the cross-spectral density of signal vectors x and y by using Welch’s method. The window parameter specifies a window function, Fs is the sampling frequency for plotting purposes,

Nfft is the size of the FFT used, and Noverlap specifies the number of overlapping samples. The function cohere(x,y,Nfft,Fs,window (L),Noverlap);

estimates the coherency spectrum between two vectors x and y. Its values are between 0 and 1. 5.4.2 Estimation of Frequency Response Functions When random processes x(n) and y(n) are the input and output of some physical system, the bivariate spectral estimation techniques discussed in this section can be used to estimate the system characteristics, namely, its frequency response. Problems of this kind arise in many applications including communications, industrial control, and biomedical signal processing. In communications applications, we need to characterize a channel over which signals are transmitted. In this situation, a known training signal is transmitted, and the channel response is recorded. By using the statistics of these two signals, it is possible to estimate channel characteristics within a reasonable accuracy. In the industrial applications such as a gas furnace, the classical methods using step (or sinusoidal) inputs may be inappropriate because of large disturbances generated within the system. Hence, it is necessary to use statistical methods that take into account noise generated in the system. From Chapter 3, we know that if x(n) and y(n) are input and output signals of an LTI system characterized by the impulse response h(n), then y(n) = h(n) ∗ x(n)

(5.4.21)

The impulse response h(n), in principle, can be computed through the deconvolution operation. However, deconvolution is not always computationally feasible. If the input and output processes are jointly stationary, then from Chapter 3 we know that the cross-correlation between these two processes is given by ryx (l) = h(l) ∗ rx (l)

(5.4.22)

Ryx (ej ω ) = H (ej ω )Rx (ej ω )

(5.4.23)

and the cross-spectrum is given by (ej ω )

Ryx (5.4.24) Rx (ej ω ) Hence, if we can estimate the auto power spectrum and cross power spectrum with reasonable accuracy, then we can determine the frequency response of the system. † Consider next an LTI system with additive output noise, as shown in Figure 5.26. This model situation applies to many practical problems where the input measurements x(n) are essentially without noise while the output measurements y(n) can be modeled by the sum of the ideal response yo (n) due to x(n) and an additive noise v(n), which is statistically independent of x(n). If we observe the input x(n) and the ideal output yo (n), the frequency response can be obtained by or

H (ej ω ) =

Ryo x (ej ω ) (5.4.25) Rx (ej ω ) where all signals are assumed stationary with zero mean (see Section 5.3.1). Since x(n) and v(n) are independent, we can easily show that H (ej ω ) =

Ryo x (ej ω ) = Ryx (ej ω ) †

(5.4.26)

More general situations involving both additive input noise and additive output noise are discussed in Bendat and Piersol (1980).

241 section 5.4 Joint Signal Analysis

242

FIGURE 5.26 Input-output LTI system model with output noise.

v(n)

chapter 5 Nonparametric Power Spectrum Estimation

x(n) h(n)

yo(n)

y(n)

and

Ry (ej ω ) = Ryo (ej ω ) + Rv (ej ω )

(5.4.27)

where

Ryo (ej ω ) = |H (ej ω )|2 Rx (ej ω )

(5.4.28)

is the ideal output PSD produced by the input. From (5.4.25) and (5.4.26), we have H (ej ω ) =

Ryx (ej ω ) Rx (ej ω )

(5.4.29)

which shows that we can determine the frequency response by using the cross power spectral density between the noisy output and the input signals. Given a finite record of input(PA) (PA) −1 , we estimate Rˆ yx (ej ωk ) and Rˆ x (ej ωk ) by using one of the output data {x(n), y(n)}N 0 j ω previously discussed methods and then estimate H (e ) at a set of equidistant frequencies {ωk = 2π k/K}K−1 , that is, 0 Hˆ (ej ωk ) =

(PA) Rˆ yx (ej ωk ) (PA) Rˆ x (ej ωk )

(5.4.30)

The coherence function, which measures the linear correlation between two signals x(n) and y(n) in the frequency domain, is given by 2 (ω) = Gxy

|Rxy (ej ω )|2 Rx (ej ω )Ry (ej ω )

(5.4.31)

2 (ω) ≤ 1 (see Section 3.3.6). If R (ej ω ) = 0 for and satisfies the inequality 0 ≤ Gxy xy 2 2 (ω) = 1 all ω, then Gxy (ω) = 0. On the other hand, if y(n) = h(n) ∗ x(n), then Gxy because Ry (ej ω ) = |H (ej ω )|2 Rx (ej ω ) and Rxy (ej ω ) = H ∗ (ej ω )Rx (ej ω ). Furthermore, we can show that the coherence function is invariant under linear transformations. Indeed, if 2 (ω) = G 2 (ω) (see Problem 5.16). x1 (n) = h1 (n)∗x(n) and y1 (n) = h2 (n)∗y(n), then Gxy x1 y1 To avoid delta function behavior at ω = 0, we should remove the mean value from the data 2 (ω). Also R (ej w )R (ej w ) > 0 to avoid division by 0. before we compute Gxy x y In practice, the coherence function is usually greater than 0 and less than 1. This may result from one or more of the following reasons (Bendat and Piersol 1980):

1. 2. 3. 4.

Excessive measurement noise. Significant resolution bias in the spectral estimates. The system relating y(n) to x(n) is nonlinear. The output y(n) is not produced exclusively by the input x(n).

Using (5.4.28), (5.4.25), Rxyo (ej ω ) = H ∗ (ej ω )Rx (ej ω ), and (5.4.31), we obtain 2 (ω)Ry (ej ω ) Ryo (ej ω ) = Gxy

(5.4.32)

which is known as the coherent output PSD. Combining the last equation with (5.4.27), we have 2 (ω)]Ry (ej ω ) Rv (ej ω ) = [1 − Gxy

(5.4.33)

which can be interpreted as the part of the output PSD that cannot be produced from the input by using linear operations.

243

Substitution of (5.4.27) into (5.4.32) results in 2 (ω) = 1 − Gxy

Rv (ej ω ) Ry (ej ω )

(5.4.34)

2 (ω) → 1 as R (ej ω )/R (ej ω ) → 0 and G 2 (ω) → 0 as R (ej ω )/ which shows that Gxy v y v xy j ω Ry (e ) → 1. Typically, the coherence function between input and output measurements reveals the presence of errors and helps to identify their origin and magnitude. Therefore, the coherence function provides a useful tool for evaluating the accuracy of frequency response estimates. In Matlab the function

H = tfe(x,y,Nfft,Fs,window (L),Noverlap)

is available that estimates the transfer function of the system with input signal x and output y using Welch’s method. The window parameter specifies a window function, Fs is the sampling frequency for plotting purposes, Nfft is the size of the FFT used, and Noverlap specifies the number of overlapping samples. We next provide two examples that illustrate some of the problems that may arise when we estimate frequency response functions by using input and output measurements. E XAM PLE 5.4.1.

Consider the AP(4) system

H (z) =

1 1 − 2.7607z−1 + 3.8106z−2 − 2.6535z−3 + 0.9238z−4

discussed in Example 5.3.2. The input is white Gaussian noise, and the output of this system is corrupted by additive white Gaussian noise, as shown in Figure 5.27. We wish to estimate the −1 frequency response of the system from a set of measurements {x(n), y(n)}N . Since the input 0 is white, when the output signal-to-noise ratio (SNR) is very high, we can estimate the magnitude response of the system by computing the PSD of the output signal. However, to compute the phase response or a more accurate estimate of the magnitude response, we should use the joint measurements of the input and output signals, as explained above. Figure 5.27 shows estimates of the MSC function, magnitude response functions (in linear and log scales), and phase response functions for two different levels of output SNR: 32 and 0 dB. When SNR = 32 dB, we note that |Gxy (ω)| is near unity at almost all frequencies, as we theoretically expect for ideal LTI input-output relations. The estimated magnitude and phase responses are almost identical to the theoretical ones with the exception at the two sharp peaks of |H (ej ω )|. Since the SNR is high, the two notches in |Gxy (ω)| at the same frequencies suggest a bias error due to the lack of sufficient frequency resolution. When SNR = 0 dB, we see that |Gxy (ω)| falls very sharply for frequencies above 0.2 cycle per sampling interval. We notice that the presence of noise increases the random errors in the estimates of magnitude and phase response in this frequency region, and the bias error in the peaks of the magnitude response. Finally, we note that the uncertainty fluctuations in |Gxy (ω)| increase as |Gxy (ω)| → 0, as predicted by the formula (PA) (PA) std[|Gˆxy (ω)|] √ 1 − |Gˆxy (ω)|2 = 2 (PA) √ |Gxy (ω)| |Gxy (ω)| K

(5.4.35)

where std(·) means standard deviation and K is the number of averaged segments (Bendat and Piersol 1980). E XAM PLE 5.4.2. In this example we illustrate the use of frequency response estimation to study the effect of respiration and blood pressure on heart rate. Figure 5.28 shows the systolic blood pressure (mmHg), heart rate (beats per minute), and the respiration (mL) signals with their corresponding PSD functions (Grossman 1998). The sampling frequency is Fs = 5 Hz, and the PSDs were estimated using the method of averaged periodograms with 50 percent overlap. Note the corresponding quasiperiodic oscillations of blood pressure and heart rate occurring approximately every 12 s (0.08 Hz). Close inspection of the heart rate time series will also reveal

section 5.4 Joint Signal Analysis

244

1.0 SNR = 32 dB

Coherence

0.8 0.6

SNR = 0 dB

0.4 0.2 0

0.1

0.2 0.3 Frequency f = F/Fs

0.4

0.5

0.1

0.2 0.3 Frequency f = F/Fs

0.4

0.5

0.1

0.2 0.3 Frequency f = F/Fs

0.4

0.5

0.1

0.2 0.3 Frequency f = F/Fs

0.4

0.5

Magnitude (linear)

150 100 50

Magnitude (log)

10 2

10 0

4 Phase response

chapter 5 Nonparametric Power Spectrum Estimation

2 0 −2 −4

FIGURE 5.27 Estimated coherence, magnitude response, and phase response for the AP(4) system. The solid lines show the ideal magnitude and phase responses. another rhythm corresponding to the respiratory period (about 4.3 s, or 0.23 Hz). These rhythms reflect nervous system mechanisms that control the activity of the heart and the circulation under most circ*mstances. The left column of Figure 5.29 shows the coherence, magnitude response, and phase response between respiration as input and heart rate as output. Heart rate fluctuates clearly at the respiratory frequency (here at 0.23 Hz); this is indicated by the large amount of heart rate power and the high degree of coherence at the respiratory frequency. Heart function is largely controlled

100 90 200 Time (s)

section 5.4 Joint Signal Analysis

600 400 200

300

0.1

0.2 0.3 0.4 Frequency (Hz)

0.1

0.2 0.3 0.4 Frequency (Hz)

0.1

0.2 0.3 0.4 Frequency (Hz)

300

70 Power HR

HR (beats per minute)

100

65 60

200 100

55 100

200 Time (s)

300

−0.2

Power respiration

Respiration (mL)

245

800 Power SBP

SBP (mmHg)

110

−0.4 −0.6 −0.8 100

200 Time (s)

300

3 2 1

FIGURE 5.28 Continuous systolic blood pressure (SBP), heart rate (HR), and respiration of a young man during quiet upright tilt and their estimated PSDs.

by two branches of the autonomic nervous system, the parasympathetic and sympathetic. Frequency analysis of cardiovascular signals may improve our understanding of the manner in which these two branches interact under varied circ*mstances. Heart rate fluctuations at the respiratory frequency (termed respiratory sinus arrhythmia) are primarily mediated by the parasympathetic branch of the autonomic nervous system. Increases in respiratory sinus arrhythmia indicate enhanced parasympathetic influence upon the heart. Sympathetic oscillations of heart rate occur only at slower frequencies (below 0.10 Hz) owing to the more sluggish frequency response characteristics of the sympathetic branch of the autonomic nervous system. The right column of Figure 5.29 shows the coherence, magnitude response, and phase response between systolic blood pressure as input and heart rate as output. Coherent oscillations among cardiac and blood pressure signals can often be discerned in a frequency band with a typical center frequency of 0.10 Hz (usual range, 0.07 to 0.12 Hz). This phenomenon has been tied to the cardiovascular baroreflex system, which involves baroreceptors, that is, bodies of cells in the carotid arteries and aorta that are sensitive to stretch. When blood pressure is increased, these baroreceptors fire proportionally to stretch and pressure changes, sending commands via the brain to the heart and circulatory system. This baroreflex system is the only known physiological system acting to buffer rapid and extreme surges or falls in blood pressure. Increased baroreceptor stretch, for example, slows the heart rate by means of increased parasympathetic activity; decreased baroreceptor stretch will elicit cardiovascular sympathetic activation that will speed the heart and constrict arterial vessels. Thus pressure drops due to a decrease in flow. The 0.10-Hz blood pressure oscillations (see PSD in Figure 5.28) are sympathetic in origin and are produced by periodic sympathetic constriction of arterial blood vessels.

Coherence

Coherence

0.8

0.8 0.6 0.4 0.2 0

0.1

0.4 0.2 0

0.1

0.2 0.3 0.4 Frequency (Hz)

0.1

0.2 0.3 0.4 Frequency (Hz)

0.1

0.2 0.3 0.4 Frequency (Hz)

0.8

60

|H(F )|

|H(F )|

0.6

0.2 0.3 0.4 Frequency (Hz)

80

40 20

0.6 0.4 0.2

0.1

0.2 0.3 0.4 Frequency (Hz)

arg[H(F )] (rad)

arg[H(F )] (rad)

chapter 5 Nonparametric Power Spectrum Estimation

Blood pressure → Heart rate

Respiration → Heart rate

246

2 0 −2 0

0.1

0.2 0.3 0.4 Frequency (Hz)

2 0 −2

FIGURE 5.29 Coherence, magnitude response, and phase response between respiration as input and heart rate as output, and systolic blood pressure as input and heart rate as output.

5.5 MULTITAPER POWER SPECTRUM ESTIMATION Tapering is another name for the data windowing operation in the time domain. The periodogram estimate of the power spectrum, discussed in Section 5.3, is an operation on a data record {x(n)}N−1 n=0 . One interpretation of this finite-duration data record is that it is obtained by truncating an infinite-duration process x(n) with a rectangular window (or taper). Since bias and variance properties of the periodogram estimate are unacceptable, methods for bias and variance reduction were developed either by smoothing estimates in the frequency domain (using lag windows) or by averaging periodograms computed over several short segments (data windows). Since these window functions (other than the rectangular one) typically taper the response toward both ends of the data record, windows are also referred to as tapers. In 1982, Thomson suggested an alternate approach for producing a “direct” (or “raw” periodogram-based) spectral estimator. In this method, rather than use a single rectangular data taper as in the periodogram estimate, several data tapers are used on the same data record to compute several modified periodograms. These modified periodograms are then averaged (with or without weighting) to produce the multitaper spectral estimate. The central premise of this multitaper approach is that if the data tapers are properly designed orthogonal functions, then, under mild conditions, the spectral estimates would be independent of each other at every frequency. Thus, averaging would reduce the variance while proper design of

full-length windows would reduce bias and loss of resolution. Thomson suggested windows based on discrete prolate spheroidal sequences (DPSSs) that form an orthonormal set, although any other orthogonal set with desirable properties can also be used. This DPSS set is also known as the set of Slepian tapers. The multitaper method is different in spirit from the other methods in that it does not seek to produce highly smoothed spectra. Detailed discussions of the multitaper approach are given in Thomson (1982) and in Percival and Walden (1993). In this section, we provide a brief sketch of the algorithm.

5.5.1 Estimation of Auto Power Spectrum −1 Given a data record {x(n)}N n=0 of length N, consider a set of K data tapers {wk (n); 0 ≤ n ≤ N − 1, 0 ≤ k ≤ K − 1}. These tapers are assumed to be orthonormal, that is, N −1 1 k=l wk (n)wl (n) = (5.5.1) 0 k = l n=0

Let Rˆ k,x (ej ω ) be the periodogram estimator based on kth taper. Then, similar to (5.3.2), we obtain 2 N −1 1 Rˆ k,x (ej ω ) = wk (n)x(n)e−j ωn (5.5.2) N n=0

The simple averaged multitaper (MT) estimator is then defined by K−1 1 ˆ Rˆ x(MT) (ej ω ) = Rk,x (ej ω ) K

(5.5.3)

k=0

A pictorial description of this multitaper algorithm is shown in Figure 5.30. Another approach, suggested by Thomson, is to apply adaptive weights (both frequency- and datadependent) prior to averaging to protect against the biasing degradations of different tapers. In either case, the multitaper estimator is an average of direct spectral estimators (called eigenspectra by Thomson) employing an orthonormal set of tapers. Thomson (1982) showed that under mild conditions, the orthonormality of the tapers results in an approximate independence of each individual Rˆ k,x (ej ω ) at every frequency ω. This approximate inde(MT) pendence further implies that the equivalent degrees of freedom for Rˆ x (ej ω ) are equal to twice the number of data tapers. This increase in degrees of freedom is enough to shrink (MT) the width of the 95 percent confidence interval for Rˆ x (ej ω ) and to reduce the variability to the point at which the overall shape of the spectrum is easily recognizable even though the spectrum is not highly smoothed. Clearly, the success of this approach lies in the selection of K orthonormal tapers. To understand the rationale behind the selection of these tapers, consider the bias or mean of Rˆ k,x (ej ω ). Following (5.3.15), we obtain π 1 jω ˆ E{Rk,x (e )} = Rx (ej θ )Rk,w (ej (ω−θ ) ) dθ (5.5.4) 2π N −π where

Rk,w (ej ω ) = F{wk (n) ∗ wk (−n)} = |Wk (ej ω )|2

It follows, then, from (5.5.3) that E{Rˆ x(MT) (ej ω )} = where

1 2π N

1 R¯ w (ej ω ) K

π

Rx (ej θ )R¯ w (ej (ω−θ ) ) dθ

−π K−1 k=0

|Wk (ej ω )|2

(5.5.5)

(5.5.6) (5.5.7)

247 section 5.5 Multitaper Power Spectrum Estimation

248

chapter 5 Nonparametric Power Spectrum Estimation

Taper 1

N−1

Data record

N−1

Periodogram 1 N − 1

… Taper 2

Taper M

N−1

Periodogram 2 N − 1

N−1

Periodogram M N − 1

…

…

…

… Averaging

PSD Estimate

FIGURE 5.30 A pictorial description of the multitaper approach to power spectrum estimation.

The function R¯ w (ej ω ) is the spectral window of the averaged multitaper estimator, which is obtained by averaging spectra of the individual tapers. Hence, for R¯ w (ej ω ) to produce a good (MT) leakage-free estimate Rˆ x (ej ω ), all K spectral windows must provide good protection against leakage. Therefore, each taper must have low sidelobe levels. Furthermore, the (MT) averaging of K individual periodograms also reduces the overall variance of Rˆ x (ej ω ). j ω The reduction in variance is possible if the Rˆ k,x (e ) are pairwise uncorrelated with common variance, in which case the variance reduces by a factor of 1/K. Thus, we need K orthonormal data tapers such that each one provides a good protection against leakage and such that the resulting individual spectral estimates are nearly uncorrelated. One such set is obtained by using DPSS with parameter W and of orders k = 0, . . . , K − 1, where K is chosen to be less than or equal to the number 2W (called the Shannon number, which is also a fixed-resolution bandwidth). The design of these sequences is discussed in detail in Thomson (1982) and in Percival and Walden (1993). In Matlab these tapers are generated by using the [w]=dpss(L,W) function, where L is the length of 2W tapers computed in matrix w. The first four 21-point DPSS tapers with W = 4 and their Fourier transforms are shown in Figure 5.31 while the next four DPSS tapers are shown in Figure 5.32. It can be seen that higher-order tapers assume both positive and negative values. The zeroth-order taper (like other windows) heavily attenuates data values near n = 0 and n = L. The higherorder tapers successively give greater weights to these values to the point that tapers for k ≥ K have very poor bias properties and hence are not used. This behavior is quite evident in the frequency domain where as the taper order increases, mainlobe width and sidelobe attenuation decrease. The multitapering approach can be interpreted as a technique in which higher-order tapers capture information that is “lost” when only the first taper is used. In Matlab the function [Pxx,Pxxc,F]=PMTM(x,W,Nfft,Fs)

estimates the power spectrum of the data vector x in the array Pxx, using the multitaper approach. The function uses DPSS tapers with parameter W and adaptive weighted averaging

Time domain

10

150

20

10

150

20

0.5

0.5

10

150

20

0 Decibels

0.5

w 3(n)

Decibels

w 2(n)

0.5

−0.4

0.5

−0.4

Decibels

w 1(n)

0.5

−0.4

section 5.5 Multitaper Power Spectrum Estimation

Decibels

w 0(n)

−0.4

249

Frequency domain

0.5

10 n

20

150

0.5 Normalized frequency

FIGURE 5.31 DPSS data tapers for k = 0, 1, 2, 3 in the time and frequency domains.

as the default method. The 95 percent confidence interval is available in Pxxc. The size of the DFT used is Nfft, the sampling frequency is Fs, and the frequency values are returned in the vector F. Another much simpler set of orthonormal tapers was suggested by Reidel and Siderenko (1995). This particular set contains harmonically related sinusoidal tapers. One important aspect of multitapering is to reduce the periodogram variance without reducing resolution caused by smoothing across frequencies. If the spectrum is changing slowly across the band so that sidelobe bias is not severe (recall the argument given for the unbiasedness of the periodogram for the white noise process), then sine tapers can reduce the variance. The kth taper in this set of k = 0, 1, . . . , N − 1 tapers is given by π (k + 1)(n + 1) 2 sin n = 0, 1, . . . , N − 1 (5.5.8) N +1 N +1 where the amplitude term on the right is a normalization factor that ensures orthonormality of the tapers. These sine tapers have much narrower mainlobe but also much higher sidelobes wk (n) =

250

Frequency domain

Decibels

w4(n)

−0.5

10

50

20

10

50

20

0.5

0.5

10

50

20

0 Decibels

0.5

−0.5

Decibels

w 6(n)

0.5

−0.5

0.5

−0.5

Decibels

w 5(n)

0.5

w 7 (n)

chapter 5 Nonparametric Power Spectrum Estimation

Time domain 0.5

10 n

20

50

0.5 Normalized frequency

FIGURE 5.32 DPSS data tapers for k = 4, 5, 6, 7 in the time and frequency domains.

(recall the rectangular window) than the DPSS tapers. Thus they achieve a smaller bias due to smoothing by the mainlobe than the DPSS tapers, but at the expense of sidelobe suppression. Clearly this performance is acceptable if the spectrum is varying slowly. Owing to their simple nature, these tapers can be analyzed analytically, and it can be shown that (Reidel and Siderenko 1995) the kth sinusoidal taper has its spectral energy concentrated in the frequency bands πk π (k + 2) k = 0, 1, . . . , N − 1 (5.5.9) ≤ |ω| ≤ N +1 N +1 If the first K < N tapers are used, then the multitaper estimator has the spectral window concentrated in the band K +1 K +1 − , (5.5.10) N +1 N +1 A summary of the multitaper algorithm performance and its comparison with other PSD estimation methods are given in Table 5.3.

E XAM PLE 5.5.1 (TH R E E S I N U S O I D S I N WH ITE N O I S E ) . Consider the random process x(n) containing three sinusoids in white noise discussed earlier, that is,

x(n) = cos (0.35π n + φ 1 ) + cos (0.4π n + φ 2 ) + 0.25 cos(0.8π n + φ 3 ) + ν(n)

251 section 5.5 Multitaper Power Spectrum Estimation

Fifty realizations of x(n), 0 ≤ n ≤ N − 1, were processed using the PMTM function to obtain multitaper spectrum estimates for K = 3, 5, and 7 Slepian tapers. The results are shown in Figure 5.33 in the form of overlays and averages. Several interesting observations and comparisons with the previous methods can be made. The number of tapers used in the estimation determines the variance and the smearing of the spectrum. When fewer tapers are used, the peaks are sharper and narrower but the noise variance is larger. After increasing the number of tapers, the variance is decreased but the peaks become wider. When these estimates are compared with those from Welch’s method, an interesting feature can be noticed. The broadening of the peaks is not just at the base but is present along the entire length of the peak. Therefore, even with seven tapers, peaks are distinguishable. This feature is due to the bandwidth of the average spectral window due to K tapers. E XAM PLE 5.5.2 ( O C EAN WAVE DATA) . Consider the wire gage wave data of Figure 5.24. The (MT) j ω (e ) of these 4096-point data is obtained using the PMTM function in multitaper estimate Rˆ x which the parameter W is set to 4. The results are shown in Figure 5.34. In this graph, the middle solid is the spectral estimate in decibels while the upper and lower solid curves are the upper

Multitaper estimate average: K = 3

Multitaper estimate overlay: K = 3 20 Power (dB)

Power (dB)

20 10 0 −10 0

0.35p 0.4p v

0.8p

10 0 −10 0

p

Multitaper estimate overlay: K = 5

Power (dB)

Power (dB)

0.35p 0.4p v

0.8p

10 0 −10 0

p

0.35p 0.4p v

0.8p

p

Multitaper estimate average: K = 7

Multitaper estimate overlay: K = 7 20 Power (dB)

20 Power (dB)

p

20

10

10 0 −10 0

0.8p

Multitaper estimate average: K = 5

20

−10 0

0.35p 0.4p v

0.35p 0.4p v

0.8p

p

10 0 −10 0

0.35p 0.4p v

FIGURE 5.33 Spectrum estimation of three sinusoids in white noise using the multitaper method in Example 5.5.1.

0.8p

p

252

PSD of wire gage wave data using multitaper approach 10

chapter 5 Nonparametric Power Spectrum Estimation

Decibels

−10 −20 −30 −40 −50

1 Frequency (Hz)

2

FIGURE 5.34 Spectrum estimation of the wire gage wave data using the multitaper method in Example 5.5.2. and lower limits of the 95 percent confidence interval for a fixed frequency. For comparison purposes, the “raw” periodogram estimate is also shown as small dots. Clearly, the periodogram has a large variability that is reduced in the multitaper estimate. At the same time, the multitaper estimate is not smooth, but its variability is small enough to follow the shape of the overall structure.

5.5.2 Estimation of Cross Power Spectrum The multitapering approach can also be extended to the estimation of the cross power spectrum. Following (5.4.11), the multitaper estimator of the cross power spectrum is given by ∗ Lxy −1 xy −1 K−1 L 1 (MT) j ω Rˆ xy (e ) = wk (n)x(n)e−j ωn wk (n)y(n)e−j ωn (5.5.11) KLxy k=0

n=0

n=0

where wk (n) is the kth-order data taper of length Lxy and a fixed-resolution bandwidth of 2W . As with the auto power spectrum, the use of multitaper averaging reduces the variability of the cross-periodogram Rxy (ej ω ). Once again, the number of equivalent degrees of freedom for Rˆ xy (ej ω ) is equal to 2K. The real-valued functions associated with the cross power spectrum can also be estimated by using the multitaper approach in a similar fashion. The cospectrum and the quadrature spectrum are given by (MT) (MT) j ω ˆ (MT) ˆ (MT) j ω (ω) = Re[Rˆ xy (e )] and Q (5.5.12) Cˆ xy xy (ω) = − Im[Rxy (e )] while the cross-amplitude spectrum and the phase spectrum are given by (MT) ˆ xy (ω) Q (MT) (MT) (MT) (MT) −1 ˆ xy (ω)]2 and ; ˆ xy (ω) = tan − (MT) Aˆ xy (ω) = [Cˆ xy (ω)]2 + [Q Cˆ xy (ω) (5.5.13) Finally, the coherency spectrum is given by (MT) (MT) 2 1/2 ˆ xy ˆ xy (ω)]2 + [Q (ω)] [ C (MT) |Gˆxy (ω)| = (MT) (MT) Rˆ x (ej ω )Rˆ y (ej ω )

(5.5.14)

Matlab does not provide a function for cross power spectrum estimation using the multitaper approach. However, by using the DPSS function, it is relatively straightforward to implement the simple averaging method of (5.5.11). EXAMPLE 5.5.3. Again consider the wire gage and the infrared gage wave data of Figure 5.24. The (MT) multitaper estimate of the cross power spectrum Rˆ xy (ej ω ) of these two 4096-point sequences is obtained by using (5.5.11) in which the parameter W is set to 4. Figure 5.35 shows plots of the estimates of the auto power spectra of the two data sets in solid lines. The cross power spectrum of the two signals is shown with a dotted line. It is interesting to note that the two auto power spectra agree almost perfectly over the band up to 0.3 Hz and then reasonably well up to 0.9 Hz, beyond which point the spectrum due to the infrared gage is consistently higher due to high-frequency noise inherent in the measurements. The cross power spectrum agrees with the two auto power spectra at low frequencies up to 0.2 Hz. Figure 5.36 contains two graphs; the

PSD and CSD of wave data using multitaper approach 10 0 −10

Decibels

−20

Infrared gage

−30 −40 Wire gage

−50 −60 −70 −80

0.4

0.8 1.2 Frequency (Hz)

1.6

2

FIGURE 5.35 Cross power spectrum estimation of the wave data using the multitaper approach.

Estimated MSC

Magnitude-squared coherency 1.0 0.8 0.6 0.4 0.2 0 0

1

Estimated phase

Phase spectrum 1.0 0.5 0 −0.5 1.0

0.2

0.4 0.6 Frequency (Hz)

0.8

1

FIGURE 5.36 Coherency and phase spectrum of the wave data using the multitaper approach.

253 section 5.5 Multitaper Power Spectrum Estimation

254 chapter 5 Nonparametric Power Spectrum Estimation

upper graph is for the MSC while the lower one is for the phase spectrum. Consistent with our observation of the cross power spectrum in Figure 5.36, the MSC is almost one over these lower frequencies. The phase spectrum is almost a linear function over the range over which the two auto power spectra agree. Thus, the multitaper approach provides estimates that agree with the conventional techniques.

5.6 SUMMARY In this chapter, we presented many different nonparametric methods for estimating the power spectrum of a wide-sense stationary random process. Nonparametric methods do not depend on any particular model of the process but use estimators that are determined entirely by the data. Therefore, one has to be very careful about the data and the interpretation of results based on them. We began by revisiting the topic of frequency analysis of deterministic signals. Since the spectrum estimation of random processes is based on the Fourier transformation of data, the purpose of this discussion was to identify and study errors associated with the practical implementation. In this regard, three problems—the sampling of the continuous signal, windowing of the sampled data, and the sampling of the spectrum—were isolated and discussed in detail. Some useful data windows and their characteristics were also given. This background was necessary to understand more complex spectrum estimation methods and their results. An important topic of autocorrelation estimation was considered next. Although this discussion was not directly related to spectrum estimation, its inclusion was appropriate since one important method (i.e., that of Blackman and Tukey) was based on this estimation. The statistical properties of the estimator and its implementation completed this topic. The major part of this chapter was devoted to the section on the auto power spectrum estimation. The classical approach was to develop an estimator from the Fourier transform of the given values of the process. This was called the periodogram method, and it resulted in a natural PSD estimator as a Fourier transform of an autocorrelation estimate. Unfortunately, the statistical analysis of the periodogram showed that it was not an unbiased estimator or a consistent estimator; that is, its variability did not decrease with increasing data record length. The modification of the periodogram using the data window lessened the spectral leakage and improved the unbiasedness but did not decrease the variance. Several examples were given to verify these aspects. To improve the statistical performance of the simple periodogram, we then looked at several possible improvements to the basic technique. Two main directions emerged for reducing the variance: periodogram smoothing and periodogram averaging. These approaches produced consistent and asymptotically unbiased estimates. The periodogram smoothing was obtained by applying the lag window to the autocorrelation estimate and then Fouriertransforming it. This method was due to Blackman and Tukey, and results of its mean and variance were given. The periodogram averaging was done by segmenting the data to obtain several records, followed by windowing to reduce spectral leakage, and finally by averaging their periodograms to reduce variance. This was the well-known Welch-Bartlett method, and the results of its statistical analysis were also given. Finally, implementations based on the DFT and Matlab were given for both methods along with several examples to illustrate the performance of their estimates. These nonparametric methods were further extended to estimate the cross power spectrum, coherence functions, and transfer function. Finally, we presented a newer nonparametric technique for auto power spectrum and cross power spectrum that was based on applying several data windows or tapers to the data followed by averaging of the resulting modified periodograms. The basic principle behind this method was that if the tapers are orthonormal and properly designed (to reduce leakage), then the resulting periodograms can be considered to be independent at each frequency and

hence their average would reduce the variance. Two orthogonal sets of data taper, namely, the Slepian and sinusoidal, were provided. The implementation using Matlab was given, and examples were given to complete the chapter. PROBLEMS 5.1 Let xc (t), −∞ < t < ∞, be a continuous-time signal with Fourier transform Xc (F ), −∞ < F < ∞, and let x(n) be obtained by sampling xc (t) every T per sampling interval with its DTFT X(ej ω ). (a) Show that the DTFT X(ej ω ) is given by X(ej ω ) = Fs

∞

Xc (f Fs − lFs )

ω = 2π f

Fs =

l=−∞

1 T

(b) Let X˜ p (k) be obtained by sampling X(ej ω ) every 2π /N rad per sampling interval, that is,

∞

X˜ p (k) = X(ej 2π k/N ) = Fs

Xc l=−∞

kFs − lFs N

Then show that inverse DFT(X˜ p ) is given by xp (n) IDFT(X˜ p ) = xp (n)

∞

xc (nT − mNT )

m=−∞

5.2 Matlab provides two functions to generate triangular windows, namely, bartlett and triang. These two functions actually generate two slightly different coefficients. (a) Use bartlett to generate N = 11, 31, and 51 length windows wB (n), and plot their samples, using the stem function. (b) Compute the DTFTs WB (ej ω ), and plot their magnitudes over [−π , π]. Determine experimentally the width of the mainlobe as a function of N. Repeat part (a) using the triang function. How are the lengths and the mainlobe widths different in this case? Which window function is an appropriate one in terms of nonzero samples? (c) Determine the length of the bartlett window that has the same mainlobe width as that of a 51-point rectangular window. 5.3 Sidelobes of the window transform contribute to the spectral leakage due to the frequencydomain convolution. One measure of this leakage is the maximum sidelobe height, which generally occurs at the first sidelobe for all windows except the Dolph-Chebyshev window. (a) For simple windows such as the rectangular, Hanning, or Hamming window, the maximum sidelobe height is independent of window length N. Choose N = 11, 31, and 51, and determine the maximum sidelobe height in decibels for the above windows. (b) For the Kaiser window, the maximum sidelobe height is controlled by the shape parameter β and is proportional to β/ sinh β. Using several values of β and N , verify the relationship between β and the maximum sidelobe height. (c) Determine the value of β that gives the maximum sidelobe height nearly the same as that of the Hamming window of the same length. Compare the mainlobe widths and the window coefficients of these two windows. (d ) For the Dolph-Chebyshev window, all sidelobes have the same height A in decibels. For A = 40, 50, and 60 dB, determine the 3-dB mainlobe widths for N = 31 length window. 5.4

Let x(n) be given by y(n) = cos ω1 n + cos (ω2 n + φ)

and

x(n) = y(n)w(n)

where w(n) is a length-N data window. The |X(ej ω )|2 is computed using Matlab and is plotted over [0, π].

255 problems

(a) Let w(n) be a rectangular window. For ω1 = 0.25π and ω2 = 0.3π , determine the minimum length N so that the two frequencies in the |X(ej ω )|2 plot are barely separable for any arbitrary φ ∈ [−π , π]. (You may want to consider the worst possible value of φ or experiment, using several values of φ.) (b) Repeat part (a) for a Hamming window. (c) Repeat part (a) for a Blackman window.

256 chapter 5 Nonparametric Power Spectrum Estimation

5.5

ˆ x given in (5.2.3), in which the In this problem we will prove that the autocorrelation matrix R sample correlations are defined by (5.2.1), is a nonnegative definite matrix, that is, ˆ xx ≥ 0 xH R

for every x ≥ 0

ˆ x can be decomposed into the product XH X, where X is called a data matrix. (a) Show that R Determine the form of X. ˆ x x ≥ 0, for every x ≥ 0. (b) Using the above decomposition, now prove that xH R 5.6 An alternative autocorrelation estimate rˇx (l) is given in (5.2.13) and is repeated below. N −l−1 1 0≤l≤L

Consider the above unbiased autocorrelation estimator rˇx (l) of a zero-mean white Gaussian process with variance σ 2x . (a) Determine the variance of rˇx (l). Compute its limiting value as l → ∞. (b) Repeat part (a) for the biased estimator rˆx (l). Comment on any differences in the results.

ˇ x formed by using rˇx (l) is not nonnegative definite, that 5.8 Show that the autocorrelation matrix R is, ˆ xx < 0 for some x ≥ 0 xH R 5.9

In this problem, we will show that the periodogram Rˆ x (ej ω ) can also be expressed as a DTFT of the autocorrelation estimate rˆx (l) given in (5.2.1). (a) Let v(n) = x(n)wR (n), where wR (n) is a rectangular window of length N. Show that rˆx (l) =

1 v(l) ∗ v ∗ (−l) N

(P.1)

(b) Take the DTFT of (P.1) to show that Rˆ x (ej ω ) =

N −1

rˆx (l)e−j ωl

l=−N +1

5.10 Consider the following simple windows over 0 ≤ n ≤ N − 1: rectangular, Bartlett, Hanning, and Hamming. (a) Determine analytically the DTFT of each of the above windows. (b) Sketch the magnitude of these Fourier transforms for N = 31. (c) Verify your sketches by performing a numerical computation of the DTFT using Matlab.

257

5.11 The Parzen window is given by 3 2 l l + 6 1 − 6 L L wP (l) l 3 2 1 − L 0

problems

0 ≤ |l| ≤

L 2

L < |l| < L 2

(P.2)

elsewhere

(a) Show that its DTFT is given by WP (ej ω )

sin (ωL/4) 4 ≥0 sin (ω/4)

(P.3)

Hence using the Parzen window as a correlation window always produces nonnegative spectrum estimates. (b) Using Matlab, compute and plot the time-domain window wP (l) and its frequency-domain response WP (ej ω ) for L = 5, 10, and 20. (c) From the frequency-domain plots in part (b) experimentally determine the 3-dB mainlobe width ω as a function of L. 5.12 The variance reduction ratio of a correlation window wa (l) is defined as Ew var{Rˆ x (ej ω )} j ω ˆ N var{Rx (e )} (PS)

where

Ew =

0<ω<π

π L−1 1 Wa2 (ej ω ) dω = wa2 (l) 2π −π l=−(L−1)

(a) Using Matlab, compute and plot Ew as a function of L for the following windows: rectangular, Bartlett, Hanning, Hamming, and Parzen. (b) Using your computations above, show that for L 1, the variance reduction ratio for each window is given by the formula in the following table.

Window name Rectangular Bartlett Hanning Hamming Parzen

Variance reduction factor 2L/N 0.667L/N 0.75L/N 0.7948L/N 0.539L/N

5.13 For L > 100, the direct computation of rˆx (l) using (5.3.49) is time-consuming; hence an indirect computation using the DFT can be more efficient. This computation is implemented by the following steps: • Given the sequence {x(n)}N −1 , pad enough zeros to make it a (2N − 1)-point sequence. n=0

˜ • Compute the NFFT -point FFT of x(n) to obtain X(k), where NFFT is equal to the next powerof-2 number that is greater than or equal to 2N − 1. 2 to obtain ˜ ˆ • Compute 1/N|X(k)| R(k). ˆ • Compute the NFFT -point IFFT of R(k) to obtain rˆx (l).

Develop a Matlab function rx = autocfft(x,L) which computes rˆx (l),over −L ≤ l ≤ L. Compare this function with the autoc function discussed in the chapter in terms of the execution time for L ≥ 100.

258

(PA) 5.14 The Welch-Bartlett estimate Rˆ x (k) is given by

chapter 5 Nonparametric Power Spectrum Estimation

K−1 1 (PA) Rˆ x (k) = |Xi (k)|2 KL i=0

If x(n) is real-valued, then the sum in the above expression can be evaluated more efficiently. Let K be an even number. Then we will combine two real-valued sequences into one complexvalued sequence and compute one FFT, which will reduce the overall computations. Specifically, let K gr (n) x2r (n) + j x2r+1 (n) −1 n = 0, 1, . . . , L − 1, r = 0, 1, . . . , 2 Then the L-point DFT of gr (n) is given by ˜ r (k) = X˜ 2r (k) + j X˜ 2r+1 (k) G

k = 0, 1, . . . , L − 1, r = 0, 1, . . . ,

K −1 2

(a) Show that ˜ r (L − k)|2 = 2[|X˜ 2r (k)|2 + |X˜ 2r+1 (k)|2 ] ˜ r (k)|2 + |G |G

k, r = 0, . . . ,

K −1 2

(PA) ˜ (b) Determine the resulting expression for Rˆ x (k) in terms of G(k). (c) What changes are necessary if K is an odd number? Provide detailed steps for this case. (PA) (PA) 5.15 Since Rˆ x (ej ω ) is a PSD estimate, one can determine autocorrelation estimate rˆx (l) from Welch’s method as π 1 (PA) (PA) (P.4) Rˆ x (ej ω )ej ωl dω rˆx (l) = 2π −π

(PA) (PA) Let Rˆ x (k) be the samples of Rˆ x (ej ω ) according to ˆ (PA) (PA) R x (k) Rˆ x (ej 2π k/NFFT )

0 ≤ k ≤ NFFT − 1 (PA) (PA) (a) Show that the IDFT rˆ x (l) of Rˆ x (k) is an aliased version of the autocorrelation estimate (PA) rˆx (l). (b) If the length of the overlapping data segment in Welch’s method is L, how should NFFT be (PA) chosen to avoid aliasing in rˆ x (l)? 2 (ω) is invariant under linear transformation, that is, if 5.16 Show that the coherence function Gxy x1 (n) = h1 (n) ∗ x(n) and y1 (n) = h2 (n) ∗ y(n), then 2 (ω) = G 2 (ω) Gxy x1 y2

5.17 Bartlett’s method is a special case of Welch’s method in which nonoverlapping sections of length L are used without windowing in the periodogram averaging operation. (a) Show that the ith periodogram in this method can be expressed as Rˆ x,i (ej ω ) =

L

rˆx,i (l)wB (l)e−j ωl

(P.5)

l=−L

where wB (l) is a (2L − 1)-length Bartlett window. (b) Let u(ej ω ) [1 ej ω · · · ej (L−1)ω ]T . Show that Rˆ x,i (ej ω ) in (P.5) can be expressed as a quadratic product 1 ˆ x,i u(eω ) Rˆ x,i (ej ω ) = uH (ej ω )R L

(P.6)

ˆ x,i is the autocorrelation matrix of rˆx,i (l) values. where R (c) Finally, show that the Bartlett estimate is given by 1 H jω ˆ (B) u (e )Rx,i u(eω ) Rˆ x (ej ω ) = KL K

i=1

(P.7)

5.18 In this problem, we will explore a spectral estimation technique that uses combined data and correlation weighting (Carter and Nuttall 1980). In this technique, the following steps are performed: (PA) j ω • Given {x(n)}N −1 , compute the Welch-Bartlett estimate Rˆ x (e ) by choosing the appron=0

priate values of L and D.

(PA) • Compute the autocorrelation estimate rˆx (l), −L ≤ l ≤ L, using the approach described

in Problem 5.15. (PA) (CN) (PA) • Window rˆx (l), using a lag window wa (l) to obtain rˆx (l) rˆx (l)wa (l). (CN) (CN) • Finally, compute the DTFT of rˆx (l) to obtain the new spectrum estimate Rˆ x (ej ω ).

(a) Determine the bias of Rˆ x (ej ω ). (b) Comment on the effect of additional windowing on the variance and resolution of the estimate. (c) Implement this technique in Matlab, and compute spectral estimates of the process containing three sinusoids in white noise, which was discussed in the chapter. Experiment with various values of L and with different windows. Compare your results to those given for the Welch-Bartlett and Blackman-Tukey methods. (CN)

5.19 Explain why we use the scaling factor L−1

w2 (n)

n=0

which is the energy of the data window in the Welch-Bartlett method. 5.20 Consider the basic periodogram estimator Rˆ x (ej ω ) at the zero frequency, that is, at ω = 0. (a) Show that 1 Rˆ x (ej 0 ) = N

2 2 N N −1 −1 1 j 0 x(n)e = x(n) N n=0 n=0

(b) If x(n) is a real-valued, zero-mean white Gaussian process with variance σ 2x , determine the mean and variance of Rˆ x (ej 0 ). (c) Determine if Rˆ x (ej 0 ) is a consistent estimator by evaluating the variance as N → ∞. 5.21 Consider Bartlett’s method for estimating Rx (ej 0 ) using L = 1; that is, we use nonoverlapping segments of single samples. The periodogram of one sample x(n) is simply |x(n)|2 . Thus we have N −1 N −1 1 ˆ 1 (B) |x(n)|2 Rx,n (ej 0 ) = Rˆ x (ej 0 ) = N N n=0

n=0

Again assume that x(n) is a real-valued white Gaussian process with variance σ 2x . (a) Determine the mean and variance of Rˆ x (ej 0 ). (b) Compare the above result with those in Problem 5.20. Comment on any differences. (B)

5.22 One desirable property of lag or correlation windows is that their Fourier transforms are nonnegative. (a) Formulate a procedure to generate a symmetric lag window of length 2L + 1 with nonnegative Fourier transform. (b) Using the Hanning window as a prototype in the above procedure, determine and plot a 31-length lag window. Also plot its Fourier transform. 5.23 Consider the following random process x(n) =

4 k=1

Ak sin (ωk n + φ k ) + ν(n)

259 problems

260 chapter 5 Nonparametric Power Spectrum Estimation

where

A1 = 1 ω1 = 0.1π

A2 = 0.5 ω2 = 0.6π

A3 = 0.5 ω3 = 0.65π

A4 = 0.25 ω4 = 0.8π

and the phases {φ i }4i=1 are IID random variables uniformly distributed over [−π , π]. Generate 50 realizations of x(n) for 0 ≤ n ≤ 256. Let v(n) be WN(0, 1). (a) Compute the Blackman-Tukey estimates for L = 32, 64, and 128, using the Bartlett lag window. Plot your results, using overlay and averaged estimates. Comment on your plots. (b) Repeat part (a), using the Parzen window. (c) Provide a qualitative comparison between the above two sets of plots. 5.24 Consider the random process given in Problem 5.23. (a) Compute the Bartlett estimate, using L = 16, 32, and 64. Plot your results, using overlay and averaged estimates. Comment on your plots. (b) Compute the Welch estimate, using 50 percent overlap, Hamming window, and L = 16, 32, and 64. Plot your results, using overlay and averaged estimates. Comment on your plots. (c) Provide a qualitative comparison between the above two sets of plots. 5.25 Consider the random process given in Problem 5.23. (a) Compute the multitaper spectrum estimate, using K = 3, 5, and 7 Slepian tapers. Plot your results, using overlay and averaged estimates. Comment on your plots. (b) Make a qualitative comparison between the above plots and those obtained in Problems 5.23 and 5.24. 5.26 Generate 1000 samples of an AR(1) process using a = −0.9. Determine its theoretical PSD. (a) Determine and plot the periodogram of the process along with the true spectrum. Comment on the plots. (b) Compute the Blackman-Tukey estimates for L = 10, 20, 50, and 100. Plot these estimates along with the true spectrum. Comment on your results. (c) Compute the Welch estimates for 50 percent overlap, Hamming window, and L = 10, 20, 50, and 100. Plot these estimates along with the true spectrum. Comment on your results. 5.27 Generate 1000 samples of an AR(1) process using a = 0.9. Determine its theoretical PSD. (a) Determine and plot the periodogram of the process along with the true spectrum. Comment on the plots. (b) Compute the Blackman-Tukey estimates for L = 10, 20, 50, and 100. Plot these estimates along with the true spectrum. Comment on your results. (c) Compute the Welch estimates for 50 percent overlap, Hamming window, and L = 10, 20, 50, and 100. Plot these estimates along with the true spectrum. Comment on your results. 5.28 Multitaper estimation technique requires a properly designed orthonormal set of tapers for the desired performance. One set discussed in the chapter was that of harmonically related sinusoids given in (5.5.8). (a) Design a Matlab function [tapers] = sine_tapers(N,K) that generates K < N sinusoidal tapers of length N. (b) Using the above function, compute and plot the Fourier transform magnitudes of the first 5 tapers of length 51. 5.29 Design a Matlab function Pxx = psd_sinetaper(x,K) that determines the multitaper estimates using the sine tapers. (a) Apply the function psd_sinetaper to the AR(1) process given in Problem 5.26, and compare its performance. (b) Apply the function psd_sinetaper to the AR(1) process given in Problem 5.27, and compare its performance.

C HAPT E R 6

Optimum Linear Filters

In this chapter, we present the theory and application of optimum linear filters and predictors. We concentrate on linear filters that are optimum in the sense of minimizing the mean square error (MSE). The minimum MSE (MMSE) criterion leads to a theory of linear filtering that is elegant and simple, involves only second-order statistics, and is useful in many practical applications. The optimum filter designed for a given set of second-order moments can be used for any realizations of stochastic processes with the same moments. We start with the general theory of linear MMSE estimators and their computation, using the triangular decomposition of Hermitian positive definite matrices. Then we apply the general theory to the design of optimum FIR filters and linear predictors for both nonstationary and stationary processes (Wiener filters). We continue with the design of nonparametric (impulse response) and parametric (pole-zero) optimum IIR filters and predictors for stationary processes. Then we present the design of optimum filters for inverse system modeling, blind deconvolution, and their application to equalization of data communication channels. We conclude with a concise introduction to optimum matched filters and eigenfilters that maximize the output SNR. These signal processing methods find extensive applications in digital communication, radar, and sonar systems. 6.1 OPTIMUM SIGNAL ESTIMATION As we discussed in Chapter 1, the solution of many problems of practical interest depends on the ability to accurately estimate the value y(n) of a signal (desired response) by using a set of values (observations or data) from another related signal or signals. Successful estimation is possible if there is significant statistical dependence or correlation between the signals involved in the particular application. For example, in the linear prediction problem we use the M past samples x(n − 1), x(n − 2), . . . , x(n − M) of a signal to estimate the current sample x(n). The echo canceler in Figure 1.17 uses the transmitted signal to form a replica of the received echo. The radar signal processor in Figure 1.27 uses the signals xk (n) for 1 ≤ k ≤ M received by the linear antenna array to estimate the value of the signal y(n) received from the direction of interest. Although the signals in these and other similar applications have different physical origins, the mathematical formulations of the underlying signal processing problems are very similar. In array signal processing, the data are obtained by using M different sensors. The situation is simpler for filtering applications, because the data are obtained by delaying a single discrete-time signal; that is, we have xk (n) = x(n + 1 − k), 1 ≤ k ≤ M (see Figure 6.1). Further simplifications are possible in linear prediction, where both the desired response and the data are time samples of the same signal, for example, y(n) = x(n) and 261

262

x1(n)

chapter 6 Optimum Linear Filters

…

…

…

…

x2 (n)

xM (n) x(n − M )

x(n − 1)

x(n)

(a) x(n − M − 1) x(n − M ) x(n − M + 1)

…

x(n − 2) x(n − 1) x(n) x(n)

x(n − 1) x(n − 2) (b)

FIGURE 6.1 Illustration of the data vectors for (a) array processing (multiple sensors) and (b) FIR filtering or prediction (single sensor) applications.

xk (n) = x(n − k), 1 ≤ k ≤ M. As a result, the design and implementation of optimum filters and predictors are simpler than those for an optimum array processor. Since array processing problems are the most general ones, we will formulate and solve the following estimation problem: Given a set of data xk (n) for 1 ≤ k ≤ M, determine an estimate y(n), ˆ of the desired response y(n), using the rule (estimator) y(n) ˆ H {xk (n), 1 ≤ k ≤ M}

(6.1.1)

which, in general, is a nonlinear function of the data. When xk (n) = x(n + 1 − k), the estimator takes on the form of a discrete-time filter that can be linear or nonlinear, timeinvariant or time-varying, and with a finite- or infinite-duration impulse response. Linear filters can be implemented using any direct, parallel, cascade, or lattice-ladder structure (see Section 2.5 and Proakis and Manolakis 1996). The difference between the estimated response y(n) ˆ and the desired response y(n), that is, e(n) y(n) − y(n) ˆ

(6.1.2)

is known as the error signal. We want to find an estimator whose output approximates the desired response as closely as possible according to a certain performance criterion. We use the term optimum estimator or optimum signal processor to refer to such an estimator. We stress that optimum is not used as a synonym for best; it simply means the best under the given set of assumptions and conditions. If either the criterion of performance or the assumptions about the statistics of the processed signals change, the corresponding optimum filter will change as well. Therefore, an optimum estimator designed for a certain performance metric and set of assumptions may perform poorly according to some other criterion or if the actual statistics of the processed signals differ from the ones used in the design. For this reason, the sensitivity of the performance to deviations from the assumed statistics is very important in practical applications of optimum estimators.

Therefore, the design of an optimum estimator involves the following steps: 1. Selection of a computational structure with well-defined parameters for the implementation of the estimator. 2. Selection of a criterion of performance or cost function that measures the performance of the estimator under some assumptions about the statistical properties of the signals to be processed. 3. Optimization of the performance criterion to determine the parameters of the optimum estimator. 4. Evaluation of the optimum value of the performance criterion to determine whether the optimum estimator satisfies the design specifications. Many practical applications (e.g., speech, audio, and image coding) require subjective criteria that are difficult to express mathematically. Thus, we focus on criteria of performance that (1) only depend on the estimation error e(n), (2) provide a sufficient measure of the user satisfaction, and (3) lead to a mathematically tractable problem. We generally select a criterion of performance by compromising between these objectives. Since, in most applications, negative and positive errors are equally harmful, we should choose a criterion that weights both negative and positive errors equally. Choices that satisfy this requirement include the absolute value of the error |e(n)|, or the squared error |e(n)|2 , or some other power of |e(n)| (see Figure 6.2). The emphasis put on different values of the error is a key factor when we choose a criterion of performance. For example, the squared-error criterion emphasizes the effect of large errors much more than the absolute error criterion. Thus, the squared-error criterion is more sensitive to outliers (occasional large values) than the absolute error criterion is.

10 9 |e| 2

8

|e| 3

Weight

7 6 5 |e|

4 3 2 1 0 −5

|e| 1/2

−4

−3

−2

−1 0 1 Error value e

2

3

4

5

FIGURE 6.2 Graphical illustration of various error-weighting functions.

To develop a mathematical theory that will help to design and analyze the performance of optimum estimators, we assume that the desired response and the data are realizations of stochastic processes. Furthermore, although in practice the estimator operates on specific realizations of the input and desired response signals, we wish to design an estimator with good performance across all members of the ensemble, that is, an estimator that “works

263 section 6.1 Optimum Signal Estimation

264 chapter 6 Optimum Linear Filters

well on average.” Since, at any fixed time n, the quantities y(n), xk (n) for 1 ≤ k ≤ M, and e(n) are random variables, we should choose a criterion that involves the ensemble or time averaging of some function of |e(n)|. Here is a short list of potential criteria of performance: 1. The mean square error criterion P (n) E{|e(n)|2 }

(6.1.3)

which leads, in general, to a nonlinear optimum estimator. 2. The mean αth-order error criterion E{|e(n)|α }, α = 2. Using a lower- or higher-order moment of the absolute error is more appropriate for certain types of non-Gaussian statistics than the MSE (Stuck 1978). 3. The sum of squared errors (SSE) E(ni , nf )

nf

|e(n)|2

(6.1.4)

n=ni

which, if it is divided by nf − ni + 1, provides an estimate of the MSE. The MSE criterion (6.1.3) and the SSE criterion (6.1.4) are the most widely used because they (1) are mathematically tractable, (2) lead to the design of useful systems for practical applications, and (3) can serve as a yardstick for evaluating estimators designed with other criteria (e.g., signal-to-noise ratio, maximum likelihood). In most practical applications, we use linear estimators, which further simplifies their design and evaluation. Mean square estimation is a rather vast field that was originally developed by Gauss in the nineteenth century. The current theories of estimation and optimum filtering started with the pioneering work of Wiener and Kolmogorov that was later extended by Kalman, Bucy, and others. Some interesting historical reviews are given in Kailath (1974) and Sorenson (1970).

6.2 LINEAR MEAN SQUARE ERROR ESTIMATION In this section, we develop the theory of linear MSE estimation. We concentrate on linear estimators for various reasons, including mathematical simplicity and ease of implementation. The problem can be stated as follows: Design an estimator that provides an estimate y(n) ˆ of the desired response y(n) using a linear combination of the data xk (n) for 1 ≤ k ≤ M, such that the MSE 2 } is minimized. E{|y(n) − y(n)| ˆ More specifically, the linear estimator is defined by y(n) ˆ

M

ck∗ (n)xk (n)

(6.2.1)

k=1

and the goal is to determine the coefficients ck (n) for 1 ≤ k ≤ M such that the MSE (6.1.3) is minimized. In general, a new set of optimum coefficients should be computed for each time instant n. Since we assume that the desired response and the data are realizations of stochastic processes, the quantities y(n), x1 (n), . . . , xM (n) are random variables at any fixed time n. For convenience, we formulate and solve the estimation problem at a fixed time instant n. Thus, we drop the time index n and restate the problem as follows: Estimate a random variable y (the desired response) from a set of related random variables x1 , x2 , . . . , xM (data) using the linear estimator yˆ

M k=1

ck∗ xk = cH x

(6.2.2)

265

x = [x1 x2 · · · xM ]T

where

(6.2.3)

is the input data vector and c = [c1 c2 · · · cM ]T

(6.2.4)

is the parameter or coefficient vector of the estimator. Unless otherwise stated, all random variables are assumed to have zero-mean values. The number M of data components used is called the order of the estimator. The linear estimator (6.2.2) is represented graphically as shown in Figure 6.3 and involves a computational structure known as the linear combiner. The MSE

where

P E{|e|2 }

(6.2.5)

e y − yˆ

(6.2.6)

is a function of the parameters ck . Minimization of (6.2.5) with respect to parameters ck leads to a linear estimator co that is optimum in the MSE sense. The parameter vector co is known as the linear MMSE (LMMSE) estimator and yˆo as the LMMSE estimate. Data x1

Desired response

Linear combiner c*1

x2

y yˆ − Estimate

…

c*2

Error e

xM * cM Estimator parameters

FIGURE 6.3 Block diagram representation of the linear estimator.

6.2.1 Error Performance Surface To determine the linear MMSE estimator, we seek the value of the parameter vector c that minimizes the function (6.2.5). To this end, we want to express the MSE as a function of the parameter vector c and to understand the nature of this dependence. By using (6.2.5), (6.2.6), (6.2.2), and the linearity property of the expectation operator, the MSE is given by P (c) = E{|e|2 } = E{(y − cH x)(y ∗ − xH c)} = E{|y|2 } − cH E{xy ∗ } − E{yxH }c + cH E{xxH }c or more compactly, P (c) = Py − cH d − dH c + cH Rc

(6.2.7)

2

Py E{|y| }

(6.2.8)

d E{xy ∗ }

(6.2.9)

where is the power of the desired response,

section 6.2 Linear Mean Square Error Estimation

266

is the cross-correlation vector between the data vector x and the desired response y, and

chapter 6 Optimum Linear Filters

R E{xxH }

(6.2.10)

is the correlation matrix of the data vector x. The matrix R is guaranteed to be Hermitian and nonnegative definite (see Section 3.4.4). The function P (c) is known as the error performance surface of the estimator. Equation (6.2.7) shows that the MSE P (c) (1) depends only on the second-order moments of the desired response and the data and (2) is a quadratic function of the estimator coefficients and represents an (M + 1)-dimensional surface with M degrees of freedom. We will see that if R is positive definite, then the quadratic function P (c) is bowl-shaped and has a unique minimum that corresponds to the optimum parameters. The next example illustrates this fact for the second-order case. E XAM PLE 6.2.1.

If M = 2 and the random variables y, x1 , and x2 are real-valued, the MSE is

P (c1 , c2 ) = Py − 2d1 c1 − 2d2 c2 + r11 c12 + 2r12 c1 c2 + r22 c22

2500

1500

2000

1000

1500

MSE

MSE

because r12 = r21 . And P (c1 , c2 ) is a second-order function of coefficients c1 and c2 , and Figure 6.4 shows two plots of the function P (c1 , c2 ) that are quite different in appearance. The surface in Figure 6.4(a) looks like a bowl and has a unique extremum that is a minimum. The values for the error surface parameters are Py = 0.5, r11 = r22 = 4.5, r12 = r21 = −0.1545, d1 = −0.5, and d2 = −0.1545. On the other hand, in Figure 6.4(b), we have a saddle point that is neither a

1000

500 0

500 0 20 10 0

−10

c2

−20 −20

−10

10

−500 20

20

10 0 c2

c1

−10

(a)

−20 −20

−10

10

0 c1

(b)

15

15

2000

1200

1800 10

5

600

1200

P(c1o, c2o )

1000 800

−5

600 400

−10

c2

c2

800

1400

5

1000

10

1600

400 200

−5

−10

−200

200 −15 −15

−10

−5

0 c1 (c)

20

5

10

15

−15 −15

−400 −10

−5

0 c1

5

10

15

(d )

FIGURE 6.4 Representative surface and contour plots for positive definite and negative definite quadratic error performance surfaces.

minimum nor a maximum (here only the matrix elements have changed to r11 = r22 = 1, r12 = r21 = 2). If we cut the surfaces with planes parallel to the (c1 , c2 ) plane, we obtain contours of constant MSE that are shown in Figure 6.4(c) and (d ). In conclusion, the error performance surface is bowl-shaped and has a unique minimum only if the matrix R is positive definite (the determinants of the two matrices are 20.23 and −3, respectively). Only in this case can we obtain an estimator that minimizes the MSE, and the contours are concentric ellipses whose center corresponds to the optimum estimator. The bottom of the bowl is determined by setting the partial derivatives with respect to the unknown parameters to zero, that is, ∂P (c1 , c2 ) =0 ∂c1 ∂P (c1 , c2 ) =0 ∂c2

which results in

r11 c1o + r12 c2o = d1

which results in

r12 c1o + r22 c2o = d2

This is a linear system of two equations with two unknowns whose solution provides the coefficients c1o and c2o that minimize the MSE function P (c1 , c2 ).

When the optimum filter is specified by a rational system function, the error performance surface may be nonquadratic. This is illustrated in the following example. E XAM PLE 6.2.2. Suppose that we wish to estimate the real-valued output y(n) of the “unknown” system (see Figure 6.5)

G(z) =

0.05 − 0.4z−1 1 − 1.1314z−1 + 0.25z−2

using the pole-zero filter H (z) =

b 1 − az−1

by minimizing the MSE E{e2 (n)} (Johnson and Larimore 1977). The input signal x(n) is white noise with zero mean and variance σ 2x . The MSE is given by 2 } = E{y 2 (n)} − 2E{y(n)y(n)} ˆ ˆ + E{yˆ 2 (n)} E{e2 (n)} = E{[y(n) − y(n)]

and is a function of parameters b and a. Since the impulse response h(n) = ba n u(n) of the optimum filter has infinite duration, we cannot use (6.2.7) to compute E{e2 (n)} and to plot the error surface. The three components of E{e2 (n)} can be evaluated as follows, using Parseval’s theorem: The power of the desired response ∞ σ2 G(z)G(z−1 )z−1 dz σ 2x σ 2g g 2 (n) = x E{y 2 (n)} = σ 2x 2π j n=0

is constant and can be computed either numerically by using the first M “nonzero” samples of g(n) or analytically by evaluating the integral using the residue theorem. The power of the optimum filter output is ∞ σ2 b2 E{yˆ 2 (n)} = E{x 2 (n)} H (z)H (z−1 )z−1 dz = σ 2x h2 (n) = x 2π j 1 − a2 n=0

FIGURE 6.5 Identification of an “unknown” system using an optimum filter.

“Unknown” system G(z) y(n) y(n) ˆ

x(n) H(z) White noise

e(n) −

267 section 6.2 Linear Mean Square Error Estimation

268

which is a function of parameters b and a. The middle term is ∞ ∞ g(k)x(n − k) h(m)x(n − m) E{y(n)y(n)} ˆ =E

chapter 6 Optimum Linear Filters

k=0

= σ 2x

∞

m=0

g(k)h(k) =

k=0

σ 2x 2π j

G(z)H (z−1 )z−1 dz = bG(z)|z−1 =a

because E{x(n − k)x(n − m)} = σ 2x δ(m − k). For convenience we compute the normalized MSE E{e2 (n)} σ 2x b2 2 − 2b G(z) P (b, a) = σ + x −1 σ 2g σ 2g σ 2g 1 − a 2 z

=a

whose surface and contour plots are shown in Figure 6.6. We note that the resulting error performance surface is bimodal with a global minimum P = 0.277 at (b, a) = (−0.311, 0.906) and a local minimum P = 0.976 at (b, a) = (0.114, −0.519). As a result, the determination of the optimum filter requires the use of nonlinear optimization techniques with all associated drawbacks.

2.0

P(b, a)

1.5 1.0 0.5 * 0 1.0

+

0.5 a

1.0 0.5

0 −0.5 −1.0 −1.0

−0.5

0 b

FIGURE 6.6 Illustration of the nonquadratic form of the error performance surface of a pole-zero optimum filter specified by the coefficients of its difference equation.

6.2.2 Derivation of the Linear MMSE Estimator The approach in Example 6.2.1 can be generalized to obtain the necessary and sufficient † conditions that determine the linear MMSE estimator. Here, we present a simpler matrixbased approach that is sufficient for the scope of this chapter. We first notice that we can put (6.2.7) into the form of a “perfect square” as P (c) = Py − dH R −1 d + (Rc − d)H R −1 (Rc − d)

(6.2.11)

where only the third term depends on c. If R is positive definite, the inverse matrix R −1 exists †

For complex-valued random variables, there are some complications that should be taken into account because |e|2 is not an analytic function. This topic is discussed in Appendix B.

and is positive definite; that is, zH R −1 z >0 for all z = 0. Therefore, if R is positive definite, the term dH R −1 d >0 decreases the cost function by an amount determined exclusively by the second-order moments. In contrast, the term (Rc − d)H R −1 (Rc − d) > 0 increases the cost function depending on the choice of the estimator parameters. Thus, the best estimator is obtained by setting Rc − d = 0. Therefore, the necessary and sufficient conditions that determine the linear MMSE estimator co are

and

Rco = d

(6.2.12)

R is positive definite

(6.2.13)

In greater detail, (6.2.12) can be written as r11 r12 · · · r1M r21 r22 · · · r2M .. .. . . .. . . . . rM1 where and

rM2

c1 c2 .. .

· · · rMM

d1 d2 = .. .

cM

(6.2.14)

dM

rij E{xi xj∗ } = rj∗i

(6.2.15)

di E{xi y ∗ }

(6.2.16)

and are known as the set of normal equations. The invertibility of the correlation matrix R—and hence the existence of the optimum estimator—is guaranteed if R is positive definite. In theory, R is guaranteed to be nonnegative definite, but in physical applications it will almost always be positive definite. The normal equations can be solved by using any general-purpose routine for a set of linear equations. Using (6.2.11) and (6.2.12), we find that the MMSE Po is Po = Py − dH R −1 d = Py − dH co

(6.2.17)

where we can easily show that the term dH co is equal to E{|yˆo |2 }, the power of the optimum estimate. If x and y are uncorrelated (d = 0), we have the worst situation (Po = Py ) because there is no linear estimator that can reduce the MSE. If d = 0, there is always going to be some reduction in the MSE owing to the correlation between the data vector x and the desired response y, assuming that R is positive definite. The best situation corresponds to yˆ = y, which gives Po = 0. Thus, for comparison purposes, we use the normalized MSE E

Pyˆ Po =1− o Py Py

(6.2.18)

because it is bounded between 0 and 1, that is, 0≤E ≤1

(6.2.19)

If c˜ is the deviation from the optimum vector co , that is, if c = co + c˜ , then substituting into (6.2.11) and using (6.2.17), we obtain P (co + c˜ ) = P (co ) + c˜ H R˜c

(6.2.20)

Equation (6.2.20) shows that if R is positive definite, any deviation c˜ from the optimum vector co increases the MSE by an amount c˜ H R˜c > 0, which is known as the excess MSE, that is, Excess MSE P (co + c˜ ) − P (co ) = c˜ H R˜c

(6.2.21)

We emphasize that the excess MSE depends only on the input correlation matrix and not on the desired response. This fact has important implications because any deviation from the optimum can be detected by monitoring the MSE.

269 section 6.2 Linear Mean Square Error Estimation

270 chapter 6 Optimum Linear Filters

For nonzero-mean random variables, we use the estimator yˆ c0 + cH x. The elements of R and d are replaced by the corresponding covariances and c0 = E{y} − cH E{x} (see Problem 6.1). In the sequel, unless otherwise explicitly stated, we assume that all random variables have zero mean or have been reduced to zero mean by replacing y by y − E{y} and x by x − E{x}.

6.2.3 Principal-Component Analysis of the Optimum Linear Estimator The properties of optimum linear estimators and their error performance surfaces depend on the correlation matrix R. We can learn a lot about the nature of the optimum estimator if we express R in terms of its eigenvalues and eigenvectors. Indeed, from Section 3.4.4, we have R = QQH =

M

λi qi qiH

= QH RQ

and

(6.2.22)

i=1

where

= diag{λ1 , λ2 , . . . , λM }

(6.2.23)

are the eigenvalues of R, assumed to be distinct, and Q = [q1 q2 · · · qM ]

(6.2.24)

are the eigenvectors of R. The modal matrix Q is unitary, that is, QH Q = I

(6.2.25)

which implies that Q−1 = QH . The relationship (6.2.22) between R and is known as a similarity transformation. In general, the multiplication of a vector by a matrix changes both the length and the direction of the vector. We define a coordinate transformation of the optimum parameter vector by co QH co Since

co Qco

or

co = (Qco )H Qco = coH QH Qco = co

(6.2.26) (6.2.27)

the transformation (6.2.26) changes the direction of the transformed vector but not its length. If we substitute (6.2.22) into the normal equations (6.2.12), we obtain QQH co = d

QH co = QH d

or

which results in co = d where

d QH d

(6.2.28) d Qd

or

(6.2.29)

is the transformed “decoupled” cross-correlation vector. Because is diagonal, the set of M equations (6.2.28) can be written as λi co,i = di

1≤i≤M

(6.2.30)

and d are the components of c and d , respectively. This is an uncoupled set of where co,i o i M first-order equations. If λi = 0, then = co,i

di λi

1≤i≤M

is indeterminate. and if λi = 0, the value of co,i

(6.2.31)

271

The MMSE becomes

section 6.2 Linear Mean Square Error Estimation

Po = Py − dH co = Py − (Qd )H Qco = Py − dH co = Py −

M

di∗ co,i

= Py −

i=1

(6.2.32)

M |d |2 i

i=1

λi

which shows how the eigenvalues and the decoupled cross-correlations affect the performance of the optimum filter. The advantage of (6.2.31) and (6.2.32) is that we can study the behavior of each parameter of the optimum estimator independently of all the remaining ones. To appreciate the significance of the principal-component transformation, we will discuss the error surface of a second-order estimator. However, all the results can be easily generalized to estimators of order M, whose error performance surface exists in a space of M + 1 dimensions. Figure 6.7 shows the contours of constant MSE for a positive definite, second-order error surface. The contours are concentric ellipses centered at the tip of the optimum vector co . We define a new coordinate system with origin at co and axes determined by the major axis v˜ 1 and the minor axis v˜ 2 of the ellipses. The two axes are orthogonal, and the resulting system is known as the principal coordinate system. The transformation from the “old” system to the “new” system is done in two steps: Translation:

c˜ = c − co

Rotation:

v˜ = QH c˜

(6.2.33)

where the rotation changes the axes of the space to match the axes of the ellipsoid. The excess MSE (6.2.21) becomes %P (˜v) = c˜ H R˜c = c˜ H QQH c˜ = v˜ H v˜ =

M

λi |v˜i |2

(6.2.34)

i=1

which shows that the penalty paid for the deviation of a parameter from its optimum value is proportional to the corresponding eigenvalue. Clearly, changes in uncoupled parameters (which correspond to λi = 0) do not affect the excess MSE. Using (6.2.22), we have co = R −1 d = Q−1 QH d =

M qH d i

i=1

λi

c2

~ v1

co

M di qi λi

(6.2.35)

i=1

FIGURE 6.7 Contours of constant MSE and principal-component axes for a second-order quadratic error surface.

~ v2 ~ v

qi =

c1

272

and the optimum estimate can be written as

chapter 6 Optimum Linear Filters

yˆo =

coH x

M di H = (q x) λi i

(6.2.36)

i=1

which leads to the representation of the optimum estimator shown in Figure 6.8. The eigenfilters qi decorrelate the data vector x into its principal components, which are weighted and added to produce the optimum estimate.

q1Hx d1' /l1

Optimum estimate

d2' /l2

yˆo

…

…

Data x

H q2 x

H

qMx ' /lM dM

FIGURE 6.8 Principal-components representation of the optimum linear estimator.

6.2.4 Geometric Interpretations and the Principle of Orthogonality It is convenient and pedagogic to think of random variables with zero mean value and finite variance as vectors in an abstract vector space with an inner product (i.e., a Hilbert space) defined by their correlation x, y E{xy ∗ }

(6.2.37)

x2 x, x = E{|x|2 } < ∞

(6.2.38)

and the length of a vector by

From the definition of the correlation coefficient in Section 3.2.1 and the above definitions, we obtain |x, y|2 ≤ xy

(6.2.39)

which is known as the Cauchy-Schwartz inequality. Two random variables are orthogonal, denoted by x ⊥ y, if x, y = E{xy ∗ } = 0

(6.2.40)

which implies they are uncorrelated since they have zero mean. This geometric viewpoint offers an illuminating and intuitive interpretation for many aspects of MSE estimation that we will find very useful. Indeed, using (6.2.9), (6.2.10), and (6.2.12), we have E{xeo∗ } = E{x(y ∗ − xH co )} = E{xy ∗ } − E{xxH }co = d − Rco = 0 Therefore or

E{xeo∗ } = 0 E{xm eo∗ } = 0

for 1 ≤ m ≤ M

(6.2.41) (6.2.42)

that is, the estimation error is orthogonal to the data used for the estimation. Equations

(6.2.41), or equivalently (6.2.42), are known as the orthogonality principle and are widely used in linear MMSE estimation. To illustrate the use of the orthogonality principle, we note that any linear combination ∗ ∗ x lies in the subspace defined by the vectors† x , . . . , x . Therefore, c1 x1 + · · · + cM M 1 M the estimate yˆ that minimizes the squared length of the error vector e, that is, the MSE, is determined by the foot of the perpendicular from the tip of the vector y to the “plane” defined by vectors x1 , . . . , xM . This is illustrated in Figure 6.9 for M = 2. Since eo is perpendicular to every vector in the plane, we have xm ⊥eo , 1 ≤ m ≤ M, which leads to the orthogonality principle (6.2.42). Conversely, we can start with the orthogonality principle (6.2.41) and derive the normal equations. This interpretation has led to the name normal equations for (6.2.12). We will see several times that the concept of orthogonality has many important theoretical and practical implications. As an illustration, we apply the Pythagorean theorem to the orthogonal triangle formed by vectors yˆo , eo , and y, in Figure 6.9, to obtain y2 = yˆo 2 + eo 2 E{|y|2 } = E{|yˆo |2 } + E{|eo |2 }

or

(6.2.43)

which decomposes the power of the desired response into two components, one that is correlated to the data and one that is uncorrelated to the data. FIGURE 6.9 Pictorial illustration of the orthogonality principle. For random vectors orthogonality holds on the “average.”

y eo = y − yˆo c*o, 2 x2

c*o, 1 x1 x1

x2

yˆo

6.2.5 Summary and Further Properties We next summarize, for emphasis and future reference, some important properties of optimum, in the MMSE sense, linear estimators. 1. Equations (6.2.12) and (6.2.17) show that the optimum estimator and the MMSE depend only on the second-order moments of the desired response and the data. The dependence on the second-order moments is a consequence of both the linearity of the estimator and the use of the MSE criterion. 2. The error performance surface of the optimum estimator is a quadratic function of its coefficients. If the data correlation matrix is positive definite, this function has a unique minimum that determines the optimum set of coefficients. The surface can be visualized as a bowl, and the optimum estimator corresponds to the bottom of the bowl. †

We should be careful to avoid confusing vector random variables, that is, vectors whose components are random variables, and random variables interpreted as vectors in the abstract vector space defined by Equations (6.2.37) to (6.2.39).

273 section 6.2 Linear Mean Square Error Estimation

274 chapter 6 Optimum Linear Filters

3. If the data correlation matrix R is positive definite, any deviation from the optimum increases the MMSE according to (6.2.21). The resulting excess MSE depends on R only. This property is very useful in the design of adaptive filters. 4. When the estimator operates with the optimum set of coefficients, the error eo is uncorrelated (orthogonal) to both the data x1 , x2 , . . . , xM and the optimum estimate yˆo . This property is very useful if we want to monitor the performance of an optimum estimator in practice and is used also to design adaptive filters. 5. The MMSE, the optimum estimator, and the optimum estimate can be expressed in terms of the eigenvalues and eigenvectors of the data correlation matrix. See (6.2.32), (6.2.35), and (6.2.36). 6. The general (unconstrained) estimator yˆ h(x) = h(x1 , x2 , . . . , xM ) that minimizes the MSE P = E{|y − h(x)|2 } with respect to h(x) is given by the mean of the conditional density, that is, ∞ yˆo ho (x) = E{y|x} = ypy (y|x) dy −∞

and clearly is a nonlinear function of x1 , . . . , xM . If the desired response and the data are jointly Gaussian, the linear MMSE estimator is the best in the MMSE sense; that is, we cannot find a nonlinear estimator that produces an estimate with smaller MMSE (Papoulis 1991).

6.3 SOLUTION OF THE NORMAL EQUATIONS In this section, we present a numerical method for the solution of the normal equations and the computation of the minimum error, using a slight modification of the Cholesky decomposition of Hermitian positive definite matrices known as the lower-diagonal-upper decomposition, or LDLH decomposition for short. Hermitian positive definite matrices can be uniquely decomposed into the product of a lower triangular and a diagonal and an upper triangular matrix as R = LDLH where L is a unit lower-triangular matrix 1 l10 L . .. lM−1,0 and

0 1 .. . lM−1,1

(6.3.1) 0 0 .. . ··· 1

··· ··· .. .

D = diag{ξ 1 , ξ 2 , . . . , ξ M }

(6.3.2)

(6.3.3)

is a diagonal matrix with strictly real, positive elements. When the decomposition (6.3.1) is known, we can solve the normal equations Rco = LD(LH co ) = d

(6.3.4)

by solving the lower triangular system LDk d

(6.3.5)

for the intermediate vector k and the upper triangular system LH co = k

(6.3.6)

for the optimum estimator co . The advantage is that the solution of triangular systems of equations is trivial. We next provide a constructive proof of the LDLH decomposition by example and illustrate its application to the solution of the normal equations for M = 4. The generalization to an arbitrary order is straightforward and is given in Section 7.1.4.

r32

Writing the decomposition (6.3.1) explicitly for M = 4, we have ∗ ∗ ξ1 0 r14 0 0 l20 1 0 0 0 1 l10 ∗ r23 r24 l10 1 0 0 1 0 0 0 l21 ξ2 0 = r33 r34 l20 l21 1 0 0 1 0 ξ 3 0 0 0

r42

r43

E XAM PLE 6.3.1.

r11

r 21 r31 r41

r12 r22

r13

r44

l30

l31

l32

1

ξ4

∗ l30

∗ l31 ∗ l32

1 (6.3.7)

where rij = rj∗i and ξ i > 0, by assumption. If we perform the matrix multiplications on the right-hand side of (6.3.7) and equate the matrix elements on the left and right sides, we obtain r11 = ξ 1

⇒

ξ 1 = r11 r l10 = 21 ξ1

r21 = ξ 1 l10

⇒

r22 = ξ 1 |l10 |2 + ξ 2

⇒

r31 = ξ 1 l20

⇒

∗ +ξ l r32 = ξ 1 l20 l10 2 21

⇒

l21 =

r33 = ξ 1 |l20 |2 + ξ 2 |l21 |2 + ξ 3

⇒

r41 = ξ 1 l30

⇒

ξ 3 = r33 − ξ 1 |l20 |2 − ξ 2 |l21 |2 r l30 = 41 ξ1

∗ +ξ l r42 = ξ 1 l30 l10 2 31

⇒

l31 =

∗ + ξ l l∗ + ξ l r43 = ξ 1 l30 l20 2 31 21 3 32

⇒

l32 =

r44 = ξ 1 |l30 |2 + ξ 2 |l31 |2 + ξ 3 |l32 |2 + ξ 4

⇒

ξ 4 = r44 − ξ 1 |l30 |2 − ξ 2 |l31 |2 − ξ 3 |l32 |2

ξ 2 = r22 − ξ 1 |l10 |2 r l20 = 31 ξ1

∗ r32 − ξ 1 l20 l10

ξ2

∗ r42 − ξ 1 l30 l10

ξ2

∗ − ξ l l∗ r43 − ξ 1 l30 l20 2 31 21

ξ3 (6.3.8)

which provides a row-by-row computation of the elements of the LDLH decomposition. We note that the computation of the next row does not change the already computed rows. The lower unit triangular system in (6.3.5) becomes 1 0 0 0 ξ 1 k1 d1 l ξ k d 1 0 0 10 2 2 2 (6.3.9) = l20 l21 1 0 ξ 3 k3 d3 l30

l31

l32

1

ξ 4 k4

d4

and can be solved by forward substitution, starting with the first equation. Indeed, we obtain d ξ 1 k1 = d1 ⇒ k1 = 1 ξ1 d − l10 ξ 1 k1 l10 ξ 1 k1 + ξ 2 k2 = d2 ⇒ k2 = 2 ξ2 d3 − l20 ξ 1 k1 + l21 ξ 2 k2 l20 ξ 1 k1 + l21 ξ 2 k2 + ξ 3 k3 = d3 ⇒ k3 = ξ3 d4 − l30 ξ 1 k1 + l31 ξ 2 k2 + l32 ξ 3 k3 l30 ξ 1 k1 + l31 ξ 2 k2 + l32 ξ 3 k3 + ξ 4 k4 = d4 ⇒ k4 = ξ4 (6.3.10)

275 section 6.3 Solution of the Normal Equations

276 chapter 6 Optimum Linear Filters

which compute the coefficients ki in “forward” order. Then, the optimum estimator is obtained by solving the upper unit triangular system in (6.3.6) by backward substitution, starting from the last equation. Indeed, we have (4) ∗ ∗ ∗ c(4) c4 = k4 l20 l30 1 l10 1 k1 (4) ∗ c ∗ ∗ c(4) 0 1 c3 = k3 − l32 l21 l31 4 2 k2 (6.3.11) = ⇒ ∗ (4) 0 0 (4) k ∗ ∗ c 1 l 3 c2 = k2 − l21 c3 − l31 32 c3 4 k4 (4) (4) 0 0 0 1 ∗ c − l∗ c − l∗ c c4 c1 = k1 − l10 2 20 3 30 4 that is, the coefficients of the optimum estimator are computed in “backward” order. As a result of this backward substitution, computing one more coefficient for the optimum estimator changes all the previously computed coefficients. Indeed, the coefficients of the third-order estimator are (3) ∗ ∗ c(3) c3 = k3 l20 1 l10 1 k1 ∗ (3) = k ⇒ (3) ∗ c(3) 0 1 l21 (6.3.12) c 2 c2 = k2 − l21 3 2 k (3) (3) (3) (3) 3 0 0 1 c c = k − l∗ c − l∗ c 3

1

1

10 2

20 3

which are different from the first three coefficients of the fourth-order estimator.

Careful inspection of the formulas for r11 , r22 , r33 , and r44 shows that the diagonal elements of R provide an upper bound for the elements of L and D, which is the reason for the good numerical properties of the LDLH decomposition algorithm. The general formulas for the row-by-row computation of the triangular decomposition, forward substitution, and backward substitution are given in Table 6.1 and can be easily derived by generalizing the results of the previous example. The triangular decomposition requires M 3 /6 operations, and the solution of each triangular system requires M(M + 1)/2 ≈ M 2 /2 operations. TABLE 6.1

Solution of normal equations using triangular decomposition. For i = 1, 2, . . . , M and for j = 0, 1, . . . , i − 1, j −1 1 ∗ ri+1,j +1 − ξ m+1 lim lj m lij = ξi

(not executed when i = M)

m=0

ξ i = rii −

i−1

ξ m |li−1,m−1 |2

m=1

For i = 1, 2, . . . , M, i−2 d li−1,m km+1 ki = i − ξi m=0

For i = M, M − 1, . . . , 1, ci = ki −

M

∗ lm−1,i−1 cm

m=i+1

The decomposition (6.3.1) leads to an interesting and practical formula for the computation of the MMSE without using the optimum estimator coefficients. Indeed, using (6.2.17), (6.3.6), and (6.3.1), we obtain Po = Py − coH Rco = Py − k H L−1 R(L−1 )H k = Py − k H Dk

(6.3.13)

or in scalar form Po = Py −

M i=1

ξ i |ki |2

(6.3.14)

since D is diagonal. Equation (6.3.14) shows that because ξ i > 0, increasing the order of the filter can only reduce the minimum error and hence leads to a better estimate. Another important application of (6.3.14) is in the computation of arbitrary positive definite quadratic forms. Such problems arise in various statistical applications, such as detection and hypothesis testing, involving the correlation matrix of Gaussian processes (McDonough and Whalen 1995). Since the determinant of a unit lower triangular matrix equals 1, from (6.3.1) we obtain det R = (det L)(det D)(det LT )

M

(6.3.15)

ξi i=1

which shows that if R is positive definite, ξ i > 0 for all i, and vice versa. The triangular decomposition of symmetric, positive definite matrices is numerically stable. The function [L,D]=ldlt(R) implements the first part of the algorithm in Table 6.1, and it fails only if matrix R is not positive definite. Therefore, it can be used as an efficient test to find out whether a symmetric matrix is positive definite. The function [co,Po]=lduneqs(L,D,d) computes the MMSE estimator using the last formula in Table 6.1 and the corresponding MMSE using (6.3.14). To summarize, linear MMSE estimation involves the following computational steps 1. R = E{xxH }, d = E{xy ∗ }

Normal equations Rco = d

2. R =

3. LDk = d

Triangular decomposition Forward substitution → k

4. LH co = k

Backward substitution → co

5. Po = Py − k H Dk

MMSE computation

LDLH

6. e = y

− coH x

(6.3.16)

Computation of residuals

The vector k can also be obtained using the LDLH decomposition of an augmented correlation matrix. To this end, consider the augmented vector x x¯ = (6.3.17) y and its correlation matrix

¯ = E{¯xx¯ H }= R

E{xy ∗ }

E{yxH }

E{|y|2 }

E{xxH }

=

R

d

dH

Py

¯ is We can easily show that the LDLH decomposition of R 0 L 0 D LH k H ¯ = R k H 1 0H Po 0H 1

(6.3.18)

(6.3.19)

which provides the MMSE Po and the quantities L and k required to obtain the optimum estimator co by solving LH co = k. Compute, using the LDLH method, the optimum estimator and the MMSE specified by the following second-order moments: E XAM PLE 6.3.2.

1

3

2

4

3 R= 2

12

18

18

54

21 48

2 d= 1.5

4

21

48

55

1

4

and

Py = 100

277 section 6.3 Solution of the Normal Equations

278 chapter 6 Optimum Linear Filters

Solution. We first compute the triangular factors 1 0 0 0 1 3 1 0 0 0 L= D= 2 4 1 0 0 4

3

2

1

3

2

0 0 0

4

using (6.3.8), and the vector k k = [1 − 13 1.75 − 1]T using (6.3.9). Then we determine the optimum estimator c = [34.5 − 12 13 3.75 − 1]T by solving the triangular system (6.3.11). The corresponding MMSE Po = 88.5 can be evaluated by using either (6.2.17) or (6.3.14). The reader can easily verify that the LDLH ¯ provides the elements of L, k, and Po . decomposition of R

Since the diagonal elements ξ k are positive, the matrix L LD1/2

(6.3.20)

is lower triangular with positive diagonal elements. Then (6.3.1) can be written as R = LLH

(6.3.21)

which is known as the Cholesky decomposition of R (Golub and Van Loan 1996). The computation of L requires M 3 /6 multiplications and additions and M square roots and can be done by using the function L=chol(R)’. The function [L,D]=ldltchol(R) computes the LDLH decomposition using the function chol.

6.4 OPTIMUM FINITE IMPULSE RESPONSE FILTERS In the previous section, we presented the theory of general linear MMSE estimators [see Figure 6.1(a)]. In this section, we apply these results to the design of optimum linear filters, that is, filters whose performance is the best possible when measured according to the MMSE criterion [see Figure 6.1(b)]. The general formulation of the optimum filtering problem is shown in Figure 6.10. The optimum filter forms an estimate y(n) ˆ of the desired response y(n) by using samples from a related input signal x(n). The theory of optimum filters was developed by Wiener (1942) in continuous time and Kolmogorov (1939) in discrete time. Levinson (1947) reformulated the theory for FIR filters and stationary processes and developed an efficient algorithm for the solution of the normal equations that exploits the Toeplitz structure of the autocorrelation matrix R (see Section 7.4). For this reason, linear MMSE filters are often referred to as Wiener filters. Desired response Input signal x(n)

y(n) Optimum filter

y(n) ˆ −

Error signal e(n)

FIGURE 6.10 Block diagram representation of the optimum filtering problem.

We consider a linear FIR filter specified by its impulse response h(n, k). The output of the filter is determined by the superposition summation y(n) ˆ =

M−1

h(n, k)x(n − k)

(6.4.1)

k=0

M

ck∗ (n)x(n − k + 1) cH (n)x(n)

(6.4.2)

k=1

where and

c(n) [c1 (n) c2 (n) · · · cM (n)]T

(6.4.3)

x(n) [x(n) x(n − 1) · · · x(n − M + 1)]

T

(6.4.4)

†

are the filter coefficient vector and the input data vector, respectively. Equation (6.4.1) becomes a convolution if h(n, k) does not depend on n, that is, when the filter is timeinvariant. The objective is to find the coefficient vector that minimizes the MSE E{|e(n)|2 }. We prefer FIR over IIR filters because (1) any stable IIR filter can be approximated to any desirable degree by an FIR filter and (2) optimum FIR filters are easily obtained by solving a linear system of equations.

6.4.1 Design and Properties To determine the optimum FIR filter co (n), we note that at every time instant n, the optimum filter is the linear MMSE estimator of the desired response y(n) based on the data x(n). Since for any fixed n the quantities y(n), x(n), . . . , x(n − M + 1) are random variables, we can determine the optimum filter either from (6.2.12) by replacing x by x(n), y by y(n), and co by co (n); or by applying the orthogonality principle (6.2.41). Indeed, using (6.2.41), (6.1.2), and (6.4.2), we have E{x(n)[y ∗ (n) − xH (n)co (n)]} = 0

(6.4.5)

which leads to the following set of normal equations

where

R(n)co (n) = d(n)

(6.4.6)

R(n) E{x(n)x (n)}

(6.4.7)

H

is the correlation matrix of the input data vector and d(n) E{x(n)y ∗ (n)}

(6.4.8)

is the cross-correlation vector between the desired response and the input data vector, that is, the input values stored currently in the filter memory and used by the filter to estimate the desired response. We see that, at every time n, the coefficients of the optimum filter are obtained as the solution of a linear system of equations. The filter co (n) is optimum if and only if the Hermitian matrix R(n) is positive definite. To find the MMSE, we can use either (6.2.17) or the orthogonality principle (6.2.41). Using the orthogonality principle, we have Po (n) = E{eo (n)[y ∗ (n) − xH (n)co (n)]} = E{eo (n)y ∗ (n)}

due to orthogonality

= E{[y(n) − x (n)co (n)]y ∗ (n)} H

†

We define ck+1 (n) h∗ (n, k), 0 ≤ k ≤ M − 1 in order to comply with the definition R(n) E{x(n)xH (n)} of the correlation matrix.

279 section 6.4 Optimum Finite Impulse Response Filters

280

which can be written as

chapter 6 Optimum Linear Filters

Po (n) = Py (n) − dH (n)co (n)

(6.4.9)

Py (n) E{|y(n)|2 }

(6.4.10)

The first term

is the power of the desired response signal and represents the MSE in the absence of filtering. The second term dH (n)co (n) is the reduction in the MSE that is obtained by using the optimum filter. In many practical applications, we need to know the performance of the optimum filter in terms of MSE reduction prior to computing the coefficients of the filter. Then we can decide if it is preferable to (1) use an optimum filter (assuming we can design one), (2) use a simpler suboptimum filter with adequate performance, or (3) not use a filter at all. Hence, the performance of the optimum filter can serve as a yardstick for other competing methods. The optimum filter consists of (1) a linear system solver that determines the optimum set of coefficients from the normal equations formed, using the known second-order moments, and (2) a discrete-time filter that computes the estimate y(n) ˆ (see Figure 6.11). The solution of (6.4.6) can be obtained by using standard linear system solution techniques. In Matlab, we solve (6.4.6) by copt=R\d and compute the MMSE by Popt=Py-dot(conj(d),copt). The optimum filter is implemented by yest=filter(copt,1,x). We emphasize that the optimum filter only needs the input signal for its operation, that is, to form the estimate of y(n); the desired response, if it is available, may be used for other purposes.

Input signal x(n)

z−1

…

z−1

z−1

Optimum estimate

…

ˆyo (n)

c*2, o(n) c*M, o(n)

c*1, o(n)

Linear system solver R(n)co(n) = d(n) A priori information R(n)

d(n)

FIGURE 6.11 Design and implementation of a time-varying optimum FIR filter.

Conventional frequency-selective filters are designed to shape the spectrum of the input signal within a specific frequency band in which it operates. In this sense, these filters are effective only if the components of interest in the input signal have their energy concentrated within nonoverlapping bands. To design the filters, we need to know the limits of these bands, not the values of the sequences to be filtered. Note that such filters do not depend on the values of the data (values of the samples) to be filtered; that is, they are not data-adaptive. In contrast, optimum filters are designed using the second-order moments of the processed signals and have the same effect on all classes of signals with the same second-order moments. Optimum filters are effective even if the signals of interest have

overlapping spectra. Although the actual data values also do not affect optimum filters, that is, they are also not data-adaptive, these filters are optimized to the statistics of the data and thus provide superior performance when judged by the statistical criterion. The dependence of the optimum filter only on the second-order moments is a consequence of the linearity of the filter and the use of the MSE criterion. Phase information about the input signal or non-second-order moments of the input and desired response processes is not needed; even if the moments are known, they are not used by the filter. Such information is useful only if we employ a nonlinear filter or use another criterion of performance. The error performance surface of the optimum direct-form FIR filter is a quadratic function of its impulse response. If the input correlation matrix is positive definite, this function has a unique minimum that determines the optimum set of coefficients. The surface can be visualized as a bowl, and the optimum filter corresponds to the bottom of the bowl. The bottom is moving if the processes are nonstationary and fixed if they are stationary. In general, the shape of the error performance surface depends on the criterion of performance and the structure of the filter. Note that the use of another criterion of performance or another filter structure may lead to error performance surfaces with multiple local minima or saddle points.

6.4.2 Optimum FIR Filters for Stationary Processes Further simplifications and additional insight into the operation of optimum linear filters are possible when the input and desired response stochastic processes are jointly wide-sense stationary. In this case, the correlation matrix of the input data and the cross-correlation vector do not depend on the time index n. Therefore, the optimum filter and the MMSE are time-invariant (i.e., they are independent of the time index n) and are determined by Rco = d

(6.4.11)

Po = Py − d co H

and

(6.4.12)

Owing to stationarity, the autocorrelation matrix is rx (0) rx (1) ··· rx∗ (1) rx (0) ··· R . .. .. . . . . ∗ ∗ rx (M − 1) rx (M − 2) · · ·

rx (M − 1) rx (M − 2) .. .

(6.4.13)

rx (0)

E{x(n)x ∗ (n − l)} of the input signal. The cross-

determined by the autocorrelation rx (l) = correlation vector between the desired response and the input data vector is ∗ ∗ ∗ d [d1 d2 · · · dM ]T [ryx (0) ryx (1) · · · ryx (M − 1)]T

(6.4.14)

and Py is the power of the desired response. For stationary processes, the matrix R is Toeplitz and positive definite unless the components of the data vector are linearly dependent. Since the optimum filter is time-invariant, it is implemented by using convolution yˆo (n) =

M−1

ho (k) x(n − k)

(6.4.15)

k=0 ∗ where ho (n) = co,n+1 is the impulse response of the optimum filter. ∗ Using (6.4.13), (6.4.14), ho (n) = co,n+1 , and r(l) = r ∗ (−l), we can write the normal equations (6.4.11) more explicitly as M−1 k=0

ho (k)r∗ (m − k) = ryx (m)

0≤m≤M −1

(6.4.16)

281 section 6.4 Optimum Finite Impulse Response Filters

282 chapter 6 Optimum Linear Filters

which is the discrete-time counterpart of the Wiener-Hopf integral equation, and its solution determines the impulse response of the optimum filter. We notice that the cross-correlation between the input signal and the desired response (right-hand side) is equal to the convolution between the autocorrelation of the input signal and the optimum filter (left-hand side). Thus, to obtain the optimum filter, we need to solve a convolution equation. The MMSE is given by Po = Py −

M−1

∗ ho (k)ryx (k)

(6.4.17)

k=0

which is obtained by substituting (6.4.14) into (6.4.12). Table 6.2 summarizes the information required for the design of an optimum (in the MMSE sense) linear time-invariant filter, the Wiener-Hopf equations that define the filter, and the resulting MMSE. TABLE 6.2

Specification of optimum linear filters for stationary signals. The limits 0 and M − 1 on the summations can be replaced by any values M1 and M2 . Filter and Error Definitions

e(n) y(n) −

M−1

h(k)x(n − k)

k=0

Criterion of Performance Wiener-Hopf Equations

P E{|e(n)|2 } → minimum M−1

ho (k)rx (m − k) = ryx (m), 0 ≤ m ≤ M − 1

k=0

Minimum MSE

Po = Py −

M−1

∗ (k) ho (k)ryx

k=0

Second-Order Statistics

rx (l) = E{x(n)x ∗ (n − l)}, Py = E{|y(n)|2 } ryx (l) = E{y(n)x ∗ (n − l)}

To summarize, for nonstationary processes R(n) is Hermitian and nonnegative definite, and the optimum filter ho (n) is time-varying. For stationary processes, R is Hermitian, nonnegative definite, and Toeplitz, and the optimum filter is time-invariant. A Toeplitz autocorrelation matrix is positive definite if the power spectrum of the input satisfies Rx (ej ω ) > 0 for all frequencies ω. In both cases, the filter is used for all realizations of the processes. If M = ∞, we have a causal IIR optimum filter determined by an infinite-order linear system of equations that can only be solved in the stationary case by using analytical techniques (see Section 6.6). E XAM PLE 6.4.1.

Consider a harmonic random process y(n) = A cos (ω0 n + φ)

with fixed, but unknown, amplitude and frequency, and random phase φ, uniformly distributed on the interval from 0 to 2π. This process is corrupted by additive white Gaussian noise v(n) ∼ N (0, σ 2v ) that is uncorrelated with y(n). The resulting signal x(n) = y(n) + v(n) is available to the user for processing. Design an optimum FIR filter to remove the corrupting noise v(n) from the observed signal x(n). Solution. The input of the optimum filter is x(n), and the desired response is y(n). The signal y(n) is obviously unavailable, but to design the filter, we only need the second-order moments rx (l) and ryx (l). We first note that since y(n) and v(n) are uncorrelated, the autocorrelation of

283

the input signal is

section 6.4 Optimum Finite Impulse Response Filters

rx (l) = ry (l) + rv (l) = 12 A2 cos ω0 l + σ 2v δ(l) where ry (l) = 12 A2 cos ω0 l is the autocorrelation of y(n). The cross-correlation between the desired response y(n) and the input signal x(n) is ryx (l) = E{y(n)[y(n − l) + v(n − l)]} = ry (l) Therefore, the autocorrelation matrix R is symmetric Toeplitz and is determined by the elements r(0), r(1), . . . , r(M − 1) of its first row. The right-hand side of the Wiener-Hopf equations is d = [ry (0) ry (1) · · · ry (M − 1)]T . If we know ry (l) and σ 2v , we can numerically determine the optimum filter and the MMSE from (6.4.11) and (6.4.12). For example, suppose that A = 0.5, f0 = ω0 /(2π ) = 0.05, and σ 2v = 0.5. The input signal-to-noise ratio (SNR) is SNRI = 10 log

A2 /2 = −6.02 dB σ 2v

The processing gain (PG), defined as the ratio of signal-to-noise ratios at the output and input of a signal processing system PG

SNRO SNRI

provides another useful measure of performance. The first problem we encounter is how to choose the order M of the filter. In the absence of any a priori information, we compute ho and Poh for 1 ≤ M ≤ Mmax = 50 and PG and plot both results in Figure 6.12. We see that an M = 20 order filter provides satisfactory performance. Figure 6.13 shows a realization of the corrupted and filtered signals. Another useful approach to evaluate how well the optimum filter enhances a harmonic signal is to compute the spectra of the input and output signals and the frequency response of the optimum filter. These are shown in Figure 6.14, where we see that the optimum filter has a sharp bandpass about frequency f0 , as expected (for details see Problem 6.5).

0.10 14

Gain (M)

Po (M)

0.08 0.06

12 10

0.04 8 0.02 10

20 30 FIR filter order M (a)

40

50

10

20 30 FIR filter order M

40

(b)

FIGURE 6.12 Plots of (a) the MMSE and (b) the processing gain as a function of the filter order M.

To illustrate the meaning of the estimator’s optimality, we will use a Monte Carlo simulation. Thus, we generate K = 100 realizations of the sequence x(ζ i , n), 0 ≤ n ≤ N − 1(N = 1000); we compute the output sequence y(ζ ˆ i , n), using (6.4.15); and then the error sequence e(ζ i , n) = y(ζ i , n) − y(ζ ˆ i , n) and its variance Pˆ (ζ i ). Figure 6.15 shows a plot of Pˆ (ζ i ), 1 ≤ ζ i ≤ K. We

50

Sinusoid + noise

284 chapter 6 Optimum Linear Filters

2 1 0 −1 −2 100

200

300

400

500

600

700

800

900

1000

700

800

900

1000

Filtered signal

0.5 0 −0.5 100

200

300

400 500 600 Sample number n

FIGURE 6.13 Example of the noise-corrupted and filtered sinusoidal signals.

Magnitude

Sinusoid + noise 12 10 8 6 4 2 0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.35

0.4

0.45

0.5

0.35

0.4

0.45

0.5

Magnitude

Optimum filter magnitude response 0.6 0.4 0.2 0

0.05

0.1

0.15

0.2

0.25

0.3

Magnitude

Filtered signal 8 6 4 2 0

0.05

0.1

0.15

0.2

0.25 0.3 Frequency f

FIGURE 6.14 PSD of the input signal, magnitude response of the optimum filter, and PSD of the output signal.

285

0.08

section 6.4 Optimum Finite Impulse Response Filters

MMSE (zi )

0.1

0.06 0.04 0.02 0

20

40 60 Realization zi

80

100

FIGURE 6.15 Results of Monte Carlo simulation of the optimum filter. The solid line corresponds to the MMSE and the dashed line to the average of Pˆ (ζ i ) values. notice that although the filter performs better or worse than the optimum in particular cases, on average its performance is close to the theoretically predicted one. This is exactly the meaning of the MMSE criterion: optimum performance on the average (in the MMSE sense).

For a certain realization, the optimum filter may not perform as well as some other linear filters; however, on average, it performs better than any other linear filter of the same order when all possible realizations of x(n) and y(n) are considered.

6.4.3 Frequency-Domain Interpretations We will now investigate the performance of the optimum filter, for stationary processes, in the frequency domain. Using (6.2.7), (6.4.13), and (6.4.14), we can easily show that the MSE of an FIR filter h(n) is given by P = E{|e(n)|2 } = ry (0)−

M−1

∗ h(k)ryx (k)−

k=0

M−1

h∗ (k)ryx (k)+

k=0

M−1 M−1

h(k)r(l −k)h∗ (l)

k=0 l=0

(6.4.18) The frequency response function of the FIR filter is H (ej ω )

M−1

h(k)e−j ωk

(6.4.19)

k=0

Using Parseval’s theorem, ∞ n=−∞

x1 (n)x2∗ (n) =

1 2π

π −π

X1 (ej ω )X2∗ (ej ω ) dω

(6.4.20)

we can show that the MSE (6.4.18) can be expressed in the frequency domain as π 1 ∗ [H (ej ω )Ryx (ej ω )+H ∗ (ej ω )Ryx (ej ω )−H (ej ω )H ∗ (ej ω )Rx (ej ω )] dω P = ry (0)− 2π −π (6.4.21) where Rx (ej ω ) is the PSD of x(n) and Ryx (ej ω ) is the cross-PSD of y(n) and x(n) (see Problem 6.10). This formula holds for both FIR and IIR filters. If we minimize (6.4.21) with respect to H (ej ω ), we obtain the system function of the optimum filter and the MMSE. However, we leave this for Problem 6.11 and instead express

286 chapter 6 Optimum Linear Filters

(6.4.17) in the frequency domain by using (6.4.20). Indeed, we have π 1 ∗ Ho (ej ω )Ryx (ej ω ) dω Po = ry (0) − 2π −π π 1 ∗ = [Ry (ej ω ) − Ho (ej ω )Ryx (ej ω )] dω 2π −π

(6.4.22)

where Ho (ej ω ) is the frequency response of the optimum filter. The above equation holds for any filter, FIR or IIR, as long as we use the proper limits to compute the summation in (6.4.19). We will now obtain a formula for the MMSE that holds only for IIR filters whose impulse response extends from −∞ to ∞. In this case, (6.4.16) is a convolution equation that holds for −∞ < m < ∞. Using the convolution theorem of the Fourier transform, we obtain Ho (ej ω ) =

Ryx (ej ω ) Rx (ej ω )

(6.4.23)

which, we again stress, holds for noncausal IIR filters only. Substituting into (6.4.22), we obtain π |Ryx (ej ω )|2 1 Po = ]Ry (ej ω ) dω [1 − 2π −π Ry (ej ω )Rx (ej ω ) π 1 Po = or [1 − Gyx (ej ω )]Ry (ej ω ) dω (6.4.24) 2π −π where Gyx (ej ω ) is the coherence function between x(n) and y(n). This important equation indicates that the performance of the optimum filter depends on the coherence between the input and desired response processes. As we recall from Section 5.4, the coherence is a measure of both the noise disturbing the observations and the relative linearity between x(n) and y(n). The optimum filter can reduce the MMSE at a certain band only if there is significant coherence, that is, Gyx (ej ω ) 1. Thus, the optimum filter Ho (z) constitutes the best, in the MMSE sense, linear relationship between the stochastic processes x(n) and y(n). These interpretations apply to causal IIR and FIR optimum filters, even if (6.4.23) and (6.4.24) only hold approximately in these cases (see Section 6.6).

6.5 LINEAR PREDICTION Linear prediction plays a prominent role in many theoretical, computational, and practical areas of signal processing and deals with the problem of estimating or predicting the value x(n) of a signal at the time instant n = n0 , by using a set of other samples from the same signal. Although linear prediction is a subject useful in itself, its importance in signal processing is also due, as we will see later, to its use in the development of fast algorithms for optimum filtering and its relation to all-pole signal modeling.

6.5.1 Linear Signal Estimation Suppose that we are given a set of values x(n), x(n − 1), . . . , x(n − M) of a stochastic process and we wish to estimate the value of x(n − i), using a linear combination of the remaining samples. The resulting estimate and the corresponding estimation error are given

287

by x(n ˆ − i) −

M

ck∗ (n)x(n − k)

(6.5.1)

k=0 k =i

ˆ − i) e(i) (n) x(n − i) − x(n

and

=

M

ck∗ (n)x(n − k)

with ci (n) 1

(6.5.2)

k=0

where ck (n) are the coefficients of the estimator as a function of discrete-time index n. The process is illustrated in Figure 6.16. Linear signal estimation

c*M

…

…

1

e(i)(n)

c*1 c*o

x(n ˆ − i) Forward linear prediction

…

a*M a*M−1

a*k

…

e f(n)

a*2 a*1 1 x(n) ˆ

Backward linear prediction 1 b*M−1

…

b*k

…

eb(n)

b*1 b*0

x(n ˆ − M) x(n) x(n − M)

x(n − 1)

…

n−M

n−i

… n

Time (nT )

FIGURE 6.16 Illustration showing the samples, estimates, and errors used in linear signal estimation, forward linear prediction, and backward linear prediction.

To determine the MMSE signal estimator, we partition (6.5.2) as e(i) (n) =

i−1

ck∗ (n)x(n − k) + x(n − i) +

k=0

M k=i+1

c1H (n)x1 (n) + x(n − i) + c2H (n)x2 (n)

ck∗ (n)x(n − k) (6.5.3)

[¯c(i) (n)]H x¯ (n) where the partitions of the coefficient and data vectors, around the ith component, are easily defined from the context. To obtain the normal equations and the MMSE for the optimum

section 6.5 Linear Prediction

288

linear signal estimator, we note that

chapter 6 Optimum Linear Filters

Desired response = x(n − i)

x1 (n) data vector = x2 (n)

Using (6.4.6) and (6.4.9) or the orthogonality principle, we have R11 (n) R12 (n) c1 (n) r1 (n) =− T (n) R (n) c2 (n) r2 (n) R12 22

(6.5.4)

†

or more compactly

and

R (i) (n)co(i) (n) = −d(i) (n)

(6.5.5)

Po(i) (n) = Px (n − i) + r1H (n)c1 (n) + r2H (n)c2 (n)

(6.5.6)

where for j, k = 1, 2 Rj k (n) E{xj (n)xkH (n)}

(6.5.7)

∗

rj (n) E{xj (n)x (n − i)}

(6.5.8)

Px (n) = E{|x(n)| }

(6.5.9)

2

For various reasons, to be seen later, we will combine (6.5.4) and (6.5.6) into a single equation. To this end, we note that the correlation matrix of the extended vector x1 (n) x¯ (n) = x(n − i) (6.5.10) x2 (n) can be partitioned as

R11 (n)

H ¯ R(n) = E{¯x(n)¯xH (n)} = r1 (n)

H (n) R12

r1 (n) Px (n − i) r2 (n)

R12 (n)

r2H (n) R22 (n)

(6.5.11)

with respect to its ith row and ith column. Using (6.5.4), (6.5.6), and (6.5.11), we obtain 0 ¯ (6.5.12) R(n)¯ co(i) (n) = Po(i) (n) ← ith row 0 (i)

which completely determines the linear signal estimator c(i) (n) and the MMSE Po (n). If M = 2L and i = L, we have a symmetric linear smoother c¯ (n) that produces an estimate of the middle sample by using the L past and the L future samples. The above formulation suggests an easy procedure for the computation of the linear signal estimator for any value of i, which is outlined in Table 6.3 and implemented by the function[ci,Pi]=olsigest(R,i). We next discuss two types of linear signal estimation that are of special interest and have their own dedicated notation. 6.5.2 Forward Linear Prediction One-step forward linear prediction (FLP) involves the estimation or prediction of the value x(n) of a stochastic process by using a linear combination of the past samples x(n − 1), . . . , x(n − M) (see Figure 6.16). We should stress that in signal processing applications

†

The minus sign on the right-hand side of the normal equations is the result of arbitrarily setting the coefficient ci (n) 1.

TABLE 6.3

289

Steps for the computation of optimum signal estimators. ¯ 1. Determine the matrix R(n) of the extended data vector x¯ (n). ¯ by removing its ith row and its ith column. 2. Create the M × M submatrix R (i) (n) of R(n) ¯ and removing its ith element. 3. Create the M × 1 vector d(i) (n) by extracting the ith column d¯ (i) (n) of R(n) (i) (i) (i) (i) 4. Solve the linear system R (n)co (n) = −d (n) to obtain co (n). (i) (i) 5. Compute the MMSE Po (n) = [d¯ (i) (n)]H c¯ o (n).

of linear prediction, what is important is the ability to obtain a good estimate of a sample, pretending that it is unknown, instead of forecasting the future. Thus, the term prediction is used more with signal estimation than forecasting in mind. The forward predictor is a linear signal estimator with i = 0 and is denoted by ef (n) x(n) +

M

ak∗ (n)x(n − k)

k=1

(6.5.13)

= x(n) + aH (n)x(n − 1) where

a(n) [a1 (n) a2 (n) · · · aM (n)]T

(6.5.14)

is known as the forward linear predictor and ak (n) with a0 (n) 1 as the FLP error filter. To obtain the normal equations and the MMSE for the optimum FLP, we note that for i = 0, (6.5.11) can be written as Px (n) rf H (n) ¯ (6.5.15) R(n) = rf (n) R(n − 1) R(n) = E{x(n)xH (n)}

(6.5.16)

rf (n) = E{x(n − 1)x ∗ (n)}

(6.5.17)

where and

Therefore, (6.5.5) and (6.5.6) give

and or

R(n − 1)ao (n) = −rf (n)

(6.5.18)

= Px (n) + r (n)ao (n) 1 Pof (n) ¯ R(n) = ao (n) 0

(6.5.19)

Pof (n)

fH

(6.5.20)

which completely specifies the FLP parameters.

6.5.3 Backward Linear Prediction In this case, we want to estimate the sample x(n − M) in terms of the future samples x(n), x(n − 1), . . . , x(n − M + 1) (see Figure 6.16). The term backward linear prediction (BLP) is not accurate but is used since it is an established convention. A more appropriate name might be postdiction or hindsight. The BLP is basically a linear signal estimator with i = M and is denoted by eb (n)

M−1

bk∗ (n)x(n − k) + x(n − M)

k=0

(6.5.21)

= bH (n)x(n) + x(n − M) where

b(n) [b0 (n) b1 (n) · · · bM−1 (n)]T

(6.5.22)

section 6.5 Linear Prediction

is the BLP and bk (n) with bM (n) 1 is the backward prediction error filter (BPEF). For i = M, (6.5.11) gives R(n) rb (n) ¯ (6.5.23) R(n) = rbH (n) Px (n − M)

290 chapter 6 Optimum Linear Filters

rb (n) E{x(n)x ∗ (n − M)}

where

(6.5.24)

The optimum backward linear predictor is specified by R(n)bo (n) = −rb (n)

(6.5.25)

Pob (n) = Px (n − M) + rbH (n)bo (n)

(6.5.26)

and the MMSE is

and can be put in a single equation as ¯ R(n)

bo (n)

1

=

(6.5.27)

Pob (n)

In Table 6.4, we summarize the definitions and design equations for optimum FIR filtering and prediction. Using the entries in this table, we can easily obtain the normal equations and the MMSE for the FLP and BLP from those of the optimum filter. TABLE 6.4

Summary of the design equations for optimum FIR filtering and prediction. Optimum filter

FLP

BLP

Input data vector

x(n)

x(n − 1)

x(n)

Desired response

y(n)

x(n)

x(n − M)

Coefficient vector

h(n)

a(n)

b(n)

Estimation error

e(n) = y(n) − cH (n)x(n)

ef (n) = x(n) + aH (n)x(n − 1)

eb (n) = x(n − M) + bH (n)x(n)

Normal equations

R(n)co (n) = d(n)

R(n − 1)ao (n) = −rf (n)

R(n)bo (n) = −rb (n)

MMSE

Poc (n) = Py (n) − coH (n)d(n)

Pof (n) = Px (n) + aH (n)rof (n)

Pob (n) = Px (n − M) + bH (n)rob (n)

Required moments

R(n) = E{x(n)xH (n)}

rf (n) = E{x(n − 1)x ∗ (n)}

rb (n) = E{x(n)x ∗ (n − M)}

Rao = −r∗

Rbo = −Jr ⇒ bo = Ja∗o

d(n) = E{x(n)y ∗ (n)} Stationary processes

Rco = d, R is Toeplitz

6.5.4 Stationary Processes ¯ If the process x(n) is stationary, then the correlation matrix R(n) does not depend on the time n and it is Toeplitz r(0) r(1) · · · r(M) r ∗ (1) r(0) · · · r(M − 1) ¯ R = . (6.5.28) . . . .. .. .. . . r ∗ (M) r ∗ (M − 1) · · · r(0) Therefore, all the resulting linear MMSE signal estimators are time-invariant. If we define

291

the correlation vector r [r(1) r(2) · · · r(M)]T

(6.5.29)

where r(l) = E{x(n)x ∗ (n − l)}, we can easily see that the cross-correlation vectors for the FLP and the BLP are

and

where

rf = E{x(n − 1)x ∗ (n)} = r∗

(6.5.30)

rb = E{x(n)x ∗ (n − M)} = Jr

(6.5.31)

0 .. J = . 0 1

··· .. . 1 ··· 0 ···

0 .. .

1 .. . , 0 0

JH J = JJH = I

(6.5.32)

is the exchange matrix that simply reverses the order of the vector elements. Therefore, Rao = −r∗

(6.5.33)

Pof = r(0) + rH ao

(6.5.34)

Rbo = −Jr

(6.5.35)

Pob = r(0) + rH Jbo

(6.5.36)

¯ by deleting the last column and row. Using where the Toeplitz matrix R is obtained from R the centrosymmetry property of symmetric Toeplitz matrices RJ = JR ∗

(6.5.37)

and (6.5.33), we have JR ∗ ao∗ = −Jr

RJa∗o = −Jr

or

(6.5.38)

Comparing the last equation with (6.5.35), we have bo = Ja∗o

(6.5.39)

that is, the BLP coefficient vector is the reverse of the conjugated FLP coefficient vector. Furthermore, from (6.5.34), (6.5.36), and (6.5.39), we have Po Pof = Pob

(6.5.40)

that is, the forward and backward prediction error powers are equal. This remarkable symmetry between the MMSE forward and backward linear predictors holds for stationary processes but disappears for nonstationary processes. Also, we do not have such a symmetry if a criterion other than the MMSE is used and the process to be predicted is non-Gaussian (Weiss 1975; Lawrence 1991). E XAM PLE 6.5.1. To illustrate the basic ideas in FLP, BLP, and linear smoothing, we consider the second-order estimators for stationary processes. The augmented equations for the first-order FLP are (r(o) is always real) (1) f a r(0) r(1) P1 0 = (1) r ∗ (1) r(0) 0 a

1

and they can be solved by using Cramer’s rule. Indeed, we obtain f P1 r(1) det r(0)P1f 0 r(0) det R2 r 2 (0) − |r(1)|2 (1) a0 = = = 1 ⇒ P1f = = det R2 det R2 det R1 r(0)

section 6.5 Linear Prediction

292 chapter 6 Optimum Linear Filters

det

r ∗ (1) 0 det R2

(1) a1 =

and

P1f

r(0)

=

−P1f r ∗ (1) det R2

=−

r ∗ (1) r(0)

for the MMSE and the FLP. For the second-order case we have (2) f r(0) r(1) r(2) a0 P2 ∗ r(1) a1(2) = 0 r (1) r(0) r ∗ (2) r ∗ (1) r(0) 0 (2) a2 whose solution is (2)

a0

(2) a1 =

P2f det R2

r(1)

− det

r(0)

=

det R3

and

= 1 ⇒ P2f =

det R3

r ∗ (1) −P2f det ∗ r (2)

(2) a2 =

=

r ∗ (1) P2f det ∗ r (2)

r ∗ (1)

det R3

r ∗ (1)

r(1)

r ∗ (2)

r(0)

=

det R2

r(0)

det R3 det R2

det =

r(1)r ∗ (2) − r(0)r ∗ (1) r 2 (0) − |r(1)|2

r(0)

r ∗ (2)

r ∗ (1)

r ∗ (1)

=

det R2

[r ∗ (1)]2 − r(0)r ∗ (2) r 2 (0) − |r(1)|2

Similarly, for the BLP (1) 0 b 0 = (1) r(0) P1b b1

r(0)

r(1)

r ∗ (1) (1)

where b1 = 1, we obtain P1b =

det R2 det R1

r(0) ∗ r (1) r ∗ (2)

P2b = We note that and

det R3 det R2

(1)

b0 = −

and

r(1) r(0) r ∗ (1)

(2) 0 r(2) b0 r(1) b1(2) = 0 r(0) P2b (2) b2

r ∗ (1)r(2) − r(0)r(1) r 2 (0) − |r(1)|2

(2)

b1 =

P1f = P1b P2f = P2b

(2)

a1

r(1) r(0)

(1)

a1

(2)∗

= b1

(2)

b0 =

r 2 (1) − r(0)r(2) r 2 (0) − |r(1)|2

(1)∗

= b0

(2)

a2

(2)∗

= b0

which is a result of the stationarity of x(n) or equivalently of the Toeplitz structure of Rm . For the linear signal estimator, we have (2) r(0) r(1) r(2) c0 0 (2) ∗ r(1) c1 = P2 r (1) r(0) 0 r ∗ (2) r ∗ (1) r(0) (2) c2

(2)

with c1 = 1. Using Cramer’s rule, we obtain P2 = r(2) r(0)

−P2 det (2)

c0 =

r(1) r ∗ (1)

(2)

c2 =

r(1) r ∗ (1)

=−

det R3 (2)

(2)∗

from which we see that c0 = c2

(2) det R3

=−

r(0) −P2 det ∗ r (2)

section 6.5 Linear Prediction

det R3

det

det R3

293

r(2) r(0)

r(1) r ∗ (1)

(2) det R3

r(0) det ∗ r (2)

(2)

r ∗ (1)r(2) − r(0)r(1) r 2 (0) − |r(1)|2

=

r(1)r ∗ (2) − r(0)r ∗ (1) r 2 (0) − |r(1)|2

r(1) r ∗ (1)

det R3

=

; that is, we have a linear phase estimator.

6.5.5 Properties Linear signal estimators and predictors have some interesting properties that we discuss next. PROPERTY 6.5.1.

If the process x(n) is stationary, then the symmetric, linear smoother has linear

phase. ¯ = JR ¯ ∗ and (6.5.12) for M = 2L, i = L, we Proof. Using the centrosymmetry property RJ obtain c¯ = J¯c∗

(6.5.41)

that is, the symmetric, linear smoother has even symmetry and, therefore, has linear phase (see Problem 6.12). PR O PE R TY 6.5.2. If the process x(n) is stationary, the forward prediction error filter (PEF) 1, a1 , a2 , . . . , aM is minimum-phase and the backward PEF b0 , b1 , . . . , bM−1 , 1 is maximumphase.

Proof. The system function of the Mth-order forward PEF can be factored as A(z) = 1 +

M

ak∗ z−k = G(z)(1 − qz−1 )

k=1

where q is a zero of A(z) and G(z) = 1 +

M−1

gk z−k

k=1

is an (M − 1)st-order filter. The filter A(z) can be implemented as the cascade connection of the filters G(z) and 1 − qz−1 (see Figure 6.17). The output s(n) of G(z) is s(n) = x(n) + g1 x(n − 1) + · · · + gM−1 x(n − M + 1) and it is easy to see that E{s(n − 1)ef ∗ (n)} = 0

x(n)

s(n) G (z)

1 − qz−1

ef(n)

(6.5.42) FIGURE 6.17 The prediction error filter with one zero factored out.

294 chapter 6 Optimum Linear Filters

because E{x(n − k)ef ∗ (n)} = 0 for 1 ≤ k ≤ M. Since the output of the second filter can be expressed as ef (n) = s(n) − qs(n − 1) we have E{s(n − 1)ef ∗ (n)} = E{s(n − 1)s ∗ (n)} − q ∗ E{s(n − 1)s ∗ (n − 1)} = 0 which implies that q=

rs (−1) ⇒ |q| ≤ 1 rs (0)

because q is equal to the normalized autocorrelation of s(n). If the process x(n) is not predictable, that is, E{|ef (n)|2 } = 0, we have E{|ef (n)|2 } = E{ef (n)[s ∗ (n) − q ∗ s ∗ (n − 1)]} = E{ef (n)s ∗ (n)}

due to (6.5.42)

= E{[s(n) − qs(n − 1)]s ∗ (n)} = rs (0)(1 − |q|2 ) = 0 which implies that |q| < 1 that is, the zero q of the forward PEF filter is strictly inside the unit circle. Repeating this process, we can show that all zeros of A(z) are inside the unit circle; that is, A(z) is minimum-phase. This proof was presented in Vaidyanathan et al. (1996). The property b = Ja∗ is equivalent to 1 B(z) = z−M A∗ ∗ z which implies that B(z) is a maximum-phase filter (see Section 2.4). The forward and backward prediction error filters can be expressed in terms ¯ of the eigenvalues λ¯ i and the eigenvectors q¯ i of the correlation matrix R(n) as follows M+1 1 1 ∗ q¯ i q¯i,1 (6.5.43) = Pof (n) ao (n) λ¯ i

PR O PE R TY 6.5.3.

i=1

bo (n)

and

1

= Pob (n)

M+1 i=1

1 ∗ q¯ i q¯i,M+1 λ¯ i

(6.5.44)

where q¯i,1 and q¯i,M+1 are the first and last components of q¯ i . The first equation of (6.5.43) and the last equation in (6.5.44) can be solved to provide the MMSEs Pof (n) and Pob (n), respectively. Proof. See Problem 6.13. ¯ ¯ −1 (n) be the inverse of the correlation matrix R(n). Then, the inverse of Let R ¯ −1 (n) is equal to the MMSE P (i) (n), and the ith column the ith element of the ith column of R normalized by the ith element is equal to c(i) (n). PR O PE RTY 6.5.4.

Proof. See Problem 6.14. PR O PE RTY 6.5.5.

The MMSE prediction errors can be expressed as Pof (n) =

¯ det R(n) det R(n − 1)

Pob (n) =

¯ det R(n) det R(n)

Proof. Problem 6.17.

The previous concepts are illustrated in the following example.

(6.5.45)

A random sequence x(n) is generated by passing the white Gaussian noise process w(n) ∼ WN(0, 1) through the filter

E XAM PLE 6.5.2.

x(n) = w(n) + 12 w(n − 1) Determine the second-order FLP, BLP, and symmetric linear signal smoother. Solution. The complex power spectrum is R(z) = H (z)H (z−1 ) = (1 + 12 z−1 )(1 + 12 z) = 12 z + 54 + 12 z−1 Therefore, the autocorrelation sequence is equal to r(0) = 54 , r(±1) = 12 , r(l) = 0 for |l| ≥ 2.

Since the power spectrum R(ej ω ) = 54 + cos ω > 0 for all ω, the autocorrelation matrix is positive definite. The same is true of any principal submatrix. To determine the second-order linear signal estimators, we start with the matrix 5 1 0 4 2 ¯ = 1 5 1 R 2 4 2 0 12 45 and follow the procedure outlined in Section 6.5.1 or use the formulas in Table 6.3. The results are Forward linear prediction (i = 0):

{ak } → {1, −0.476, 0.190}

Pof = 1.0119

Symmetric linear smoothing (i = 1):

{ck } → {−0.4, 1, −0.4}

Pos = 0.8500

Backward linear prediction (i = 2):

{bk } → {0.190, −0.476, 1}

Pob = 1.0119

¯ is The inverse of the correlation matrix R 0.9882 −1 ¯ R = −0.4706

−0.4706

0.1882

−0.4706

1.1765

0.1882 −0.4706 0.9882

and we see that dividing the first, second, and third columns by 0.9882, 1.1765, and 0.9882 provides the forward PEF, the symmetric linear smoothing filter, and the backward PEF, respectively. The inverses of the diagonal elements provide the MMSEs Pof , Pos , and Pob . The reader can easily see, by computing the zeros of the corresponding system functions, that the FLP is minimum-phase, the BLP is maximum-phase, and the symmetric linear smoother is mixed-phase. It is interesting to note that the smoother performs better than either of the predictors.

6.6 OPTIMUM INFINITE IMPULSE RESPONSE FILTERS So far we have dealt with optimum FIR filters and predictors for nonstationary and stationary processes. In this section, we consider the design of optimum IIR filters for stationary stochastic processes. For nonstationary processes, the theory becomes very complicated. The Wiener-Hopf equations for optimum IIR filters are the same for FIR filters; only the limits in the convolution summation and the range of values for which the normal equations hold are different. Both are determined by the limits of summation in the filter convolution equation. We can easily see from (6.4.16) and (6.4.17), or by applying the orthogonality principle (6.2.41), that the optimum IIR filter y(n) ˆ = ho (k)x(n − k) (6.6.1) k

is specified by the Wiener-Hopf equations ho (k)rx (m − k) = ryx (m) k

(6.6.2)

295 section 6.6 Optimum Infinite Impulse Response Filters

296

and the MMSE is given by

chapter 6 Optimum Linear Filters

Po = ry (0) −

∗ ho (k)ryx (k)

(6.6.3)

k

where rx (l) is the autocorrelation of the input stochastic process x(n) and ryx (l) is the cross-correlation between the desired response process y(n) and x(n). We assume that the processes x(n) and y(n) are jointly wide-sense stationary with zero mean values. The range of summation in the above equations includes all the nonzero coefficients of the impulse response of the filter. The range of k in (6.6.1) determines the number of unknowns and the number of equations, that is, the range of m. For IIR filters, we have an infinite number of equations and unknowns, and thus only analytical solutions for (6.6.2) are possible. The key to analytical solutions is that the left-hand side of (6.6.2) can be expressed as the convolution of ho (m) with rx (m), that is, ho (m) ∗ rx (m) = ryx (m)

(6.6.4)

which is a convolutional equation that can be solved by using the z-transform. The complexity of the solution depends on the range of m. The formula for the MMSE is the same for any filter, either FIR or IIR. Indeed, using Parseval’s theorem and (6.6.3), we obtain 1 1 ∗ Ho (z)Ryx (6.6.5) z−1 dz Po = ry (0) − 2π j z∗ C

where Ho (z) is the system function of the optimum filter and Ryx (z) = Z{ryx (l)}. The power Py can be computed by 1 Py = ry (0) = Ry (z)z−1 dz (6.6.6) 2π j C where Ry (z) = Z{ry (l)}. Combining (6.6.5) with (6.6.6), we obtain 1 1 ∗ [Ry (z) − Ho (z)Ryx ]z−1 dz Po = 2π j C z∗

(6.6.7)

which expresses the MMSE in terms of z-transforms. To obtain the MMSE in the frequency domain, we replace z by ej ω . For example, (6.6.5) becomes π 1 ∗ Po = ry (0) − Ho (ej ω )Ryx (ej ω ) dω 2π −π where Ho (ej ω ) is the frequency response of the optimum filter.

6.6.1 Noncausal IIR Filters For the noncausal IIR filter y(n) ˆ =

∞

hnc (k)x(n − k)

(6.6.8)

k=−∞

the range of the Wiener-Hopf equations (6.6.2) is −∞ < m < ∞ and can be easily solved by using the convolution property of the z-transform. This gives Hnc (z)Rx (z) = Ryx (z) Ryx (z) (6.6.9) Rx (z) where Hnc (z) is the system function of the optimum filter, Rx (z) is the complex PSD of x(n), and Ryx (z) is the complex cross-PSD between y(n) and x(n). or

Hnc (z) =

6.6.2 Causal IIR Filters

297

For the causal IIR filter

section 6.6 Optimum Infinite Impulse Response Filters

y(n) ˆ =

∞

hc (k)x(n − k)

(6.6.10)

k=0

the Wiener-Hopf equations (6.6.2) hold only for m in the range 0 ≤ m < ∞. Since the sequence ry (m) can be expressed as the convolution of ho (m) and rx (m) only for m ≥ 0, we cannot solve (6.6.2) using the z-transform. However, a simple solution is possible † using the spectral factorization theorem. This approach was introduced for continuoustime processes in Bode and Shannon (1950) and Zadeh and Ragazzini (1950). It is based on the following two observations: 1. The solution of the Wiener-Hopf equations is trivial if the input is white. 2. Any regular process can be transformed to an equivalent white process. White input processes. We first note that if the process x(n) is white noise, the solution of the Wiener-Hopf equations is trivial. Indeed, if rx (l) = σ 2x δ(l) Then Equation (6.6.4) gives hc (m) ∗ δ(m) = which implies that

ryx (m) σ 2x

0≤m<∞

1 ryx (m) hc (m) = σ 2x 0

0≤m<∞

(6.6.11)

m<0

because the filter is causal. The system function of the optimum filter is given by Hc (z) = where

1 [Ryx (z)]+ σ 2x

[Ryx (z)]+

∞

ryx (l)z−l

(6.6.12) (6.6.13)

l=0

is the one-sided z-transform of the two-sided sequence ryx (l). The MMSE is given by Pc = ry (0) −

∞ 1 |ryx (k)|2 σ 2x

(6.6.14)

k=0

which follows from (6.6.3) and (6.6.11). Regular input processes. The PSD of a regular process can be factored as 1 2 ∗ Rx (z) = σ x Hx (z)Hx z∗

(6.6.15)

where Hx (z) is the innovations filter (see Section 4.1). The innovations process w(n) = x(n) −

∞

hx (k)w(n − k)

(6.6.16)

k=1 †

An analogous matrix-based approach is extensively used in Chapter 7 for the design and implementation of optimum FIR filters.

298 chapter 6 Optimum Linear Filters

is white and linearly equivalent to the input process x(n). Therefore, linear estimation of y(n) based on x(n) is equivalent to linear estimation of y(n) based on w(n). The optimum filter that estimates y(n) from x(n) is obtained by cascading the whitening filter 1/Hx (z) with the optimum filter that estimates y(n) from w(n) (see Figure 6.18). Since w(n) is white, the optimum filter for estimating y(n) from w(n) is 1 (6.6.17) Hc (z) = 2 [Ryw (z)]+ σx

where [Ryw (z)]+ is the one-sided z-transform of ryw (l). To express Hc (z) in terms of Ryx (z), we need the relationship between Ryw (z) and Ryx (z). From x(n) =

∞

hx (k)w(n − k)

k=0

if we recall that ryx (l) = ryw (l) ∗ h∗x (−l), we obtain E{y(n)x ∗ (n − l)} =

∞

h∗x (k)E{y(n)w ∗ (n − l − k)}

k=0

ryx (l) =

or

∞

h∗x (k)ryw (l + k)

(6.6.18)

k=0

Taking the z-transform of the above equation leads to Ryx (z) Ryw (z) = ∗ Hx (1/z∗ ) which, combined with (6.6.17), gives

Hc (z) =

1 σ 2x

(6.6.19)

Ryx (z) Hx∗ (1/z∗ )

(6.6.20) +

which is the causal optimum filter for the estimation of y(n) from w(n). The optimum filter for estimating y(n) from x(n) is Ryx (z) 1 (6.6.21) Hc (z) = 2 σ x Hx (z) Hx∗ (1/z∗ ) + which is causal since it is the cascade connection of two causal filters [see Figure 6.19(a)]. Optimum filter x(n)

1 Hx(z) Whitening filter

w(n)

1 [Ryw(z)]+ s 2x

y(n) ˆ

Optimum filter for white input

FIGURE 6.18 Optimum causal IIR filter design by the spectral factorization method.

The MMSE from (6.6.3) can also be expressed as Pc = ry (0) −

∞ 1 |ryw (k)|2 σ 2x

(6.6.22)

k=0

which shows that the MMSE decreases as we increase the order of the filter. Table 6.5 summarizes the equations required for the design of optimum FIR and IIR filters.

299

Optimum causal IIR filter x(n)

w(n)

1 Hx(z)

1 Ryx(z) s 2x H x*(1/z* )

section 6.6 Optimum Infinite Impulse Response Filters

y(n) ˆ +

Optimum causal filter for white input

Whitening filter (a)

Optimum noncausal IIR filter x(n)

w(n)

1 Hx(z)

y(n) ˆ

1 Ryx(z) s 2x H x*(1/z* ) Optimum noncausal filter for white input

Whitening filter

(b)

FIGURE 6.19 Comparison of causal and noncausal IIR optimum filters. TABLE 6.5

Design of FIR and IIR optimum filters for stationary processes. Filter type

Solution

Required quantities

e(n) = y(n) − coH x(n) co = R −1 d Po = ry (0) − dH co

FIR

Noncausal IIR

Hnc (z) =

R =[rx (m − k)], d = [ryx (m)] 0 ≤ k, m ≤ M − 1, M = finite Rx (z) = Z{rx (l)} ∗ (l)} Ryx (z) = Z{rxy

Ryx (z) Rx (z)

Pnc = ry (0) −

∞

∗ (k) hnc (k)ryx

k=−∞

Causal IIR

Hc (z) =

1 σ 2x Hx (z)

Pc = ry (0) −

∞

Ryx (z) Hx∗ (1/z∗ ) +

∗ (k) hnc (k)ryx

Rx (z) = σ 2x Hx (z)Hx∗ (1/z∗ ) Ryx (z) = Z{rxy (l)}

k=0

Finally, since the equation for the noncausal IIR filter can be written as Ryx (z) 1 Hnc (z) = 2 σ x Hx (z) Hx∗ (1/z∗ )

(6.6.23)

we see that the only difference from the causal filter is that the noncausal filter includes both the causal and noncausal parts of Ryx (z)/Hx (z−1 ) [see Figure 6.19(b)]. By using the innovations process w(n), the MMSE can be expressed as Pnc = ry (0) −

∞ 1 |ryw (k)|2 σ 2x

(6.6.24)

k=−∞

and is known as the irreducible MMSE because it is the best performance that can be achieved by a linear filter. Indeed, since |ryw (k)| ≥ 0, every coefficient we add to the optimum filter can help to reduce the MMSE.

300 chapter 6 Optimum Linear Filters

6.6.3 Filtering of Additive Noise To illustrate the optimum filtering theory developed above, we consider the problem of estimating a “useful” or desired signal y(n) that is corrupted by additive noise v(n). The goal is to find an optimum filter that extracts the signal y(n) from the noisy observations x(n) = y(n) + v(n)

(6.6.25)

given that y(n) and v(n) are uncorrelated processes with known autocorrelation sequences ry (l) and rv (l). To design the optimum filter, we need the autocorrelation rx (l) of the input signal x(n) and the cross-correlation ryx (l) between the desired response y(n) and the input signal x(n). Using (6.6.25), we find rx (l) = E{x(n)x ∗ (n − l)} = ry (l) + rv (l) ∗

ryx (l) = E{y(n)x (n − l)} = ry (l)

and

(6.6.26) (6.6.27)

because y(n) and v(n) are uncorrelated. The design of optimum IIR filters requires the functions Rx (z) and Ryx (z). Taking the z-transform of (6.6.26) and (6.6.27), we obtain Rx (z) = Ry (z) + Rv (z)

(6.6.28)

Ryx (z) = Ry (z)

(6.6.29)

and

The noncausal optimum filter is given by Hnc (z) =

Ry (z) Ryx (z) = Rx (z) Ry (z) + Rv (z)

(6.6.30)

|Rv (ej ω )|, that which for z = ej ω shows that, for those values of ω for which |Ry (ej ω )| j ω j ω is, for high SNR, we have |Hnc (e )| ≈ 1. In contrast, if |Ry (e )| ! |Rv (ej ω )|, that is, for low SNR, we have |Hnc (ej ω )| ≈ 0. Thus, the optimum filter “passes” its input in bands with high SNR and attenuates it in bands with low SNR, as we would expect intuitively. Substituting (6.6.30) into (6.6.7), we obtain for real-valued signals Ry (z)Rv (z) −1 1 z dz Pnc = (6.6.31) 2π j C Ry (z) + Rv (z) which provides an expression for the MMSE that does not require knowledge of the optimum filter. We next illustrate the design of optimum filters for the reduction of additive noise with a detailed numerical example. E XAM PLE 6.6.1. In this example we illustrate the design of an optimum IIR filter to extract a random signal with known autocorrelation sequence

ry (l) = α |l|

−1<α <1

(6.6.32)

which is corrupted by additive white noise with autocorrelation rv (l) = σ 2v δ(l)

(6.6.33)

The processes y(n) and v(n) are uncorrelated. Required statistical moments. The input to the filter is the signal x(n) = y(n) + v(n) and the desired response, the signal y(n). The first step in the design is to determine the required second-order moments, that is, the autocorrelation of the input process and the cross-correlation between input and desired response. Substituting into (6.6.26) and (6.6.27), we have

and

rx (l) = α |l| + σ 2v δ(l)

(6.6.34)

ryx (l) = α |l|

(6.6.35)

To simplify the derivations and deal with “nice, round” numbers, we choose α = 0.8 and σ 2v = 1. Then the complex power spectral densities of y(n), v(n), and x(n) are ( 35 )2 Ry (z) = (1 − 45 z−1 )(1 − 45 z)

5 4 < |z| < 5 4

(6.6.36)

Rv (z) = σ 2v = 1 Rx (z) =

and

8 (1 − 5 (1 −

(6.6.37)

1 z−1 )(1 − 2 4 z−1 )(1 − 5

1 z) 2 4 z) 5

(6.6.38)

respectively. Noncausal filter. Using (6.6.9), (6.6.29), (6.6.36), and (6.6.38), we obtain Hnc (z) =

Ryx (z) 9 1 = Rx (z) 40 (1 − 1 z−1 )(1 − 1 z) 2 2

1 < |z| < 2 2

Evaluating the inverse the z-transform we have 3 ( 1 )|n| hnc (n) = 10 −∞

∞

|k|

( 12 ) ( 45 )

|k|

3 = 10

(6.6.39)

k=−∞

and provides the irreducible MMSE. Causal filter. To find the optimum causal filter, we need to perform the spectral factorization Rx (z) = σ 2x Hx (z)Hx (z−1 ) which is provided by (6.6.38) with σ 2x = 85 Hx (z) =

and

Thus,

Ryw (z) =

Ryx (z) Hx (z−1 )

=

(6.6.40)

1 − 12 z−1

(6.6.41)

1 − 45 z−1

0.36 (1 − 45 z−1 )(1 − 12 z)

0.6

=

1 − 45 z−1

+

0.3z 1 − 12 z

(6.6.42)

where the first term (causal) converges for |z| > 45 and the second term (noncausal) converges for |z| < 2. Hence, taking the causal part 3 Ryx (z) 5 = Hx (z−1 ) + 1 − 45 z−1 and substituting into (6.6.21), we obtain the causal optimum filter 4 −1 3 1 3 1 5 1 − 5z 5 = |z| < Hc (z) = 8 1 − 1 z−1 1 − 4 z−1 8 1 − 1 z−1 2 2

5

(6.6.43)

2

The impulse response is hc (n) = 38 ( 12 )n u(n) which corresponds to a causal and stable IIR filter. The MMSE is Pc = ry (0) −

∞

hc (k)ryx (k) = 1 − 38

k=0

which is, as expected, larger than Pnc .

∞

( 12 )k ( 45 )k = 38

k=0

(6.6.44)

301 section 6.6 Optimum Infinite Impulse Response Filters

302 chapter 6 Optimum Linear Filters

From (6.6.43), we see that the optimum causal filter is a first-order recursive filter that can be implemented by the difference equation y(n) ˆ = 12 y(n ˆ − 1) + 38 x(n) In general, this is possible only when Hc (z) is a rational function. Computation of MMSE using the innovation. We next illustrate how to find the MMSE by using the cross-correlation sequence ryw (l). From (6.6.42), we obtain 3 ( 4 )l l≥0 5 5 (6.6.45) ryw (l) = 3 2l l < 0 5 which, in conjunction with (6.6.22) and (6.6.24), gives Pc = ry (0) −

and

∞ ∞ 1 2 5 ( 3 )2 r (k) = 1 − ( 45 )2k = 38 yw 8 5 σ 2x k=0 k=0

∞ −1 1 2 2 (k) = 3 Pnc = ry (0) − 2 ryw (k) − ryw 10 σ x k=0 k=−∞

which agree with (6.6.44) and (6.6.39). Noncausal smoothing filter. Suppose now that we want to estimate the value y(n + D) of the desired response from the data x(n), −∞ < n < ∞. Since

and

E{y(n + D)x(n − l)} = ryx (n + D)

(6.6.46)

Z{ryx (n + D)} = zD Ryx (z)

(6.6.47)

the noncausal Wiener smoothing filter is D (z) = Hnc

zD Ry (z) zD Ryx (z) = = zD Hnc (z) Rx (z) Rx (z) hD nc (n) = hnc (n + D)

(6.6.48)

(6.6.49)

The MMSE is D = r (0) − Pnc y

∞

hnc (k + D)ryx (k + D) = Pnc

(6.6.50)

k=−∞

which is independent of the time shift D. Causal prediction filter. We estimate the value y(n + D) (D > 0) of the desired response using the data x(k), −∞ < k ≤ n. The whitening part of the causal prediction filter does not depend on y(n) and is still given by (6.6.41). The coloring part depends on y(n + D) and is given by Ryw (z) = zD Ryw (z) or ryw (l) = ryw (l + D). Taking into consideration that D > 0, we can show (see Problem 6.31) that the system function and the impulse response of the causal Wiener predictor are 4 −1 3 ( 4 )D 3 ( 4 )D 5 1 − 5z 5 5 8 5 Hc[D] (z) = = (6.6.51) 8 1 − 1 z−1 1 − 4 z−1 1 − 1 z−1 2

and

5

2

hc[D] (n) = 38 ( 45 )D ( 12 )n u(n)

(6.6.52)

respectively. This shows that as D → ∞, the impulse response hc[D] (n) → 0, which is consistent with our intuition that the prediction is less and less reliable. The MMSE is Pc[D] = 1 − 38 ( 45 )2D

∞

( 25 )k = 1 − 58 ( 45 )2D

k=0

(6.6.53)

and Pc[D] → ry (0) = 1 as D → ∞, which agrees with our earlier observation. For D = 2, the

MMSE is Pc[2] = 93/125 = 0.7440 > Pc , as expected.

Causal smoothing filter. To estimate the value y(n + D) (D < 0) of the desired response using the data x(n), −∞ < k ≤ n, we need a smoothing Wiener filter. The derivation, which is straightforward but somewhat involved, is left for Problem 6.32. The system function of the optimum smoothing filter is −D−1 −D−1 2D 2D 2l z−l 2l z−l−1 4 3 zD l=0 l=0 Hc[D] (z) = + − (6.6.54) 1 1 1 −1 −1 8 1 − z−1 5 1 − z 1 − z 2 2 2 where D < 0. To find the impulse response for D = −2, we invert (6.6.54). This gives 3 δ(k) + 51 δ(k − 1) + 39 ( 1 )k−2 u(k − 2) h[−2] (k) = 32 c 320 128 2 and if we express ryx (k − 2) in a similar form, we can compute the MMSE

(6.6.55)

3 − 51 − ( 39 ) 5 = 39 = 0.3047 Pc[−2] = 1 − 50 (6.6.56) 400 128 3 128 which is less than Pc = 0.375. This should be expected since the smoothing Wiener filter uses more information than the Wiener filter (i.e., when D = 0). In fact it can be shown that

lim D→−∞

Pc[D] = Pnc

and

lim D→−∞

hc[D] (n) = hnc (n)

(6.6.57)

which is illustrated in Figure 6.20 (Problem 6.22). Figure 6.21 shows the impulse responses of the various optimum IIR filters designed in this example. Interestingly, all are obtained by shifting and truncating the impulse response of the optimum noncausal IIR filter.

FIGURE 6.20 MMSE as a function of the time shift D.

MMSE (D)

1.000

0.375 0.300

−10 −8 −6 −4 −2 0 2 4 6 8 10 D

FIR filter. The Mth-order FIR filter is obtained by solving the linear system Rh = d where and

R = Toeplitz(1 + σ 2v , α, . . . , α M−1 ) d = [1 α · · · α M−1 ]T

The MMSE is Po = ry (0) −

M−1

ho (k)ryx (k) k=0

and is shown in Figure 6.22 as a function of the order M together with Pc and Pnc . We notice that an optimum FIR filter of order M = 4 provides satisfactory performance. This can be explained by noting that the impulse response of the causal optimum IIR filter is negligible for n > 4.

303 section 6.6 Optimum Infinite Impulse Response Filters

304

Causal Wiener filter

chapter 6 Optimum Linear Filters

hc(n)

0.375

2

4

6

8

10

12

14

16

18

20

14

16

18

20

14

16

18

20

Causal Wiener predictor

[2]

h c (n)

0.375

2

4

6

8

10

12

[−2]

h c (n)

Causal Wiener smoother

0.3

2

4

6

8

10 n

12

FIGURE 6.21 Impulse response of optimum filters for pure filtering, prediction, and smoothing. FIGURE 6.22 MMSE as a function of the optimum FIR filter order M.

MMSE (M )

0.5

0.4 0.375

0.3

Causal filter MMSE

Noncausal filter MMSE 1

2 3 4 FIR filter order M

5

6.6.4 Linear Prediction Using the Infinite Past—Whitening The one-step forward IIR linear predictor is a causal IIR optimum filter with desired response y(n) x(n + 1). The prediction error is ef (n + 1) = x(n + 1) −

∞

hlp (k)x(n − k)

(6.6.58)

k=0

where

Hlp (z) =

∞ k=0

hlp (k)z−k

(6.6.59)

is the system function of the optimum predictor. Since y(n) = x(n + 1), we have ryx (l) = rx (l + 1) and Ryx (z) = zRx (z). Hence, the optimum predictor is 2 zσ x Hx (z)Hx (z−1 ) zHx (z) − z 1 [zHx (z)]+ = = Hlp (z) = 2 H (z) Hx (z) σ x Hx (z) Hx (z−1 ) x + and the prediction error filter (PEF) is E f (z) 1 = 1 − z−1 Hlp (z) = (6.6.60) X(z) Hx (z) that is, the one-step IIR linear predictor of a regular process is identical to the whitening filter of the process. Therefore, the prediction error process is white, and the prediction error filter is minimum-phase. We will see that the efficient solution of optimum filtering problems includes as a prerequisite the solution of a linear prediction problem. Furthermore, algorithms for linear prediction provide a convenient way to perform spectral factorization in practice. The MMSE is " ! 1 1 1 −1 ∗ f Po = z Rx Rx (z) − z 1 − z−1 dz 2π j C Hx (z) z∗ 1 1 = z−1 dz Rx (z) (6.6.61) Hx (z) 2π j C 1 2 1 ∗ = σx z−1 dz = σ 2x Hx 2π j C z∗ 1 1 because Hx∗ ∗ z−1 dz = hx (0) = 1 2π j C z HPEF (z) =

From Section 2.4.4 and (6.6.61) we have π 1 Pof = σ 2x = exp ln Rx (ej ω ) dω 2π −π

(6.6.62)

which is known as the Kolmogorov-Szegö formula. We can easily see that the D-step predictor (D > 0) is given by HD (z) =

∞ 1 [zD Hx (z)]+ = hx (k)z−k+D Hx (z) Hx (z)

(6.6.63)

k=D

but is not guaranteed to be minimum-phase for D = 1. E XAM PLE 6.6.2.

Consider a minimum-phase AR(2) process x(n) = a1 x(n − 1) + a2 x(n − 2) + w(n)

where w(n) ∼ WN(0, σ 2w ). The complex PSD of the process is Rx (z) =

σ 2x σ 2x Hx (z)Hx (z−1 ) A(z)A(z−1 )

where A(z) 1 − a1 z−1 − a2 z−2 and σ 2x = σ 2w . The one-step forward predictor is given by z Hlp (z) = z − = z − zA(z) = a1 + a2 z−1 Hx (z) or

x(n ˆ + 1) = a1 x(n) + a2 x(n − 1)

as should be expected because the present value of the process depends only on the past two values. Since the excitation w(n) is white and cannot be predicted from the present or previous values of the signal x(n), it is equal to the prediction error ef (n). Therefore, σ 2f = σ 2w , as e expected from (6.6.62). This shows that the MMSE of the one-step linear predictor depends on the SFM of the process x(n). It is maximum for a white noise process, which is clearly unpredictable.

305 section 6.6 Optimum Infinite Impulse Response Filters

306 chapter 6 Optimum Linear Filters

Predictable processes. A random process x(n) is said to be (exactly) predictable if Pe = E{|ef (n)|2 } = 0. We next show that a process x(n) is predictable if and only if its PSD consists of impulses, that is, Ak δ(ω − ωk ) (6.6.64) Rx (ej ω ) = k

or in other words, x(n) is a harmonic process. For this reason harmonic processes are also known as deterministic processes. From (6.6.60) we have π Pe = E{|ef (n)|2 } = |HPEF (ej ω )|2 Rx (ej ω ) dω (6.6.65) −π

where HPEF (ej ω ) is the frequency response of the prediction error filter. Since Rx (ej ω ) ≥ 0, the integral in (6.6.65) is zero if and only if |HPEF (ej ω )|2 Rx (ej ω ) = 0. This is possible only if Rx (ej ω ) is a linear combination of impulses, as in (6.6.64), and ej ωk are the zeros of HPEF (z) on the unit circle (Papoulis 1985). From the Wold decomposition theorem (see Section 4.1.3) we know that every random process can be decomposed into two components that are mutually orthogonal: (1) a regular component with continuous PSD that can be modeled as the response of a minimum-phase system to white noise and (2) a predictable process that can be exactly predicted from a linear combination of past values. This component has a line PSD and is essentially a harmonic process. A complete discussion of this subject can be found in Papoulis (1985, 1991) and Therrien (1992).

6.7 INVERSE FILTERING AND DECONVOLUTION In many practical applications, a signal of interest passes through a distorting system whose output may be corrupted by additive noise. When the distorting system is linear and timeinvariant, the observed signal is the convolution of the desired input with the impulse response of the system. Since in most cases we deal with linear and time-invariant systems, the terms filtering and convolution are often used interchangeably. Deconvolution is the process of retrieving the unknown input of a known system by using its observed output. If the system is also unknown, which is more common in practical applications, we have a problem of blind deconvolution. The term blind deconvolution was introduced in Stockham et al. (1975) for a method used to restore old records. Other applications include estimation of the vocal tract in speech processing, equalization of communication channels, deconvolution of seismic data for the elimination of multiple reflections, and image restoration. The basic problem is illustrated in Figure 6.23. The output of the unknown LTI system G(z), which is assumed BIBO stable, is given by x(n) =

∞

g(k)w(n − k)

(6.7.1)

k=−∞

where w(n) ∼ IID(0, σ 2w ) is a white noise sequence. Suppose that we observe the output x(n) and that we wish to recover the input signal w(n), and possibly the system G(z), using the output signal and some statistical information about the input. w(n)

x(n) G(z)

Unknown input Unknown system

y(n) H(z) Deconvolution filter

FIGURE 6.23 Basic blind deconvolution model.

If we know the system G(z), the inverse system H (z) is obtained by noticing that perfect retrieval of the input is possible if h(n) ∗ g(n) ∗ w(n) = b0 w(n − n0 ) (6.7.2) where b0 and n0 are constants. From (6.7.2), we have h(n) ∗ g(n) = b0 δ(n − n0 ), or equivalently z−n0 (6.7.3) H (z) = b0 G(z) which provides the system function of the inverse system. The input can be recovered by convolving the output with the inverse system H (z). Therefore, the terms inverse filtering and deconvolution are equivalent for LTI systems. There are three approaches for blind deconvolution: • • •

Identify the system G(z), design its inverse system H (z), and then compute the input w(n). Identify directly the inverse H (z) = 1/G(z) of the system, and then determine the input w(n). Estimate directly the input w(n) from the output x(n).

Any of the above approaches requires either directly or indirectly the estimation of both the magnitude response |G(ej ω )| and the phase response G(ej ω ) of the unknown system. In practice, the problem becomes more complicated because the output x(n) is usually corrupted by additive noise. If this noise is uncorrelated with the input signal and the required second-order moments are available, we show how to design an optimum inverse filter that provides an optimum estimate of the input in the presence of noise. In Section 6.8 we apply these results to the design of optimum equalizers for data transmission systems. The main blind identification and deconvolution problem, in which only statistical information about the output is known, is discussed in Chapter 12. We now discuss the design of optimum inverse filters for linearly distorted signals observed in the presence of additive output noise. The typical configuration is shown in Figure 6.24. Ideally, we would like the optimum filter to restore the distorted signal x(n) to its original value y(n). However, the ability of the optimum filter to attain ideal performance is limited by three factors. First, there is additive noise v(n) at the output of the system. Second, if the physical system G(z) is causal, its output s(n) is delayed with respect to the input, and we may need some delay z−D to improve the performance of the system. When G(z) is a non-minimum-phase system, the inverse system is either noncausal or unstable and should be approximated by a causal and stable filter. Third, the inverse system may be IIR and should be approximated by an FIR filter. y(n − D)

z−D v(n) y(n)

s(n) G(z)

x(n) H(z)

y(n) ˆ

e(n) −

FIGURE 6.24 Typical configuration for optimum inverse system modeling.

The optimum inverse filter is the noncausal Wiener filter Hnc (z) =

z−D Ryx (z) Rx (z)

(6.7.4)

where the term z−D appears because the desired response is yD (n) y(n − D). Since y(n)

307 section 6.7 Inverse Filtering and Deconvolution

308

and v(n) are uncorrelated, we have

chapter 6 Optimum Linear Filters

Ryx (z) = Rys (z) 1 ∗ Rx (z) = G(z)G Ry (z) + Rv (z) z∗

and

The cross-correlation between y(n) and s(n) ∗

Rys (z) = G

1 z∗

(6.7.5) (6.7.6)

Ry (z)

(6.7.7)

is obtained by using Equation (6.6.18). Therefore, the optimum inverse filter is Hnc (z) =

z−D G∗ (1/z∗ )Ry (z) G(z)G∗ (1/z∗ )Ry (z) + Rv (z)

(6.7.8)

which, in the absence of noise, becomes z−D (6.7.9) G(z) as expected. The behavior of the optimum inverse system is illustrated in the following example. Hnc (z) =

E XAM PLE 6.7.1.

Let the system G(z) be an all-zero non-minimum-phase system given by G(z) = 15 (−3z + 7 − 2z−1 ) = − 35 (1 − 13 z−1 )(z − 2)

Then the inverse system is given by H (z) = G−1 (z) =

5 1 1 = − −3z + 7 − 2z−1 1 − 2z−1 1 − 13 z−1

which is stable if the ROC is − 13 < |z| < 2. Therefore, the impulse response of the inverse system is # 1 n ( ) n≥0 h(n) = 3 n<0 2n which is noncausal and stable. Following the discussion given in this section, we want to design an optimum inverse system given that G(z) is driven by a white noise sequence y(n) and that the additive noise v(n) is white, that is, Ry (z) = σ 2y and Rv (z) = σ 2v . From (6.7.8), the optimum noncausal inverse filter is given by Hnc (z) =

z−D G(z) + [1/G(z−1 )](σ 2v /σ 2y )

which can be computed by assuming suitable values for variances σ 2y and σ 2v . Note that if σ 2v ! σ 2y , that is, for very large SNR, we obtain (6.7.9). A more interesting case occurs when the optimum inverse filter is FIR, which can be easily implemented. To design this FIR filter, we will need the autocorrelation rx (l) and the crosscorrelation ryD x (l), where yD (n) = y(n − D) is the delayed system input sequence. Since Rx (z) = σ 2y G(z)G(z−1 ) + σ 2v RyD x (l) = σ 2y z−D G(z−1 )

and we have (see Section 3.4.1)

rx (l) = g(l) ∗ g(−l) ∗ ry (l) + rv (l) = σ 2y [g(l) ∗ g(−l)] + σ 2v δ(l) and

ryD x (l) = g(−l) ∗ ry (l − D) = σ 2y g(−l + D)

respectively. Now we can determine the optimum FIR filter hD of length M by constructing an M × M Toeplitz matrix R from rx (l) and an M × 1 vector d from ryD (l) and then solving RhD = d

for various values of D. We can then plot the MMSE as a function of D to determine the best value of D (and the corresponding FIR filter) which will give the smallest MMSE. For example, if σ 2y = 1, σ 2v = 0.1, and M = 10, the correlation functions are

6 7 129 7 6 ,− , ,− , rx (l) = 25 5 50 5 25

2 7 3 − , , − ryD x (l) = 5 5 5

and

↑ l=0

↑ l=D

The resulting MMSE as a function of D is shown in Figure 6.25, which indicates that the best value of D is approximately M/2. Finally, plots of impulse responses of the inverse system are shown in Figure 6.26. The first plot shows the noncausal h(n), the second plot shows the causal FIGURE 6.25 The inverse filtering MMSE as a function of delay D.

MMSE (D)

0.25

0.2

0.15

0.1

1

2

3

4 5 Delay D

6

7

8

9

Ideal IIR filter

hnc(n)

1.0

0.5

−10

−8

−6

−4

−2

2

4

6

8

10

8

9

10

8

9

Causal inverse system for D = 0

h(n)

1.0

0.5

0 −1

1

2

4

5

6

7

Causal inverse system for D = 5

1.0

h(n)

3

0.5

0 0

1

2

3

4

5 n

FIGURE 6.26 Impulse responses of optimum inverse filters.

6

7

309 section 6.7 Inverse Filtering and Deconvolution

FIR system h0 (n) for D = 0, and the third plot shows the causal FIR system hD (n) for D = 5. It is clear that the optimum delayed FIR inverse filter for D M/2 closely matches the impulse response of the inverse filter h(n).

310 chapter 6 Optimum Linear Filters

6.8 CHANNEL EQUALIZATION IN DATA TRANSMISSION SYSTEMS The performance of data transmission systems through channels that can be approximated by linear systems is limited by factors such as finite bandwidth, intersymbol interference, and thermal noise (see Section 1.4). Typical examples include telephone lines, microwave line-of-sight radio links, satellite channels, and underwater acoustic channels. When the channel frequency response deviates form the ideal of flat magnitude and linear phase, both (left and right) tails of a transmitted pulse will interfere with neighboring pulses. Hence, the value of a sample taken at the center of a pulse will contain components from the tails of the other pulses. The distortion caused by the overlapping tails is known as intersymbol interference (ISI ), and it can lead to erroneous decisions that increase the probability of error. For band-limited channels with low background noise (e.g., voice band telephone channel), ISI is the main performance limitation for high-speed data transmission. In radio and undersea channels, ISI is the result of multipath propagation (Siller 1984). Intersymbol interference occurs in all pulse modulation systems, including frequencyshift keying (FSK), phase-shift keying (PSK), and quadrature amplitude modulation (QAM). However, to simplify the presentation, we consider a baseband pulse amplitude modulation (PAM) system. This does not result in any loss of generality because we can obtain an equivalent baseband model for any linear modulation scheme (Proakis 1996). We consider the K-ary (K = 2L ) PAM communication system shown in Figure 6.27(a). The binary vc(t) Transmitting filter gt(t)

an

Receiving filter gr(t)

Channel hc(t)

~ x (t)

x~(n) Detector

t − nTB

Overall filter h r(t) (a) L-bit symbol Tb 1 0 1 1 1 1 1 0 0 1 1 1 0 0 1 1 1 0 1 1 0 0 1 1 TB a1

K = 2 L levels

K−1

a4

a0

a5 TB 0

2TB

3TB 4TB

TB

5TB

t

a2 a3

0 (b)

FIGURE 6.27 (a) Baseband pulse amplitude modulation data transmission system model and (b) input symbol sequence an .

aˆn

input sequence is subdivided into L-bit blocks, or symbols, and each symbol is mapped to one of the K amplitude levels, as shown in Figure 6.27(b). The interval TB is called the symbol or baud interval while the interval Tb is called the bit interval. The quantity RB = 1/TB is known as the baud rate, and the quantity Rb = LRB is the bit rate. The resulting symbol sequence {an } modulates the transmitted pulse gt (t). For analysis purposes, the symbol sequence {an } can be represented by an equivalent continuous-time signal using an impulse train, that is, ∞ ∞ {an }−∞ ⇔ an δ(t − nTB ) (6.8.1) n=−∞

The modulated pulses are transmitted over the channel represented by the impulse response hc (t) and the additive noise vc (t). The received signal is filtered by the receiving filter gr (t) to obtain x(t). ˜ Using (6.8.1), the signal x(t) ˜ at the output of the receiving filter is given by x(t) ˜ =

∞

ak {δ(t − kTB ) ∗ gt (t) ∗ hc (t) ∗ gr (t)} + vc (t) ∗ gr (t)

k=−∞

∞

(6.8.2) ak h˜ r (t − kTB )v(t) ˜

k=−∞

h˜ r (t) gt (t) ∗ hc (t) ∗ gr (t)

where

(6.8.3)

is the impulse response of the combined system of transmitting filter, channel, and receiving filter, and v(t) ˜ gr (t) ∗ vc (t)

(6.8.4)

is the additive noise at the output of the receiving filter.

6.8.1 Nyquist’s Criterion for Zero ISI If we sample the received signal x(t) at the time instant t0 + nTB , we obtain x(t ˜ 0 + nTB ) =

∞

ak h˜ r (t0 + nTB − kTB ) + v(t ˜ 0 + nTB )

k=−∞

= an h˜ r (t0 ) +

∞

ak h˜ r (t0 + nTB − kTB ) + v(t ˜ 0 + nTB )

(6.8.5)

k=−∞ k =n

where t0 accounts for the channel delay and the sampler phase. The first term in (6.8.5) is the desired signal term while the third term is the noise term. The middle term in (6.8.5) represents the ISI, and it will be zero if and only if n = k (6.8.6) h˜ r (t0 + nTB − kTB ) = 0 As was first shown by Nyquist (Gitlin, Hayes, and Weinstein 1992), a time-domain pulse h˜ r (t) will have zero crossings once every TB s, that is, # 1 n=0 (6.8.7) h˜ r (nTB ) = 0 n = 0 if its Fourier transform satisfies the condition ∞ l ˜ = TB Hr F + TB

(6.8.8)

l=−∞

This condition is known as the Nyquist criterion for zero ISI and its basic meaning is illustrated in Figure 6.28.

311 section 6.8 Channel Equalization in Data Transmission Systems

312

Baseband spectrum with odd symmetry

chapter 6 Optimum Linear Filters

−1/2TB Overlap region

F, Hz

1/2TB Overlap region

Spectral image due to sampling at rate 1/TB

Spectral image due to sampling at rate 1/TB −3/2TB

−1/ TB

−1/2TB

1/2TB

1/ TB

3/2TB

F, Hz

Constant amplitude

Folded spectrum

Folded spectrum −1/2TB

1/2TB

F, Hz

FIGURE 6.28 Frequency-domain Nyquist criterion for zero ISI.

A pulse shape that satisfies (6.8.8) and that is widely used in practice is of the raised cosine family sin(π t/TB ) cos(π αt/TB ) (6.8.9) h˜ rc (t) = π t/TB 1 − 4α 2 t 2 /TB2 where 0 ≤ α ≤ 1 is known as the rolloff factor. This pulse and its Fourier transform for α = 0, 0.5, and 1 are shown in Figure 6.29. The choice of α = 0 reduces h˜ rc (t) to the unrealizable sinc pulse and RB = 1/TB , whereas for α = 1 the symbol rate is RB = 1/(2TB ). In practice, we can see the effect of ISI and the noise if we display the received signal on the vertical axis of an oscilloscope and set the horizontal sweep rate at 1/TB . The resulting display is known as eye pattern because it resembles the human eye. The closing of the eye increases with the increase in ISI. 6.8.2 Equivalent Discrete-Time Channel Model Referring to Figure 6.27(a), we note that the input to the data transmission system is a discrete-time sequence {an } at the symbol rate 1/TB symbols per second, and the input to the detector is also a discrete-time sequence x(nT ˜ B ) at the symbol rate. Thus the overall system between the input symbols and the equalizer can be modeled as a discrete-time channel model for further analysis. From (6.8.2), after sampling at the symbol rate, we obtain ∞ x(nT ˜ ak h˜ r (nTB − kTB ) + v(nT ˜ (6.8.10) B) = B) k=−∞

˜ is given in (6.8.4). The first term in (6.8.10) can be where h˜ r (t) is given in (6.8.3) and v(t) † interpreted as a discrete-time IIR filter with impulse response h˜ r (n) hr (nTB ) with input †

Here we have abused the notation to avoid a new symbol.

h rc(t)

313 section 6.8 Channel Equalization in Data Transmission Systems

a=1 t 0 a=0

2TB

TB

3TB

4TB

a = 0.5

Hrc(F)

a = 0.5

a=1 −1/2TB

−1/TB

1/2TB

1/TB

F

FIGURE 6.29 Pulses with a raised cosine spectrum.

ak . In a practical data transmission system, it is not unreasonable to assume that h˜ r (n) = 0 for |n| ≥ L, where L is some arbitrary positive integer. Then we obtain x(n) ˜ =

L

ak h˜ r (n − k) + v(n) ˜

(6.8.11)

k=−L

x(n) ˜ x(nT ˜ B)

v(n) ˜ v(nT ˜ B)

which is an FIR filter of length 2L + 1, shown in Figure 6.30. an+L

~ hr(−L)

z−1

an+L−1

~ hr(−L+1)

an

z−1 ~ hr(0)

an−1

z−1 ~ hr(1)

z−1

an−L

~ hr(L)

x~(n)

v~(n)

FIGURE 6.30 Equivalent discrete-time model of data transmission system with ISI.

There is one difficulty with this model. If we assume that the additive channel noise vc (t) is zero-mean white, then the equivalent noise sequence v(n) ˜ is not white. This can be seen from the definition of v(t) ˜ in (6.8.4). Thus the autocorrelation of v(n) ˜ is given by rv˜ (l) = σ 2v rgr (l)

(6.8.12)

where σ 2v is the variance of the samples of vc (t) and rgr (l) is the sampled autocorrelation of

314 chapter 6 Optimum Linear Filters

gr (t). This nonwhiteness of v(t) ˜ poses a problem in the subsequent design and performance evaluation of equalizers. Therefore, in practice, it is necessary to whiten this noise by designing a whitening filter and placing it after the sampler in Figure 6.27(a). The whitening filter is designed by using spectral factorization of Z[rgr (l)]. Let Rgr (z) = Z[rgr (l)] = Rg+r (z)Rg−r (z)

(6.8.13)

where Rg+r (z) is the minimum-phase factor and Rg−r (z) is the maximum-phase factor. Choosing W (z)

1

(6.8.14)

Rg+r (z)

as a causal, stable, and recursive filter and applying the sampled sequence x(n) ˜ to this filter, we obtain ∞ ak hr (n − k) + v(n) (6.8.15) x(n) w(n) ∗ x(n) ˜ = k=0

where

hr (n) h˜ r (n) ∗ w(n)

(6.8.16)

v(n) w(n) ∗ v(n) ˜

(6.8.17)

and

The spectral density of v(n), from (6.8.12), (6.8.13), and (6.8.14), is given by Rv (z) = Rw (z)Rv˜ (z) =

1 σ 2v Rg+r (z)Rg−r (z) + Rgr (z)Rg−r (z)

= σ 2v

(6.8.18)

which means that v(n) is a white sequence. Once again, assuming that hr (n) = 0, n > L, where L is an arbitrary positive integer, we obtain an equivalent discrete-time channel model with white noise x(n) =

L

ak hr (n − k) + v(n)

(6.8.19)

k=0

This equivalent model is shown in Figure 6.31. An example to illustrate the use of this model in the design and analysis of an equalizer is given in the next section. an

hr(0)

z−1

an−1

hr(1)

z−1

z−1

hr(L−1)

an−L+1

z−1

an−L

hr(L)

x(n)

v(n)

FIGURE 6.31 Equivalent discrete-time model of data transmission system with ISI and WGN.

6.8.3 Linear Equalizers If we know the characteristics of the channel, that is, the magnitude response |Hc (F )| and the phase response Hc (F ), we can design optimum transmitting and receiving filters that will maximize the SNR and will result in zero ISI at the sampling instant. However, in practice we have to deal with channels whose characteristics are either unknown (dial-up telephone channels) or time-varying (ionospheric radio channels). In this case, we usually

use a receiver that consists of a fixed filter gr (t) and an adjustable linear equalizer, as shown in Figure 6.32. The response of the fixed filter either is matched to the transmitted pulse or is designed as a compromise equalizer for an “average” channel typical of the given application. In principle, to eliminate the ISI, we should design the equalizer so that the overall pulse shape satisfies Nyquist’s criterion (6.8.6) or (6.8.8). From the receiving filter

~ x (t)

~ x (nT ) t = nT

y(nT ˆ B)

Equalizer

t = nTB

Detector

aˆn

(a) Continuous-time model

From the equivalent model

x(n) ˆ

y(n) ˆ

Equalizer {c(n)}M −M

Detector

aˆn

(b) Discrete-time model for synchronous equalizer

FIGURE 6.32 Equalizer-based receiver model.

The most widely used equalizers are implemented using digital FIR filters. To this end, as shown in Figure 6.32(a), we sample the received signal x(t) ˜ periodically at times t = t0 + nT , where t0 is the sampling phase and T is the sampling period. The sampling period should be less or equal to the symbol interval TB because the output of the equalizer should be sampled once every symbol interval (the case T > TB creates aliasing). For digital implementation T should be chosen as a rational fraction of the symbol interval, that is, T = L1 TB /L2 , with L1 ≤ L2 (typical choices are T = TB , T = TB /2, or T = 2TB /3). † If the sampling interval T = TB , we have a synchronous or symbol equalizer (SE) and if ‡ T < TB a fractionally spaced equalizer (FSE). The output of the equalizer is quantized to obtain the decision aˆ n . The goal of the equalizer is to determine the coefficients {ck }M −M so as to minimize the ISI according to some criterion of performance. The most meaningful criterion for data transmission is the average probability of error. However, this criterion is a nonlinear function of the equalizer coefficients, and its minimization is extremely difficult. We next discuss two criteria that are used in practical applications. For this discussion we assume a synchronous equalizer, that is, T = TB . The FSE is discussed in Chapter 12. For the synchronous equalizer, the equivalent discrete-time model given in Figure 6.31 is applicable in which the input is x(n), given by x(n) =

L

al hr (n − l) + v(n)

(6.8.20)

l=0

The output of the equalizer is given by y(n) ˆ =

M

c∗ (k)x(n − k) cH x(n)

(6.8.21)

k=−M

where

c = [c(−M) · · · c(0) · · · c(M)]T x(n) = [x(n + M) · · · x(n) · · · x(n − M)]T

(6.8.22) (6.8.23)

This equalizer model is shown in Figure 6.32(b). †

Also known as a baud-spaced equalizer (BSE). ‡

The most significant difference between SE and FSE is that by properly choosing T we can completely avoid aliasing at the input of the FSE. Thus, the FSE can provide better compensation for timing phase and asymmetries in the channel response without noise enhancement (Qureshi 1985).

315 section 6.8 Channel Equalization in Data Transmission Systems

316 chapter 6 Optimum Linear Filters

6.8.4 Zero-Forcing Equalizers Zero-forcing (zf) equalization (Lucky, Saltz, and Weldon 1968) requires that the response of the equalizer to the combined pulse h˜ r (t) satisfy the Nyquist criterion (6.8.7). For the FIR equalizer in (6.8.21), in the absence of noise we have # M 1 n=0 czf (k)hr (n − k) = (6.8.24) 0 n = ±1, ±2, . . . , ±M k=−M which is a linear system of equations whose solution provides the required coefficients. The zero-forcing equalizer does not completely eliminate the ISI because it has finite duration. If M = ∞, Equation (6.8.24) becomes a convolution equation that can be solved by using the z-transform. The solution is 1 Czf (z) = (6.8.25) Hr (z) where Hr (z) is the z-transform of hr (n). Thus, the zero-forcing equalizer is an inverse filter that inverts the frequency-folded (aliased) response of the overall channel. When M is finite, then it is generally impossible to eliminate the ISI at the output of the equalizer because there are only 2M + 1 adjustable parameters to force zero ISI outside of [−M, M]. Then the equalizer design problem reverts to minimizing the peak distortion M czf (k)hr (n − k) D (6.8.26) n =0 k=−M

This distortion function can be shown to be a convex function (Lucky 1965), and its minimization, in general, is difficult to obtain except when the input ISI is less than 100 percent (i.e., the eye pattern is open). This minimization and the determination of {czf } can be obtained by using the steepest descent algorithm, which is discussed in Chapter 10. Zero-forcing equalizers have two drawbacks: (1) They ignore the presence of noise and therefore amplify the noise appearing near the spectral nulls of Hr (ej ω ), and (2) they minimize the peak distortion or worst-case ISI only when the eye is open. For these reasons they are not currently used for bad channels or high-speed modems (Qureshi 1985). The above two drawbacks are eliminated if the equalizers are designed using the MSE criterion.

6.8.5 Minimum MSE Equalizers It has been shown (Saltzberg 1968) that the error rate Pr{aˆ n = an } decreases monotonically with the MSE defined by where

MSE = E{|e(n)|2 }

(6.8.27)

ˆ e(n) = y(n) − y(n) ˆ = an − y(n)

(6.8.28)

is the difference between the desired response y(n) an and the actual response y(n) ˆ given in (6.8.21). Therefore, if we minimize the MSE in (6.8.27), we take into consideration both the ISI and the noise at the output of the equalizer. For M = ∞, following the arguments similar to those leading to (6.8.25), the minimum MSE equalizer is specified by CMSE (z) =

Hr∗ (1/z∗ ) Hr (z)Hr∗ (1/z∗ ) + σ 2v

(6.8.29)

where σ 2v is the variance of the sampled channel noise vc (kTB ). Clearly, (6.8.29) reduces to the zero-forcing equalizer if σ 2v = 0. Also (6.8.29) is the classical Wiener filter. For finite M, the minimum MSE equalizer is specified by Rco = d

(6.8.30)

Po = Pa − coH d

(6.8.31)

where R = E{x(n)xH (n)} and d = E{an∗ x(n)}. The data sequence y(n) = an is assumed to be white with zero mean and power Pa = E{|an |2 }, and uncorrelated with the additive channel noise. Under these assumptions, the elements of the correlation matrix R and the cross-correlation vector d are given by rij E{x(n − i)x ∗ (n − j )} = Pa hr (m − i)h∗r (m − j ) + σ 2v δ ij

− M ≤ i, j ≤ M

(6.8.32)

m

and

di E{x(n − i)y ∗ (n)} = Pa hr (−i)

− M ≤ i, j ≤ M

(6.8.33)

that is, in terms of the overall (equivalent) channel response hr (n) and the noise power σ 2v . We hasten to stress that matrix R is Toeplitz if T = TB ; otherwise, for T = TB , matrix R is Hermitian but not Toeplitz. Since MSE equalizers, in contrast to zero-forcing equalizers, take into account both the statistical properties of the noise and the ISI, they are more robust to both noise and large amounts of ISI. E XAM PLE 6.8.1. Consider the model of the data communication system shown in Figure 6.33. The input symbol sequence {a(n)} is a Bernoulli sequence {±1}, with Pr{1} = Pr{−1} = 0.5. The channel (including the receiving and whitening filter) is modeled as 2π (n − 2) 0.5 1 + cos n = 1, 2, 3 h(n) = (6.8.34) W 0 otherwise

where W controls the amount of amplitude distortion introduced by the channel. The channel impulse response values are (the arrow denotes the sample at n = 0) # $ 2π 2π , 1, 0.5 1 + cos ,0 (6.8.35) h(n) = 0, 0.5 1 + cos W W ↑ which is a symmetric channel, and its frequency response is 2π cos ω H (ej ω ) = e−j 2ω 1 + 1 + cos W The channel noise v(n) is modeled as white Gaussian noise (WGN) with zero mean and variance σ 2v . The equalizer is an 11-tap FIR filter whose optimum tap weights {c(n)} are obtained using either optimum filter theory (nonadaptive approach) or adaptive algorithms that will be described in Chapter 10. The input to the equalizer is x(n) = s(n) + v(n) = h(n) ∗ a(n) + v(n)

(6.8.36)

where s(n) represents the distorted pulse sequence. The output of the equalizer is y(n), ˆ which is an estimate of a(n). In practical modem implementations, the equalizer is initially designed using a training sequence that is known to the receiver. It is shown in Figure 6.33 as the sequence y(n). It is reasonable to introduce a delay D in the training sequence to account for delays introduced in the channel and in the equalizer; that is, y(n) = a(n − D) during the training phase. The error sequence e(n) is further used to design the equalizer c(n). The aim of this example is to study the effect of the delay D and to determine its optimum value for proper operation.

a(n)

x(n)

Channel h(n) v(n)

FIGURE 6.33 Data communication model used in Example 6.8.1.

y(n)

Delay D

Equalizer c(n)

ˆy(n) −

e(n)

317 section 6.8 Channel Equalization in Data Transmission Systems

318 chapter 6 Optimum Linear Filters

To obtain an optimum equalizer c(n), we will need the autocorrelation matrix Rx of the input sequence x(n) and the cross-correlation vector d between x(n) and y(n). Consider the autocorrelation rx (l) of x(n). From (6.8.36), assuming real-valued quantities, we obtain rx (l) = E{[s(n) + v(n)][s(n − l) + v(n − l)]} = rs (l) + σ 2v δ(l)

(6.8.37)

where we have assumed that s(n) and v(n) are uncorrelated. Since s(n) is a convolution between {an } and h(n), the autocorrelation rs (l) is given by rs (l) = ra (l) ∗ rh (l) = rh (l)

(6.8.38)

where ra (l) = δ(l) since {a(n)} is a Bernoulli sequence, and rh (l) is the autocorrelation of the channel response h(n) and is given by rh (l) = h(l) ∗ h(−l) Using the symmetric channel response values in (6.8.35), we find that the autocorrelation rx (l) in (6.8.37) is given by 2π 2 + σ 2v rx (0) = h2 (1) + h2 (2) + h2 (3) + σ 2ν = 1 + 0.5 1 + cos W 2π rx (±1) = h(1)h(2) + h(2)h(3) = 1 + cos W (6.8.39) 2 2π rx (±2) = h(1)h(3) = 0.25 1 + cos W rx (l) = 0

|l| ≥ 3

Since the equalizer is an 11-tap FIR filter, the autocorrelation matrix Rx is an 11 × 11 matrix. However, owing to few nonzero values of rx (l) in (6.8.39), it is also a quintdiagonal matrix with the main diagonal containing rx (0) and two upper and lower non-zero diagonals. The cross-correlation between x(n) and y(n) = a(n − D) is given by d(l) = E{a(n − D)x(n − l)} = E{a(n − D)[s(n − l) + ν(n − l)]} = E{a(n − D)s(n − l)} + E{a(n − D)ν(n − l)}

(6.8.40)

= E{a(n − D)[h(n − l) ∗ a(n − l)]} = h(D − l) ∗ ra (D − l) = h(D − l)

where we have used (6.8.36). The last step follows from the fact that ra (l) = δ(l). Using the channel impulse response values in (6.8.35), we obtain D=0 D=1 D=2 .. . D=7

d(l) = h(−l) = 0 l ≥ 0 d(l) = h(1 − l) ⇒ d(0) = h(1) d(l) = h(2 − l) ⇒ d(0) = h(2) .. . d(l) = h(7 − l) ⇒ d(4) = h(3) d(l) = 0 elsewhere

d(l) = 0 l > 0 d(1) = h(1) d(l) = 0 d(5) = h(2)

l>1 (6.8.41)

d(6) = h(1)

Remarks. There are some interesting observations that we can make from (6.8.41) in which the delay D turns the estimation problem into a filtering, prediction, or smoothing problem. 1. When D = 0, we have a filtering case. The cross-correlation vector d = 0, hence the equalizer taps are all zeros. This means that if we do not provide any delay in the system, the cross-correlation is zero and equalization is not possible because co = 0. 2. When D = 1, we have a one-step prediction case. 3. When D ≥ 2, we have a smoothing filter, which provides better performance. When D = 7, we note that the vector d is symmetric [with respect to the middle sample d(5)] and hence we should expect the best performance because the channel is also symmetric. We can also show that D = 7 is the optimum delay for this example (see Problem 6.40). However, this should not be a surprise since h(n) is symmetric about n = 2, and if we make the equalizer symmetric about n = 5, then the channel input a(n) is delayed by D = 5 + 2 = 7.

Figure 6.34 shows the channel impulse response h(n) and the equalizer c(n) for D = 7, σ 2v = 0.001, and W = 2.9 and W = 3.1.

1.5 1.0 h(n)

1.5 0.5

0.5

−0.5

−0.5

2.0

−1.0

0 1 2 3 4 5 6 7 8 9 10 n Equalizer for W = 2.9

0 1 2 3 4 5 6 7 8 9 10 n

2.0

1.5

1.5

1.0

1.0

0.5

−0.5

−0.5 −1.0

0 1 2 3 4 5 6 7 8 9 10 n

Equalizer for W = 3.1

0.5

0 −1.0

Channel for W = 3.1

2.0

1.0

−1.0

c(n)

Channel for W = 2.9

c(n)

h(n)

2.0

0 1 2 3 4 5 6 7 8 9 10 n

FIGURE 6.34 Channel impulse response h(n) and the equalizer c(n) for D = 7, σ 2ν = 0.001, and W = 2.9 and W = 3.1.

6.9 MATCHED FILTERS AND EIGENFILTERS In this section we discuss the design of optimum filters that maximize the output signal-tonoise power ratio. Such filters are used to detect signals in additive noise in many applications, including digital communications and radar. First we discuss the case of a known deterministic signal in noise, and then we extend the results to the case of a random signal in noise. Suppose that the observations obtained by sampling the output of a single sensor at M instances, or M sensors at the same instant, are arranged in a vector x(n). Furthermore, we assume that the available signal x(n) consists of a desired signal s(n) plus an additive noise plus interference signal v(n), that is, x(n) = s(n) + v(n)

(6.9.1)

where s(n) can be one of two things. It can be a deterministic signal of the form s(n) = αs0 , where s0 is the completely known shape of s(n) and α is a complex random variable with power Pα = E{|α|2 }. The argument α provides the unknown initial phase, and the modulus |α|, the amplitude of the signal, respectively. It can also be a random signal with known correlation matrix Rs (n). The signals s(n) and v(n) are assumed to be uncorrelated with zero means. The output of a linear processor (combiner or FIR filter) with coefficients {ck∗ }M 1 is y(n) = cH x(n) = cH s(n) + cH v(n) and its power

(6.9.2)

Py (n) = E{|y(n)| } = E{c x(n)x (n)c} = c Rx (n)c 2

H

H

H

(6.9.3)

is a quadratic function of the filter coefficients. The output noise power is Pv (n) = E{|cH v(n)|2 } = E{cH v(n)vH (n)c} = cH Rv (n)c

(6.9.4)

319 section 6.9 Matched Filters and Eigenfilters

320 chapter 6 Optimum Linear Filters

where Rv (n) is the noise correlation matrix. The determination of the output SNR, and hence the subsequent optimization, depends on the nature of the signal s(n). 6.9.1 Deterministic Signal in Noise In the deterministic signal case, the power of the signal is Ps (n) = E{|αcH s0 |2 } = Pα |cH s0 |2

(6.9.5)

and therefore the output SNR can be written as SNR(c) = Pα

|cH s0 |2 v (n)c

cH R

(6.9.6)

White noise case. If the correlation matrix of the additive noise is given by Rv (n) = Pv I, the SNR becomes Pα |cH s0 |2 (6.9.7) SNR(c) = Pv c H c which simplifies the maximization process. Indeed, from the Cauchy-Schwartz inequality 1/2 cH s0 ≤(cH c)1/2 (sH 0 s0 )

we conclude that the SNR in (6.9.7) attains its maximum value Pα H s s0 SNRmax = Pv 0 if the optimum filter co is chosen as co = κs0

(6.9.8)

(6.9.9) (6.9.10)

that is, when the filter is a scaled replica of the known signal shape. This property resulted in † the term matched filter, which is widely used in communications and radar applications. We note that if a vector co maximizes the SNR (6.9.7), then any constant κ times co maximizes the SNR as well. Therefore, we can choose this constant in any way we want. In this section, we choose κ o so that c0H s0 = 1. Colored noise case. Using the Cholesky decomposition Rv = Lv LH v of the noise correlation matrix, we can write the SNR in (6.9.6) as SNR(c) = Pα

−1 2 H |(LH v c) (Lv s0 )| H H (LH v c) (Lv c)

(6.9.11)

which, according to the Cauchy-Schwartz inequality, attains its maximum 2 H −1 SNRmax = Pα L−1 v s0 = Pα s0 Rv s0

when the optimum filter satisfies

= κL−1 v s0 , or co = κRv−1 s0

LH v co

(6.9.12)

equivalently (6.9.13)

which provides the optimum matched filter for color additive noise.Again, the optimum filter −1 −1 can be scaled in any desirable way. We choose coH s0 = 1 which implies κ = (sH 0 R v s0 ) . −1 −1 If we pass the observed signal through the preprocessor Lv , we obtain a signal Lv s H −H in additive white noise v˜ = L−1 vv˜ H } = E{L−1 v v because E{˜ v vv Lv } = I. Therefore, the optimum matched filter in additive color noise is the cascade of a whitening filter followed by a matched filter for white noise (compare with a similar decomposition for the optimum †

We note that the matched filter co in (6.9.10) is not a complex conjugate reversed version of the signal s. This happens when we define the matched filter as a convolution that involves a reversal of the impulse response (Therrien 1992).

Wiener filter in Figure 6.19). The application of the optimum matched filter is discussed in Section 11.3, which provides a more detailed treatment. E XAM PLE 6.9.1. Consider a finite-duration deterministic signal s(n) = a n , 0 ≤ n ≤ M − 1, corrupted by additive noise v(n) with autocorrelation sequence rv (l) = σ 20 ρ |l| /(1 − ρ 2 ). We

determine and plot the impulse response of an Mth-order matched filter for a = 0.6, M = 8, σ 20 = 0.25, and (a) ρ = 0.1 and (b) ρ = −0.8. We first note that the signal vector is s = [1 a a 2 · · · a 7 ]T and that the noise correlation matrix Rv is Toeplitz with first row [rv (0) rv (1) · · · rv (7)]. The optimum matched filters are determined by c = R −1 v s0 and are shown in Figure 6.35. We notice that for ρ = 0.1 the matched filter looks like the signal because the correlation between the samples of the interference is very small; that is, the additive noise is close to white. For ρ = −0.8 the correlation increases, and the shape of the optimum filter differs more from the shape of the signal. However, as a result of the increased noise correlation, the optimum SNR increases.

s(n)

Signal 1 0.8 0.6 0.4 0.2 0

1

2

3

5

6

7

8

7

8

7

8

r = 0.1, SNR = 5.5213 dB

4 c0(n)

4

3 2 1

c0(n)

1

2

3

4

5

6

r = −0.8, SNR = 14.8115 dB

10 8 6 4 2 0 1

2

3

4 5 Sample index n

6

FIGURE 6.35 Signal and impulse responses of the optimum matched filter that maximizes the SNR in the presence of additive color noise.

6.9.2 Random Signal in Noise In the case of a random signal with known correlation matrix Rs , the SNR is c H Rs c c H Rv c that is, the ratio of two quadratic forms. We again distinguish two cases. SNR(c) =

(6.9.14)

White noise case. If the correlation matrix of the noise is given by Rv = Pv I, we have SNR(c) =

1 c H Rs c Pv c H c

(6.9.15)

321 section 6.9 Matched Filters and Eigenfilters

322 chapter 6 Optimum Linear Filters

which has the form of Rayleigh’s quotient (Strang 1980; Leon 1998). By using the innovations transformation c˜ = QH c, where the unitary matrix Q is obtained from the eigendecomposition Rs = QQH , the SNR can be expressed as 1 λ1 |c˜1 |2 + · · · + λM |c˜M |2 1 c˜ H c˜ = (6.9.16) H Pv c˜ c˜ Pv |c˜1 |2 + · · · + |c˜M |2 where 0 ≤ λ1 ≤ · · · ≤ λM are the eigenvalues of the signal correlation matrix. The SNR is maximized if we choose c˜M = 1 and c˜1 = · · · = c˜M−1 = 0 and is minimized if we choose c˜1 = 1 and c˜2 = · · · = c˜M = 0. Therefore, for any positive definite matrix Rs , we have SNR(c) =

c H Rs c ≤ λmax (6.9.17) cH c which is known as Rayleigh’s quotient (Strang 1980). This implies that the optimum filter c = Q˜c is the eigenvector corresponding to the maximum eigenvalue of Rs , that is, λmin ≤

c = qmax

(6.9.18)

and provides a maximum SNR λmax (6.9.19) Pv where λmax = λM . The obtained optimum filter is sometimes known as an eigenfilter (Makhoul 1981). The following example provides a geometric interpretation of these results for a second-order filter. SNRmax =

E XAM PLE 6.9.2.

R

1 ρ

Suppose that the signal correlation matrix Rs is given by (see Example 3.5.1) H ρ 1 1 1 1 1−ρ 0 1 1 = QQH = √ √ 1 1+ρ 2 −1 1 2 −1 1 0

where ρ = 0.81. To obtain a geometric interpretation, we fix cH c = 1 and try to maximize the numerator cH Rc > 0 (we assume that R is positive definite). The relation c12 + c22 = 1 represents a circle in the (c1 , c2 ) plane. The plot can be easily obtained by using the parametric description c1 = cos φ and c2 = sin φ. To obtain the plot of cH Rc = 1, we note that cH Rc = cH QQH c = c˜ H c˜ = λ21 c˜12 + λ22 c˜22 = 1

√ H where c˜ Q c.√To plot λ21 c˜12 + λ22 c˜22 = 1, we use the parametric description c˜1 = cos φ/ λ1 and√c˜2 = sin φ/ λ2 . The result is an ellipse in the (c˜1 , c˜2 ) plane. √ √ For c˜2 = 0 we have c˜1 = 1/ λ1 , and for c˜1 = 0 we have c˜2 = 1/ λ2 . Since λ1 < λ2√, 2/ λ1 provides√the length of the major axis determined by the eigenvector q1 = [1 − 1]T / 2. Similarly, 2/ λ2 provides the √ length of the minor axis determined by the eigenvector q2 = [1 1]T / 2. The coordinates of the ellipse in the (c1 , c2 ) plane are obtained by the rotation transformation c = Q˜c. The resulting circle and ellipse are shown in Figure 6.36. The maximum value of cH Rc = λ1 c˜12 + λ22 c˜22 on the circle c˜12 + c˜22 = 1 is obtained for c˜1 = 0 and c˜2 = 1, that is, at the endpoint of eigenvector q2 , and is equal to the largest eigenvalue λ2 . Similarly, the minimum is λ1 and is obtained at the tip of eigenvector q1 (see Figure 6.36). Therefore, the optimum filter is c = q2 and the maximum SNR is λ2 /Pv .

Colored noise case. Using the Cholesky decomposition Rv = Lv LH v of the noise correlation matrix, we process the observed signal with the transformation L−1 v , that is, we obtain −1 −1 xv (n) L−1 v x(n) = Lv s(n) + Lv v(n) (6.9.20) = sv (n) + v˜ (n) −1 −H where v˜ (n) is white noise with E{˜v(n)˜vH (n)} = I and E{sv (n)sH v (n)} = Lv Rs Lv . Therefore, the optimum matched filter is determined by the eigenvector corresponding −H to the maximum eigenvalue of matrix L−1 v Rs Lv , that is, the correlation matrix of the transformed signal sv (n).

323

c2

section 6.9 Matched Filters and Eigenfilters

2

c H Rc = 1

c~2 c Hc = 1

Max 1 1 l2

q2 −2

−1

1

c1

2

q1

l1 = c H Rc

l2 = −1

c H Rc

Min 1 l1

−2

c~1

FIGURE 6.36 Geometric interpretation of the optimization process for the derivation of the optimum eigenfilter using isopower contours for λ1 < λ2 .

The problem can also be solved by using the simultaneous diagonalization of the signal and noise correlation matrices Rs and Rv , respectively. Starting with the decomposition Rv = Qv v QH v , we compute the isotropic transformation −1/2

xv (n) v

−1/2

= v

QH v x(n) −1/2

QH v s(n) + v

˜ (n) QH v v(n) s˜ (n) + v −1/2

(6.9.21)

−1/2

Rs˜ . Since the where E{˜v(n)˜vH (n)} = I and E{˜s(n)˜sH (n)} = v QH v R s Qv v noise vector is white, the optimum matched filter is determined by the eigenvector corresponding to the maximum eigenvalue of matrix Rs˜ . Finally, if Rs˜ = Qs˜ s˜ QH s˜ , the transformation H H ˜ (n) s¯(n) + v¯ (n) xvs (n) QH s˜ xv (n) = Qs˜ s˜ (n) + Qs˜ v

(6.9.22)

results in new signal and noise vectors with correlation matrices E{¯s(n)¯sH (n)} = QH s˜ Rs˜ Qs˜ = s˜ E{¯v(n)¯v (n)} = H

QH s˜ IQs˜

=I

(6.9.23) (6.9.24)

Therefore, the transformation matrix −1/2 H Qv Q QH s˜ v

(6.9.25)

diagonalizes matrices Rs and Rv simultaneously (f*ckunaga 1990). The maximization of (6.9.14) can also be obtained by whitening the signal, that is, by using the Cholesky decomposition Rs = Ls LH s of the signal correlation matrix. Indeed, H

using the transformation c˜ Ls c, we have SNR(c) =

c˜ H c˜ c H Ls LH c H Rs c s c = = −H H −H c H Rv c ˜ c˜ H L−1 cH Ls L−1 s Rv L s L s c s Rv L s c

(6.9.26)

chapter 6 Optimum Linear Filters

which attains its maximum when c˜ is equal to the eigenvector corresponding to the minimum −H eigenvalue of matrix L−1 s Rv Ls . This approach has been used to obtain optimum movingtarget indicator filters for radar applications (Hsiao 1974). E XAM PLE 6.9.3. The basic problem in many radar detection systems is the separation of a useful signal from colored noise or interference background. In several cases the signal is a point target (i.e., it can be modeled as a unit impulse) or is random with a flat PSD, that is, Rs = Pa I. 2 Suppose that the background is colored with correlation rv (i, j ) = ρ (i−j ) , 1 ≤ i, j ≤ M, which leads to a Toeplitz correlation matrix Rv . We determine and compare three filters for interference rejection. The first is a matched filter that maximizes the SNR

Pa cH c SNR(c) = H c Rv c

(6.9.27)

by setting c equal to the eigenvector corresponding to the minimum eigenvalue of Rv . The second approach is based on the method of linear prediction. Indeed, if we assume that the interference vk (n) is much stronger than the useful signal sk (n), we can obtain an estimate vˆ1 (n) of v1 (n) using the observed samples {xk (n)}M 2 and then subtract vˆ 1 (n) from x1 (n) to cancel the interference. The Wiener filter with desired response y(n) = v1 (n) and input data {xk (n)}M 2 is y(n) ˆ =−

M−1

ak∗ xk+1 (n) −aH x˜ (n)

k=1

and is specified by the normal equations ˜ x a = −d˜ R and the MMSE P f = E{|v1 |2 } + d˜ H a where and

˜ x ij = E{xi+1 (n)x ∗ (n)} E{vi+1 (n)v ∗ (n)} R i+1 i+1 ∗ (n)} E{v (n)v ∗ (n)} d˜i = E{v1 xi+1 1 i+1

because the interference is assumed much stronger than the signal. Using the last four equations, we obtain 1 Ef (6.9.28) Rv = a 0 which corresponds to the forward linear prediction error (LPE) filter discussed in Section 6.5.2. Finally, for the sake of comparison, we consider the binomial filters HM (z) = (1 − z−1 )M that are widely used in radar systems for the elimination of stationary (i.e., nonmoving) clutter. Figure 6.37 shows the magnitude response of the three filters for ρ = 0.9 and M = 4. We emphasize that 0 –5 Magnitude response (dB)

324

Matched filter LPE filter Binomial filter

–10 –15 –20 –25 –30 –35 –40 –45 –50 0

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Frequency (cycles/sampling interval)

FIGURE 6.37 Comparison of frequency responses of matched filter, prediction error filter, and binomial interference rejection filter.

the FLP method is suboptimum compared to matched filtering. However, because the frequency response of the FLP filter does not have the deep zero notch, we use it if we do not want to lose useful signals in that band (Chiuppesi et al. 1980).

6.10 SUMMARY In this chapter, we discussed the theory and application of optimum linear filters designed by minimizing the MSE criterion of performance. Our goal was to explain the characteristics of each criterion, emphasize when its use made sense, and illustrate its meaning in the context of practical applications. We started with linear processors that formed an estimate of the desired response by combining a set of different signals (data) and showed that the parameters of the optimum processor can be obtained by solving a linear system of equations (normal equations). The matrix and the right-hand side vector of the normal equations are completely specified by the second-order moments of the input data and the desired response. Next, we used the developed theory to design optimum FIR filters, linear signal estimators, and linear predictors. We emphasized the case of stationary stochastic processes and showed that the resulting optimum estimators are time-invariant. Therefore, we need to design only one optimum filter that can be used to process all realizations of the underlying stochastic processes. Although another filter may perform better for some realizations, that is, the estimated MSE is smaller than the MMSE, on average (i.e., when we consider all possible realizations), the optimum filter is the best. We showed that the performance of optimum linear filters improves as we increase the number of filter coefficients. Therefore, the noncausal IIR filter provides the best possible performance and can be used as a yardstick to assess other filters. Because IIR filters involve an infinite number of parameters, their design involves linear equations with an infinite number of unknowns. For stationary processes, these equations take the form of a convolution equation that can be solved using z-transform techniques. If we use a pole-zero structure, the normal equations become nonlinear and the design of the optimum filter is complicated by the presence of multiple local minima. Then we discussed the design of optimum filters for inverse system modeling and blind deconvolution, and we provided a detailed discussion of their use in the important practical application of channel equalization for data transmission systems. Finally, we provided a concise introduction to the design of optimum matched filters and eigenfilters that maximize the output SNR and find applications for the detection of signals in digital communication and radar systems.

PROBLEMS 6.1 Let x be a random vector with mean E{x}. Show that the linear MMSE estimate yˆ of a random variable y using the data vector x is given by yˆ = yo + cH x, where yo = E{y} − cH E{x}, c = R −1 d, R = E{xxH }, and d = E{xy ∗ }. 6.2 Consider an optimum FIR filter specified by the input correlation matrix R = Toeplitz {1, 14 } and cross-correlation vector d = [1 12 ]T .

(a) Determine the optimum impulse response co and the MMSE Po . (b) Express co and Po in terms of the eigenvalues and eigenvectors of R. 6.3 Repeat Problem 6.2 for a third-order optimum FIR filter.

325 problems

326 chapter 6 Optimum Linear Filters

6.4 A process y(n) with the autocorrelation ry (l) = a |l| , −1 < a < 1, is corrupted by additive, uncorrelated white noise v(n) with variance σ 2v . To reduce the noise in the observed process x(n) = y(n) + v(n), we use a first-order Wiener filter. (a) Express the coefficients co,1 and co,2 and the MMSE Po in terms of parameters a and σ 2v . (b) Compute and plot the PSD of x(n) and the magnitude response |Co (ej ω )| of the filter when σ 2v = 2, for both a = 0.8 and a = −0.8, and compare the results. (c) Compute and plot the processing gain of the filter for a = −0.9, −0.8, −0.7, . . . , 0.9 as a function of a and comment on the results. 6.5 Consider the harmonic process y(n) and its noise observation x(n) given in Example 6.4.1. (a) Show that ry (l) = 12 A2 cos ω0 l. (b) Write a Matlab function h = opt_fir(A,f0,var_v,M) to design an Mth-order optimum FIR filter impulse response h(n). Use the toeplitz function from Matlab to generate correlation matrix R. (c) Determine the impulse response of a 20th-order optimum FIR filter for A = 0.5, f0 = 0.05, and σ 2v = 0.5. (d ) Using Matlab, determine and plot the magnitude response of the above-designed filter, and verify your results with those given in Example 6.4.1. 6.6 Consider a “desired” signal s(n) generated by the process s(n) = −0.8w(n − 1) + w(n), where w(n) ∼ WN(0, σ 2w ). This signal is passed through the causal system H (z) = 1 − 0.9z−1 whose output y(n) is corrupted by additive white noise v(n) ∼ WN(0, σ 2v ). The processes w(n) and v(n) are uncorrelated with σ 2w = 0.3 and σ 2v = 0.1. (a) Design a second-order optimum FIR filter that estimates s(n) from the signal x(n) = y(n) + v(n) and determine co and Po . (b) Plot the error performance surface, and verify that it is quadratic and that the optimum filter points to its minimum. (c) Repeat part (a) for a third-order filter, and see whether there is any improvement. 6.7 Repeat Problem 6.6, assuming that the desired signal is generated by s(n) = −0.8s(n−1)+w(n). 6.8 Repeat Problem 6.6, assuming that H (z) = 1. 6.9 A stationary process x(n) is generated by the difference equation x(n) = ρx(n − 1) + w(n), where w(n) ∼ WN(0, σ 2w ). (a) Show that the correlation matrix of x(n) is given by Rx =

σ 2w Toeplitz{1, ρ, ρ 2 , . . . , ρ M−1 } 1 − ρ2 (M)

(b) Show that the Mth-order FLP is given by a1 f = σ2 . is PM w

(M)

= −ρ, ak

= 0 for k > 1 and the MMSE

6.10 Using Parseval’s theorem, show that (6.4.18) can be written as (6.4.21) in the frequency domain. 6.11 By differentiating (6.4.21) with respect to H (ej ω ), derive the frequency response function Ho (ej ω ) of the optimum filter in terms of Ryx (ej ω ) and Rx (ej ω ). 6.12 A conjugate symmetric linear smoother is obtained from (6.5.12) when M = 2L and i = L. If ¯ = JR ¯ ∗ , show that c¯ = J¯c∗ . the process x(n) is stationary, then, using RJ ¯ be the matrices from the eigendecomposition of R, ¯Q ¯ and ¯ ¯ H. ¯ that is, R ¯ =Q 6.13 Let Q (a) Substitute R into (6.5.20) and (6.5.27) to prove (6.5.43) and (6.5.44).

(b) Generalize the above result for a j th-order linear signal estimator c(j ) (n); that is, prove that (j )

c(j ) (n) = Po (n)

M+1 i=1

1 q¯ q¯ ¯λi i i,j

˜ ¯ 6.14 Let R(n) be the inverse of the correlation matrix R(n) given in (6.5.11). ˜ (a) Using (6.5.12), show that the diagonal elements of R(n) are given by ˜ R(n) i,i =

1 P (i) (n)

1≤i ≤M +1

(b) Furthermore, show that c(i) (n) =

r˜ i (n) ˜ R(n) i,i

1≤i ≤M +1

˜ where r˜ i (n) is the i-th column of R(n). 6.15 The first five samples of the autocorrelation sequence of a signal x(n) are r(0) = 1, r(1) = 0.8, r(2) = 0.6, r(3) = 0.4, and r(4) = 0.3. Compute the FLP, the BLP, the optimum symmetric smoother, and the corresponding MMSE (a) by using the normal equations method and (b) by using the inverse of the normal equations matrix. 6.16 For the symmetric, Toeplitz autocorrelation matrix R = Toeplitz{r(0), r(1), r(2)} = r(0)× Toeplitz{1, ρ 1 , ρ 2 } with R = LDLH and D = diag{ξ 1 , ξ 2 , ξ 3 }, the following conditions are equivalent: • R is positive definite. • ξ i > 0 for 1 ≤ i ≤ 3. • |ki | < 1 for 1 ≤ i ≤ 3.

Determine the values of ρ 1 and ρ 2 for which R is positive definite, and plot the corresponding area in the (ρ 1 , ρ 2 ) plane. 6.17 Prove the first equation in (6.5.45) by rearranging the FLP normal equations in terms of the unknowns Pof (n), a1 (n), . . . , aM (n) and then solve for Pof (n), using Cramer’s rule. Repeat the procedure for the second equation. 6.18 Consider the signal x(n) = y(n) + v(n), where y(n) is a useful random signal corrupted by noise v(n). The processes y(n) and v(n) are uncorrelated with PSDs π 0 ≤ |ω| ≤ 1 2 j ω Ry (e ) = π < |ω| ≤ π 0 2 π π ≤ |ω| ≤ 1 4 2 and Rv (ej ω ) = π π 0 0 ≤ |ω| < and < |ω| ≤ π 4 2 respectively. (a) Determine the optimum IIR filter and find the MMSE. (b) Determine a thirdorder optimum FIR filter and the corresponding MMSE. (c) Determine the noncausal optimum FIR filter defined by y(n) ˆ = h(−1)x(n + 1) + h(0)x(n) + h(1)x(n − 1) and the corresponding MMSE. 6.19 Consider the ARMA(1, 1) process x(n) = 0.8x(n − 1) + w(n) + 0.5w(n − 1), where w(n) ∼ WGN(0, 1). (a) Determine the coefficients and the MMSE of (1) the one-step ahead FLP x(n) ˆ = a1 x(n − 1) + a2 x(n − 2) and (2) the two-step ahead FLP x(n ˆ + 1) = a1 x(n − 1) + a2 x(n − 2). (b) Check if the obtained prediction error filters are minimum-phase, and explain your findings.

327 problems

328 chapter 6 Optimum Linear Filters

6.20 Consider a random signal x(n) = s(n) + v(n), where v(n) ∼ WGN(0, 1) and s(n) is the AR(1) process s(n) = 0.9s(n − 1) + w(n), where w(n) ∼ WGN(0, 0.64). The signals s(n) and v(n) are uncorrelated. (a) Determine and plot the autocorrelation rs (l) and the PSD Rs (ej ω ) of s(n). (b) Design a second-order optimum FIR filter to estimate s(n) from x(n). What is the MMSE? (c) Design an optimum IIR filter to estimate s(n) from x(n). What is the MMSE? 6.21 A useful signal s(n) with PSD Rs (z) = [(1 − 0.9z−1 )(1 − 0.9z)]−1 is corrupted by additive uncorrelated noise v(n) ∼ WN(0, σ 2 ). (a) The resulting signal x(n) = s(n) + v(n) is passed through a causal filter with system function H (z) = (1 − 0.8z−1 )−1 . Determine (1) the SNR at the input, (2) the SNR at the output, and (3) the processing gain, that is, the improvement in SNR. (b) Determine the causal optimum filter and compare its performance with that of the filter in (a). 6.22 A useful signal s(n) with PSD Rs (z) = 0.36[(1 − 0.8z−1 )(1 − 0.8z)]−1 is corrupted by additive uncorrelated noise v(n) ∼ WN(0, 1). Determine the optimum noncausal and causal IIR filters, and compare their performance by examining the MMSE and their magnitude response. Hint: Plot the magnitude responses on the same graph with the PSDs of signal and noise. 6.23 Consider a process with PSD Rx (z) = σ 2 Hx (z)Hx (z−1 ). Determine the D-step ahead linear 2 predictor, and show that the MMSE is given by P (D) = σ 2 D−1 n=0 |hx | (n). Check your results by using the PSD Rx (z) = (1 − a 2 )[(1 − az−1 )(1 − az)]−1 . 6.24 Let x(n) = s(n) + v(n) with Rv (z) = 1, Rsv (z) = 0, and 0.75 Rs (z) = (1 − 0.5z−1 )(1 − 0.5z) Determine the optimum filters for the estimation of s(n) and s(n − 2) from {x(k)}n−∞ and the corresponding MMSEs. 6.25 For the random signal with PSD (1 − 0.2z−1 )(1 − 0.2z) (1 − 0.9z−1 )(1 − 0.9z) determine the optimum two-step ahead linear predictor and the corresponding MMSE. Rx (z) =

6.26 Repeat Problem 6.25 for Rx (z) =

1 (1 − 0.2z−1 )(1 − 0.2z)(1 − 0.9z−1 )(1 − 0.9z)

6.27 Let x(n) = s(n) + v(n) with v(n) ∼ WN(0, 1) and s(n) = 0.6s(n − 1) + w(n), where w(n) ∼ WN(0, 0.82). The processes s(n) and v(n) are uncorrelated. Determine the optimum filters for the estimation of s(n), s(n + 2), and s(n − 2) from {x(k)}n−∞ and the corresponding MMSEs. 6.28 Repeat Problem 6.27 for Rs (z) = [(1 − 0.5z−1 )(1 − 0.5z)]−1 , Rv (z) = 5, and Rsv (z) = 0. 6.29 Consider the random sequence x(n) generated in Example 6.5.2 x(n) = w(n) + 12 w(n − 1) where w(n) is WN(0, 1). Generate K = 100 sample functions {wk (n)}N n=0 , k = 1, . . . , K of N w(n), in order to generate K sample functions {xk (n)}n=0 , k = 1, . . . , K of x(n). (a) Use the second-order FLP ak to obtain predictions {xˆkf (n)}N n=2 of xk (n), for k = 1, . . . , K. Then determine the average error Pˆ f =

N 1 |xk (n) − xˆkf (n)|2 N −1 n=2

and plot it as a function of k. Compare it with Pof .

k = 1, . . . , K

−2 (b) Use the second-order BLP bk to obtain predictions {xˆkb (n)}N n=0 , k = 1, . . . , K of xk (n). Then determine the average error

Pˆ b =

1 N −1

N −2

|xk (n) − xˆkb (n)|2

k = 1, . . . , K

n=0

and plot it as a function of k. Compare it with Pob . −2 (c) Use the second-order symmetric linear smoother ck to obtain smooth estimates {xˆkc (n)}N n=0 of xk (n) for k = 1, . . . , K. Determine the average error Pˆ s =

N −1 1 |xk (n) − xˆkc (n)|2 N −1

k = 1, . . . , K

n=1

and plot it as a function of k. Compare it with Pos . 6.30 Let x(n) = y(n) + v(n) be a wide-sense stationary process. The linear, symmetric smoothing filter estimator of y(n) is given by y(n) ˆ =

L

cks x(n − k)

k=−L

(a) Determine the normal equations for the optimum MMSE filter. (b) Show that the smoothing filter cos has linear phase. (c) Use the Lagrange multiplier method to determine the MMSE Mth-order estimator y(n) ˆ = cH x(n), where M = 2L + 1, when the filter vector c is constrained to be conjugate symmetric, that is, c = Jc∗ . Compare the results with those obtained in part (a). 6.31 Consider the causal prediction filter discussed in Example 6.6.1. To determine Hc[D] (z), first compute the causal part of the z-transform [Ryw (z)]+ . Next compute Hc[D] (z) by using (6.6.21). (a) Determine hc[D] (n). (b) Using the above hc[D] (n), show that Pc[D] = 1 − 58 ( 45 )2D 6.32 Consider the causal smoothing filter discussed in Example 6.6.1.

(a) Using [ryw (l)]+ = ryw (l + D)u(l), D < 0, show that [ryw (l)]+ can be put in the form [ryw (l)]+ = 35 ( 45 )l+D u(l + D) + 35 (2l+D )[u(l) − u(l + D)]

(b) Hence, show that [Ryw (z)]+ is given by

[Ryw (z)]+ =

−D−1 zD 3 3 D l −l (2 + ) 2z 5 1 − 4 z−1 5 5

l=0

(c) Finally using (6.6.21), prove (6.6.54). 6.33 In this problem, we will prove (6.6.57)

(a) Starting with (6.6.42), show that [Ryw (z)]+ can also be put in the form 3 2D − zD zD [Ryw (z)]+ = + 5 1 − 4 z−1 1 − 2z−1 5

(b) Now, using (6.6.21), show that 3 Hc[D] (z) = 8

D 2 (1 − 45 z−1 ) + 35 zD−1 (1 − 45 z−1 )(1 − 2z−1 )

D<0

329 problems

330

hence, show that

chapter 6 Optimum Linear Filters

lim D→−∞

9 Hc[D] (z) =

40

zD

= zD Hnc (z)

(1 − 45 z−1 )(1 − 2z−1 )

(c) Finally, show that lim Pc[D] = Pnc . D→∞

6.34 Consider the block diagram of a simple communication system shown in Figure 6.38. The information resides in the signal s(n) produced by exciting the system H1 (z) = 1/(1+0.95z−1 ) with the process w(n) ∼ WGN(0, 0.3). The signal s(n) propagates through the channel H2 (z) = 1/(1 − 0.85z−1 ), and is corrupted by the additive noise process v(n) ∼ WGN(0, 0.1), which is uncorrelated with w(n). (a) Determine a second-order optimum FIR filter (M = 2) that estimates the signal s(n) from the received signal x(n) = z(n) + v(n). What is the corresponding MMSE Po ? (b) Plot the error performance surface and verify that the optimum filter corresponds to the bottom of the “bowl.” (c) Use a Monte Carlo simulation (100 realizations with a 1000-sample length each) to verify the theoretically obtained MMSE in part (a). (d ) Repeat part (a) for M = 3 and check if there is any improvement. Hint: To compute the autocorrelation of z(n), notice that the output of H1 (z)H2 (z) is an AR(2) process.

Channel H1 (z)

w(n)

H2 (z)

s(n)

y(n)

v(n) z(n)

x(n)

Optimum filter

− y(n) ˆ

e(n)

FIGURE 6.38 Block diagram of simple communication system used in Problem 6.34.

6.35 Write a program to reproduce the results shown in Figure 6.35 of Example 6.9.1. (a) Produce plots for ρ = 0.1, −0.8, 0.8. (b) Repeat part (a) for M = 16. Compare the plots obtained in (a) and (b) and justify any similarities or differences. 6.36 Write a program to reproduce the plot shown in Figure 6.36 of Example 6.9.2. Repeat for ρ = −0.81 and explain the similarities and differences between the two plots. 6.37 In this problem we study in greater detail the interference rejection filters discussed in Example 6.9.3. (a) Shows that SNRs for the matched filter and FLP filter are given by

M 2 3

Matched filter

1 + ρ2 1 − ρ2

1 1−ρ 2 + ρ 4 (1 −

FLP filter

2 &

1 + 8ρ −6 )

1 + ρ 2 + 3ρ 4 + ρ 6 (ρ 2 − 1)(ρ 4 − 1)

and check the results numerically. (b) Compute and plot the SNRs and compare the performance of both filters for M = 2, 3, 4 and ρ = 0.6, 0.8, 0.9, 0.95, 0.99, and 0.995. For what values of ρ and M do the two methods give similar results? Explain your conclusions. (c) Plot the magnitude response of the matched, FLP, and binomial filters for M = 3 and ρ = 0.9. Why does the optimum matched filter always have some nulls in its frequency response? 6.38 Determine the matched filter for the deterministic pulse s(n) = cos ω0 n for 0 ≤ n ≤ M − 1 and zero elsewhere when the noise is (a) white with variance σ 2v and (b) colored with autocorrelation rv (l) = σ 2v ρ |l| /(1−ρ 2 ), −1 < ρ < 1. Plot the frequency response of the filter and superimpose

it on the noise PSD, for ω0 = π /6, M = 12, σ 2v = 1, and ρ = 0.9. Explain the shape of the obtained response. (c) Study the effect of the SNR in part (a) by varying the value of σ 2v . (d ) Study the effect of the noise correlation in part (c) by varying the value of ρ. 6.39 Consider the equalization experiment in Example 6.8.1 with M = 11 and D = 7. (a) Compute and plot the magnitude response |H (ej ω )| of the channel and |Co (ej ω )| of the optimum equalizer for W = 2.9, 3.1, 3.3, and 3.5 and comment upon the results. (b) For the same values of W , compute the spectral dynamic range |H (ej ω )|max /|H (ej ω )|min of the channel and the eigenvalue spread λmax /λmin of the M × M input correlation matrix. Explain how the variation in one affects the other. 6.40 In this problem we clarify some of the properties of the MSE equalizer discussed in Example 6.8.1. (a) Compute and plot the MMSE Po as a function of M, and recommend how to choose a “reasonable” value. (b) Compute and plot Po as a function of the delay D for 0 ≤ D ≤ 11. What is the best value of D? (c) Study the effect of input SNR upon Po for M = 11 and D = 7 by fixing σ 2y = 1 and varying σ 2v . 6.41 In this problem we formulate the design of optimum linear signal estimators (LSE) using a constrained optimization framework. To this end we consider the estimator e(n) = c0∗ x(n) + ∗ x(n − M) cH x(n) and we wish to minimize the output power E{|e(n)|2 } = cH Rc. · · · + cM To prevent the trivial solution c = 0 we need to impose some constraint on the filter coefficients and use Lagrange multipliers to determine the minimum. Let ui be an M × 1 vector with one at the ith position and zeros elsewhere. (a) Show that minimizing cH Rc under the linear constraint uTi c = 1 provides the following estimators: FLP if i = 0, BLP if i = M, and linear smoother if i = 0, M. (b) Determine the appropriate set of constraints for the L-steps ahead linear predictor, defined by c0 = 1 and {ck = 0}L−1 , and solve the corresponding 1 constrained optimization problem. Verify your answer by obtaining the normal equations using the orthogonality principle. (c) Determine the optimum linear estimator by minimizing cH Rc under the quadratic constraints cH c = 1 and cH Wc = 1 (W is a positive definite matrix) which impose a constraint on the length of the filter vector.

331 problems

C HAPT E R 7

Algorithms and Structures for Optimum Linear Filters The design and application of optimum filters involves (1) the solution of the normal equations to determine the optimum set of coefficients, (2) the evaluation of the cost function to determine whether the obtained parameters satisfy the design requirements, and (3) the implementation of the optimum filter, that is, the computation of its output that provides the estimate of the desired response. The normal equations can be solved by using any general-purpose routine for linear simultaneous equations. However, there are several important reasons to study the normal equations in greater detail in order to develop efficient, special-purpose algorithms for their solution. First, the throughput of several real-time applications can only be served with serial or parallel algorithms that are obtained by exploiting the special structure (e.g., Toeplitz) of the correlation matrix. Second, sometimes we can develop order-recursive algorithms that help us to choose the correct filter order or to stop the algorithm before the manifestation of numerical problems. Third, some algorithms lead to intermediate sets of parameters that have physical meaning, provide easy tests for important properties (e.g., minimum phase), or are useful in special applications (e.g., data compression). Finally, sometimes there is a link between the algorithm for the solution of the normal equations and the structure for the implementation of the optimum filter. In this chapter, we present different algorithms for the solution of the normal equations, the computation of the minimum mean square error (MMSE), and the implementation of the optimum filter. We start in Section 7.1 with a discussion of some results from matrix algebra that are useful for the development of order-recursive algorithms and introduce an algorithm for the order-recursive computation of the LDLH decomposition, the MMSE, and the optimum estimate in the general case. In Section 7.2, we present some interesting interpretations for the various introduced algorithmic quantities and procedures that provide additional insight into the optimum filtering problem. The only assumption we have made so far is that we know the required second-order statistics; hence, the results apply to any linear estimation problem: array processing, filtering, and prediction of nonstationary or stationary processes. In the sequel, we impose additional constraints on the input data vector and show how to exploit them in order to simplify the general algorithms and structures or specify new ones. In Section 7.3, we explore the shift invariance of the input data vector to develop a time-varying lattice-ladder structure for the optimum filter. However, to derive an order-recursive algorithm for the computation of either the direct or lattice-ladder structure parameters of the optimum time-varying filter, we need an analytical description of the changing second-order statistics of the nonstationary input process. Recall that in the simplest case of stationary processes, the correlation matrix is constant and Toeplitz. As a result, the optimum FIR filters and predictors are time-invariant, and their direct or lattice-ladder structure parameters can be computed (only once) using efficient, order-recursive algorithms due to Levinson and Durbin (Section 7.4) or Schür (Section 7.6). Section 7.5 provides a derivation of the lattice-ladder structures for

333

334 chapter 7 Algorithms and Structures for Optimum Linear Filters

optimum filtering and prediction, their structural and statistical properties, and algorithms for transformations between the various sets of parameters. Section 7.7 deals with efficient, order-recursive algorithms for the triangularization and inversion of Toeplitz matrices. The chapter concludes with Section 7.8 which provides a concise introduction to the Kalman filtering algorithm. The Kalman filter provides a recursive solution to the minimum MSE filtering problem when the input stochastic process is described by a known state space model. This is possible because the state space model leads to a recursive formula for the updating of the required second-order moments. 7.1 FUNDAMENTALS OF ORDER-RECURSIVE ALGORITHMS In Section 6.3, we introduced a method to solve the normal equations and compute the MMSE using the LDLH decomposition. The optimum estimate is computed as a sum of products using a linear combiner supplied with the optimum coefficients and the input data. The key characteristic of this approach is that the order of the estimator should be fixed initially, and in case we choose a different order, we have to repeat all the computations. Such computational methods are known as fixed-order algorithms. When the order of the estimator becomes a design variable, we need to modify our notation to take this into account. For example, the mth-order estimator cm (n) is obtained by minimizing E{|em (n)|2 }, where em (n) y(n) − yˆm (n) cm (n) xm (n)

H yˆm (n) cm (n)xm (n) (m) (m) (m) [c1 (n) c2 (n) · · · cm (n)]T [x1 (n) x2 (n) · · · xm (n)]T

(7.1.1) (7.1.2) (7.1.3) (7.1.4)

In general, we use the subscript m to denote the order of a matrix or vector and the superscript m to emphasize that a scalar is a component of an m×1 vector. We note that these quantities are functions of time n, but sometimes we do not explicitly show this dependence for the sake of simplicity. If the mth-order estimator cm (n) has been computed by solving the normal equations, it seems to be a waste of computational power to start from scratch to compute the (m + 1)storder estimator cm+1 (n). Thus, we would like to arrange the computations so that the results for order m, that is, cm (n) or yˆm (n), can be used to compute the estimates for order m + 1, that is, cm+1 (n) or yˆm+1 (n). The resulting procedures are called order-recursive algorithms or order-updating relations. Similarly, procedures that compute cm (n + 1) from cm (n) or yˆm (n + 1) from yˆm (n) are called time-recursive algorithms or time-updating relations. Combined order and time updates are also possible. All these updates play a central role in the design and implementation of many optimum and adaptive filters. In this section, we derive order-recursive algorithms for the computation of the LDLH decomposition, the MMSE, and the MMSE optimal estimate. We also show that there is no order-recursive algorithm for the computation of the estimator parameters. 7.1.1 Matrix Partitioning and Optimum Nesting We start by introducing some notation that is useful for the discussion of order-recursive † algorithms. Notice that if the order of the estimator increases from m to m + 1, then the input data vector is augmented with one additional observation xm+1 . We use the notation †

All quantities in Sections 7.1 and 7.2 are functions of the time index n. However, for notational simplicity we do not explicitly show this dependence.

m

m

xm+1 to denote the vector that consists of the first m components and xm+1 for the last m components of vector xm+1 . The same notation can be generalized to matrices. The m × m m matrix Rm+1 , obtained by the intersection of the first m rows and columns of Rm+1 , is known as the mth-order leading principal submatrix of Rm+1 . In other words, if rij are m

m the elements of Rm+1 , then the elements of Rm+1 are rij , 1 ≤ i, j ≤ m. Similarly, Rm+1 denotes the matrix obtained by the intersection of the last m rows and columns of Rm+1 . For example, if m = 3 we obtain 3

R4

R4 =

r11 r21 r31 r41

r12 r22 r32 r42

r13 r23 r33 r43

r14 r24 r34 r44

(7.1.5)

3

R4

which illustrates the upper left corner and lower right corner partitionings of matrix R4 . m Since xm+1 = xm , we can easily see that the correlation matrix can be partitioned as b xm Rm rm

H ∗ xm xm+1 = (7.1.6) Rm+1 = E bH xm+1 rm ρ bm where

b ∗ rm E{xm xm+1 }

(7.1.7)

and

ρ bm E{|xm+1 |2 }

(7.1.8)

The result

m

m

xm+1 = xm ⇒ Rm = Rm+1

(7.1.9)

is known as the optimum nesting property and is instrumental in the development of orderm recursive algorithms. Similarly, we can show that xm+1 = xm implies dm xm ∗ ∗ dm+1 = E{xm+1 y } = E y = (7.1.10) xm+1 dm+1 m

m

xm+1 = xm ⇒ dm = dm+1

or

(7.1.11)

that is, the right-hand side of the normal equations also has the optimum nesting property. Since (7.1.9) and (7.1.11) hold for all 1 ≤ m ≤ M, the correlation matrix RM and the cross-correlation vector dM contain the information for the computation of all the optimum estimators cm for 1 ≤ m ≤ M. 7.1.2 Inversion of Partitioned Hermitian Matrices m

−1 of the leading principal submatrix R Suppose now that we know the inverse Rm m+1 = Rm −1 of matrix Rm+1 and we wish to use it to compute Rm+1 without having to repeat all the work. Since the inverse Qm+1 of the Hermitian matrix Rm+1 is also Hermitian, it can be partitioned as Qm qm (7.1.12) Qm+1 = H qm qm

Using (7.1.6), we obtain Rm+1 Qm+1 =

Rm

b rm

bH rm

ρ bm

Qm

qm

H qm

qm

=

Im

0m

0H m

1

(7.1.13)

335 section 7.1 Fundamentals of Order-Recursive Algorithms

336

After performing the matrix multiplication, we get b H q m = Im R m Qm + r m

chapter 7 Algorithms and Structures for Optimum Linear Filters

H + ρ bm qm

bH Qm rm

=

(7.1.14)

0H m

(7.1.15)

b qm = 0m Rm qm + rm bH qm rm

+ ρ bm qm

(7.1.16)

=1

(7.1.17)

where 0m is the m × 1 zero vector. If matrix Rm is invertible, we can solve (7.1.16) for qm −1 b qm = −Rm r m qm

(7.1.18)

and then substitute into (7.1.17) to obtain qm as 1

qm =

(7.1.19)

bH R −1 rb ρ bm − rm m m b bH −1 b assuming that the scalar quantity ρ m − rm Rm rm = 0. Substituting (7.1.19) into (7.1.18),

we obtain −1 rb −Rm m

qm =

(7.1.20)

bH R −1 rb ρ bm − rm m m which, in conjunction with (7.1.14), yields −1 −1 b H −1 − Rm rm q m = Rm + Qm = Rm

−1 rb (R −1 rb )H Rm m m m

(7.1.21) bH R −1 rb ρ bm − rm m m We note that (7.1.19) through (7.1.21) express the parts of the inverse matrix Qm+1 in terms of known quantities. For our purposes, we express the above equations in a more convenient form, using the quantities bm [b0

(m)

−1 b · · · bm−1 ]T −Rm rm

(m)

(m)

b1

bH −1 b bH Rm rm = ρ bm + rm bm α bm ρ bm − rm

and

(7.1.22) (7.1.23)

= 0, combining (7.1.13) with (7.1.19) through Thus, if matrix Rm is invertible and (7.1.23), we obtain −1 b −1 Rm rm Rm 0m 1 bm H −1 = (7.1.24) Rm+1 = + b bm 1 H bH b αm 1 0m 0 rm ρm α bm

−1 −1 by using a simple rank-one modification known as the which determines Rm+1 from Rm matrix inversion by partitioning lemma (Noble and Daniel 1988). Another useful expression for α bm is det Rm+1 α bm = (7.1.25) det Rm

which reinforces the importance of the quantity α bm for the invertibility of matrix Rm+1 (see Problem 7.1). E XAM PLE 7.1.1.

Given the matrix 1 1 R3 = 2 1 3

and the inverse matrix

1

R2−1 = 1 2

1 2

1 3 R 1 = 2 2 r2bH

1 1 2

1

1 −1 2 1

4 1 = 3 −2

r2b ρ b2

−2 4

compute matrix R3−1 , using the matrix inversion by partitioning lemma.

Solution. To determine R3−1 from the order-updating formula (7.1.24), we first compute 1 4 −2 1 1 1 −1 b 3 =− b2 = −R2 r2 = − 1 3 −2 9 4 4 2

1 α b2 = ρ b2 + r2bH b2 = 1 − 9

and

using (7.1.22) and (7.1.23). Then we compute 1 −9 4 −2 0

4 1 −2 27 −1 − − 1 4 0 + R3 = 20 9 3 9 0 0 0 1

1 2

1 3

−

4 9

1 4

=

20 27

1 =

27

1 −12 20 −3

−12 32 −12

−3

−12 27

using (7.1.24). The reader can easily verify the above calculations using Matlab.

Following a similar approach, we can show (see Problem 7.2) that the inverse of the lower right corner partitioned matrix Rm+1 can be expressed as f

f H −1 0 0H ρ m rm m 1 1

−1 H Rm+1 + f (7.1.26) = 1 am a f −1 f f αm m 0 (R ) r R m

where

m

m

am [a1

(m)

(m)

a2

m

f

(m) T f · · · am ] −(R m )−1 rm

fH f fH (R fm )−1 rm = ρ fm + rm am = α fm ρ fm − rm

det Rm+1 f det Rm

(7.1.27) (7.1.28)

f is invertible and α f = 0. A similar set of and the relationship (7.1.26) exists if matrix Rm m formulas can be obtained for arbitrary matrices (see Problem 7.3).

Interpretations. The vector bm , defined by (7.1.22), is the MMSE estimator of observation xm+1 from data vector xm . Indeed, if b em = xm+1 − xˆm+1 = xm+1 + bH m xm

(7.1.29)

b∗ } = 0, that b results in the MMSE we can show, using the orthogonality principle E{xm em m given by b b Pmb = ρ bm + bH m rm = α m

(7.1.30)

Similarly, we can show that am , defined by (7.1.27), is the optimum estimator of x1 based on f ∗ } = 0, the MMSE x˜ m [x2 x3 · · · xm+1 ]T . By using the orthogonality principle, E{xm em is fH Pmf = ρ fm + rm am = α fm

(7.1.31)

If xm+1 = [x(n) x(n − 1) · · · x(n − m)]T , then bm provides the backward linear predictor (BLP) and am the forward linear predictor (FLP) of the process x(n) from Section 6.5. For convenience, we always use this terminology even if, strictly speaking, the linear prediction interpretation is not applicable.

7.1.3 Levinson Recursion for the Optimum Estimator We now illustrate how to use (7.1.24) to express the optimum estimator cm+1 in terms of the estimator cm . Indeed, using (7.1.24), (7.1.10), and the normal equations Rm cm = dm ,

337 section 7.1 Fundamentals of Order-Recursive Algorithms

338

we have

chapter 7 Algorithms and Structures for Optimum Linear Filters

−1 cm+1 = Rm+1 dm+1 −1 0 dm Rm 1 bm H m + b = T bm αm 1 dm+1 0m 0 −1 d bm bH Rm m m dm + dm+1 = + α bm 1 0

or more concisely

bm c cm + km = 0 1

dm 1 dm+1

cm+1

(7.1.32)

where the quantities c km

β cm α bm

(7.1.33)

β cm bH m dm + dm+1

and

(7.1.34)

contain the “new information” dm+1 (the new component of dm+1 ). By using (7.1.22) and Rm cm = dm , alternatively β cm can be written as bH cm + dm+1 β cm = −rm

(7.1.35)

We will use the term Levinson recursion for the order-updating relation (7.1.32) because a similar recursion was introduced as part of the celebrated algorithm due to Levinson (see Section 7.3). However, we stress that even though (7.1.32) is order-recursive, the parameter m vector cm+1 does not have the optimum nesting property, that is, cm+1 = cm . Clearly, if we know the vector bm , we can determine cm+1 , using (7.1.32); however, its practical utility depends on how easily we can obtain the vector bm . In general, bm requires the solution of an m × m linear system of equations, and the computational savings compared to direct solution of the (m + 1)st-order normal equations is insignificant. For the Levinson recursion to be useful, we need an order recursion for vector bm . Since matrix Rm+1 has the optimum nesting property, we need to check whether the same is true for b E{x x ∗ the right-hand side vector in Rm+1 bm+1 = −rbm+1 . From the definition rm m m+1 }, bm

b m

b and r b we can easily see that rm+1 = rm m+1 = rm . Hence, in general, we cannot find a Levinson recursion for vector bm . This is possible only in optimum filtering problems in which the input data vector xm (n) has a shift-invariance structure (see Section 7.3).

Use the Levinson recursion to specified by the matrix 1 12 R3 = 21 1 E XAM PLE 7.1.2.

1 3

determine the optimum linear estimator c3

1 2

1 3 1 2

1

in Example 7.1.1 and the cross-correlation vector d3 = [1 2 4]T (1)

(1)

Solution. For m = 1 we have r11 c1 = d1 , which gives c1 = 1. Also, from (7.1.32) and (1) (7.1.34) we obtain k0c = c1 = 1 and β c0 = d1 = 1. Finally, from k0c = β c0 /α b0 , we get a0b = 1.

(1)

To obtain c2 , we need b1 , k1c , β c1 , and α 1 . We have

339 section 7.1 Fundamentals of Order-Recursive Algorithms

1

1 (1) (1) r11 b1 = −r1b ⇒ b1 = − 2 = − 2 1 (1) β c1 = b1 d1 + d2 = − 12 (1) + 2 = 32 (1) α b1 = ρ b1 + r1b b1 = 1 + 12 (− 12 ) = 34 βc k1c = 1 = 2 α b1

and therefore c2 =

c1 0

+

b1

1

k1c =

1 0

+

− 12

1

2=

0 2

To determine c3 , we need b2 , β c2 , and α b2 . To obtain b2 , we solve the linear system (2) 1 1 12 b1 1 1 3 b =− ⇒ b2 = − or R2 b2 = −r2 1 1 1 (2) 9 4 b 2

2

2

and then compute 1

β c2 = bT2 d2 + d3 = − 1 9 α b2 = ρ b2 + r2bT b2 = 1 + 13 k2c =

β c2 α b2

=

1

+4=3 2 1 1 20 1 = − 2 9 27 4 4

81 20

The desired solution c3 is obtained by using the Levinson recursion − 1 −9 0 9 c2 b2 c 1 81 4 + k2 ⇒ c3 = 2 + c3 = − 9 20 = 20 4 0 1 81 0 1 which agrees with the solution obtained by solving R3 c3 = d3 using the function c3=R3\d3. We can also solve this linear system by developing an algorithm using the lower partitioning (7.1.26) as discussed in Problem 7.4.

Matrix inversion and the linear system solution for m = 1 are trivial (scalar division m only). If RM is strictly positive definite, that is, Rm = RM is positive definite for all −1 and the solutions of R c = d , 2 ≤ m ≤ M, can 1 ≤ m ≤ M, the inverse matrices Rm m m m be determined using (7.1.22) and the Levinson recursion (7.1.32) for m = 1, 2, . . . , M − 1. However, in practice using the LDLH provides a better method for performing these computations. 7.1.4 Order-Recursive Computation of the LDLH Decomposition We start by showing that the LDLH decomposition can be computed in an order-recursive manner. The procedure is developed as part of a formal proof of the LDLH decomposition using induction. For M = 1, the matrix R1 is a positive number r11 and can be written uniquely in the form r11 = 1 · ξ 1 · 1 > 0. As we increment the order m, the (m + 1)st-order principal

340 chapter 7 Algorithms and Structures for Optimum Linear Filters

submatrix of Rm can be partitioned as in (7.1.6). By the induction hypothesis, there are unique matrices Lm and Dm such that Rm = Lm Dm LH m We next form the matrices

Lm+1 =

Lm

H lm

1

(7.1.36)

Dm+1 =

Dm

0H

ξ m+1

(7.1.37)

and try to determine the vector lm and the positive number ξ m+1 so that Rm+1 = Lm+1 Dm+1 LH m+1

(7.1.38)

Using (7.1.6) and (7.1.36) through (7.1.38), we see that b (Lm Dm )lm = rm

ρ bm Since

det Rm =

=

H lm Dm lm

+ ξ m+1 ,

det Lm det Dm det LH m

(7.1.39) ξ m+1 > 0

= ξ 1ξ 2 · · · ξ m > 0

(7.1.40) (7.1.41)

then det Lm Dm = 0 and (7.1.39) has a unique solution lm . Finally, from (7.1.41) we obtain ξ m+1 = det Rm+1 / det Rm , and therefore ξ m+1 > 0 because Rm+1 is positive definite. Hence, ξ m+1 is uniquely computed from (7.1.41), which completes the proof. Because the triangular matrix Lm is generated row by row using (7.1.39) and because the diagonal elements of matrix Dm are computed sequentially using (7.1.40), both matrices have the optimum nesting property, that is, Lm = Lm , Dm = Dm . The optimum filter cm is then computed by solving Lm Dm km dm

(7.1.42)

LH m cm = km

(7.1.43)

Using (7.1.42), we can easily see that km has the optimum nesting property, that is, km = k m for 1 ≤ m ≤ M. This is a consequence of the lower triangular form of Lm . The computation of Lm , Dm , and km can be done in a simple, order-recursive manner, which is all that is needed to compute cm for 1 ≤ m ≤ M. However, the optimum estimator m does not have the optimum nesting property, that is, cm+1 = cm , because of the backward substitution involved in the solution of the upper triangular system (7.1.43) (see Example 6.3.1). Using (7.1.42) and (7.1.43), we can write the MMSE for the mth-order linear estimator as H H Pm = Py − cm dm = Py − km Dm km

(7.1.44)

which, owing to the optimum nesting property of Dm and km , leads to Pm = Pm−1 − ξ m |km |2

(7.1.45)

which is initialized with P0 = Py . Equation (7.1.45) provides an order-recursive algorithm for the computation of the MMSE.

7.1.5 Order-Recursive Computation of the Optimum Estimate H x , using a linear combiner, The computation of the optimum linear estimate yˆm = cm m requires m multiplications and m − 1 additions. Therefore, if we want to compute yˆm , for 1 ≤ m ≤ M, we need M linear combiners and hence M(M + 1)/2 operations. We next provide an alternative, more efficient order-recursive implementation that exploits the triangular decomposition of Rm+1 . We first notice that using (7.1.43), we obtain H H −1 H yˆm = cm xm = (km Lm )xm = km (L−1 m xm )

(7.1.46)

341

Next, we define vector wm as L m wm x m

(7.1.47)

which can be found by using forward substitution in order to solve the triangular system. Therefore, we obtain m H ki∗ wi (7.1.48) yˆm = km wm = i=1

which provides the estimate yˆm in terms of km and wm , that is, without using the estimator vector cm . Hence, if the ultimate goal is the computation of yˆm we do not need to compute the estimator cm . For an order-recursive algorithm to be possible, the vector wm must have the optimum m nesting property, that is, wm = wm+1 . Indeed, using (7.1.37) and the matrix inversion by partitioning lemma for nonsymmetric matrices (see Problem 7.3), we obtain −1 Lm 0 L−1 0 m −1 = H Lm+1 = H lm 1 vm 1 where

H −1 −1 −1 b −1 b vm = −L−H m lm = −(Lm ) Dm Lm rm = −Rm rm = bm

due to (7.1.22). Therefore,

wm+1 = L−1 m+1 xm+1 =

wm xm = xm+1 wm+1

L−1 m

bH m

1

b wm+1 = bH m xm + xm+1 = em

where

(7.1.49) (7.1.50)

from (7.1.29). In this case, we can derive order-recursive algorithms for the computation of yˆm and em , for all 1 ≤ m ≤ M. Indeed, using (7.1.48) and (7.1.49), we obtain ∗ wm yˆm = yˆm−1 + km

(7.1.51)

with yˆ0 = 0. From (7.1.51) and em = y − yˆm , we have ∗ wm em = em−1 − km

(7.1.52)

for m = 1, 2, . . . , M with e0 = y. The quantity wm can be computed in an order-recursive manner by solving (7.1.47) using forward substitution. Indeed, from the mth row of (7.1.47) we obtain m−1 (m−1) li−1 wi (7.1.53) wm = xm − i=1

which provides a recursive computation of wm for m = 1, 2, . . . , M. To comply with the (m−1) order-oriented notation, we use li−1 instead of lm−1,i−1 . Depending on the application, we use either (7.1.51) or (7.1.52). For MMSE estimation, all the quantities are functions of the time index n, and therefore, the triangular decomposition of Rm and the recursions (7.1.51) through (7.1.53) should be repeated for every new set of observations y(n) and x(n). E XAM PLE 7.1.3. A linear estimator is specified by the correlation matrix R4 and the crosscorrelation vector d4 in Example 6.3.2. Compute the estimates yˆm , 1 ≤ m ≤ 4, if the input data vector is given by x4 = [1 2 1 − 1]T .

Solution. Using the triangular factor L4 and the vector k4 found in Example 6.3.2 and (7.1.53), we find w4 = [1 −1 3 −8]T and

yˆ1 = 1

yˆ2 = 43

yˆ3 = 6.6

yˆ4 = 14.6

T x , 1 ≤ m ≤ 4. which the reader can verify by computing cm and yˆm = cm m

section 7.1 Fundamentals of Order-Recursive Algorithms

342

If we compute the matrix

chapter 7 Algorithms and Structures for Optimum Linear Filters

1

−1 Lm+1

Bm+1

(1) b0 = . .. (m) b0

··· 0

1 .. .

··· .. . (m)

b1

0 .. . ··· 1

(7.1.54)

then (7.1.49) can be written as b wm+1 = em+1 = Bm+1 xm+1

(7.1.55)

b b T [e0b e1b · · · em ] em+1

(7.1.56)

where

is the BLP error vector. From (7.1.22), we can easily see that the rows of Bm+1 are formed by the optimum estimators bm of xm+1 from xm . Note that the elements of matrix Bm+1 are (m) denoted by using the order-oriented notation bi introduced in Section 7.1 rather than the conventional bmi matrix notation. Equation (7.1.55) provides an alternative computation of wm+1 as a matrix-vector multiplication. Each component of wm+1 can be computed independently, and hence in parallel, by the formula wj = xj +

j −1

(j −1)∗

bi−1

1≤j ≤m

xi

(7.1.57)

i=1

which, in contrast to (7.1.53), is nonrecursive. Using (7.1.57) and (7.1.51), we can derive the order-recursive MMSE estimator implementation shown in Figure 7.1.

Decorrelator

Input

Innovations

Linear combiner

w1

x1

Output yˆ1

k1* w2

x2

yˆ2

(1)*

b0

k *2 w3

x3 (2)* b0

(2)* b1

w4

x4 (3)*

b0

(3)*

yˆ3 k *3 yˆ4

(3)*

b1

b2

k*4

R = LDLH

Basic processing element a in x in

B = L−1 k = D−1BD

yout = bx in + a in R

b

d

Second-order moments

FIGURE 7.1 Orthogonal order-recursive structure for linear MMSE estimation.

Finally, we notice that matrix Bm provides the UDUH decomposition of the inverse correlation matrix Rm . Indeed, from (7.1.36) we obtain −1 Rm

=

−1 −1 −1 (LH m ) Dm L m

=

−1 BH m Dm Bm

(7.1.58) UDUH

decomposition is because inversion and transposition are interchangeable and the unique. This formula provides a practical method to compute the inverse of the correlation matrix by using the LDLH decomposition because computing the inverse of a triangular matrix is simple (see Problem 7.5).

7.2 INTERPRETATIONS OF ALGORITHMIC QUANTITIES We next show that various intermediate quantities that appear in the linear MMSE estimation algorithms have physical and statistical interpretations that, besides their intellectual value, facilitate better understanding of the operation, performance, and numerical properties of the algorithms.

7.2.1 Innovations and Backward Prediction The correlation matrix of wm is H H −H } = L−1 E{wm wm m E{xm xm }Lm = Dm

(7.2.1)

where we have used (7.1.47) and the triangular decomposition (7.1.36). Therefore, the components of wm are uncorrelated, random variables with variances ξ i = E{|wi |2 }

(7.2.2)

since ξ i ≥ 0. Furthermore, the two sets of random variables {w1 , w2 , . . . , wM } and {x1 , x2 , . . . , xM } are linearly equivalent because they can be obtained from each other through the linear transformation (7.1.47). This transformation removes all the redundant correlation among the components of x and is known as a decorrelation or whitening operation (see Section 3.5.2). Because the random variables wi are uncorrelated, each of them adds “new information” or innovation. In this sense, {w1 , w2 , . . . , wm } is the innovations representation of the random variables {x1 , x2 , . . . , xm }. Because xm = Lm wm , the random b is the innovations representation, and x and w are linearly equivalent vector wm = em m m as well, (see Section 3.5). The cross-correlation matrix between xm and wm is H H } = E{Lm wm wm } = Lm Dm E{xm wm

(7.2.3)

which shows that, owing to the lower triangular form of Lm , E{xi wj∗ } = 0 for j > i. We will see in Section 7.6 that these factors are related to the gapped functions and the algorithm of Schür. b =w Furthermore, since em m+1 , from (7.1.50) we have Pmb = ξ m+1 = E{|wm+1 |2 } which also can be shown algebraically by using (7.1.41), (7.1.40), and (7.1.30). Indeed, we have det Rm+1 H bH −1 b = ρ bm − lm Dm lm = ρ bm − rm Rm rm = Pmb (7.2.4) ξ m+1 = det Rm and, therefore, b } Dm = diag{P0b , P1b , . . . , Pm−1

(7.2.5)

343 section 7.2 Interpretations of Algorithmic Quantities

344 chapter 7 Algorithms and Structures for Optimum Linear Filters

7.2.2 Partial Correlation In general, the random variables y, x1 , . . . , xm , xm+1 are correlated. The correlation between y and xm+1 , after the influence from the components of the vector xm has been removed, is known as partial correlation. To remove the correlation due to xm , we extract from y and xm+1 the components that can be predicted from xm . The remaining correlation b , which are both uncorrelated with x because of is from the estimation errors em and em m the orthogonality principle. Therefore, the partial correlation of y and xm+1 is b∗ } = E{(y − cH x )eb∗ } PARCOR(y; xm+1 ) E{em em m m m b∗ } = E{y(x ∗ H = E{yem m+1 + xm bm )} ∗ H }b = E{yxm+1 } + E{yxm m

(7.2.6)

∗ H b β c∗ + dm = dm+1 m m b∗ } = 0 and (7.1.10), (7.1.50), and where we have used the orthogonality principle E{xm em (7.1.34). The partial correlation PARCOR(y; xm+1 ) is also related to the parameters km obtained from the LDLH decomposition. Indeed, from (7.1.42) and (7.1.54), we obtain the relation

km+1 = D−1 m+1 Bm+1 dm+1

(7.2.7)

whose last row is km+1 =

bH βc m dm + dm+1 c = mb = km ξ m+1 Pm

(7.2.8)

owing to (7.2.4) and (7.2.6). E XAM PLE 7.2.1.

The LDLH decomposition of matrix R3 1 1 0 0 1 0 1 0 D= L= 2 1 4 1 0 3 9

in Example 7.1.2 is given by 0 0 3 0 4 0 20 27

and can be found by using the function [L,D]=ldlt(R). Comparison with the results obtained in Example 7.1.2 shows that the rows of the matrix 1 0 0 1 0 L−1 = − 12 4 1 −9 −9 1 provide the elements of the backward predictors, whereas the diagonal elements of D are equal to the scalars α m . Using (7.2.7), we obtain k = [1 2 81 ]T whose elements are the quantities 20 k0c , k1c , and k2c computed in Example 7.1.2 using the Levinson recursion.

7.2.3 Order Decomposition of the Optimum Estimate ∗ c , shows that the imThe equation yˆm+1 = yˆm + km+1 wm+1 , with km+1 = β cm /Pmb = km provement in the estimate when we include one more observation xm+1 , that is, when we increase the order by 1, is proportional to the innovation wm+1 contained in xm+1 . The innovation is the part of xm+1 that cannot be linearly estimated from the already used data xm . The term wm+1 is scaled by the ratio of the partial correlation between y and the “new” observation xm+1 and the power of the innovation Pmb .

T x Thus, the computation of the (m+1)st-order estimate of y based on xm+1 = [xm m+1 ] can be reduced to two mth-order estimation problems: the estimation of y based on xm and the estimation of the new observation xm+1 based on xm . This decomposition of linear estimation problems into smaller ones has very important applications to the development of efficient algorithms and structures for MMSE estimation. We use the term direct for the implementation of the MMSE linear combiner as a sum (m) of products, involving the optimum parameters ci , 1 ≤ i ≤ m, to emphasize the direct use of these coefficients. Because the random variables wi used in the implementation of Figure 7.1 are orthogonal, that is, wi , wj = 0 for i = j , we refer to this implementation as the orthogonal implementation or the orthogonal structure. These two structures appear in every type of linear MMSE estimation problem, and their particular form depends on the specifics of the problem and the associated second-order moments. In this sense, they play a prominent role in linear MMSE estimation in general, and in this book in particular. We conclude our discussion with the following important observations:

1. The direct implementation combines correlated, that is, redundant information, and it is not order-recursive because increasing the order of the estimator destroys the optimality of the existing coefficients. Again, the reason is that the direct-form optimum filter coefficients do not possess the optimal nesting property. 2. The orthogonal implementation consists of a decorrelator and a linear combiner. The estimator combines the innovations of the data (nonredundant information) and is orderrecursive because it does not use the optimum coefficient vector. Hence, increasing the order of the estimator preserves the optimality of the existing lower-order part. The resulting structure is modular such that each additional term improves the estimate by an amount proportional to the included innovation wm . 3. Using the vector interpretation of random variables, the transformation x˜ m = Fm xm is just a change of basis. The choice Fm = L−1 m converts from the oblique set {x1 , x2 , . . . , xm } to the orthogonal basis {w1 , w2 , . . . , wm }. The advantage of working with orthogonal bases is that adding new components does not affect the optimality of previous ones. 4. The LDLH decomposition for random vectors is the matrix equivalent of the spectral factorization theorem for discrete-time, stationary, stochastic processes. Both approaches facilitate the design and implementation of optimum FIR and IIR filters (see Sections 6.4 and 6.6).

7.2.4 Gram-Schmidt Orthogonalization We next combine the geometric interpretation of the random variables with the GramSchmidt procedure used in linear algebra. The Gram-Schmidt procedure produces the innovations {w1 , w2 , . . . , wm } by orthogonalizing the original set {x1 , x2 , . . . , xm }. We start by choosing w1 to be in the direction of x1 , that is, w1 = x1 The next “vector” w2 should be orthogonal to w1 . To determine w2 , we subtract from x2 its component along w1 [see Figure 7.2(a)], that is, w2 = x2 − l0 w1 (1)

where l0 is obtained from the condition w2 ⊥ w1 as follows: (1)

w2 , w1 = x2 , w1 − l0 w1 , w1 = 0 x2 , w1 (1) l0 = w1 , w1 (1)

or

345 section 7.2 Interpretations of Algorithmic Quantities

346 chapter 7 Algorithms and Structures for Optimum Linear Filters

w3 w2

x2 x3 (2 )

l 0 w1 (1)

l 0 w1

w1 = x1

w1

(2 )

l 1 w2 w2 (a) m = 2

(b) m = 3

FIGURE 7.2 Illustration of the Gram-Schmidt orthogonalization process.

Similarly, to determine w3 , we subtract from x3 its components along w1 and w2 , that is, w3 = x3 − l0 w1 − l1 w2 (2)

(2)

as illustrated in Figure 7.2(b). Using the conditions w3 ⊥ w1 and w3 ⊥ w2 , we can easily see that x3 , w1 x3 , w2 (2) (2) l0 = l1 = w1 , w1 w2 , w2 This approach leads to the following classical Gram-Schmidt algorithm: • •

Define w1 = x1 . For 2 ≤ m ≤ M, compute wm = xm − l0

(m−1)

(m−1)

where

li

w1 · · · − lm−2 wm−1

=

(m−1)

xm−1 , wi wi , wi

(7.2.9) (7.2.10)

assuming that wi , wi = 0. From the derivation of the algorithm it should be clear that the sets {x1 , . . . , xm } and {w1 , . . . , wm } are linearly equivalent for m = 1, 2, . . . , M. Using (7.2.11), we obtain

where

xm = Lm wm

1 (1) l0 Lm . .. (m−1) l0

··· 0

1 .. .

··· .. .

(m−1)

l1

0 .. . ··· 1

(7.2.11)

(7.2.12)

is a unit lower triangular matrix. Since, by construction, the components of wm are uncorrelated, its correlation matrix Dm is diagonal with elements ξ i = E{|wi |2 }. Using (7.2.11), we obtain H H H Rm = E{xm xm } = Lm E{wm wm }LH m = Lm D m L m

(7.2.13)

which is precisely the unique LDLH decomposition of the correlation matrix Rm . Therefore, the Gram-Schmidt orthogonalization of the data vector xm provides an alternative H }. approach to obtain the LDLH decomposition of its correlation matrix Rm = E{xm xm

347

7.3 ORDER-RECURSIVE ALGORITHMS FOR OPTIMUM FIR FILTERS The key difference between a linear combiner and an FIR filter is the nature of the input data vector. The input data vector for FIR filters consists of consecutive samples from the same discrete-time stochastic process, that is, xm (n) = [x(n) x(n − 1) · · · x(n − m + 1)]T

(7.3.1)

instead of samples from m different processes xi (n). This shift invariance of the input data vector allows for the development of simpler, order-recursive algorithms and structures for optimum FIR filtering and prediction compared to those for general linear estimation. Furthermore, the quest for order-recursive algorithms leads to a natural, elegant, and unavoidable interconnection between optimum filtering and the BLP and FLP problems. We start with the following upper and lower partitioning of the input data vector x(n) x(n − 1) x(n) xm (n) .. xm+1 (n) = . = (7.3.2) = x(n − m) xm (n − 1) x(n − m + 1) x(n − m) m

m

which shows that xm+1 (n) and xm+1 (n) are simply shifted versions (by one sample delay) of the same vector xm (n). The shift invariance of xm+1 (n) results in an analogous shift H (n)}. Indeed, we can invariance for the correlation matrix Rm+1 (n) = E{xm+1 (n)xm+1 easily show that the upper-lower partitioning of the correlation matrix is b (n) Rm (n) rm (7.3.3) Rm+1 (n) = bH (n) P (n − m) rm x and the lower-upper partitioning is

Rm+1 (n) = where

f H (n) rm

f (n) rm

Rm (n − 1)

Px (n)

(7.3.4)

b rm (n) = E{xm (n)x ∗ (n − m)}

(7.3.5)

f (n) = E{xm (n − 1)x ∗ (n)} rm

(7.3.6)

Px (n) = E{|x(n)| } 2

(7.3.7)

m Rm+1 (n)

f (n) = We note that, in contrast to the general case (7.1.5) where the matrix Rm

m is unrelated to Rm (n), here the matrix Rm+1 (n) = Rm (n − 1). This is a by-product of the shift-invariance property of the input data vector and takes the development of orderrecursive algorithms one step further. We begin our pursuit of an order-recursive algorithm with the development of a Levinson order recursion for the optimum FIR filter coefficients.

7.3.1 Order-Recursive Computation of the Optimum Filter Suppose that at time n we have already computed the optimum FIR filter cm (n) specified by −1 cm (n) = Rm (n)dm (n)

(7.3.8)

and the MMSE is H Pmc (n) = Py (n) − dm (n)cm (n)

where

dm (n) = E{xm (n)y ∗ (n)}

(7.3.9) (7.3.10)

section 7.3 Order-Recursive Algorithms for Optimum FIR Filters

348 chapter 7 Algorithms and Structures for Optimum Linear Filters

We wish to compute the optimum filter −1 cm+1 (n) = Rm+1 (n)cm+1 (n)

by modifying cm (n) using an order-recursive algorithm. From (7.3.3), we see that matrix Rm+1 (n) has the optimum nesting property. Using the upper partitioning in (7.3.2), we obtain xm (n) dm (n) ∗ dm+1 (n) = E y (n) = (7.3.11) x(n − m) dm+1 (n) which shows that dm+1 (n) also has the optimum nesting property. Therefore, we can develop a Levinson order recursion using the upper left matrix inversion by partitioning lemma −1 Rm (n) 0 bm (n) H 1 −1 (7.3.12) Rm+1 (n) = + b bm (n) 1 Pm (n) 1 0T 0 b bm (n) = −R −1 m (n)rm (n)

where

(7.3.13)

is the optimum BLP, and Pmb (n) =

det Rm+1 (n) bH = Px (n − m) + rm (n)bm (n) det Rm (n)

(7.3.14)

is the corresponding MMSE. Equations (7.3.12) through (7.3.14) follow easily from (7.1.22), (7.1.23), and (7.1.24). It is interesting to note that bm (n) is the optimum estimator for the additional observation x(n − m) used by the optimum filter cm+1 (n). Substituting (7.3.11) and (7.3.12) into (7.3.8), we obtain bm (n) c cm (n) cm+1 (n) = + km (n) (7.3.15) 0 1 where and

c (n) km

β cm (n) Pmb (n)

β cm (n) bH m (n)dm (n) + dm+1 (n)

(7.3.16) (7.3.17)

Thus, if we know the BLP bm (n), we can determine cm+1 (n) by using the Levinson recursion in (7.3.15). Levinson recursion for the backward predictor. For the order recursion in (7.3.15) to be useful, we need an order recursion for the BLP bm (n). This is possible if the linear systems Rm (n)bm (n) = −rbm (n) Rm+1 (n)bm+1 (n) = −rbm+1 (n)

(7.3.18)

are nested. Since the matrices are nested [see (7.3.3)], we check whether the right-hand side vectors are nested. We can easily see that no optimum nesting is possible if we use the upper partitioning in (7.3.2). However, if we use the lower-upper partitioning, we obtain b rm+1 (n) x(n) b ∗ (7.3.19) rm+1 (n) = E x (n − m − 1) b (n − 1) xm (n − 1) rm b (n) delayed by one sample which provides a partitioning that includes the wanted vector rm as a result of the shift invariance of xm (n). To explore this partitioning, we use the lowerupper corner matrix inversion by partitioning lemma

0 0H 1 1 −1 H (n) Rm+1 (n) = (7.3.20) + f 1 am −1 (n − 1) P (n) am (n) 0 Rm

−1 f am (n) −Rm (n − 1)rm (n)

where

(7.3.21)

is the optimum FLP and det Rm+1 (n) fH = Px (n) + rm (n)am (n) (7.3.22) det Rm (n − 1) is the forward linear prediction MMSE. Equations (7.3.20) through (7.3.22) follow easily from (7.1.26) through (7.1.28). Substituting (7.3.20) and (7.3.19) into Pmf (n) =

b bm+1 (n) = −R −1 m+1 (n)rm+1 (n)

we obtain the recursion

1 0 + k b (n) bm+1 (n) = am (n) m bm (n − 1)

where and

b km (n) −

β bm (n) Pmf (n)

b H b β bm (n) rm+1 (n) + am (n)rm (n − 1)

(7.3.23) (7.3.24) (7.3.25)

To proceed with the development of the order-recursive algorithm, we clearly need an order recursion for the optimum FLP am (n). Levinson recursion for the forward predictor. Following a similar procedure for the Levinson recursion of the BLP, we can derive the Levinson recursion for the FLP. If we use the upper-lower partitioning in (7.3.2), we obtain f (n) rm f ∗ rm+1 (n) = E{xm+1 (n − 1)x (n)} = f (7.3.26) rm+1 (n) which in conjunction with (7.3.12) and (7.3.21) leads to the following order recursion bm (n − 1) f am (n) + km (n) am+1 (n) = (7.3.27) 0 1 where and

f km (n) −

β fm (n) Pmb (n − 1)

f f β fm (n) bH m (n − 1)rm (n) + rm+1 (n)

(7.3.28) (7.3.29)

Is an order-recursive algorithm feasible? For m = 1, we have a scalar equation (1) (1) r11 (n)c1 (n) = d1 (n) whose solution is c1 (n) = d1 (n)/r11 (n). Using the Levinson order recursions for m = 1, 2, . . . , M − 1, we can find cM (n) if the quantities bm (n − 1) and Pmb (n − 1), 1 ≤ m < M, required by (7.3.27) and (7.3.28) are known. The lack of this information prevents the development of a complete order-recursive algorithm for the solution of the normal equations for optimum FIR filtering or prediction. The need for time updates arises because each order update requires both the upper left corner and the lower right corner partitionings × × Rm (n) × = Rm+1 (n) = × × × Rm (n − 1) of matrix Rm+1 . The presence of Rm (n − 1), which is a result of the nonstationarity of the input signal, creates the need for a time updating of bm (n). This is possible only for certain types of nonstationarity that can be described by simple relations between Rm (n) and Rm (n − 1). The simplest case occurs for stationary processes where Rm (n) = Rm (n − 1) = Rm . Another very useful case occurs for nonstationary processes generated by linear statespace models, which results in the Kalman filtering algorithm (see Section 7.8).

349 section 7.3 Order-Recursive Algorithms for Optimum FIR Filters

350 chapter 7 Algorithms and Structures for Optimum Linear Filters

Partial correlation interpretation. The partial correlation between y(n) and x(n−m), after the influence of the intermediate samples x(n), x(n − 1), . . . , x(n − m + 1) has been removed, is b ∗ c E{em (n)em (n)} = bH m (n)dm (n) + dm+1 (n) = β m (n)

(7.3.30)

which is obtained by working as in the derivation of (7.2.6). It can be shown, following a procedure similar to that leading to (7.2.8), that the km (n) parameters in the Levinson recursions can be obtained from Rm (n) = Lm (n)Dm (n)LH m (n) c (n) = d (n) Lm (n)Dm (n)km m

(7.3.31)

f (n) = rb (n) Lm (n)Dm (n)km m b (n) = rf (n) Lm (n − 1)Dm (n − 1)km m

that is, as a by-product of the LDLH decomposition. Similarly, if we consider the sequence x(n), x(n − 1), . . . , x(n − m), x(n − m − 1), we can show that the partial correlation between x(n) and x(n − m − 1) is given by (see Problem 7.6) b f∗ f f f E{em (n − 1)em (n)} = rm+1 (n) + bH m (n − 1)rm (n) = β m (n)

(7.3.32)

f b∗ (n), we have the following simplification (n) = rm+1 Because rm+1 −1 f f β fm (n) = bH m (n − 1)Rm (n − 1)Rm (n − 1)rm (n) + rm+1 (n) bH b∗ = rm (n − 1)am (n) + rm+1 (n) = β b∗ m (n)

which is known as Burg’s lemma (Burg 1975). In order to simplify the notation, we define β m (n) β fm (n) = β b∗ m (n)

(7.3.33)

Using (7.3.24), (7.3.28), and (7.3.30), we obtain b f km (n)km (n) =

b (n − 1)ef ∗ (n)}|2 |E{em |β m (n)|2 m = f (n)|2 }E{|eb (n − 1)|2 } Pmf (n)Pmb (n − 1) E{|em m

(7.3.34)

which implies that f b 0 ≤ km (n)km (n) ≤ 1

(7.3.35)

because the last term in (7.3.34) is the squared magnitude of the correlation coefficient of f (n) and eb (n − 1). the random variables em m Order recursions for the MMSEs. Using the Levinson order recursions, we can obtain order-recursive formulas for the computation of Pmf (n), Pmb (n), and Pmc (n). Indeed, using (7.3.26), (7.3.27), and (7.3.29), we have fH f Pm+1 (n) = Px (n) + rm+1 (n)am+1 (n)

=

fH f∗ (n)rm+1 (n)] Px (n) + [rm

bm (n − 1) f am (n) + km (n) 0 1

fH fH f∗ f (n)am (n) + [rm (n)bm (n − 1) + rm+1 (n)]km (n) = Px (n) + rm

or

f f (n) = Pmf (n) + β ∗m (n)km (n) = Pmf (n) − Pm+1

|β m (n)|2 Pmb (n − 1)

(7.3.36)

|β m (n)|2 Pmf (n)

(7.3.37)

If we work in a similar manner, we obtain b b Pm+1 (n) = Pmb (n − 1) + β m (n)km (n) = Pmb (n − 1) −

and

c c c Pm+1 (n) = Pmc (n) − β c∗ m (n)km (n) = Pm (n) −

|β cm (n)|2 Pmb (n)

351

(7.3.38)

If the subtrahends in the previous recursions are nonzero, increasing the order of the filter c (n) ≤ P c (n). Also, the conditions P f (n) = 0 always improves the estimates, that is, Pm+1 m m b and Pm (n) = 0 are critical for the invertibility of Rm (n) and the computation of the optimum filters. The above relations are special cases of (7.1.45) and can be derived from the LDLH decomposition (see Problem 7.7). The presence of vectors with mixed optimum nesting (upper-lower and lower-upper) in the definitions of β m (n) and β cm (n) does not lead to similar order recursions for these quantities. However, for stationary processes we can break the dot products in (7.3.17) and (7.3.25) into scalar recursions, using an algorithm first introduced by Schür (see Section 7.6).

7.3.2 Lattice-Ladder Structure We saw that the shift invariance of the input data vector made it possible to develop the Levinson recursions for the BLP and the FLP. We next show that these recursions can be used to simplify the triangular order-recursive estimation structure of Figure 7.1 by reducing it to a more efficient (linear instead of triangular), lattice-ladder filter structure that simultaneously provides the FLP, BLP, and FIR filtering estimates. The computation of the estimation errors using direct-form structures is based on the following equations: f (n) = x(n) + aH (n)x (n − 1) em m m b (n) = x(n − m) + bH (n)x (n) em m m

em (n) =

(7.3.39)

H (n)x (n) y(n) − cm m

Using (7.3.2), (7.3.27), and (7.3.39), we obtain H bm (n − 1) f xm (n − 1) am (n) f em+1 (n) = x(n) + + km (n) 0 1 x(n − 1 − m) H f∗ (n)xm (n − 1) + [bH = x(n) + am m (n − 1)xm (n − 1) + x(n − 1 − m)]km (n)

or

f f f∗ b em+1 (n) = em (n) + km (n)em (n − 1)

(7.3.40)

In a similar manner, we obtain b b∗ f b (n) = em (n − 1) + km (n)em (n) em+1

(7.3.41)

using (7.3.2), (7.3.23), and (7.3.39). Relations (7.3.40) and (7.3.41) are executed for m = 0, 1, . . . , M − 2, with e0f (n) = e0b (n) = x(n), and constitute a lattice filter that implements the FLP and the BLP. Using (7.3.2), (7.3.15), and (7.3.39), we can show that the optimum filtering error can be computed by c∗ b (n)em (n) em+1 (n) = em (n) − km

(7.3.42)

which is executed for m = 0, 1, . . . , M − 1, with e0 (n) = y(n). The last equation provides the ladder part, which is coupled with the lattice predictor to implement the optimum filter. The result is the time-varying lattice-ladder structure shown in Figure 7.3. Notice that a new set of lattice-ladder coefficients has to be computed for every n, using Rm (n) and dm (n). The parameters of the lattice-ladder structure can be obtained by LDLH decomposition using (7.3.31). Suppose now that we know P0f (n) = P0b (n) = Px (n), P0b (n − 1), P0c (n) = Py (n), f b c {β m (n)}M−1 , and {β cm (n)}M 0 . Then we can determine Pm (n), Pm (n), and Pm (n) for all 0 m, using (7.3.36) through (7.3.38), and all filter coefficients, using (7.3.16), (7.3.24), and

section 7.3 Order-Recursive Algorithms for Optimum FIR Filters

Stage 1

352 chapter 7 Algorithms and Structures for Optimum Linear Filters

f

f

b

k0(n)

x(n)

Lattice part

f

k 0(n)

b

e0(n) c*

−k 1 (n) e1(n)

f*

eM−1(n) b*

kM−2(n) f*

kM−2(n)

b

e M−1(n)

z −1 c*

−k M−1(n)

c*

−k 0 (n)

y(n)

…

e1 (n)

z −1

b

…

Stage M − 1

e1(n)

e0(n)

…

e2(n)

eM−1(n)

eM (n)

Ladder part

FIGURE 7.3 Lattice-ladder structure for FIR optimum filtering and prediction.

(7.3.28). However, to obtain a completely time-recursive updating algorithm, we need time updatings for β m (n) and β cm (n). As we will see later, this is possible if R(n) and d(n) are fixed or are defined by known time-updating formulas. We recall that the BLP error vector ebm+1 (n) is the innovations vector of the data xm+1 (n). Notice that as a result of the shift invariance of the input data vector, the triangular decorrelator of the general linear estimator (see Figure 7.1) is replaced by a simpler, “linear” lattice structure. For stationary processes, the lattice-ladder filter is time-invariant, and we need to compute only one set of coefficients that can be used for all signals with the same R and d (see Section 7.5).

7.3.3 Simplifications for Stationary Stochastic Processes When x(n) and y(n) are jointly wide-sense stationary (WSS), the optimum estimators are time-invariant and we have the following simplifications: • •

All quantities are independent of n; thus we do not need time recursions for the BLP parameters. bm = Ja∗m (see Section 6.5.4), and thus we do not need the Levinson recursion for the BLP bm .

Both simplifications are a consequence of the Toeplitz structure of the correlation matrix Rm . Indeed, comparing the partitionings T Rm Jrm r(0) rm Rm+1 (n) = (7.3.43) = H J r(0) ∗ rm rm Rm where

rm [r(1) r(2) · · · r(m)]T

(7.3.44)

with (7.3.3) and (7.3.4), we have Rm (n) = Rm (n − 1) = Rm f (n) = r∗ rm m

(7.3.45)

b (n) = Jr rm m

which can be used to simplify the order recursions derived for nonstationary processes. Indeed, we can easily show that bm am am+1 = + km (7.3.46) 0 1

bm = Ja∗m

where

f b∗ km km = km =−

βm Pm

H ∗ ∗ β m β fm = β b∗ m = bm rm + r (m + 1) ∗ Pm Pmb = Pmf = Pm−1 + β ∗m−1 km−1 = Pm−1 + β m−1 km−1

(7.3.47)

353

(7.3.48)

section 7.3 Order-Recursive Algorithms for Optimum FIR Filters

(7.3.49) (7.3.50)

This recursion provides a complete order-recursive algorithm for the computation of the FLP am for 1 ≤ m ≤ M from the autocorrelation sequence r(l) for 0 ≤ l ≤ M. The optimum filters cm for 1 ≤ m ≤ M can be obtained from the quantities am and Pm for 1 ≤ m ≤ M − 1 and dM , using the following Levinson recursion Jam c cm cm+1 = + km (7.3.51) 0 1 c km

where and

β cm Pm

(7.3.52)

β cm = bH m dm + dm+1

(7.3.53)

c c Pmc = Pm−1 − β cm km

(7.3.54)

The MMSE Pmc is then given by Pmc

and although it is not required by the algorithm, is useful for selecting the order of the optimum filter. Both algorithms are discussed in greater detail in Section 7.4. 7.3.4 Algorithms Based on the UDUH Decomposition Hermitian positive definite matrices can also be factorized as ¯ R = UDU

H

(7.3.55)

¯ is a diagonal matrix with positive elements where U is a unit upper triangular matrix and D ξ¯ i , using the function [U,D]=udut(R) (see Problem 7.8). Using the decomposition (7.3.55), we can obtain the solution of the normal equations by solving the triangular systems, first for k¯

and then for c

¯ k¯ d (UD)

(7.3.56)

U c = k¯

(7.3.57)

H

by backward and forward substitution, respectively. The MMSE estimate can be computed by H ¯ yˆ = cH x = k¯ w

where

¯ U w

−1

x

(7.3.58) (7.3.59) −1

is an innovations vector for the data vector x. It can be shown that the rows of A U are the linear MMSE estimator of xm based on [xm+1 xm+2 · · · xM ]T . Furthermore, the UDUH factorization (7.3.55) can be obtained by the Gram-Schmidt algorithm, starting with xM and proceeding “backward” to x1 (see Problem 7.9). The various triangular decompositions of the correlation matrix R are summarized in Table 7.1. ˜ Jw, we obtain If we define the reversed vectors x˜ Jx and w ˜w ˜ U ˜ x˜ = Jx = JLJJw = JLJw

(7.3.60)

354 chapter 7 Algorithms and Structures for Optimum Linear Filters

TABLE 7.1

Summary of the triangular decompositions of the correlation matrix. Decomposition Matrix R

LDLH

– UDUH

A = U−1

R−1

– AH DA

BHD−1B

B = L−1

˜ = JLJ is upper triangular. The correlation matrix of x˜ is because J2 = I and U ˜ = E{˜xx˜ H } = U ˜D ˜U ˜H R

(7.3.61)

˜ E{w ˜ Equation (7.3.61) provides ˜w ˜ H } is the diagonal correlation matrix of w. where D H ˜ the UDU decomposition of R. A natural question arising at this point is whether we can develop order-recursive algorithms and structures for optimum filtering using the UDUH instead of the LDLH decomposition of the correlation matrix. The UDUH decomposition is coupled to a partitioning of Rm+1 (n) starting at the lower right corner and moving to the upper left corner that provides the following sequence of submatrices R1 (n − m) → R2 (n − m + 1) → · · · → Rm (n)

(7.3.62)

which, in turn, are related to the FLPs a1 (n − m) → a2 (n − m + 1) → · · · → am (n)

(7.3.63)

f (n) e1f (n − m + 1) → e2f (n − m + 2) → · · · → em

(7.3.64)

and the FLP errors

If we define the FLP error vector f f f (n) = [em (n) em−1 (n − 1) · · · e0f (n − m)]T em+1

(7.3.65)

f em+1 (n) = Am+1 (n)xm+1 (n)

(7.3.66)

we see that where

1

(m)

(m)

· · · am (n) (m)

a1 (n)

a2 (n)

1 .. .

a1 .. .

1

(1) a1 (n − m + 1)

0 0

1

0 . Am+1 (n) .. 0

(m−1)

(n − 1)

··· .. .

am−1 (n − 1) .. . (m−1)

(7.3.67)

f (n) are uncorrelated, and the LDLH decomposition of the The elements of the vector em+1 inverse correlation matrix (see Problem 7.10) is given by −1 H ¯ −1 (n)Am+1 (n) (n) = Am+1 (n)D Rm+1 m+1

(7.3.68)

¯ m+1 (n) is the correlation matrix of ef (n). Using ef (n) as an orthogonal basis where D m+1 m+1 b instead of em+1 (n) results in a complicated lattice structure because of the additional delay elements required for the forward prediction errors. Thus, the LDLH decomposition is the method of choice in practical applications for linear MMSE estimation.

355

7.4 ALGORITHMS OF LEVINSON AND LEVINSON-DURBIN Since the correlation matrix of a stationary, stochastic process is Toeplitz, we can explore its special structure to develop efficient, order-recursive algorithms for the linear system solution, matrix triangularization, and matrix inversion. Although we develop such algorithms in the context of optimum FIR filtering and prediction, the results apply to other applications involving Toeplitz matrices (Golub and van Loan 1996). Suppose that we know the optimum filter cm is given by −1 cm = R m dm

(7.4.1)

and we wish to use it to compute the optimum filter cm+1 −1 cm+1 = Rm+1 dm+1

(7.4.2)

We first notice that the matrix Rm+1 and the vector dm+1 can be partitioned as follows r(0) · · · r(m − 1) r(m) . .. . . .. .. Rm Jrm . . . Rm+1 = (7.4.3) = rH J r(0) r(1) r ∗ (m − 1) · · · r(0) m r ∗ (m) · · · r ∗ (1) r(0) dm dm+1 = (7.4.4) dm+1 m

which shows that both quantities have the optimum nesting property, that is, Rm+1 = Rm m

and dm+1 = dm . Using the matrix inversion by partitioning lemma (7.1.24), we obtain −1 0 Rm 1 bm H −1 Rm+1 = H + b bm 1 Pm 1 0 0 where and

−1 bm = −Rm Jrm

(7.4.6)

H Pmb = r(0) + rm Jbm

(7.4.7)

Substitution of (7.4.4) and (7.4.5) into (7.4.2) gives bm c cm cm+1 = + km 0 1 where and

(7.4.5)

c km

β cm Pmb

H β cm bH m dm + dm+1 = −cm Jrm + dm+1

(7.4.8)

(7.4.9) (7.4.10)

Equations (7.4.8) through (7.4.10) constitute a Levinson recursion for the optimum filter and have been obtained without making use of the Toeplitz structure of Rm+1 . The development of a complete order-recursive algorithm is made possible by exploiting the Toeplitz structure. Indeed, when the correlation matrix Rm is Toeplitz, we have

and

bm = Ja∗m

(7.4.11)

Pm Pmb = Pmf

(7.4.12)

as we recall from Section 6.5. Since we can determine bm from am , we need to perform only one Levinson recursion, either for bm or for am .

section 7.4 Algorithms of Levinson and Levinson-Durbin

356 chapter 7 Algorithms and Structures for Optimum Linear Filters

To avoid the use of the lower right corner partitioning, we develop an order recursion for the FLP am . Indeed, to compute am+1 from am , recall that −1 ∗ rm+1 am+1 = −Rm+1

which, when combined with (7.4.5) and rm+1 leads to the Levinson recursion

(7.4.13)

rm = r(m + 1)

(7.4.14)

bm am + km = 0 1

am+1

km −

where

βm Pm

(7.4.15) (7.4.16)

∗ ∗ T ∗ ∗ β m bH m rm + r (m + 1) = am Jrm + r (m + 1)

(7.4.17)

H ∗ T Pm = r(0) + rm am = r(0) + am rm

(7.4.18)

and

Also, using (7.1.46) and (7.2.6), we can show that Pm =

det Rm+1 det Rm

and

det Rm =

m−1

Pi

with P0 = r(0)

(7.4.19)

i=0

which emphasizes the importance of Pm for the invertibility of the autocorrelation matrix. The MMSE Pm for either the forward or the backward predictor of order m can be computed recursively as follows: ∗ bm am H ∗ Pm+1 = r(0) + [rm r (m + 1)] + km 0 1 (7.4.20) H a∗ + [rH b∗ + r ∗ (m + 1)]k ∗ = r(0) + rm m m m m

or

∗ Pm+1 = Pm + β m km = Pm + β ∗m km = Pm (1 − |km |2 )

(7.4.21)

The following recursive formula for the computation of the MMSE c c∗ c = Pmc − β cm km = Pmc − β c∗ Pm+1 m km

(7.4.22)

can be found by using (7.4.8). Therefore, the algorithm of Levinson consists of two parts: a set of recursions that compute the optimum FLP or BLP and a set of recursions that use this information to compute the optimum filter. The part that computes the linear predictors is known as the Levinson-Durbin algorithm and was pointed out by Durbin (1960). From a linear system solution point of view, the algorithm of Levinson solves a Hermitian Toeplitz system with arbitrary right-hand side vector d; the Levinson-Durbin algorithm deals with the special case d = r∗ or Jr. Additional interpretations are discussed in Section 7.7. Algorithm of Levinson-Durbin The algorithm of Levinson-Durbin, which takes as input the autocorrelation sequence r(0), r(1), . . . , r(M) and computes the quantities am , Pm , and km−1 for m = 1, 2, . . . , M, is illustrated in the following examples. (2)

E XAM PLE 7.4.1. Determine the FLP a2 = [a1 lation values r(0), r(1), and r(2).

(2)

a2 ]T and the MMSE P2 from the autocorre-

Solution. To initialize the algorithm, we determine the first-order predictor by solving the (1) normal equations r(0)a1 = −r ∗ (1). Indeed, we have (1)

a1

=−

r ∗ (1) r(0)

β = k0 = − 0 P0

β 0 = r ∗ (1)

which implies that

P0 = r(0)

To update to order 2, we need k1 and hence β 1 and P1 , which can be obtained by r(0)r ∗ (2) − [r ∗ (1)]2 r(0)

β 1 = a1 r ∗ (1) + r ∗ (2) = (1)

P1 = P0 + β 0 k0∗ = k1 =

as

r 2 (0) − |r(1)|2 r(0)

[r ∗ (1)]2 − r(0)r ∗ (2) r 2 (0) − |r(1)|2

Therefore, using Levinson’s recursion, we obtain (2)

a1

(1)

(1)∗

= a1 + a1

k1 = (2)

and

a2

r(1)r ∗ (2) − r(0)r ∗ (1) r 2 (0) − |r(1)|2

= k1

which agree with the results obtained in Example 6.5.1. The resulting MMSE can be found by using P2 = P1 + β 1 k1∗ . EXAMPLE 7.4.2. Use the Levinson-Durbin algorithm to compute the third-order forward predictor for a signal x(n) with autocorrelation sequence r(0) = 3, r(1) = 2, r(2) = 1, and r(3) = 12 .

(1)

Solution. To initialize the algorithm, we notice that the first-order predictor is given by r(0)a1 (1)

= −r(1) and that for m = 0, (7.4.15) gives a1 (1)

=−

a1 which implies

= k0 . Hence, we have

2 r(1) β = − = k0 = 0 r(0) 3 P0

P0 = r(0) = 3 (1)

β 0 = r(1) = 2 (1)

To compute a2 by (7.4.15), we need a1 , b1 have

(1)

= a1 , and k1 = −β 1 /P1 . From (7.4.21), we

P1 = P0 + β 0 k0 = 3 + 2(− 23 ) = 53 and from (7.4.17) β 1 = r1T Ja1 + r(2) = 2(− 23 ) + 1 = − 13 1

Hence,

− β 1 k1 = − 1 = − 53 = P1 5 3

and

a2 =

2 4 −5 1 −3 + = 1 5 0 1 5

2 −3

Continuing in the same manner, we obtain P2 = P1 + β 1 k1 = 53 + (− 13 )( 15 ) = 85 1 1 1 5 T + = β 2 = r2 Ja2 + r(3) = [2 1] 4 2 10 − 5

357 section 7.4 Algorithms of Levinson and Levinson-Durbin

1

358

β 1 k2 = − 2 = − 10 =− 8 P2 16 5 13 4 1 − − 5 5 16 1 Ja2 a2 1 1 4 k3 = + a3 = 5 − − 5 16 = 4 0 1 1 0 1 − 16 1 − 1 = 51 P3 = P2 + β 2 k2 = 85 + 10 16 32

chapter 7 Algorithms and Structures for Optimum Linear Filters

The algorithm of Levinson-Durbin, summarized in Table 7.2, requires M 2 operations and is implemented by the function [a,k,Po]=durbin(r,M).

TABLE 7.2

Summary of the LevinsonDurbin algorithm. 1. Input:

r(0), r(1), r(2), . . . , r(M)

2. Initialization (a) P0 = r(0), β 0 = r ∗ (1) (b) k0 = −r ∗ (1)/r(0), a1

(1)

= k0

3. For m = 1, 2, . . . , M − 1 ∗ (a) Pm = Pm−1 + β m−1 km−1 (b) rm = [r(1) r(2) · · · r(m)]T T Jr∗ + r ∗ (m + 1) (c) β m = am m

β (d ) km = − m Pm Ja∗m am + km (e) am+1 = 0 1 ∗ 4. PM = PM−1 + β M kM

5. Output:

aM , {km }M−1 , {Pm }M 0 1

Algorithm of Levinson The next example illustrates the algorithm of Levinson that can be used to solve a system of linear equations with a Hermitian Toeplitz matrix and arbitrary right-hand side vector. Consider an optimum filter with input x(n) and desired response y(n). The autocorrelation of the input signal is r(0) = 3, r(1) = 2, and r(2) = 1. The cross-correlation between the desired response and input is d1 = 1, d2 = 2, and d3 = 52 ; and the power of y(n) is Py = 3. Design a third-order optimum FIR filter, using the algorithm of Levinson. E XAM PLE 7.4.3.

(1)

Solution. We start initializing the algorithm by noticing that for m = 0 we have r(0)a1 −r(1), which gives (1)

a1

= k0 = −

P0 = r(0) = 3 and

2 r(1) =− r(0) 3 β 0 = r(1) = 2

P1 = P0 + β 0 k0 = 3 + 2(− 23 ) = 53

=

359

Next, we compute the Levinson recursion for the first-order optimum filter P0c = 5

β c0 = d1 = 1

section 7.4 Algorithms of Levinson and Levinson-Durbin

1 d (1) k0c = c1 = 1 = r(0) 3 P1c = P0c − β c0 k0c = 3 − 1( 13 ) = 83 Then we carry the Levinson recursion for m = 1 to obtain β 1 = r1T Ja1 + r(2) = 2(− 23 ) + 1 = − 13 1

− β 1 k1 = − 1 = − 53 = P1 5 3 2 4 2 −5 −3 1 −3 + = a2 = 1 5 0 1 5 P2 = P1 + β 1 k1 = 53 + (− 13 )( 15 ) = 85 for the optimum predictor, and β c1 = a1T Jd1 + d2 = − 23 (1) + 2 = 43 β c1

4

4 = 35 = P1 5 3 2 1 1 −5 4 −3 = c2 = 3 + 4 5 0 1 5 k1c =

P2c = P1c − β c1 k1c = 83 − 43 ( 45 ) = 85 for the optimum filter. The last recursion (m = 2) is carried out only for the optimum filter and gives

4 1 11 1 5 c T − β 2 = a2 Jd2 + d3 = + = 5 5 2 2 10 β c2

11

11 = 10 = 8 P2 16 5 1 1 1 − 16 −5 5 c2 Ja2 c 4 11 4 + − = 1 + k2 = c3 = 5 5 4 16 0 1 11 0 1 16 k2c =

11 ( 11 ) = 27 P3c = P2c − β c2 k2c = 85 − 10 16 32

The algorithm of Levinson, summarized in Table 7.3, is implemented by the Matlab function [c,k,kc,Pc]=levins(R,d,Py,M) and requires 2M 2 operations because it involves two dot products and two scalar-vector multiplications. A parallel processing implementation of the algorithm is not possible because the dot products involve additions that cannot be executed simultaneously. Notice that adding M = 2q numbers using M/2 adders requires q = log2 M steps. This bottleneck can be avoided by using the algorithm of Schür (see Section 7.6). Minimum phase and autocorrelation extension Using (7.4.16), we can also express the recursion (7.4.21) as |β |2 Pm+1 = Pm (1 − |km |2 ) = Pm − m Pm

(7.4.23)

360 chapter 7 Algorithms and Structures for Optimum Linear Filters

TABLE 7.3

Summary of the algorithm of Levinson. 1. Input:

M {r(l)}M 0 , {dm }1 , Py

2. Initialization (a) P0 = r(0), β 0 = r ∗ (1), P0c = Py (1)

(b) k0 = −β 0 /P0 , a1 (c) β c0 = d1

= k0

(1)

(d ) k0c = −β c0 /P0 , c1 = k0c (e) P1c = P0c + β c0 k0c∗ 3. For m = 1, 2, . . . , M − 1 (a) rm = [r(1) r(2) · · · r(m)]T T Jr∗ + r ∗ (m + 1) (b) β m = am m ∗ (c) Pm = Pm−1 + β m−1 km−1 βm (d ) km = − Pm

∗ am Jam (e) am+1 = + km 0 1

(f ) β cm = −cH m Jrm + dm+1 β cm c (g) km = Pm

∗ cm Jam c (h) cm+1 = + km 0 1 c c + β c k c∗ (i) Pm+1 = Pm m m

4. Output:

c }M−1 , {P , P c }M aM , cM , {km , km m m 0 0

which, since Pm ≥ 0, implies that Pm+1 ≤ Pm

(7.4.24)

and since the matrix Rm is positive definite, then Pm > 0 and (7.4.23) implies that |km | ≤ 1

(7.4.25)

P0 > · · · > PM−1 > PM = 0

(7.4.26)

for all 1 ≤ m < M. If then the process x(n) is predictable and (7.4.23) implies that kM = ±1

|km | < 1

and

1≤k<M

(7.4.27)

(see Section 6.6.4). Also if PM−1 > PM = · · · = P∞ = P > 0

(7.4.28)

from (7.4.23) we have km = 0

for m > M f (n) ∼ and eM P0 , P1 , P2 , . . .

(7.4.29)

WN(0, PM ) (see Section which implies that the process x(n) is AR(M) 4.2.3). Finally, we note that since the sequence is nonincreasing, its limit as m → ∞ exists and is nonnegative. A regular process must satisfy |km | < 1 for all m, because |km | = 1 implies that Pm = 0, which contradicts the regularity assumption. For m = 0, (7.4.19) gives P0 = r(0). Carrying out (7.4.23) from m = 0 to m = M, we obtain M PM = r(0) (1 − |km−1 |2 ) (7.4.30) m=1

which converges, as M → ∞, if |km | < 1.

7.5 LATTICE STRUCTURES FOR OPTIMUM FIR FILTERS AND PREDICTORS

361 section 7.5 Lattice Structures for Optimum FIR Filters and Predictors

To compute the forward prediction error of an FLP of order m, we use the formula f H (n) = x(n) + am xm (n − 1) = x(n) + em

m

x(n − k)

(7.5.1)

x(n − k)

(7.5.2)

(m)∗

ak k=1

Similarly, for the BLP we have b em (n) = x(n − m) + bH m xm (n) = x(n − m) +

m−1

(m)∗

bk k=0

Both filters can be implemented using the direct-form filter structure shown in Figure 7.4. Since am and bm do not have the optimum nesting property, we cannot obtain order-recursive direct-form structures for the computation of the prediction errors. However, next we show that we can derive an order-recursive lattice-ladder structure for the implementation of optimum predictors and filters using the algorithm of Levinson.

… (m)*

1

z −1

x(n) (m)*

b0

(m)*

(m)*

a1

a2

a2

…

z −1 (m)*

(m)*

b1

f

em (n) a(m)* m−1

z −1 b(m)* m−1

b2

(m)* am

1

…

b

em (n)

FIGURE 7.4 Direct-form structure for the computation of the mth-order forward and backward prediction errors.

7.5.1 Lattice-Ladder Structures We note that the data vector for the (m + 1)st-order predictor can be partitioned in the following ways: xm+1 (n) = [x(n) x(n − 1) · · · x(n − m + 1) x(n − m)]T T = [xm (n) x(n − m)]T

(7.5.3)

= [x(n)

(7.5.4)

T xm (n − 1)]T

Using (7.5.1), (7.5.3), (7.4.15), and (7.5.2), we obtain H xm (n − 1) bm am f + km em+1 (n) = x(n) + 0 1 x(n − m − 1) H ∗ H = x(n) + am xm (n − 1) + km [bm xm (n − 1) + x(n − 1 − m)]

or

f f ∗ b (n) = em (n) + km em (n − 1) em+1

(7.5.5)

Using (7.4.11) and (7.4.15), we obtain the following Levinson-type recursion for the backward predictor: 1 0 bm+1 = + k∗ am m bm

362

The backward prediction error is

chapter 7 Algorithms and Structures for Optimum Linear Filters

H x(n) 0 1 + k∗ am m xm (n − 1) bm

b (n) = x(n − m − 1) + em+1

H = x(n − m − 1) + bH m xm (n − 1) + km [x(n) + am xm (n − 1)] b b f em+1 (n) = em (n − 1) + km em (n)

or

(7.5.6)

Recursions (7.5.5) and (7.5.6) can be computed for m = 0, 1, . . . , M − 1. The initial conditions e0f (n) and e0b (n) are easily obtained from (7.5.1) and (7.5.2). The recursions also lead to the following all-zero lattice algorithm e0f (n) = e0b (n) = x(n) ∗ b f (n) = ef em m−1 (n) + km−1 em−1 (n − 1)

m = 1, 2, . . . , M

f b km−1 em−1 (n) + em−1 (n − 1)

m = 1, 2, . . . , M

b (n) em

=

(7.5.7)

f (n) e(n) = eM

that is implemented using the structure shown in Figure 7.5. The lattice parameters km are known as reflection coefficients in the speech processing and geophysics areas. Stage 1 f

Stage M f

e0(n)

e1(n)

x (n)

…

k0

kM−1

k*

k*M −1

z −1 b e0 (n)

f

eM (n)

b e1 (n)

…

z −1 b

eM (n)

FIGURE 7.5 All-zero lattice structure for the implementation of the forward and backward prediction error filters.

The Levinson recursion for the optimum filter, (7.4.8) through (7.4.10), adds a ladder part to the lattice structure for the forward and backward predictors. Using (7.4.8), (7.5.7), and the partitioning in (7.5.3), we can express the filtering error of order m + 1 in terms of b (n) as follows em (n) and em H c∗ b em+1 (n) = y(n) − cm+1 xm+1 (n) = em (n) − km em (n)

(7.5.8)

for m = 0, 1, . . . , M − 1. The resulting lattice-ladder structure is similar to the one shown in Figure 7.3. However, owing to stationarity all coefficients are constant, and f (n) = k b (n) = k . We note that the efficient solution of the Mth-order optimum filtering km m m problem is derived from the solution of the (M − 1)st-order forward and backward prediction problems of the input process. In fact, the lattice part serves to decorrelate the samples b (n) x(n), x(n−1), . . . , x(n−M), producing the uncorrelated samples e0b (n), e1b (n), . . . , eM (innovations), which are then linearly combined (“recorrelated”) to obtain the optimum estimate of the desired response. System functions. We next express the various lattice relations in terms of z-transforms. Taking the z-transform of (7.5.1) and (7.5.2), we obtain M (m)∗ −k f X(z) Am (z)X(z) ak z (7.5.9) Em (z) = 1 + k=1

b Em (z)

= z

−m

+

M

(m)∗ bk z−k+1

363

X(z) Bm (z)X(z)

(7.5.10)

k=1

where Am (z) and Bm (z) are the system functions of the paths from the input to the outputs of the mth stage of the lattice. Using the symmetry relation am = Jb∗m , 1 ≤ m ≤ M, we obtain 1 Bm (z) = z−m A∗m ∗ (7.5.11) z Note that if z0 is a zero of Am (z), then z0−1 is a zero of Bm (z). Therefore, if Am (z) is minimum-phase, then Bm (z) is maximum-phase. Taking the z-transform of the lattice equations (7.5.7), we have for the mth stage f f ∗ b Em (z) = Em−1 (z) + km−1 z−1 Em−1 (z)

(7.5.12)

f b km−1 Em−1 (z) + z−1 Em−1 (z)

(7.5.13)

b Em (z)

=

Dividing both equations by X(z) and using (7.5.9) and (7.5.10), we have ∗ Am (z) = Am−1 (z) + km−1 z−1 Bm−1 (z)

Bm (z) = km−1 Am−1 (z) + z

−1

Bm−1 (z)

(7.5.14) (7.5.15)

which, when initialized with A0 (z) = B0 (z) = 1

(7.5.16)

describe the lattice filter in the z domain. The z-transform of the ladder-part (7.5.8) is given by c∗ b Em+1 (z) = Em (z) − km Em (z)

(7.5.17)

where Em (z) is the z-transform of the error sequence em (n). All-pole or “inverse” lattice structure. If we wish to recover the input x(n) from the f (n), we can use the following all-pole lattice filter algorithm prediction error e(n) = eM f (n) = e(n) eM f b f (n) − k ∗ (n) = em em−1 m−1 em−1 (n − 1) b (n) em

=

x(n) =

b f em−1 (n − 1) + km−1 em−1 (n) e0f (n) = e0b (n)

m = M, M − 1, . . . , 1 m = M, M − 1, . . . , 1

(7.5.18)

which is derived as explained in Section 2.5 and is implemented by using the structure in Figure 7.6. Although the system functions of the all-zero lattice in (7.5.7) and the all-pole lattice in (7.5.18) are HAZ (z) = A(z) and HAP (z) = 1/A(z), the two lattice structures are described by the same set of lattice coefficients. The difference is the signal flow (see feedback loops in the all-pole structure). This structure is used in speech processing applications (Rabiner and Schafer 1978). 7.5.2 Some Properties and Interpretations Lattice filters have some important properties and interesting interpretations that make them a useful tool in optimum filtering and signal modeling. Optimal nesting. The all-zero lattice filter has an optimal nesting property when it is used for the implementation of an FLP. Indeed, if we use the lattice parameters obtained via the algorithm of Levinson-Durbin, the all-zero lattice filter driven by the signal x(n)

section 7.5 Lattice Structures for Optimum FIR Filters and Predictors

Stage M

364 chapter 7 Algorithms and Structures for Optimum Linear Filters

f eM (n)

Stage 1

…

= e(n) Input

b

kM−1 * −1 −kM z −1

…

f

f e1(n)

b

e1(n)

x(n) = e0(n) Output

k0 −k*0 z −1

b

e0(n)

eM (n)

FIGURE 7.6 All-pole lattice structure for recovering the input signal from the forward prediction error. f (n) and eb (n) at the output of the mth stage for all 1 ≤ m ≤ M. produces prediction errors em m This implies that we can increase the order of the filter by attaching additional stages without destroying the optimality of the previous stages. In contrast, the direct-form filter structure implementation requires the computation of the entire predictor for each stage. However, the nesting property does not hold for the all-pole lattice filter because of the feedback path. b (n) for 0 ≤ m ≤ M are uncorrelated Orthogonality. The backward prediction errors em (see Section 7.2), that is, k=m Pm b b∗ E{em (n)ek (n)} = (7.5.19) 0 k = m

and constitute the innovations representation of the input samples x(n), x(n − 1), . . . , x(n − m). We see that at a given time instant n, the backward prediction errors for orders m = 0, 1, 2, . . . , M are uncorrelated and are part of a nonstationary sequence because the b (n)|2 } = P depends on n. This should be expected because, for a given variance E|em m b (n) is computed using a different set of predictor coefficients. In contrast, for a n, each em b (n) is stationary for −∞ < n < ∞. given m, the sequence em Reflection coefficients. The all-pole lattice structure is very useful in the modeling of layered media, where each stage of the lattice models one layer or section of the medium. Traveling waves in geophysical layers, in acoustic tubes of varying cross-sections, and in multisectional transmission lines have been modeled in this fashion. The modeling is performed such that the wave travel time through each section is the same, but the sections f (n) and may have different impedances. The mth section is modeled with the signals em b em (n) representing the forward and backward traveling waves, respectively. If Zm and Zm−1 are the characteristic impedances at sections m and m−1, respectively, then km represents the reflection coefficients between the two sections, given by km =

Zm − Zm−1 Zm + Zm−1

(7.5.20)

For this reason, the lattice parameters km are often known as reflection coefficients. As reflection coefficients, it makes good sense that their magnitudes not exceed unity. The termination of the lattice assumes a perfect reflection, and so the reflected wave e0b (n) is equal to the transmitted wave e0f (n). The result of this specific termination is an overall all-pole model (Rabiner and Schafer 1978). Partial correlation coefficients. The partial correlation coefficient (PCC) between x(n) and x(n − m − 1) (see also Section 7.2.2) is defined as the correlation coefficient

f (n) and eb (n − 1), that is, between em m

365

PARCOR{x(n − m − 1); x(n)} PCC{x(n − m − 1); x(n)} b f E{|em (n − 1)|2 }E{|em (n)|2 }

(7.5.21)

and, therefore, it takes values in the range [−1, 1] (Kendall and Stuart 1979). Working as in Section 7.2, we can show that f∗ b E{em (n − 1)em (n)} = bH m rm + r(m + 1) = β m

(7.5.22)

which in conjunction with f b E{|em (n − 1)|2 } = E{|em (n)|2 } = Pm

(7.5.23)

and (7.4.16), results in km = −

βm = −PCC{x(n − m − 1); x(n)} Pm

(7.5.24)

That is, for stationary processes the lattice parameters are the negative of the partial autocorrelation sequence and satisfy the relation |km | ≤ 1

for all 0 ≤ m ≤ M − 1

(7.5.25)

derived also for (7.4.25) using an alternate approach. Minimum phase. According to Theorem 2.3 (Section 2.5), the roots of the polynomial A(z) are inside the unit circle if and only if |km | < 1

for all 0 ≤ m ≤ M − 1

(7.5.26)

which implies that the filters with system functions A(z) and 1/A(z) are minimum-phase. The strict inequalities (7.5.26) are satisfied if the stationary process x(n) is nonpredictable, which is the case when the Toeplitz autocorrelation matrix R is positive definite. Lattice-ladder optimization. As we saw in Section 2.5, the output of an FIR lattice filter is a nonlinear function of the lattice parameters. Hence, if we try to design an optimum lattice filter by minimizing the MSE with respect to the lattice parameters, we end up with a nonlinear optimization problem (see Problem 7.11). In contrast, the Levinson algorithm leads to a lattice-ladder realization of the optimum filter through the order-recursive solution of a linear optimization problem. This subject is of interest to signal modeling and adaptive filtering (see Chapters 9 and 10).

7.5.3 Parameter Conversions We have shown that the Mth-order forward linear predictor of a stationary process x(n) is uniquely specified by a set of linear equations in terms of the autocorrelation sequence and the prediction error filter is minimum-phase. Furthermore, it can be implemented using (M) (M) (M) either a direct-form structure with coefficients a1 , a2 , . . . , aM or a lattice structure with parameters k1 , k2 , . . . , kM . Next we show how to convert between the following equivalent representations of a linear predictor: 1. Direct-form filter structure: {PM , a1 , a2 , . . . , aM }. 2. Lattice filter structure: {PM , k0 , k1 , . . . , kM−1 }. 3. Autocorrelation sequence: {r(0), r(1), . . . , r(M)}. The transformation between the above representations is performed using the algorithms shown in Figure 7.7.

section 7.5 Lattice Structures for Optimum FIR Filters and Predictors

FIGURE 7.7 Equivalent representations for minimum-phase linear prediction error filters.

366 chapter 7 Algorithms and Structures for Optimum Linear Filters

r(0) r(1),…, r(M )

in -D urb vin so n

r

Le

r

Le

u¨ ch

hu¨

I vin nver so se n-D urb i

n

eS e rs Inv

Sc

Step-down recursion PM , k0 ,…, kM−1

PM , a1,…, aM

Step-up recursion

Lattice-to-direct (step-up) recursion. Given the lattice parameters k1 , k2 , . . . , kM and the MMSE error PM , we can compute the forward predictor aM by using the following recursions Ja∗m−1 am−1 + km−1 am = (7.5.27) 0 1 Pm = Pm−1 (1 − |km−1 |2 )

(7.5.28)

for m = 1, 2, . . . , M. This conversion is implemented by the function [a,PM]=stepup(k). Direct-to-lattice (step-down) recursion. Using the partitioning (m) (m) (m) a¯ m = [a1 a2 · · · am−1 ]T

(7.5.29)

km−1 = am

(m)

we can write recursion (7.5.27) as a¯ m = am−1 + Ja∗m−1 km−1 or by taking the complex conjugate and multiplying both sides by J ∗ Ja¯ ∗m = Ja∗m−1 + am−1 km−1

Eliminating Ja∗m−1 from the last two equations and solving for am−1 , we obtain am−1 =

a¯ m − Ja¯ ∗m km−1 1 − |km−1 |2

(7.5.30)

From (7.5.28), we have Pm (7.5.31) 1 − |km−1 |2 Given aM and PM , we can obtain km and Pm for 0 ≤ m ≤ M − 1 by computing the last two recursions for m = M, M − 1, . . . , 2. We should stress that both recursions break down if |km | = ±1. The step-down algorithm is implemented by the function [k]=stepdown(a). Pm−1 =

(3)

(3)

(3)

Given the third-order FLP coefficients a1 , a2 , a3 , compute the lattice parameters k0 , k1 , k2 . E XAM PLE 7.5.1.

Solution. With the help of (7.5.29) the vector relation (7.5.30) can be written in scalar form as (m)

km−1 = am

(7.5.32)

(m)

(m−1)

and

ai

a = i

367

(m)∗

− am−i km−1

(7.5.33)

1 − |km−1 |2

which can be used to implement the step-down algorithm for m = M, M − 1, . . . , 2 and i = 1, 2, . . . , m − 1. Starting with m = 3 and i = 1, 2, we have (3)

(3)∗

(3)

a − a2 k2 = 1 1 − |k2 |2 Similarly, for m = 2 and i = 1, we obtain (3)

k2 = a3

(2)

(2)

a1

a2

(2)

(2)

k1 = a2

(1)

a1

(3)∗

a − a1 k2 = 2 1 − |k2 |2

(2)∗

a − a1 k1 = 1 = k0 1 − |k1 |2

which completes the solution.

The step-up and step-down recursions also can be expressed in polynomial form as 1 ∗ Am (z) = Am−1 (z) + km−1 (7.5.34) z−m A∗m−1 ∗ z Am−1 (z) =

and

∗ Am (z) − km−1 z−m A∗m (1/z∗ )

1 − |km−1 |2

(7.5.35)

respectively. Lattice parameters to autocorrelation. If we know the lattice parameters k1 , k2 , . . . , kM and PM , we can compute the values r(0), r(1), . . . , r(M) of the autocorrelation sequence using the formula ∗ H r(m + 1) = −km Pm − a m Jrm

(7.5.36)

which follows from (7.4.16) and (7.4.17), in conjunction with (7.5.27) and (7.4.21) for m = 1, 2, . . . , M. Equation (7.5.36) is obtained by eliminating β m from (7.4.9) and (7.4.10). This algorithm is used by the function r=k2r(k,PM). Another algorithm that computes the autocorrelation sequence from the lattice coefficients and does not require the intermediate computation of am is provided in Section 7.6. E XAM PLE 7.5.2.

Given P0 , k0 , k1 , and k2 , compute the autocorrelation values r(0), r(1), r(2),

and r(3). Solution. Using r(0) = P0 and ∗ P − aH Jr r(m + 1) = −km m m m

for m = 0, we have r(1) = −k0∗ P0 For m = 1 r(2) = −k1∗ P1 − a1

(1)∗

r(1)

P1 = P0 (1 − |k0 |2 )

where Finally, for m = 2 we obtain

r(3) = −k2∗ P2 − [a1

(2)∗

P2 = P1 (1 − |k1 |2 )

where and

r(2) + k1∗ r(1)]

(2)

a1

(1)

(1)∗

= a1 + a1

k1 = k0 + k0∗ k1

from the Levinson recursion.

Direct parameters to autocorrelation. Given aM and PM , we can compute the autocorrelation sequence r(0), r(1), . . . , r(M) by using (7.5.29) through (7.5.36). This method is known as the inverse Levinson algorithm and is implemented by the function r=a2r(a,PM).

section 7.5 Lattice Structures for Optimum FIR Filters and Predictors

368 chapter 7 Algorithms and Structures for Optimum Linear Filters

7.6 ALGORITHM OF SCHÜR The algorithm of Schür is an order-recursive procedure for the computation of the lattice parameters k1 , k2 , . . . , kM of the optimum forward predictor from the autocorrelation sequence r(0), r(1), . . . , r(M) without computing the direct-form coefficients am , m = 1, 2, . . . , M. The reverse process is known as the inverse Schür algorithm. The algorithm also can be extended to compute the ladder parameters of the optimum filter and the LDLH decomposition of a Toeplitz matrix. The algorithm has its roots in the original work of Schür (Schür 1917), who developed a procedure to test whether a polynomial is analytic and bounded in the unit disk.

7.6.1 Direct Schür Algorithm f (n), eb (n), and x(n) We start by defining the cross-correlation sequences between em m f∗ ξ fm (l) E{x(n − l)em (n)} b∗ (n)} ξ bm (l) E{x(n − l)em

with ξ fm (l) = 0, for 1 ≤ l ≤ m

(7.6.1)

with ξ bm (l) = 0, for 0 ≤ l < m

(7.6.2)

which are also known as gapped functions because of the regions of zeros created by the orthogonality principle (Robinson and Treitel 1980). Multiplying the direct-form equations (7.5.1) and (7.5.2) by x ∗ (n − l) and taking the mathematical expectation of both sides, we obtain

and where

H ξ fm (l) = r(l) + am r˜ m (l − 1)

(7.6.3)

˜ m (l) ξ bm (l) = r(l − m) + bH mr

(7.6.4)

r˜ m (l) [r(l) r(l − 1) · · · r(l − m + 1)]T

(7.6.5)

We notice that ξ fm (l) and ξ bm (l) can be interpreted as forward and backward autocorrelation prediction errors, because they occur when we feed the sequence r(0), r(1), . . . , r(m + 1) through the optimum predictors am and bm of the process x(n). Using the property bm = Ja∗m , we can show that (see Problem 7.29) ξ bm (l) = ξ fm∗ (m − l)

(7.6.6)

If we set l = m + 1 in (7.6.3) and l = m in (7.6.4), and notice that r˜ m (m) = Jr∗m , then we have H ∗ ξ fm (m + 1) = r(m + 1) + am Jrm = β ∗m

and

ξ bm (m)

=

H r(0) + rm Jbm

= Pm

(7.6.7) (7.6.8)

respectively. Therefore, we have km = −

βm ξ f (m + 1) =− mb Pm ξ m (m)

(7.6.9)

that is, we can compute km+1 in terms of ξ fm (l) and ξ bm (l). Multiplying the lattice recursions (7.5.7) by x ∗ (n − l) and taking the mathematical expectation of both sides, we obtain ξ f0 (l) = ξ b0 (l) = r(l) ∗ ξ bm−1 (l − 1) ξ fm (l) = ξ fm−1 (l) + km−1

m = 1, 2, . . . , M

ξ bm (l) = km−1 ξ fm−1 (l) + ξ bm−1 (l − 1)

m = 1, 2, . . . , M

(7.6.10)

which provides a lattice structure for the computation of the cross-correlations ξ fm (l) and ξ bm (l). In contrast, (7.6.7) and (7.6.8) provide a computation using a direct-form structure. In the next example we illustrate how to use the lattice structure (7.6.10) to compute the lattice parameters k1 , k2 , . . . , kM from the autocorrelation sequence r(0), r(1), . . . , r(M) without the intermediate explicit computation of the predictor coefficients am . E XAM PLE 7.6.1. Use the algorithm of Schür to compute the lattice parameters {k0 , k1 , k2 } and the MMSE P3 from the autocorrelation sequence coefficients

r(0) = 3

r(1) = 2

r(2) = 1

r(3) = 12

Solution. Starting with (7.6.9) for m = 0, we have ξ f (1) 2 r(1) =− k0 = − 0 =− r(0) 3 ξ b0 (0) because ξ f0 (l) = ξ b0 (l) = r(l). To compute k1 , we need ξ f1 (2) and ξ b1 (1), which are obtained from (7.6.10) by setting l = 2. Indeed, we have ξ f1 (2) = ξ f0 (2) + k0 ξ b0 (1) = 1 + (− 23 )2 = − 13 ξ b1 (1) = ξ b0 (0) + k0 ξ f0 (1) = 3 + (− 23 )2 = 53 = P1 1

− ξ f (2) 1 k1 = − 1 = − 53 = b 5 ξ 1 (1) 3

and

The computation of k2 requires ξ f2(3) and ξ b2 (2), which in turn need ξ f1(3) and ξ b1 (2). These quantities are computed by ξ f1 (3) = ξ f0 (3) + k0 ξ b0 (2) = 12 + (− 23 )1 = − 16 ξ b1 (2) = ξ b0 (1) + k0 ξ f0 (2) = 2 + (− 23 )1 = 43 1 ξ f2 (3) = ξ f1 (3) + k1 ξ b1 (2) = − 16 + 15 · 43 = 10

ξ b2 (2) = ξ b1 (1) + k1 ξ f1 (2) = 43 + 15 (− 16 ) = 85 = P2 and the lattice coefficient is 1

ξ f (3) 1 k2 = − 2 =− = − 10 8 b 16 ξ 2 (2) 5

The final MMSE is computed by 1 ) = 51 P3 = P2 (1 − |k2 |2 ) = 85 (1 − 256 32

although we could use the formula ξ bm (m) = Pm as well. Therefore the lattice coefficients and the MMSE are found to be k0 = − 23

k1 = 15

1 k2 = − 16

P3 = 51 32

It is worthwhile to notice that the km parameters can be obtained by “feeding” the sequence r(0), r(1), . . . , r(M) through the lattice filter as a signal and switching on the stages one by one after computing the required lattice coefficient. The value of km is computed at time n = m from the inputs to stage m (see Problem 7.30). The procedure outlined in the above example is known as the algorithm of Schür and has good numerical properties because the quantities used in the lattice structure (7.6.10) are bounded. Indeed, from (7.6.1) and (7.6.2) we have f |ξ fm (l)|2 ≤ |E{|em (n)|2 }||E{|x(n − l)|2 }| ≤ Pm r(0) ≤ r 2 (0)

(7.6.11)

b |ξ bm (l)|2 ≤ |E{|em (n)|2 }||E{|x(n − l)|2 }| ≤ Pm r(0) ≤ r 2 (0)

(7.6.12)

369 section 7.6 Algorithm of Schür

370 chapter 7 Algorithms and Structures for Optimum Linear Filters

because Pm ≤ P0 = r(0). As a result of this fixed dynamic range, the algorithm of Schür can be easily implemented with fixed-point arithmetic. The numeric stability of the Schür algorithm provided the motivation for its use in speech processing applications (LeRoux and Gueguen 1977).

7.6.2 Implementation Considerations Figure 7.8 clarifies the computational steps in Example 7.4.2, using three decomposition trees that indicate the quantities needed to compute k0 , k1 , and k2 when we use the lattice recursions (7.6.10) for real-valued signals. We can easily see that the computations for k0 are part of those for k1 , which in turn are part of the computations for k2 . Thus, the tree for k2 includes also the quantities needed to compute k0 and k1 . The computations required to compute k0 , k1 , k2 , and k3 are 1.

k0 = −

2.

ξ f0 (1)

9.

ξ f2 (4) = ξ f1 (4) + k1 ξ b1 (3)

ξ f1 (4) = ξ f0 (4) + k0 ξ b0 (3)

10.

ξ b2 (3) = ξ b1 (2) + k1 ξ f1 (3)

3.

ξ b1 (3) = ξ b0 (2) + k0 ξ f0 (3)

11.

ξ f2 (3) = ξ f1 (3) + k1 ξ b1 (2)

4.

ξ f1 (3) = ξ f0 (3) + k0 ξ b0 (2)

12.

ξ b2 (2) = ξ b1 (1) + k1 ξ f1 (2)

5.

ξ b1 (2) = ξ b0 (1) + k0 ξ f0 (2)

13.

k2 = −

6.

ξ f1 (2) = ξ f0 (2) + k0 ξ b0 (1)

14.

ξ f3 (4) = ξ f2 (4) + k2 ξ b2 (3)

7.

ξ b1 (1) = ξ b0 (0) + k0 ξ f0 (1)

15.

ξ b3 (3) = ξ b2 (2) + k2 ξ f2 (3)

8.

k1 = −

16.

k3 = −

ξ b0 (0)

ξ f1 (2) ξ b1 (1)

ξ f2 (3) ξ b2 (2)

ξ f3 (4) ξ b3 (3)

With the help of the corresponding tree decomposition diagram, this can be arranged as shown in Figure 7.9. The obtained computational structure was named the superlattice because it consists of a triangular array of latticelike stages (Carayannis et al. 1985). Note that the superlattice has no redundancy and is characterized by local interconnections; that is, the quantities needed at any given node are available from the immediate neighbors. The two-dimensional layout of the superlattice suggests various algorithms to perform the computations. 1. Parallel algorithm. We first note that all equations involving the coefficient km constitute one stage of the superlattice and can be computed in parallel after the computation of km because all inputs to the current stage are available from the previous one. This algorithm can be implemented by 2(M − 1) processors in M − 1 “parallel” steps (Kung and Hu 1983). Since each step involves one division to compute km and then 2(M − m) multiplications and additions for the parallel computations, the number of utilized processors decreases from 2(M − 1) to 1. The algorithm is not order-recursive because the order M must be known before the superlattice structure is set up. 2. Sequential algorithm. Asequential implementation of the parallel algorithm is essentially equivalent to the version introduced for speech processing applications (LeRoux and Gueguen 1977). This algorithm, which is implemented by the function k=schurlg(r,M) and summarized in Table 7.4, starts with Equation (1) and computes sequentially Equations (2), (3), etc.

371 f j 0(2)

= r(2)

f

j 1(2) b

j 0 (1) = r(1)

f

j 0(1) = r(1) k0

k1

b

j 0 (1) = r(0)

f

j 0(1) = r(1)

b j 1 (1)

b

j 0 (0) = r(0)

f

j 0(3) = r(3) f j 1(3) b

j 0 (2) = r(2) f j 2(3) f

j 0(2) = r(2) b j 1 (2) b

j 0 (1) = r (1) k2 f

j 0(2) = r(2) f

j 1(2) b

j 0(1) = r (1) b

j 2 (2) f

j 0(1) = r(1) b j 1 (1) b

j 0 (0) = r (0)

FIGURE 7.8 Tree decomposition for the computations required by the algorithm of Schür.

3. Sequential order-recursive algorithm. The parallel algorithm starts at the left of the superlattice and performs the computations within the vertical strips in parallel. Clearly, the order M should be fixed before we start, and the algorithm is not order-recursive. Careful inspection of the superlattice reveals that we can obtain an order-recursive algorithm by organizing the computations in terms of the slanted shadowed strips shown in Figure 7.9. Indeed, we start with k0 and then perform the computations in the first slanted strip to determine the quantities ξ f1 (2) and ξ b1 (1) needed to compute k1 . We proceed with the next slanted strip, compute k2 , and conclude with the computation of the last strip and k3 . The computations within each slanted strip are performed sequentially. 4. Partitioned-parallel algorithm. Suppose that we have P processors with P < M. This algorithm partitions the superlattice into groups of P consecutive slanted strips (partitions) and performs the computations of each partition, in parallel, using the P processors. It turns out that by storing some intermediate quantities, we have everything needed by the superlattice to compute all the partitions, one at a time (Koukoutsis et al. 1991). This algorithm provides a very convenient scheme for the implementation of the superlattice using multiprocessing (see Problem 7.31).

section 7.6 Algorithm of Schür

372 chapter 7 Algorithms and Structures for Optimum Linear Filters

f

j 0(4) = r(4) f

b

j 1(4)

f

j 1 (3)

b

j 1(3)

j 0 (3) = r(3)

b

j 0(3) = r(3)

f

j 2(4)

f

j 0 (2) = r(2)

b

j 2 (3)

f

j 3(4)

f

j 0(2) = r(2)

f

b

b

j 2(3)

j 1 (2)

j 3 (3)

b

j 0 (1) = r(1)

k3

b

f

j 2 (2)

j 1(2)

f

k2

j 0(1) = r(1) b

b j 0 (0)

j 1 (1)

k1

= r(0) k0

FIGURE 7.9 Superlattice structure organization of the algorithm of Schür. The input is the autocorrelation sequence and the output the lattice parameters. TABLE 7.4

Summary of the algorithm of Schür. {r(l)}M 0

1. Input:

2. Initialization (a) For l = 0, 1, . . . , M ξ f0 (l) = ξ b0 (l) = r(l) ξ f (1) (b) k0 = − 0 ξ b0 (0) (c) P1 = r(0)(1 − |k1 |2 ) 3. For m = 1, 2, . . . , M − 1 (a) For l = m, m + 1, . . . , M ∗ ξ bm−1 (l − 1) ξ fm (l) = ξ fm−1 (l) + km−1

ξ bm (l) = km−1 ξ fm−1 (l) + ξ bm−1 (l − 1) ξ f (m + 1) (b) km = − m ξ bm (m) (c) Pm+1 = Pm (1 − |km |2 ) 4. Output:

{km }M−1 , {Pm }M 0 1

Extended Schür algorithm. To extend the Schür algorithm for the computation of the c , we define the cross-correlation sequence ladder parameters km ∗ ξ cm (l) E{x(n − l)em (l)}

with ξ cm (l) = 0, for 0 ≤ l < m

(7.6.13)

due to the orthogonality principle. Multiplying (7.5.8) by x ∗ (n − l) and taking the mathematical expectation, we obtain a direct form T ξ cm (l) = dl+1 − cm r˜ m (l)

(7.6.14)

373

and a ladder-form equation c∗ b ξ m (l) ξ cm+1 (l) = ξ cm (l) − km

(7.6.15)

H ξ cm (m) = dl+1 − cm Jrm = β cm

(7.6.16)

For l = m, we have

c km =

and

β cm ξ c (m) = m Pm ξ bm (m)

(7.6.17)

c using a lattice-ladder structure. that is, we can compute the sequence km The computations can be arranged in the form of a superladder structure, shown in Figure 7.10 (Koukoutsis et al. 1991). See also Problem 7.32. In turn, (7.6.17) can be used in conjunction with the superlattice to determine the lattice-ladder parameters of the optimum FIR filter. The superladder structure is illustrated in the following example. c

FIGURE 7.10 Graphical illustration of the superladder structure.

j 0 (3) = d4 b j 0 (3)

= r(3)

c

j1 (3) b j 1 (3)

c

c

j 2 (3)

j 0 (2) = d3 b

b

j 0 (2) = r(2)

j 2 (3)

c

c

j 3 (3)

j1 (2) b

j 1 (2)

c j 0 (1) = d2 b j 0 (1) = r(1)

c

j 2 (2) b

j 3 (3)

c j1 (1) b

j 2 (2) c j 0 (0) = d1 c j 0 (0) = r(0)

b

j 1 (1)

c

k3

c

k2

c

k1

c

k0

Determine the lattice-ladder parameters of an optimum FIR filter with input autocorrelation sequence given in Example 7.6.1 and cross-correlation sequence d1 = 1, d2 = 2, and d3 = 52 , using the extended Schür algorithm. E XAM PLE 7.6.2.

Solution. Since the lattice parameters were obtained in Example 7.6.1, we only need to find the ladder parameters. Hence, using (7.6.15), (7.6.17), and the values of ξ bm (l) computed in Example 7.6.1, we have ξ c (0) 1 d =− 1 =− k0c = − 0 b r(0) 3 ξ 0 (0) 4 1 ξ c1 (1) = ξ c0 (1) + k0c ξ b0 (1) = 2 − (2) = 3 3 13 5 1 ξ c1 (2) = ξ c0 (2) + k0c ξ b0 (2) = − (1) = 2 3 6 4

ξ c (1) 4 = − 35 = − k1c = − 1 5 ξ b1 (1) 3 11 13 4 4 − ( )= 6 5 3 10 11 c (2) ξ 11 k2c = − 2 =− = − 10 8 16 ξ b2 (2)

ξ c2 (2) = ξ c1 (2) + k1c ξ b1 (2) =

5

section 7.6 Algorithm of Schür

374 chapter 7 Algorithms and Structures for Optimum Linear Filters

which provide the values of the ladder parameters. These values are identical to those obtained in Example 7.4.3.

7.6.3 Inverse Schür Algorithm The inverse Schür algorithm computes the autocorrelation sequence coefficients r(0), r(1), . . . , r(m) from the lattice parameters k0 , k1 , . . . , kM and the MMSE PM of the linear predictor. The organization of computations is best illustrated by the following example. E XAM PLE 7.6.3.

Given the lattice filter coefficients k0 = − 23

1 k1 = 15 k2 = − 16 and the MMSE P3 = 51/32, compute the autocorrelation samples r(0), r(1), r(2), and r(3), using the inverse Schür algorithm.

Solution. We base our approach on the part of the superlattice structure shown in Figure 7.9 that is enclosed by the nodes ξ b0 (0), ξ f0 (3), ξ f2 (3), and ξ b2 (2). To start at the lower left corner, we compute r(0), using (7.4.30): r(0) =

P3 2

=

2) (1 − km

51 32 =3 1 )(1 − 1 ) (1 − 49 )(1 − 25 256

m=0

This also follows from (7.5.31). Then, continuing the computations from the line defined by r(0) and ξ b2 (2) to the node defined by ξ f0 (3) = r(3), we have r(1) = −k0 r(0) = −(− 32 )3 = 2 ξ b1 (1) = ξ b0 (0) + k0 ξ f0 (1) = 3 + (− 23 )2 = 35 ξ f1 (2) = −k1 ξ b1 (1) = − 15 ( 53 ) = − 13 r(2) = ξ f0 (2) = ξ f1 (2) − k0 ξ b0 (1) = − 13 − (− 23 )2 = 1 ξ b1 (2) = ξ b0 (1) + k0 ξ f0 (2) = 2 + (− 23 )1 = 34 ξ b2 (2) = ξ b1 (1) + k1 ξ f1 (2) = 53 + 15 (− 31 ) = 85 1 )( 8 ) = 1 ξ f2 (3) = −k2 ξ b2 (2) = −(− 16 5 10 1 − 1(4) = −1 ξ f1 (3) = ξ f2 (3) − k1 ξ b1 (2) = 10 5 3 6

r(3) = ξ f0 (3) = ξ f1 (3) − k0 ξ b0 (2) = − 16 − (− 23 )1 = 12 as can be easily verified by the reader. Thus, the autocorrelation sequence is r(0) = 3

r(1) = 2

r(2) = 1

r(3) = 12

which agree with the autocorrelation sequence coefficients used in Example 7.6.1 with the direct Schür algorithm.

The inverse Schür algorithm is implemented by the function r=invschur(k,PM), which follows the same procedure as the previous example.

7.7 TRIANGULARIZATION AND INVERSION OF TOEPLITZ MATRICES In this section, we develop LDLH decompositions for both Toeplitz matrices and the inverse of Toeplitz matrices, followed by a recursion for the computation of the inverse of a Toeplitz matrix.

7.7.1 LDLH Decomposition of Inverse of a Toeplitz Matrix

375

Since Rm is a Hermitian Toeplitz matrix that also happens to be persymmetric, that is, JR m J = R ∗m , taking its inverse, we obtain

section 7.7 Triangularization and Inversion of Toeplitz Matrices

∗ −1 JR −1 m J = (Rm )

(7.7.1)

The last equation shows that the inverse of a Toeplitz matrix, although not Toeplitz, is persymmetric. From (7.1.58), we recall that the BLP coefficients and the MMSE Pmb provide −1 , that is, the quantities for the UDUT decomposition of Rm+1

Bm+1

where

−1 −1 Rm+1 = BH m+1 Dm+1 Bm+1 1 0 ··· 0 (1) b0 1 ··· 0 .. .. . . .. = . . . (m−1) . (m−1) b b1 ··· 1 0 (m)

(m)

b1

b0

(7.7.2) 0

0 .. . 0 1

· · · bm−1 (m)

Dm+1 = diag {P0b , P1b , . . . , Pmb }

and

(7.7.3)

(7.7.4)

For a Toeplitz matrix Rm+1 , we can obtain the LDLH decomposition of its inverse by using (7.7.2) and the property J = J−1 of the exchange matrix. Starting with (7.7.1), we obtain −1 ∗ H (Rm+1 )−1 = JR −1 m+1 J = (JBm+1 J)(JDm+1 J)(JBm+1 J)

(7.7.5)

Am+1 JB∗m+1 J

(7.7.6)

If we define

and then (7.7.2) gives

¯ m+1 D

−1 JDm+1 J

= diag {Pm , Pm−1 , . . . , P0 }

−1 H ¯ −1 Am+1 Rm+1 = Am+1 D m+1

(7.7.7) (7.7.8)

−1 which provides the unique LDLH decomposition of the matrix Rm+1 . Indeed, using the

property aj = Jb∗j for 1 ≤ j ≤ Am+1 = JB∗m+1 J as 1 0 . Am+1 = .. 0

(j )

m, or equivalently ai (m)∗

(m)∗

· · · am

(m−1)∗

···

··· .. . .. .

··· 1

a1

a2

1 .. .

a1 .. .

0 0

(j )∗

= bj −i , we can write matrix (m)∗

(m−1)∗

am−1 .. . (1)∗

a1

(7.7.9)

which is an upper unit triangular matrix. We stress that the property JB∗m+1 J = Am+1 and the above derivation of (7.7.8) hold for Toeplitz matrices only. However, the decomposition in (7.7.2) holds for any Hermitian, positive definite matrix (see Section 7.1.4). As we saw in Section 6.3, the solution of the normal equations Rc = d can be obtained in three steps as R = LDLH ⇒ LDk c = d ⇒ LH c = k c

(7.7.10)

where the LDLH decomposition requires about M 3 /6 flops and the solution of each triangular system M 2 /2 flops. Since R −1 = BH D−1 B, the Levinson-Durbin algorithm performs the UDUH decomposition of R −1 when R is Toeplitz, at a cost of M 2 flops; that is, it

376 chapter 7 Algorithms and Structures for Optimum Linear Filters

reduces the computational complexity by an order of magnitude. The Levinson recursion for the optimum filter is equivalent to the solution of the two triangular systems and requires M 2 operations. Compute the lattice-ladder parameters of an MMSE finite impulse response filter specified by the normal equations 1 3 2 1 h(0) 2 2 3 2 h(1) =

E XAM PLE 7.7.1.

1

2

3

h(2)

5 2

using two different approaches: the LDLH decomposition and the algorithm of Levinson. Solution. The LDLH decomposition of R is 1 0 0 3 0 0 2 5 D= L= 0 3 0 3 1 0 8 4 1 1 0 0 5 5 3

1

2 L−1 = − 3

1 5

0 1 − 45

0 1

and using (7.3.31), we have

11 T k3c = D−1 L−1 d = 13 45 16 which gives the three ladder parameters. The two lattice parameters are obtained by solving the system L2 D2 k2 = r2b

with r2b = [1 2]T

which gives k0 = 13 and k1 = 45 . The results agree with those obtained in Example 7.4.3 using the algorithm of Levinson. We also note that the rows of L−1 provide the first- and second-order forward and backward linear predictors. This is the case because the matrix is Toeplitz. For symmetric matrices the LDLH decomposition provides the backward predictors only.

7.7.2 LDLH Decomposition of a Toeplitz Matrix The computation of the LDLH decomposition of a symmetric, positive definite matrix requires on the order of M 3 computations. In Section 7.1, we saw that the cross-correlation b (n) is related to the LDLH decomposition of the correlation matrix R . between x(n) and em m We next show that we can extend the Schür algorithm to compute the LDLH decomposition of a Toeplitz matrix with O(M 2 ) computations using the cross-correlations ξ bm (l). To illustrate the basic process, we note that evaluating the product on the left with the help of (7.6.4), we obtain b(0) (1)∗ (2)∗ (3)∗ b0 b0 1 b0 r(0) r(1) r(2) r(3) 0 0 0 ξ0 b (2)∗ (3)∗ r(1) r(0) r(1) r(2) ξ 0 (1) ξ b1 (1) 0 0 0 1 b1 b1 = b b b (3)∗ r(2) r(1) r(0) r(1) 0 0 1 b ξ 0 (2) ξ 1 (2) ξ 2 (2) 0 2

r(3) r(2) r(1) r(0)

0 0

1

ξ b0 (3) ξ b1 (3) ξ b2 (3) ξ b3 (3)

˜ which can be written as that is, a lower triangular matrix L, 1 0 0 0 ξ b (1) 0 1 0 0 P0 0 P0 0 P1 b b ˜ L = ξ 0 (2) ξ 1 (2) 0 0 1 0 P0 P1 0 0 ξ b (3) ξ b (3) ξ b (3) 0 1 2 1 P0 P1 P2

0 0 P2

0 0 = LD 0

P3

because Pm = ξ bm (m) ≥ 0. Therefore, RBH = LD and since R is Hermitian, we have R = LDB−H = B−1 DLH , which implies that B−1 = L. This results in the following LDLH factorization of the (M + 1) × (M + 1) symmetric Toeplitz matrix R

where

L = B−1

R = LDLH

1 0 ··· 0 ¯b ξ 0 (1) 1 ··· 0 ¯b b ¯ ξ (2) ξ 1 (2) · · · 0 = 0 . .. . . .. .. . . . b b b ξ¯ 0 (M) ξ¯ 1 (M) · · · ξ¯ M−1 (M) b ξ¯ m (l) =

and

ξ bm (l) ξ bm (m)

=

ξ bm (l) Pm

0 0 .. .

(7.7.11)

(7.7.12)

1 (7.7.13)

D = diag {P0 , P1 , . . . , PM }

(7.7.14)

The basic recursion (7.6.10) in the algorithm of Schür can be extended to compute the elements of L˜ and hence the LDLH factorization of the Toeplitz matrix R (see Problem 7.33). Since a Toeplitz matrix is persymmetric, that is, JRJ = R ∗ , we have ¯ R = JR ∗ J = (JL∗ J)(JDJ)(JLH J) UDU

H

(7.7.15)

which provides the UDUH decomposition of R. Notice that the relation U = JL∗ J also can be obtained from A = JB∗ J [see (7.4.11)], which in turn is a consequence of the symmetry between forward and backward prediction for stationary processes. The validity of (7.6.10) also can be shown by computing the product 1 0 0 0 r(0) r(1) r(2) r(3) (3)∗ r(1) r(0) r(1) r(2) 1 0 0 a1 H RA = (7.7.16) (3)∗ (2)∗ r(2) r(1) r(0) r(1) a 1 0 a 2 1 (3)∗ (2)∗ (1)∗ r(3) r(2) r(1) r(0) a2 a1 1 a3 ξ f3 (0) ξ f2 (−1) ξ f1 (−2) ξ f0 (−3) 0 ξ f1 (−1) ξ f0 (−2) ξ f2 (0) (7.7.17) = f (0) f (−1) ξ 0 0 ξ 1 0 0 0 0 ξ f0 (0) with the help of (7.6.3) and r(−l) = r ∗ (l). The formula U = JL∗ J relates ξ fm (l) and ξ bm (l), as expected by (7.6.10). 7.7.3 Inversion of Real Toeplitz Matrices From the discussion in Section 7.1, it follows from (7.1.12) that the inverse QM of a symmetric, positive definite matrix RM is given by Q q QM T (7.7.18) q q with

b P 1 q= P q=

(7.7.19) (7.7.20)

377 section 7.7 Triangularization and Inversion of Toeplitz Matrices

378 chapter 7 Algorithms and Structures for Optimum Linear Filters

1 T bb (7.7.21) P as given by (7.1.18), (7.1.19), and (7.1.21). The matrix Q is an (M − 1) × (M − 1) matrix, and b is the (M − 1)st-order BLP. Next we show that for Toeplitz matrices we can compute QM with O(M 2 ) computations. First, we note that the last column and the last row of QM can be obtained by solving the Toeplitz system Rb = −Jr using the Levinson-Durbin algorithm. Then we show that we can compute the elements of Q by exploiting the persymmetry property of Toeplitz matrices, moving from the known edges to the interior. Indeed, since R is persymmetric, that is, R = JRJ, we have R −1 = JR −1 J, that is, R −1 is also persymmetric. From (7.7.21), we have Q = R −1 +

and

Qij = R −1 ij + P qi qj = R −1 M−j,M−i + P qi qj

(7.7.22)

because R −1 is persymmetric, and R −1 M−j,M−i = QM−j,M−i − P qM−j qM−i

(7.7.23)

Combining (7.7.22) and (7.7.23), we obtain Qij = QM−j,M−i − P (qi qj − qM−j qM−i )

(7.7.24)

which in conjunction with persymmetry makes possible the computation of the elements of Q from q and q. The process is illustrated for M = 6 in the following diagram p1 p1 p1 p1 p1 k p1 p2 p2 p2 u1 k p p p u u k 2 3 2 1 1 Q6 = p1 p2 u2 u2 u1 k p u u u u k 1 1 1 1 1 k k k k k k where we start with the known elements k and then compute the u elements by using the updating property (7.7.22) and the elements p by using the persymmetry property (7.7.24) in the following order: k → p1 → u1 → p2 → u2 → p3 . Clearly, because the matrix −1 QM = RM is both symmetric and persymmetric, we need to compute only the elements in the following wedge: p1

p1 p2

p1

p1

p1

p2

p2

u1

p3

u2

k

which can be easily extended to the general case. This algorithm, which was introduced by Trench (1964), requires O(M 2 ) operations and is implemented by the function Q=invtoepl(r,M)

The algorithm is generalized for complex Toeplitz matrices in Problem 7.40.

7.8 KALMAN FILTER ALGORITHM The various optimum linear filter algorithms and structures that we discussed so far in this chapter provide us with the determination of filter coefficients or optimal estimates using some form of recursive update. Some algorithms and structures are order-recursive while others are time-recursive. In effect, they tell us how the past values should be updated to

determine the present values. Unfortunately, these techniques do not lend themselves very well to the more complicated nonstationary problems. Readers will note carefully that the only case in which we obtained efficient order-recursive algorithms and structures was in the stationary environment, using the approaches of Levinson and Schür. In 1960, R. E. Kalman provided an alternative approach to formulating the MMSE linear filtering problem using dynamic models. This “Kalman filter” technique was quickly hailed as a practical solution to a number of problems that were intractable using the more established Wiener methods. As we see in this section, the Kalman filter algorithm is actually a special case of the optimal linear filter algorithms that we have studied. However, it is used in a number of fields such as aerospace and navigation, where a signal trajectory can be well defined. Its use in statistical signal processing is somewhat limited (adaptive filters discussed in Chapter 10 are more appropriate). The two main features of the Kalman filter formulation and solution are the dynamic (or state-space) modeling of the random processes under consideration and the time-recursive processing of the input data. In this section, we discuss only the discrete-time Kalman filter. The continuous-time version is covered in several texts including Gelb (1977) and Brown and Hwang (1997). As a motivation to this approach, we begin with the following estimation problem.

7.8.1 Preliminary Development Suppose that we want to obtain a linear MMSE estimate of a random variable y using the related random variables (observations) {x1 , x2 , . . . , xm }, that is, yˆm E {y|x1 , x2 , . . . , xm }

(7.8.1)

as described in Section 7.1.5. Furthermore, we want to obtain this estimate in an orderrecursive fashion, that is, determine yˆm in terms of yˆm−1 . We considered and solved this problem in Section 7.1. Our approach, which is somewhat different from that in Section 7.1, is as follows: Assume that we have computed the corresponding estimate yˆm−1 , we have the observations {x1 , x2 , . . . , xm }, and we wish to determine the estimate yˆm . Then we carry out the following steps: 1. We first determine the optimal one-step prediction of xm , that is, xˆm|m−1 {xm |x1 , x2 , . . . , xm−1 } −1 b rm−1 ]H xm−1 = −bH = [Rm−1 m−1 xm−1

=−

m−1

(7.8.2)

(m−1) ∗ [bk ] xk

k=1

where the vector and matrix quantities are as defined in Section 7.1. 2. When the new data value xm is received, we determine the optimal prediction error b xm − xˆm|m−1 = wm em

(7.8.3)

which is the new information or innovations contained in the new data. 3. Determine a linear MMSE estimate of y, given the new information wm : ∗ ∗ −1 }(E{wm wm }) wm E{y|wm } = E{ym wm

(7.8.4)

4. Finally, form a linear estimate yˆm of the form ∗ ∗ −1 }(E{wm wm }) wm yˆm = yˆm−1 + E{y|wm } = yˆm−1 + E{ym wm

(7.8.5)

∗ } (E{w w ∗ })−1 The algorithm is initialized with yˆ0 = 0. Note that the quantity E{ym wm m m ∗ and that we have rederived (7.1.51). For the implementation is equal to the coefficient km of (7.8.5), see Figure 7.1.

379 section 7.8 Kalman Filter Algorithm

380

E XAM PLE 7.8.1. Let the observed random data be obtained from a stationary random process; that is, the data are of the form

chapter 7 Algorithms and Structures for Optimum Linear Filters

{x(1), x(2), . . . , x(n), . . .}

r(n, l) = r(n − l)

Also instead of estimating a single random variable, we want to estimate the sample y(n) of a random process {y(n)} that is jointly stationary with x(n). Then, following the analysis leading to (7.8.5), we obtain y(n) ˆ = y(n ˆ − 1) + kn∗ w(n) = y(n ˆ − 1) + kn∗ [x(n) +

n−1

(n−1) ∗ ] x(k)]

[bk

(7.8.6)

k=0

It is interesting to note that, because of stationarity, we have a time-recursive algorithm in (7.8.6). The coefficients {kn∗ } can be obtained recursively by using the algorithms of Levinson or Schür. However, the data prediction term does require a growing memory. Indeed, if we define the vector x(n) = [x(1) x(2) · · · x(n)]T whose order is equal to time index n, we have y(n) ˆ =

n (n) [ck ]∗ x(k) cnH x(n) k=1

The optimum estimator is given by Rn cn = dn H Rn E{x(n)x (n)} dn E{x(n)y ∗ (n)}

where

Since, owing to stationarity, the matrix Rn is Toeplitz, we can derive a lattice-ladder structure {kn , knc } that solves this problem recursively (see Section 7.4). When each new observation {y(n + 1)} is received, we use the moments r(n + 1) and d(n + 1) to compute new latticec } and we add a new stage to the “growing-order” (and, therefore, ladder parameters {kn+1 , kn+1 growing-memory) filter.

The above example underscores two problems with our estimation technique if we were to obtain a true time-recursive algorithm with finite memory. The first problem concerns the ∗ term or, in particular, for E{y w ∗ } and (E{w w ∗ })−1 . We time-recursive update for the km m m m m alluded to this problem in Section 7.1. In the example, we solved this problem by assuming a stationary signal environment. The second problem deals with the infinite memory in (7.8.2). This problem can be solved if we are able to compute the data prediction term also in a time-recursive fashion. In the stationary case, this problem can be solved by using the Levinson-Durbin or Schür algorithm. For nonstationary situations, the above two problems are solved by the Kalman filter by assuming appropriate dynamic models for the process to be estimated and for the observation data. Consider the optimal one-step prediction term in (7.8.2), defined as x(n|n ˆ − 1) E{x(n)|x(0), . . . , x(n − 1)}

(7.8.7)

which requires growing memory. If we assume the following linear data relation model x(n) = H (n)y(n) + v(n) ∗

E{v(n)y (l)} = 0

with

∗

E{v(n)v (l)} = rv (n)δ n,l

for all n, l for all n, l

(7.8.8) (7.8.9) (7.8.10)

then (7.8.7) becomes x(n|n ˆ − 1) = E{[H (n)y(n) + v(n)]|x(0), . . . , x(n − 1)} = H (n)y(n|n ˆ − 1)

(7.8.11)

where we have used the notation y(n|n ˆ − 1) E{y(n)|x(0), . . . , x(n − 1)}

(7.8.12)

Thus, we will be successful in obtaining a finite-memory computation for x(n|n ˆ − 1) if we can obtain a recursion for y(n|n ˆ − 1) in terms of y(n ˆ − 1|n − 1). This is possible if we assume the following linear signal model y(n) = a(n − 1)y(n − 1) + η(n)

(7.8.13)

with appropriate statistical assumptions on the random process η(n). Thus it is now possible to complete the development of the Kalman filter. The signal model (7.8.13) provides the dynamics of the time evolution of the signal to be estimated while (7.8.8) is known as the observation model, since it relates the signal y(n) with the observation x(n). These models are formally defined in the next section.

7.8.2 Development of Kalman Filter Since the Kalman filter is also well suited for vector processes, we begin by assuming that the random process to be estimated can be modeled in the form y(n) = A(n − 1)y(n − 1) + B(n)η(n)

(7.8.14)

which is known as the signal (or state vector) model where y(n) = k × 1 signal state vector at time n A(n − 1) = k × k matrix that relates y(n − 1) to y(n) in absence of a forcing function η(n) = k × 1 zero-mean white noise sequence with covariance matrix Rη (n) B(n) = k × k input matrix (7.8.15) The matrix A(n − 1) is known as the state-transition matrix while η(n) is also known as the modeling error vector. The observation (or measurement) model is described using the linear relationship x(n) = H(n)y(n) + v(n)

(7.8.16)

where x(n) = m × 1 signal state vector at time n H(n) = m × k matrix that gives ideal linear relationship between y(n) and x(n) v(n) = k × 1 zero-mean white noise sequence with covariance matrix Rv (n) (7.8.17) The matrix H(n) is known as the output matrix, and the sequence v(n) is known as the observation error. We further assume the following statistical properties: E{y(n)vH (l)} = 0 E{η(n)v (l)} = 0 E{η(n)yH (−1)} = 0 H

for all n, l for all n, l

(7.8.18) (7.8.19)

for all n

(7.8.20)

E{y(−1)} = 0 E{y(−1)yH (−1)} = Ry (−1)

(7.8.21) (7.8.22)

The first three relations, (7.8.18) to (7.8.20), imply orthogonality between respective random variables while the last two, (7.8.21) and (7.8.22), establish the mean and covariance of the initial-condition vector y(−1). From (7.8.14) and (7.8.21) the mean of y(n) = 0 for all n, and the evolution of its correlation matrix is given by Ry (n) = A(n − 1)Ry (n − 1)AH (n − 1) + B(n)Rη (n)BH (n)

(7.8.23)

381 section 7.8 Kalman Filter Algorithm

382 chapter 7 Algorithms and Structures for Optimum Linear Filters

From (7.8.16), the mean of x(n) = 0 for all n, and from (7.8.23) the evolution of its correlation matrix is given by Rx (n) = H(n)[A(n − 1)Ry (n − 1)AH (n − 1) + B(n)Rη (n)BH (n)]HH (n) + Rv (n)

(7.8.24)

Evolution of optimal estimates We now assume that we have available the MMSE estimate yˆ (n − 1|n − 1) of y(n − 1) based on the observations up to and including time n − 1. Using (7.8.14) and (7.8.20), the one-step prediction of y(n) is given by yˆ (n|n − 1) = A(n − 1)ˆy(n − 1|n − 1)

(7.8.25)

with initial condition yˆ (−1| − 1) = y(−1). From (7.8.16), the one-step prediction of x(n) is given by xˆ (n|n − 1) = H(n)ˆy(n|n − 1) = H(n)A(n − 1)ˆy(n − 1|n − 1)

(7.8.26)

Thus we have a recursive formula to compute the predicted observation. The prediction error (7.8.3) from (7.8.16) is now given by w(n) = x(n) − xˆ (n|n − 1) = H(n)y(n) + v(n) − H(n)ˆy(n|n − 1)

(7.8.27)

= H(n)˜y(n|n − 1) + v(n) where we have defined the signal prediction error y˜ (n|n − 1) y(n) − yˆ (n|n − 1) Now the quantity corresponding to

∗} E{wm wm

in (7.8.5) is given by

Rw (n) = E{w(n)wH (n)} = H(n)Ry˜ (n|n − 1)HH (n) + Rv (n) Ry˜ (n|n − 1) E{˜y(n|n − 1)˜y (n|n − 1)} H

where

(7.8.28)

(7.8.29) (7.8.30)

is called the prediction (a priori) error covariance matrix. Similarly, from (7.8.27) the ∗ } in (7.8.5) is given by quantity corresponding to E{ym wm E{y(n)wH (n)} = E{y(n)[˜yH (n|n − 1)HH (n) + vH (n)]} = E{[˜y(n|n − 1) + yˆ (n|n − 1)] × [˜yH (n|n − 1)HH (n) + vH (n)]} =

(7.8.31)

E{˜y(n|n − 1)˜yH (n|n − 1)}HH (n)

= Ry˜ (n|n − 1)HH (n) since the optimal prediction error y˜ (n|n − 1) is orthogonal to the optimal prediction yˆ (n|n − 1). Now the updated MMSE estimate (which is also known as the filtered estimate) corresponding to (7.8.5) is −1 (n){x(n) − x ˆ (n|n − 1)} yˆ (n|n) = yˆ (n|n − 1) + Ry˜ (n|n − 1)HH (n)Rw

= yˆ (n|n − 1) + K(n){x(n) − H(n)ˆy(n|n − 1)}

(7.8.32)

where we have defined a new quantity −1 (n) K(n) Ry˜ (n|n − 1)HH (n)Rw

(7.8.33)

which is known as the Kalman gain matrix and where yˆ (n|n − 1) is given in terms of yˆ (n − 1|n − 1) using (7.8.25). Thus we have Prediction:

yˆ (n|n − 1) = A(n − 1)ˆy(n − 1|n − 1)

Filter:

yˆ (n|n) = yˆ (n|n − 1) + K(n){x(n) − H(n)ˆy(n|n − 1)}

(7.8.34)

and we have succeeded in obtaining a time-updating algorithm for recursively computing the MMSE estimates. All that remains is a time evolution of the gain matrix K(n). Since Rw (n) from (7.8.29) also depends on Ry˜ (n|n − 1), what we need is an update equation for the error covariance matrix. Evolution of error covariance matrices First we define the filtered error as y˜ (n|n) y(n) − yˆ (n|n) = y(n) − yˆ (n|n − 1) − K(n){x(n) − H(n)ˆy(n|n − 1)} = y˜ (n|n − 1) − K(n)w(n)

(7.8.35)

where we have used (7.8.27) and (7.8.34). Then the filtered error covariance is given by Ry˜ (n|n) E{˜y(n|n)˜yH (n|n)} = Ry˜ (n|n − 1) − K(n)Rw (n)K H (n) −1 (n)H(n)R (n|n − 1) = Ry˜ (n|n − 1) − K(n)Rw (n)Rw y˜ = [I − K(n)H(n)]Ry˜ (n|n − 1)

(7.8.36)

where in the second-to-last step we substituted (7.8.33) for K H (n). The error covariance Ry˜ (n|n) is also known as the a posteriori error covariance. Finally, we need to determine the a priori prediction error covariance at time n from Ry˜ (n − 1|n − 1) to complete the recursive calculations. From the prediction equation in (7.8.34), we obtain the prediction error at time n as y(n) − yˆ (n|n − 1) = A(n − 1)y(n − 1) + B(n)η(n) − A(n − 1)ˆy(n − 1|n − 1) (7.8.37) y˜ (n|n − 1) = A(n − 1)˜y(n − 1|n − 1) + B(n)η(n) or Ry˜ (n|n − 1) = A(n − 1)Ry˜ (n − 1|n − 1)AH (n − 1) + B(n)Rη (n)BH (n) (7.8.38) with initial condition Ry˜ (−1| − 1) = Ry (−1). Thus we have A priori error covariance: Kalman gain: A posteriori error covariance:

Ry˜ (n|n − 1) = A(n − 1)Ry˜ (n − 1|n − 1)AH (n − 1) + B(n)Rη (n)BH (n) −1 (n) K(n) = Ry˜ (n|n − 1)HH (n)Rw Ry˜ (n|n) = [I − K(n)H(n)]Ry˜ (n|n − 1)

(7.8.39) The complete Kalman filter algorithm is given in Table 7.5, and the block diagram description is provided in Figure 7.11. E XAM PLE 7.8.2.

Let y(n) be an AR(2) process described by

y(n) = 1.8y(n − 1) − 0.81y(n − 2) + 0.1η(n)

n≥0

(7.8.40)

where η(n) ∼ WGN(0, 1) and y(−1) = y(−2) = 0. We want to determine the linear MMSE estimate of y(n), n ≥ 0, by observing √ x(n) = y(n) + 10v(n) n≥0 (7.8.41) where v(n) ∼ WGN(0, 10) and orthogonal to η(n). Solution. From (7.8.40) and (7.8.41), we first formulate the state vector and observation equations: y(n) 1.8 −0.81 y(n − 1) 0.1 y(n) = + η(n) (7.8.42) y(n − 1) 1 0 y(n − 2) 0 and

√ y(n) + 10v(n) x(n) = [1 0] y(n − 1)

(7.8.43)

383 section 7.8 Kalman Filter Algorithm

TABLE 7.5

384 chapter 7 Algorithms and Structures for Optimum Linear Filters

Summary of the Kalman filter algorithm. 1. Input: (a) Signal model parameters: A(n − 1), B(n), Rη (n); n = 0, 1, 2, . . . (b) Observation model parameters: H(n), Rv (n); n = 0, 1, 2, . . . (c) Observation data: y(n); n = 0, 1, 2, . . . 2. Initialization: yˆ (0| − 1) = y(−1) = 0; Ry˜ (−1| − 1) = Ry (−1) For n = 0, 1, 2, . . .

3. Time recursion:

(a) Signal prediction: yˆ (n|n − 1) = A(n − 1)ˆy(n − 1|n − 1) (b) Data prediction: xˆ (n|n − 1) = H(n)ˆy(n|n − 1) (c) A priori error covariance: Ry˜ (n|n − 1) = A(n − 1)Ry˜ (n − 1|n − 1)AH (n − 1) + B(n)Rη (n)BH (n) (d ) Kalman gain: −1 (n) K(n) = Ry˜ (n|n − 1)HH (n)Rw H Rw (n) = H(n)Ry˜ (n|n − 1)H (n + Rv (n) (e) Signal update: yˆ (n|n) = yˆ (n|n − 1) + K(n)[x(n) − xˆ (n|n − 1)] (f ) A posteriori error covariance: Ry˜ (n|n) = [I − K(n)H(n)]Ry˜ (n|n − 1)

4. Output:

Filtered estimate yˆ (n|n), n = 0, 1, 2, . . .

Correction x(n) +

H(n)

−

v(n) y(n−1) A(n−1) Signal model

z−1

H(n)

xˆ (n|n −1)

Observation model

yˆ (n|n)

Update

K(n) Signal prediction

y(n)

B(n)

Data prediction

h(n)

Error

yˆ (n|n −1)

yˆ (n−1|n −1) A(n−1)

z−1

Discrete Kalman filter

FIGURE 7.11 The block diagram of the Kalman filter model and algorithm. Hence the relevant matrix quantities are 1.8 −0.81 A(n) = 1 0 and

H(n) = [1

B(n) = 0]

0.1 0

Rv (n) = 10

Rη (n) = 1 (7.8.44)

Now the Kalman filter equation from Table 7.5 can be implemented with zero initial conditions. Note that since the system matrices are constant, the processes x(n) and y(n) are asymptotically stationary. Using (7.8.40) and (7.8.41), we generated 100 samples of y(n) and x(n). The observation x(n) was processed using the Kalman filter equations to obtain yˆf (n) = y(n|n), ˆ and the results are shown in Figure 7.12. Owing to a large observation noise variance, the x(n) values are very noisy around the signal y(n) values. However, the Kalman filter was able to track x(n) closely and reduce the noise v(n) degradation. In Figure 7.13 we show the evolution of Kalman filter gain values K1 (n) and K2 (n) along with the estimation error variance. The filter reaches its steady state in about 20 samples and becomes a stationary filter as expected. In such situations, the gain and error covariance equations can be implemented off-line (since these equations are dataindependent) to obtain a constant-gain matrix. The data then can be filtered using this constant gain to reduce on-line computational complexity.

FIGURE 7.12 Estimation of AR(2) process using Kalman filter in Example 7.8.2.

Estimation of AR(2) process

Amplitude

5

−5

−10

y(n) x(n) y(n) ˆ 0

20

40

60

80

100

n

FIGURE 7.13 Kalman filter gains and estimation error covariance in Example 7.8.2.

Kalman gain values

Gain

0.1

K1(n) K2(n) 0 0

20

40

60

80

100

80

100

Mean square error

MSE

1

0 0

20

40

60 n

In the next example, we consider the case of the estimation of position of an object in a linear motion subjected to random acceleration. E XAM PLE 7.8.3. Consider an object traveling in a straight-line motion that is perturbed by random acceleration. Let yp (n) = yc (nT ) be the true position of the object at the nth sampling instant, where T is the sampling interval in seconds and yc (t) is the instantaneous position. This position is measured by a sensor that records noisy observations. Let x(n) be the measured position at the nth sampling instant. Then we can model the observation as

x(n) = yp (n) + v(n)

n≥0

(7.8.45)

where v(n) ∼ WGN(0, σ 2v ). To derive the state dynamic equation, we assume that the object is in a steady-state motion (except for the random acceleration). Let yv (n) = y˙c (nT ) be the true velocity at the nth sampling instant, where y˙c (t) is the instantaneous velocity. Then we have the following equations of motion yv (n) = yv (n − 1) + ya (n − 1)T

(7.8.46)

yp (n) = yp (n − 1) + yv (n − 1)T + 12 ya (n − 1)T 2

(7.8.47)

where we have assumed that the acceleration y¨c (t) is constant over the sampling interval and that ya (n − 1) is the acceleration over (n − 1)T ≤ t < nT . We now define the state vector as yp (n) (7.8.48) y(n) yv (n)

385 section 7.8 Kalman Filter Algorithm

and the modeling error as η(n) ya (n − 1), which is assumed to be random with η(n) ∼ WGN(0, σ 2η ) and orthogonal to v(n). Thus (7.8.46) and (7.8.47) can be arranged in vector form as 2 T 1 T y(n) = n≥0 (7.8.49) y(n − 1) + 2 η(n) 0 1 T

386 chapter 7 Algorithms and Structures for Optimum Linear Filters

Thus we have A=

1

T

1

2 T B= 2 T

and

Similarly, the observation (7.8.45) is given by x(n) = [1 0]y(n) + v(n)

n≥0

(7.8.50)

and hence H = [1 0]. Let the initial conditions be yp (−1) and yv (−1). Now given the noisy observations {x(n)} and all the necessary information [T , σ 2v , σ 2η , yp (−1), and yv (−1)], we can recursively estimate the position and velocity of the object at each sampling instance. An approach similar to this is used in aircraft navigation systems. Using the following values T = 0.1

σ 2v = σ 2η = 0.25

yp (−1) = 0

yv (−1) = 1

we simulated the trajectory of the object over [0, 10] second interval. From Table 7.5 Kalman filter equations were obtained, and the true positions as well as velocities were estimated using the noisy positions. Figure 7.14 shows the estimation results. The top graph shows the true, noisy, and estimated positions. The bottom graph shows the true and estimated velocities. Due to random acceleration values (which are moderate), the true velocity has small deviations from the constant value of 1 while the true position trajectory is approximately linear. The estimates of the position follow the true values very closely. However, the velocity estimates have more errors around the true velocities. This is because no direct measurements of velocities are available; therefore, the velocity of the object can be inferred only from position measurements. True, noisy, and estimated positions

Position (m)

15 10 5

True Noisy

0 −5

Estimate 0

2

4

6

8

10

True and estimated velocities Velocity (m/s)

2 1 0 True Estimate

−1 −2

2

4

6

8

t (s)

FIGURE 7.14 Estimation of positions and velocities using Kalman filter in Example 7.8.3.

10

In Figure 7.15, we show the trajectories of Kalman gain values and trace of the error covariance matrices. The top graph contains the gain values corresponding to position (Kp ) and velocity (Kv ). The steady state of the filter is reached in about 3 s. The bottom left graph contains the a priori and a posteriori error covariances, which also reach the steady-state values in 3 s and which appear to be very close to each other. Therefore, in the bottom right graph we show an exploded view of the steady-state region over a 1-s interval. It is interesting to note that the steady-state error covariances before and after processing an observation are not the same. As a result of making an observation, the a posteriori errors are reduced from the a priori ones. However, owing to random acceleration, the errors increase during the intervals between observations. This is shown as dotted lines in Figure 7.15. The steady state is reached when the decrease in errors achieved by each observation is canceled by the increase between observations. Kalman gain components 1.0 kp kv

0.8 0.6 0.4 0.2 0.0 0

1

2

3

4

5 t (s)

6

Trace of covariance matrix

7

8

9

10

Trace of covariance matrix 0.08

A priori A posteriori

2

0.07 1 A priori A posteriori 0

0.06 0

1

2 t (s)

3

4

6

7 t (s)

FIGURE 7.15 Kalman filter gains and estimation error variances in Example 7.8.3.

It should be clear from the above two examples that the Kalman filter can recursively estimate signal values because of the assumption of dynamic models (7.8.14) and (7.8.16). Therefore, in this sense, the Kalman filter approach is a special case of the more general Wiener filter problem that we considered earlier. In many signal processing applications (e.g., data communication systems), assumption of such models is difficult to justify, which limits the use of Kalman filters. 7.9 SUMMARY The application of optimum FIR filters and linear combiners involves the following two steps. •

Design. In this step, we determine the optimum values of the estimator parameters by solving the normal equations formed by using the known second-order moments. For stationary processes the design step is done only once. For nonstationary processes, we repeat the design when the statistics change.

387 section 7.9 Summary

388 chapter 7 Algorithms and Structures for Optimum Linear Filters

•

Implementation. In this step, we use the optimum parameters and the input data to compute the optimum estimate.

The type and complexity of the algorithms and structures available for the design and implementation of linear MMSE estimators depend on two factors: • •

The shift invariance of the input data vector. The stationarity of the signals that determine the second-order moments in the normal equations.

As we introduce more structure (shift invariance or stationarity), the algorithms and structures become simpler. From a mathematical point of view, this is reflected in the structure of the correlation matrix, which starting from general Hermitian at one end becomes Toeplitz at the other. Linear combiners The input vector is not shift-invariant because the optimum estimate is computed by using samples from M different signals. The correlation matrix R is Hermitian and usually positive definite. The normal equations are solved by using the LDLH decomposition, and the optimum estimate is computed by using the obtained parameters. However, in many applications where we need the optimum estimate and not the coefficients of the optimum combiner, we can implement the MMSE linear combiner, using the orthogonal order-recursive structure shown in Figure 7.1. This structure consists of two parts: (1) a triangular decorrelator (orthogonalizer) that decorrelates the input data vector and produces its innovations vector and (2) a linear combiner that combines the uncorrelated innovations to compute the optimum estimates for all orders 1 ≤ m ≤ M. FIR filters and predictors In this case the input data vector is shift-invariant, which leads to simplifications, whose extent depends on the stationarity of the involved signals. Nonstationary case. In general, the correlation matrix is Hermitian and positive definite with no additional structure, and the LDLH decomposition is the recommended method to solve the normal equations. However, the input shift invariance leads to a remarkable coupling between FLP, BLP, and FIR filtering, resulting in a simplified orthogonal orderrecursive structure, which now takes the form of a lattice ladder filter (see Figure 7.3). The backward prediction errors of all orders 1 ≤ m ≤ M provide the innovations of the input data vector. The parameters of lattice structure (decorrelator) are specified by the components of the LDLH decomposition of the input correlation matrix. The coefficients of the ladder part (correlator) depend on both the input correlation matrix and the cross-correlation between the desired response and the input data vector. Stationary case. In this case, the addition of stationarity to the shift invariance makes the correlation matrix Toeplitz. The presence of the Toeplitz structure has the following consequences: 1. The development of efficient order-recursive algorithms, with computational complexity proportional to M 2 , for the solution of the normal equations and the triangularization of the correlation matrix. a. Levinson algorithm solves Rc = d for arbitrary right-hand side vector d (2M 2 operations). b. Levinson-Durbin algorithm solves Ra = −r∗ when the right-hand side has special structure (M 2 operations). c. Schür algorithm computes directly the lattice-ladder parameters from the autocorrelation and cross-correlation sequences.

2. The MMSE FLP, BLP, and FIR filters are time-invariant; that is, their coefficients (directform or lattice-ladder structures) are constant and should be computed only once. The algorithms for MMSE filtering and prediction of stationary processes are the simplest ones. However, we can also develop efficient algorithms for nonstationary processes that have special structure. There are two cases of interest: • •

The Kalman filtering algorithm that can be used for processes generated by a state-space model with known parameters. Algorithms for α-stationary processes, that is, processes whose correlation matrix is near to Toeplitz, as measured by a special distance known as the displacement rank (Morf et al. 1977).

PROBLEMS 7.1 By first computing the matrix product b Rm rm Im bH rm

ρ bm

0m

−1 −Rm rm

1

and then the determinants of both sides, prove Equation (7.1.25). Another proof, obtained using the LDLH decomposition, is given by Equation (7.2.4). 7.2 Prove the matrix inversion lemma for lower right corner partitioned matrices, which is described by Equations (7.1.26) and (7.1.28). 7.3 This problem generalizes the matrix inversion lemmas to nonsymmetric matrices. (a) Show that if R −1 exists, the inverse of an upper left corner partitioned matrix is given by −1 R r 1 αR −1 + wvT w = α vT r˜ T σ 1 where

Rw −r R T v −˜r α σ − r˜ T R −1 r = σ + vT r = σ + r˜ T w

(b) Show that if R −1 exists, the inverse of a lower right corner partitioned matrix is given by −1 σ r˜ T 1 1 vT = α w αR −1 + wvT r R where

Rw −r R T v −˜r α σ − r˜ T R −1 r = σ + vT r = σ + r˜ T w

(c) Check the validity of the lemmas in parts (a) and (b), using Matlab. 7.4 Develop an order-recursive algorithm to solve the linear system in Example 7.1.2, using the lower right corner partitioning lemma (7.1.26). 7.5 In this problem we consider two different approaches for inversion of symmetric and positive definite matrices by constructing an arbitrary fourth-order positive definite correlation matrix R and comparing their computational complexities. (a) Given that the inverse of a lower (upper) triangular matrix is itself lower (upper) triangular, develop an algorithm for triangular matrix inversion. (b) Compute the inverse of R, using the algorithm in part (a) and Equation (7.1.58).

389 problems

390 chapter 7 Algorithms and Structures for Optimum Linear Filters

(c) Build up the inverse of R, using the recursion (7.1.24). (d ) Estimate the number of operations for each method as a function of order M, and check their validity for M = 4, using Matlab. 7.6 Using the appropriate orthogonality principles and definitions, prove Equation (7.3.32). 7.7 Prove Equations (7.3.36) to (7.3.38), using Equation (7.1.45). 7.8 Working as in Example 6.3.1, develop an algorithm for the upper-lower decomposition of a symmetric positive definite matrix. Then use it to factorize the matrix in Example 6.3.1, and verify your results, using the function [U,D]=udut(R). ¯ H 7.9 In this problem we explore the meaning of the various quantities in the decomposition R = UDU of the correlation matrix. (a) Show that the rows of A = U−1 are the MMSE estimator of xm from xm+1 , xm+2 , . . . , xM . ¯ H can be obtained by the Gram-Schmidt orthog(b) Show that the decomposition R = UDU onalization process, starting with the random variable xM and ending with x1 , that is, proceeding backward. 7.10 In this problem we clarify the various quantities and the form of the partitionings involved in the UDUH decomposition, using an m = 4 correlation matrix. (a) Prove that the components of the forward prediction error vector (7.3.65) are uncorrelated. (b) Writing explicitly the matrix R, identify and express the quantities in Equations (7.3.62) through (7.3.67). (c) Using the matrix R in Example 6.3.2, compute the predictors in (7.3.67) by using the corresponding normal equations, verify your results, comparing them with the rows of matrix A computed directly from the LDLH decomposition of R −1 or the UDUH decomposition of R (see Table 7.1). 7.11 Given an all-zero lattice filter with coefficients k0 and k1 , determine the MSE P (k0 , k1 ) as a function of the required second-order moments, assumed jointly stationary, and plot the error performance surface. Use the statistics in Example 6.2.1. 7.12 Given the autocorrelation r(0) = 1, r(1) = r(2) = 12 , and r(3) = 14 , determine all possible representations for the third-order prediction error filter (see Figure 7.7). 7.13 Repeat Problem 7.12 for k0 = k1 = k2 = 13 and P3 = ( 23 )3 . 7.14 Use Levinson’s algorithm to solve the normal equations Rc = d where R = Toeplitz{3, 2, 1} and d = [6 6 2]T . 7.15 Consider a random sequence with autocorrelation {r(l)}30 = {1, 0.8, 0.6, 0.4}. (a) Determine f for m = 1, 2, 3. (b) Determine and draw the flow the FLP am and the corresponding error Pm diagram of the third-order lattice prediction error filter. 7.16 Using the Levinson-Durbin algorithm, determine the third-order linear predictor a3 and the MMSE P3 for a signal with autocorrelation r(0) = 1, r(1) = r(2) = 12 , and r(3) = 14 . 7.17 Given the autocorrelation sequence r(0) = 1, r(1) = r(2) = 12 , and r(3) = 14 , compute the lattice and direct-form coefficients of the prediction error filter, using the algorithm of Schür. 7.18 Determine ρ 1 and ρ 2 so that the matrix R = Toeplitz{1, ρ 1 , ρ 2 } is positive definite. 7.19 Suppose that we want to fit an AR(2) model to a sinusoidal signal with random phase in additive noise. The autocorrelation sequence is given by r(l) = P0 cos ω0 l + σ 2v δ(l)

(2)

(2)

(a) Determine the model parameters a1 , a2 , and σ 2w in terms of P0 , ω0 , and σ 2v . (b) Determine the lattice parameters of the model. (c) What are the limiting values of the direct and lattice parameters of the model when σ 2v → 0? 7.20 Given the parameters r(0) = 1, k0 = k1 = 12 , and k2 = 14 , determine all other equivalent representations of the prediction error filter (see Figure 7.7). 7.21 Let {r(l)}P 0 be samples of the autocorrelation sequence of a stationary random signal x(n). (a) Is it possible to extend r(l) for |l| > P so that the PSD R(ej ω ) =

∞

r(l)e−j ωl

l=−∞

is valid, that is, R(ej ω ) ≥ 0? (b) Using the algorithm of Levinson-Durbin, develop a procedure to check if a given autocorrelation extension is valid. (c) Use the algorithm in part (b) to find the necessary and sufficient conditions so that r(0) = 1, r(1) = ρ 1 , and r(2) = ρ 2 are a valid autocorrelation sequence. Is the resulting extension unique? 7.22 Justify the following statements. (a) The whitening filter for a stationary process x(n) is timevarying. (b) The filter in part (a) can be implemented by using a lattice structure and switching its stages on one by one with the arrival of each new sample. (c) If x(n) is AR(P ), the whitening filter becomes time-invariant P + 1 sampling intervals after the first sample is applied. Note: We assume that the input is applied to the filter at n = 0. If the input is applied at n = −∞, the whitening filter of a stationary process is always time-invariant. 7.23 Given the parameters r(0) = 1, k0 = 12 , k1 = 13 , and k2 = 14 , compute the determinant of the matrix R4 = Toeplitz{r(0), r(1), r(2), r(3)}. 7.24 (a) Determine the lattice second-order prediction error filter (PEF) for a sequence x(n) with autocorrelation r(l) = ( 12 )|l| . (b) Repeat part (a) for the sequence y(n) = x(n) + v(n), where v(n) ∼ WN(0, 0.2) is uncorrelated to x(n). (c) Explain the change in the lattice parameters using frequency domain reasoning (think of the PEF as a whitening filter). 7.25 Consider a prediction error filter specified by P3 = ( 15 )2 , k0 = 14 , k1 = 12 , and k2 = 14 . 16 (a) Determine the direct-form filter coefficients. (b) Determine the autocorrelation values r(1), r(2), and r(3). (c) Determine the value r(4) so that the MMSE P4 for the corresponding fourthorder filter is the minimum possible. 7.26 Consider a prediction error filter AM (z) = 1 + a1 z−1 + · · · + aM z−M with lattice para(M) (M) meters k1 , k2 , . . . , kM . (a) Show that if we set kˆm = (−1)m km , then aˆ m = (−1)m am . (b) What are the new filter coefficients if we set kˆm = ρ m km , where ρ is a complex number with |ρ| = 1? What happens if |ρ| < 1? (M)

(M)

7.27 Suppose that we are given the values {r(l)}m−1 −m+1 of an autocorrelation sequence such that the Toeplitz matrix Rm is positive definite. (a) Show that the values of r(m) such that Rm+1 is positive definite determine a disk in the complex plane. Find the center α m and the radius ζ m of this disk. (b) By induction show that there are infinitely many extensions of {r(l)}m−1 −m+1 that make {r(l)}∞ −∞ a valid autocorrelation sequence. 7.28 Consider the MA(1) sequence x(n) = w(n) + d1 w(n − 1), w(n) ∼ WN(0, σ 2w ). (a) Show that det Rm = r(0) det Rm−1 − |r(1)|2 Rm−2 (b) Show that km = −r m (1)/ det Rm and that r(0) 1 r ∗ (1) 1 1 =− − km r(1) km−1 r(1) km−2

m≥2

391 problems

392

(c) Determine the initial conditions and solve the recursion in (b) to show that

chapter 7 Algorithms and Structures for Optimum Linear Filters

km =

(1 − |d1 |2 )(−d1 )m 1 − |d1 |2m+2

which tends to zero as m → ∞. 7.29 Prove Equation (7.6.6) by exploiting the symmetry property bm = Ja∗m . 7.30 In this problem we show that the lattice parameters can be obtained by “feeding” the autocorrelation sequence through the lattice filter as a signal and switching on the stages one by one after the required lattice coefficient is computed. The value of km is computed at time n = m from the inputs to stage m. (a) Using (7.6.10), draw the flow diagram of a third-order lattice filter that implements this algorithm. (b) Using the autocorrelation sequence in Example 7.6.1, “feed” the sequence {r(n)}30 = {3, 2, 1, 12 } through the filter one sample at a time, and compute the lattice parameters. Hint: Use Example 7.6.1 for guidance. 7.31 Draw the supperlattice structure for M = 8, and show how it can be partitioned to distribute the computations to three processors for parallel execution. 7.32 Derive the superladder structure shown in Figure 7.10. 7.33 Extend the algorithm of Schür to compute the LDLH decomposition of a Hermitian Toeplitz matrix, and write a Matlab function for its implementation. 7.34 Given the matrix R3 = Toeplitz{1, 12 , 12 }, use the appropriate order-recursive algorithms to compute the following: (a) The LDLH and UDUH decompositions of R, (b) the LDLH and UDUH decompositions of R −1 , and (c) the inverse matrix R −1 . 7.35 Consider the AR(1) process x(n) = ρx(n − 1) + w(n), where w(n) ∼ WN(0, σ 2w ) and −1 < ρ < 1. (a) Determine the correlation matrix RM+1 of the process. (b) Determine the Mth-order −1 , using FLP, using the algorithm of Levinson-Durbin. (c) Determine the inverse matrix RM+1 the triangular decomposition discussed in Section 7.7. 7.36 If r(l) = cos ω0 l, determine the second-order prediction error filter and check whether it is minimum-phase. 7.37 Show that the MMSE linear predictor of x(n + D) in terms of x(n), x(n − 1), . . . , x(n − M + 1) for D ≥ 1 is given by Ra(D) = −r(D) where r(D) = [r(D) r(D +1) · · · r(D +M −1)]T . Develop a recursion that computes a(D+1) from a(D) by exploring the shift invariance of the vector r(D) . See Manolakis et al. (1983). 7.38 The normal equations for the optimum symmetric signal smoother (see Section 6.5.1) can be written as 0 R2m+1 c2m+1 = P2m+1 0 = 1. (a) Using a “central” where P2m+1 is the MMSE, c2m+1 = Jc∗2m+1 , and cm partitioning of R2m+3 and the persymmetry property of Toeplitz matrices, develop a recursion to determine c2m+3 from c2m+1 . (b) Develop a complete order-recursive algorithm for the computation of {c2m+1 , P2m+1 }M 0 (see Kok et al. 1993). (2m+1)

7.39 Using the triangular decomposition of a Toeplitz correlation matrix, show that (a) the forward prediction errors of various orders and at the same time instant, that is, f (n)]T ef (n) = [e0f (n) e1f (n) · · · em

393

are correlated and (b) the forward prediction errors f T f (n) ef e¯ f (n) = [eM M−1 (n − 1) · · · e0 (n − M)]

problems

are uncorrelated. 7.40 Generalize the inversion algorithm described in Section 7.7.3 to handle Hermitian Toeplitz matrices. 7.41 Consider the estimation of a constant α from its noisy observations. The signal and observation models are given by y(n + 1) = y(n)

n>0

y(0) = α

v(n) ∼ WGN(0, σ 2v )

x(n) = y(n) + v(n)

(a) Develop scalar Kalman filter equations, assuming the initial condition on the a posteriori error variance Ry˜ (0|0) equal to r0 . (b) Show that the a posteriori error variance Ry˜ (n|n) is given by Ry˜ (n|n) =

r0 1 + (r0 /σ 2v )n

(P.1)

(c) Show that the optimal filter for the estimation of the constant α is given by y(n) ˆ = y(n ˆ − 1) +

r0 /σ 2v [x(n) − y(n ˆ − 1)] 1 + (r0 /σ 2v )n

7.42 Consider a random process with PSD given by Rs (ej ω ) =

4 2.4661 − 1.629 cos ω + 0.81 cos 2ω

(a) Using Matlab, plot the PSD Rs (ej ω ) and determine the resonant frequency ω0 . (b) Using spectral factorization, develop a signal model for the process of the form y(n) = Ay(n − 1) + Bη(n) s(n) = [1

0]y(n)

where y(n) is a 2 × 1 vector, η(n) ∼ WGN(0, 1), and A and B are matrices with appropriate dimensions. (c) Let x(n) be the observed values of s(n) given by x(n) = s(n) + v(n)

v(n) ∼ WGN(0, 1)

Assuming reasonable initial conditions, develop Kalman filter equations and implement them, using Matlab. Study the performance of the filter by simulating a few sample functions of the signal process s(n) and its observation x(n). 7.43 Alternative form of the Kalman filter. A number of different identities and expressions can be obtained for the quantities defining the Kalman filter. (a) By manipulating the last two equations in (7.8.39) show that Ry˜ (n|n) = Ry˜ (n|n − 1) − Ry˜ (n|n − 1)HH (n) × [H(n)Ry˜ (n|n − 1)HH (n) + Rv (n)]−1 HR y˜ (n|n − 1)

(P.2)

(b) If the inverses of Ry˜ (n|n), Ry˜ (n|n − 1), and Rv exist, then show that Ry−1 (n|n) = Ry−1 (n|n − 1) + HH (n)Rv−1 (n)H(n) ˜ ˜

(P.3)

This shows that the update of the error covariance matrix does not require the Kalman gain matrix (but does require matrix inverses). (c) Finally show that the gain matrix is given by K(n) = Ry˜ (n|n)HH (n)Rv−1 (n) which is computed by using the a posteriori error covariance matrix.

(P.4)

394 chapter 7 Algorithms and Structures for Optimum Linear Filters

7.44 In Example 7.8.3 we assumed that only the position measurements were available for estimation. In this problem we will assume that we also have a noisy sensor to measure velocity measurements. Hence the observation model is xp (n) yp (n) + v1 (n) x(n) = (P.5) xv (n) yv (n) + v2 (n) where v1 (n) and v2 (n) are two independent zero-mean white Gaussian noise sources with variances σ 2v1 and σ 2v2 , respectively. (a) Using the state vector model given in Example 7.8.3 and the observation model in (P.5), develop Kalman filter equations to estimate position and velocity of the object at each n. (b) Using the parameter values T = 0.1

σ 2v1 = σ 2v2 = σ 2η = 0.25

yp (−1) = 0

yv (−1) = 1

simulate the true and observed positions and velocities of the object. Using your Kalman filter equations, generate plots similar to the ones given in Figures 7.14 and 7.15. (c) Discuss the effects of velocity measurements on the estimates. 7.45 In this problem, we will assume that the acceleration ya (n) is an AR(1) process rather than a white noise process. Let ya (n) be given by ya (n) = αya (n − 1) + η(n)

η(n) ∼ WGN(0, σ 2η )

ya (−1) = 0

(P.6)

(a) Augment the state vector y(n) in (7.8.48), using variable ya (n), and develop the state vector as well as the observation model, assuming that only the position is measured. (b) Using the above model and the parameter values T = 0.1 yp (−1) = 0

α = 0.9

σ 2v = σ 2η = 0.25

yv (−1) = 1

ya (−1) = 0

simulate the linear motion of the object. Using Kalman filter equations, estimate the position, velocity, and acceleration values of the object at each n. Generate performance plots similar to the ones given in Figures 7.14 and 7.15. (c) Now assume that noisy measurements of yv (n) and ya (n) are also available, that is, the observation model is xp (n) yp (n) + v1 (n) x(n) xv (n) = yv (n) + v2 (n) (P.7) xa (n) ya (n) + v3 (n) where v1 (n), v2 (n), and v3 (n) are IID zero-mean white Gaussian noise sources with variance σ 2v . Repeat parts (a) and (b) above.

C HAPT E R 8

Least-Squares Filtering and Prediction

In this chapter, we deal with the design and properties of linear combiners, finite impulse response (FIR) filters, and linear predictors that are optimum in the least-squares error (LSE) sense. The principle of least squares is widely used in practice because second-order moments are rarely known. In the first part of this chapter (Sections 8.1 through 8.4), we † concentrate on the design, properties, and applications of least-squares (LS ) estimators. Section 8.1 discusses the principle of LS estimation. The unique aspects of the different implementation structures, starting with the general linear combiner followed by the FIR filter and predictor, are treated in Sections 8.2 to 8.4. In the second part (Sections 8.5 to 8.7), we discuss various numerical algorithms for the solution of the LSE normal equations and the computation of LSE estimates including QR decomposition techniques (Householder reflections, Givens rotations, and modified Gram-Schmidt orthogonalization) and the singular value decomposition (SVD). 8.1 THE PRINCIPLE OF LEAST SQUARES The principle of least squares was introduced by the German mathematician Carl Friedrich Gauss, who used it to determine the orbit of the asteroid Ceres in 1821 by formulating the estimation problem as an optimization problem. The design of optimum filters in the minimum mean square error (MMSE) sense, discussed in Chapter 6, requires the a priori knowledge of second-order moments. However, such statistical information is simply not available in most practical applications, for which we can only obtain measurements of the input and desired response signals. To avoid this problem, we can (1) estimate the required second-order moments from the available data (see Chapter 5), if possible, to obtain an estimate of the optimum MMSE filter, or (2) design an optimum filter by minimizing a criterion of performance that is a function of the available data. In this chapter, we use the minimization of the sum of the squares of the estimation error as the criterion of performance for the design of optimum filters. This method, known as least-squares error (LSE ) estimation, requires the measurement of both the input signal and the desired response signal. A natural question arising at this point is, What is the purpose of estimating the values of a known, desired response signal? There are several answers: †

A note about abbreviations used throughout the chapter: The two acronyms LSE and LS will be used almost interchangably. Although LSE is probably the more accurate term, LS has become a standard reference to LSE estimators.

395

396 chapter 8 Least-Squares Filtering and Prediction

1. In system modeling applications, the goal is to obtain a mathematical model describing the input-output behavior of an actual system. A quality estimator provides a good model for the system. The desired result is the estimator or system model, not the actual estimate. 2. In linear predictive coding, the useful result is the prediction error or the respective predictor coefficients. 3. In many applications, the desired response is not available (e.g., digital communications). Therefore, we do not always have a complete set of data from which to design the LSE estimator. However, if the data do not change significantly over a number of sets, then one special complete set, the training set, is used to design the estimator. The resulting estimator is then applied to the processing of the remaining incomplete sets. The use of measured signal values to determine the coefficients of the estimator leads to some fundamental differences between MMSE and LSE estimation that are discussed where appropriate. To summarize, depending on the available information, there are two ways to design an optimum estimator: (1) If we know the second-order moments, we use the MMSE criterion and design a filter that is optimum for all possible sets of data with the same statistics. (2) If we only have a block of data, we use the LSE criterion to design an estimator that is optimum for the given block of data. Optimum MMSE estimators are obtained by using ensemble averages, whereas LSE estimators are obtained by using finite-length time averages. For example, an MMSE estimator, designed using ensemble averages, is optimum for all realizations. In contrast, an LSE estimator, designed using a block of data from a particular realization, depends on the numerical values of samples used in the design. If the processes are ergodic, the LSE estimator approaches the MMSE estimator as the block length of the data increases toward infinity.

8.2 LINEAR LEAST-SQUARES ERROR ESTIMATION We start with the derivation of general linear LS filters that are implemented using the linear combiner structure described in Section 6.2. A set of measurements of the desired response y(n) and the input signals xk (n) for 1 ≤ k ≤ M has been taken for 0 ≤ n ≤ N − 1. As in optimum MMSE estimation, the problem is to estimate the desired response y(n) using the linear combination y(n) ˆ =

M

ck∗ (n) xk (n) = cH (n) x(n)

(8.2.1)

k=1

We define the estimation error as e(n) = y(n) − y(n) ˆ = y(n) − cH (n) x(n)

(8.2.2)

and the coefficients of the combiner are determined by minimizing the sum of the squared errors E

N −1

|e(n)|2

(8.2.3)

n=0

that is, the energy of the error signal. For this minimization to be possible, the coefficient vector c(n) should be held constant over the measurement time interval 0 ≤ n ≤ N − 1. The constant vector cls resulting from this optimization depends on the measurement set and is known as the linear LSE estimator. In the statistical literature, LSE estimation is known as linear regression, where (8.2.2) is called a regression function, e(n) are known as residuals (leftovers), and c(n) is the regression vector (Montgomery and Peck 1982).

The system of equations in (8.2.2), or equivalently e∗ (n) = y ∗ (n) − xH (n) c, can be written in matrix form as ∗ ∗ e (0) y (0) ∗ ∗ y (1) e (1) = . . . . . . e∗ (N − 1) ∗ x1 (0) ∗ x1 (1) − .. .

y ∗ (N − 1) x2∗ (0) x2∗ (1) .. .

x1∗ (N − 1)

∗ (0) · · · xM

··· .. .

x2∗ (N − 1)

∗ (1) xM .. .

c1 c2 . . .

∗ (N − 1) · · · xM

(8.2.4)

cM

or more compactly as e = y − Xc

(8.2.5)

where e [e(0) e(1) · · · e(N − 1)]H

error data vector (N × 1)

y [y(0) y(1) · · · y(N

− 1)]H

desired response vector (N × 1)

X [x(0) x(1) · · · x(N

− 1)]H

input data matrix (N × M)

c [c1 c2 · · · cM ]T

(8.2.6)

combiner parameter vector (M × 1)

are defined by comparing (8.2.4) to (8.2.5). The input data matrix X can be partitioned either columnwise or rowwise as follows: H x (0) H x (1) (8.2.7) X [˜x1 , x˜ 2 , . . . , x˜ M ] = . . . xH (N − 1) where the columns x˜ k of X x˜ k [xk (0) xk (1) · · · xk (N − 1)]H will be called data records and the rows x(n) [x1 (n) x2 (n) · · · xM (n)]T will be called snapshots. Both of these partitionings of the data matrix, which are illustrated in Figure 8.1, are useful in the derivation, interpretation, and computation of LSE estimators. The LSE estimator operates in a block processing mode; that is, it processes a frame of N snapshots using the steps shown in Figure 8.2. The input signals are blocked into frames of N snapshots with successive frames overlapping by N0 samples. The values of N and N0 depend on the application. The required estimate or residual signals are unblocked at the final stage of the processor. If we set e = 0, we have a set of N equations with M unknowns. If N = M, then (8.2.4) usually has a unique solution. For N > M, we have an overdetermined system of linear equations that typically has no solution. Conversely, if N < M, we have an underdetermined system that has an infinite number of solutions. However, even if M > N or N > M, the system (8.2.4) has a natural, unique, least-squares solution. We next focus our attention on overdetermined systems since they play a very important role in practical applications. The underdetermined least-squares problem is examined in Section 8.7.2.

397 section 8.2 Linear Least-Squares Error Estimation

398 chapter 8 Least-Squares Filtering and Prediction

Desired response

Input signals

y

X

Record Coefficient vector c

0 1

c1 c2

…

…

Time

…

cM

N−1

1

samples 2

M Snapshot

Sensor

FIGURE 8.1 The columns of the data matrix are the records of data collected at each input (sensor), whereas each row contains the samples from all inputs at the same instant. N N0

xM (n)

…

x1(n)

Frame blocking

X y

Compute and solve normal equations

Compute estimates or residuals

c ls

yˆ e

Frame unblocking

y(n) ˆ e(n)

y(n)

FIGURE 8.2 Block processing implementation of a general linear LSE estimator.

8.2.1 Derivation of the Normal Equations We provide an algebraic and a geometric solution to the LSE estimation problem; a calculusbased derivation is given in Problem 8.1. Algebraic derivation. The energy of the error can be written as E = eH e = (yH − cH XH )(y − Xc) = yH y − cH XH y − yH Xc + cH XH Xc

(8.2.8)

ˆ = Ey − c dˆ − dˆ c + c Rc H

where

H

Ey yH y =

H

N −1

|y(n)|2

(8.2.9)

x(n)xH (n)

(8.2.10)

x(n)y ∗ (n)

(8.2.11)

n=0

ˆ XH X = R

N −1 n=0

dˆ XH y =

N −1 n=0

Note that these quantities can be viewed as time-average estimates of the desired response power, correlation matrix of the input data vector, and the cross-correlation vector between the desired response and the data vector, when these quantities are divided by the number of data samples N . We emphasize that all formulas derived for the MMSE criterion hold for theLSE cri−1 terion if we replace the expectation E{(·)} with the time-average operator (1/N ) N n=0 (·). This results from the fact that both criteria are quadratic cost functions. Therefore, working ˆ is positive as in Section 6.2.2, we conclude that if the time-average correlation matrix R definite, the LSE estimator cls is provided by the solution of the normal equations ˆ ls = dˆ Rc

(8.2.12)

and the minimum sum of squared errors is given by ˆ −1 dˆ = Ey − dˆ H cls Els = Ey − dˆ H R

(8.2.13)

ˆ is Hermitian, we only need to compute the elements Since R rˆij = x˜ iH x˜ j

(8.2.14)

in the upper triangular part, which requires M(M + 1)/2 dot products. The right-hand side requires M dot products dˆi = x˜ iH y

(8.2.15)

Note that each dot product involves N arithmetic operations, each consisting of one multiplication and one addition. Thus, to form the normal equations requires a total of 1 M(M 2

+ 1)N + MN = 12 M 2 N + 32 MN

(8.2.16)

ˆ is nonsingular, which is the case when R ˆ is positive definite, arithmetic operations. When R we can solve the normal equations using either the LDLH or the Cholesky decomposition (see Section 6.3). However, it should be stressed at this point that most of the computational work lies in forming the normal equations rather than their solution. The formulation of the overdetermined LS equations and the normal equations is illustrated graphically in Figure 8.3. The solution of LS problems has been extensively studied in various application areas and in numerical analysis. The basic methods for the solution of the LS problem, which are discussed in this book, are shown in Figure 8.4. We just stress here that for overdetermined LS problems, well-behaved data, and sufficient numerical precision, all these methods provide comparable results. Geometric derivation. We may think of the desired response record y and the data records x˜ k , 1 ≤ k ≤ M, as vectors in an N -dimensional vector space, with the dot product and length defined by N −1 xi (n) xj∗ (n) (8.2.17) ˜xi , x˜ j x˜ iH x˜ j = n=0

and

˜x2 ˜x, x˜ =

N −1

|x(n)|2 = Ex

(8.2.18)

n=0

respectively. The estimate of the desired response record can be expressed as yˆ = Xc =

M

ck x˜ k

(8.2.19)

k=1

that is, as a linear combination of the data records. The M vectors x˜ k form an M-dimensional subspace, called the estimation space, which is the column space of data matrix X. Clearly, any estimate yˆ must lie in the estimation space. The desired response record y, in general, lies outside the estimation space. The estimation

399 section 8.2 Linear Least-Squares Error Estimation

400

c

X Number of observations N

chapter 8 Least-Squares Filtering and Prediction

Normal equations

Least-squares equations

XH X

y

M×M N >M

XHy

c ls

=

≈ dˆ

ˆ R Number of coefficients M

X

ˆ = XH X R

XH ~ xH i

i ×

=

~ ri j = ~ xH i xj

j ~ xj

y dˆ = XHy

XH ~ xH i

i ×

d i = x~ H i y

=

FIGURE 8.3 The LS problem and computation of the normal equations.

space for M = 2 and N = 3 is illustrated in Figure 8.5. The error vector e points from the tip of yˆ to the tip of y. The squared length of e is minimum when e is perpendicular to the estimation space, that is, e ⊥ x˜ k for 1 ≤ k ≤ M. Therefore, we have the orthogonality principle ˜xk , e = x˜ kH e = 0

1≤k≤M

(8.2.20)

or more compactly XH e = XH (y − Xcls ) = 0 or

(XH X)cls = XH y

(8.2.21)

which we recognize as the LSE normal equations from (8.2.12). The LS solution splits the desired response y into two orthogonal components, namely, yˆ ls and els . Therefore, y2 = ˆyls 2 + els 2

(8.2.22)

and, using (8.2.18) and (8.2.19), we have Els = Ey − clsH XH Xcls = Ey − clsH XH y

(8.2.23)

401

LS computations Data: {X, y}

section 8.2 Linear Least-Squares Error Estimation

Amplitude domain: work directly with data {X, y}

Power domain: use normal equations (XH X)c ls = XHy

QR decomposition

Householderreflection Givens rotation

Singular value decomposition

Gram-Schmidt orthogonalization

FIGURE 8.4 Classification of different computational algorithms for the solution of the LS problem.

FIGURE 8.5 Vector space interpretation of LSE estimation for N = 3 (dimension of data space) and M = 2 (dimension of estimation subspace).

y

e = y − yˆ

c ls,2 ~ x2

c ls,1~ x1 ~ x1

~ x2

yˆ

which is identical to (8.2.13). The normalized total squared error is E

Eyˆ Els =1− Ey Ey

(8.2.24)

which is in the range 0 ≤ E ≤ 1, with limits of 0 and 1, which correspond to the worst and best cases, respectively. Uniqueness. The solution of the LSE normal equations exists and is unique if the ˆ is invertible. We shall prove the following: time-average correlation matrix R ˆ = XH X is invertible if and only if the The time-average correlation matrix R ˆ is positive definite. columns x˜ k of X are linearly independent, or equivalently if and only if R

T H E O R E M 8.1.

Proof. If the columns of X are linearly independent, then for every z = 0 we have Xz = 0. This implies that for every z = 0

402 chapter 8 Least-Squares Filtering and Prediction

zH (XH X)z = (Xz)H Xz = Xz2 > 0

(8.2.25)

ˆ is positive definite and hence nonsingular. that is, R If the columns of X are linearly dependent, then there is a vector z0 = 0 such that Xz0 = 0. ˆ = XH X is singular. Therefore, XH Xz0 = 0, which implies that R

For a matrix to have linearly independent columns, the number of rows should be equal to or larger than the number of columns; that is, we must have more equations than unknowns. To summarize, the overdetermined (N > M) LS problem has a unique solution ˆ is provided by the normal equations in (8.2.12) if the time-average correlation matrix R positive definite, or equivalently if the data matrix X has linearly independent columns. In this case, the LS solution can be expressed as

where

cls = X+ y

(8.2.26)

X+ (XH X)−1 XH

(8.2.27)

is an M × N matrix known as the pseudo-inverse or the Moore-Penrose generalized inverse of matrix X (Golub and Van Loan 1996; Strang 1980). The LS estimate yˆ ls of y can be expressed as

where

yˆ ls = Py

(8.2.28)

P X(XH X)−1 XH

(8.2.29)

is known as the projection matrix because it projects the data vector y onto the column space of X to provide the LS estimate yˆ ls of y. Similarly, the LS error vector els can be expressed as els = (I − P)y

(8.2.30)

where I is the N × N identity matrix. The projection matrix P is Hermitian and idempotent, that is, P = PH

(8.2.31)

P =P P=P

(8.2.32)

2

and

H

respectively. When the columns of X are linearly dependent, the LS problem has many solutions. Since all these solutions satisfy the normal equations and the orthogonal projection of y onto the column space of X is unique, all these solutions produce an error vector e of equal length, that is, the same LSE. This subject is discussed in Section 8.6.2 (minimum-norm solution). Suppose that we wish to estimate the sequence y = [1 2 3 2]T from the observation vectors x˜ 1 = [1 2 1 1]T and x˜ 2 = [2 1 2 3]T . Determine the optimum filter, the error vector els , and the LSE Els . E XAM PLE 8.2.1.

Solution. We first compute the quantities T 1 2 1 2

2 1 2 1 7 9 T ˆ R=X X= = 1 2 1 2 9 18 1 3 1 3

1 2 dˆ = XT y = 1 1

T 2 1 2 3

ˆ ls = dˆ to obtain the LS estimator and we then solve the normal equations Rc 2 −1

4 10 5 5 5 −1 ˆ dˆ = cls = R = 22 7 16 −1 5

45

45

1

2 = 10 3 16 2

and the LSE

403

T 4 10 5 98 = Els = Ey − dˆ T cls = 18 − 22 16 45

section 8.2 Linear Least-Squares Error Estimation

45

The projection matrix is 2 9 1 9 P = X(XT X)−1 XT = 2 9 1 3

1 9 43 45 1 9 2 − 15

2 9 1 9 2 9 1 3

1 3 2 − 15 1 3 3 5

which can be used to determine the error vector els = y − Py = [−

4 11 4 7 − − ]T 9 45 9 15

whose squared norm is equal to els 2 = 98 45 = Els , as expected. We can also easily verify the T x˜ = eT x˜ = 0. orthogonality principle els 1 ls 2

Weighted least-squares estimation. The previous results were derived by using an LS criterion that treats every error e(n) equally. However, based on a priori information, we may wish to place greater importance on different errors, using the weighted LS criterion Ew =

N −1

w(n)|e(n)|2 = eH We

(8.2.33)

n=0

where

W diag{w(0), w(1), . . . , w(N − 1)}

(8.2.34)

is a diagonal weighting matrix with positive elements. Usually, we choose small weights where the errors are expected to be large, and vice versa. Minimization of Ew with respect to c yields the weighted LS (WLS) estimator cwls = (XH WX)−1 XH Wy

(8.2.35)

assuming that the inverse of the matrix XH WX exists. We can easily see that when W = I, then cwls = cls . The criterion in (8.2.33) can be generalized by choosing W to be any Hermitian, positive definite matrix (see Problem 8.2).

8.2.2 Statistical Properties of Least-Squares Estimators A useful approach for evaluating the quality of an LS estimator is to study its statistical properties. Toward this end, we assume that the obtained measurements y actually have been generated by y = Xco + eo

(8.2.36)

where eo is the random measurement error vector. We may think of co as the “true” parameter vector. Using (8.2.36), we see that (8.2.21) gives cls = co + (XH X)−1 XH eo

(8.2.37)

We make the following assumptions about the random measurement error vector eo : 1. The error vector eo has zero mean E{eo } = 0

(8.2.38)

404 chapter 8 Least-Squares Filtering and Prediction

2. The error vector eo has uncorrelated components with constant variance σ 2eo ; that is, the correlation matrix is given by Reo = E{eo eoH } = σ 2eo I

(8.2.39)

3. There is no information about eo contained in data matrix X; that is, E{eo |X} = E{eo } = 0

(8.2.40)

4. If X is a deterministic N × M matrix, then it has rank M. This means that X is a full-column rank and that XH X is invertible. If X is a stochastic N × M matrix, then E{(XH X)−1 } exists. In the following analysis, we consider two possibilities: X is deterministic and stochastic. Under these conditions, the LS estimator cls has several desirable properties. Deterministic data matrix In this case, we assume that the LS estimators are obtained from the deterministic data values; that is, the matrix X is treated as a matrix of constants. Then the properties of the LS estimators can be derived from the statistical properties of the random measurement error vector eo . PR O PE RTY 8.2.1.

The LS estimator cls is an unbiased estimator of co , that is, E{cls } = co

(8.2.41)

Proof. Taking the expectation of both sides of (8.2.37), we have E{cls } = E{co } + (XH X)−1 XH E{eo } = co because X is deterministic and E{eo } = 0. The covariance matrix of cls corresponding to the error cls − co is ˆ −1 ls E{(cls − co )(cls − co )H } = σ 2eo (XH X)−1 = σ 2eo R

PR O PE RTY 8.2.2.

(8.2.42)

Proof. Using (8.2.37), (8.2.39), and the definition (8.2.42), we easily obtain ls = (XH X)−1 XH E{eo eoH }X(XH X)−1 = σ 2eo (XH X)−1

ˆ −1 are also equal to the variance of the Note that the diagonal elements of matrix σ 2e R LS combiner vector cls . PR O PE RTY 8.2.3.

An unbiased estimate of the error variance σ 2eo is given by

Els (8.2.43) N −M where N is the number of observations, M is the number of parameters, and Els is the LS error. σˆ 2eo =

Proof. Using (8.2.30) and (8.2.36), we obtain els = (I − P)y = (I − P)eo which results in H e = eH (I − P)H (I − P)e = eH (I − P)e Els = els o o ls o o because of (8.2.32). Since Els depends on eo , it is a random variable whose expected value is

E{Els } = E{eoH (I − P)eo } = E{tr[(I − P)eo eoH ]} = tr[(I − P)E{eo eoH }] = σ 2e tr(I − P) since tr(AB) = tr (BA), where tr is the trace function. However, tr(I − P) = tr[I − X(XH X)−1 XH ] = tr[IN ×N − (XH X)−1 XH X] = tr(IN ×N ) − tr[(XH X)−1 XH X] = tr(IN ×N ) − tr(IM×M ) = N − M

σ 2eo =

therefore

E{Els } N −M

(8.2.44)

section 8.2 Linear Least-Squares Error Estimation

which proves that σˆ 2e is an unbiased estimate of σ 2eo .

Similar to (8.2.41), the mean value of cwls is E{cwls } = E{co } + (XH WX)−1 XH WE{eo } = E{co }

(8.2.45)

that is, the WLS estimator is an unbiased estimate of co . The covariance matrix of cwls is wls = (XH WX)−1 XH WR eo WX(XH WX)−1 where Reo is the correlation matrix of eo . It is easy to see that when Reo = we obtain (8.2.42). PROPERTY 8.2.4.

(8.2.46) σ 2eo I and W

= I,

The trace of wls attains its minimum when W = Re−1 o . The resulting estimator X)−1 XH Re−1 y cmv = (XH Re−1 o o

(8.2.47)

is known as the minimum variance or Markov estimator and is the best linear unbiased estimator (BLUE). Proof. The proof is somewhat involved. Interested readers can see Goodwin and Payne (1977) and Scharf (1991). PR O PE RTY 8.2.5.

If Reo = σ 2eo I, the LS estimator cls is also the best linear unbiased estimator.

Proof. It follows from (8.2.47) with the substitution Reo = σ 2eo I. PR O PE RTY 8.2.6. When the random observation vector eo has a normal distribution with mean zero and correlation matrix Reo = σ 2eo I, that is, when its components are uncorrelated, the LS estimator cls is also the maximum likelihood estimator.

Proof. Since the components of vector eo are uncorrelated and normally distributed with zero mean and variance σ 2e , the likelihood function for real-valued eo is given by

N −1 |eo (n)|2 1 (8.2.48) exp − L(c) = √ 2π σ eo 2σ 2eo n=0 and its logarithm by N 1 ln(2π σ 2eo ) = − 2 (y − Xc)H (y − Xc) + const (8.2.49) 2 2σ eo √ For complex-valued eo , the terms 2π σ eo and 2σ 2eo in (8.2.48) are replaced by π σ 2eo and σ 2eo , respectively. Since the logarithm is a monotonic function, maximization of L(c) is equivalent to minimization of ln L(c). It is easy to see, by comparison with (8.2.8), that the LS solution maximizes this likelihood function. ln L(c) = −

1

2σ 2eo

eoH eo −

Stochastic data matrix We now extend the statistical properties of cls from the preceding section to the situation in which the data values in X are obtained from a random source with a known probability distribution. This situation is best handled by first obtaining the desired results conditioned on X, which is equivalent to the deterministic case. We then determine the unconditional results by (statistical) averaging over the conditional distributions using the following properties of the conditional averages. The conditional mean and the conditional covariance of a random vector x(ζ ), given another random vector y(ζ ), are defined by µx|y E{x(ζ )|y(ζ )} and

x|y E{[x(ζ ) − µx|y ][x(ζ ) − µx|y ]H | y(ζ )}

405

406 chapter 8 Least-Squares Filtering and Prediction

respectively. Since both quantities are random objects, it can be shown that µx = E{x(ζ )} = Ey {E{x(ζ )|y(ζ )}} which is known as the law of iterated expectations and that (y)

+ µx|y x = µ(y) x|y which is called the decomposition of the covariance rule. This rule states that the covariance of a random vector x(ζ ) decomposes into the covariance of the conditional mean plus the mean of the conditional covariance. The covariance of the conditional mean, µx|y , is given by Ey {[µx|y − µx ][µx|y − µx ]H } µ(y) x|y (y)

where the notation [·] indicates the covariance over the distribution of y(ζ ). More details can be found in Greene (1993). PR O PE RTY 8.2.7.

The LS estimator cls is an unbiased estimator of co .

Proof. Taking the conditional expectation with respect to X of both sides of (8.2.37), we obtain E{cls |X} = E{co |X} + (XH X)−1 XH E{eo |X}

(8.2.50)

Now using the law of iterated expectations, we get E{cls } = EX {E{cls |X}} = co + E{(XH X)−1 XH E{eo |X}} Since E{eo |X} = 0, from assumption 3, we have E{cls } = co . Thus cls is also unconditionally unbiased. PR O PE RTY 8.2.8.

The covariance matrix of cls corresponding to the error cls − co is ls E{(cls − co )(cls − co )H } = σ 2eo E{(XH X)−1 }

(8.2.51)

Proof. From (8.2.42), the conditional covariance matrix of cls , conditional on X, is E{(cls − co )(cls − co )H |X} = σ 2eo (XH X)−1

(8.2.52)

For the unconditional covariance, we use the decomposition of covariance rule to obtain E{(cls − co )(cls − co )H } = EX {E{(cls − co )(cls − co )H |X}} + EX {(E{cls |X} − co )(E{cls |X} − co )H } The second term on the right-hand side above is equal to zero since E{cls |X} = co and hence E{(cls − co )(cls − co )H } = EX {E{(cls − co )(cls − co )H |X}} = EX {σ 2eo (XH X)−1 } = σ 2eo E{(XH X)−1 } Thus the earlier result in (8.2.42) is modified by the expected value (or averaging) of (XH X)−1 .

One important conclusion about the statistical properties of the LS estimator is that the results obtained for the deterministic data matrix X are also valid for the stochastic case. This conclusion also applies for the Markov estimators and maximum likelihood estimators (Greene 1993).

8.3 LEAST-SQUARES FIR FILTERS We will now apply the theory of linear LS error estimation to the design of FIR filters. The treatment closely follows the notation and approach in Section 6.4. Recall that the filtering error is e(n) = y(n) −

M−1 k=0

h(k) x(n − k) y(n) − cH x(n)

(8.3.1)

407

where y(n) is the desired response, x(n) = [x(n) x(n − 1) · · · x(n − M + 1)]T

(8.3.2)

is the input data vector, and c = [c0 c1 · · · cM−1 ]T

(8.3.3) h∗ (k).

is the filter coefficient vector related to impulse response by ck = Suppose that we take measurements of the desired response y(n) and the input signal x(n) over the time interval 0 ≤ n ≤ N − 1. We hold the coefficients {ck }M−1 of the filter constant 0 within this period and set any other required data samples equal to zero. For example, at time n = 0, that is, when we take the first measurement x(0), the filter needs the samples x(0), x(−1), . . . , x(−M +1) to compute the output sample y(0). ˆ Since the samples x(−1), . . . , x(−M + 1) are not available, to operate the filter, we should replace them with arbitrary values or start the filtering operation at time n = M − 1. Indeed, for M − 1 ≤ n ≤ N −1 N − 1, all the input samples of x(n) required by the filter to compute the output {y(n)} ˆ M−1 are available. If we want to compute the output while the last sample x(N − 1) is still in the filter memory, we must continue the filtering operation until n = N +M −2. Again, we need to assign arbitrary values to the unavailable samples x(N), . . . , x(N + M − 2). Most often, we set the unavailable samples equal to zero, which can be thought of as windowing the sequences x(n) and y(n) with a rectangular window. To simplify the illustration, suppose that N = 7 and M = 3. Writing (8.3.1) for n = 0, 1, . . . , N + M − 1 and arranging in matrix form, we obtain ∗ ∗ ∗ 0 e (0) x (0) 0 y (0) 0→ e∗ (1) y ∗ (1) x ∗ (1) x ∗ (0) 0 ∗ ∗ ∗ ∗ ∗ M −1→ e (2) y (2) x (2) x (1) x (0) e∗ (3) y ∗ (3) x ∗ (3) x ∗ (2) x ∗ (1) c0 ∗ ∗ ∗ (8.3.4) e (4) = y (4) − x (4) x ∗ (3) x ∗ (2) c1 ∗ ∗ ∗ e (5) y (5) x (5) x ∗ (4) x ∗ (3) c2 e∗ (6) y ∗ (6) x ∗ (6) x ∗ (5) x ∗ (4) N −1→ ∗ ∗ ∗ e (7) 0 0 x (6) x (5) N + M − 2 → e∗ (8) 0 0 0 x ∗ (6) or, in general, e = y − Xc

(8.3.5)

where the exact form of e, y, and X depends on the range Ni ≤ n ≤ Nf of measurements to be used, which in turn determines the range of summation E=

Nf

|e(n)|2 = eH e

(8.3.6)

n=Ni

in the LS criterion. The LS FIR filter is found by solving the LS normal equations

or

(XH X)cls = XH y

(8.3.7)

ˆ ls = dˆ Rc

(8.3.8)

Els = Ey − dˆ H cls

(8.3.9)

with an LS error of where Ey is the energy of the desired response signal. The elements of the time-average ˆ are given by correlation matrix R rˆij = x˜ iH x˜ j =

Nf n=Ni

x(n + 1 − i)x ∗ (n + 1 − j )

1 ≤ i, j ≤ M

(8.3.10)

section 8.3 Least-Squares FIR Filters

408 chapter 8 Least-Squares Filtering and Prediction

where x˜ i are the columns of data matrix X. A simple manipulation of (8.3.10) leads to rˆi+1,j +1 = rˆij + x(Ni − i)x ∗ (Ni − j ) − x(Nf + 1 − i)x ∗ (Nf + 1 − j )

1 ≤ i, j < M

(8.3.11) ˆ which relates the elements of matrix R that are located on the same diagonal. This property holds because the columns of X are obtained by shifting the first column. The recursion in ˆ (8.3.11) suggests the following way of efficiently computing R: ˆ by using (8.3.10). This requires M dot products and a total 1. Compute the first row of R of about M(Nf − Ni ) operations. ˆ using (8.3.11). This 2. Compute the remaining elements in the upper triangular part of R, required number of operations is proportional to M 2 . ˆ using the Hermitian symmetry relation rˆj i = rˆ ∗ . 3. Compute the lower triangular part of R, ij ˆ using (8.3.10), that is, Notice that direct computation of the upper triangular part of R 2 without the recursion, requires approximately M N/2 operations, which increases significantly for moderate or large values of M. There are four ways to select the summation range Ni ≤ n ≤ Nf that are used in LS filtering and prediction: No windowing. If we set Ni = M − 1 and Nf = N − 1, we only use the available data and there are no distortions caused by forcing the data at the borders to artificial values. Prewindowing. This corresponds to Ni = 0 and Nf = N − 1 and is equivalent to setting the samples x(0), x(−1), . . . , x(−M + 1) equal to zero. As a result, the term x(M − i)x(M − j ) does not appear in (8.3.11). This method is widely used in LS adaptive filtering. Postwindowing. This corresponds to Ni = M − 1 and Nf = N + M − 2 and is equivalent to setting the samples x(N), . . . , x(N + M − 2) equal to zero. As a result, the term x(M − i)x(M − j ) does not appear in (8.3.11). This method is not used very often for practical applications without prewindowing. Full windowing. In this method, we impose both prewindowing and postwindowing (full windowing) to the input data and postwindowing to the desired response. The range of summation is from Ni = 0 to Nf = N + M − 2, and as a result of full windowing, Eq. (8.3.11) becomes rˆi+1,j +1 = rˆij . Therefore, the elements rˆij , depend on i − j , and matrix ˆ is Toeplitz. In this case, the normal equations (8.2.12) can be obtained from the WienerR Hopf equations (6.4.11) by replacing the theoretical autocorrelations with their estimated values (see Section 5.2). Clearly, as N M the performance difference between the various methods becomes insignificant. The no-windowing and full-windowing methods are known in the signal processing literature as the autocorrelation and covariance methods, respectively (Makhoul 1975b). We avoid these terms because they can lead to misleading statistical interpretations. We notice that in the LS filtering problem, the data matrix X is Toeplitz and the normal ˆ = XH X is the product of two Toeplitz matrices. However, R ˆ is Toeplitz equations matrix R ˆ is near to only in the full-windowing case when X is banded Toeplitz. In all other cases R ˆ is close to Toeplitz in a sense made precise in Morf, et al. (1977). Toeplitz or R ˆ and vector d, ˆ for the various windowing methods, are computed by using The matrix R the Matlab function [R,d]=lsmatvec(x,M,method,y), which is based on (8.3.10) and (8.3.11). Then the LS filter is computed by cls=R\d. Figure 8.6 shows an FIR LSE filter operating in block processing mode.

409

N N0

x(n) y(n)

Frame blocking

x y

Compute and solve normal equations

c ls

FIR filter

yˆ e

Frame unblocking

y(n) ˆ e(n)

Ni Nf

FIGURE 8.6 Block processing implementation of an FIR LSE filter.

To illustrate the design of least-squares FIR filters, suppose that we have a set of measurements of x(n) and y(n) for 0 ≤ n ≤ N − 1 with N = 100 that have been generated by the difference equation E XAM PLE 8.3.1.

y(n) = 0.5x(n) + 0.5x(n − 1) + v(n) The input x(n) and the additive noise v(n) are uncorrelated processes from a normal (Gaussian) distribution with mean E{x(n)} = E{v(n)} = 0 and variance σ 2x = σ 2v = 1. Fitting the model y(n) ˆ = h(0)x(n) + h(1)x(n − 1) to the measurements with the no-windowing LS criterion, we obtain

0.5361 0.0073 2 2 ˆ −1 cls = σˆ e = 1.0419 σˆ e R = 0.5570 −0.0005

−0.0005 0.0071

using (8.3.7), (8.3.9), (8.2.44), and (8.2.42). If the mean of the additive noise is nonzero, for example, if E{v(n)} = 1, we get

0.4889 0.0131 −0.0009 ˆ −1 = cls = σˆ 2e = 1.8655 σˆ 2e R 0.5258 −0.0009 0.0127 ˆ −1 , increases which shows that the variance of the estimates, that is, the diagonal elements of σˆ 2e R significantly. Suppose now that the recording device introduces an outlier in the input data at x(30) = 20. The estimated LS model and its associated statistics are given by

0.1796 0.0030 0.0000 2 2 ˆ −1 σˆ e = 1.6270 cls = σˆ e R = 0.1814 0.0000 0.0030 Similarly, when an outlier is present in the output data, for example, at y(30) = 20, then the LS model and its statistics are

0.6303 0.0357 −0.0025 2 2 ˆ −1 cls = σˆ e = 5.0979 σˆ e R = 0.4653 −0.0025 0.0347 In general, LS estimates are very sensitive to colored additive noise and outliers (Ljung 1987). Note that all the LS solutions in this example were produced with one sample realization x(n) and that the results will vary for any other realizations.

LS inverse filters. Given a causal filter with impulse response g(n), its inverse filter h(n) is specified by g(n) ∗ h(n) = δ(n − n0 ), n0 ≥ 0. We focus on causal inverse filters, which are often infinite impulse response (IIR), and we wish to approximate them by some FIR filter cls (n) = h∗ (n) that is optimum according to the LS criterion. In this case, the actual impulse response g(n) ∗ cls∗ (n) of the combined system deviates from the desired response δ(n − n0 ), resulting in an error e(n). The convolution equation e(n) = δ(n − n0 ) −

M k=0

cls∗ (k) g(n − k)

(8.3.12)

section 8.3 Least-Squares FIR Filters

410 chapter 8 Least-Squares Filtering and Prediction

can be formulated in matrix form as follows for M = 2 and N = 6 ∗ ∗ 0 e (0) g (0) 0 1 e∗ (1) 0 g ∗ (1) g ∗ (0) 0 ∗ ∗ e (2) 0 g (2) g ∗ (1) g ∗ (0) ∗ ∗ e (3) 0 g (3) g ∗ (2) g ∗ (1) cls (0) ∗ ∗ e (4) = 0 − g (4) g ∗ (3) g ∗ (2) cls (1) ∗ ∗ e (5) 0 g (5) g ∗ (4) g ∗ (3) cls (2) ∗ ∗ e (6) 0 g (6) g ∗ (5) g ∗ (4) ∗ e (7) 0 0 g ∗ (6) g ∗ (5) e∗ (8)

g ∗ (6)

assuming that n0 = 0. In general, e = δ i − Gcls

(i)

(8.3.13)

where δ i is a vector whose ith element is 1 and whose remaining elements are all zero. The LS inverse filter and the corresponding error are given by (GH G)cls = GH δ i (i)

∗ Els = 1 − δ H i Gcls = 1 − g (i)cls (i) (i)

(i)

and

(i)

(8.3.14) 0≤i ≤M +N

(8.3.15)

respectively. Using the projection operators (8.2.29) and (8.2.30), we can express the LS error as H Els = δ H i (P − I) (P − I)δ i

(8.3.16)

P = G(GH G)−1 GH

(8.3.17)

(i)

where

The total error for all possible delays 0 ≤ i ≤ N + M can be written as Etotal =

N +M

Els = tr[DH (P − I)H (P − I)D] (i)

(8.3.18)

i=0

D [δ 0 δ 1 δ 2 · · · δ N +M ] = I

where

is the (N + M + 1) × (N + M + 1) identity matrix. Since D = I, P = PH , and P2 = P, we obtain Etotal = tr[DH (P − I)H (P − I)D] = tr(I − P) = tr(I) − tr(P) Etotal = N

or

(8.3.19)

because tr(I) = N + M + 1 and tr(P) = tr[G(GH G)−1 GH ] = tr[GH G(GH G)−1 ] = M + 1

(8.3.20)

Hence, Etotal depends on the length N + 1 of the filter g(n) and is independent of the length (i) M + 1 of the inverse filter cls (n). If the minimum Els , for a given N , occurs at delay i = i0 , we have N (i ) Els 0 ≤ (8.3.21) N +M +1 which shows that Els 0 → 0 as M → ∞ (Claerbout and Robinson 1963). (i )

E XAM PLE 8.3.2.

Suppose that g(n) = δ(n) − αδ(n − 1), where α is a real constant. The exact

inverse filter is H (z) =

1 1 − α z−1

⇒

h(n) = α n u(n)

and is minimum-phase only if −1 < α < 1. The inverse LS filter for M = 1 and N ≥ 2 is obtained by applying (8.3.14) with 1 1 0 1 and δ = 0 G = −α 0 0 −α The normal equations are

1 + α2

−α

−α

1 + α2

1 cls (0) = 0 cls (1)

(8.3.22)

leading to the LS inverse filter cls (0) =

1 + α2 1 + α2 + α4

cls (1) =

α 1 + α2 + α4

with LS error Els = 1 − cls (0) = The system function of the LS inverse filter is 1 + α2 Hls (z) = 1 + α2 + α4

α4 1 + α2 + α4

α z−1 1+ 1 + α2

and has a zero at z1 = −α/(1 + α 2 ) = −1/(α + α −1 ). Since |z1 | < 1 for any value of α, the LS inverse filter is minimum-phase even if g(n) is not. This stems from the fact that the normal equations (8.3.22) specify a one-step forward linear predictor with a correlation matrix that is Toeplitz and positive definite for any value of α (see Section 7.4).

8.4 LINEAR LEAST-SQUARES SIGNAL ESTIMATION We now discuss the application of the LS method to general signal estimation, FLP, BLP, and combined forward and backward linear prediction. The reader is advised to review Section 6.5, which provides a detailed discussion of the same problems for the MMSE criterion. The presentation in this section closely follows the viewpoint and notation in Section 6.5. 8.4.1 Signal Estimation and Linear Prediction (i)

Suppose that we wish to compute the linear LS signal estimator ck defined by e(i) (n) =

M

ck x(n − k) = c(i)H x¯ (n) (i)∗

with ci 1 (i)

(8.4.1)

k=0

from the data x(n), 0 ≤ n ≤ N − 1. Using (8.4.1) and following the process that led to (8.3.4), we obtain ¯ (i) e(i) = Xc (8.4.2) ∗ 0 ··· 0 x (0) ∗ (0) x ∗ (1) x ··· 0 .. . .. . . . .. . . ∗ ∗ (M − 1) · · · x ∗ (0) x (M) x .. . . ¯ .. .. X = . where (8.4.3) x ∗ (N − 1) x ∗ (N − 2) · · · x ∗ (N − M − 1) 0 x ∗ (N − 1) · · · x ∗ (N − M) . . . . . . . . . . . . 0

· · · x ∗ (N − 1)

411 section 8.4 Linear Least-Squares Signal Estimation

412 chapter 8 Least-Squares Filtering and Prediction

is the combined data and desired response matrix with all the unavailable samples set equal ¯ can be partitioned columnwise as to zero (full windowing). Matrix X ¯ = [X1 y X2 ] X

(8.4.4)

¯ Using (8.4.4), we can easily show where y, the desired response, is the ith column of X. (i) (i) that the LS signal estimator cls and the associated LS error Els are determined by 0 (i) ¯ H X)c ¯ (i) = (X (8.4.5) E ls

ls

0 (i)

where Els is the ith element of the right-hand side vector (see Problem 8.3). If we define the time-average correlation matrix ¯ ¯ X ¯ HX R

(8.4.6)

and use the augmented normal equations in (8.4.5), we obtain a set of equations that have the same form as (6.5.12), the equations for the MMSE signal estimator. Therefore, after ¯ using the command Rbar=lsmatvec(x,M+1,method), we can use we have computed R, the steps in Table 6.3 to compute the LS forward linear predictor (FLP), the backward linear predictor (BLP), the symmetric smoother, or any other signal estimator with delay i. (0) (0) (M) Again, we use the standard notation Els = E f and cls = a for the FLP and Els = E b (M) and cls = b for the BLP. All formulas given in Section 6.5 hold for LS signal estimators if the matrix R(n) ¯ However, we stress that although the optimum MMSE signal estimator is replaced by R. (i) (i) co (n) is a deterministic vector, the LS signal estimator cls is a random vector that is a function of the random measurements x(n), 0 ≤ n ≤ N − 1. In the full-windowing case, ¯ is Toeplitz; if it is also positive definite, then the FLP is minimum-phase. Although matrix R the use of full windowing leads to these nice properties, it also creates some “edge effects” and bias in the estimates because we try to estimate some signal values using values that are not part of the signal by forcing the samples leading and lagging the available data measurements to zero. Suppose that we are given the signal segment x(n) = α n , 0 ≤ n ≤ N, where α is an arbitrary complex-valued constant. Determine the first-order one-step forward linear predictor, using the full-windowing and no-windowing methods. E XAM PLE 8.4.1.

Solution. We start by forming the combined desired response and data matrix

x(0) x(1) · · · x(N) 0 H ¯ X = 0 x(0) · · · x(N − 1) x(N) For the full-windowing method, the matrix ¯ = ¯ =X ¯HX R

rˆx (0)

rˆx (1)

rˆx∗ (1)

rˆx (0)

is Toeplitz with elements rˆx (0) =

N

|x(n)|2 =

n=0

and

rˆx (1) =

N

N n=0

x(n) x ∗ (n − 1) =

n=1

Therefore, we have

|α|2n =

N

1 − |α|2(N +1) 1 − |α|2

α n (α ∗ )n−1 = α ∗

n=1

rˆx (0)

rˆx (1)

1

rˆx∗ (1)

rˆx (0)

a1

(1)

=

f E1 0

1 − |α|2N 1 − |α|2

413

whose solution gives (1)

a1

1 − |α|2N rˆ ∗ (1) = −α =− x rˆx (0) 1 − |α|2(N +1) (1)

E1f = rˆx (0) + rˆx (1)a1

and

=

section 8.4 Linear Least-Squares Signal Estimation

1 − |α|2(2N+1) 1 − |α|2(N +1) (1)

Since for every sequence |ˆrx (l)| ≤ |ˆrx (0)|, we have |a1 | ≤ 1; that is, the obtained prediction (1) error filter always is minimum-phase. Furthermore, if |α| < 1, then limN →∞ a1 = −α and limN →∞ E1f = 1 = x(0). In the no-windowing case, the matrix

rˆ rˆ12 ¯ = 11 ¯ =X ¯HX R ∗ rˆ12 rˆ22 is Hermitian but not Toeplitz with elements rˆ11 =

N

|x(n)|2 = |α|2

n=1

rˆ12 =

1 − |α|2N 1 − |α|2

N

rˆ22 =

x(n) x ∗ (n − 1) = α ∗

rˆ11

rˆ12

∗ rˆ12

we obtain and

rˆ22 (1)

a¯ 1

|x(n)|2 =

n=0

n=1

Solving the linear system

N −1

1 − |α|2N 1 − |α|2

1 − |α|2N 1 − |α|2

E¯ 1f = (1) 0 a¯ 1 1

rˆ ∗ = − 12 = −α rˆ22

(1) E¯ 1f = rˆ11 + rˆ12 a¯ 1 = 0

We see that the no-windowing method provides a perfect linear predictor because there is no distortion due to windowing. However, the obtained prediction error filter is minimum-phase only when |α| < 1. To illustrate the statistical properties of least-squares FLP, we generate K = 500 realizations of the MA(1) process x(n) = w(n) + 12 w(n − 1), where w(n) ∼ WN(0, 1) (see Example 6.5.2). Each realization x(ζ i , n) has duration N = 100 samples. We use these data to design an M = 2 order FLP, using the no-windowing LS method. The estimated mean and variance of the obtained K FLP vectors are

−0.4695 0.0086 and var{a(ζ i )} = Mean{a(ζ i )} = 0.1889 0.0092 EXAMPLE 8.4.2.

whereas the average of the variances σˆ 2e is 0.9848. We notice that both means are close to the theoretical values obtained in Example 6.5.2. The covariance matrix of a given LS estimate als was found to be

0.0099 −0.0043 ˆ −1 = σˆ 2e R −0.0043 0.0099 whose diagonal elements are close to the components of var{a}, as expected. The bias in the estimate als results from the fact that the residuals in the LS equations are correlated with each other (see Problem 8.4).

8.4.2 Combined Forward and Backward Linear Prediction (FBLP) For stationary stochastic processes, the optimum MMSE forward and backward linear predictors have even conjugate symmetry, that is, ao = Jb∗o

(8.4.7)

414 chapter 8 Least-Squares Filtering and Prediction

because both directions of time have the same second-order statistics. Formally, this property stems from the Toeplitz structure of the autocorrelation matrix (see Section 6.5). However, we could possibly improve performance by minimizing the total forward and backward squared error Ef b =

Nf

{|ef (n)|2 + |eb (n)|2 } = (ef )H ef + (eb )H eb

(8.4.8)

n=Ni

under the constraint af b a = Jb∗ The FLP and BLP overdetermined sets of equations are

1 b ¯ ¯ ef = X and eb = X a 1

∗

1 b b∗ ¯ ¯∗ ¯ ∗ J 1f b ef = X = X or and e = X a 1 af b

(8.4.9)

(8.4.10) (8.4.11)

where we have used (8.4.9) and the property JJ = I of the exchange matrix. If we combine the above two equations as

¯ 1 ef X = (8.4.12) ∗ ∗ b ¯ X J af b e then the forward-backward linear predictor that minimizes E f b is given by (see Problem 8.5)

H

¯ ¯ 1 X X Elsf b = ¯ ∗ J af b ¯ ∗J 0 X X ls

1 Elsf b T H ∗ ¯ + JX ¯ X ¯ J) ¯ X = or (8.4.13) (X 0 alsf b which can be solved by using the steps described in Table 6.3. The time-average forwardbackward correlation matrix ˆfb X ¯ HX ¯ + JX ¯TX ¯ ∗J R

(8.4.14)

with elements ∗ rˆijf b = rˆij + rˆM−i,M−j

0 ≤ i, j ≤ M

(8.4.15)

ˆ ∗f b R

ˆ f bJ = is persymmetric; that is, JR and its elements are conjugate symmetric about both ˆ f b by these commands: main diagonals. In Matlab we compute R Rbar=lsmatvec(x,M+1,method) Rfb=Rbar+flipud(fliplr(conj(Rbar)))

The FBLP method is used with no windowing and was originally introduced independently by Ulrych and Clayton (1976) and Nuttall (1976) as a spectral estimation technique under the name modified covariance method (see Section 9.2). If we use full windowing, then af b = (a + Jb∗ )/2 (see Problem 8.6).

8.4.3 Narrowband Interference Cancelation Several practical applications require the removal of narrowband interference (NBI ) from a wideband desired signal corrupted by additive white noise. For example, ground and

foliage-penetrating radars operate from 0.01 to 1 GHz and use either an impulse or a chirp waveform. To achieve high resolution, these waveforms are extremely wideband, occupying at least 100 MHz within the range of 0.01 to 1 GHz. However, these frequency ranges are extensively used by TV and FM stations, cellular phones, and other relatively narrowband (less than 1 MHz) radio-frequency (RF) sources. Clearly, these sources spoil the radar returns with narrowband RF interference (Miller et al. 1997). Since the additive noise is often due to the sensor circuitry, it will be referred to as sensor thermal noise. Next we provide a practical solution to this problem, using an LS linear predictor. Suppose that the corrupted signal x(n) is given by where

x(n) = s(n) + y(n) + v(n) s(n) = signal of interest y(n) = narrowband interference v(n) = thermal (white) noise

(8.4.16) (8.4.17)

are the individual components, assumed to be stationary stochastic processes. We wish to design an NBI canceler that estimates and rejects the interference signal y(n) from the signal x(n), while preserving the signal of interest s(n). Since signals y(n) and x(n) are correlated, we can form an estimate of the NBI using the optimum linear estimator where

y(n) ˆ = coH x(n − D)

(8.4.18)

Rco = d

(8.4.19)

R = E{x(n − D)x (n − D)} H

∗

d = E{x(n − D) y (n)}

(8.4.20) (8.4.21)

and D is an integer delay whose use will be justified shortly. Note that if D = 1, then (8.4.18) is the LS forward linear predictor. If y(n) ˆ = y(n), the output of the canceler is x(n) − y(n) ˆ = s(n) + v(n); that is, the NBI is completely excised, and the desired signal is corrupted by white noise only and is said to be thermal noise–limited. Since, in practice, the required second-order moments are not available, we need to use an LS estimator instead. However, the quantity XH y in (8.2.21) requires the NBI signal y(n), which is also not available. To overcome this obstacle, consider the optimum MMSE D-step forward linear predictor ef (n) = x(n) + aH x(n − D)

(8.4.22)

Ra = −r

(8.4.23)

f

where R is given by (8.4.20) and rf = E{x(n − D)x ∗ (n)}

(8.4.24)

In many NBI cancelation applications, the components of the observed signal have the following properties: 1. The desired signal s(n), the NBI y(n), and the thermal noise v(n) are mutually uncorrelated. 2. The thermal noise v(n) is white; that is, rv (l) = σ 2v δ(l). 3. The desired signal s(n) is wideband and therefore has a short correlation length; that is, rv (l) = 0 for |l| ≥ D. 4. The NBI has a long correlation length; that is, its autocorrelation takes significant values over the range 0 ≤ |l| ≤ M for M > D. In practice, the second and third properties mean that the desired signal and the thermal noise are approximately uncorrelated after a certain small lag. These are precisely the properties exploited by the canceler to separate the NBI from the desired signal and the background noise.

415 section 8.4 Linear Least-Squares Signal Estimation

416

As a result of the first assumption, we have

chapter 8 Least-Squares Filtering and Prediction

E{x(n − k)y ∗ (n)} = E{y(n − k)y ∗ (n)} = ry (k)

for all k

rx (l) = rs (l) + ry (l) + rv (l)

and

(8.4.25) (8.4.26)

Making use of the second and third assumptions, we have rx (l) = ry (l)

for l = 0, 1, . . . , D − 1

(8.4.27)

The exclusion of the lags for l = 0, 1, . . . , D − 1 in r and d is critical, and we have arranged for that by forcing the filter and the predictor to form their estimates using the delayed data vector x(n−D). From (8.4.21), (8.4.24), and (8.4.27), we conclude that d = rf and therefore co = ao . Thus, the optimum NBI estimator co is equal to the D-step linear predictor ao , which can be determined exclusively from the input signal x(n). The cleaned signal is x(n) − y(n) ˆ = x(n) + aoH x(n − D) = ef (n)

(8.4.28)

which is identical to the D-step forward prediction error. This leads to the linear prediction NBI canceler shown in Figure 8.7. Corrupted signal x(n)

Cleaned signal − z −D

e f(n)

FIGURE 8.7 Block diagram of linear prediction NBI canceler.

Forward linear predictor

To illustrate the performance of the linear prediction NBI canceler, we consider an impulse radar operating in a location with commercial radio and TV stations. The desired signal is a short-duration impulse corrupted by additive thermal noise and NBI (see Figure 8.8). The spectrum of the NBI is shown in Figure 8.9. We use a block of data (N = 4096) to design an FBLP with D = 1 and M = 100 coefficients, using the LS criterion with no windowing. Then we compute the cleaned signal, using (8.4.28). The cleaned signal, its spectrum, and the magnitude response of the NBI canceler are shown in Figures 8.8 and 8.9. We see that the canceler acts as a notch filter that optimally puts notches at the peaks of the NBI. A detailed description of the design of optimum least-squares NBI cancelers is given in Problem 8.27.

8.5 LS COMPUTATIONS USING THE NORMAL EQUATIONS The solution of the normal equations for both MMSE and LSE estimation problems is computed by using the same algorithms. The key difference is that in MMSE estimation R and d are known, whereas in LSE estimation they need to be computed from the observed input and desired response signal samples. Therefore, it is natural to want to take advantage of the same algorithms developed for MMSE estimation in Chapter 7, whenever possible. However, keep in mind that despite algorithmic similarities, there are fundamental differences between the two classes of estimators that are dictated by the different nature of the the criteria of performance (see Section 8.1). In this section, we show how the computational algorithms and structures developed for linear MMSE estimation can be applied to linear LSE estimation, relying heavily on the material presented in Chapter 7.

8.5.1 Linear LSE Estimation

417

The computation of a general linear LSE estimator requires the solution of a linear system

section 8.5 LS Computations Using the Normal Equations

ˆ ls = dˆ Rc

(8.5.1)

ˆ is Hermitian and positive definite [see (8.2.25)]. where the time-average correlation matrix R We can solve (8.5.1) by using the LDLH or the Cholesky decomposition introduced in Section 6.3. The computation of linear LSE estimators involves the steps summarized in Table 8.1. We again stress that the major computational effort is involved in the computation ˆ and d. ˆ of R Steps 2 and 3 in (6.3.16) can be facilitated by a single extended LDLH decomposition. To this end, we form the augmented data matrix ¯ = [X y] X

(8.5.2)

and compute its time-average correlation matrix

H X XH y ˆ R X ¯ =X ¯ HX ¯ = R = yH X yH y dˆ H

dˆ

(8.5.3)

Ey

Impulse + White noise

Impulse 3

2.5 2.0

2

1.5 1

1.0

0.5 0

−1 −2 950

−0.5 1000 Time (ns)

1050

−1.0 950

1000 Time (ns)

1050

After NBI excision: M = 100

Impulse + White noise + NBI 5

2.5 2.0 1.5 1.0

0 0.5 0 −0.5 −5 950

1000 Time (ns)

FIGURE 8.8 NBI cancelation: time-domain results.

1050

−1.0 950

1000 Time (ns)

1050

418

Power (dB)

50 0 −50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.7

0.8

0.9

1

0.7

0.8

0.9

1

NBI canceler response: M = 100 Power (dB)

20 0 −20 −40 −60

0.1

0.2

0.3

0.4

0.5

0.6

Cleaned signal spectrum 40 Power (dB)

chapter 8 Least-Squares Filtering and Prediction

Observed signal spectrum 100

20 0 −20

0.1

0.2

0.3

0.4 0.5 0.6 Frequency (GHz)

FIGURE 8.9 NBI cancelation: frequency-domain results. TABLE 8.1

Comparison between the LDLH and Cholesky decomposition methods for the solution of normal equations. Step

LDLH decomposition

1

ˆ = LLH R Lk˜ = dˆ

Triangular decomposition

3

ˆ = XH X, dˆ = XH y R ˆ = LDLH R LDk = dˆ

4

LH cls = k

LH cls = k˜

Backward substitution → cls

5

Els = Ey − k H Dk

H Els = Ey − k˜ k˜

LSE computation

6

els = y − Xcls

els = y − Xcls

Computation of residuals

2

Cholesky decomposition

Description ˆ ls = dˆ Normal equations Rc Forward substitution → k or k˜

¯ is given by We then can show (see Problem 8.9) that the LDLH decomposition of R

0 L 0 D LH k H ¯ = (8.5.4) R k H 1 0H Els 0H 1 and thus provides the vector k and the LSE Els . Therefore, we can solve the normal equations ¯ to compute L and k and then solving LH cls = k (8.5.1), using the LDLH decomposition of R to compute cls . A careful inpection of the design equations for the general, mth-order, MMSE and LSE estimators, derived in Chapter 6 and summarized in Table 8.2, shows that the LSE

TABLE 8.2

419

Comparison between the MMSE and LSE normal equations for general linear estimation. MMSE

LSE

Available information

Rm (n), dm (n)

Normal equations

Rm (n)cm (n) = dm (n)

{xm (n), y(n), ni ≤ n ≤ nf } ˆ m cm = dˆ m R

Minimum error

H (n)c (n) Pm (n) = Py (n) − dm m

Hc Em = Ey − dˆ m m

Correlation matrix

H (n)} Rm (n) E{xm (n)xm

HX = ˆ m = Xm R m

Cross-correlation vector

dm (n) E{xm (n)y ∗ (n)} Py (n) = E{|y(n)|2 }

Power

section 8.5 LS Computations Using the Normal Equations

H (n) xm (n)xm

n=0 N−1

xm (n)y ∗ (n)

Hy = dˆ m = Xm

Ey = yH y =

N−1

n=0 N−1

|y(n)|2

n=0

equations can be obtained from the MMSE equations by replacing the linear operator E{·} by the linear operator n (·). As a result, all algorithms developed in Sections 7.1 and 7.2 can be used for linear LSE estimation problems. ˆ M , dˆ M , LM , DM , and kM have the optimum For example, we can easily see that R ˆm = R ˆ m and so on. As a result, the nesting property described in Section 7.1.1, that is, R M H factors of the LDL decomposition have the optimum nesting property, and we can obtain an order-recursive structure for the computation of the LSE estimate yˆm (n). Indeed, if we define wm (n) = L−1 0≤n≤N −1 (8.5.5) m xm (n)

N −1 N −1 H H ˆm = then R xm (n)xm (n) = Lm wm (n)wm (n) Lm Lm Dm LH (8.5.6) m n=0

n=0

where the matrix Dm is diagonal because the LDLH decomposition is unique. If we define the record vectors ˜ j [wj (0) wj (1) · · · wj (N − 1)]H w

(8.5.7)

˜2 ··· w ˜1 w ˜ m] Wm [w

(8.5.8)

H Wm = diag{ξ 1 , ξ 2 , . . . , ξ m } Dm = Wm

(8.5.9)

and the data matrix

then where

ξi =

N −1

˜i ˜ iH w |wi (n)|2 = w

(8.5.10)

n=0

From (8.5.9), we have ˜j = 0 ˜ iH w w

for i = j

(8.5.11)

that is, the columns of Wm are orthogonal and, in this sense, are the innovation vectors of the columns of data matrix Xm , according to the LS interpretation of orthogonality introduced in Section 8.2. Following the approach in Section 7.1.5, we can show that the following order-recursive algorithm m−1 (m−1)∗ wm (n) = xm (n) − li−1 wi (n) (8.5.12) i=1 ∗ yˆm (n) = yˆm−1 (n) + km wm (n)

∗ em (n) = em−1 (n) − km wm (n)

420

or

chapter 8 Least-Squares Filtering and Prediction

computed for n = 0, 1, . . . , N − 1 and m = 1, 2, . . . , M, provides the LSE estimates for orders 1 ≤ m ≤ M. The statistical interpretations of innovation and partial correlation for wm (n) and km+1 hold now in a deterministic LSE sense. For example, the partial correlation between y˜ and b = x ˜ m+1 + Xm bm , x˜ m+1 is defined by using the residual records e˜ m = y˜ − Xm cm and e˜ m H b where bm is the least-squares error BLP. Indeed, if β m+1 e˜ m e˜ m , we can show that km+1 = β m+1 /ξ m+1 (see Problem 8.11). E XAM PLE 8.5.1.

signal:

Solve the LS problem with the following data matrix and desired response

1 2 X= 3 1

1 2 1 0

1 1 3 1

1 2 y= 4 3

Solution. We start by computing the time-average correlation matrix and cross-correlation vector 20 15 8 13 ˆ = 8 6 6 dˆ = 9 R 18 13 6 12 ˆ using the Matlab function [L,D]=ldlt(X). This followed by the LDLH decomposition of R gives 15 0 0 1 0 0 1 0 D = 0 1.7333 0 L = 0.5333 0 0 0. 2308 0.8667 −0.5385 1 and working through the steps in Table 8.1, we find the LS solution and LSE to be cls = [3.0 −1.5 −1.0]T

Els = 1.5

using the following sequence of Matlab commands k=L\ dhat; cls=L’\ k; Els=sum((y-X’*cls).ˆ2); These results can be verified by using the command cls=Rhat\dhat.

8.5.2 LSE FIR Filtering and Prediction As we stressed in Section 7.3, the fundamental difference between general linear estimation and FIR filtering and prediction, which is the key to the development of efficient orderrecursive algorithms, is the shift invariance of the input data vector xm+1 (n) = [x(n) x(n − 1) · · · x(n − m + 1) x(n − m)]T The input data vector can be partitioned as

xm (n) x(n) xm+1 (n) = = x(n − m) xm (n − 1)

(8.5.13)

(8.5.14)

which shows that samples from different times are incorporated as the order is increased. This creates a coupling between order and time updatings that has significant implications in the development of efficient algorithms. Indeed, we can easily see that the matrix ˆ m+1 = R

Nf n=Ni

H xm+1 (n)xm+1 (n)

(8.5.15)

421

can be partitioned as

ˆ m+1 = R

where

ˆm R

b rˆ m

bH rˆ m

b Em

=

f Em

fH rˆ m

f rˆ m

f ˆm R

f H H ˆm ˆ m + xm (Ni − 1)xm R =R (Ni − 1) − xm (Nf )xm (Nf )

(8.5.16) (8.5.17)

f and R ˆ m, ˆm is the matrix equivalent of (8.2.28). We notice that the relationship between R which allows for the development of a complete set of order-recursive algorithms for FIR filtering and prediction, depends on the choice of Ni and Nf , that is, the windowing method selected. As we discussed in Section 8.3, there are four cases of interest. In the full-windowing f =R ˆm ˆ m and R ˆ m is Toeplitz. Therefore, all the case (Ni = 0, Nf = N + M − 2), we have R algorithms and structures developed in Chapter 7 for Toeplitz matrices can be utilized. In the prewindowing case (Ni = 0, Nf = N − 1), Equation (8.5.17) becomes H f ˆ m − xm (N − 1)xm ˆm =R (N − 1) R

(8.5.18)

ˆ m is a function of N . If we use the definition Since xm (n) = 0 for n ≤ 0 (prewindowing), R ˆ m (N ) R

N −1

H xm (n)xm (n)

(8.5.19)

n=0

then the time-updating (8.5.18) can be written as f H ˆ m (N − 1) = R ˆ m (N ) − xm (N − 1)xm ˆm =R (N − 1) R

and the order-updating (8.5.16) as

b (N ) f (N ) r f H (N ) ˆ m (N ) rˆ m R ˆm Em ˆ Rm+1 (N ) = = f bH (N ) E b (N ) ˆ m (N − 1) rˆ m (N ) R rˆ m m

(8.5.20)

(8.5.21)

which has the same form as (7.3.3). Therefore, all order recursions developed in Section 7.3 can be applied in the prewindowing case. However, to get a complete algorithm, we b (N − need recursions for the time updatings of the BLP bm (N − 1) → bm (N ) and Em b 1) → Em (N), which can be developed by using the time-recursive algorithms developed in Chapter 10 for LS adaptive filters. The postwindowing case can be developed in a similar fashion, but it is of no particular practical interest. f depend ˆ m and R ˆm In the no-windowing case (Ni = M − 1, Nf = N − 1), matrices R on both M and N . Thus, although the development of order recursions can be done as in the prewindowing case, the time updatings are more complicated due to (8.5.17) (Morf et al. 1977). Setting the lower limit to Ni = M − 1 means that all filters cm , 1 ≤ m ≤ M, are optimized over the interval M − 1 ≤ n ≤ N − 1, which makes the optimum nesting property possible. If we set Ni = m − 1, each filter cm is optimized over the interval m − 1 ≤ n ≤ N − 1; that is, it utilizes all the available data. However, in this case, the ˆm = R ˆ m does not hold, and the resulting order-recursive optimum nesting property R M algorithms are slightly more complicated (Kalouptsidis et al. 1984). The development of order-recursive algorithms for FBLP least-squares filters and predictors with linear phase constraints, for example, cm = ±Jc∗m , is more complicated, in general. A review of existing algorithms and more references can be found in Theodoridis and Kalouptsidis (1993). In conclusion, we notice that order-recursive algorithms are more efficient than the LDLH decomposition–based solutions only if N is much larger than M. Furthermore, their numerical properties are inferior to those of the LDLH decomposition methods; therefore, a bit of extra caution needs to be exercised when order-recursive algorithms are employed.

section 8.5 LS Computations Using the Normal Equations

422 chapter 8 Least-Squares Filtering and Prediction

8.6 LS COMPUTATIONS USING ORTHOGONALIZATION TECHNIQUES When we use the LDLH or Cholesky decomposition for the computation of LSE filters, we ˆ = XH X and the time-average first must compute the time-average correlation matrix R H ˆ cross-correlation vector d = X y from the data X and y. Although this approach is widely used in practice, there are certain applications that require methods with better numerical properties. When numerical considerations are a major concern, the orthogonalization techniques, discussed in this section, and the singular value decomposition, discussed in Section 8.7, are the methods of choice for the solution of LS problems. Orthogonal transformations are linear changes of variables that preserve length. In matrix notation y = QH x

(8.6.1)

where Q is an orthogonal matrix, that is, Q−1 = QH

⇒

QQH = I

(8.6.2)

From this property, we can easily see that y2 = yH y = xH QQH x = xH x = x2

(8.6.3)

that is, multiplying a vector by an orthogonal matrix does not change the length of the † vector. As a result, algorithms that use orthogonal transformations do not amplify roundoff errors, resulting in more accurate numerical algorithms. There are two ways to look at the solution of LS problems using orthogonalization techniques: •

•

Use orthogonal matrices to transform the data matrix X to a form that simplifies the solution of the normal equations without affecting the time-average correlation matrix ˆ = XH X. For any orthogonal matrix Q, we have R ˆ = XH X = XH QQH X = (QH X)H QH X R (8.6.4) Clearly, we can repeat this process as many times as we wish until the matrix XH Q1 Q2 · · · is in a form that simplifies the solution of the LS problem. Since orthogonal transformations preserve the length of a vector, multiplying the residual e = y − Xc by an orthogonal matrix does not change the total squared error. Hence, multiplying the residuals by QH gives min e = min y − Xc = min QH (y − Xc) c

c

(8.6.5)

c

Thus, the goal is to find a matrix Q that simplifies the solution of the LS problem. Suppose that we have already found an N × N orthogonal matrix Q such that

R X=Q (8.6.6) O ‡

where, in practice, Q is constructed to make the M × M matrix R upper triangular. Using (8.6.5), we have e = QH e = QH y − QH Xc

(8.6.7)

Q [Q1 Q2 ]

(8.6.8)

X = Q1 R

(8.6.9)

Using the partitioning where Q1 has M columns, we obtain

†

Matrix Q is an arbitrary unitary matrix and should not be confused with the eigenvector matrix of R.

‡

The symbol U would be more appropriate for the upper triangular matrix R which can also be mistaken for the correlation matrix R. However, we chose R because, otherwise, it would be difficult to use the well-established term QR factorization.

which is known as the “thin” QR decomposition. Similarly, H

Q1 y z1 H zQ y= H z 2 Q2 y

423

(8.6.10)

where z1 has M components and z2 has N − M components. Substitution of (8.6.9) and (8.6.10) into (8.6.7) gives

Rc Rc − z QH 1 1 y e = − (8.6.11) = 0 −z2 QH 2 y Since the term z2 = QH 2 y does not depend on the parameter vector c, the length of e becomes minimum if we set c = cls , that is, Els =

and

Rcls = z1

(8.6.12)

2 QH 2 y

(8.6.13)

= z2

where the upper triangular system in (8.6.12) can be solved for cls by back substitution. The steps for the solution of the LS problem using the QR decomposition are summarized in Table 8.3. TABLE 8.3

Solution of the LS problem using the QR decomposition method. Step

Computations

1 2 3 4 5

R X=Q 0

Description

z1 z = QH y = z2 Rcls = z1 Els = z2 2

0 els = Q z2

QR decomposition Transformation and partitioning of y Backward substitution → cls Computation of LS error Back transformation of residuals

Using the QR decomposition (8.6.6), we have ˆ = XH X = RH R R

(8.6.14)

ˆ = LLH , gives which, in conjunction with the unique Cholesky decomposition R R = LH

(8.6.15)

that is, the QR factorization computes the Cholesky factor R directly from data matrix X. ˜ we have Also, since LH cls = k, k˜ = z1 (8.6.16) which, owing to the Cholesky decomposition, leads to Els = Ey − k˜ H k˜ = z2 2 y2

QH y2

2

because = = z1 + z2 If we form the augmented matrix ¯ = [X y] X ¯ provides the triangular factor the QR decomposition of X

R k˜

(8.6.17)

2 .

ξ˜

(8.6.18)

(8.6.19)

section 8.6 LS Computations Using Orthogonalization Techniques

424 chapter 8 Least-Squares Filtering and Prediction

¯ = X¯ H X ¯ with which is identical to the one obtained from the Cholesky decomposition of R 2 H R = L and ξ˜ = Els (see Problem 8.14). E XAM PLE 8.6.1.

Solve the LS problem in Example 8.5.1 1 1 1 1 2 2 2 1 y= X= 4 3 1 3 3 1 0 1

using the QR decomposition approach. Solution. Using the Matlab function [Q,R]=qr(X), we obtain −0.2582 −0.3545 0.8006 0.4082 −0.5164 −0.7089 −0.4804 0.0000 Q= −0.7746 0.4557 0.1601 −0.4082 −0.2582 0.4051 −0.3203 0.8165 −3.8730 −2.0656 −3.5666 0 −1.3166 0.7089 R= 0 0 0.4804 0 0 0 and following the steps in Table 8.3, we find the LS solution and the LSE to be cls = [3.0 −1.5 −1.0]T

Els = 1.5

using the sequence of Matlab commands z=Q’*y; cls=R(1:3,1:3)’\z(1:3); Els=sum(z(4).2); In applications that require only the error (or residual) vector els , we do not need to solve the triangular system Rcls = z1 . Instead, we can compute directly the error by els = Q[0z2 ] or the Matlab command e=Q*[zeros(1,M) z2’]’. This approach is known as direct error (or residual) extraction and plays an important role in LS adaptive filtering algorithms and architectures (see Chapter 10).

It is generally agreed in numerical analysis that orthogonal decomposition methods applied directly to data matrix X are preferable to the computation and solution of the normal equations whenever numerical stability is important (Hager 1988; Golub and Van Loan 1996). The sensitivity of the solution cls to perturbations in the data X and y depends on ˆ and does not depend on the algorithm the ratio of the largest to the smallest eigenvalues of R used to compute the solution. Furthermore, the numerical accuracy required to compute L ˆ The “squaring” R ˆ = XH X directly from X is one-half of that required to compute L from R. of the data to form the time-average correlation matrix results in a loss of information and should be avoided if the numerical precision is not deemed sufficient. Algorithms that compute L directly from X are known as square root methods. However, by paraphrasing Rader (1996), we use the terms amplitude-domain techniques for methods that compute L directly from X and power-domain techniques for methods that compute L indirectly from ˆ = XH X. These ideas are illustrated in the following example. R E XAM PLE 8.6.2.

Let

1 X = 0 0

1 0 0

ˆ = XT X = R

1 + 02

1

1

1 + 02

where XT X is clearly positive definite and nonsingular. Let the desired signal be y = [2 0 0]T so that dˆ = [2 + 0 2 2 + 0 2 ]T . If 0 is such that 1 + 0 2 = 1, due to limited numerical precision,

the matrix XT X becomes singular. If we set 0 = 10−8 , solving the LS equations for cls using the ˆ is singular to the working precision Matlab command cls=Rhat\dhat is not possible since R of Matlab. However, if the problem is solved using the QR decomposition as shown in Example 8.6.1, we find cls = [1 1]T . Note that even for slightly larger values of 0 the Matlab command ˆ is ill cls=Rhat\dhat is able to find a solution that differs from the true LS solution since R conditioned.

There are two classes of orthogonal decomposition algorithms: 1. Methods that compute the orthogonal matrix Q: Householder reflections and Givens rotations 2. Methods that compute Q1 : classical and modified Gram-Schmidt orthogonalizations These decompositions are illustrated in Figure 8.10. The cost of the QR decomposition using the Givens rotations is twice the cost of using Householder reflections or the Gram-Schmidt orthogonalization. The standard method for the computation of the QR decomposition and the solution of LS problems employs the Householder transformation. The Givens rotations are preferred for the implementation of adaptive LS filters (see Chapter 10). M

M

M “Thin” QR decomposition

M N

X

Q1 R

N

M

M N

X

Q1

Full QR decomposition

Q2

R

Q

FIGURE 8.10 Pictorial illustration of the differences between thin and full QR decompositions.

8.6.1 Householder Reflections Consider a vector x and a fixed line l in the plane (see Figure 8.11). If we reflect x about the line l, we obtain a vector y that is the mirror image of x. Clearly, the vector x and its reflection y have the same length. We define a unit vector w in the direction of x − y as 1 w (x − y) (8.6.20) x − y assuming that x and y are nonzero vectors. Since the projection of x on w is (wH x)w, simple inspection of Figure 8.11 gives y = x − 2(wH x)w = x − 2(wwH )x = (I − 2wwH )x Hx H I − 2wwH

where

(8.6.21)

In general, any matrix H of the form (8.6.21) with w = 1 is known as a Householder reflection or Householder transformation (Householder 1958) and has the following properties HH = H that is, the matrix H is unitary.

HH H = I

H−1 = HH

(8.6.22)

425 section 8.6 LS Computations Using Orthogonalization Techniques

426

FIGURE 8.11 The Householder reflection vector.

Line l

chapter 8 Least-Squares Filtering and Prediction

y−x

y x

Projection of x on w w

We can build a Householder matrix Hk that leaves intact the first k − 1 components of a given vector x, changes the kth component, and annihilates (zeros out) the remaining components, that is, i = 1, 2, . . . , k − 1 xi yi = (Hx)i = yk (8.6.23) i=k 0 i = k + 1, . . . , N where yk is to be determined. If we set yk = ±

N

1/2 |xi |

2

ej θ k

(8.6.24)

i=k

where θ k is the angle part of xk (if complex-valued), then both x and y have the same length. There are two choices for the sign of yk . Since the computation of w by (8.6.20) involves subtraction (which can lead to severe numerical problems when two numbers are nearly equal), we choose the negative sign so that yk and xk have opposite signs. Hence, yk − xk is never the difference between nearly equal numbers. Therefore, using (8.6.20), we find that w is given by 0 . .. 0 1 w= √ (8.6.25) (|xk | + sk )ej θ k 2sk (sk + |xk |) x k+1 .. . xN where

sk

N

1/2 |xi |

2

(8.6.26)

i=k

In general, an N × M matrix X with N > M can be diagonalized with a sequence of M Householder transformations

or where

HM · · · H2 H1 X = R

(8.6.27)

X = QR

(8.6.28)

Q H1 H2 · · · HM

(8.6.29)

Note that for M = N we need only M − 1 reflections. We next illustrate by an example how to compute the QR decomposition of a rectangular matrix by using a sequence of Householder transformations.

E XAM PLE 8.6.3.

427

Find the QR decomposition of the data matrix 1 2 X = 2 3 6 7

section 8.6 LS Computations Using Orthogonalization Techniques

using Householder reflections. Solution. Using (8.6.25), we compute the vector w1 = [0.7603 0.2054 0.6162]T and the Householder reflection matrix H1 for the first column of X. The modified data matrix is −6.4031 −7.8087 0.3501 H1 X = 0 0 −0.9496 Similarly, we compute the vector w2 = [0 0.8203 −0.5719]T and matrix H2 for the second column of H1 X, which results in the desired QR decomposition −6.4031 −7.8087 −1.0121 H2 H1 X = R = 0 0 0 −0.1562 −0.7711 −0.6172 Q = H1 H2 = −0.3123 −0.5543 −0.7715 −0.9370 0.3133 −0.1543 This result can be verified by using the Matlab function [Q,R]=qr(X), which implements the Householder transformation.

8.6.2 The Givens Rotations The second elementary transformation that does not change the length of a vector is a rotation about an axis (see Figure 8.12). To describe the method of Givens, we assume for simplicity that the vectors are real-valued. The components of the rotated vector y in terms of the components of the original vector x are y1 = r cos(φ + θ ) = x1 cos θ − x2 sin θ y2 = r sin(φ + θ ) = x1 sin θ + x2 cos θ or in matrix form

cos θ y1 = sin θ y2

− sin θ cos θ

x1 x1 G(θ ) x2 x2 FIGURE 8.12 The Givens rotation.

y

y2

x

x2 u f 0

y1

x1

(8.6.30)

428 chapter 8 Least-Squares Filtering and Prediction

where θ is the angle of rotation. We can easily show that the rotation matrix G(θ ) in (8.6.30) is orthogonal and has a determinant det G(θ) = 1. Any matrix of the form 1 ··· 0 ··· 0 ··· 0 . . . . . . . . . .. .. . . 0 · · · c · · · −s · · · 0 ← i . . . . .. .. . . . .. . (8.6.31) Gij (θ) . 0 · · · s · · · c · · · 0 ← j .. .. .. . . .. . . . . . 0 ··· 0 ··· 0 ··· 1 ↑ j

↑ i

c2 + s 2 = 1

with

(8.6.32)

is known as a Givens rotation. When this matrix is applied to a vector x, it rotates the components xi and xj through an angle θ = arctan (s/c) while leaving all other components intact (Givens 1958). Because of (8.6.30), we can write c = cos θ and s = sin θ for some angle θ. It can easily be shown that the matrix Gij (θ) is orthogonal. The Givens rotations have two attractive features. First, performing the rotation y = Gij (θ )x as yi = cxi − sxj yj = sxi + cxj yk = xk

(8.6.33)

k = i, j

requires only four multiplications and two additions. Second, we can choose c and s to annihilate the j th component of a vector. Indeed, if we set xj xi s = − (8.6.34) c= 2 2 2 xi + x j xi + xj2 in (8.6.31), then yi =

xi2 + xj2

and

yj = 0

(8.6.35)

Using a sequence of Givens rotations, we can annihilate (zero out) all elements of a matrix X below the main diagonal to obtain the upper triangular matrix of the QR decomposition. The product of all the Givens rotation matrices provides matrix Q. We stress that the order of rotations cannot be arbitrary because later rotations can destroy zeros introduced earlier. A version of the Givens algorithm without square roots, which is known as the fast Givens QR, is discussed in Golub and Van Loan (1996). We illustrate this procedure with the next example. E XAM PLE 8.6.4. The QR decomposition can be found in order to find the LS solution using the Givens rotations. Given the same data matrix X as in Example 8.6.3 1 2 X = 2 3 6 7

we first zero the last element of the first column, that is, element (3, 1), using the Givens rotation matrix G31 with c = −0.1664 and s = 0.9864. Indeed, using (8.6.34), we have −6.0828 −7.2336 3 G31 X = 2 0 0.8220

Then the element (2, 1) is eliminated by using the Givens rotation matrix G21 with c = 0.9550 and s = 0.3123, resulting in −6.4031 −7.8087 0.5905 G21 G31 X = 0 0 0.8220 Finally, the QR factorization is found after applying the Givens rotation matrix G32 with c = −0.5834 and s = 0.8122: −6.4031 −7.8087 −1.0121 R = G32 G21 G31 X = 0 0 0 −0.1562 −0.7711 −0.6172 T GT GT = −0.3123 −0.5543 −0.7715 Q = G31 21 32 −0.9370 0.3133 −0.1543 which, as expected, agrees with the QR decomposition found in Example 8.6.3.

In the case of complex-valued vectors, the components of rotated vector y in (8.6.30) are given by

cos θ −e−j ψ sin θ x1 y1 = jψ (8.6.36) y2 x2 e sin θ cos θ where c cos θ and s ej ψ sin θ. The element −s of the rotation matrix Gij (θ) is replaced by −s ∗ , where c2 + |s|2 = 1 instead of (8.6.32).

8.6.3 Gram-Schmidt Orthogonalization If we are given a set of M linearly independent vectors x1 , x2 , . . . , xM , we can create an orthonormal basis q1 , q2 , . . . , qM that spans the same space by using a systematic procedure known as the classical Gram-Schmidt (GS) othogonalization method (see also Section 7.2.4). The GS method starts by choosing x1 (8.6.37) q1 = x1 as the first basis vector. To obtain q2 , we express x2 as the sum of two components: its projection (q1H x2 )q1 onto q1 and a vector p2 that is perpendicular to q1 . Hence, p2 = x2 − (q1H x2 )q1

(8.6.38)

and q2 is obtained by normalizing p2 , that is, q2 =

p2 p2

(8.6.39)

The vectors q1 and q2 have unit length, are orthonormal, and span the same space as x1 and x2 . In general, the orthogonal basis vector qj is obtained by removing from xj its projections onto the already computed vectors q1 to qj −1 . Therefore, we have pj = xj −

j −1

(qiH xj )qi

and

qj =

i=1

pj pj

(8.6.40)

for all 1 ≤ j ≤ M. The GS algorithm can be used to obtain the “thin” Q1 R factorization. Indeed, if we define rij qiH xj

rjj pj

(8.6.41)

429 section 8.6 LS Computations Using Orthogonalization Techniques

430 chapter 8 Least-Squares Filtering and Prediction

pj = rjj qj = xj −

we have

j −1

rij qi

(8.6.42)

i=1

or by solving for xj xj =

j

j = 1, 2, . . . , M

qi rij

(8.6.43)

i=1

Using matrix notation, we can express this relation as X = Q1 R , which is exactly the thin Q1 R factorization in (8.6.9). Major drawbacks of the GS procedure are that it does not produce accurate results and that the resulting basis may not be orthogonal when implemented using finite-precision arithmetic. However, we can achieve better numerical behavior if we reorganize the computations in a form known as the modified Gram-Schmidt (MGS) algorithm (Björck 1967). We start the first step by defining q1 as before x1 (8.6.44) q1 = x1 However, all the remaining vectors x2 , . . . , xM are modified to be orthogonal to q1 by subtracting from each vector its projection onto q1 , that is, (1)

xi

= xi − (q1H xi )q1

i = 2, . . . , M

(8.6.45)

At the second step, we define the vector (1)

q2 =

x2

(8.6.46)

x2 (1)

which is already orthogonal to q1 . Then we modify the remaining vectors to make them orthogonal to q2 (2)

xi

= xi − (q2H xi )q2 (1)

i = 3, . . . , M

(1)

(8.6.47) (m)

Continuing in a similar manner, we compute qm and the updated vectors xi

by

(m−1)

qm = and

(m)

xi

= xi

(m−1)

xm xm

H − (qm xi

(m−1)

(m−1)

)qm

(8.6.48)

i = m + 1, . . . , M

(8.6.49)

The MGS algorithm involves the following steps, outlined in Table 8.4 and is implemented by the function Q=mgs(X). The superior numerical properties of the modified algorithm stem TABLE 8.4

Orthogonalization of a set of vectors using the modified Gram-Schmidt algorithm. Modified GS Algorithm For m = 1 to M rmm = xm 2 qm = xm /rmm For i = m + 1 to M Hx rmi = qm i xi ← xi − rmi qm next i next m

(m)

from the fact that successive xi generated by (8.6.49) decrease in size and that the dot H x(m−1) can be computed more accurately than the dot product qH x . product qm m i i E XAM PLE 8.6.5.

Consider an LS problem (Dahlquist and Björck 1974) with 1 1 1 1 0 0 0 0 X= y= 0 0 0 0 0 0 0 0

where 0 2 # 1, that is, 0 2 can be neglected compared to 1. We first compute XT X and XT y to determine the normal equations 1 1 + 02 1 1 2 1 = c 1 1 + 0 ls 1 2 1 1 1 1+0 which provide the exact solution cls = [1 1 1]T /(3 + 0 2 ). Numerically, the matrix XT X is singular on any computer with accuracy such that 1 + 0 2 is rounded to 1. Applying the MGS algorithm to the column vectors of the augmented matrix [X y], and taking into consideration that 1 + 0 2 is rounded to 1, we obtain 1 0 0 0 0 0 −0 − 1 0 2 e=− Q= 0 3 1 0 0 − 2 1 0 0 0 1 1 1 1 z = 12 R = 0 1 1 2 0 0 1 1 3

which corresponds to the thin QR decomposition. Solving Rcls = z, we obtain cls = [1 1 1]T/3, which agrees with the exact solution under the assumption that 1 + 0 2 is rounded to 1.

8.7 LS COMPUTATIONS USING THE SINGULAR VALUE DECOMPOSITION The singular value decomposition (SVD) plays a prominent role in the theoretical analysis and practical solution of LS problems because (1) it provides a unified framework for the solution of overdetermined and underdetermined LS problems with full rank or that are rank-deficient and (2) it is the best numerical method to solve LS problems in practice. In this section, we discuss the existence and fundamental properties of the SVD, show how to use it for solving the LS problem, and apply it to determine the numerical rank of a matrix. More details are given in Golub and Van Loan (1996), Leon (1990), Stewart (1973), Watkins (1991), and Klema and Laub (1980).

8.7.1 Singular Value Decomposition The eigenvalue decomposition reduces a Hermitian matrix to a diagonal matrix by premultiplying and postmultiplying it by a single unitary matrix. The singular value decomposition, introduced in the next theorem, reduces a general matrix to a diagonal one by premultiplying and postmultiplying it by two different unitary matrices.

431 section 8.7 LS Computations Using the Singular Value Decomposition

Any real N × M matrix X with rank r (recall that r is defined as the number of linearly independent columns of a matrix) can be written as

432

TH E O R E M 8.2.

chapter 8 Least-Squares Filtering and Prediction

X = UVH

(8.7.1)

where U is an N × N unitary matrix, V is an M × M unitary matrix, and is an N × M matrix with ij = 0, i = j , and ii = σ i > 0, i = 1, 2, . . . , r. The numbers σ i are known as the singular values of X and are usually arranged in decreasing order as σ 1 ≥ σ 2 ≥ · · · ≥ σ r > 0. Proof. We follow the derivation given in Stewart (1973). Since the matrix XH X is positive semidefinite, it has nonnegative eigenvalues σ 21 , σ 22 , . . . , σ 2M such that σ 1 ≥ σ 2 ≥ · · · ≥ σ r > 0 = σ r+1 = · · · = σ M for 0 ≤ r ≤ M. Let v1 , v2 , . . . , vM be the eigenvectors corresponding to the eigenvalues σ 21 , σ 22 , . . . , σ 2M . Consider the partitioning V = [V1 V2 ], where V1 consists of the first r columns of V. If r = diag{σ 1 , σ 2 , . . . , σ r }, then we obtain V1H XH XV1 = r2 and r−1 V1H XH XV1 r−1 = I

(8.7.2)

Since V2H XH XV2 = 0, we have XV2 = 0

(8.7.3)

U1 XV1 r−1

(8.7.4)

If we define then (8.7.2) gives U1H U1 = I; that is, the columns of U1 are unitary. A unitary matrix U [U1 U2 ] is found by properly choosing the components of U2 , that is, U2H U1 = 0 and U2H U2 = I.

Then

UH XV =

H U1 U2H

X [V1 V2 ] =

H U1 XV1

U1H (XV2 )

U2H XV1

U2H (XV2 )

=

r

(8.7.5)

because of (8.7.2), (8.7.3), and U2H XV1 = (U2H U1 )r = 0.

The SVD of a matrix, which is illustrated in Figure 8.13, provides a wealth of information about the structure of the matrix. Figure 8.14 provides a geometric interpretation of the SVD of a 2 × 2 matrix X (see Problem 8.23 for details). Orthogonal matrix

N

UT

N

Data matrix ×

Orthogonal matrix ×

V

Data matrix =

Σ

r

FIGURE 8.13 Pictorial representation of the singular value decomposition of a matrix.

X

M

M

r

M

Properties and interpretations. We next provide a summary of interpretations and properties whose proofs are given in the references and the problems. 1. Postmultiplying (8.7.1) by V and equating columns, we obtain i = 1, 2, . . . , r σ i ui Xvi = 0 i = r + 1, . . . , M that is, vi (columns of V) are the right singular vectors of X. 2. Premultiplying (8.7.1) by UH and equating rows, we obtain i = 1, 2, . . . , r σ i viH H ui X = 0 i = r + 1, . . . , N that is, ui (columns of U) are the left singular vectors of X.

(8.7.6)

(8.7.7)

FIGURE 8.14 The SVD of a 2 × 2 matrix maps the unit circle into an ellipse whose semimajor and semiminor axes are equal to the singular values of the matrix.

X = UΣ VH

1

VH = V −1 Rotation

U H = U −1 Rotation s2 Σ=

1 0

0 2 s1

1 Stretching

3. Let λi (·) and σ 2i (·) denote the ith largest eigenvalue and singular value of a given matrix, respectively. The vectors v1 , . . . , vM are eigenvectors of XH X; u1 , . . . , uN are eigenvectors of XXH , for which the squares of the singular values σ 21 , . . . , σ 2r of X are the first r nonzero eigenvalues of XH X and XXH , that is, λi (XH X) = λi (XXH ) = σ 2i (X)

(8.7.8)

4. In the product X = the last N − r columns of U and M − r columns of V are superfluous because they interact only with blocks of zeros in . This leads to the following thin SVD representation of X UVH ,

X = Ur r VrH

(8.7.9)

where Ur and Vr consist of the first r columns of U and V, respectively, and r = diag {σ 1 , σ 2 , . . . , σ r }. 5. The SVD can be expressed as X=

r

σ i ui viH

(8.7.10)

i=1

that is, as a sum of cross products weighted by the singular values. 6. If the matrix X has rank r, then: a. The first r columns of U form an orthonormal basis for the space spanned by the columns of X (range space or column space of X). b. The first r columns of V form an orthonormal basis for the space spanned by the rows of X (range space of XH or row space of X). c. The last M − r columns of V form an orthonormal basis for the space of vectors orthogonal to the rows of X (null space of X). d. The last N − r columns of U form an orthonormal basis for the null space of XH . 7. The Euclidean norm of X is X = σ 1

(8.7.11)

8. The Frobenius norm of X, that is, the square root of the sum of the squares of its elements, is N M XF |xij |2 = σ 21 + σ 22 + · · · + σ 2r (8.7.12) i=1 j =1

433 section 8.7 LS Computations Using the Singular Value Decomposition

434 chapter 8 Least-Squares Filtering and Prediction

9. The difference between the transformations implied by eigenvalue and SVD transformations can be summarized as follows: Eigenvalue decomposition R = QQH

SVD X = UVH XH

X q1

λ1

−→

q1

λ2

q2 .. .

−→ .. .

qM

−→

λM

σ1

−→

v1

u1

σ2

q2 .. .

v2 .. .

−→ .. .

qM

vr

−→

σr

vr+1 .. . vM

u2 .. .

σ1

−→

v1

σ2

−→ .. .

v2 .. .

σr

ur −→ →0

vr

ur+1 .. →0 . uM

This illustrates the need for left and right singular values and vectors. We can compute the SVD of a matrix X by forming the matrices XH X and XXH and computing their eigenvalues and eigenvectors (see Problem 8.21). However, we should avoid this approach because the “squaring” of X to form these correlation matrices results in a loss of information (see Example 8.6.2). In practice, the SVD is computed by using the algorithm of Golub and Reinsch (1970) or the R-SVD algorithm described in Chan (1982), which for N M is twice as fast. The state of the art in SVD research is provided in Golub and Van Loan (1996), whereas reliable numerical algorithms and code are given in LA-PACK, LINPACK, and Numerical Recipes in C (Press et al. 1992).

8.7.2 Solution of the LS Problem So far, we have discussed the solution of the overdetermined (N > M) LS problem with full-rank (r = M) data matrices using the normal equations and the QR decomposition techniques. We next show how the SVD can be used to solve the LS problem without making any assumptions about the dimensions N and M or the rank r of data matrix X. Suppose that we know the exact SVD of data matrix X = UVH. Since U is orthogonal, y − Xc = y − UVH c = UH y − VH c $

$

y U y

If we define

(8.7.13)

c V c

H

H

we obtain the LSE $

$

y − Xc = y − c = 2

r i=1

|yi$

− σ i ci$ |2

+

N

|yi$ |2

(8.7.14)

i=r+1

which is minimized if and only if ci$ = yi$ /σ i for i = 1, 2, . . . , r. We notice that when $ $ do not appear in (8.7.14). Therefore, they have no effect , . . . , cM r < M, the terms cr+1 on the residual and can be chosen arbitrarily. To illustrate this point, consider the geometric interpretation in Figure 8.5. There is only one linear combination of the linearly independent vectors x˜ 1 and x˜ 2 that determines the optimum LS estimate. If the data matrix has one more column x˜ 3 that lies in the same plane, then there are an infinite number of linear combinations c1 x˜ 1 + c2 x˜ 2 + c3 x˜ 3 that satisfy the LSE criterion. To obtain a unique LS solution from all solutions c that minimize y − Xc, we choose the one with the minimum length c. Since

the matrix V is orthogonal, we have c$ = VH c = c, and the norm c is minimized $ $ = 0 provides = · · · = cM when the norm c$ is minimized. Hence, choosing cr+1 the minimum-norm solution to the LS problem. In summary, the unique, minimum-norm solution to the LS problem is cls =

where

$ H yi = ui y σi ci$ = σ i 0 Els = y − Xcls 2 =

and

r uH y i

σi

i=1

vi

(8.7.15)

i = 1, . . . , r

(8.7.16)

i = r + 1, . . . , M N

|yi$ |2 =

i=r+1

N

2 |uH i y|

(8.7.17)

i=r+1

is the corresponding LS error. We next express the unique minimum-norm solution to the LS problem in terms of the pseudoinverse of data matrix X using the SVD. To this end, we note that (8.7.16) can be written in matrix form

where

c$ = + y $ −1

r 0 + 0 0

(8.7.18) (8.7.19)

is an N × N matrix with r−1 = diag {1/σ 1 , . . . , 1/σ r }. Therefore, using (8.7.15) and (8.7.19), we obtain

where

cls = V + UH y = X+ y r 1 vi uH X+ V + UH = i σi

(8.7.20) (8.7.21)

i=1

is the pseudoinverse of matrix X. For full-rank matrices, the pseudoinverse is defined as X+ = (XH X)−1 XH (Golub and Van Loan 1996), so that using (8.7.21) leads to the LS solution in (8.2.21). If N = M = rank(X), then X+ = X−1 . Therefore, (8.7.21) holds for any rectangular or square matrix that is either full rank or rank-deficient. Formally, X+ can be defined independently of the LS problem as the unique M × N matrix A that satisfies the four Moore-Penrose conditions XAX = X (XA)H = XA (8.7.22) AXA = A (AX)H = AX which implies that XX+ and X+ X are orthogonal projections onto the range space of X and XH (see Problem 8.25). However, we stress that the pseudoinverse is, for the most part, a theoretical tool, and there is seldom any reason for its use in practice. In summary, the computation of the LS estimator using the SVD involves the steps shown in Table 8.5. The vector cls is unique and satisfies two requirements: (1) It minimizes the sum of the errors, and (2) it has the smallest Euclidean norm. The following example illustrates the use of the SVD for the computation of the LS estimator. E XAM PLE 8.7.1.

signal:

Solve the LS problem with the following data matrix and desired response

1 2 X= 3 1

1 2 1 0

1 1 3 1

1 2 y= 4 3

435 section 8.7 LS Computations Using the Singular Value Decomposition

436 chapter 8 Least-Squares Filtering and Prediction

TABLE 8.5

Solution of the LS problem using the SVD method. Step

Description

1 2

Compute the SVD X = UVH Determine the rank r of X

3

Compute yi$ = uH i y, i = 1, . . . , N

4

Compute cls =

5

Compute Els =

r yi$

σ

i=1 i N

vi |yi$ |2

i=r+1

Solution. We start by computing the SVD of X = UVT by using the Matlab function [U,S,V]=svd(X). This gives 0.3041 0.2170 0.8329 0.4082 0.4983 0.7771 −0.3844 0.0000 U= 0.7768 −0. 4778 0.0409 −0.4082 0.2363 −0. 3474 −0.3960 0.8165

5.5338 0 = 0 0

0 1.5139 0 0

0 0 0.2924 0

0.6989 V = −0.0063 −0.7152

0.3754 0.8544 0.3593

T −0.60882 −0.5196 0.5994

which implies that the data matrix has rank r = 3. Next we compute 5.1167 3.0 −1.1821 Els = 1.5 cls = −1.5 y $ = UT y = −0.9602 −1.0 1.2247 by the Matlab commands yp=U’*y; cls=V*(yp(1:r)./diag(S)); Els=sum(yp(r+1:N).ˆ2); which implement steps 3, 4, and 5 in Table 8.5. The LS solution also can be obtained from cls=X\y. If we set X23 = 2, the first and last columns of X become linearly dependent, the SVD has only two nonzero singular values, and the svd function warns that X is rank-deficient.

Table 8.6 shows the numerical operations required by the various LS solution methods (Golub and Van Loan 1996). For full-rank (nonsingular) data matrices, all other methods are simpler than the SVD. However, these methods are inaccurate when X is rank-deficient (nearly singular). In such cases, the SVD reveals the near singularity of the data matrix and is the method of choice because it provides a reliable computation of the numerical rank (see the next section). Normal equations versus QR decomposition. The squaring of X to form the timeˆ = XH X results in a loss of information and should be avoided. average correlation matrix R −1 Since X = 1/σ min , the condition number of X is σ max κ(X) = XX−1 = (8.7.23) σ min

TABLE 8.6

437

Computational complexity of LS computation algorithms. LS Algorithm Normal equations Householder orthogonalization Givens orthogonalization Modified Gram-Schmidt Golub-Reinsch SVD R-SVD

section 8.7 LS Computations Using the Singular Value Decomposition

FLOPS (floating point operations) NM 2 + M 3 /3 2NM 2 − 2M 3 /3 3NM 2 − M 3 2NM 2 4NM 2 + 8M 3 2NM 2 + 11M 3

which is analogous to the eigenvalue ratio for square Hermitian matrices. Hence, κ(XH X) =

λmax σ2 = max = κ 2 (X) λmin σ 2min

(8.7.24)

which shows that squaring a matrix can only worsen its condition. The study of the sensitivity of the LS problem is complicated. However, the following conclusions (Golub and Van Loan 1996; Van Loan 1997) can be drawn: √ 1. The sensitivity of the LS solution is roughly proportional to the quantity κ(X) + Els κ 2 (X). Hence, any method produces inaccurate results when applied to ill-conditioned problems with large Els . 2. The method of normal equations produces a solution cls whose relative error is approximately eps · κ 2 (X), where eps is the machine precision. 3. The QR method (Householder, Givens, √ MGS) produces a solution cls whose relative error is approximately eps · [κ(X) + Els κ 2 (X)]. In general, QR methods are more accurate than and can be used for a wider class of data matrices than the normal equations approach, even if the latter is about twice as fast. In many practical applications, we need to update the Cholesky or QR decomposition after the original data matrix has been modified by the addition or deletion of a row or column (rank 1 modifications). Techniques for the efficient computation of these decompositions by updating the existing ones can be found in Golub and Van Loan (1996) and Gill et al. (1974). 8.7.3 Rank-Deficient LS Problems In theory, it is relatively easy to determine the rank of a matrix or that a matrix is rankdeficient. However, both tasks become complicated in practice when the elements of the matrix are specified with inadequate accuracy or the matrix is near singular. The SVD provides the means of determining how close a matrix is to being rank-deficient, which in turn leads to the concept of numerical rank. To this end, suppose that the elements of matrix X are known with an accuracy of order 0, and its computed singular values σ 1 ≥ σ 2 ≥ · · · ≥ σ M are such that σ 2r+1 + σ 2r+2 + · · · + σ 2M < 0 2 Then if we set r diag {σ 1 , . . . , σ r , 0, . . . , 0} and

r Xr U VH 0 X − Xr F = σ 2r+1 + σ 2r+2 + · · · + σ 2M < 0 we have

(8.7.25)

(8.7.26) (8.7.27)

and matrix X is said to be near a matrix of rank r or X has numerical rank r. It can be shown that Xr is the matrix of rank r that is nearest to X in the Frobenius norm sense (Leon

438 chapter 8 Least-Squares Filtering and Prediction

1990; Stewart 1973). This result has important applications in signal modeling and data compression. Computing the LS solution for rank-deficient data matrices requires extra care. When a singular value is equal to a very small number, its reciprocal, which is a singular value of the pseudoinverse X+ , is a very large number. As a result, the LS solution deviates substantially from the “true” solution. One way to handle this problem is to replace each singular value below a certain cutoff value (thresholding) with zero. A typical threshold is a fraction of σ 1 determined by either the machine precision available or the accuracy of the elements in the data matrix (measurement accuracy). For example, if the data matrix is accurate to six decimal places, we set the threshold at 10−6 σ 1 (Golub and Van Loan 1996). Another way is to replace the LS criterion (8.7.14) by E{c, ψ} = y − Xc2 + ψc2

(8.7.28)

where the constant ψ > 0 reflects the importance of the norm of the solution vector. The term c acts a stabilizer, that is, prevents the solution cψ from becoming too large (regularization). Indeed, using the method of Lagrange multipliers, we can show that cψ =

r

σ2 i=1 i

σi (uH i y)vi +ψ

(8.7.29)

which is known as the regularized solution. We note that cψ = cls when ψ = 0. However, when ψ > 0, as σ i → 0 the term σ i /(σ 2i + ψ) in (8.7.29) tends to zero while the term 1/σ i → ∞ in (8.7.15) √ tends to infinity. Furthermore, it can be shown that cls ≤ y/σ r and cψ ≤ y/ ψ (Hager 1988). Since the minimum-norm LS solution requires only the first r columns of U, where r is the numerical rank of X, we can use the thin SVD. If N M, the computation of either Ur or U is expensive. However, in practical SVD algorithms, U is computed as the product of many reflections and rotations. Hence, we can compute y$ = UH y by updating y at each step i with each orthogonal transformation, that is, UiH y → y.

8.8 SUMMARY In this chapter we discussed the theory, implementation, and application of linear estimators (combiners, filters, and predictors) that are optimum according to the LSE criterion of performance. The fundamental differences between linear MMSE and LSE estimators are as follows: •

•

MMSE estimators are designed using ensemble average second-order moments R and d; they can be designed prior to operation, and during their normal operation they need only the input signals. ˆ and dˆ of the second-order LSE estimators are designed using time-average estimates R moments or data matrix X and the desired response vector y. For this reason LSE estimators are sometimes said to be data-adaptive. The design and operation of LSE estimators are coupled and are usually accomplished by using either of the following approaches: – Collect a block of training data Xtr and ytr and use them to design an LSE estimator; use it to process subsequent blocks. Clearly, this approach is meaningful if all blocks have statistically similar characteristics. – For each collected block of data X and y, compute the LSE filter cls or the LSE estimate yˆ (whatever is needed).

There are various numerical algorithms designed to compute LSE estimators and estimates. For well-behaved data and sufficient numerical precision, all these methods produce

the same results and therefore provide the same LSE performance, that is, the same total squared error. However, when ill-conditioned data, finite precision, or computational complexity is a concern, the choice of the LS computational algorithm is very important. We saw that there are two major families of numerical algorithms for dealing with LS problems: Power-domain techniques solve LS estimation problems using the time-average moˆ = XH X and dˆ = XH y. The most widely used methods are the LDLH and ments R Cholesky decompositions. Amplitude-domain techniques operate directly on data matrix X and the desired response vector. In general, they require more computations and have better numerical properties than power-domain methods. This group includes the QR orthogonalization methods (Householder, Givens, and modified Gram-Schmidt) and the SVD method. The QR decomposition methods apply a unitary transformation to the data matrix to reduce it to an upper triangular one, whereas the GS methods apply an upper triangular matrix transformation to orthogonalize the columns of the data matrix. In conclusion, we emphasize that there are various ways to compute the coefficients of an optimum estimator and the value of the optimum estimate. We stress that the performance of any optimum estimator, as measured by the MMSE or LSE, does not depend on the particular implementation as long as we have sufficient numerical precision. Therefore, if we want to investigate how well an optimum estimator performs in a certain application, we can use any implementation, as long as computational complexity is not a consideration.

PROBLEMS 8.1 By differentiating (8.2.8) with respect to the vector c, show that the LSE estimator cls is given by the solution of the normal equations (8.2.12). 8.2 Let the weighted LSE be given by Ew = eH We, where W is a Hermitian positive definite matrix. (a) By minimizing Ew with respect to the vector c, show that the wieghted LSE estimator is given by (8.2.35). (b) Using the LDLH decomposition W = LDLH , show that the weighted LS criterion corresponds to prefiltering the error or the data. 8.3

(i)

Using direct substitution of (8.4.4) into (8.4.5), show that the LS estimator cls and the associated (i)

LS error Els are determined by (8.4.5). 8.4

Consider a linear system described by the difference equation y(n) = 0.9y(n − 1) + 0.1x (n − 1) + v(n), where x(n) is the input signal, y(n) is the output signal, and v(n) is an output disturbance. Suppose that we have collected N = 1000 samples of input-output data and that we wish to estimate the system coefficients, using the LS criterion with no windowing. Determine the coefficients of the model y(n) = ay(n − 1) + dx(n − 1) and their estimated covariance ˆ −1 when matrix σˆ 2e R (a) x(n) ∼ WGN(0, 1) and v(n) ∼ WGN(0, 1) and (b) x(n) ∼ WGN(0, 1) and v(n) = 0.8v(n − 1) + w(n) is an AR(1) process with w(n) ∼ WGN(0, 1). Comment upon the quality of the obtained estimates by comparing the matrices ˆ −1 obtained in each case. σˆ 2e R

8.5 Use Lagrange multipliers to show that Equation (8.4.13) provides the minimum of (8.4.8) under the constraint (8.4.9).

439 problems

440 chapter 8 Least-Squares Filtering and Prediction

8.6 If full windowing is used in LS, then the autocorrelation matrix is Toeplitz. Using this fact, show that in the combined FBLP the predictor is given by af b = 12 (a + Jb∗ ) 8.7

Consider the noncausal “middle” sample linear signal estimator specified by (8.4.1) with M = 2L and i = L. (a) Show that if we apply full windowing to the data matrix, the resulting signal estimator is conjugate symmetric, that is, c(L) = Jc(L)∗ . This property does not hold for any other windowing method. (b) Derive the normal equations for the signal estimator that minimizes the total squared error E (L) = e(L) 2 under the constraint c(L) = Jc(L)∗ . (c) Show that if we enforce the normal equation matrix to be centro-Hermitian, that is, we use the normal equations 0 ¯HX ¯ + JX ¯TX ¯ ∗ J)c(L) = (X E (L) 0 then the resulting signal smoother is conjugate symmetric. (d ) Illustrate parts (a) to (c), using the data matrix 1 1 1 2 2 1 X = 3 1 3 1 0 1 1

2

1

and check which smoother provides the smallest total squared error. Try to justify the obtained answer. 8.8 A useful impulse response for some geophysical signal processing applications is the Mexican hat wavelet 2 2 g(t) = √ π −1/4 (1 − t 2 )e−t /2 3 which is the second derivative of a Gaussian pulse.

8.9

(a) Plot the wavelet g(t) and the magnitude and phase of its Fourier transform. (b) By examining the spectrum of the wavelet, determine a reasonable sampling frequency Fs . (c) Design an optimum LS inverse FIR filter for the discrete-time wavelet g(nT ), where T = 1/Fs . Determine a reasonable value for M by plotting the LSE EM as a function of order M. Investigate whether we can improve the inverse filter by introducing some delay n0 . Determine the best value of n0 and plot the impulse response of the resulting filter and the combined impulse response g(n) ∗ h(n − n0 ), which should resemble an impulse. (d ) Repeat part (c) by increasing the sampling frequency by a factor of 2 and comparing with the results obtained in part (c). ¯ (a) Prove Equation (8.5.4) regarding the LDLH decomposition of the augmented matrix R. H ¯ (b) Solve the LS estimation problem in Example 8.5.1, using the LDL decomposition of R and the partitionings in (8.5.4).

8.10 Prove the order-recursive algorithm described by the relations given in (8.5.12). Demonstrate the validity of this approach, using the data in Example 8.5.1. 8.11 In this problem, we wish to show that the statistical interpretations of innovation and partial correlation for wm (n) and km+1 in (8.5.12) hold in a deterministic LSE sense. To this end, suppose that the “partial correlation” between y˜ and x˜ m+1 is defined using the residual records b = x˜ e˜ m = y˜ − Xm cm and e˜ m m+1 + Xm bm , where bm is the LSE BLP. Show that kk+1 = H e˜ b and ξ bH b β m+1 /ξ m+1 , where β m+1 e˜ m m+1 = e˜ m e˜ m . Demonstrate the validity of these m formulas using the data in Example 8.5.1.

8.12 Show that the Cholesky decomposition of a Hermitian positive definite matrix R can be computed by using the following algorithm for j = 1 to M lij = (rij −

j −1

|lj k |2 )1/2

k=1

for i = j + 1 to M j −1 ∗ l )/ l lik lij = (rij − jk jj k=1

end i end j and write a Matlab function for its implementation. Test your code using the built-in Matlab function chol. 8.13 Compute the LDLT and Cholesky decompositions of the following matrices: 9 3 −6 6 4 −2 1 3 and X2 = 4 5 X1 = 3 4 −6 1 9 −2 3 6 8.14 Solve the LS problem in Example 8.6.1, ¯ = [X y] and (a) using the QR decomposition of the augmented data matrix X ¯ =X ¯ H X. ¯ (b) using the Cholesky decomposition of the matrix R Note: Use Matlab built-in functions for the QR and Cholesky decompositions. 8.15 (a) Show that a unit vector w is an eigenvector of the matrix H = I − 2wwH . What is the corresponding eigenvalue? (b) If a vector z is orthogonal to w, show that z is an eigenvector of H. What is the corresponding eigenvalue? 8.16 Solve the LS problem

1 1 X= 1 1

2 3 2 −1

−3 10 y= 3 6

using the Householder transformation. 8.17 Solve Problem 8.16 by using the Givens transformation. 8.18 Compute the QR decomposition of the data matrix 4 2 1 2 0 1 X= 2 0 −1 1 2 1 using the GS and MGS methods, and compare the obtained results. 8.19 Solve the following LS problem

1 2 X= 2 4

−2 0 −4 0

−1 1 2 0

−1 1 y= 1 −2

by computing the QR decomposition using the GS algorithm.

441 problems

442 chapter 8 Least-Squares Filtering and Prediction

8.20 Show that the computational organization of the MGS algorithm shown in Table 8.4 can be used to compute the GS algorithm if we replace the step rim = qiH xm by rim = qiH qm . 8.21 Compute the SVD of X =

1 1 0

1 1 0

by computing the eigenvalues and eigenvectors of XH X and

XXH . Check with the results obtained using the svd function. 8.22 Repeat Problem 8.21 for

6 2 (a) X = and −7 6

0 1 1 . (b) X = 1 1 0 8.23 Write a Matlab program to produce the plots in Figure 8.14, using the matrix X = −76 26 . Hint: Use a parametric description of the circle in polar coordinates. T 8.24 For the matrix X = 01 11 10 determine X+ and verify that X and X+ satisfy the four Moore-Penrose conditions (8.7.22). 8.25 Prove the four Moore-Penrose conditions in (8.7.22) and explain why XX+ and X+ X are orthogonal projections onto the range space of X and XH . 8.26 In this problem we examine in greater detail the radio-frequency interference cancelation experiment discussed in Section 8.4.3. We first explain the generation of the various signals and then proceed with the design and evaluation of the LS interference canceler. (a) The useful signal is a pointlike target defined by d 1 dg(t) s(t) = dt e−αt/tr + eαt/tf dt where α = 2.3, tr = 0.4, and tf = 2. Given that Fs = 2 GHz, determine s(n) by computing the samples g(n) = g(nT ) in the interval −2 ≤ nT ≤ 6 ns and then computing the first difference s(n) = g(n) − g(n − 1). Plot the signal s(n) and its Fourier transform (magnitude and phase), and check whether the pointlike and wideband assumptions are justified. (b) Generate N = 4096 samples of the narrowband interference using the formula z(n) =

L

Ai sin (ωi n + φ i )

i=1

and the following information: Fs=2; % All frequencies are measured in GHz. F=0.1*[0.6 1 1.8 2.1 3 4.8 5.2 5.7 6.1 6.4 6.7 7 7.8 9.3]’; L=length(F); om=2*pi*F/Fs; A=[0.5 1 1 0.5 0.1 0.3 0.5 1 1 1 0.5 0.3 1.5 0.5]’; rand(’seed’,1954); phi=2*pi*rand(L,1); (c) Compute and plot the the periodogram of z(n) to check the correctness of your code. (d ) Generate N samples of white Gaussian noise v(n) ∼ WGN (0, 0.1) and create the observed signal x(n) = 5s(n − n0 ) + z(n) + v(n), where n0 = 1000. Compute and plot the periodogram of x(n). (e) Design a one-step ahead (D = 1) linear predictor with M = 100 coefficients using the FBLP method with no windowing. Then use the obtained FBLP to clean the corrupted signal x(n) as shown in Figure 8.7. To evaluate the performance of the canceler, generate the plots shown in Figures 8.8 and 8.9.

8.27 Careful inspection of Figure 8.9 indicates that the the D-step prediction error filter, that is, the system with input x(n) and output ef (n), acts as a whitening filter. In this problem, we try to solve Problem 8.26 by designing a practical whitening filter using a power spectral density (PSD) estimate of the corrupted signal x(n). (PA) (a) Estimate the PSD Rˆ x (ej ωk ), ωk = 2π k/NFFT , of the signal x(n), using the method of averaged periodograms. Use a segment length of L = 256 samples, 50 percent overlap, and NFFT = 512. (b) Since the PSD does not provide any phase information, we shall design a whitening FIR filter with linear phase by

1 −j 2π H˜ (k) = e NFFT (PA) Rˆ x (ej ωk )

NFFT −1 k 2

where H˜ (k) is the DFT of the impulse response of the filter, that is, H˜ (k) =

NFFT −1

h(n) e

−j N2π nk FFT

n=0

with 0 ≤ k ≤ NFFT − 1. (c) Use the obtained whitening filter to clean the corrupted signal x(n), and compare its performance with the FBLP canceler by generating plots similar to those shown in Figures 8.8 and 8.9. (d ) Repeat part (c) with L = 128, NFFT = 512 and L = 512, NFFT = 1024 and check whether spectral resolution has any effect upon the performance. Note: Information about the design and implementation of FIR filters using the DFT can be found in Proakis and Manolakis (1996). 8.28 Repeat Problem 8.27, using the multitaper method of PSD estimation. 8.29 In this problem we develop an RFI canceler using a symmetric linear smoother with guard samples defined by e(n) = x(n) − x(n) ˆ x(n) +

M

ck [x(n − k) + x(n + k)]

k=D

where 1 ≤ D < M prevents the use of the D adjacent samples to the estimation of x(n). (a) Following the approach used in Section 8.4.3, demonstrate whether such a canceler can be used to mitigate RFI and under what conditions. (b) If there is theoretical justification for such a canceler, estimate its coefficients, using the method of LS with no windowing for M = 50 and D = 1 for the situation described in Problem 8.26. (c) Use the obtained filter to clean the corrupted signal x(n), and compare its performance with the FBLP canceler by generating plots similar to those shown in Figures 8.8 and 8.9. (d ) Repeat part (c) for D = 2. 8.30 In Example 6.7.1 we studied the design and performance of an optimum FIR inverse system. In this problem, we design and analyze the performance of a similar FIR LS inverse filter, using training input-output data. (a) First, we generate N = 100 observations of the input signal y(n) and the noisy output signal x(n). We assume that y(n) ∼ WGN(0, 1) and v(n) ∼ WGN(0, 0.1). To avoid transient effects, we generate 200 samples and retain the last 100 samples to generate the required data records. (b) Design an LS inverse filter with M = 10 for 0 ≤ D < 10, using no windowing, and choose the best value of delay D. (c) Repeat part (b) using full windowing. (d ) Compare the LS filters obtained in parts (b) and (c) with the optimum filter designed in Example 6.7.1. What are your conclusions?

443 problems

444 chapter 8 Least-Squares Filtering and Prediction

8.31 In this problem we estimate the equalizer discussed in Example 6.8.1, using input-output training data, and we evaluate its performance using Monte Carlo simulation. −1 and use them (a) Generate N = 1000 samples of input-desired response data {x(n), a(n)}N 0 ˆ x and the cross-correlation vector dˆ between x(n) and to estimate the correlation matrix R y(n − D). Use D = 7, M = 11, and W = 2.9. Solve the normal equations to determine the LS FIR equalizer and the corresponding LSE. (b) Repeat part (a) 500 times; by changing the seed of the random number generators, compute the average (over the realizations) coefficient vector and average LSE, and compare with the optimum MSE equalizer obtained in Example 6.8.1. What are your conclusions? (c) Repeat parts (a) and (b) by setting W = 3.1.

C HAPT E R 9

Signal Modeling and Parametric Spectral Estimation

This chapter is a transition from theory to practice. It focuses on the selection of an appropriate model for a given set of data, the estimation of the model parameters, and how well the model actually “fits the data.” Although the development of parameter estimation techniques requires a strong theoretical background, the selection of a good model and its subsequent evaluation require the user to have sufficient practical experience and a familiarity with the intended application. We provide complete, detailed algorithms for fitting pole-zero models to data using least-squares techniques. The estimation of all-pole model parameters involves the solution of a linear system of equations, whereas pole-zero modeling requires nonlinear least-squares optimization. The chapter is roughly organized into two separate but related parts. In the first part, we begin in Section 9.1 by explaining the steps that are required in the model-building process. Then, in Section 9.2, we introduce various least-squares algorithms for the estimation of parameters of direct and lattice all-pole models, provide different interpretations, and discuss some order selection criteria. For pole-zero models we provide, in Section 9.3, a nonlinear optimization algorithm that estimates the parameters of the model by minimizing the least-squares criterion. We conclude this part with Section 9.4 in which we discuss the applications of pole-zero models to spectral estimation and speech processing. In the second part, we begin with the method of minimum-variance spectral estimation (Capon’s method). Then we describe frequency estimation methods based on the harmonic model: the Pisarenko harmonic decomposition and the MUSIC, minimum-norm, and ESPRIT algorithms. These methods are suitable for applications in which the signals of interest can be represented by complex exponential or harmonic models. Signals consisting of complex exponentials are found in a variety of applications including as formant frequencies in speech processing, moving targets in radar, and spatially propagating signals in array processing.

9.1 THE MODELING PROCESS: THEORY AND PRACTICE In this section, we discuss the modeling of real-world signals using parametric pole-zero (PZ) signal models, whose theoretical properties were discussed in Chapter 4. We focus on PZ (P , Q) models with white input sequences, which are also known as ARMA (P , Q) random signal models. These models are defined by the linear constant-coefficient difference 445

446 chapter 9 Signal Modeling and Parametric Spectral Estimation

equation x(n) = −

P

ak x(n − k) + w(n) +

k=1

Q

dk w(n − k)

(9.1.1)

k=1

where w(n) ∼ WN (0, σ 2w ) with σ 2w < ∞. The power spectral density (PSD) of the output signal is 2 Q −j ωk dk e 1 + |D(e−j ω )|2 k=1 jω 2 = σ 2w (9.1.2) R(e ) = σ w P |A(e−j ω )|2 ak e−j ωk 1 + k=1

which is a rational function completely specified by the parameters, {a1 , a2 , . . . , aP }, {d1 , . . . , dQ }, and σ 2w . We stress that since these models are linear, time-invariant (LTI), the resulting process x(n) is stationary, which is ensured if the corresponding systems are BIBO stable. The essence of signal modeling and of the resulting parametric spectrum estimation −1 is the following: Given finite-length data {x(n)}N n=0 , which can be regarded as a sample sequence of the signal under consideration, we want to estimate signal model parameQ ters {aˆ k }P1 , {bˆk }1 , and σˆ 2w , to satisfy a prescribed criterion. Furthermore, if the parameter estimates are sufficiently accurate, then the following formula 2 Q −j ωk dˆk e 1 + ˆ −j ω )|2 |D(e k=1 2 ˆ j ω ) = σˆ w = σˆ 2w (9.1.3) R(e P ˆ −j ω )|2 |A(e −j ωk aˆ k e 1 + k=1

should provide a reasonable estimate of the signal PSD. A similar argument applies to harmonic signal models and harmonic spectrum estimation in which the model parameters are the amplitudes and frequencies of complex exponentials (see Section 3.3.6). The development of such models involves the steps shown in Figure 9.1. In this chapter, we assume that we have removed trends, seasonal variations, and other nonstationarities from the data. We further assume that unit poles have been removed from the data by using the differencing approach discussed in Box et al. (1994). Model selection In this step, we basically select the structure of the model (direct or lattice), and we make a preliminary decision on the orders P and Q of the model. The most important aid to model selection is the insight and understanding of the signal and the physical mechanism that generates it. Hence, in some applications (e.g., speech processing) physical considerations point to the type and order of the model; when we lack a priori information or we have insufficient knowledge of the mechanism generating the signal, we resort to data analysis methods. In general, to select a candidate model, we estimate the autocorrelation, partial autocorrelation, and power spectrum from the available data, and we compare them to the corresponding quantities obtained from the theoretical models (see Table 4.1). This preliminary data analysis provides sufficient information to choose a PZ model and some initial estimate for P and Q to start a model building process. Several order selection criteria have been developed that penalize both model misfit and a large number of parameters. Although theoretically interesting and appealing, these criteria are of limited value when we deal with actual signals.

Stage 1 Model selection

Choose model structure and order

Stage 2 Model estimation

Estimate model parameters

Stage 3 Model validation

Check the candidate model for performance

Is model satisfactory?

FIGURE 9.1 Steps in the signal model building process.

No

Yes Use the model for your application

The model structure influences (1) the complexity of the algorithm that estimates the model parameters and (2) the shape of the criterion function (quadratic or nonquadratic). Therefore, the structure (direct or lattice) is not critical to the performance of the model, and its choice is not as crucial as the choice of the order of the model. Model estimation −1 In this step, also known as model fitting, we use the available data {x(n)}N to esti0 mate the parameters of the selected model, using optimization of some criterion. Although there are several criteria (e.g., maximum likelihood, spectral matching) that can be used to measure the performance or quality of a PZ model, we concentrate on the least-squares (LS) error criterion. As we shall see, the estimation of all-pole (AP) models leads to linear optimization problems whereas the estimation of all-zero (AZ) and PZ models requires the solution of nonlinear optimization problems. Parameter estimation for PZ models using other criteria can be found in Kay (1988), Box et al. (1994), Porat (1994), and Ljung (1987).

Model validation Here we investigate how well the obtained model captures the key features of the data. We then take corrective actions, if necessary, by modifying the order of the model, and repeat the process until we get an acceptable model. The goal of the model validation process is to find out whether the model • • •

Agrees sufficiently with the observed data Describes the “true” signal generation system Solves the problem that initiated the design process

Of course, the ultimate test is whether the model satisfies the requirements of the intended application, that is, the objective and subjective criteria that specify the performance of the model, computational complexity, cost, etc. In this discussion, we concentrate on how well the model fits the observed data in an LS error statistical sense. The existence of any structure in the residual or prediction error signal indicates a misfit between the model and the data. Hence, a key validation technique is to check whether the residual process, which is generated by the inverse of the fitted model, is a realization of white noise. This can be checked by using, among others, the following statistical techniques (Brockwell and Davis 1991; Bendat and Piersol 1986):

447 section 9.1 The Modeling Process: Theory and Practice

448 chapter 9 Signal Modeling and Parametric Spectral Estimation

Autocorrelation test. It can be shown (Kendall and Stuart 1983) that when N is sufficiently large, the distribution of the estimated autocorrelation coefficients ρ(l) ˆ = rˆ (l)/ˆr (0) is approximately Gaussian with zero mean and variance of 1/N . The approximate 95 per√ cent confidence limits are ±1.96/ N . Any estimated values of ρ(l) ˆ that fall outside these limits are “significantly” different from zero with 95 percent confidence. Values well beyond these limits indicate nonwhiteness of the residual signal. −1 Power spectrum density test. Given a set of data {x(n)}N n=0 , the standardized cumulative periodogram is defined by 0 k<1 k ˆ j 2π i/N ) R(e i=1 1≤k≤K I˜(k) (9.1.4) K j 2π i/N ˆ ) R(e i=1 1 k>K

where K is the integer part of N/2. If the process x(n) is white Gaussian noise (WGN), then the random variables I˜(k), k = 1, 2, . . . , K, are independently and uniformly distributed in the interval (0, 1), and the plot of I˜(k) should be approximately linear with respect to k (Jenkins and Watts 1968). The hypothesis is rejected at level 0.05 if I˜(k) exits the boundaries specified by k−1 ± 1.36(K − 1)−1/2 I˜(b) (k) = K −1

1≤k≤K

(9.1.5)

Partial autocorrelation test. This test is similar to the autocorrelation test. Given the residual process x(n), it can be shown (Kendall and Stuart 1983) that when N is sufficiently large, the partial autocorrelation sequence (PACS) values {kl } for lag l [defined in (4.2.44)] are approximately independent with distribution WN (0, 1/N ). √This means that roughly 95 percent of the PACS values fall within the bounds ±1.96/ N . If we observe values consistently well beyond this range for N sufficiently large, it may indicate nonwhiteness of the signal. To apply the above tests and interpret their results, we consider a WGN sequence x(n). By using the randn function, 100 samples of x(n) with zero mean and unit variance were generated. These samples are shown in Figure 9.2. From these samples, the autocorrelation estimates up to lag 40, denoted by {ˆr (l)}40 l=0 , were computed using the autoc function, from which the the correlation coefficients ρ(l) ˆ were obtained. The first 10 coefficients are shown in Figure 9.2 along with the appropriate confidence limits. As expected, the first coefficient at lag 0 is unity while the remaining coefficients are within the limits. Next, using the psd function, a periodogram based on 100 samples was computed, from which the cumulative periodogram I˜(k) was obtained and plotted as a function of the normalized frequency, as shown in Figure 9.2. The confidence limits are also shown. The computed cumulative periodogram is a monotonic increasing function lying within the limits. Finally, using the durbin function, PACS sequence {kl }40 l=1 was computed from the estimated correlations and plotted in Figure 9.2. Again all the values for lags l ≥ 1 are within the confidence limits. Thus all three tests suggest that the 100-point data are almost surely from a white noise sequence. EXAMPLE 9.1.1.

Although the whiteness of the residuals is a good test for model fitting, it does not provide a definite answer to the problem. Some additional procedures include checking whether •

The criterion of performance decreases (fast enough) as we increase the order of the model.

Autocorrelation test

section 9.2 Estimation of All-Pole Models

1.00

0.17 0 −0.17 0

20

n

5 Lag l

PSD test

Partial autocorrelation test

40

60

80

100

10

1.00

~ I(f)

PACS

1.00

0.17 0 −0.17

449

r(l )

x (n)

WGN samples 4 3 2 1 0 −1 −2 −3 −4

0.1 0.2 0.3 0.4 Frequency (cycles/samp)

0.17 0 −0.17

0.5

5 Lag l

10

FIGURE 9.2 Validation tests on white Gaussian noise in Example 9.1.1. • • •

The estimate of the variance of the residual decreases as the number N of observations increases. Some estimated parameters that have physical meaning (e.g., reflection coefficients) assume values that make sense. The estimated parameters have sufficient accuracy for the intended application.

Finally, to demonstrate that the model is sufficiently accurate for the purpose for which it was designed, we can use a method known as cross-validation. Basically, in cross-validation we use one set of data to fit the model and another, statistically independent set of data to test it. Cross-validation is of paramount importance when we build models for control, forecasting, and pattern recognition (Ljung 1987). However, in signal processing applications, such as spectral estimation and signal compression, where the goal is to provide a good fit of the model to the analyzed data, cross-validation is not as useful.

9.2 ESTIMATION OF ALL-POLE MODELS We next use the principle of least squares to estimate parameters of all-pole signal models assuming both white and periodic excitations. We also discuss criteria for model order selection, techniques for estimation of all-pole lattice parameters, and the relationship between all-pole estimation methods using the methods of least squares and maximum entropy. The relationship between all-pole model estimation and minimum-variance spectral estimation is explored in Section 9.5.

9.2.1 Direct Structures Consider the AR(P0 ) model where we use ak∗ instead of ak to comply with Chapter 8 notation. P0 ak∗ x(n − k) + w(n) (9.2.1) x(n) = − k=1

450

where w(n) ∼ WN(0, σ 2w ). The P th-order linear predictor of x(n) is given by

chapter 9 Signal Modeling and Parametric Spectral Estimation

x(n) ˆ =−

P

aˆ k∗ x(n − k)

(9.2.2)

k=1

and the corresponding prediction error sequence is e(n) = x(n) − x(n) ˆ = x(n) +

P

aˆ k∗ x(n − k)

(9.2.3)

k=1

= aˆ H x(n)

(9.2.4)

where aˆ 0 = 1 and aˆ = [1 aˆ 1 · · · aˆ P ]T x(n) = [x(n) x(n − 1) · · · x(n − P )]T

(9.2.5) (9.2.6)

Thus the error over the range Ni ≤ n ≤ Nf can be expressed as a vector ¯ aˆ e=X

(9.2.7) ¯ where X is the data matrix defined in (8.4.3). For the full-windowing case, the data matrix ¯ is given by X x(0) x(1) · · · x(P ) ··· 0 ··· 0 x(0) · · · x(P − 1) · · · x(N − 1) · · · 0 0 H ¯ (9.2.8) X = . . . . . . . . . . .. .. .. . . .. .. .. 0

· · · x(0)

· · · x(N − P ) · · · x(N − 1)

¯ is while for the no-windowing case the data matrix X x(P ) x(P + 1) · · · x(N − 2) · · · x(N − 3) x(P − 1) x(P ) ¯ H = . X .. . . .. . . . . . x(0)

x(1)

x(N − 1) x(N − 2) .. .

· · · x(N − P − 2) x(N − P − 1)

(9.2.9)

Notice that if P = P0 and aˆ k = ak , the prediction error e(n) is identical to the white noise excitation w(n). Furthermore, if AR(P0 ) is minimum-phase, then w(n) is the innovation process of x(n) and x(n) ˆ is the MMSE prediction of x(n). Thus, we can obtain a good estimate of the model parameters by minimizing some function of the prediction error. In theory, we minimize the MSE E{|e(n)|2 }. In practice, since this is not possible, we estimate {ak }P1 for a given P by minimizing the total squared error 2 Nf Nf P 2 ∗ EP = |e(n)| = aˆ k x(n − k) (9.2.10) x(n) + n=Ni

=

Nf

n=Ni

k=1

¯ HX ¯ aˆ |ˆaH x(n)|2 = aˆ H X

(9.2.11)

n=Ni

over the range Ni ≤ n ≤ Nf . Hence, we can use the methods discussed in Section 8.4 for the computation of LS linear predictors. In particular, the forward linear predictor coefficient {aˆ k }Pk=1 and the associated LS error EˆP are obtained by solving the normal equations

ˆ ¯ H X)ˆ ¯ a = EP (X (9.2.12) 0 The solution of (9.2.12) is discussed extensively in Chapter 8.

The least-squares AP(P ) parameter estimates have properties similar to those of linear prediction. For example, if the process w(n) is Gaussian, the least-squares no-windowing estimates are also maximum-likelihood estimates (Jenkins and Watts 1968). The variance of the excitation process can be obtained from the LS error EˆP by σˆ 2w = σˆ 2w =

or

1 1 EˆP = N +P N +P

N +P −1

|e(n)|2

full windowing

(9.2.13)

n=0

N −1 1 1 EˆP = |e(n)|2 N −P N −P

no windowing

(9.2.14)

n=P

for the full-windowing or no-windowing methods, respectively. Furthermore, in the fullwindowing case, if the Toeplitz correlation matrix is positive definite, the obtained model is guaranteed to be minimum-phase (see Section 7.4). Matlab functions [ahat,e,V] = arwin(x,P) and [ahat,e,V] = arls(x,P)

are provided that compute the model parameters, the error sequence, and the modeling error using the full-windowing and no-windowing methods, respectively. We present three examples below to illustrate the all-pole model determination and its use in PSD estimation. The first example uses real data consisting of water-level measurements of Lake Huron from 1875 to 1972. The second example also uses real data containing sunspot numbers for 1770 through 1869. These sunspot numbers have an approximate cycle of period around 10 to 12 years. The Lake Huron and sunspot data are shown in Figure 9.3. The third example generates simulated AR(4) data to estimate model parameters and through them the PSD values. In each case, the mean was computed and removed from the data prior to processing. Lake Huron data

Level (ft)

590

580

570 1860

1880

1900

1920 Year

1940

1960

1980

1840

1860

1880

Sunspot data

Numbers

150 100 50 0 −50 1760

1780

1800

1820 Year

FIGURE 9.3 The Lake Huron and sunspot data used in Examples 9.2.1 and 9.2.2.

E XAM PLE 9.2.1. A careful examination of Lake Huron water-level measurement data indicates that a low-order all-pole model might be a suitable representation of the data. To test this hypothesis, first- and second-order models were considered. Using the full-windowing method, model

451 section 9.2 Estimation of All-Pole Models

452

parameters were computed:

chapter 9 Signal Modeling and Parametric Spectral Estimation

First-order

aˆ 1 = −0.791,

σˆ 2w = 0.5024

Second-order

aˆ 1 = −1.002,

aˆ 2 = 0.2832,

σˆ 2w = 0.4460

Using these model parameters, the data were filtered and the residuals were computed. Three tests for checking the whiteness of the residuals as described in Section 9.1 were performed to ascertain the validity of models. In Figure 9.4, we show the residuals, the autocorrelation test, the PSD test, and the partial correlation test for the first-order model. The partial correlation test indicates that the PACS coefficient at lag 1 is outside the confidence limits and thus the first-order model is a poor fit. In Figure 9.5 we show the same plots for the second-order model. Clearly, these tests show that the residuals are approximately white. Therefore, the AR(2) model appears to be a good match to the data.

Autocorrelation test 1.0 r(l )

e(n)

Residual samples 3 2 1 0 −1 −2 −3

20

40

60

0.2 0 −0.2

80

5

10

15

n

30

35

40

35

40

Partial autocorrelation test

PSD test 1.0

~ I( f )

PACS

1.0

0.2 0 −0.2

20 25 Lag l

0.1 0.2 0.3 0.4 Frequency (cycles/sample)

0.5

0.2 0 −0.2 0

5

10

15

20 25 Lag l

30

FIGURE 9.4 Validation tests on the first-order model fit to the Lake Huron water-level measurement data in Example 9.2.1. E XAM PLE 9.2.2. Figure 9.6 shows the PACS coefficients of the sunspot numbers along with the 95 percent confidence limits. Since all PACS values beyond lag 2 fall well inside the limits, a second-order model is a possible candidate for the data. Therefore, the second-order model parameters were estimated from the data to obtain the model

x(n) = 1.318x(n − 1) − 0.634x(n − 2) + w(n)

σˆ 2w = 289.2

In Figure 9.7 we show the residuals obtained by filtering the data along with three tests for its whiteness. The plots show that the estimated model is a reasonable fit to the data. Finally, in Figure 9.8 we show the PSD estimated from the AR(2) model as well as from the periodogram. The periodogram is very noisy and is devoid of any structure. The AR(2) spectrum is smoother and distinctly shows a peak at 0.1 cycle per sampling interval. Since the sampling rate is 1 sampling interval per year, the peak corresponds to 10 years per cycle, which agrees with the observations. Thus the parametric approach to PSD estimation was appropriate. E XAM PLE 9.2.3. We illustrate the least-squares algorithms described above, using the AR(4) process x(n) introduced in Example 5.3.2. The system function of the model is given by

H (z) =

1 1 − 2.7607z−1 + 3.8106z−2 − 2.6535z−3 + 0.9238z−4

and the excitation is a zero-mean Gaussian white noise with unit variance. Suppose that we are given the N = 250 samples of x(n) shown in Figure 9.9 and we wish to model the underlying process by using an all-pole model. To identify a candidate model, we compute the autocorrelation, partial autocorrelation, and periodogram, using the available data. Careful inspection of Figure 9.9 and the signal model characteristics given in Table 4.1 suggests an AR model. Since the PACS plot cuts off around P = 5, we choose P = 4 and fit an AR(4) model to the data, using both the full-windowing and no-windowing methods. Figure 9.10 shows the actual spectrum of the process, the spectra of the estimated models, and the periodogram. Clearly, the no-windowing estimate provides a better fit because it does not impose any windowing on the data. Figure 9.11 shows the residual, autocorrelation, partial autocorrelation, and periodogram for the no-windowing-based model. We see that the residuals can be assumed uncorrelated with reasonable confidence, which implies that the model captures the second-order statistics of the data. Residual samples 1.0

20

40

60

0.2 0 −0.2 0

80

5

10

15

n

PSD test

20 25 Lag l

30

40

35

40

1.0 PACS

~ I( f )

0.1 0.2 0.3 0.4 Frequency (cycles/sample)

0.5

0.2 0 −0.2 0

5

10

15

20 25 Lag l

30

FIGURE 9.5 Validation tests on the second-order model fit to the Lake Huron water-level measurement data in Example 9.2.1.

Partial autocorrelation test 1.0

PACS

35

Partial autocorrelation test

1.0

0.2 0 −0.2

section 9.2 Estimation of All-Pole Models

Autocorrelation test

r(l )

e(n)

3 2 1 0 −1 −2 −3

453

0.2 0 −0.2

5

10

15

20 Lag l

25

30

35

FIGURE 9.6 The PACS values of the sunspot numbers in Example 9.2.2.

40

454

Modified covariance method. The LS method described above to estimate model parameters uses the forward linear predictor and prediction error. There is also another approach that is based on the backward linear predictor. Recall that the backward linear predictor derived from the known correlations is the complex conjugate of the forward predictor (and likewise, the corresponding errors are identical). However, the LS estimators and errors based on the actual data are different because the data read in each direction are different from a statistical viewpoint. Hence, it is much more reasonable to consider both forward and backward predictors and to minimize the combined error

chapter 9 Signal Modeling and Parametric Spectral Estimation

EPf b

Nf

[|ef (n)|2 + |eb (n)|2 ]

n=Ni

=

Nf

[|ˆaH x(n)|2 + |ˆaT x∗ (n)|2 ]

(9.2.15)

n=Ni

¯ HX ¯ aˆ + aˆ H X ¯TX ¯ ∗ aˆ = aˆ H X subject to the constraint that the first component of aˆ is 1. The minimization of Epf b leads to the set of normal equations f b Eˆ T H ∗ ¯ X ¯ +X ¯ X ¯ )ˆa = P (9.2.16) (X 0 which can be solved efficiently to obtain the model parameters (see Section 8.4.2). This method of using the forward-backward predictors is called the modified covariance method. Not only does it have the advantage of minimizing the combined global error, but also since it uses more data in (9.2.16), it gives better estimates and lower error. A similar minimization approach, but implemented at each local stage, is used in Burg’s method, which is discussed in Section 9.2.2. Residual samples

Autocorrelation test

e(n)

r(l )

1.0

0.2 0 −0.2 0

20

40

60

80

5

10

15

n

30

35

40

35

40

Partial autocorrelation test

PSD test 1.0

~ I( f )

PACS

1.0

0.2 0 −0.2

20 25 Lag l

0.1 0.2 0.3 0.4 Frequency (cycles/sampling interval)

0.5

0.2 0 −0.2 0

5

10

15

20 25 Lag l

FIGURE 9.7 Validation tests on the second-order model fit to the sunspot numbers in Example 9.2.2.

30

40

455

30

section 9.2 Estimation of All-Pole Models

Power (dB)

AR(2) 20 Periodogram 10

0 0

0.1 0.2 0.3 0.4 Frequency (cycles/sampling interval)

0.5

FIGURE 9.8 Comparison of the periodogram and the AR(2) spectrum in Example 9.2.2. Autocorrelation

AR(4) 1.0

Amplitude

40

0.5 r(l )

20 0 −20

0 −0.5

−40

−1.0 0

100 200 Sample number

PACS

10 Lag l

15

20

Periodogram 100 Power (dB)

1.0 0.5 km

5

−0.5 −1.0 0

5

10 m

15

20

50 0 −50

0.1 0.2 0.3 0.4 0.5 Frequency (cycles/sampling interval)

FIGURE 9.9 Data segment from an AR(4) process, and the corresponding autocorrelation, partial autocorrelation, and periodogram.

Frequency-domain interpretation. In the case of full windowing, by using Parseval’s theorem, the error energy can be written as π ∞ |X(ej ω )|2 1 E= |e(n)|2 = dω (9.2.17) 2π −π |Hˆ (ej ω )|2 n=−∞ where |X(ej ω )|2 is the spectrum of the modeled windowed signal segment and Hˆ (ej ω ) is the frequency response of the estimated all-pole model [or estimated spectrum of x(n)]. This expression is a good approximation for the other windowing methods if N P . Since the integrand in (9.2.17) is positive, minimizing the error E is equivalent to minimizing the

456

20

chapter 9 Signal Modeling and Parametric Spectral Estimation

10

True PSD No window

Rectangular window Hamming window

Power (dB)

−10 −20

Periodogram

−30 −40 −50 −60 −70 −80

0.1 0.2 0.3 0.4 Frequency (cycles/sampling interval)

0.5

FIGURE 9.10 Periodogram, theoretical AR(4) spectrum, and AR(4) model spectra using full windowing, Hamming windowing, and no windowing. Autocorrelation

Residuals: AR(4) 1.0

2 r(l )

Amplitude

0.5 0

0 −0.5

−2

−1.0 0

100 200 Sample number

30

0.5

20

0 −0.5 −1.0 0

5

10 m

10 Lag l

15

20

Periodogram

1.0 Power (dB)

km

PACS

5

15

20

10 0 −10

0.1 0.2 0.3 0.4 0.5 Frequency (cycles/sampling interval)

FIGURE 9.11 Residual sequence for the AR(4) data, and the corresponding autocorrelation, partial autocorrelation, and periodogram.

integrated ratio of the energy spectrum of the modeled signal segment to its all-pole-based spectrum. The presence of this ratio in (9.2.17) has three additional consequences. (1) The quality of the spectral matching is uniform over the whole frequency range, irrespective of the shape of the spectrum. (2) Since regions where |X(ej ω )| > |Hˆ (ej ω )| contribute more to the total error than regions where |X(ej ω )| < |Hˆ (ej ω )| do, the match is better near spectral peaks than near spectral valleys. (3) The all-pole model provides a good estimate of the envelope of the signal spectrum |X(ej ω )|2 . These properties are apparent in Figure 9.12,

−30

457

−40

section 9.2 Estimation of All-Pole Models

Power (dB)

−50 AP(28)

−60 −70 −80

Periodogram −90 −100

0.5

1.0

1.5 2.0 2.5 Frequency (kHz)

3.0

3.5

4.0

FIGURE 9.12 Illustration of the spectral envelope matching property of all-pole models.

which shows a comparison between 20 log |X(ej ω )| (obtained using the periodogram) and 20 log |Hˆ (ej ω )| [obtained by an AP(28) model fitted using full windowing] for a 20-ms, Hamming windowed, speech signal sampled at 20 kHz. Note that the slope of |Hˆ (ej ω )| is always zero at frequencies ω = 0 and ω = π , as expected. More details on these issues can be found in Makhoul (1975b). The error energy (9.2.17) is also related to the Itakura-Saito (IS) distortion measure, which is given by π 1 dIS (R1 , R2 ) [exp V (ej ω ) − V (ej ω ) − 1] dω (9.2.18) 2π −π where R1 (ej ω ) and R2 (ej ω ) are two spectra, and V (ej ω ) log R1 (ej ω ) − log R2 (ej ω ) Indeed, we can show that 1 dIS (R1 , R2 ) = 2π

π −π

σ 21 R1 (ej ω ) dω − log −1 R2 (ej ω ) σ 22

(9.2.19)

(9.2.20)

where σ 21 and σ 22 are the variances of the innovation sequences corresponding to the factorization of spectra R1 (ej ω ) and R2 (ej ω ), respectively. More details can be found in Rabiner and Juang (1993). Order selection criteria. The order of an all-pole signal model plays an important role in the modeling problem. It determines the number of parameters to be estimated and hence the computational complexity of the algorithm. But more importantly, it affects the quality of the spectrum estimates. If a much lower order is selected, then the resulting spectrum will be smooth and will display poor resolution. If a much larger order is used, then the spectrum may contain spurious peaks at best and a phenomenon called spectrum splitting at worst, in which a single peak is split into two separate and distinct peaks (Hayes 1996). Several criteria have been proposed over the years for model order selection; however, in practice nothing surpasses the graphical approach outlined in Examples 9.2.1 and 9.2.2 combined with the experience of the user. Therefore, we only provide a brief summary of some well-known criteria and refer the interested reader to Kay (1988), Porat (1994), and Ljung (1987) for more details. The simplest approach would be to monitor the modeling error and then select the order at which this error enters a steady state. However, for all-pole models, the modeling error is monotonically decreasing, which makes this approach all but

458 chapter 9 Signal Modeling and Parametric Spectral Estimation

impossible. The general idea behind the suggested criterion is to introduce a penalty function in the modeling error that increases with the model order P . We present the following four criteria that are based on the above general idea. FPE criterion. The final prediction error (FPE) criterion, proposed by Akaike (1970), is based on the function N +P 2 FPE(P ) = σˆ (9.2.21) N −P P where σˆ 2P is the modeling error [or variance of the residual of the estimated AP(P ) model]. We note that the term σˆ 2P decreases or remains the same with increasing P , whereas the term (N + P )/(N − P ) accounts for the increase in σˆ 2P due to inaccuracies in the estimated parameters and increases with P . Clearly, FPE(P ) is an inflated version of σˆ 2P . The FPE order selection criterion is to choose P that will minimize the function in (9.2.21). AIC. The Akaike information criterion (AIC), also introduced by Akaike (1974), is based on the function AIC(P ) = N log σˆ 2P + 2P

(9.2.22)

It is a very general criterion that provides an estimate of the Kullback-Leibler distance (Kullback 1959) between an assumed and the true probability density function of the data. The performances of the FPE criterion and the AIC are quite similar. MDL criterion. The minimum description length (MDL) criterion was proposed by Risannen (1978) and uses the function MDL(P ) = N log σˆ 2P + P log N

(9.2.23)

The first term in (9.2.23) decreases with P , but the second penalty term increases. It has been shown (Risannen 1978) that this criterion provides a consistent order estimate in that as the probability that the estimated order is equal to the true order approaches 1, the data length N tends to infinity. CAT. This criterion is based on Parzen’s criterion autoregressive transfer (CAT) function (Parzen 1977), which is given by CAT(P ) =

P 1 N −k N −P − 2 N N σ ˆ N σˆ 2P k k=1

(9.2.24)

This criterion is asymptotically equivalent to the AIC and the MDL criteria. Basically, all order selection criteria add to the variance of the residuals a term that grows with the order of the model and estimate the order of the model by minimizing the resulting criterion. However, when P N, which is the case in many practical applications, the criterion does not exhibit a clear minimum that makes the order selection process difficult (see Problem 9.1). 9.2.2 Lattice Structures We noted in Section 7.5 that a prediction error filter, and hence the AP model, can also be implemented by using a lattice structure. The P th-order forward prediction error e(n) = ePf (n) and the total squared error EP =

Nf n=Ni

|e(n)|2

(9.2.25)

are nonlinear functions of the lattice parameters km , 0 ≤ m ≤ P − 1. For example, if P = 2, we have e2f (n)

=

x(n) + (k0∗

+ k0 k1∗ )x(n − 1) + k1∗ x(n − 2)

which shows that e2f (n) depends on the product k0 k1∗ . Thus, fitting an all-pole lattice model by minimizing EP with respect to km , 0 ≤ m ≤ P − 1, leads to a difficult nonlinear optimization problem. We can avoid this problem by replacing the above “global” optimization with P “local” optimizations from m = 1 to P , one for each stage of the lattice. From the lattice equations f f ∗ b (n) = em−1 (n) + km−1 em−1 (n − 1) em

(9.2.26)

b b f em (n) = em−1 (n − 1) + km−1 em−1 (n)

(9.2.27)

we see that the mth-order prediction errors depend on the coefficient km−1 only. Furthermore, f b the values of em−1 (n) and em−1 (n) have been computed by using km−2 , which has been determined from the optimization step at the previous stage. Hence, to minimize the forward prediction error f Em =

Nf

f |em (n)|2

(9.2.28)

n=Ni ∗ . This leads to we substitute (9.2.26) into (9.2.28) and differentiate with respect to km−1 the following optimum value of km−1 †

FP km−1 =−

β fmb =

where

Nf

b β fm−1 b Em−1

f b [em (n)]∗ em (n − 1)

(9.2.29)

(9.2.30)

n=Ni b Em

and

Nf

=

b |em (n − 1)|2

(9.2.31)

n=Ni

Similarly, minimization of the backward prediction error (9.2.31) gives BP =− km−1

b β fm−1 f Em−1

(9.2.32)

Burg (1967) suggested the estimation of km−1 by minimizing fb Em =

Nf

f b {|em (n)|2 + |em (n)|2 }

(9.2.33)

n=Ni ‡

at each stage of the lattice. Indeed, substituting (9.2.26) and (9.2.27) in the last equation, we obtain the relationship f ∗ b b fb Em = (1 + |km−1 |2 )Em−1 + 4 Re(km−1 β fm−1 ) + (1 + |km−1 |2 )Em−1

(9.2.34)

†

See Appendix B for a discussion of how to find an optimum of a real-valued function of a complex variable and its conjugate.

‡

This approach should not be confused with the maximum entropy method introduced also by Burg and discussed later.

459 section 9.2 Estimation of All-Pole Models

460

f b /∂k ∗ If we set ∂Em m−1 = 0, we obtain the following estimate of km−1 :

chapter 9 Signal Modeling and Parametric Spectral Estimation

b β fm−1

B =−1 km−1

f 2 (Em−1

b + Em−1 )

=

FP k BP 2km−1 m−1 FP + k BP km−1 m−1

(9.2.35)

FP and k BP . We also stress that the obtained B We note that km−1 is the harmonic mean of km−1 m−1 model is different from the one resulting from the forward-backward least-squares (FBLS) method through global optimization [see (9.2.16)]. Itakura and Saito (1971) proposed an estimate of km−1 based on replacing the theoretical ensemble averages in (7.5.24) by time averages. Their estimate is given by βf b IS FP BP FP k BP km−1 = − m−1 = sign(km−1 or km−1 ) km−1 (9.2.36) m−1 f b Em−1 Em−1

and is also known as the geometric mean method. Since it can be shown that B IS | ≤ |km−1 |≤1 |km−1

(9.2.37)

both estimates result in minimum-phase models (see Problem 9.2). From (9.2.36) and FP | < 1, then |k BP | > 1 and vice versa; that is, if the FLP (9.2.37) we conclude that if |km−1 m−1 is minimum-phase, then the BLP is maximum-phase and vice versa. Several other estimates are discussed in Makhoul (1977) and Viswanathan and Makhoul (1975). In all previous methods, we use no windowing; that is, we set Ni = m and Nf = N −1. If we use data windowing, all the above estimates are identical to the data windowing estimates obtained using the algorithm of Levinson-Durbin (see Problem 9.3). The variance of the residuals can be estimated by fb 1 Em 2N −m which for large values of N (see Problem 9.12) can be approximated by

σˆ 2m =

σˆ 2m = σˆ 2m−1 (1 − |km−1 |2 ) σˆ 20 =

where

(9.2.38)

(9.2.39)

N −1 1 |x(n)|2 N

(9.2.40)

n=0

The computations for the lattice estimation methods are summarized in Table 9.1, and the algorithms are implemented by the function [k,var] = aplatest(x,P).

9.2.3 Maximum Entropy Method We next show how LS all-pole modeling is related to Burg’s method of maximum entropy. To this end, suppose that x(n) is a normal, stationary process with zero mean. The Mdimensional complex-valued vector x xM (n) obeys a normal distribution 1 exp(−xH R −1 x) det R where R is a Toeplitz correlation matrix. By definition, its entropy is given by p(x) =

πM

H(x) −E{log p(x)} = M log π + log(det R) + M

(9.2.41)

(9.2.42)

because E{xH R −1 x} = M. If the process x(n) is regular, that is, |km | < 1 for all m, we have det R =

M−1

Pm m=0

and

Pm = r(0)

m j =1

(1 − |kj |2 )

(9.2.43)

TABLE 9.1

461

Algorithm for estimation of AP lattice parameters.

section 9.2 Estimation of All-Pole Models

1. Input: x(n) for Ni ≤ n ≤ Nf 2. Initialization a. e0f (n) = e0b (n) = x(n). b. Compute β f0 b , E0f , and E0b from x(n). c. Compute k0FP and k0BP . d. Compute either k0IS or k0B from k0FP and k0BP . e. Apply the first stage of the lattice to x(n) using either k0IS or k0B to obtain e1f (n) and e1b (n). 3. For m = 2, 3, . . . , P b , Ef b f b a. Compute β fm−1 m−1 , and Em−1 from em−1 (n) and em−1 (n). FP and k BP . b. Compute km−1 m−1 FP BP IS or k B c. Compute either km−1 m−1 from km−1 and km−1 . IS or k B f b (n) using either km−1 (n) and em−1 d. Apply the mth stage of the lattice to em−1 m−1 to obtain f (n) and eb (n). em m IS or k B for m = 1, 2, . . . , P and ef (n) and eb (n). 4. Output: Either km m m m

where Pm = Pmf = Pmb (see Section 7.4). If we substitute (9.2.43) into (9.2.42), we obtain H(x) = M log π + M + M log r(0) +

M−1

(M − m) log(1 − |km |2 )

(9.2.44)

m=1

which expresses the entropy in terms of r(0) and the PACS km , 1 ≤ m ≤ M ≤ ∞ [recall that any parametric model can be specified by r(0) and the PACS]. Suppose now that we are given the first P + 1 values r(0), r(1), . . . , r(P ) of the autocorrelation sequence and we wish to find a model, by choosing the remaining values r(l), l > P , so that the entropy is maximized. From (9.2.44), we see that the entropy is maximized if we choose km = 0 for m > P , that is, by modeling the process x(n) by an AR(P ) model. In conclusion, among all regular Gaussian processes with the same first P + 1 autocorrelation values, the AR(P ) process has the maximum entropy. Any other choices for km , m > P , that satisfy the condition |km | < 1 lead to a valid extension of the autocorrelation sequence. The “extended” values r(l), l > P , can be obtained by using the inverse Levinson-Durbin or the inverse Schür algorithm (see Chapter 7). The relation between autoregressive modeling and the principle of maximum entropy, known as the maximum entropy method, was introduced by Burg (1967, 1975). We note that the above proof, given in Porat (1994), is different from the original proof provided by Burg (Burg 1975; Therrien 1992). An interesting discussion of various arguments in favor of and against the maximum entropy method can be found in Makhoul (1986).

9.2.4 Excitations with Line Spectra When the excitation of a parametric model has a spectrum with lines at L frequencies ωm , the spectrum of the output signal provides information about the frequency response of the model at these frequencies only. For simplicity, assume equidistant samples at frequencies ωm = 2π m/L, 0 ≤ m ≤ L − 1. Given a set of values Rx (ej ωm ) = |X(ej ωm )|2 , we wish to find an AP(P ) model whose spectrum Rˆ h (ej ω ) matches Rx (ωm ) at the given frequencies, by minimizing the criterion L d0 Rx (ej ωm ) E˜ = ˆ j ωm ) L m=1 Rh (e

(9.2.45)

462 chapter 9 Signal Modeling and Parametric Spectral Estimation

which is the discrete version of (9.2.17) and d0 is the gain of the model (see Section 4.2). The minimization of (9.2.45) with respect to the model parameters {ak } results in the YuleWalker equations P E˜ i=0 (9.2.46) ak∗ r˜ (i − k) = 0 1≤i≤P k=0 r˜ (l) =

where

L 1 Rx (ej ωm ) ej ωm L

(9.2.47)

m=1

For continuous spectra, linear prediction uses the autocorrelation π 1 Rx (ej ω ) ej ω dω r(l) = 2π −π

(9.2.48)

which is related to r˜ (l) by r˜ (l) =

∞

r(l − Lm)

(9.2.49)

m=−∞

that is, r˜ (l) is an aliased version of r(l). We have seen that linear prediction equates the autocorrelation of the AP(P ) model to the autocorrelation of the modeled signal for the first P + 1 lags. Hence, when we use linear prediction for a signal with line spectra, the autocorrelation of the all-pole model will be matched to r˜ (l) = r(l) and will always result in a model different from the original. Clearly, the correlation matching condition cannot compensate for the autocorrelation aliasing, which becomes more pronounced as L decreases. This phenomenon, which is severe for voiced sounds with high pitch, is illustrated in Problem 9.13. A method that provides better estimates, by minimizing a discrete version of the Itakura-Saito error measure, has been developed for both AP and PZ models by El-Jaroudi and Makhoul (1991, 1989).

9.3 ESTIMATION OF POLE-ZERO MODELS The estimation of PZ(P , Q) model parameters for Q = 0 leads to a nonlinear LS optimization problem. As a result, a vast number of suboptimum methods, with reduced computational complexity, have been developed to avoid this problem. For example, some techniques estimate the AP(P ) and AZ(Q) parameters separately. However, today the availability of high-speed computers has made exact least-squares the method of choice. Since the nonlinear LS optimization with respect to complex vectors and its conjugate is inherently difficult, and since this optimization does not provide any additional insight into the solution technique, we assume, in this section, that the quantities are real-valued. Furthermore, most of the real-world applications of pole-zero models almost always involve real-valued signals and systems. The extension to the complex-valued case is straightforward. Consider the PZ(P , Q) model x(n) = −

P

ak x(n − k) + w(n) +

Q

dk w(n − k)

(9.3.1)

k=1

k=1

where w(n) ∼ WN(0, σ 2w ). Using vector notation, we can express (9.3.1) as x(n) = zT (n − 1)cpz + w(n) where and

(9.3.2)

z(n) [−x(n) · · · − x(n − P + 1) w(n) · · · w(n − Q + 1)]T cpz = [a d ] = [a1 · · · aP d1 · · · dQ ] T

T

T

(9.3.3) (9.3.4)

463

9.3.1 Known Excitation Assume for a moment that the excitation w(n) is known. Then we can predict x(n) from past values, using the following linear predictor where

x(n) ˆ = zT (n − 1)c

(9.3.5)

c = [aˆ 1 · · · aˆ P dˆ1 · · · dˆQ ]T

(9.3.6)

are the predictor parameters. The prediction error e(n) = x(n) − x(n) ˆ = x(n) − zT (n − 1)c

(9.3.7)

equals w(n) if c = cpz . Minimization of the total squared error E(c)

Nf

e2 (n)

(9.3.8)

n=Ni

leads to the following linear system of equations ˆ z c = rˆ z R ˆz = R

where

Nf

(9.3.9)

z(n − 1)zT (n − 1)

(9.3.10)

n=Ni

rˆ z =

and

Nf

z(n − 1)x(n)

(9.3.11)

n=Ni

Usually, we use residual windowing, which implies that Ni = max(P , Q) and Nf = N − 1. ˆ z is symmetric and positive semidefinite, we can solve (9.3.9) using LDLH Since the matrix R decomposition. Thus, if we know the excitation w(n), the least-squares estimation of the PZ(P , Q) model parameters reduces to the solution of a linear system of equations. An estimate of the input variance is given by σˆ 2w

1 = N − max(P , Q)

N −1

e2 (n)

(9.3.12)

n=max(P ,Q)

This method, which is implemented by the function pzls.m, is known as the equation-error method and can be used to identify a system from input-output data (Ljung 1987) (see Problem 9.14).

9.3.2 Unknown Excitation In most applications, the excitation w(n) is never known. However, we can obtain a good estimate of x(n) by replacing w(n) by e(n) in (9.3.3). This makes a natural choice if the model used to obtain e(n) is reasonably accurate. The prediction error is then given by e(n) = x(n) − x(n) ˆ = x(n) − zˆ T (n − 1)c where

zˆ (n) [−x(n) · · · − x(n − P + 1) e(n) · · · e(n − Q + 1)]T

(9.3.13) (9.3.14)

If we write (9.3.13) explicitly e(n) = −

Q k=1

dˆk e(n − k) + x(n) +

P

aˆ k x(n − k)

(9.3.15)

k=1

we see that the prediction error is obtained by exciting the inverse model with the signal x(n). Hence, the inverse model has to be stable. To satisfy this condition, we require the estimated model to be minimum-phase.

section 9.3 Estimation of Pole-Zero Models

464 chapter 9 Signal Modeling and Parametric Spectral Estimation

The recursive computation of e(n) by (9.3.15) makes the prediction error a nonlinear function of the model parameters. To illustrate this, consider the prediction error for a first-order model, that is, for P = Q = 1 e(n) = x(n) + aˆ 1 x(n − 1) − dˆ1 e(n − 1) Assuming e(0) = 0, we have for n = 1, 2, 3 e(1) = x(1) + aˆ 1 x(0) e(2) = x(2) + aˆ 1 x(1) − dˆ1 e(1) = x(2) + (aˆ 1 − dˆ1 )x(1) − aˆ 1 dˆ1 x(0) e(3) = x(3) + aˆ 1 x(2) − dˆ1 e(2) = x(3) + (aˆ 1 − dˆ1 )x(2) − (aˆ 1 − dˆ1 )dˆ1 x(1) + aˆ 1 dˆ12 x(0) which shows that e(n) is a nonlinear function of the model parameters if Q = 0. Thus, the total squared error E(c) =

Nf

e2 (n)

(9.3.16)

n=Ni

expressed in terms of the signal values x(0), x(1), . . . , x(N −1), is a nonquadratic function of the model parameters. Sometimes, E(c) has several local minima. The model parameters can be obtained by minimizing the total square error using nonlinear optimization techniques.

9.3.3 Nonlinear Least-Squares Optimization We next outline such a technique that is based on the method of Gauss-Newton. More details can be found in Scales (1985); Luenberger (1984); and Gill, Murray, and Wright (1981). To this end, we expand the function E(c) as a Taylor series E(c0 + -c) = E(c0 ) + (-c)T ∇E(c0 ) + 12 (-c)T [∇ 2 E(c0 )](-c) + · · ·

where

∇E(c)

∂E ∂E ∂E ··· ∂c1 ∂c2 ∂cp+q

(9.3.17)

T (9.3.18)

is the vector of the first partial derivatives or gradient vector and ∇ 2 E(c), whose (i, j )th element is ∂ 2 E/(∂ci ∂cj ), is the (symmetric) matrix of second partial derivatives (Hessian matrix). The Taylor expansion of a quadratic function has only the first three terms. Indeed, for the known excitation case we have ∇E(c) = 2

Nf

ˆ z c) z(n − 1)e(n) = 2(rz − R

(9.3.19)

n=Ni

and

∇ 2 E(c) = 2

Nf

ˆz z(n − 1)zT (n − 1) = 2R

(9.3.20)

n=Ni

Higher-order terms are zero, and if c0 is the minimum, then ∇E(c0 ) = 0. In this case, (9.3.17) becomes ˆ z (-c) E(c0 + -c) = E(c0 ) + (-c)T R

ˆ z is positive definite, that is, (-c)T R ˆ z (-c) ≥ 0, then any deviation which shows that if R from the minimum results in an increase in the total squared error. This relationship holds approximately for nonquadratic functions, as long as c0 is close to a minimum. Thus, if we are at a point ci with total squared error E(ci ), we can move to a point ci+1 with total squared error E(ci+1 ) ≤ E(ci ) by moving in the direction of −∇E(ci ). This suggests the following iterative procedure ci+1 = ci − µi Gi ∇E(ci )

(9.3.21)

where the positive scalar µi controls the length of the descent and matrix Gi modifies the direction of the descent, as is specified by the gradient vector. Various choices for these quantities lead to various optimization algorithms. For quadratic functions, choosing ˆ z )−1 (inverse of the Hessian matrix) gives c1 = R ˆ z−1 rˆ z ; that c0 = 0, µ0 = 1, and G0 = (2R is, we find the unique minimum in one step. This provides the motivation for modifying the direction of the gradient using the inverse of the Hessian matrix, even for nonquadratic functions. This choice is justified as long as we are close to a minimum. Using (9.3.13), we compute the Hessian as follows ∇ 2 E(c) = ∇[∇E(c)]T = 2

Nf

ψ(n)ψ T (n) + 2

n=Ni

where

ψ(n) ∇e(n) =

Nf

[∇ψ T (n)]e(n)

(9.3.22)

n=Ni

∂e(n) ∂e(n) ∂e(n) ∂e(n) ··· ··· ∂ aˆ 1 ∂ aˆ P ∂ dˆ1 ∂ dˆQ

T (9.3.23)

We usually approximate the Hessian with the first summation in (9.3.22), that is, H=2

Nf

ψ(n)ψ T (n)

(9.3.24)

n=Ni

Similarly, the gradient is given by ∇E(c) v = 2

Nf

ψ(n) e(n)

(9.3.25)

n=Ni

If we set G = H−1 , the direction vector g = Gv = H−1 v can be obtained by solving the following linear system of equations: Hg = v

(9.3.26)

Clearly, the factor 2 in the definitions of H and v does not affect the solution g, and can be dropped. Although the matrix H is guaranteed by (9.3.24) to be positive semidefinite, in practice it may be singular or close to singular. To avoid such problems in solving (9.3.26), we regularize the matrix by adding a small positive constant δ to its diagonal; that is, we approximate the Hessian by H + δI, where I is the identity matrix. This approach is known as the Levenberg-Marquard regularization (Dennis and Schnabel 1983; Ljung 1987). We next compute the gradient ψ(n) = ∇e(n), using (9.3.23) and (9.3.15). Indeed, we have ∂e(n − k) ∂e(n) = x(n − j ) − dˆk ∂ aˆ j ∂ aˆ j Q

j = 1, 2, . . . , P

(9.3.27)

k=1

∂e(n − k) ∂e(n) = −e(n − j ) − dˆk ∂ dˆj ∂ dˆj Q

and

k=1

j = 1, 2, . . . , Q

(9.3.28)

465 section 9.3 Estimation of Pole-Zero Models

10

466

0 Power (dB)

chapter 9 Signal Modeling and Parametric Spectral Estimation

−10 −20

Actual AP(10)

−30 −40

PZ(4 ,2) 0

0.1 0.2 0.3 0.4 Frequency (cycles/sampling Interval)

0.5

FIGURE 9.13 Illustration of the capability of a PZ(4, 2) and AP(10) model to estimate the PSD of an ARMA(4, 2) process from a 300-sample segment.

Thus, the components of the gradient vector are obtained by driving the all-pole filter 1 1 Hψ (z) = = (9.3.29) Q D(z) 1+ dˆk z−k k=1

with the signals x(n) and −e(n), respectively. This filter is stable if the estimated model is minimum-phase. The above development leads to the following iterative algorithm, implemented in the Matlab function armals.m, which computes the parameters of a PZ(P , Q) model from the data x(0), x(1), . . . , x(N − 1) by minimizing the LS error. The LS pole-zero modeling algorithm consists of the following steps: 1. Fit an AP(P + Q) model to the data, using the no-windowing LS method, and compute the prediction error e(n) (see Section 9.2). 2. Fit a PZ(P , Q) model to the data {x(n), e(n), 0 ≤ n ≤ N − 1}, using the known excitation method. Convert the model to minimum-phase, if necessary. Use Equations (9.3.9) to (9.3.11). 3. Start the iterative minimization procedure, which involves the following steps: a. Compute the gradient ψ(n), using (9.3.27) and (9.3.28). b. Compute the Hessian H and the gradient v, using (9.3.24) and (9.3.25). c. Solve (9.3.26) to compute the search vector g. If necessary, use the LevenbergMarquard regularization technique. 1 d. For µ = 1, 12 , . . . , 10 , compute c ← c + µg, convert the model to minimum-phase, if necessary, and compute the corresponding value of E(c). Choose the value of c † that gives the smaller total squared error. e. Stop if E(c) does not change significantly or if a certain number of iterations have been exceeded. 4. Compute the estimate of the input variance, using (9.3.15) and (9.3.12). The application of the LS PZ(P , Q) model estimation algorithm is illustrated in Figure 9.13, which shows the actual PSD of a PZ(4, 2) model and the estimated PSDs, using an LS PZ(4, †

This approach was suggested in Ljung (1987), problem 10S.1.

2) and an AP(10) model fitted to a 300-sample segment of the output process. We notice that, in contrast to the PZ model, the AP model does not provide a good match at the spectral zero. More details are provided in Problem 9.15.

9.4 APPLICATIONS Pole-zero modeling has many applications in such fields as spectral estimation, speech processing, geophysics, biomedical signal processing, and general time series analysis and forecasting (Marple 1987; Kay 1988; Robinson and Treitel 1980; Box, Jenkins, and Reinsel 1994). In this section, we discuss the application of pole-zero models to spectral estimation and speech processing.

9.4.1 Spectral Estimation After we have estimated the parameters of a PZ model, we can compute the PSD of the analyzed process by 2 Q j ωk 1 + ˆ dk e k=1 ˆ j ω ) = σˆ 2w R(e (9.4.1) 2 P aˆ k ej ωk 1 + k=1

In practice, we mainly use AP models because (1) the all-zero PSD estimator is essentially identical to the Blackman-Tukey one (see Problem 9.16) and (2) the application of polezero PSD estimators is limited by computational and other practical difficulties. Also, any continuous PSD can be approximated arbitrarily well by the PSD of an AP(P ) model if P is chosen large enough (Anderson 1971). However, in practice, the value of P is limited by the amount of available data (usually P < N/3). The statistical properties of all-pole PSD estimators are difficult to obtain; however, it has been shown that the estimator is consistent only if the analyzed process is AR(P0 ) with P0 ≤ P . Furthermore, the quality of the estimator degrades if the process is contaminated by noise. More details about pole-zero PSD estimation can be found in Kay (1988), Porat (1994), and Percival and Walden (1993). The performance of all-pole PSD estimators depends on the method used to estimate the model parameters, the order of the model, and the presence of noise. The effect of model mismatch is shown in Figure 9.13 and is further investigated in Problem 9.17. Order selection in all-pole PSD estimation is absolutely critical: If P is too large, the obtained PSD exhibits spurious peaks; if P is too small, the structure of the PSD is smoothed over. The increased resolution of the parametric techniques, compared to the nonparametric PSD estimation methods, is basically the result of imposing structure on the data (i.e., a model). The model makes possible the extrapolation of the ACS, which in turns leads to better resolution. However, if the adopted model is inaccurate, that is, if it does not match the data, then the “gained” resolution reflects the model and not the data! As a result, despite their popularity and their “success” with simulated signals, the application of parametric PSD estimation techniques to actual experimental data is rather limited. Figure 9.14 shows the results of a Monte Carlo simulation of various all-pole PSD estimation techniques. We see that, except for the windowing approach that results in a significant loss of resolution, all other techniques have similar performance. However, we should mention that the forward/backward LS all-pole modeling method is considered to provide the best results (Marple 1987).

467 section 9.4 Applications

True AR(4) PSD

LS 20

0 –20

−20

–40

−40 0

0.1

0.2

0.3

0.4

0.5

0.1

Yule-Walker

0.2

0.3

0.4

0.5

Forward/backward LS

Power (dB)

20 0

−20

−20

−40

−40 0

0.1

0.2

0.3

0.4

0.5

0.1

20

20

−20

−20

−40

−40 0

0.2

0.3

0.4

0.5

Itakura–Saito

Burg Power (dB)

chapter 9 Signal Modeling and Parametric Spectral Estimation

Power (dB)

468

0.1 0.2 0.3 0.4 0.5 Frequency (cycles/sampling interval)

0.1 0.2 0.3 0.4 0.5 Frequency (cycles/sampling interval)

FIGURE 9.14 Monte Carlo simulation for the comparison of all-pole PSD estimation techniques, using 50 realizations of a 50-sample segment from an AR(4) process using fourth-order AP models.

In practice, it is our experience that the best way to estimate the PSD of an actual signal is to combine parametric prewhitening with nonparametric PSD estimation methods. The process is illustrated in Figure 9.15 and involves the following steps: 1. Fit an AP(P ) model to the data using the forward LS, forward/backward LS, or Burg’s method with no windowing. 2. Compute the residual (prediction error) e(n) = x(n) +

P

ak∗ x(n − k)

P ≤n≤N −1

(9.4.2)

k=1

and then compute and plot its ACS, PACS, and cumulative periodogram (see Figure 9.2) to see if it is reasonably white. The goal is not to completely whiten the residual but to reduce its spectral dynamic range, that is, to increase its spectral flatness to avoid spectral leakage. 3. Compute the PSD Rˆ e (ej ωk ), using one of the nonparametric techniques discussed in Chapter 5. 4. Compute the PSD of x(n) by Rˆ e (ej ωk ) Rˆ x (ej ωk ) = |A(ej ωk )|2 that is, by applying postcoloring to “undo” the prewhitening.

(9.4.3)

N x(n)

469

Prediction e(n) error filter

Frame blocking

Nonparametric R ˆ e(e jvk ) PSD estimation

A(z)

ˆ x(e jvk ) R 1

Compute AP model

A(e jvk ) 2

P

FIGURE 9.15 Block diagram of nonparametric PSD estimation using linear prediction prewhitening.

The main goal of AP modeling here is to reduce the spectral dynamic range to avoid leakage. In other words, we need a good linear predictor regardless of whether the process is true AR(P ). Therefore, very accurate order selection and model fit are not critical, because all spectral structure not captured by the model is still in the residuals. Needless to say, if the periodogram of x(n) has a small dynamic range, we do not need prewhitening. Another interesting application of prewhitening is for the detection of outliers in practical data (Martin and Thomson 1982). To illustrate the effectiveness of the above prewhitening and postcoloring method, consider the AR(4) process x(n) used in Example 9.2.3. This process has a large dynamic range, and hence the nonparametric methods such as Welch’s periodogram averaging method will suffer from leakage problems. Using the system function of the model

E XAM PLE 9.4.1.

H (z) =

1 1 = A(z) 1 − 2.7607z−1 + 3.8106z−2 − 2.6535z−3 + 0.9238z−4

and WGN (0, 1) input sequence, we generated 256 samples of x(n). These samples were then used to obtain the all-pole LS predictor coefficients using the arwin function. The spectrum |A(ej ω )|−2 corresponding to this estimated model is shown in Figure 9.16 as a dashed curve. The signal samples were prewhitened using the model to obtain the residuals e(n). The nonparametric PSD estimate Rˆ e (ej ω ) of e(n) was computed by using Welch’s method with L = 64 and 50 percent overlap. Finally, Rˆ e (ej ω ) was postcolored using the spectrum |A(ej ω )|−2 to obtain Rˆ x (ej ω ), which is shown in Figure 9.16 as a solid line. For comparison purposes, the Welch FIGURE 9.16 Spectral estimation of AR(4) process using prewhitening and postcoloring method in Example 9.4.1.

PSD estimation of AR(4) signal

50

Prewhiten/postcolor AR(4) PSD Welch PSD

40

Power (dB)

30 20 10 0 −10 −20 −30 0

0.2 0.4 0.6 0.8 Frequency (cycles/sampling interval)

1.0

section 9.4 Applications

470

PSD estimate of x(n) is also shown as a dotted line. As expected, the nonparametric estimate does not resolve the two peaks in the true spectrum and suffers from leakage at high frequencies. However, the combined nonparametric and parametric estimate resolves two peaks with ease and also follows the true spectrum quite well. Therefore, the use of the parametric method as a preprocessor is highly recommended especially in large-dynamic-range situations.

chapter 9 Signal Modeling and Parametric Spectral Estimation

9.4.2 Speech Modeling All-pole modeling using LS linear prediction is widely employed in speech processing applications because (1) it provides a good approximation to the vocal tract for voiced sounds and adequate approximation for unvoiced and transient sounds, (2) it results in a good separation between source (fine spectral structure) and vocal tract (spectral envelop), and (3) it is analytically tractable and leads to efficient software and hardware implementations. Figure 9.17 shows a typical AP modeling system, also known as the linear predictive coding (LPC) processor, that is used in speech synthesis, coding, and recognition applications. The processor operates in a block processing mode; that is, it processes a frame of N samples and computes a vector of model parameters using the following basic steps: 1. Preemphasis. The digitized speech signal is filtered by the high-pass filter H1 (z) = 1 − αz−1

2.

3. 4.

5.

0.9 ≤ α ≤ 1

(9.4.4)

to reduce the dynamic range of the spectrum, that is, to flatten the spectral envelope, and make subsequent processing less sensitive to numerical problems (Makhoul 1975a). Usually α = 0.95, which results in about a 32 dB boost in the spectrum at ω = π over that at ω = 0. The preemphasizer can be made adaptive by setting α = ρ(1), where ρ(l) is the normalized autocorrelation of the frame, which corresponds to a first-order optimum prediction error filter. Frame blocking. Here the preemphasized signal is blocked into frames of N samples with successive frames overlapping by N0 N/3 samples. In speech recognition N = 300 with a sampling rate Fs = 6.67 Hz, which corresponds to 45-ms frames overlapping by 15 ms. Windowing. Each frame is multiplied by an N -sample window (usually Hamming) to smooth the discontinuities at the beginning and the end of the frame. Autocorrelation computation. Here the LPC processor computes the first P + 1 values of the autocorrelation sequence. Usually, P = 8 in speech recognition and P = 12 in speech coding applications. The value of r(0) provides the energy of the frame, which is useful for speech detection. LPC analysis. In this step the processor uses the P + 1 autocorrelations to compute an LPC parameter set for each speech frame. Depending on the required parameters, we N

a

N0

w(n)

x(n) Preemphasis

Frame blocking

Windowing P

LPC parameter conversion

{a k} {k m}

Levinson-Durbin or Schür algorithm

r (l )

Autocorrelation computation

FIGURE 9.17 Block diagram of an AP modeling processor for speech coding and recognition.

can use the algorithm of Levinson-Durbin or the algorithm of Schür. The most widely used parameters are am =

(P ) am

km gm =

1 − km 1 log = tanh−1 km 2 1 + km c(m) ωm

LPC coefficients PACS log area ratio coefficients cepstral coefficients line spectrum pairs

where 1 ≤ m ≤ P , except for the cepstrum, which is computed up to about 3P /2. The line spectrum pair parameters, which are pole angles of the singular filters, were discussed in Section 2.5.8, and their application to speech processing is considered in Furui (1989). The log area ratio and the line spectrum pair coefficients have good quantization properties and are used for speech coding (Rabiner and Schafer 1978; Furui 1989); the cepstral coefficients provide an excellent discriminant for speech and speaker recognition applications (Rabiner and Juang 1993; Mammone et al. 1996). AP models are extensively used for the modeling of speech sounds. However, the AP model does not provide an accurate description of the speech spectral envelope when the speech production process resembles a PZ system (Atal and Schroeder 1978). This can happen when (1) the nasal tract is coupled to the main vocal tract through the velar opening, for example, during the generation of nasals and nasalized sounds, (2) the source of excitation is not at the glottis but is in the interior of the vocal tract (Flanagan 1972), and (3) the transmission or recording channel has zeros in its response. Although a zero can be approximated with arbitrary precision by a number of poles, this approximation is usually inefficient and leads to spectral distortion and other problems. These problems can be avoided by using pole-zero modeling, as illustrated in the following example. More details about pole-zero speech modeling can be found in Atal and Schroeder (1978). Figure 9.18(a) shows a Hamming window segment from an artificial nasal speech signal sampled at Fs = 10 kHz. According to acoustic theory, such sounds require both poles and zeros in the vocal tract system function. Before the fitting of the model, the data are passed though a preemphasis filter with α = 0.95. Figure 9.18(b) shows the periodogram of the speech segment, the spectrum of an AP(16) model using data windowing, and the spectrum of a PZ(12, 6) model using the least-squares algorithm described in Section 9.3.3 (see Problem 9.18 for details). We see that the pole-zero model matches zeros (“valleys”) in the periodogram of the data better than other models do.

9.5 MINIMUM-VARIANCE SPECTRUM ESTIMATION Spectral estimation methods were discussed in Chapter 5 that are based on the discrete Fourier transform (DFT) and are data-independent; that is, the processing does not depend on the actual values of the samples to be analyzed. Window functions can be employed to cut down on sidelobe leakage, at the expense of resolution. These methods have, as a rule of thumb, an approximate resolution of -f ≈ 1/N cycles per sampling interval. Thus, for all these methods, resolution performance is limited by the number of available data samples N. This problem is only accentuated when the data must be subdivided into segments to reduce the variance of the spectrum estimate by averaging periodograms. The effective resolution is then on the order of 1/M, where M is the window length of the segments. For many applications the amount of data available for spectrum estimation may be limited

471 section 9.5 Minimum-Variance Spectrum Estimation

472

1

chapter 9 Signal Modeling and Parametric Spectral Estimation

0 −1

5

10

15 Time (ms)

20

25

30

FIGURE 9.18 (a) Speech segment and (b) periodogram, spectrum of a data windowing-based AP(16) model, and spectrum of a residual windowing-based PZ(12, 6) model.

(a) 60 50 PZ (12, 6)

Power (dB)

40 AP (16)

30 20 10

Periodogram 0 −10 −20

1

2 3 Frequency (kHz)

4

5

(b)

either because the signal may only be considered stationary over limited intervals of time or may only be collected over a short finite interval. Many times, it may be necessary to resolve spectral peaks that are spaced closer than the 1/M limit imposed by the amount of data available. All the DFT-based methods use a predetermined, fixed processing that is independent of the values of the data. However, there are methods, termed data-adaptive spectrum estimation (Lacoss 1971), that can exploit actual characteristics of the data to offer significant improvements over the data-independent, DFT-based methods, particularly in the case of limited data samples. Minimum-variance spectral estimation is one such technique (Capon 1969). Like the methods from Chapter 5, the minimum-variance spectral estimator is nonparametric; that is, it does not assume an underlying model for the data. However, the spectral estimator adapts itself to the characteristics of the data in order to reject as much out-of-band energy, that is, leakage, as possible. In addition, minimum-variance spectral estimation provides improved resolution—better than the -f ≈ 1/N associated with the DFT-based methods. As a result, the minimumvariance method is commonly referred to as a high-resolution spectral estimator. Note that model-based data-adaptive methods, such as the LS all-pole method, also have high resolving capabilities when the model adequately represents the data. Theory We derive the minimum-variance spectral estimator by using a filter bank structure in which each of the filters adapts its response to the data. Recall that the goal of a power spectrum estimator is to determine the power content of a signal at a certain frequency. To this end, we would like to measure R(ej 2π f ) at the frequency of interest only and not have our estimate influenced by energy present at other frequencies. Thus, we might interpret spectral estimation as a methodology in determining the ideal, frequency-selective filter for each frequency. Recall the filter bank interpretation of a power spectral estimator from Chapter 5. This ideal filter for fk should pass energy within its bandwidth -f but reject all

other energy, that is,

1 j 2π f 2 |Hk (e )| = -f 0

473

|f − fk | ≤

-f 2

(9.5.1)

otherwise †

where the factor -f ∼ 1/M accounts for the filter bandwidth. Therefore, the filter does not impart a gain across the bandwidth of the filter, and the output of the filter is a measure of power in the frequency band around fk . However, since such an ideal filter does not exist in practice, we need to design one that passes energy at the center frequency while rejecting as much out-of-band energy as possible. A filter bank–based spectral estimator should have filters at all frequencies of interest. The filters have equal spacing in frequency, spanning the fundamental frequency range − 12 ≤ f < 12 . Let us denote the total number of frequencies by K and the center frequency of the kth filter as k−1 1 fk = − (9.5.2) K 2 for k = 1, 2, . . . , K. The output of the kth filter is the convolution of the signal x(n) with the impulse response of the filter hk (n), which can also be expressed in vector form as yk (n) = hk (n) ∗ x(n) =

M−1

hk (m)x(n − m) = ckH x(n)

(9.5.3)

m=0

where

ck = [h∗k (0) h∗k (1) · · · h∗k (M − 1)]T

(9.5.4)

is the impulse response of the kth filter, and x(n) = [x(n) x(n − 1) · · · x(n − M + 1)]T

(9.5.5)

is the input data vector. In addition, we define the frequency vector v(f ) as a vector of complex exponentials at frequency f within the time-window vector from (9.5.5) v(f ) = [1 e−j 2π f · · · e−j 2π f (M−1) ]T

(9.5.6)

When the frequency vector v(f ) is chosen as the filter weight vector in (9.5.4), then the filter will pass signals at frequency f . Note that if we have ck = v(fk ), then the resulting filter bank performs a DFT since v(f ) is a column vector in the DFT matrix. Thus, all the DFT-based methods, when interpreted using a filter bank structure, use a form of v(f ), possibly with a window, as filter weights. See Chapter 5 for the filter bank interpretation of the DFT. The output yk (n) of the kth filter should ideally give an estimate of the power spectrum at fk . The output power of the kth filter is E{|yk (n)|2 } = ckH Rx ck

(9.5.7)

where Rx = is the correlation matrix of the input data vector from (9.5.5). Since the ideal filter response from (9.5.1) cannot be realized, we instead constrain our filter ck to have a response at the center frequency fk of E{x(n)xH (n)}

Hk (fk ) = |ckH v(fk )|2 = M

(9.5.8)

This constraint ensures that the center frequency of our bandpass filter is at the frequency fk . To eliminate as much out-of-band energy as possible, the filter is formulated as the filter that minimizes its output power subject to the center frequency constraint in (9.5.8), that is, √ min ckH Rx ck subject to ckH v(fk ) = M (9.5.9) †

A similar normalization was performed for all the DFT-based methods. Note that the same is not true of a sinusoidal signal that has zero bandwidth. See Example 9.5.2.

section 9.5 Minimum-Variance Spectrum Estimation

474 chapter 9 Signal Modeling and Parametric Spectral Estimation

√ This constraint requires the filter to have a response of M to a frequency vector from (9.5.6) at the frequency of interest while rejecting (minimizing) energy from all other frequencies. The solution to this constrained optimization problem can be found via Lagrange multipliers (see Appendix B and Problem 9.22) to be √ MRx−1 v(fk ) (9.5.10) ck = vH (fk )Rx−1 v(fk ) By substituting (9.5.10) into (9.5.3), we obtain the output of the kth filter. The power of this signal, from (9.5.7), is the minimum-variance spectral estimate M (mv) Rˆ M (ej 2π fk ) = E{|yk (n)|2 } = (9.5.11) vH (fk )Rx−1 v(fk ) where the subscript M denotes the length of the data vector used to compute the spectral estimate. Note that in order to compute the minimum-variance spectral estimate, we need to find the inverse of the correlation matrix, which is a Toeplitz matrix since x(n) is stationary. Efficient techniques for computing the inverse of a Toeplitz matrix were discussed in Chapter 7. Implementation A spectral estimator attempts to determine the power of a random process as a function of frequency based on a finite set of observations. Since the minimum-variance estimate of the spectrum involves the correlation matrix of the input data vector, which is unknown in practice, the correlation matrix must be estimated from the data. An estimate of the M × M † correlation matrix, known as the sample correlation matrix, is given by 1 ˆx = XH X (9.5.12) R N −M +1 where

XH = [x(M) x(M x(M) x(M − 1) = . .. x(1)

+ 1) · · · x(N )] x(M + 1)

· · · x(N )

x(M) .. .

··· .. . ···

x(2)

x(N − 1) .. .

(9.5.13)

x(N − M + 1)

is the data matrix formed from x(n) for 0 ≤ n ≤ N − 1. Any of the other methods of forming a data matrix discussed in Chapter 8 can also be employed. Note that the data ˆ x in (9.5.12), though other methods matrix in (9.5.13) does not produce a Toeplitz matrix R from Chapter 8 will produce a Toeplitz sample correlation matrix. An estimate of the spectrum based on the sample correlation matrix is found by subˆ x for the true correlation matrix Rx in (9.5.11). Note that, in practice, the sample stituting R correlation matrix is not actually computed. The form of the sample correlation matrix resembles the product of the data matrices in the least-squares (LS) problem that is addressed in Chapter 8. Therefore, we might compute the upper triangular factor of the data matrix X by using one of the techniques discussed in Chapter 8, such as a QR factorization. Indeed, if we compute the QR factorization made up of the orthonormal matrix Qx and the upper triangular factor Rx X = Qx Rx

(9.5.14)

then the minimum-variance spectrum estimator based on the sample correlation matrix is M 1 (scmv) j 2π fk Rˆ M (e )= (9.5.15) H 2 N − M + 1 |v (fk )R−H x | †

We have normalized by N − M + 1, the number of realizations of the time-window vector x(n) in the data matrix X. This normalization is necessary so that the output of the filter bank corresponds to an estimate of power.

Note that the conjugation of the upper triangular matrix comes about through the formulation of the data matrix in (9.5.13). We have not addressed the issue of choosing the filter length M. Ideally, M is chosen to be as large as possible in order to maximize the rejection of out-of-band energy. However, from a practical point of view, we must place a limit on the filter length. As the filter length increases, the size of the data matrix grows, which increases the amount of computation necessary. In addition, since we are inherently estimating the correlation matrix, reducing the variance of this estimator requires averaging over a set of realizations of the input data vector x(n). Thus, for a fixed data record size of N, we must balance the length of the time window M against the number of realizations of the input data vector N − M + 1. As we will demonstrate in the following example, the minimum-variance spectrum estimator provides a means of achieving high resolution, certainly better than the -f ∼ 1/M limit of the DFT-based methods. High resolving capability essentially means that the minimum-variance spectrum estimator can better distinguish complex exponential signals closely spaced in frequency. This topic is explored further in Section 9.6. However, high resolution does not come without a cost. In practice, the spectrum cannot be estimated over a continuous frequency interval and must be computed at a finite set of discrete frequency points. Since the minimum-variance estimator is based on Rx−1 , it is very sensitive to the exact frequency points at which the spectrum is estimated. Therefore, the minimum-variance spectrum needs to be computed at a very fine frequency spacing in order to accurately measure the power of such a complex exponential. In some applications where computational cost is a concern, the DFT-based methods are probably preferred, as long as they provide the necessary resolution and sidelobe leakage is properly controlled. E XAM PLE 9.5.1. In this example, we explore the resolving capability of the minimum-variance spectrum estimator and compare its performance to that of a DFT-based method (Bartlett)√and the all-pole method. Two closely spaced complex exponentials, both with an amplitude of 10, at discrete-time frequencies of f = 0.1 and f = 0.12 are contained in noise with unit power σ 2w = 1. We apply the spectrum estimators with time-window lengths (or order) M = 16, 32, 64, and 128 to signals consisting of 500 time samples. The estimated spectra were then averaged over 100 realizations. The resulting average spectrum estimates are shown in Figure 9.19. Note that the frequency spacing of the two complex exponentials is -f = 0.02, suggesting a time-window length of at least M = 50 to resolve them with a DFT-based method. The minimum-variance spectrum estimator, however, is able to resolve them at the M = 32 window length, for which they are clearly not distinguishable using the DFT-based method. On the other hand, the all-pole spectrum estimate is able to resolve the two complex exponentials even for as low an order as M = 16, for which the minimum-variance spectrum was not successful. In general, the superior resolving capability of the all-pole model over the minimum-variance spectrum estimator is due to an averaging effect that comes about through the nonparametric nature of the minimumvariance method. This subject is explored following the next example. Note that the estimated noise level is most accurately √ measured by the minimum-variance method in all cases. Recall that the signal amplitude was 10, yet the estimated power at the frequencies of the complex exponentials increases as the window length M increases. In the filter bank interpretation of the minimum-variance spectrum estimator, the normalization assumed a constant signal power level across the bandwidth of the frequency-selective filter. However, the complex exponential is actually an impulse in frequency and has zero bandwidth. Therefore, the estimated power will grow with the length of the time window used for the spectrum estimator as a result of this bandwidth normalization. The gain imparted on a complex exponential signal is explored in Example 9.5.2. E XAM PLE 9.5.2.

Consider the complex exponential signal with frequency f1 contained in noise x(n) = α 1 ej 2π f1 n + w(n)

where α 1 = |α 1 |ej ψ 1 is a complex number with constant amplitude |α 1 | and random phase ψ 1 with uniform distribution over [0, 2π]. The correlation matrix of x(n) is Rx = |α 1 |2 v(f1 )vH (f1 ) + σ 2w I

475 section 9.5 Minimum-Variance Spectrum Estimation

40

chapter 9 Signal Modeling and Parametric Spectral Estimation

35

30

30

25

25

Power (dB)

40 35

Power (dB)

476

20 15 10 5

20 15 10 5

−5

−5

0.1 0.2 Normalized frequency

(b) M = 32

40

40

35

35

30

30

25

25

Power (dB)

Power (dB)

(a) M = 16

0.1 0.2 Normalized frequency

20 15 10 5

20 15 10 5

−5

−5

0.1 0.2 Normalized frequency

(c) M = 64

0.1 0.2 Normalized frequency (d ) M = 128

FIGURE 9.19 Comparison of the minimum-variance (solid line), all-pole (large dashed line), and Fourier-based (small dashed line) spectrum estimators for different time window lengths M.

Using the matrix inversion lemma from Appendix B, we can write the inverse of the correlation matrix as |α 1 |2 1 |α 1 |2 v(f1 )vH (f1 ) 1 −1 H Rx = 2 I − 2 2 v(f1 )v (f1 ) = 2 I− 2 σw σ w [σ w + |α 1 |2 v(f1 )vH (f1 )] σw σ w + M |α 1 |2 Substituting this expression for the inverse of the correlation matrix into (9.5.11) for the minimumvariance spectrum estimate, we have (mv) Rˆ M (ej 2π f1 ) =

M vH (f1 ) Rx−1 v(f1 )

=

σ 2w 2 |α | /M 1− 2 1 |vH (f1 ) v(f1 )|2 σ w + M|α 1 |2

Recall that the norm of the frequency vector v(f ) from (9.5.6) is vH (f1 ) v(f1 ) = M. Therefore, the minimum-variance power spectrum estimate at f = f1 is (mv) Rˆ M (ej 2π f1 ) = σ 2w + M|α 1 |2

that is, the sum of the noise power and the signal power times the time-window length. This gain of M on the signal power comes about through the normalization we imposed on our filter in (9.5.8). This normalization assumed the signal had equal amplitude across the passband of the filter. A complex exponential, on the other hand, has no bandwidth and thus this normalization imparts a gain of M on the signal. Therefore, if an estimate of the amplitude of a complex exponential is desired, this gain must be accounted for. Last, let us examine the behavior of the minimum variance spectrum estimator at the other frequencies that contain only noise. In the

case of M 1, then vH (f ) v(f1 ) ≈ 0 and

477

(mv) Rˆ M (ej 2π f ) ≈ σ 2w

section 9.5 Minimum-Variance Spectrum Estimation

Relationship between the minimum-variance and all-pole spectrum estimation methods The minimum-variance spectrum estimator has an interesting relation to the all-pole spectrum estimator discussed in Section 9.4. Recall from (9.5.11) that the minimum-variance spectrum estimate is a function of Rx−1 . The inverse of a Toeplitz correlation matrix was studied in Chapter 7 and from (7.7.8) can be written as an LDLH decomposition ¯ −1 A Rx−1 = AH D

(9.5.16)

where the upper triangular matrix A from (7.7.9) is given by (M)∗ (M)∗ (M)∗ a2 · · · aM−1 1 a1 (M−1)∗ (M−1)∗ a1 · · · aM−2 0 1 .. A= . . .. ... ... . . . (2)∗ 0 0 0 · · · a1 0 0

(9.5.17)

··· 1

¯ is and the diagonal matrix D ¯ = diag {PM , PM−1 , . . . , P1 } D

(9.5.18)

are the Recall from Chapter 7 that the columns of the lower triangular factor L = coefficients of the forward linear predictors of orders m = 1, 2, . . . , M−1 for the signal x(n) with correlation matrix Rx . Pm is the residual output power resulting from the application of this mth-order forward linear predictor to the signal x(n). In turn, the forward linear predictor coefficients form the mth-order all-pole model. The model orders are found in descending order as the column index increases. Let us denote the column vector of coefficients for the mth-order all-pole model as AH

am = [1 a1

(m)

(m)

a2

(m) T · · · am ]

(9.5.19)

We can write the estimate of the spectrum derived from an mth-order all-pole model in vector notation as Pm (ap) (9.5.20) Rˆ m (ej 2π f ) = H |vm (f )am |2 where vm (f ) is the frequency vector from (9.5.6) of order M = m. Then we can substitute (9.5.16) into the minimum-variance spectrum estimator from (9.5.11) to obtain (mv) Rˆ M (ej 2π f ) =

M H (f )R −1 v (f ) vM M x

=

M H (f )AH D ¯ −1 AvM (f ) vM

(9.5.21)

Therefore, we can write the following relationship between the reciprocals of the minimumvariance and all-pole model spectrum estimators 1 (mv) Rˆ M (ej 2π f )

=

M M H (f )a |2 |vm 1 1 m = (ap) j 2π f ) MPm M ˆ m=1 m=1 Rm (e

(9.5.22)

where the subscripts denote the order of the respective spectrum estimators. Thus, the minimum-variance spectrum estimator for a filter of length M is formed by averaging spectrum estimates from all-pole models of orders 1 through M. Note that the resolving capabilities of the all-pole model improve with increasing model order. As a result, the resolution of the minimum-variance spectrum estimator must be worse than that of the

478 chapter 9 Signal Modeling and Parametric Spectral Estimation

Mth-order all-pole model as we observed in Example 9.5.1. However, on the other hand, this averaging of all-pole model spectra indicates a lower variance for the minimum-variance spectrum estimator.

9.6 HARMONIC MODELS AND FREQUENCY ESTIMATION TECHNIQUES The pole-zero models we have discussed so far assume a linear time-invariant system that is excited by white noise. However, in many applications, the signals of interest are complex exponentials contained in white noise for which a sinusoidal or harmonic model is more appropriate. Signals consisting of complex exponentials are found as formant frequencies in speech processing, moving targets in radar, and spatially propagating signals in array † processing. For real signals, complex exponentials make up a complex conjugate pair (sinusoids), whereas for complex signals, they may occur at a single frequency. For complex exponentials found in noise, the parameters of interest are the frequencies of the signals. Therefore, our goal is to estimate these frequencies from the data. One might consider estimating the power spectrum by using the nonparametric methods discussed in Chapter 5 or the minimum-variance spectral estimate from Section 9.5. The frequency estimates of the complex exponentials are then the frequencies at which peaks occur in the spectrum. Certainly, the use of these nonparametric methods seems appropriate for complex exponential signals since they make no assumptions about the underlying process. We might also consider making use of an all-pole model for the purposes of spectrum estimation as discussed in Section 9.4.1, also known as the maximum entropy method (MEM) spectral estimation technique. Even though some of these methods can achieve very fine resolution, none of these methods accounts for the underlying model of complex exponentials in noise. As in all modeling problems, the use of the appropriate model is desirable from an intuitive point of view and advantageous in terms of performance. We begin by describing the harmonic signal model, deriving the model in a vector notation, and looking at the eigendecomposition of the correlation matrix of complex exponentials in noise. Then we describe frequency estimation methods based on the harmonic model: the Pisarenko harmonic decomposition, and the MUSIC, minimum-norm, and ESPRIT algorithms. These methods have the ability to resolve complex exponentials closely spaced in frequency and has led to the name superresolution commonly being associated with them. However, a word of caution on the use of these harmonic models. The high level of performance in terms of resolution is achieved by assuming an underlying model of the data. As with all other parametric methods, the performance of these techniques depends upon how closely this mathematical model matches the actual physical process that produced the signals. Deviations from this assumption result in model mismatch and will produce frequency estimates for a signal that may not have been produced by complex exponentials. In this case, the frequency estimates have little meaning.

9.6.1 Harmonic Model Consider the signal model that consists of P complex exponentials in noise x(n) =

P

α p ej 2π nfp + w(n)

(9.6.1)

p=1 †

In array processing, a spatially propagating wave produces a complex exponential signal as measured across uniformly spaced sensors in an array. The frequency of the complex exponential is determined by the angle of arrival of the impinging, spatially propagating signal. Thus, in array processing the frequency estimation problem is known as angle-of-arrival (AOA) or direction-of-arrival (DOA) estimation. This topic is discussed in Section 11.7.

The normalized, discrete-time frequency of the pth component is Fp ωp = (9.6.2) fp = 2π Fs where ωp is the discrete-time frequency in radians, Fp is the actual frequency of the pth complex exponential, and Fs is the sampling frequency. The complex exponentials may occur either individually or in complex conjugate pairs, as in the case of real signals. In general, we want to estimate the frequencies and possibly also the amplitudes of these signals. Note that the phase of each complex exponential is contained in the amplitude, that is, (9.6.3) α p = |α p |ej ψ p where the phases ψ p are uncorrelated random variables uniformly distributed over [0, 2π ]. The magnitude |α p | and the frequency fp are deterministic quantities. If we consider the spectrum of a harmonic process, we note that it consists of a set of impulses with a constant background level at the power of the white noise σ 2w = E{|w(n)|2 }. As a result, the power spectrum of complex exponentials is commonly referred to as a line spectrum, as illustrated in Figure 9.20. FIGURE 9.20 The spectrum of complex exponentials in noise.

R (e j2pf )

Noise level

Frequency

Since we will make use of matrix methods based on a certain time window of length M, it is useful to characterize the signal model in the form of a vector over this time window consisting of the sample delays of the signal. Consider the signal x(n) from (9.6.1) at its current and future M − 1 values. This time window can be written as (9.6.4) x(n) = [x(n) x(n + 1) · · · x(n + M − 1)]T We can then write the signal model consisting of complex exponentials in noise from (9.6.1) for a length-M time-window vector as x(n) =

P

α p v(fp )ej 2π nfp + w(n) = s(n) + w(n)

(9.6.5)

p=1

where w(n) = [w(n) w(n + 1) · · · w(n + M − 1)]T is the time-window vector of white noise and (9.6.6) v(f ) = [1 ej 2π f · · · ej 2π (M−1)f ]T is the time-window frequency vector. Note that v(f ) is simply a length-M DFT vector at frequency f . We differentiate here between the signal s(n), consisting of the sum of complex exponentials, and the noise component w(n), respectively. Consider the time-window vector model consisting of a sum of complex exponentials in noise from (9.6.5). The autocorrelation matrix of this model can be written as the sum of signal and noise autocorrelation matrices Rx = E{x(n)xH (n)} = Rs + Rw P = |α p |2 v(fp )vH (fp ) + σ 2w I = VAVH + σ 2w I p=1

(9.6.7)

479 section 9.6 Harmonic Models and Frequency Estimation Techniques

V = [v(f1 ) v(f2 ) · · · v(fP )]

480

where

chapter 9 Signal Modeling and Parametric Spectral Estimation

is an M × P matrix whose columns are the time-window frequency vectors from (9.6.6) at frequencies fp of the complex exponentials and ··· 0 |α 1 |2 0 . . 0 |α 2 |2 . . .. A = . (9.6.9) . . .. .. 0 .. 0 ··· 0 |α P |2

(9.6.8)

is a diagonal matrix of the powers of each of the respective complex exponentials. The autocorrelation matrix of the white noise is Rw = σ 2w I

(9.6.10)

which is full rank, as opposed to Rs which is rank-deficient for P < M. In general, we will always choose the length of our time window M to be greater than the number of complex exponentials P . The autocorrelation matrix can also be written in terms of its eigendecomposition Rx =

M

H λm qm qm = QQH

(9.6.11)

m=1

where λm are the eigenvalues in descending order, that is, λ1 ≥ λ2 ≥ · · · ≥ λM , and qm are their corresponding eigenvectors. Here is a diagonal matrix made up of the eigenvalues found in descending order on the diagonal, while the columns of Q are the corresponding eigenvectors. The eigenvalues due to the signals can be written as the sum of the signal power in the time window and the noise: λm = M|α m |2 + σ 2w

for

m≤P

(9.6.12)

The remaining eigenvalues are due to the noise only, that is, λm = σ 2w

for

(9.6.13)

m>P

Therefore, the P largest eigenvalues correspond to the signal made up of complex exponentials and the remaining eigenvalues have equal value and correspond to the noise. Thus, we can partition the correlation matrix into portions due to the signal and noise eigenvectors Rx =

P

H (M|α m |2 + σ 2w )qm qm +

M

H σ 2w qm qm

m=P +1

(9.6.14)

Qw = [qP +1 · · · qM ]

(9.6.15)

m=1 2 H = Qs s QH s + σ w Qw Q w

where

Qs = [q1 q2 · · · qP ]

are matrices whose columns consist of the signal and noise eigenvectors, respectively. The matrix s is a P ×P diagonal matrix containing the signal eigenvalues from (9.6.12). Thus, the M-dimensional subspace that contains the observations of the time-window signal vector from (9.6.5) can be split into two subspaces spanned by the signal and noise eigenvectors, respectively. These two subspaces, known as the signal subspace and the noise subspace, † are orthogonal to each other since the correlation matrix is Hermitian symmetric. All the subspace methods discussed later in this section rely on the partitioning of the vector space into signal and noise subspaces. Recall from Chapter 8 in (8.2.29) that the projection matrix †

The eigenvectors of a Hermitian symmetric matrix are orthogonal.

from an M-dimensional space onto an L-dimensional subspace (L < M) spanned by a set of vectors Z = [z1 z2 · · · zL ] is P = Z(Z Z) H

−1 H

Z

(9.6.16)

Therefore, we can write the matrices that project an arbitrary vector onto the signal and noise subspaces as Ps = Qs QH s

P w = Qw Q H w

(9.6.17)

H since the eigenvectors of the correlation matrix are orthonormal (QH s Qs = I and Qw Qw = I). Since the two subspaces are orthogonal

Pw Qs = 0

Ps Qw = 0

(9.6.18)

then all the time-window frequency vectors from (9.6.5) must lie completely in the signal subspace, that is, Ps v(fp ) = v(fp )

Pw v(fp ) = 0

(9.6.19)

These concepts are central to the subspace-based frequency estimation methods discussed in Sections 9.6.2 through 9.6.5. Note that in our analysis, we are considering the theoretical or true correlation matrix Rx . In practice, the correlation matrix is not known and must be estimated from the measured data samples. If we have a time-window signal vector from (9.6.4), then we can form the data matrix by stacking the rows with measurements of the time-window data vector at a time n T x (0) x(0) x(1) · · · x(M − 1) T x(1) x (1) x(2) · · · x(M) . . . . . .. .. .. .. .. T x(n) x(n + 1) · · · x(n + M − 1) X = x (n) (9.6.20) = . . . . . .. .. .. .. .. xT (N − 2) x(N − 2) x(N − 1) · · · x(N + M − 3) xT (N − 1)

x(N − 1)

x(N )

· · · x(N + M − 2)

which has dimensions of N × M, where N is the number of data records or snapshots and M is the time-window length. From this matrix, we can form an estimate of the correlation matrix, referred to as the sample correlation matrix ˆ x = 1 XH X (9.6.21) R N In the case of an estimated sample correlation matrix, the noise eigenvalues are no longer ˆ Therefore, the nice, clean equal because of the finite number of samples used to compute R. threshold between signal and noise eigenvalues, as described in (9.6.12) and (9.6.13), no longer exists. The model order estimation techniques discussed in Section 9.2 can be employed to attempt to determine the number of complex exponentials P present. In practice, these methods are best used as rough estimates, as their performance is not very accurate, especially for short data records. For several of the frequency estimation techniques described in this section, the analysis considers the use of eigenvalues and eigenvectors of the correlation matrix for the purposes † of defining signal and noise subspaces. In practice, we estimate the signal and noise subspaces by using the eigenvectors and eigenvalues of the sample correlation matrix. Note that for notational expedience we will not differentiate between eigenvectors and eigenvalues of †

The ESPRIT method uses a singular value decomposition of data matrix X.

481 section 9.6 Harmonic Models and Frequency Estimation Techniques

482 chapter 9 Signal Modeling and Parametric Spectral Estimation

the true and sample correlation matrices. However, the reader should always keep in mind that the sample correlation matrix eigendecomposition is what must be used for implementation. We note that use of an estimate rather than the true correlation matrix will result in a degradation in performance, the analysis of which is beyond the scope of this book.

9.6.2 Pisarenko Harmonic Decomposition The Pisarenko harmonic decomposition (PHD) was the first frequency estimation method proposed that was based on the eigendecomposition of the correlation matrix and its partitioning into signal and noise subspaces (Pisarenko 1973). This method uses the eigenvector associated with the smallest eigenvalue to estimate the frequencies of the complex exponentials. Although this method has limited practical use owing to its sensitivity to noise, it is of great theoretical interest because it was the first method based on signal and noise subspace principles and it helped to fuel the development of many well-known subspace methods, such as MUSIC and ESPRIT. Consider the model of complex exponentials contained in noise in (9.6.5) and the eigendecomposition of its correlation matrix in (9.6.14). The eigenvector corresponding to the minimum eigenvalue must be orthogonal to all the eigenvectors in the signal subspace. Thus, we choose the time window to be of length M =P +1

(9.6.22)

that is, 1 greater than the number of complex exponentials. Therefore, the noise subspace consists of a single eigenvector Qw = qM

(9.6.23)

corresponding to the minimum eigenvalue λM . By virtue of the orthogonality between the signal and noise subspaces, each of the P complex exponentials in the time-window signal vector model in (9.6.5) is orthogonal to this eigenvector vH (fp )qM =

M

qM (k)e−j 2π fp (k−1) = 0

for

m≤P

(9.6.24)

k=1

Making use of this property, we can compute 1 1 R¯ phd (ej 2π f ) = H = |v (f )qM |2 |QM (ej 2π f )|2

(9.6.25)

which is commonly referred to as a pseudospectrum. The frequencies are then estimated by observing the P peaks in R¯ phd (ej 2π f ). Note that since (9.6.25) requires a search of all frequencies −0.5 ≤ f ≤ 0.5, in practice a dense sampling of the frequencies is generally necessary. The quantity QM (ej 2π f ) = vH (f )qM =

M

qM (k)e−j 2π f (k−1)

(9.6.26)

k=1

is simply the Fourier transform of the Mth eigenvector corresponding to the minimum eigenvalue. Thus, the pseudospectrum for the Pisarenko harmonic decomposition R¯ phd (ej 2π f ) can be efficiently implemented by computing the FFT of qM with sufficient zero padding to provide the necessary frequency resolution. Then R¯ phd (ej 2π f ) is simply the reciprocal of the spectrum of the noise eigenvector, that is, the squared magnitude of its Fourier transform. Note that R¯ phd (ej 2π f ) is not an estimate of the true power spectrum since it contains no information about the powers of the complex exponentials |α p |2 or the background noise level σ 2w . However, these amplitudes can be found by using the estimated frequencies and the corresponding time-window frequency vectors along with the relationship of eigenvalues and eigenvectors. See Problem 9.24 for details.

Alternately, the frequencies of the complex exponentials can be found by computing the zeros of the Fourier transform of the Mth eigenvector in (9.6.23). The z-transform of this eigenvector is QM (z) =

M

qM (k)z−k =

k=1

M−1

(1 − ej 2π fk z−1 )

(9.6.27)

k=1

where the phases of the P = M − 1 roots of this polynomial are the frequencies fk of the P = M − 1 complex exponentials. As we stated up front, the significance of the Pisarenko harmonic decomposition is seen mostly from a theoretical perspective. The limitations of its practical use stem from the fact that it uses a single noise eigenvector and, as a result, lacks the necessary robustness needed for most applications. Since the correlation matrix is not known and must be estimated from data, the resulting noise eigenvector of the estimated correlation matrix is only an estimate of the actual noise eigenvector. Because we only use one noise eigenvector, this method is very sensitive to any errors in the estimation of the noise eigenvector. E XAM PLE 9.6.1. We demonstrate the use of the Pisarenko harmonic decomposition with a sinusoid in noise. The amplitude and frequency of the sinusoid are α = 1 and f = 0.2, respectively. The additive noise has unit power (σ 2w = 1). Using Matlab, this signal is generated:

x = sin(2*pi*f*[0:(N-1)]’) + (randn(N,1)+j*randn(N,1))/sqrt(2); Since the number of complex exponentials is equal to P = 2 (a complex conjugate pair for a sinusoid), the time-window length is chosen to be M = 3. After forming the N × M data ˆ x , we can compute the pseudospectrum matrix X and computing the sample correlation matrix R as follows: [Q0,D] = eig(R); % eigendecomposition [lambda,index] = sort(abs(diag(D))); % order by eigenvalue magnitude lambda = lambda(M:-1:1); Q=Q0(:,index(M:-1:1)); Rbar = 1./abs(fftshift(fft(Q(:,M),Nfft))).ˆ2; Figure 9.21 shows the pseudospectrum of the Pisarenko harmonic decomposition for a single realization with an FFT size of 1024. Note the two peaks near f = ±0.2. Recall that

70 60

Pseudospectrum (dB)

50 40 30 20 10 0 −10 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 Normalized frequency

0.3

0.4

FIGURE 9.21 Pseudospectrum for the Pisarenko harmonic decomposition of a sinusoid in noise with frequency f = 0.2.

0.5

483 section 9.6 Harmonic Models and Frequency Estimation Techniques

484 chapter 9 Signal Modeling and Parametric Spectral Estimation

this is a pseudospectrum, so that the actual values do not correspond to an estimate of power. A Matlab routine for estimating frequencies using the Pisarenko harmonic decomposition is provided in phd.m.

9.6.3 MUSIC Algorithm The multiple signal classification (MUSIC) frequency estimation method was proposed as an improvement on the Pisarenko harmonic decomposition (Bienvenu and Kopp 1983; Schmidt 1986). Like the Pisarenko harmonic decomposition, the M-dimensional space is split into signal and noise components using the eigenvectors of the correlation matrix from (9.6.15). However, rather than limit the length of the time window to M = P + 1, that is, 1 greater than the number of complex exponentials, allow the size of the time window to be M > P + 1. Therefore, the noise subspace has a dimension greater than 1. Using this larger dimension allows for averaging over the noise subspace, providing an improved, more robust frequency estimation method than Pisarenko harmonic decomposition. Because of the orthogonality between the noise and signal subspaces, all the timewindow frequency vectors of the complex exponentials are orthogonal to the noise subspace from (9.6.19). Thus, for each eigenvector (P < m ≤ M) vH (fp )qm =

M

qm (k)e−j 2π fp (k−1) = 0

(9.6.28)

k=1

for all the P frequencies fp of the complex exponentials. Therefore, if we compute a pseudospectrum for each noise eigenvector as R¯ m (ej 2π f ) =

1 |vH (f )q

m

|2

=

1 |Qm (ej 2π f )|2

(9.6.29)

the polynomial Qm (ej 2π f ) has M −1 roots, P of which correspond to the frequencies of the complex exponentials. These roots produce P peaks in the pseudospectrum from (9.6.29). Note that the pseudospectra of all M − P noise eigenvectors share these roots that are due to the signal subspace. The remaining roots of the noise eigenvectors, however, occur at different frequencies. There are no constraints on the location of these roots, so that some may be close to the unit circle and produce extra peaks in the pseudospectrum. A means of reducing the levels of these spurious peaks in the pseudospectrum is to average the M − P pseudospectra of the individual noise eigenvectors R¯ music (ej 2π f ) =

1 M m=P +1

|v (f )qm | H

= 2

1 M

|Qm (e

(9.6.30) j 2π f

)|

2

m=P +1

which is known as the MUSIC pseudospectrum. The frequency estimates of the P complex exponentials are then taken as the P peaks in this pseudospectrum. Again, the term pseudospectrum is used because the quantity in (9.6.30) does not contain information about the powers of the complex exponentials or the background noise level. Note that for M = P +1, the MUSIC method is equivalent to Pisarenko harmonic decomposition. The implicit assumption in the MUSIC pseudospectrum is that the noise eigenvalues all have equal power λm = σ 2w , that is, the noise is white. However, in practice, when an estimate is used in place of the actual correlation matrix, the noise eigenvalues will not be equal. The differences become more pronounced when the correlation matrix is estimated from a small number of data samples. Thus, a slight variation on the MUSIC algorithm, known as the eigenvector (ev) method, was proposed to account for the potentially different

noise eigenvalues (Johnson and DeGraaf 1982). For this method, the pseudospectrum is 1 1 R¯ ev (ej ω ) = = (9.6.31) M M 1 1 H 2 j 2π f 2 |v (f )qm | |Qm (e )| λm λm m=P +1

k=P +1

where λm is the eigenvalue corresponding to the eigenvector qm . The pseudospectrum of each eigenvector is normalized by its corresponding eigenvalue. In the case of equal noise eigenvalues (λm = σ 2w ) for P + 1 ≤ m ≤ M, the eigenvector and MUSIC methods are identical. The peaks in the MUSIC correspond to the frequencies at which the pseudospectrum j 2π f )|2 approaches zero. Therefore, we might denominator in (9.6.30) M |Q (e m m=P +1 want to consider the z-transform of this denominator M 1 P¯music (z) = (9.6.32) Qm (z)Q∗m ∗ z m=P +1

which is the sum of the z-transforms of the pseudospectrum due to each noise eigenvector. This (2M − 1)th-order polynomial has M − 1 pairs of roots with one inside and one outside the unit circle. Since we assume that the complex exponentials are not damped, their corresponding roots must lie on the unit circle. Thus, if we have found the M − 1 roots of (9.6.32), the P closest roots to the unit circle will correspond to the complex exponentials. The phases of these roots are then the frequency estimates. This method of rooting the polynomial corresponding to the MUSIC pseudospectrum is known as rootMUSIC (Barabell 1983). Note that in many cases, a rooting method is more efficient than computing a pseudospectrum at a very fine frequency resolution that may require a very large FFT. Statistical performance analyses of the MUSIC algorithm can be found in Kaveh and Barabell (1986) and Stoica and Nehorai (1989). For the performance of the root-MUSIC method see Rao and Hari (1989). A routine for the MUSIC algorithm is provided in music.m and a routine for the root-MUSIC algorithm is provided in rootmusic.m. E XAM PLE 9.6.2. In this example, we demonstrate the use of the MUSIC algorithm and examine its performance in terms of resolution with respect to that of the minimum-variance spectral estimator. Consider the following scenario: Two complex exponentials in unit power noise (σ 2w = 1) with normalized frequencies f = 0.1, 0.2 both with amplitudes of α = 1. We generate N = 128 samples of the signal and use a frequency vector of length M = 8. Proceeding as we did in Example 9.6.1, we compute the eigendecomposition and partition it into signal and noise subspaces. The MUSIC pseudospectrum is computed as

Qbar = zeros(Nfft,1); for n = 1:(M-P) Qbar = Qbar + abs(fftshift(fft(Q(:,M-(n-1)),Nfft))).ˆ2; end Rbar = 1./Qbar; The minimum-variance spectral estimate and the MUSIC pseudospectrum are computed and averaged over 1000 realizations using an FFT size of 1024. The result is shown in Figure 9.22. The two exponentials have been clearly resolved using the MUSIC algorithm, whereas they are not very clear using the minimum-variance spectral estimate. Since the minimum-variance spectral estimator is nonparametric and makes no assumptions about the underlying model, it cannot achieve the resolution of the MUSIC algorithm.

9.6.4 Minimum-Norm Method The minimum-norm method (Kumaresan and Tufts 1983), like the MUSIC algorithm, uses a time-window vector of length M > P + 1 for the purposes of frequency estimation. For MUSIC, a larger time window is used than for Pisarenko harmonic decomposition, resulting

485 section 9.6 Harmonic Models and Frequency Estimation Techniques

30

486 chapter 9 Signal Modeling and Parametric Spectral Estimation

25

Power (dB)

20 15 10 5 0 −5 −10 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 Normalized frequency

0.3

0.4

0.5

FIGURE 9.22 Comparison of the minimum-variance spectral estimate (dashed line) and the MUSIC pseudospectrum (solid line) for two complex exponentials in noise.

in a larger noise subspace. The use of a larger subspace provides the necessary robustness for frequency estimation when an estimated correlation matrix is used. The same principle is applied in the minimum-norm frequency estimation method. However, rather than average the pseudospectra of all the noise subspace eigenvectors to reduce spurious peaks, as in the case of the MUSIC algorithm, a different approach is taken. Consider a single vector u contained in the noise subspace. The pseudospectrum of this vector is given by 1 ¯ j 2π f ) = (9.6.33) R(e |vH (f )u|2 Since the vector u lies in the noise subspace, its pseudospectrum in (9.6.33) has P peaks corresponding to the complex exponentials in the signal subspace. However, u is length M so that its pseudospectrum may exhibit an additional M − P − 1 peaks that do not correspond to the frequencies of the complex exponentials. These spurious peaks lead to frequency estimation errors. In the case of Pisarenko harmonic decomposition, spurious peaks were not a concern since M = P + 1 and therefore its pseudospectrum in (9.6.25) only had P peaks. On the other hand, the MUSIC algorithm diluted the strength of these spurious peaks since its pseudospectrum in (9.6.30) is produced by averaging the pseudospectra of the M − P noise eigenvectors. Recall the projection onto the noise subspace from (9.6.17) is Pw = Qw QH w

(9.6.34)

where Qw is the matrix of noise eigenvectors. Therefore, for any vector u that lies in the noise subspace Pw u = u

Ps u = 0

(9.6.35)

where Ps is the signal subspace projection matrix and 0 is the length-P zero vector. Now let us consider the z-transform of the coefficients of u = [u(1) u(2) · · · u(M)]T U (z) =

M−1 k=0

u(k + 1)z−k =

P k=1

(1 − ej 2π fk z−1 )

M−1

(1 − zk z−1 )

k=P +1

(9.6.36)

This polynomial is the product of the P roots corresponding to complex exponentials that lie on the unit circle and the M − P − 1 roots that in general do not lie directly on the unit circle but can potentially produce spurious peaks in the pseudospectrum of u. Therefore, we want to choose u so that it minimizes the spurious peaks due to these other roots of its associated polynomial U (z). The minimum-norm method, as its name implies, seeks to minimize the norm of u in order to avoid spurious peaks in its pseudospectrum. Using (9.6.35), the norm of a vector u contained in the noise subspace is u2 = uH u = uH Pw u

(9.6.37)

However, an unconstrained minimization of this norm will produce the zero vector. There† fore, we place the constraint that the first element of u must equal 1. This constraint can be expressed as δH 1 u=1

(9.6.38)

where δ 1 = [1 0 · · · Then the determination of the minimum-norm vector comes down to solving the following constrained minimization problem: 0]T .

min u2 = uH Pw u

subject to

δH 1 u=1

(9.6.39)

The solution can be found by using Lagrange multipliers (see Appendix B) and is given by umn =

Pw δ 1 H δ 1 Pw δ 1

(9.6.40)

The frequency estimates are then obtained from the peaks in the pseudospectrum of the minimum-norm (mn) vector, umn R¯ mn (ej 2π f ) =

1 |vH (f )umn |2

(9.6.41)

The performance of the minimum-norm frequency estimation method is similar to that of MUSIC. For a performance comparison see Kaveh and Barabell (1986). Note that it is also possible to implement the minimum-norm method by rooting a polynomial rather than computing a psuedospectrum (see Problem 9.25). EXAMPLE 9.6.3. In this example, we illustrate the use of the minimum-norm method and compare its performance to that of the other three frequency estimation methods discussed in this chapter: Pisarenko harmonic decomposition, the MUSIC algorithm, and the eigenvector method. The pseudospectrum of the minimum-norm method is found by first computing the minimum-norm vector umn and then finding its pseudospectrum, that is,

delta1 = zeros(M,1); delta1(1) = 1; Pn=Q(:,(P+1):M)*Q(:,(P+1):M)’; % noise subspace projection matrix u = (Pn*e1)/(e1’*Pn*e1); % minimum-norm vector Rbar = 1./abs(fftshift(fft(u,Nfft))).ˆ2; % pseudospectrum Consider the case of P = 4 complex exponentials in noise with frequencies f = 0.1, 0.25, 0.4, and −0.1, all with an amplitude of α = 1. The power of the noise is set to α 2w = 1 with 100 realizations. The time-window length used was M = 8 for all the methods except Pisarenko harmonic decomposition, which is constrained to use M = P + 1 = 5. The pseudospectra are shown in Figure 9.23 with an FFT size of 1024, where we have not averaged in order to demonstrate the variance of the various methods. Here we see the large variance in the frequency estimates that is produced by Pisarenko harmonic decomposition compared to the other methods, which is a direct result of using a one-dimensional noise subspace. The other methods all perform comparably in terms of estimating the frequencies of the complex exponentials. Note the fluctuations in the pseudospectrum of the eigenvector method that result from the normalization †

The choice of a value of 1 is somewhat arbitrary, since any nonzero constant will result in a similar solution.

487 section 9.6 Harmonic Models and Frequency Estimation Techniques

488

Pisarenko harmonic decomposition

35 Pseudospectrum (dB)

Pseudospectrum (dB)

MUSIC

40

60 50 40 30 20 10 0

30 25 20 15 10 5 0 −5

−10 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 Normalized frequency

−10 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 Normalized frequency

(b)

(a)

Minimum-norm method

Eigenvector method 40

60

30

50 Pseudospectrum (dB)

Pseudospectrum (dB)

chapter 9 Signal Modeling and Parametric Spectral Estimation

70

20 10 0 −10

40 30 20 10 0

−20 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 Normalized frequency

−10 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 Normalized frequency

(c)

(d )

FIGURE 9.23 Comparison of the eigendecomposition-based frequency estimation methods: (a) Pisarenko harmonic decomposition, (b) MUSIC, (c) eigenvector method, and (d) minimum-norm method. by the eigenvalues. Since these eigenvalues vary over realizations, the pseudospectra will also reflect a similar variation. Routines for the eigenvector method and the minimum-norm method are provided in ev method.m and minnorm.m, respectively.

9.6.5 ESPRIT Algorithm A frequency estimation technique that is built upon the same principles as other subspace methods but further exploits a deterministic relationship between subspaces is the estimation of signal parameters via rotational invariance techniques (ESPRIT) algorithm. This method differs from the other subspace methods discussed so far in this chapter in that the signal subspace is estimated from the data matrix X rather than the estimated correlation matrix ˆ x . The essence of ESPRIT lies in the rotational property between staggered subspaces R that is invoked to produce the frequency estimates. In the case of a discrete-time signal or time series, this property relies on observations of the signal over two identical intervals staggered in time. This condition arises naturally for discrete-time signals, provided that the † sampling is performed uniformly in time. Extensions of the ESPRIT method to a spatial †

This condition is violated in the case of a nonuniformly sampled time series.

array of sensors, the application for which it was originally proposed, will be discussed in Chapter 11 in Section 11.7. We first describe the original, least-squares version of the algorithm (Roy et al. 1986) and then extend the derivation to total least-squares ESPRIT (Roy and Kailath 1989), which is the preferred method for use. Since the derivation of the algorithm requires an extensive amount of formulation and matrix manipulations, we have included a block diagram in Figure 9.24 to be used as a guide through this process.

489 section 9.6 Harmonic Models and Frequency Estimation Techniques

Unknown Signal model P

s(n) = Σ ap p=1

Time-window signal vector model

e j2p fp n

V1 V2 = V1Φ

V V2

Matching signal subspace

Ψ = TΦT −1

fp are eigenvalues of Ψ

Data matrix

UH U1

UHs N

X

= LΣ SVD

Us U nH Separate signal & noise subspaces

M

U2

Compute Ψ (LS or TLS) U2 = U1 Ψ

P

Partition into staggered subspaces

FIGURE 9.24 Block diagram demonstrating the flow of the ESPRIT algorithm starting from the data matrix through the frequency estimates.

Consider a single complex exponential s0 (n) = ej 2π f n with complex amplitude α and frequency f . This signal has the following property s0 (n + 1) = αej 2π f (n+1) = s0 (n)ej 2π f

(9.6.42)

that is, the next sample value is a phase-shifted version of the current value. This phase shift can be represented as a rotation on the unit circle ej 2π f . Recall the time-window vector model from (9.6.4) consisting of a signal s(n), made up of complex exponentials, and the noise component w(n) x(n) =

fp

P

α p v(fp )ej 2π nfp + w(n) = Vn α + w(n) = s(n) + w(n) (9.6.43)

p=1

where the P columns of matrix V are length-M time-window frequency vectors of the

fˆp =

fp

2p l≤p≤P

490 chapter 9 Signal Modeling and Parametric Spectral Estimation

complex exponentials V = [v(f1 ) v(f2 ) · · · v(fP )]

(9.6.44)

The vector α consists of the amplitudes of the complex exponentials α p . On the other hand, matrix is the diagonal matrix of phase shifts between neighboring time samples of the individual, complex exponential components of s(n) j 2π f1 0 ··· 0 e 0 ej 2π f2 · · · 0 = diag {φ 1 , φ 2 , . . . , φ P } = . (9.6.45) . . .. . .. .. . . 0 ··· 0 ej 2π fP where φ p = ej 2π fp for p = 1, 2, . . . , P . Since the frequencies of the complex exponentials fp completely describe this rotation matrix, frequency estimates can be obtained by finding . Let us consider two overlapping subwindows of length M − 1 within the length M time-window vector. This subwindowing operation is illustrated in Figure 9.25. Consider the signal consisting of the sum of complex exponentials s(n) sM−1 (n) (9.6.46) = s(n) = s(n + M − 1) sM−1 (n + 1) where sM−1 (n) is the length-(M − 1) subwindow of s(n), that is, sM−1 (n) = VM−1 n α

(9.6.47)

x(n)

n

n+M−1

n+1

n+M

xM − 1(n)

xM − 1(n + 1)

FIGURE 9.25 Time-staggered, overlapping windows used by the ESPRIT algorithm.

Matrix VM−1 is constructed in the same manner as V except its time-window frequency vectors are of length M − 1, denoted as vM−1 (f ), VM−1 = [vM−1 (f1 ) vM−1 (f2 ) · · · vM−1 (fP )]

(9.6.48)

Recall that s(n) is the scalar signal made up of the sum of complex exponentials at time n. Using the relation in (9.6.47), we can define the matrices V1 = VM−1 n

and

V2 = VM−1 n+1

(9.6.49)

where V1 and V2 correspond to the unstaggered and staggered windows, that is, ∗ ∗ ··· ∗ V1 (9.6.50) = V = ∗ ∗ ··· ∗ V2

Clearly, by examining (9.6.49), these two matrices of time-window frequency vectors are related as V2 = V1

(9.6.51)

Note that each of these two matrices spans a different, though related, (M − 1)-dimensional subspace. Now suppose that we have a data matrix X from (9.6.20) with N data records of the length-M time-window vector signal x(n). Using the singular value decomposition (SVD) † discussed in Chapter 8, we can write the data matrix as X = LUH

(9.6.52)

where L is an N × N matrix of left singular vectors and U is an M × M matrix of right singular vectors. Both of these matrices are unitary; that is, LH L = I and UH U = I. The matrix has dimensions N × M consisting of singular values on the main diagonal ordered in descending magnitude. The squared magnitudes of the singular values are equal to the ˆ scaled by a factor of N from (9.6.21), and the columns of U are their eigenvalues of R corresponding eigenvectors. Thus, U forms an orthonormal basis for the underlying Mdimensional vector space. This subspace can be partitioned into signal and noise subspaces as U = [Us |Un ]

(9.6.53)

where Us is the matrix of right-hand singular vectors corresponding to the singular values with the P largest magnitudes. Note that since the signal portion consists of the sum of complex exponentials modeled as time-window frequency vectors v(f ), all these frequency vectors, for f = f1 , f2 , . . . , fP , must also lie in the signal subspace.As a result, the matrices V and Us span the same subspace. Therefore, there exists an invertible transformation T that maps Us into V, that is, V = Us T

(9.6.54)

The transformation T is never solved for in this derivation, but instead is only formulated as a mapping between these two matrices within the signal subspace. Proceeding as we did with the matrix V in (9.6.50), we can partition the signal subspace into two smaller (M − 1)-dimensional subspaces as ∗ ∗ ··· ∗ U1 (9.6.55) = Us = ∗ ∗ ··· ∗ U2 where U1 and U2 correspond to the unstaggered and staggered subspaces, respectively. Since V1 and V2 correspond to the same subspaces, the relation from (9.6.54) must also hold for these subspaces V1 = U1 T

V 2 = U2 T

(9.6.56)

The staggered and unstaggered components of the matrix V in (9.6.50) are related through the subspace rotation in (9.6.51). Since the matrices U1 and U2 also span these respective, related subspaces, a similar, though different, rotation must exist that relates (rotates) U1 to U2 U2 = U1

(9.6.57)

where is this rotation matrix. Recall that frequency estimation comes down to solving for the subspace rotation matrix . We can estimate by making use of the relations in (9.6.56) together with the †

Our notation differs slightly from that introduced in Chapter 8 in order to avoid confusion with the matrix of time-window frequency vectors V.

491 section 9.6 Harmonic Models and Frequency Estimation Techniques

492 chapter 9 Signal Modeling and Parametric Spectral Estimation

rotations between the staggered signal subspaces in (9.6.51) and (9.6.57). In this process, the matrices U1 and U2 are known from the SVD on data matrix X. First, we solve for from the relation in (9.6.57), using the method of least-squares (LS) from Chapter 8 = (U1H U1 )−1 U1H U2

(9.6.58)

Substituting (9.6.57) into (9.6.56), we have V2 = U2 T = U1 T

(9.6.59)

Similarly, we can also solve for V2 , using the relation in (9.6.51) and substituting (9.6.56) for V1 V2 = V1 = U1 T

(9.6.60)

Thus, equating the two right-hand sides of (9.6.59) and (9.6.60), we have the following relation between the two subspace rotations T = T or equivalently

−1

= TT

(9.6.61) (9.6.62)

Equations (9.6.61) and (9.6.62) should be recognized as the relationship between eigenvectors and eigenvalues of the matrix (Golub and Van Loan 1996). Therefore, the diagonal elements of , φ p for p = 1, 2, . . . , P , are simply the eigenvalues of . As a result, the estimates of the frequencies are fˆp =

φ p

(9.6.63) 2π where φ p is the phase of φ p . Although the principle behind the ESPRIT algorithm, namely, the use of subspace rotations, is quite simple, one can easily get lost in the details of the derivation of the algorithm. Note that we have only used simple matrix relationships. An illustrative example of the implementation of ESPRIT in Matlab is given in Example 9.6.4 to help clarify the details of the algorithm. However, first we give a total least-squares version of the algorithm, which is the preferred method for use. Note that the subspaces U1 and U2 are both only estimates of the true subspaces that correspond to V1 and V2 , respectively, obtained from the data matrix X. The estimate of the subspace rotation was obtained by solving (9.6.57) using the LS criterion ls = (U1H U1 )−1 U1H U2

(9.6.64)

This LS solution is obtained by minimizing the errors in an LS sense from the following formulation U2 + E2 = U1

(9.6.65)

where E2 is a matrix consisting of errors between U2 and the true subspace corresponding to V2 . Note that this LS formulation assumes errors only on the estimation of U2 and no errors between U1 and the true subspace that it is attempting to estimate corresponding to V1 . Therefore, since U1 is also an estimated subspace, a more appropriate formulation is U2 + E2 = (U1 + E1 )

(9.6.66)

where E1 is the matrix representing the errors between U1 and the true subspace corresponding to V1 . A solution to this problem, known as total least squares (TLS), is obtained by minimizing the Frobenius norm of the two error matrices E1

E2 F

(9.6.67)

Since the principles of TLS are beyond the scope of this book, we simply give the procedure to obtain the TLS solution of and refer the interested reader to Golub and Van Loan (1996).

First, form a matrix made up of the staggered signal subspace matrices U1 and U2 † placed side by side, and perform an SVD ˜U [U1 U2 ] = L˜

˜H

(9.6.68)

˜ of right singular vectors. This matrix is partitioned We then operate on the 2P ×2P matrix U into P × P quadrants ˜ 12 ˜ 11 U U ˜ = (9.6.69) U ˜ 21 U ˜ 22 U The TLS solution for the subspace rotation matrix is then ˜ 12 U ˜ −1 tls = −U 22

(9.6.70)

The frequency estimates are then obtained from (9.6.62) and (9.6.63) by using tls from (9.6.70). Although the TLS version of ESPRIT involves slightly more computations, it is generally preferred over the LS version based on formulation in (9.6.66). A statistical analysis of the performance of the ESPRIT algorithms is given in Ottersten et al. (1991). E XAM PLE 9.6.4. In this illustrative example, we demonstrate the use of both the LS and TLS versions of the ESPRIT algorithm on a set of complex exponentials in white noise using Matlab. First, generate a signal s(n) of length N = 128 consisting of complex exponential signals at normalized frequencies f = 0.1, 0.15, 0.4, and −0.15, all with amplitude α = 1. Each of the complex exponentials is generated by exp(j*2*pi*f*[0:(N-1)]’);. The overall signal in white noise with unit power (σ 2w = 1) is then

x = s + (randn(N,1)+j*randn(N,1))/sqrt(2); We form the data matrix corresponding to (9.6.20) for a time window of length M = 8. The least-squares ESPRIT algorithm is then performed as follows: [L,S,U] = svd(X); Us = U(:,1:P); % signal subspace U1 = Us(1:(M-1),:); U2 = Us(2:M,:); % signal subspaces Psi = U1\U2; % LS solution for Psi If we are using the TLS version of ESPRIT, then solve for [LL,SS,UU] = svd([U1 U2]); UU12 = UU(1:P,(P+1):(2*P)); UU22 = UU((P+1):(2*P),(P+1):(2*P)); Psi = -UU12*inv(UU22); % TLS solution for Psi The frequencies are found by computing the phases of the eigenvalues of , that is, phi = eig(Psi); % eigenvalues of Psi fhat = angle(diag(phi))/(2*pi); % frequency estimates In both cases, we average over 1000 realizations and obtain average estimated frequencies very close to the true values f = 0.1, 0.15, 0.4, and −0.15 used to generate the signals. Routines for both the LS and TLS versions of ESPRIT are provided in esprit ls.m and esprit tls.m.

9.7 SUMMARY In this chapter, we have examined the modeling process for both pole-zero and harmonic signal models. As for all signal modeling problems, the procedure begins with the selection of the appropriate model for the signal under consideration. Then the signal model is applied by estimating the model parameters from a collection of data samples. However, as we †

Note that this matrix [U1 U2 ] = Us = [U1T U2T ]T from (9.6.55).

493 section 9.7 Summary

494 chapter 9 Signal Modeling and Parametric Spectral Estimation

have stressed throughout this chapter, nothing is more valuable in the modeling process than specific knowledge of the signal and its underlying process in order to assess the validity of the model for a particular signal. For this reason, we began the chapter with a discussion of a model building procedure, starting with the choice of the appropriate model and the estimation of its parameters, and concluding with the validation of the model. Clearly, if the model is not well-suited for the signal, the application of the model becomes meaningless. In the first part of the chapter, we considered the application of the parametric signal models that were discussed in Chapter 4. The estimation of all-pole models was presented for both direct and lattice structures. Within this context, we used various model order selection criteria to determine the order of the all-pole model. However, these criteria are not necessarily limited to all-pole models. In addition, the relationship was given between the all-pole model and Burg’s method of maximum entropy. Next, we considered the polezero modeling. Using a nonlinear least-squares technique, a method was presented for estimating the parameters of the pole-zero model. The use of pole-zero models for the purposes of spectral estimation along with their application to speech modeling was also considered. The latter part of the chapter focused on harmonic signal models, that is, modeling signals using the sum of complex exponentials. The harmonic modeling problem becomes one of estimating the frequency of the complex exponentials. As a bridge between these pole-zero and harmonic models, we discussed the topic of minimum-variance spectral estimation. As will be explored in the problems that follow, there are several interesting relations between the minimum-variance spectrum and the harmonic models. In addition, a relationship between the minimum-variance spectral estimator and the all-pole model was established. Then, we discuss some of the more popular harmonic modeling methods. Starting with the Pisarenko harmonic decomposition, the first such model, we discuss the MUSIC, eigenvector, root-MUSIC, and minimum-norm methods for frequency estimation. All of these methods are based on computing a pseudospectrum or a rooting polynomial from an estimated correlation matrix. Finally, we give a brief derivation of the ESPRIT algorithm, both in its original LS form and the more commonly used TLS form.

PROBLEMS 9.1 Consider the random process x(n) described in Example 9.2.3 that is simulated by exciting the system function H (z) =

1 1 − 2.7607z−1 + 3.8108z−2 − 2.6535z−3 + 0.9238z−4

using a WGN(0, 1) process. Generate N = 250 samples of the process x(n). (a) Write a Matlab function that implements the modified covariance method to obtain AR(P ) model coefficients and the modeling error variance σˆ 2P as a function of P , using N samples of x(n). (b) Compute and plot the variance σˆ 2P , FPE(P ), AIC(P ), MDL(P ), and CAT(P ) for P = 1, 2, . . . , 15. (c) Comment on your results and the usefulness of model selection criteria for the process x(n). 9.2

f b in (9.2.33). Consider the Burg approach of minimizing forward-backward LS error Em f b can be put in the form of (9.2.34). (a) Show that by using (9.2.26) and (9.2.27), Em B f b with respect to k (b) By minimizing Em m−1 , show that the expression for the optimum km−1 is given by (9.2.35). B | < 1. (c) Show that |km−1 B | < |k IS | ≤ 1 where k IS is defined in (9.2.36). (d ) Show that |km−1 m−1 m−1

9.3 Generate an AR(2) process using the system function H (z) =

1 1 − 0.9z−1 + 0.81z−2

excited by a WGN(0, 1) process. Illustrate numerically that if we use the full-windowing method, FP }1 BP 1 B 1 ¯ in (9.2.8), then the PACS estimates {km that is, the matrix X m=0 , {km }m=0 , and {km }m=0 of Section 9.2 are identical and hence can be obtained by using the Levinson-Durbin algorithm. 9.4 Generate sample sequences of an AR(2) process x(n) = w(n) − 1.5857x(n − 1) − 0.9604x(n − 2) where w(n) ∼ WGN(0, 1). Choose N = 256 samples for each realization. (a) Design a first-order optimum linear predictor, and compute the prediction error e1 (n). Test the whiteness of the error sequence e1 (n) using the autocorrelation, PSD, and partial correlation methods, discussed in Section 9.1. Show your results as an overlay plot using 20 realizations. (b) Repeat the above part, using second- and third-order linear predictors. (c) Comment on your plots. 9.5 Generate sample functions of the process x(n) = 0.5w(n) + 0.5w(n − 1) where w(n) ∼ WGN(0, 1). Choose N = 256 samples for each realization. (a) Test the whiteness of x(n) and show your results, using overlay plots based on 10 realizations. (b) Process x(n) through the AR(1) filter H (z) =

1 1 + 0.95z−1

to obtain y(n). Test the whiteness of y(n) and show your results, using overlay plots based on 10 realizations. 9.6 The process x(n) contains a complex exponential in white noise, that is, x(n) = Aej (ω0 n+θ ) + w(n) where A is a real positive constant, θ is a random variable uniformly distributed over [0, 2π], ω0 is a constant between 0 and π , and w(n) ∼ WGN(0, σ 2w ). The purpose of this problem is to analytically obtain a maximum entropy method (MEM) estimate by fitting an AR(P ) model and then evaluating {ak }P 0 model coefficients. (a) Show that the (P + 1) × (P + 1) autocorrelation matrix of x(n) is given by Rx = A2 eeH + σ 2w I where e = [1 e−j ω0 · · · e−j P ω0 ]T . (b) By solving autocorrelation normal equations, show that aP [1 a1 · · · aP ]T A2 A2 T = 1+ 2 [1 0 · · · 0] e− 2 σ w + A2 P σ w + (P + 1)A2 (c) Show that the MEM estimate based on the above coefficients is given by A2 2 σw 1 − 2 σ w + (P + 1)A2 j ω ˆ Rx (e ) = 2 A2 j (ω−ω ) 0 W (e ) 1 − 2 R σ w + (P + 1)A2 where WR (ej ω ) is the DTFT of the (P + 1) length rectangular window.

495 problems

496

9.7 An AR(2) process y(n) is observed in noise v(n) to obtain x(n), that is,

chapter 9 Signal Modeling and Parametric Spectral Estimation

x(n) = y(n) + v(n)

v(n) ∼ WGN(0, σ 2v )

where v(n) is uncorrelated with y(n) and y(n) = 1.27y(n − 1) − 0.81y(n − 2) + w(n)

w(n) ∼ WGN(0, 1)

(a) Determine and plot the true power spectrum Rx (ej ω ).

(b) Generate 10 realizations of x(n), each with N = 256 samples. Using the LS approach with forward-backward linear predictor, estimate the power spectrum for P = 2 and σ 2v = 1. Obtain an overlay plot of this estimate, and compare it with the true spectrum. (c) Repeat part (b), using σ 2v = 10. Comment on the effect of increasing noise variance on spectrum estimates. (d ) Since the noise variance σ 2v affects only rx (0), investigate the effect of subtracting a small amount from rx (0) on the spectrum estimates in part (c). 9.8

Let x(n) be a random process whose correlation is estimated. The values for the first five lags are rx (0) = 1, rx (1) = 0.7, rx (2) = 0.5, rx (3) = 0.3, and rx (4) = 0. (a) Determine and plot the Blackman-Tukey power spectrum estimate. (b) Assume that x(n) is modeled by an AP(2) model. Determine and plot its spectrum estimate. (c) Now repeat (b) assuming that AP(4) is an appropriate model for x(n). Determine and plot the spectrum estimate.

9.9 The narrowband process x(n) is generated using the AP(4) model 1 1 + 0.98z−1 + 1.92z−2 + 0.94z−3 + 0.92z−4 driven by WGN(0, 0.001). Generate 10 realizations, each with N = 256 samples, of this process. H (z) =

(a) Determine and plot the true power spectrum Rx (ej ω ). (b) Using the LS approach with forward linear predictor, estimate the power spectrum for P = 4. Obtain an overlay plot of this estimate, and compare it with the true spectrum. (c) Repeat part (b) with P = 8 and 12. Provide a qualitative description of your results with respect to model order size. (d ) Using the LS approach with forward-backward linear predictor, estimate the power spectrum for P = 4. Obtain an overlay plot of this estimate. Compare it with the plot in part (b). 9.10 Consider the following PZ(4, 2) model 1 − z−2 1 + 0.41z−4 driven by WGN(0, 1) to obtain a broadband ARMA process x(n). Generate 10 realizations, each with N = 256 samples, of this process. H (z) =

(a) Determine and plot the true power spectrum Rx (ej ω ). (b) Using the LS approach with forward-backward linear predictor, estimate the power spectrum for P = 12. Obtain an overlay plot of this estimate, and compare it with the true spectrum. (c) Using the nonlinear LS pole-zero modeling algorithm of Section 9.3.3, estimate the power spectrum for P = 4 and Q = 2. Obtain an overlay plot of this estimate, and compare it with the plot in part (b). 9.11 A random process x(n) is given by πn 2π n x(n) = cos + θ 1 + w(n) − w(n − 2) + cos + θ2 3 3 where w(n) ∼ WGN(0, 1) and θ 1 and θ 2 are IID random variables uniformly distributed between 0 and 2π. Generate a sample sequence with N = 256 samples. (a) Determine and plot the true spectrum Rx (ej ω ). (b) Using the LS approach with forward-backward linear predictor, estimate the power spectrum for P = 10, 20, and 40 from the generated sample sequence. Compare it with the true spectrum.

(c) Using the nonlinear LS pole-zero modeling algorithm of Section 9.3.3, estimate the power spectrum for P = 4 and Q = 2. Compare it with the true spectrum and with the plot in part (b). 9.12 Show that, for large values of N, the modeling error variance estimate given by Equation (9.2.38) can be approximated by the estimate given by Equation (9.2.39). 9.13 This problem investigates the effect of correlation aliasing observed in LS estimation of model parameters when the AP model is excited by discrete spectra. Consider an AP(1) model with pole at z = α excited by a periodic sequence of period N. Let x(n) be the output sequence. (a) Show that the correlation at lag 1 satisfies rx (1) =

α N −1 + α rx (0) 1 + αN

(P.1)

(b) Using the LS approach, determine the estimate αˆ as a function of α and N. Compute αˆ for α = 0.9 and N = 10. (c) Generate x(n), using α = 0.95 and the periodic impulse train with N = 10. Compute and plot the correlation sequence rx (l), 0 ≤ l ≤ N − 1, of x(n). Compare your plot with the AP(1) model correlation for α = 0.95. Comment on your observations and discuss why they explain the discrepancy between α and α. ˆ (d ) Repeat part (c) for N = 100 and 1000. Show analytically and numerically that αˆ → α as N → ∞. 9.14 In this problem, we investigate the equation error method of Section 9.3.1. Consider the PZ(2, 2) model x(n) = 0.3x(n − 1) + 0.4x(n − 2) + w(n) + 0.25w(n − 2) √ Generate N = 200 samples of x(n), using w(n) ∼ WGN(0, 10). Record values of both x(n) and w(n). (a) Using the residual windowing method, that is, Ni = max(P , Q) and Nf = N − 1, compute the estimates of the above model parameters. (b) Compute the input variance estimate σˆ 2w from your estimated values in part (a). Compare it with the actual value σ 2w and with (9.3.12). 9.15 Consider the following PZ(4, 2) model x(n) = 1.8766x(n − 1) − 2.6192x(n − 2) + 1.6936x(n − 3) − 0.8145x(n − 4) + w(n) + 0.05w(n − 1) − 0.855w(n − 2) √ excited by w(n) ∼ WGN(0, 10). Generate 300 samples of x(n). (a) Using the nonlinear LS pole-zero modeling algorithm of Section 9.3.3, estimate the parameters of the above model from the x(n) data segment. (b) Assuming the AP(10) model for the data segment, estimate its parameters by using the LS approach described in Section 9.2. (c) Generate a plot similar to Figure 9.13 by computing spectra corresponding to the true PZ(4, 2), estimated PZ(4, 2), and estimatedAP(10) models. Compare and comment on your results. 9.16 Using matrix notation, show that AZ power spectrum estimation is equivalent to the BlackmanTukey method discussed in Chapter 5. 9.17 Consider the PZ(4, 2) model given in Problem 9.15. Generate 300 samples of x(n). (a) (b) (c) (d )

Fit an AP(5) model to the data and plot the resulting spectrum. Fit an AP(10) model to the data and plot the resulting spectrum. Fit an AP(50) model to the data and plot the resulting spectrum. Compare your plots with the true spectrum, and discuss the effect of model mismatch on the quality of the spectrum.

497 problems

498 chapter 9 Signal Modeling and Parametric Spectral Estimation

9.18 Use the supplied (about 50-ms) segment of a speech signal sampled at 8192 samples per second. (a) Compute a periodogram of the speech signal (see Chapter 5). (b) Using data windowing, fit an AP(16) model to the speech data and compute the spectrum. (c) Using the residual windowing, fit a PZ(12, 6) model to the speech data and compute the spectrum. (d ) Plot the above three spectra on one graph, and comment on the performance of each method. 9.19 One practical approach to spectrum estimation discussed in Section 9.4 is the prewhitening and postcoloring method. (a) Develop a Matlab function to implement this method. Use the forward/backward LS method to determine AP(P ) parameters and the Welch method for nonparametric spectrum estimation. (b) Verify your function on the short segment of the speech segment from Problem 9.18. (c) Compare your results with those obtained in Problem 9.18. 9.20 Consider a white noise process with variance σ 2w . Find its minimum-variance power spectral estimate. 9.21 Find the minimum-variance spectrum of a first-order all pole model, that is, x(n) = −a1 x(n − 1) + w(n) 9.22 The filter coefficient vector for the minimum-variance spectrum estimator is given in (9.5.10). Using Lagrange multipliers, discussed in Appendix B, solve this constrained optimization to find this weight vector. 9.23 Using the relationship between the minimum-variance and the all-pole model spectrum estimators in (9.5.22), generate a recursive relationship for the minimum-variance spectrum estimators (mv) (mv) of increasing window length. In other words, write Rˆ M+1 (ej 2π f ) in terms of Rˆ M (ej 2π f ) (ap) (ej 2π f ) in (9.5.20). and the all-pole model spectrum estimator Rˆ M

9.24 In Pisarenko harmonic decomposition, discussed in Section 9.6.2, we determine the frequencies of the complex exponentials in white noise through the use of the pseudospectrum. The word pseudospectrum was used because its value does not correspond to an estimated power. Find a set of linear equations that can be solved to find the powers of the complex exponentials. Hint: Use the relationship of eigenvalues and eigenvectors Rx qm = λm qm for m = 1, 2, . . . , M. 9.25 For the MUSIC algorithm, we showed a means of using the MUSIC pseudospectrum to derive a polynomial that could be rooted to obtain frequency estimates, which is known as root-MUSIC. Find a similar rooting method for the minimum-norm frequency estimation procedure. 9.26 The Pisarenko harmonic decomposition, MUSIC, and minimum-norm algorithms yield frequency estimates by computing a pseudospectrum using the Fourier transforms of the eigenvectors. However, these pseudospectra do not actually estimate a power. Derive the minimumvariance spectral estimator in terms of the Fourier transforms of the eigenvectors and the associated eigenvalues. Relate this result to the MUSIC and eigenvector method pseudospectra. 9.27 Show that the pseudospectrum for the MUSIC algorithm is equivalent to the minimum-variance spectrum in the case of an infinite signal-to-noise ratio. 9.28 Find a relationship between the minimum-norm pseudospectrum and the all-pole model spectrum in the case of an infinite signal-to-noise ratio. 9.29 In (9.5.22), we derived a relationship between the minimum-variance spectral estimator and spectrum estimators derived from all-pole models of orders 1 to M. Find a similar relationship between the pseudospectra of the MUSIC and minimum-norm algorithms that shows that the MUSIC pseudospectrum is a weighted average of minimum-norm pseudospectra.

C HAPT E R 1 0

Adaptive Filters

In Chapter 1, we discussed different practical applications that demonstrated the need for adaptive filters, pointed out the key aspects of the underlying signal operating environment (SOE), and illustrated the key features and types of adaptive filters. The defining characteristic of an adaptive filter is its ability to operate satisfactorily, according to a criterion of performance acceptable to the user, in an unknown and possibly time-varying environment without the intervention of the designer. In Chapter 6, we developed the theory of optimum filters under the assumption that the filter designer has complete knowledge of the statistical properties (usually second-order moments) of the SOE. However, in real-world applications such information is seldom available, and the most practical solution is to use an adaptive filter. Adaptive filters can improve their performance, during normal operation, by learning the statistical characteristics through processing current signal observations. In this chapter, we develop a mathematical framework for the design and performance evaluation of adaptive filters, both theoretically and by simulation. The goal of an adaptive filter is to “find and track” the optimum filter corresponding to the same signal operating environment with complete knowledge of the required statistics. In this context, optimum filters provide both guidance for the development of adaptive algorithms and a yardstick for evaluating the theoretical performance of adaptive filters. We start in Section 10.1 with discussion of a few typical application problems that can be effectively solved by using an adaptive filter. The performance of adaptive filters is evaluated using the concepts of stability, speed of adaptation, quality of adaptation, and tracking capabilities. These issues and the key features of an adaptive filter are discussed in Section 10.2. Since most adaptive algorithms originate from deterministic optimization methods, in Section 10.3 we introduce the family of steepest-descent algorithms and study their properties. Sections 10.4 and 10.5 provide a detailed discussion of the derivation, properties, and applications of the two most important adaptive filtering algorithms: the least mean square (LMS) and the recursive least-squares (RLS) algorithms. The conventional RLS algorithm, introduced in Section 10.5, can be used for either array processing (multiple-sensor or general input data vector) applications or FIR filtering (single-sensor or shift-invariant input data vector) applications. Section 10.6 deals with different implementations of the RLS algorithm for array processing applications, whereas Section 10.7 provides fast implementations of the RLS algorithm for the FIR filtering case. The development of the later algorithms is a result of the shift invariance of the data stored in the memory of the FIR filter. Finally, in Section 10.8 we provide a concise introduction to the tracking properties of the LMS and the RLS algorithms. 499

500 chapter 10 Adaptive Filters

10.1 TYPICAL APPLICATIONS OF ADAPTIVE FILTERS As we have already seen in Chapter 1, many practical applications cannot be successfully solved by using fixed digital filters because either we do not have sufficient information to design a digital filter with fixed coefficients or the design criteria change during the normal operation of the filter. Most of these applications can be successfully solved by using a special type of “smart” filters known collectively as adaptive filters. The distinguishing feature of adaptive filters is that they can modify their response to improve their performance during operation without any intervention from the user. The best way to introduce adaptive filters is with some applications for which they are well suited. These and other applications are discussed in greater detail in the sequel as we develop the necessary background and tools. 10.1.1 Echo Cancelation in Communications An echo is the delayed and distorted version of an original signal that returns to its source. In some applications (radar, sonar, or ultrasound), the echo is the wanted signal; however, in communication applications, the echo is an unwanted signal that must be eliminated. There are two types of echoes in communication systems: (1) electrical or line echoes, which are generated electrically due to impedance mismatches at points along the transmission medium, and (2) acoustic echoes, which result from the reflection of sound waves and acoustic coupling between a microphone and a loudspeaker. Here we focus on electrical echoes in voice communications; electrical echoes in data communications are discussed in Section 10.4.4, and acoustic echoes in teleconferencing and hands-free telephony were discussed in Section 1.4.1. Electrical echoes are observed on long-distance telephone circuits. A simplified form of such a circuit, which is sufficient for the present discussion, is shown in Figure 10.1. The local links from the customer to the telephone office consist of bidirectional two-wire connections, whereas the connection between the telephone offices is a four-wire carrier facility that may include a satellite link. The conversion between two-wire and four-wire links is done by special devices known as hybrids. An ideal hybrid should pass (1) the incoming signal to the two-wire output without any leakage into its output port and (2) the signal from the two-wire circuit to its output port without reflecting any energy back to the two-wire line (Sondhi and Berkley 1980). In practice, due to impedance mismatches, the hybrids do not operate perfectly. As a result, some energy on the incoming branch of the four-wire circuit leaks into the outgoing branch and returns to the source as an echo (see Figure 10.1). This echo, which is usually 11 dB down from the original signal, makes it difficult to carry on a conversation if the round-trip delay is larger than 40 ms. Satellite links, as a consequence of high altitude, involve round-trip delays of 500 to 600 ms. Four-wire connection Two-wire connection

Speech from A

Hybrid A Talker A

Hybrid B Echo of A's speech

FIGURE 10.1 Echo generation in a long-distance telephone network.

Speech from B

Talker B

The first devices used by telephone companies to control voice echoes were echo suppressors. Basically, an echo suppressor is a voice-activated switch that attempts to impose an open circuit on the return path from listener to talker when the listener is silent (see Figure 10.2). The main problems with these devices are speech clipping during doubletalking and the inability to effectively deal with round-trip delays longer than 100 ms (Weinstein 1977). Echo suppressor

Hybrid B

Control

Speech from B

Loss

Talker B

FIGURE 10.2 Principle of echo suppression.

The problems associated with echo suppressors could be largely avoided if we could estimate the transmission path from point C to point D (see Figure 10.3), which is known as the echo path. If we knew the echo path, we could design a filter that produced a copy or replica of the echo signal when driven by the signal at point C. Subtraction of the echo replica from the signal at point D will eliminate the echo without distorting the speech of the second talker that may be present at point D. The resulting device, shown in Figure 10.3, is known as an echo canceler. Echo canceler C

Adaptive filter

path

From talker A

To talker A

Echo replica D

Echo

−

Hybrid B Speech from B

Talker B

Echo from A

FIGURE 10.3 Principle of echo cancelation.

In practice, the channel characteristics are generally not known. For dial-up telephone lines, the channel differs from call to call, and the characteristics of radio and microwave channels (phase perturbations, fading, etc.) change significantly with time. Therefore, we cannot design and use a fixed echo canceler with satisfactory performance for all possible connections. There are two possible ways around this problem:

501 section 10.1 Typical Applications of Adaptive Filters

502 chapter 10 Adaptive Filters

1. Design a compromise fixed echo canceler based on some “average” echo path, assuming that we have sufficient information about the connections to be seen by the canceler. 2. Design an adaptive echo canceler that can “learn” the echo path when it is first turned on and afterward “tracks” its variations without any intervention from the designer. Since an adaptive canceler matches the echo path for any given connection, it performs better than a compromise fixed canceler. We stress that the main task of the canceler is to estimate the echo signal with sufficient accuracy; the estimation of the echo path is simply the means of achieving this goal. The performance of the canceler is measured by the attenuation, in decibels, of the echo, which is known as echo return loss enhancement. The adaptive echo canceler achieves this goal by modifying its response, using the residual echo signal in an as yet unspecified way. Adaptive echo cancelers are widely used in voice telecommunications, and the international standards organization CCITT has issued a set of recommendations (CCITT G. 165) that outlines the basic requirements for echo cancelers. More details can be found in Weinstein (1977) and Murano et al. (1990).

10.1.2 Equalization of Data Communication Channels Channel equalization, which is probably the most widely employed technique in practical data transmission systems, was first introduced in Section 1.4.1. In Section 6.8 we discussed the design of symbol rate zero-forcing and optimum MSE equalizers. As we recall, every pulse propagating through the channel suffers a certain amount of time dispersion because the frequency response of the channel deviates from the ideal one of constant magnitude and linear phase. Some typical sources of dispersion for practical communication channels are summarized in Table 10.1. As a result, the tails of adjacent pulses interfere with the measurement of the current pulse (intersymbol interference) and can lead to an incorrect decision. TABLE 10.1

Summary of causes of dispersion in various communications systems. Transmission system

Causes of dispersion

Cable TV

Transmitter filtering; coaxial-cable dispersion; cable amplifiers; reflections from impedance mismatches; bandpass filters

Microwave radio

Transmitter filtering; reflections from impedance mismatches; multipath propagation; scattering; input bandpass filtering

Voiceband modems

Digital-to-analog image suppression; channel filtering; twisted-pair transmission line; multiplexing and demultiplexing filters; hybrids; antialias lowpass filters

Troposcatter radio

Transmitter filtering; atmospheric dispersion; scattering at interface between troposphere and stratosphere; receiver bandpass filtering; input amplifiers

Source: From Treichler et al. 1996.

Since the channel can be modeled as a linear system, assuming that the receiver and the transmitter do not include any nonlinear operations, we can compensate for its distortion by using a linear equalizer. The goal of the equalizer is to restore the received pulse, as closely as possible, to its original shape. The equalizer transforms the channel to a near-ideal one if its response resembles the inverse of the channel. Since the channel is unknown and possibly time-varying, there are two ways to approach the problem: (1) Design a compromise fixed equalizer to obtain satisfactory performance over a broad range of channels, or (2) design an equalizer that can learn the inverse of the particular channel and then track its variation in real time.

The characteristics of the equalizer are adjusted by some algorithm that attempts to attain the best possible performance. The most appropriate criterion of performance for data transmission systems is the probability of error. However, it cannot be used for two reasons: (1) the “correct” symbol is unknown to the receiver (otherwise there would be no reason to communicate), and (2) the number of decisions needed to estimate the low probabilities of error is extremely large. Thus, practical equalizers assess their performance by using some function of the difference between the correct symbol and their output. The operation of practical equalizers involves three modes of operation, dependent on how we substitute for the unavailable correct symbol sequence. Training mode: A known training sequence is transmitted, and the equalizer attempts to improve its performance by comparing its output to a synchronized replica of the training sequence stored at the receiver. Usually this mode is used when the equalizer starts a transmission session. Decision-directed mode: At the end of the training session, when the equalizer starts making reliable decisions, we can replace the training sequence with the equalizer’s own decisions. “Blind” or self-recovering mode: There are several applications in which the use of a training sequence is not desired or feasible. This may occur in multipoint networks for computer communications or in wideband digital systems over coaxial facilities during rerouting (Godard 1980; Sato 1975). Also when the decision-directed mode of a microwave channel equalizer fails, after deep fades, we do not have a reverse channel to call for retraining (Foschini 1985). In such cases, where the equalizer should be able to learn or recover the characteristics of the channel without the benefit of a training sequence, we say that the equalizer operates in blind or selfrecovering mode. Adaptive equalization is a mature technology that has had the greatest impact on digital communications systems, including voiceband, microwave and troposcatter radio, and cable TV modems (Qureshi 1985; Lee and Messerschmitt 1994; Gitlin et al. 1992; Bingham 1988; Treichler et al. 1996, 1998).

10.1.3 Linear Predictive Coding The efficient storage and transmission of analog signals using digital systems requires the minimization of the number of bits necessary to represent the signal while maintaining the quality to an acceptable level according to a certain criterion of performance. The conversion of an analog (continuous-time, continuous-amplitude) signal to a digital (discrete-time, discrete-amplitude) signal involves two processes: sampling and quantization. Sampling converts a continuous-time signal to a discrete-time signal by measuring its amplitude at equidistant intervals of time. Quantization involves the representation of the measured continuous amplitude by using a finite number of symbols. Therefore, a small range of amplitudes will use the same symbol (see Figure 10.4). A code word is assigned to each symbol by the coder. When the digital representation is used for digital signal processing, the quantization levels and the corresponding code words are uniformly distributed. However, for coding applications, levels may be nonuniformly distributed to match the distribution of the signal amplitudes. For all practical purposes, the range of a quantizer is equal to RQ = · 2B , where is the quantization step size and B is the number of bits, and should cover the dynamic range of the signal. The difference between the unquantized sample x(n) and the quantized sample x(n), ˆ that is, e(n) x(n) ˆ − x(n)

(10.1.1)

503 section 10.1 Typical Applications of Adaptive Filters

504

Quantization step D

chapter 10 Adaptive Filters

Quantization level

Decision level

x(n) Range RQ

FIGURE 10.4 Partitioning of the range of a 3-bit (eight-level) uniform quantizer.

is known as the quantization error and is always in the range −/2 ≤ e(n) ≤ /2. If we define the signal-to-noise ratio by SNR

E{x 2 (n)} E{e2 (n)}

(10.1.2)

it can be shown (Rabiner and Schafer 1978; Jayant and Noll 1984) that SNR(dB) 6B

(10.1.3)

which states that each added binary digit increases the SNR by 6 dB. For a fixed number of bits, decreasing the dynamic range of the signal (and therefore the range of the quantizer) decreases the required quantization step and therefore the average quantization error power. Therefore, we can increase the SNR by reducing the dynamic range, or equivalently the variance of the signal. If the signal samples are significantly correlated, the variance of the difference between adjacent samples is smaller than the variance of the original signal. Thus, we can improve the SNR by quantizing this difference instead of the original signal. The differential quantization concept is exploited by the linear predictive coding (LPC) system illustrated in Figure 10.5. The quantized signal is the difference d(n) = x(n) − x(n) ˜

(10.1.4)

where x(n) ˜ is an estimate or prediction of the signal x(n) obtained by the predictor using a quantized version ˆ x(n) ˆ = x(n) ˜ + d(n)

(10.1.5)

of the original signal (see Figure 10.5). If the quantization error of the difference signal is

we obtain

ˆ ed (n) = d(n) − d(n)

(10.1.6)

x(n) ˆ = x(n) + ed (n)

(10.1.7)

using (10.1.4) and (10.1.5). The significance of (10.1.7) is that the quantization error of the original signal is equal to the quantization error of the difference signal, independently of the properties of the predictor. Note that if c (n) = c(n), that is, there are no transmission or storage errors, then the signal reconstructed by the decoder is xˆ (n) = x(n). ˆ If the prediction is good, the dynamic range of d(n) should be smaller than the dynamic range of x(n), resulting in a smaller quantization noise for the same number of bits or the same quantization noise with a smaller number of bits. The performance of the LPC system depends on the accuracy of the predictor. In most practical applications, we use a linear predictor that forms an estimate (prediction) x(n) ˜ of the present sample x(n) as a linear combination of the M past samples, that is, x(n) ˜ =

M k=1

ak x(n ˆ − k)

(10.1.8)

505

D dˆ(n)

d(n)

x(n)

Q[ ]

− x~(n)

Predictor

c(n) Encoder To communication or storage channel

xˆ(n)

(a)

dˆ'(n)

c'(n)

xˆ'(n)

Decoder x~'(n)

Predictor (b)

FIGURE 10.5 Block diagram of a linear predictive coding system: (a) coder and (b) decoder.

The coefficients {ak }M 1 of the linear predictor are determined by exploiting the correlation between adjacent samples of the input signal with the objective to make the prediction error as small as possible. Since the statistical properties of the signal x(n) are unknown and change with time, we cannot design an optimum fixed predictor. The established practical solution uses an adaptive linear predictor that automatically adjusts its coefficients to compute a “good” prediction at each time instant. A detailed discussion of adaptive linear prediction and its application to audio, speech, and video signal coding is provided in Jayant and Noll (1984).

10.1.4 Noise Cancelation In Section 1.4.1 we discussed the concept of active noise control using adaptive filters. We now provide a theoretical explanation for the general problem of noise canceling using multiple sensors. The principle of general noise cancelation is illustrated in Figure 10.6. The signal of interest s(n) is corrupted by uncorrelated additive noise v1 (n), and the combined signal s(n) + v1 (n) provides what is known as primary input. A second sensor, located at a different point, acquires a noise v2 (n) (reference input) that is uncorrelated with the signal s(n) but correlated with the noise v1 (n). If we can design a filter that provides a good estimate y(n) ˆ of the noise v1 (n), by exploiting the correlation between v1 (n) and v2 (n), then we could recover the desired signal by subtracting y(n) ˆ ≈ v1 (n) from the primary input. Let us assume that the signals s(n), v1 (n), and v2 (n) are jointly wide-sense stationary with zero mean values. The “clean” signal is given by the error e(n) = s(n) + [v1 (n) − y(n)] ˆ where y(n) ˆ depends on the filter structure and parameters. The MSE is given by 2 E{|e(n)|2 } = E{|s(n)|2 } + E{|v1 (n) − y(n)| ˆ }

section 10.1 Typical Applications of Adaptive Filters

506 chapter 10 Adaptive Filters

s(n) + v1(n)

Signal source

Noise source

e(n)

Primary input

v2 (n)

Filter

−

yˆ(n)

Reference input

FIGURE 10.6 Principle of adaptive noise cancelation using a reference input.

because the signals s(n) and v1 (n) − y(n) ˆ are uncorrelated. Since the signal power is not influenced by the filter, if we design a filter that minimizes the total output power E{|e(n)|2 }, 2 }. Therefore, y(n) then that filter will minimize the output noise power E{|v1 (n) − y(n)| ˆ ˆ will be the MMSE estimate of the noise v1 (n), and the canceler maximizes the output signal-to-noise ratio. If we know the second-order moments of the primary and reference inputs, we can design an optimum linear canceler using the techniques discussed in Chapter 6. However, in practice, the design of an optimum canceler is not feasible because the required statistical moments are either unknown or time-varying. Once again, a successful solution can be obtained by using an adaptive filter that automatically adjusts its parameters to obtain the best possible estimate of the interfering noise (Widrow et al. 1975).

10.2 PRINCIPLES OF ADAPTIVE FILTERS In this section, we discuss a mathematical framework for the analysis and performance evaluation of adaptive algorithms. The goal is to develop design guidelines for the application of adaptive algorithms to practical problems. The need for adaptive filters and representative applications that can benefit from their use have been discussed in Sections 1.4.1 and 10.1.

10.2.1 Features of Adaptive Filters The applications we have discussed are only a sample from a multitude of practical problems that can be successfully solved by using adaptive filters, that is, filters that automatically change their characteristics to attain the right response at the right time. Every adaptive filtering application involves one or more input signals and a desired response signal that may or may not be accessible to the adaptive filter. We collectively refer to these signals as the signal operating environment (SOE) of the adaptive filter. Every adaptive filter consists of three modules (see Figure 10.7): Filtering structure. This module forms the output of the filter using measurements of the input signal or signals. The filtering structure is linear if the output is obtained as a linear combination of the input measurements; otherwise it is said to be nonlinear. For example, the filtering module can be an adjustable finite impulse response (FIR) digital filter implemented with a direct or lattice structure or a recursive filter implemented using a cascade structure. The structure is fixed by the designer, and its parameters are adjusted by the adaptive algorithm. Criterion of performance (COP). The output of the adaptive filter and the desired response (when available) are processed by the COP module to assess its quality with respect to the requirements of the particular application. The choice of the

507 Input signal

Output signal

Filtering structure Filter parameters Adaptation algorithm

Performance evaluation

FIGURE 10.7 Basic elements of a general adaptive filter.

criterion is a balanced compromise between what is acceptable to the user of the application and what is mathematically tractable; that is, it can be manipulated to derive an adaptive algorithm. Most adaptive filters use some average form of the square error because it is mathematically tractable and leads to the design of useful practical systems. Adaptation algorithm. The adaptive algorithm uses the value of the criterion of performance, or some function of it, and the measurements of the input and desired response (when available) to decide how to modify the parameters of the filter to improve its performance. The complexity and the characteristics of the adaptive algorithm are functions of the filtering structure and the criterion of performance. The design of any adaptive filter requires some generic a priori information about the SOE and a deep understanding of the particular application. This information is needed by the designer to choose the criterion of performance and the filtering structure. Clearly, unreliable a priori information and/or incorrect assumptions about the SOE can lead to serious performance degradations or even unsuccessful adaptive filter applications. The conversion of the performance assessment to a successful parameter adjustment strategy, that is, the design of an adaptive algorithm, is the most difficult step in the design and application of adaptive filters. If the characteristics of the SOE are constant, the goal of the adaptive filter is to find the parameters that give the best performance and then stop the adjustment. The initial period, from the time the filter starts its operation until the time it gets reasonably close to its best performance, is known as the acquisition or convergence mode. However, when the characteristics of the SOE change with time, the adaptive filter should first find and then continuously readjust its parameters to track these changes. In this case, the filter starts with an acquisition phase that is followed by a tracking mode. A very influential factor in the design of adaptive algorithms is the availability of a desired response signal. We have seen that for certain applications, the desired response may not be available for use by the adaptive filter. Therefore, the adaptation must be performed in one of two ways: Supervised adaptation. At each time instant, the adaptive filter knows in advance the desired response, computes the error (i.e., the difference between the desired and actual response), evaluates the criterion of performance, and uses it to adjust its coefficients. In this case, the structure in Figure 10.7 is simplified to that of Figure 10.8. Unsupervised adaptation. When the desired response is unavailable, the adaptive filter cannot explicitly form and use the error to improve its behavior. In some applications, the input signal has some measurable property (i.e., constant envelope) that is lost by the time it reaches the adaptive filter. The adaptive filter adjusts its parameters in such a way as to restore the lost property of the input signal. The property restoral

section 10.2 Principles of Adaptive Filters

508

Desired response

chapter 10 Adaptive Filters

Performance evaluation

y(n) x(n) Input signal

yˆ(n)

Filtering structure

−

Output signal

Filter coefficients

e(n) Error

Adaptation algorithm

FIGURE 10.8 Basic elements of a supervised adaptive filter.

approach to adaptive filtering was introduced in Treichler et al. (1987). In some other applications (e.g., digital communications) the basic task of the adaptive filter is to classify each received pulse to one of a finite set of symbols. In this case we basically have a problem of unsupervised classification (f*ckunaga 1990). In this chapter we focus our discussion on supervised adaptive filters, that is, filters that have access to a desired response signal; unsupervised adaptive filters, which operate without the benefit of a desired response, are discussed in Section 12.3, in the context of blind equalization.

10.2.2 Optimum versus Adaptive Filters We have mentioned several times that the theory of stochastic processes provides the mathematical framework for the design and analysis of optimum filters. In Chapter 6, we introduced filters that are optimum according to the MSE criterion of performance; and in Chapter 7, we developed algorithms and structures for their efficient design and implementation. However, optimum filters are a theoretical tool and cannot be used in practical applications because we do not know the statistical quantities (e.g., second-order moments) that are required for their design. Adaptive filters can be thought as the practical counterpart of optimum filters: They try to reach the performance of optimum filters by processing measurements of the SOE in real time, which makes up for the lack of a priori statistics. For this analysis, we consider the general case of a linear combiner that includes filtering and prediction as special cases. However, for convenience we use the terms filters and filtering. We remind the reader that, from a mathematical point of view, the key difference between a linear combiner and an FIR filter or predictor is the shift invariance (temporal ordering) of the input data vector. This difference, which is illustrated in Figure 10.9, also has important implications in the implementation of adaptive filters. To this end, suppose that the SOE is comprised of M input signals xk (n, ζ ) and a desired response signal y(n, ζ ), † which are sample realizations of random sequences. Then the estimate of y(n, ζ ) is computed by using the linear combiner y(n, ˆ ζ) =

M

ck∗ (n)xk (n, ζ ) cH (n)x(n, ζ )

(10.2.1)

k=1

where †

c(n) = [c1 (n) c2 (n) · · · cM (n)]T

For clarity, in this section only, we include the dependence on ζ to denote random variables.

(10.2.2)

509

Input signals x1(n)

Desired response y(n)

c*1 yˆ(n)

…

x2(n) c*2

e(n) −

Actual response

Error

…

…

section 10.2 Principles of Adaptive Filters

xM (n) c*M (a) Desired response

Input signal x(n) z−1

c*1

y(n)

x(n − 1)

z−1

…

yˆ(n) c*2

Actual response

…

z−1

e(n) −

Error

x(n − M + 1)

c*M (b)

FIGURE 10.9 Illustration of the difference of the input signal between (a) a multiple-input linear combiner and (b) a single-input FIR filter.

is the coefficient vector and x(n, ζ ) = [x1 (n, ζ ) x2 (n, ζ ) · · · xM (n, ζ )]T

(10.2.3)

is the input data vector. For single-sensor applications, the input data vector is shift-invariant x(n) = [x(n, ζ ) x(n − 1, ζ ) · · · x(n − M + 1, ζ )]T

(10.2.4)

and the linear combiner takes the form of the FIR filter M−1 h(n, k)x(n − k, ζ ) cH (n)x(n, ζ ) y(n, ˆ ζ) =

(10.2.5)

k=0

where ck (n) = h∗ (n, k) are the samples of the impulse response at time n. Optimum filters. If we know the second-order moments of the SOE, we can design an optimum filter co (n) by solving the normal equations where

R(n)co (n) = d(n)

(10.2.6)

R(n) = E{x(n, ζ )x (n, ζ )}

(10.2.7)

H

510 chapter 10 Adaptive Filters

d(n) = E{x(n, ζ )y ∗ (n, ζ )}

and

(10.2.8)

are the correlation matrix of the input data vector and the cross-correlation between the input data vector and the desired response, respectively. During its normal operation, the optimum filter works with specific realizations of the SOE, that is, yˆo (n, ζ ) = coH (n)x(n, ζ )

(10.2.9)

ε o (n, ζ ) = y(n, ζ ) − yˆo (n, ζ )

(10.2.10)

where yˆo (n, ζ ) is the optimum estimate and ε o (n, ζ ) is the optimum instantaneous error [see Figure 10.10(a)]. However, the filter is optimized with respect to its average performance across all possible realizations of the SOE, and the MMSE Po (n) = E{|εo (n, ζ )|2 } = Py (n) − dH (n)co (n)

(10.2.11)

shows how well the filter performs on average. Also, we emphasize that the optimum coefficient vector is a nonrandom quantity and that the desired response is not essential for the operation of the optimum filter [see Equation (10.2.9)].

y(n ,z) x(n ,z)

co(n)

yˆ(n ,z)

x(n ,z)

Input signal

c(n − 1, z)

Desired response

yˆ(n ,z)

Input signal

−

e(n ,z) Error

Solve R(n)co(n) = d(n)

Adaptive algorithm

(a)

(b)

FIGURE 10.10 Illustration of the difference in operation between (a) optimum filters and (b) adaptive filters.

If the SOE is stationary, the optimum filter is computed once and is used with all realizations {x(n, ζ ), y(n, ζ )}. For nonstationary environments, the optimum filter design is repeated at every time instant n because the optimum filter is time-varying. Adaptive filters. In most practical applications, where the second-order moments R(n) and d(n) are unknown, the use of an adaptive filter is the best solution. If the SOE is ergodic, we have R = lim

N →∞

d = lim

N →∞

N 1 x(n, ζ )xH (n, ζ ) 2N + 1

(10.2.12)

N 1 x(n, ζ )y ∗ (n, ζ ) 2N + 1

(10.2.13)

n=−N

n=−N

because ensemble averages are equal to time averages (see Section 3.3). If we collect a −1 sufficient amount of data {x(n, ζ ), y(n, ζ )}N , we can obtain an acceptable estimate of 0 the optimum filter by computing the estimates N −1 ˆ N (ζ ) = 1 x(n, ζ )xH (n, ζ ) R N n=0

(10.2.14)

N −1 1 x(n, ζ )y ∗ (n, ζ ) dˆ N (ζ ) = N

511

(10.2.15)

n=0

by time-averaging and then solving the linear system ˆ N (ζ )cN (ζ ) = dˆ N (ζ ) R

(10.2.16)

The obtained coefficients can be used to filter the data in the interval 0 ≤ n ≤ N − 1 or to start filtering the data for n ≥ N , on a sample-by-sample basis, in real time. This procedure, which we called block adaptive filtering in Chapter 8, should be repeated each time the properties of the SOE change significantly. Clearly, block adaptive filters cannot track statistical variations within the operating block and cannot be used in all applications. Indeed, there are applications, for example, adaptive equalization, in which each input sample should be processed immediately after its observation and before the arrival of the next sample. In such cases, we should use a sample-by-sample adaptive filter that starts filtering immediately after the observation of the pair {x(0), y(0)} using a “guess” c(−1) for the adaptive filter coefficients. Usually, the initial guess c(−1) is a very poor estimate of the optimum filter co . However, this estimate is improved with time as the filter processes additional pairs of observations. As we discussed in Section 10.2.1, an adaptive filter consists of three key modules: an adjustable filtering structure that uses input samples to compute the output, the criterion of performance that monitors the performance of the filter, and the adaptive algorithm that updates the filter coefficients. The key component of any adaptive filter is the adaptive algorithm, which is a rule to determine the filter coefficients from the available data x(n, ζ ) and y(n, ζ ) [see Figure 10.10(b)]. The dependence of c(n, ζ ) on the input signal makes the adaptive filter a nonlinear and time-varying stochastic system. The data available to the adaptive filter at time n are the input data vector x(n, ζ ), the desired response y(n, ζ ), and the most recent update c(n − 1, ζ ) of the coefficient vector. The adaptive filter, at each time n, performs the following computations: 1. Filtering: y(n, ˆ ζ ) = cH (n − 1, ζ )x(n, ζ )

(10.2.17)

e(n, ζ ) = y(n, ζ ) − y(n, ˆ ζ)

(10.2.18)

c(n, ζ ) = c(n − 1, ζ ) + c{x(n, ζ ), e(n, ζ )}

(10.2.19)

2. Error formation: 3. Adaptive algorithm: where the increment or correction term c(n, ζ ) is chosen to bring c(n, ζ ) close to co , with the passage of time. If we can successively determine the corrections c(n, ζ ) so that c(n, ζ ) co , that is, c(n, ζ ) − co < δ, for some n > Nδ , we obtain a good approximation for co by avoiding the explicit averagings (10.2.14), (10.2.15), and the solution of the normal equations (10.2.16). A key requirement is that c(n, ζ ) must vanish if the error e(n, ζ ) vanishes. Hence, e(n, ζ ) plays a major role in determining the increment c(n, ζ ). We notice that the estimate y(n, ˆ ζ ) of the desired response y(n, ζ ) is evaluated using the current input vector x(n, ζ ) and the past filter coefficients c(n − 1, ζ ). The estimate y(n, ˆ ζ) and the corresponding error e(n, ζ ) can be considered as predicted estimates compared to the actual estimates that would be evaluated using the current coefficient vector c(n, ζ ). Coefficient updating methods that use the predicted error e(n, ζ ) are known as a priori type adaptive algorithms. If we use the actual estimates, obtained using the current estimate c(n, ζ ) of the adaptive filter coefficients, we have 1. Filtering: yˆa (n, ζ ) = cH (n, ζ )x(n, ζ )

(10.2.20)

section 10.2 Principles of Adaptive Filters

512

2. Error formation:

chapter 10 Adaptive Filters

ε(n, ζ ) = y(n, ζ ) − yˆa (n, ζ )

(10.2.21)

c(n, ζ ) = c(n − 1, ζ ) + c{x(n, ζ ), ε(n, ζ )}

(10.2.22)

3. Adaptive algorithm: which are known as a posteriori type adaptive algorithms. The terms a priori and a posteriori were introduced in Carayannis et al. (1983) to emphasize the use of estimates evaluated before or after the updating of the filter coefficients. The difference between a priori and a posteriori errors and their meanings will be further clarified when we discuss adaptive least-squares filters in Section 10.5. The timing diagram for the above two algorithms is shown in Figure 10.11.

c(n − 1)

x(n) y(n)

nT

e(n)

c(n) ε (n)

x(n + 1) y(n + 1)

(n + 1)T

Time

FIGURE 10.11 Timing diagrams for a priori and a posteriori adaptive algorithms.

In conclusion, the objective of an adaptive filter is to use the available data at time n, namely, {x(n, ζ ), y(n, ζ ), c(n − 1, ζ )}, to update the “old” coefficient vector c(n − 1, ζ ) to a “new” estimate c(n, ζ ) so that c(n, ζ ) is closer to the optimum filter vector co (n) and the output y(n) ˆ is a better estimate of the desired response y(n). Most adaptive algorithms have the following form: adaptation old New coefficient = coefficient + gain · error (10.2.23) signal vector vector vector where the error signal is the difference between the desired response and the predicted or actual outputs of the adaptive filter. One of the fundamental differences among the various algorithms is the optimality of the used adaptation gain vector and the amount of computation required for its evaluation.

10.2.3 Stability and Steady-State Performance of Adaptive Filters We now address the issues of stability and performance of adaptive filters. Since the goal of an adaptive filter c(n, ζ ) is first to find and then track the optimum filter co (n) as quickly and accurately as possible, we can evaluate its performance by measuring some function of its deviation c˜ (n, ζ ) c(n, ζ ) − co (n)

(10.2.24)

from the corresponding optimum filter. Clearly, an acceptable adaptive filter should be stable in the bounded-input bounded-output (BIBO) sense, and its performance should be close to that of the associated optimum filter. The analysis of BIBO stability is extremely difficult because adaptive filters are nonlinear, time-varying systems working in a random SOE. The performance of adaptive filters is primarily measured by investigating the value of the MSE as a function of time. To discuss these problems, first we consider an adaptive filter working in a stationary SOE, and then we extend our discussion to a nonstationary SOE.

Stability The adaptive filter starts its operation at time, say, n = 0, and by processing the obser∞ vations {x(n, ζ ), y(n, ζ )}∞ 0 generates a sequence of vectors {c(n, ζ )}0 using the adaptive algorithm. Since the FIR filtering structure is always stable, the output or the error of the adaptive filter will be bounded if its coefficients are always kept close to the coefficients of the associated optimum filter. However, the presence of the feedback loop in every adaptive filter (see Figure 10.10) raises the issue of stability. In a stationary SOE, where the optimum filter co is constant, convergence of c(n, ζ ) to co as n → ∞ will guarantee the BIBO stability of the adaptive filter. For a specific realization ζ , the kth component ck (n, ζ ) or the norm c(n, ζ ) of the vector c(n, ζ ) is a sequence of numbers that might or might not † converge. Since the coefficients ck (n, ζ ) are random, we must use the concept of stochastic convergence (Papoulis 1991). We say that a random sequence converges everywhere if the sequence ck (n, ζ ) converges for every ζ , that is, lim ck (n, ζ ) = co,k (ζ )

n→∞

(10.2.25)

where the limit co,k (ζ ) depends, in general, on ζ . Requiring the adaptive filter to converge to co for every possible realization of the SOE is both hard to guarantee and not necessary, because some realizations may have very small or zero probability of occurrence. If we wish to ensure that the adaptive filter converges for the realizations of the SOE that may actually occur, we can use the concept of convergence almost everywhere. We say that the random sequence ck (n, ζ ) converges almost everywhere or with probability 1 if P { lim |ck (n, ζ ) − co,k (ζ )| = 0} = 1 n→∞

(10.2.26)

which implies that there can be some sample sequences that do not converge, which must occur with probability zero. Another type of stochastic convergence that is used in adaptive filtering is defined by lim E{|ck (n, ζ ) − co,k |2 } = lim E{|c˜k (n, ζ )|2 } = 0

n→∞

n→∞

(10.2.27)

and is known as convergence in the MS sense. The primary reason for the use of mean square (MS) convergence is that unlike the almost-everywhere convergence, it uses only one sequence of numbers that takes into account the averaging effect of all sample sequences. Furthermore, it uses second-order moments for verification and has an interpretation in terms of power. Convergence in MS does not imply—nor is implied by—convergence with probability 1. Since E{|c˜k (n, ζ )|2 } var{c˜k (n, ζ )} |E{c˜k (n, ζ )}|2 + = (10.2.28) δ δ δ2 if we can show that E{c˜k (n)} → 0 as n → ∞ and var{c˜k (n, ζ )} is bounded for all n, we can ensure convergence in MS. In this case, we can say that an adaptive filter that operates in a stationary SOE is an asymptotically stable filter. Performance measures In theoretical investigations, any quantity that measures the deviation of an adaptive filter from the corresponding optimum filter can be used to evaluate its performance. The mean square deviation (MSD) D(n) E{c(n, ζ ) − co (n)2 } = E{˜c(n, ζ )2 } †

(10.2.29)

We recall that a sequence of real nonrandom numbers a0 , a1 , a2 , . . . converges to a number a if and only if for every positive number δ there exists a positive integer Nδ such that for all n > Nδ , we have |an − a| < δ. This is abbreviated by limn→∞ an = a.

513 section 10.2 Principles of Adaptive Filters

514 chapter 10 Adaptive Filters

measures the average distance between the coefficient vectors of the adaptive and optimum filters. Although the MSD is not measurable in practice, it is useful in analytical studies. Adaptive algorithms that minimize D(n) for each value of n are known as algorithms with optimum learning. In Section 6.2.2 we showed that if the input correlation matrix is positive definite, any deviation, say, c˜ (n), of the optimum filter coefficients from their optimum setting increases the mean square error (MSE) by an amount equal to c˜ H (n)R˜c(n), known as excess MSE (EMSE). In adaptive filters, the random deviation c˜ (n, ζ ) from the optimum results in an EMSE, which is measured by the ensemble average of c˜ H (n, ζ )R˜c(n, ζ ). For a posteriori adaptive filters, the MSE can be decomposed as P (n) E{|ε(n, ζ )|2 } Po (n) + Pex (n)

(10.2.30)

(n) is the EMSE and P (n) is the MMSE given by where Pex o

Po (n) E{|εo (n, ζ )|2 } with

εo (n, ζ ) y(n, ζ ) − coH (n)x(n, ζ )

as the a posteriori optimum filtering error. Clearly, the a posteriori EMSE by

Pex (n) P (n) − Po (n)

(10.2.31) (10.2.32) (n) Pex

is given

(10.2.33)

For a priori adaptive algorithms, where we use the “old” coefficient vector c(n − 1, ζ ), it is more appropriate to use the a priori EMSE given by Pex (n) P (n) − Po (n)

(10.2.34)

P (n) E{|e(n, ζ )|2 }

(10.2.35)

and

Po (n) E{|eo (n, ζ )| }

(10.2.36)

with

eo (n, ζ ) y(n, ζ ) − coH (n − 1)x(n, ζ )

(10.2.37)

where

2

as the a priori optimum filtering error. If the SOE is stationary, we have ε o (n, ζ ) = eo (n, ζ ); that is, the optimum a priori and a posteriori errors are identical. The dimensionless ratio P (n) Pex (n) or M (n) ex (10.2.38) M(n) Po (n) Po (n) known as misadjustment, is a useful measure of the quality of adaptation. Since the EMSE is always positive, there is no adaptive filter that can perform (on the average) better than the corresponding optimum filter. In this sense, we can say that the excess MSE or the misadjustment measures the cost of adaptation. Acquisition and tracking Plots of the MSD, MSE, or M(n) as a function of n, which are known as learning curves, characterize the performance of an adaptive filter and are widely used in theoretical and experimental studies. When the adaptive filter starts its operation, its coefficients provide a poor estimate of the optimum filter and the MSD or the MSE is very large. As the number of observations processed by the adaptive filter increases with time, we expect the quality of the estimate c(n, ζ ) to improve, and therefore the MSD and the MSE to decrease. The property of an adaptive filter to bring the coefficient vector c(n, ζ ) close to the optimum filter co , independently of the initial condition c(−1) and the statistical properties of the SOE, is called acquisition. During the acquisition phase, we say that the adaptive filter is in a transient mode of operation. A natural requirement for any adaptive algorithm is that adaptation stops after the algorithm has found the optimum filter co . However, owing to the randomness of the SOE

and the finite amount of data used by the adaptive filter, its coefficients continuously fluctuate about their optimum settings, that is, about the coefficients of the optimum filter, in a random manner. As a result, the adaptive filter reaches a steady-state mode of operation, after a certain time, and its performance stops improving. The transient and steady-state modes of operation in a stationary SOE are illustrated in Figure 10.12(a). The duration of the acquisition phase characterizes the speed of adaptation or rate of convergence of the adaptive filter, whereas the steady-state EMSE or misadjustment characterizes the quality of adaptation. These properties depend on the SOE, the filtering structure, and the adaptive algorithm. c(n)

c(n) co(n)

Tracking

Ac

Ac qu

qu isit i

is i

on

tion

co

Tracking

n Transient

Steady state

(a) Stationary SOE

n Transient

Steady state

(b) Nonstationary SOE

FIGURE 10.12 Modes of operation in a stationary and nonstationary SOE.

At each time n, any adaptive filter computes an estimate of the optimum filter using a finite amount of data. The error resulting from the finite amount of data is known as estimation error. An additional error, known as the lag error, results when the adaptive filter attempts to track a time-varying optimum filter co (n) in a nonstationary SOE. The modes of operation of an adaptive filter in a nonstationary SOE are illustrated in Figure 10.12(b). The SOE of the adaptive filter becomes nonstationary if x(n, ζ ) or y(n, ζ ) or both are nonstationary. The nonstationarity of the input is more severe than that of the desired response because it may affect the invertibility of R(n). Since the adaptive filter has to first acquire and then track the optimum filter, tracking is a steady-state property. Therefore, in general, the speed of adaptation (a transient-phase property) and the tracking capability (a steady-state property) are two different characteristics of the adaptive filter. Clearly, tracking is feasible only if the statistics of the SOE change “slowly” compared to the speed of tracking of the adaptive filter. These concepts will become more precise in Section 10.8, where we discuss the tracking properties of adaptive filters.

10.2.4 Some Practical Considerations The complexity of the hardware or software implementation of an adaptive filter is basically determined by the following factors: (1) the number of instructions per time update or computing time required to complete one time updating; (2) the number of memory locations required to store the data and the program instructions; (3) the structure of information flow in the algorithm, which is very important for implementations using parallel processing, systolic arrays, or VLSI chips; and (4) the investment in hardware design tools and software development. We focus on implementations for general-purpose computers or special-purpose digital signal processors that basically involve programming in a high level or assembly language. More details about DSP software development can be found in Embree and Kimble (1991) and in Lapsley et al. (1997).

515 section 10.2 Principles of Adaptive Filters

516 chapter 10 Adaptive Filters

The digital implementation of adaptive filters implies the use of finite-word-length arithmetic. As a result, the performance of the practical (finite-precision) adaptive filters deviates from the performance of ideal (infinite-precision) adaptive filters. Finite-precision implementation affects the performance of adaptive filters in several complicated ways. The major factors are (1) the quantization of the input signal(s) and the desired response, (2) the quantization of filter coefficients, and (3) the roundoff error in the arithmetic operations used to implement the adaptive filter. The nonlinear nature of adaptive filters coupled with the nonlinearities introduced by the finite-word-length arithmetic makes the performance evaluation of practical adaptive filters extremely difficult. Although theoretical analysis provides insight and helps to clarify the behavior of adaptive filters, the most effective way is to simulate the filter and measure its performance. Finite precision affects two important properties of adaptive filters, which, although related, are not equivalent. Let us denote by cip (n) and cfp (n) the coefficient vectors of the filter implemented using infinite- and finite-precision arithmetic, respectively. An adaptive filter is said to be numerically stable if the difference vector cip (n) − cfp (n) remains always bounded, that is, the roundoff error propagation system is stable. Numerical stability is an inherent property of the adaptive algorithm and cannot be altered by increasing the numerical precision. Indeed, increasing the word length or reorganizing the computations will simply delay the divergence of an adaptive filter; only actual change of the algorithm can stabilize an adaptive filter by improving the properties of the roundoff error propagation system (Ljung and Ljung 1985; Cioffi 1987). The numerical accuracy of an adaptive filter measures the deviation, at steady state, of any obtained estimates from theoretically expected values, due to roundoff errors. Numerical accuracy results in an increase of the output error without catastrophic problems and can be reduced by increasing the word length. In contrast, lack of numerical stability leads to catastrophic overflow (divergence or blowup of the algorithm) as a result of roundoff error accumulation. Numerically unstable algorithms converging before “explosion” may provide good numerical accuracy. Therefore, although the two properties are related, one does not imply the other. Two other important issues are the sensitivity of an algorithm to bad or abnormal input data (e.g., poorly exciting input) and its sensitivity to initialization. All these issues are very important for the application of adaptive algorithms to real-world problems and are further discussed in the context of specific algorithms.

10.3 METHOD OF STEEPEST DESCENT Most adaptive filtering algorithms are obtained by simple modifications of iterative methods for solving deterministic optimization problems. Studying these techniques helps one to understand several aspects of the operation of adaptive filters. In this section we discuss gradient-based optimization methods because they provide the ground for the development of the most widely used adaptive filtering algorithms. As we discussed in Section 6.2.1, the error performance surface of an optimum filter, in a stationary SOE, is given by P (c) = Py − cH d − dH c + cH Rc

(10.3.1)

where Py = E{|y(n)|2 }. Equation (10.3.1) is a quadratic function of the coefficients and represents a bowl-shaped surface (when R is positive definite) and has a unique minimum at co (optimum filter). There are two distinct ways to find the minimum of (10.3.1): 1. Solve the normal equations Rc = d, using a direct linear system solution method. 2. Find the minimum of P (c), using an iterative minimization algorithm.

Although direct methods provide the solution in a finite number of steps, sometimes we prefer iterative methods because they require less numerical precision, are computationally less expensive, work when R is not invertible, and are the only choice for nonquadratic performance functions. In all iterative methods, we start with an approximate solution (a guess), which we keep changing until we reach the minimum. Thus, to find the optimum co , we start at some arbitrary point c0 , usually the null vector c0 = 0, and then start a search for the “bottom of the bowl.” The key is to choose the steps in a systematic way so that each step takes us to a lower point until finally we reach the bottom. What differentiates various optimization algorithms is how we choose the direction and the size of each step. Steepest-descent algorithm (SDA) If the function P (c) has continuous derivatives, it is possible to approximate its value at an arbitrary neighboring point c + c by using the Taylor expansion P (c + c) = P (c) +

M ∂P (c) i=1

∂ci

∂ 2 P (c) 1 ci cj + · · · 2 ∂ci ∂cj M

ci +

M

(10.3.2)

i=1 j =1

or more compactly P (c + c) = P (c) + (c)T ∇P (c) + 12 (c)T [∇ 2 P (c)](c) + · · ·

(10.3.3)

where ∇P (c) is the gradient vector, with elements ∂P (c)/∂ci , and ∇ 2 P (c) is the Hessian matrix, with elements ∂ 2 P (c)/(∂ci ∂cj ). For simplicity we consider filters with real coefficients, but the conclusions apply when the coefficients are complex. For the quadratic function (10.3.1), we have ∇P (c) = 2(Rc − d)

(10.3.4)

∇ P (c) = 2R

(10.3.5)

2

and the higher-order terms are zero. For nonquadratic functions, higher-order terms are nonzero, but if c is small, we can use a quadratic approximation. We note that if ∇P (co ) = 0 and R is positive definite, then co is the minimum because (c)T [∇ 2 P (co )]· (c) > 0 for any nonzero c. Hence, if we choose the step c so that (c)T ∇P (c) < 0, we will have P (c + c) < P (c), that is, we make a step to a point closer to the minimum. Since (c)T ∇P (c) = c∇P (c) cos θ, the reduction in MSE is maximum when c = −∇P (c). For this reason, the direction of the negative gradient is known as the direction of steepest descent. This leads to the following iterative minimization algorithm ck = ck−1 + µ[−∇P (ck−1 )]

k≥0

(10.3.6)

which is known as the method of steepest descent (Scales 1985). The positive constant µ, known as the step-size parameter, controls the size of the descent in the direction of the negative gradient. The algorithm is usually initialized with c0 = 0. The steepest-descent algorithm (SDA) is illustrated in Figure 10.13 for a single-parameter case. For the cost function in (10.3.1), the SDA becomes ck = ck−1 + 2µ(d − Rck−1 ) = (I − 2µR)ck−1 + 2µd

(10.3.7)

which is a recursive difference equation. Note that k denotes an iteration in the SDA and has nothing to do with time. However, this iterative optimization can be combined with filtering to obtain a type of “asymptotically” optimum filter defined by H x(n, ζ ) e(n, ζ ) = y(n, ζ ) − cn−1

(10.3.8)

cn = cn−1 + 2µ(d − Rcn−1 )

(10.3.9)

and is further discussed in Problem 10.2.

517 section 10.3 Method of Steepest Descent

518

FIGURE 10.13 Illustration of gradient search of the MSE surface for the minimum error point.

P(c)

chapter 10 Adaptive Filters

Pk −1 dP dc c=c

k

Pk Pk +1 Po

co

ck +1

ck −1 c

ck

There are two key performance factors in the design of iterative optimization algorithms: stability and rate of convergence. Stability An algorithm is said to be stable if it converges to the minimum regardless of the starting point. To investigate the stability of SDA, we rewrite (10.3.7) in terms of the coefficient error vector c˜ k ck − co

k≥0

c˜ k = (I − 2µR)˜ck−1

as

k≥0

(10.3.10) (10.3.11)

which is a hom*ogeneous difference equation. Using the principal-components transformation R = QQH (see Section 3.5), we can write (10.3.11) as c˜ k = (I − 2µ)˜ck−1

c˜ k = QH c˜ k

where

k≥0 k≥0

(10.3.12) (10.3.13)

is the transformed coefficient error vector. Since is diagonal, (10.3.12) consists of a set of M decoupled first-order difference equations = (1 − 2µλi )c˜k−1,i c˜k,i

i = 1, 2, . . . , M, k ≥ 0

(10.3.14)

with each describing a natural mode of the SDA. The solutions of (10.3.12) are given by = (1 − 2µλi )k c˜0,i c˜k,i

k≥0

(10.3.15)

If for all 1 ≤ i ≤ M (10.3.16) −1 < 1 − 2µλi < 1 1 or equivalently (10.3.17) 0<µ< λi , 1 ≤ i ≤ M, tends to zero as k → ∞. This implies that c converges exponentially then c˜k,i k to co as k → ∞ because ˜ck = QT c˜ k = ˜ck . If R is positive definite, its eigenvalues are positive and 1 (10.3.18) 0<µ< λmax provides a necessary and sufficient condition for the convergence of SDA. To investigate the transient behavior of the SDA as a function of k, we note that using (10.3.10), (10.3.11), and (10.3.14), we have ck,i = co,i +

M i=1

qik c˜0,i (1 − 2µλi )k

(10.3.19)

where co,i are the optimum coefficients and qik the elements of the eigenvector matrix Q. The MSE at step k is Pk = Po +

M

2 λi (1 − 2µλi )2k |c˜0,i |

(10.3.20)

i=1

and can be obtained by substituting (10.3.19) in (10.3.1). If µ satisfies (10.3.18), we have limk→∞ Pk = Po and the MSE converges exponentially to the optimum value. The curve obtained by plotting the MSE Pk as a function of the number of iterations k is known as the learning curve. Rate of convergence The rate (or speed) of convergence depends upon the algorithm and the nature of the performance surface. The most influential effect is inflicted by the condition number of the Hessian matrix that determines the shape of the contours of P (c). When P (c) is quadratic, it can be shown (Luenberger 1984) that X (R) − 1 2 P (ck ) ≤ P (ck−1 ) (10.3.21) X (R) + 1 where X (R) = λmax /λmin is the condition number of R. If we recall that the eigenvectors corresponding to λmin and λmax point to the directions of minimum and maximum curvature, respectively, we see that the convergence slows down as the contours become more eccentric (flattened). For circular contours, that is, when X (R) = 1, the algorithm converges in one step. We stress that even if the M − 1 eigenvalues of R are equal and the remaining one is far away, still the convergence of the SDA is very slow. The rate of convergence can be characterized by using the time constant τ i defined by 1 1 1 − 2µλi = exp − (10.3.22) 1− τi τi which provides the time (or number of iterations) it takes for the ith mode ck,i of (10.3.19) to decay to 1/e of its initial value c0,i . When µ 1, we obtain τi

1 2µλi

(10.3.23)

In a similar fashion, the time constant τ i,mse for the MSE Pk can be shown to be τ i,mse

1 4µλi

(10.3.24)

by using (10.3.20) and (10.3.22). Thus, for all practical purposes, the time constant (for coefficient ck or for MSE Pk ) of the SDA is τ 1/(µλmin ), which in conjunction with µ < 1/λmax results in τ > λmax /λmin . Hence, the larger the eigenvalue spread of the input correlation matrix R, the longer it takes for the SDA to converge. In the following example, we illustrate above-discussed properties of the SDA by using it to compute the parameters of a second-order forward linear predictor. EXAMPLE 10.3.1.

Consider a signal generated by the second-order autoregressive AR(2) process x(n) + a1 x(n − 1) + a2 x(n − 2) = w(n)

(10.3.25)

where w(n) ∼ WGN(0, σ 2w ). Parameters a1 and a2 are chosen so that the system (10.3.25) is minimum-phase. We want to design an adaptive filter that uses the samples x(n−1) and x(n−2) to predict the value x(n) (desired response). If we multiply (10.3.25) by x(n − k), for k = 0, 1, 2, and take the mathematical expectation of both sides, we obtain a set of linear equations r(0) + a1 r(1) + a2 r(2) = σ 2w

(10.3.26)

519 section 10.3 Method of Steepest Descent

520

r(1) + a1 r(0) + a2 r(1) = 0

(10.3.27)

chapter 10 Adaptive Filters

r(2) + a1 r(1) + a2 r(0) = 0

(10.3.28)

which can be used to express the autocorrelation of x(n) in terms of model parameters a1 , a2 , and σ 2w . Indeed, solving (10.3.26) through (10.3.28), we obtain r(0) = σ 2x =

1 + a2 σ 2w 1 − a2 (1 + a2 )2 − a 2 1

−a1 r(0) 1 + a2

a12 r(0) r(2) = −a2 + 1 + a2

r(1) =

(10.3.29)

We choose σ 2x = 1, so that σ 2w =

(1 − a2 )[(1 + a2 )2 − a12 ] 2 σx 1 + a2

(10.3.30)

The coefficients of the optimum predictor y(n) ˆ = x(n) ˆ = co,1 x(n − 1) + co,2 x(n − 2)

(10.3.31)

are given by (see Section 6.5)

with

r(0)co,1 + r(1)co,2 = r(1)

(10.3.32)

r(1)co,1 + r(0)co,2 = r(2)

(10.3.33)

f Po = r(0) + r(1)co,1 + r(0)co,2

(10.3.34)

whose comparison with (10.3.26) through (10.3.28) shows that co,1 = −a1 , co,2 = −a2 , and f

Po = σ 2w , as expected. The eigenvalues of the input correlation matrix

r(0) r(1) R= r(1) r(0) a1 are σ 2x λ1,2 = 1 ∓ 1 + a2

(10.3.35)

(10.3.36)

from which the eigenvalue spread is λ 1 − a1 + a2 X (R) = 1 = (10.3.37) λ2 1 + a1 + a2 which, if a2 > 0 and a1 < 0, is larger than 1. Now we perform Matlab experiments with varying eigenvalue spread X (R) and step-size parameter µ. In these experiments, we choose σ 2w so that σ 2x = 1. The SDA is given by ck [ck,1 ck,2 ]T = ck−1 + 2µ(d − Rck−1 ) d = [r(1) r(2)]T

where

and

c0 = [0 0]T

We choose two different sets of values for a1 and a2 , one for a small and the other for a large eigenvalue spread. These values are shown in Table 10.2 along with the corresponding eigenvalue spread X (R) and the MMSE σ 2w . TABLE 10.2

Parameter values used in the SDA for the second-order forward prediction problem. Eigenvalue spread Small Large

a1

a2

λ1

λ2

X (R)

σ 2w

−0.1950 −1.5955

0.95 0.95

1.1 1.818

0.9 0.182

1.22 9.99

0.0965 0.0322

Using each set of parameter values, the SDA is implemented starting with the null coefficient vector c0 with two values of step-size parameters. To describe the transient behavior of the algorithm, it is informative to plot the trajectory of ck,1 versus ck,2 as a function of the iteration index k along with the contours of the error surface P (ck ). The trajectory of ck begins at the origin c0 = 0 and ends at the optimum value co = −[a1 a2 ]T . This illustration of the transient behavior can also be obtained in the domain of the transformed error coefficients c˜ k . Using (10.3.15), we see these coefficients are given by (1 − 2µλ1 )k c˜0,1 c˜k,1 c˜ k = = (10.3.38) c˜k,2 (1 − 2µλ2 )k c˜0,2 where c˜ 0 from (10.3.10) and (10.3.13) is given by c˜0,1 a1 T T T T = Q c˜ 0 = Q (c0 − co ) = −Q co = Q c˜ 0 = c˜0,1 a2

(10.3.39)

Thus the trajectory of c˜ k begins at c˜ 0 and ends at the origin c˜ k = 0. The contours of the MSE function in the transformed domain are given by Pk − Po . From (10.3.20), these contours are given by 2 f )2 + λ (c˜ )2 λi (˜ck )2 = λ1 (c˜k,1 (10.3.40) Pk − Po = 2 k,2 i=1

Small eigenvalue spread and overdamped response. For this experiment, the parameter values were selected to obtain the eigenvalue spread approximately equal to 1 [X (R) = 1.22]. The step size selected was µ = 0.15, which is less than 1/λmax = 1/1.1 = 0.9 for convergence. For this value of µ, the transient response is overdamped. Figure 10.14 shows four graphs indicating 1.5

1

0 ck, 2

c~k, 2

k=0 k=1 k=2 0

−1.5 −1.5

0 c~k, 1

−2 −1

1.5

co, 1

ck, 1

co, 2

ck, 2

(b) Locus of ck, 1 versus ck, 2

1.0

Pk

Parameters

(a) Locus of c~k, 1 versus c~k, 2

0.195 0

2 ck, 1

−0.95

15

0.0965 0

15

k

k

(c) ck learning curve

(d ) MSE Pk learning curve

FIGURE 10.14 Performance curves for the steepest-descent algorithm used in the linear prediction problem with step-size parameter µ = 0.15 and eigenvalue spread X (R) = 1.22.

521 section 10.3 Method of Steepest Descent

the behavior of the algorithm. In the graph (a), the trajectory of c˜ k is shown for 0 ≤ k ≤ 15 along with the corresponding loci c˜ k for a fixed value of Pk − Po . The first two loci for k = 0 and 1 are numbered to show the direction of the trajectory. Graph (b) shows the corresponding trajectory and the contours for ck . Graph (c) shows plots of ck,1 and ck,2 as a function of iteration step k, while graph (d ) shows a similar learning curve for the MSE Pk . Several observations can be made about these plots. The contours of constant c˜ k are almost circular since the spread is approximately 1, while those of ck are somewhat elliptical, which is to be expected. The trajectories of c˜ k and ck as a function of k are normal to the contours. The coefficients converge to their optimum values in a monotonic fashion, which confirms the overdamped nature of the response. Also this convergence is rapid, in about 15 steps, which is to be expected for a small eigenvalue spread.

522 chapter 10 Adaptive Filters

Large eigenvalue spread and overdamped response. For this experiment, the parameter values were selected so that the eigenvalue spread was approximately equal to 10 [X (R) = 9.99]. The step size was again selected as µ = 0.15. Figure 10.15 shows the performance plots for this experiment, which are similar to those of Figure 10.14. The observations are also similar except for those due to the larger spread. First, the contours, even in the transformed domain, are elliptical; second, the convergence is slow, requiring about 60 steps in the algorithm. The transient response is once again overdamped.

3

2 k=0 k=2

ck, 2

ck, 2

0 0

−3 −3

0 c~k, 1

−4 −2

3

(a) Locus of c~k, 1 versus c~k, 2

co, 1

(b) Locus of ck, 1 versus ck,-2

1.0 ck, 1 Pk

Parameters

1.5955

4 ck, 1

0 co, 2

−0.95

ck, 2 0.0322

60

−0.2

60

k

k

(c) ck learning curve

(d ) MSE Pk learning curve

FIGURE 10.15 Performance curves for the steepest-descent algorithm used in the linear prediction problem with step-size parameter µ = 0.15 and eigenvalue spread X (R) = 10. Large eigenvalue spread and underdamped response. Finally, in the third experiment, we consider the model parameters of the above case and increase the step size to µ = 0.5 (< 1/λmax = 0.55) so that the transient response is underdamped. Figure 10.16 shows the corresponding plots.

523

2

3 k=0 k=1 k=2

section 10.3 Method of Steepest Descent

ck, 2

c~k, 2

0 0

−3 −3

0 c~k, 1

−4 −2

3

(b) Locus of ck, 1 versus ck, 2

co,1

1.0 ck ,1

Pk

Parameters

4 ck, 1

(a) Locus of c~k, 1 versus c~k, 2

1.5955

0 ck ,2

−0.95

0.0322

co,2 0

60

−0.2

60

k

k

(c) ck learning curve

(d ) MSE Pk learning curve

FIGURE 10.16 Performance curves for the steepest-descent algorithm used in the linear prediction problem with eigenvalue spread X (R) = 10 and varying step-size parameters µ = 0.15 and µ = 0.5.

Note that the coefficients converge in an oscillatory fashion; however, the convergence is fairly rapid compared to that of the overdamped case. Thus the selection of the step size is an important design issue.

Newton’s type of algorithms Another family of algorithms with a faster rate of convergence includes Newton’s method and its modifications. The basic idea of Newton’s method is to achieve convergence in one step when P (c) is quadratic. Thus, if ck is to be the minimum of P (c), the gradient ∇P (ck ) of P (c) evaluated at ck (10.2.19) should be zero. From (10.2.19), we can write ∇P (ck ) = ∇P (ck−1 ) + ∇ 2 P (ck−1 )ck = 0

(10.3.41)

Thus ∇P (ck ) = 0 leads to the step increment ck = −[∇ 2 P (ck−1 )]−1 ∇P (ck−1 )

(10.3.42)

and hence the adaptive algorithm is given by ck = ck−1 − µ[∇ 2 P (ck−1 )]−1 ∇P (ck−1 )

(10.3.43)

where µ > 0 is the step size. For quadratic error surfaces, from (10.3.4) and (10.3.5), we obtain with µ = 1 ck = ck−1 − [∇ 2 P (ck−1 )]−1 ∇P (ck−1 ) = ck−1 − (ck−1 − R −1 d) = co which shows that indeed the algorithm converges in one step.

(10.3.44)

524 chapter 10 Adaptive Filters

For the quadratic case, since ∇ 2 P (ck−1 ) = 2R from (10.3.1), we can express Newton’s algorithm as ck = ck−1 − µR −1 ∇P (ck−1 )

(10.3.45)

where µ is the step size that regulates the convergence rate. Other modified Newton methods replace the Hessian matrix ∇ 2 P (ck−1 ) with another matrix, which is guaranteed to be positive definite and, in some way, close to the Hessian. These Newton-type algorithms generally provide faster convergence. However, in practice, the inversion of R is numerically intensive and can lead to a numerically unstable solution if special care is not taken. Therefore, the SDA is more popular in adaptive filtering applications. When the function P (c) is nonquadratic, it is approximated locally by a quadratic function that is minimized exactly. However, the step obtained in (10.3.42) does not lead to the minimum of P (c), and the iteration should be repeated several times. A more detailed treatment of linear and nonlinear optimization techniques can be found in Scales (1985) and in Luenberger (1984).

10.4 LEAST-MEAN-SQUARE ADAPTIVE FILTERS In this section, we derive, analyze the performance, and present some practical applications of the least-mean-square (LMS) adaptive algorithm. The LMS algorithm, introduced by Widrow and Hoff (1960), is widely used in practice due to its simplicity, computational efficiency, and good performance under a variety of operating conditions.

10.4.1 Derivation We first present two approaches to the derivation of the LMS algorithm that will help the reader to understand its operation. The first approach uses approximation to the gradient function while the second approach uses geometric arguments. Optimization approach. The SDA uses the second-order moments R and d to iteratively compute the optimum filter co = R −1 d, starting with an initial guess, usually c0 = 0, and then obtaining better approximations by taking steps in the direction of the negative gradient, that is,

where

ck = ck−1 + µ[−∇P (ck−1 )]

(10.4.1)

∇P (ck−1 ) = 2(Rck−1 − d)

(10.4.2)

is the gradient of the performance function (10.3.1). In practice, where only the input {x(j )}n0 and the desired response {y(j )}n0 are known, we can only compute an estimate of the “true” or exact gradient (10.4.2) using the available data. To develop an adaptive algorithm from (10.4.1), we take the following steps: (1) replace the iteration subscript k by the time index n; and (2) replace R and d by their instantaneous estimates x(n)xH (n) and x(n)y ∗ (n), respectively. The instantaneous estimate of the gradient (10.4.2) becomes ∇P (ck−1 ) = 2Rck−1 −2d 2x(n)xH (n)c(n−1)−2x(n)y ∗ (n) = −2x(n)e∗ (n) (10.4.3) where

e(n) = y(n) − cH (n − 1)x(n)

(10.4.4)

is the a priori filtering error. The estimate (10.4.3) also can be obtained by starting with the approximation P (c) |e(n)|2 and taking its gradient. The coefficient adaptation algorithm is c(n) = c(n − 1) + 2µx(n)e∗ (n)

(10.4.5)

which is obtained by substituting (10.4.3) and (10.4.4) in (10.4.1). The step-size parameter 2µ is also known as the adaptation gain. The LMS algorithm, specified by (10.4.5) and (10.4.4), has both important similarities to and important differences from the SDA (10.3.7). The SDA contains deterministic quantities while the LMS operates on random quantities. The SDA is not an adaptive algorithm because it only depends on the second-order moments R and d and not on the SOE {x(n, ζ ), y(n, ζ )}. Also, the iteration index k has nothing to do with time. Simply stated, the SDA provides an iterative solution to the linear system Rc = d. Geometric approach. Suppose that an adaptive filter operates in a stationary signal environment seeking the optimum filter co . At time n the filter has access to input vector x(n), the desired response y(n), and the previous or old coefficient estimate c(n − 1). Its goal is to use this information to determine a new estimate c(n) that is closer to the optimum vector co or equivalently to choose c(n) so that ˜c(n) < ˜c(n−1), where c˜ (n) = c(n)−co is the coefficient error vector given by (10.2.24). Eventually, we want ˜c(n) to become negligible as n → ∞. The vector c˜ (n − 1) can be decomposed into two orthogonal components c˜ (n − 1) = c˜ x (n − 1) + c˜ x⊥ (n − 1)

(10.4.6)

one parallel and one orthogonal to the input vector x(n), as shown in Figure 10.17(a). The response of the error filter c˜ (n − 1) to the input x(n) is y(n) ˜ = c˜ H (n − 1)x(n) = c˜ xH (n − 1)x(n) c˜ x (n − 1) =

which implies that

(10.4.7)

y˜ ∗ (n)

x(n) (10.4.8) x(n)2 which can be verified by direct substitution in (10.4.7). Note that x(n)/x(n) is a unit vector along the direction of x(n).

⊥ ~ c x (n − 1)

c~(n − 1)

⊥ ~ c x (n − 1) c~(n − 1)

c~(n)

c~x (n − 1)

c~x (n − 1) x(n) (a)

−2mc~x (n − 1) (b)

FIGURE 10.17 The geometric approach for the derivation of the LMS algorithm.

If we only know x(n) and y(n), ˜ the best strategy to decrease c˜ (n) is to choose c˜ (n) = c˜ x⊥ (n − 1), or equivalently subtract c˜ x (n − 1) from c˜ (n − 1). From Figure 10.17(a) note that as long as c˜ x (n − 1) = 0, ˜c(n) = ˜cx⊥ (n − 1) < ˜cx (n − 1). This suggests the following adaptation algorithm c˜ (n) = c˜ (n − 1) − µ ˜

y˜ ∗ (n) x(n) x(n)2

(10.4.9)

525 section 10.4 Least-Mean-Square Adaptive Filters

526 chapter 10 Adaptive Filters

which guarantees that ˜c(n) < ˜c(n − 1) as long as 0 < µ ˜ < 2 and y(n) ˜ = 0, as shown in Figure 10.17(b). The best choice clearly is µ ˜ = 1. Unfortunately, the signal y(n) ˜ is not available, and we have to replace it with some reasonable approximation. From (10.2.18) and (10.2.10) we obtain e(n) ˜ e(n) − eo (n) = y(n) − y(n) ˆ − y(n) + yˆo (n) = yˆo (n) − y(n) ˆ

(10.4.10)

= [coH − cH (n − 1)]x(n) = −˜cH (n − 1)x(n) = −y(n) ˜ where we have used (10.4.7). Using the approximation e(n) ˜ = e(n) − eo (n) e(n) we combine it with (10.4.10) to get c(n) = c(n − 1) + µ ˜

e∗ (n) x(n) x(n)2

(10.4.11)

2 which is known as the normalized LMS algorithm. Note that the effective step size µ/x(n) ˜ is time-varying. The LMS algorithm in (10.4.5) follows if we set x(n) = 1 and choose µ ˜ = 2µ.

LMS algorithm. The LMS algorithm can be summarized as y(n) ˆ = cH (n − 1)x(n)

filtering

e(n) = y(n) − y(n) ˆ

error formation

c(n) =

c(n − 1) + 2µx(n)e∗ (n)

(10.4.12)

coefficient updating

where µ is adaptation step size. The algorithm requires 2M + 1 complex multiplications and 2M complex additions. Figure 10.18 shows an implementation of an FIR adaptive filter using the LMS algorithm, which is implemented in Matlab using the function [yhat,c]=firlms(x,y,M,mu). The a posteriori form of the LMS algorithm is developed in Problem 10.9. x(n)

x(n)

c∗0 z−1

z−1

x(n − 1)

…

x(n − M + 2)

z−1

c∗M−1

… … …

x(n − M + 1)

z−1

yˆ(n) e(n) 2m

−

y(n)

FIGURE 10.18 An FIR adaptive filter realization using the LMS algorithm.

10.4.2 Adaptation in a Stationary SOE In the sequel, we study the stability and steady-state performance of the LMS algorithm in a stationary SOE; that is, we assume that the input and the desired response processes are jointly stationary. In theory, the goal of the LMS adaptive filter is to identify the optimum filter co = R −1 d from observations of the input x(n) and the desired response y(n) = coH x(n) + eo (n) The optimum error eo (n) is orthogonal to the vector x(n); that is, acts as measurement or output noise, as shown in Figure 10.19.

(10.4.13) E{x(n)e∗ (n)}

= 0 and

eo(n) x(n) co

yˆo(n) −

c(n − 1)

yˆ(n)

y(n) ~ y (n)

FIGURE 10.19 LMS algorithm in a stationary SOE.

527 section 10.4 Least-Mean-Square Adaptive Filters

e(n)

−

The first step in the statistical analysis of the LMS algorithm is to determine a difference equation for the coefficient error vector c˜ (n). To this end, we subtract co from both sides of (10.4.5), to obtain c˜ (n) = c˜ (n − 1) + 2µx(n)e∗ (n)

(10.4.14)

which expresses the LMS algorithm in terms of the coefficient error vector. We next use (10.4.12) and (10.4.13) in (10.4.14) to eliminate e(n) by expressing it in terms of c˜ (n − 1) and eo (n). The result is c˜ (n) = [I − 2µx(n)xH (n)]˜c(n − 1) + 2µx(n)eo∗ (n)

(10.4.15)

which is a time-varying forced or nonhom*ogeneous stochastic difference equation. The irreducible error eo (n) accounts for measurement noise, modeling errors, unmodeled dynamics, quantization effects, and other disturbances. The presence of eo (n) prevents convergence because it forces c˜ (n) to fluctuate around zero. Therefore, the important issue is the BIBO stability of the system (10.4.15). From (10.2.28), we see that ˜c(n) is bounded in mean square if we can show that E{˜c(n)} → 0 as n → ∞ and var{c˜k (n)} is bounded for all n. To this end, we develop difference equations for the mean value E{˜c(n)} and the correlation matrix (n) E{˜c(n)˜cH (n)}

(10.4.16)

of the coefficient error vector c˜ (n). As we shall see, the MSD and the EMSE can be expressed in terms of matrices (n) and R. The time evolution of these quantities provides sufficient information to evaluate the stability and steady-state performance of the LMS algorithm. Convergence of the mean coefficient vector If we take the expectation of (10.4.15), we have E{˜c(n)} = E{˜c(n − 1)} − 2µE{x(n)xH (n)˜c(n − 1)}

(10.4.17)

E{x(n)eo∗ (n)}

= 0 owing to the orthogonality principle. The computation of the because second term in (10.4.17) requires the correlation between the input signal and the coefficient error vector. If we assume that x(n) and c˜ (n − 1) are statistically independent, (10.4.17) simplifies to E{˜c(n)} = (I − 2µR)E{˜c(n − 1)}

(10.4.18)

which has the same form as (10.3.11) for the SDA. Therefore, c˜ (n) converges in the MS sense, that is, limn→∞ E{˜c(n)} = 0, if the eigenvalues of the system matrix (I − 2µR) are less than 1. Hence, if R is positive definite and λmax is its maximum eigenvalue, the condition 1 (10.4.19) 0 < 2µ < λmax ensures that the LMS algorithm converges in the MS sense [see the discussion following (10.2.27)].

528 chapter 10 Adaptive Filters

Independence assumption. The independence assumption between x(n) and c˜ (n − 1) was critical to the derivation of (10.4.18). To simplify the analysis, we make the following independence assumptions (Gardner 1984): A1 The sequence of input data vectors x(n) is independently and identically distributed with zero mean and correlation matrix R. A2 The sequences x(n) and eo (n) are independent for all n. n−1 From (10.4.15), we see that c˜ (n − 1) depends on c˜ (0), {x(k)}n−1 0 , and {eo (k)}0 . Since the sequence x(n) is IID and the quantities x(n) and eo (n) are independent, we conclude that x(n), e0 (n), and c˜ (n − 1) are mutually independent. This result will be used several times to simplify the analysis of the LMS algorithm. The independence assumption A1, first introduced in Widrow et al. (1976) and in Mazo (1979), ignores the statistical dependence among successive input data vectors; however, it preserves sufficient statistical information about the adaptation process to lead to useful design guidelines. Clearly, for FIR filtering applications, the independence assumption is violated because two successive input data vectors x(n) and x(n + 1) have M − 1 common elements (shift-invariance property).

Evolution of the coefficient error correlation matrix † The MSD can be expressed in terms of the trace of the correlation matrix (n), that is, D(n) = tr[(n)]

(10.4.20)

which can be easily seen by using (10.2.29) and the definition of trace. If we postmultiply both sides of (10.4.15) by their respective Hermitian transposes and take the mathematical expectation, we obtain (n) = E{˜c(n)˜cH (n)} = E{[I − 2µx(n)xH (n)]˜c(n − 1)˜cH (n − 1)[I − 2µx(n)xH (n)]H } + 2µE{[I − 2µx(n)xH (n)]˜c(n − 1)eo (n)xH (n)} + 2µE{x(n)eo∗ (n)˜cH (n − 1)[I

(10.4.21)

− 2µx(n)xH (n)]H }

+ 4µ2 E{x(n)eo∗ (n)eo (n)xH (n)} From the independence assumptions, eo (n) is independent with c˜ (n−1) and x(n). Therefore, the second and third terms in (10.4.21) vanish, and the fourth term is equal to 4µ2 Po R. If we expand the first term, we obtain (n) = (n − 1) − 2µ[R(n − 1) + (n − 1)R] + 4µ2 A + 4µ2 Po R where

A E{x(n)x (n)˜c(n − 1)˜c (n − 1)x(n)x (n)} H

H

T

(10.4.22) (10.4.23)

and the terms R(n − 1) and (n − 1)R have been computed by using the mutual independence of x(n), c˜ (n − 1), and eo (n). The computation of matrix A can be simplified if we make additional assumptions about the statistical properties of x(n). As shown in Gardner (1984), development of a recursive relation for the elements of (n) using only the independence assumptions requires the products with and the inversion of a M 2 × M 2 matrix, where M is the size of x(n). The evaluation of this term when x(n) ∼ IID, an assumption that is more appropriate for data transmission applications, is discussed in Gardner (1984). The computation for x(n) being a spherically invariant random process (SIRP) is discussed in Rupp (1993). SIRP models, which include the Gaussian distribution as a special case, provide a good †

Note that when (10.4.19) holds, limn→∞ E{˜c(n)} = 0, and therefore (n) provides asymptotically the covariance of c˜ (n).

characterization of speech signals. However, independently of the assumption used, the basic conclusions remain the same. Assuming that x(n) is normally distributed, that is, x(n) ∼ N (0, R), a significant amount of simplification can be obtained. Indeed, in this case we can use the moment factorization property for normal random variables to express fourth-order moments in terms of second-order moments (Papoulis 1991). As we showed in Section 3.2.3, if z1 , z2 , z3 , and z4 are complex-valued, zero-mean, and jointly distributed normal random variables, then E{z1 z2∗ z3 z4∗ } = E{z1 z2∗ }E{z3 z4∗ } + E{z1 z4∗ }E{z2∗ z3 }

(10.4.24)

or if they are real-valued, then E{z1 z2 z3 z4 } = E{z1 z2 }E{z3 z4 } + E{z1 z3 }E{z2 z4 } + E{z1 z4 }E{z2 z3 }

(10.4.25)

Using direct substitution of (10.4.24) or (10.4.25) in (10.4.23), we can show that R(n − 1)R + R tr[R(n − 1)] complex case A= (10.4.26) 2R(n − 1)R + R tr[R(n − 1)] real case Finally, substituting (10.4.26) in (10.4.22), we obtain a difference equation for (n). This is summarized in the following property: PR O PE R TY 10.4.1. Using the independence assumptions A1 and A2, and the normal distribution assumption of x(n), the correlation matrix of the coefficient error vector c˜ (n) satisfies the difference equation

(n) = (n − 1) − 2µ[R(n − 1) + (n − 1)R] + 4µ2 R(n − 1)R + 4µ2 R tr[R(n − 1)] + 4µ2 Po R

(10.4.27)

in the complex case and (n) = (n − 1) − 2µ[R(n − 1) + (n − 1)R] + 8µ2 R(n − 1)R + 4µ2 R tr[R(n − 1)] + 4µ2 Po R

(10.4.28)

in the real case. Both relations are matrix difference equations driven by the constant term 4µ2 Po R.

The presence of the term 4µ2 Po R in (10.4.27) or (10.4.28) implies that (n) will never become zero, and as a result the coefficients of the LMS adaptive filter will always fluctuate about their optimum settings, which prevents convergence. It has been shown (Bucklew et al. 1993) that asymptotically c˜ (n) follows a zero-mean normal distribution. The amount of fluctuation is measured by matrix (n). In contrast, the absence of a driving term in (10.4.18) allows the convergence of E{c(n)} to the optimum vector co . Since there are two distinct forms for the difference equation of (n), we will consider the real case (10.4.28) for further discussion. Similar analysis can be done for the complex case (10.4.27), which is undertaken in Problem 10.11. To further simplify the analysis, we transform (n) to the principal coordinate space of R using the spectral decomposition QT RQ = by defining the matrix

(n) QT (n)Q

(10.4.29)

which is symmetric and positive definite [when (n) is positive definite]. If we pre- and postmultiply (10.4.28) by QT and Q and use QT Q = QQT = I, we obtain (n) = (n − 1) − 2µ[(n − 1) + (n − 1)] (10.4.30) + 8µ2 (n − 1) + 4µ2 tr[(n − 1)] + 4µ2 Po which is easier to work with because of the diagonal nature of . For any symmetric and positive definite matrix , we have |θ ij (n)|2 ≤ θ ii θ jj . Hence, the convergence of the

529 section 10.4 Least-Mean-Square Adaptive Filters

530 chapter 10 Adaptive Filters

diagonal elements ensures the convergence of the off-diagonal elements. This observation and (10.4.30) suggest that to analyze the LMS algorithm, we should extract from (10.4.30) the equations for the diagonal elements θ(n) [θ 1 (n) θ 2 (n) · · · θ M (n)]T

(10.4.31)

of (n) and form a difference equation for the vector θ(n). Indeed, we can easily show that θ(n) = Bθ(n − 1) + 4µ2 Po λ B (ρ) + 4µ λλ 2

where

(10.4.32)

T

(10.4.33)

λ [λ1 λ2 · · · λM ]

T

(10.4.34)

(ρ) diag{ρ 1 , ρ 2 , . . . , ρ M } ρk =

1 − 4µλk + 8µ2 λ2k

= (1 − 2µλk )

2

+ 4µ2 λ2k

>0

(10.4.35) 1≤k≤M

(10.4.36)

and λk are the eigenvalues of R. The solution of the vector difference equation (10.4.32) is θ(n) = Bn θ(0) + 4µ2 Po

n−1

Bj λ

(10.4.37)

j =0

and can be easily found by recursion. The stability of the linear system (10.4.32) is determined by the eigenvalues of the symmetric matrix B. Using (10.4.33) and (10.4.35), for an arbitrary vector z, we obtain zT Bz = zT (ρ)z + 4µ2 (λT z)2 =

M

ρ k zk2 + 4µ2 (λT z)2

(10.4.38)

k=1

where we have used (10.4.36). Hence (10.4.38), for z = 0, implies that zT Bz > 0, that is, the matrix B is positive definite. Since matrix B is symmetric and positive definite, its eigenvalues λk (B) are real and positive. The system (10.4.37) will be BIBO stable if and only if 0 < λk (B) < 1

1≤k≤M

(10.4.39)

To find the range of µ that ensures (10.4.39), we use the Gerschgorin circles theorem (Noble and Daniel 1988), which states that each eigenvalue of an M × M matrix B lies in at least one of the disks with center at the diagonal element bkk and radius equal to the sum of absolute values |bkj |, j = k, of the remaining elements of the row. Since the elements of B are positive, we can easily see that λk (B) − bkk <

M

or

bki

λk (B) < ρ k + 4µ2 λk

j =1 j =k

M

λi j =1

using (10.4.33). Hence using (10.4.36), we see the eigenvalues of B satisfy (10.4.39) if 1 − 4µλk + 8µ2 λ2k + 4µ2 λk trR <1 or

−µλk + 2µ2 λ2k + µ2 λk trR < 0

which implies that µ > 0 and 2µ <

1 1 < λk + trR trR

because λk > 0 for all k. In conclusion, if the adaptation step µ satisfies the condition 1 trR then the system (10.4.37) is stable and therefore the sequence θ(n) converges. 0 < 2µ <

(10.4.40)

PR O PE R TY 10.4.2. When the stability condition (10.4.40) holds, the solution (10.4.37) of the difference equation (10.4.32) can be written as

θ(n) = Bn [θ(0) − θ(∞)] + θ(∞)

(10.4.41)

where θ(0) is the initial value and θ(∞) is the steady-state value of θ(n). Proof. Using the identity n−1

Bj = (I − Bn )(I − B)−1 = (I − B)−1 − Bn (I − B)−1

j =0

the solution (10.4.37) can be written as θ(n) = Bn [θ(0) − 4µ2 Po (I − B)−1 λ] + 4µ2 Po (I − B)−1 λ

(10.4.42)

When the eigenvalues of B are inside the unit circle, we have lim θ(n) θ(∞) = 4µ2 Po (I − B)−1 λ

n→∞

(10.4.43)

because the first term converges to zero. Substituting (10.4.43) in (10.4.42), we obtain (10.4.41).

Evolution of the mean square error We next express the MSE as a function of λ and θ. Using (10.2.10) and (10.2.18), we have e(n) = y(n) − cH (n − 1)x(n) = eo (n) − c˜ H (n − 1)x(n)

(10.4.44)

where eo (n) is the optimum filtering error and c˜ (n) is the coefficient error vector. The (a priori) MSE of the adaptive filter at time n is P (n) E{|e(n)|2 } = E{|eo (n)|2 } − E{˜cH (n − 1)x(n)eo∗ (n)} − E{eo (n)xH (n)˜c(n − 1)}

(10.4.45)

+ E{˜cH (n − 1)x(n)xH (n)˜c(n − 1)} Since c˜ (n) is a random vector, the evaluation of the MSE (10.4.45) requires the correlation between x(n) and c˜ (n − 1). Using the independence assumptions A1 and A2, we see that the second and third terms in (10.4.45) become zero, as explained before, and the excess MSE is given by the last term Pex (n) = E{˜cH (n − 1)x(n)xH (n)˜c(n − 1)}

(10.4.46)

If we define the quantities H

A c˜ (n − 1)

and

B x(n)xH (n)˜c(n − 1)

(10.4.47)

and notice that AB = tr(AB) (because AB is a scalar) and tr(AB) = tr(BA), we obtain Pex (n) = E{tr(AB)} = E{tr(BA)} = tr(E{BA}) = tr(E{x(n)xH (n)}E{˜c(n − 1)˜cH (n − 1)}) because expectation is a linear operation and x(n) and c˜ (n − 1) have been assumed statistically independent. Therefore, the excess MSE can be expressed as Pex (n) = tr[R(n − 1)] where (n) = expression simplifies to

E{˜c(n)˜cH (n)}

is the correlation matrix of the coefficient error vector. This Pex (n) = Mσ 2x σ 2c

if R = σ 2x I and (n) = σ 2c I.

(10.4.48)

(10.4.49)

531 section 10.4 Least-Mean-Square Adaptive Filters

532 chapter 10 Adaptive Filters

If R and (n) are both positive definite, relation (10.4.48) shows that Pex (n) > 0, that is, the MSE attained by the adaptive filter is larger than the optimum MSE Po of the optimum filter (cost of adaptation). Next we develop a difference equation for Pex (n), using, for convenience, the principal coordinate system of the input correlation matrix R. Since the trace of a matrix remains invariant under an orthogonal transformation, we have Pex (n) = tr[R(n)] = tr[(n)] = λT θ(n)

(10.4.50)

where the elements of λ are the eigenvalues of R and the elements of θ(n) are the diagonal elements of (n). Since the most often observable and important quantity for the operation of an adaptive filter is the MSE, we use our previous results to determine the value of MSE as a function of n, that is, the learning curve of the LMS adaptive filter. To this end, we use the orthogonal decomposition B = Q(B)(B)QH (B) to express Bn as Bn = Q(B)n (B)QH (B) =

M

λnk (B)qk (B)qkH (B)

(10.4.51)

k=1

where λk (B) are the eigenvalues and qk (B) are the eigenvectors of matrix B. Substituting (10.4.41) and (10.4.51) into (10.4.50) and recalling that P (n) = Po + Pex (n), we obtain P (n) = Po + Ptr (n) + Pex (∞)

(10.4.52)

where Pex (∞) is termed the steady-state excess MSE and Ptr (n)

M

γ k (R, B) λnk (B)

(10.4.53)

k=1

is termed the transient MSE because it dies out exponentially when 0 < λk (B) < 1, 1 ≤ k ≤ M. The constants γ k (R, B) λT (R)qk (B)qkH (B)[θ(0) − θ(∞)]

(10.4.54)

are determined by the eigenvalues λk (R) of matrix R and the eigenvectors qk (B) of matrix B. Since the minimum MSE Po is available, we need to determine the steady-state excess MSE Pex (∞). PR O PE RTY 10.4.3.

When the LMS adaptive algorithm converges, the steady-state excess MSE

is given by C(µ) 1 − C(µ)

(10.4.55)

µλk 1 − 2µλk

(10.4.56)

Pex (∞) = Po where

C(µ)

M k=1

and λk are the eigenvalues of the input correlation matrix. Proof. Using (10.4.32) and (10.4.35), we obtain the difference equation θ k (n) = ρ k θ k (n − 1) + 4µ2 λk Pex (n − 1) + 4µ2 Po λk When (10.4.40) holds, (10.4.57) attains the following steady-state form θ k (∞) = ρ k θ k (∞) + 4µ2 λk Pex (∞) + 4µ2 Po λk whose solution, in conjunction with (10.4.36), gives θ k (∞) = µ

Po + Pex (∞) 1 − 2µλk

(10.4.57)

Pex (∞) =

and

M

λk θ k (∞) = [Po + Pex (∞)]

k=1

M k=1

533

µλk 1 − 2µλk

section 10.4 Least-Mean-Square Adaptive Filters

Solving the last equation for Pex (∞), we obtain (10.4.55) and (10.4.56).

Solving (10.4.55) for C(µ) gives C(µ) = which implies that

Pex (∞) Po + Pex (∞)

0 < C(µ) < 1

(10.4.58) (10.4.59)

because Po and Pex (∞) are positive quantities. It has been shown that (10.4.59) leads to the tighter bound 0 < 2µ < 2/(3 trR) for the adaptation step µ (Horowitz and Senne 1981; Feuer and Weinstein 1985). Therefore, convergence in the MSE imposes a stronger constraint on the step size µ than does (10.4.40), which ensures convergence in the mean.

10.4.3 Summary and Design Guidelines There are many theoretical and simulation analyses of the LMS adaptive algorithm under a variety of assumptions. In this book, we have focused on results that help us to understand its operation and performance and to develop design guidelines for its practical application. The operation and performance of the LMS adaptive filter are determined by its stability and the properties of its learning curve, which shows the evolution of the MSE as a function of time. The MSE produced by the LMS adaptive algorithm consists of three components [see (10.4.52)] P (n) = Po + Ptr (n) + Pex (∞) where Po is the optimum MSE, Ptr (n) is the transient MSE, and Pex (∞) is the steadystate excess MSE. This equation provides the basis for understanding and evaluating the operation of the LMS adaptive algorithm in a stationary SOE. For convenience, the LMS adaptive filtering algorithm is summarized in Table 10.3. TABLE 10.3

Summary of the LMS algorithm. Design parameters x(n) = input data vector at time n y(n) = desired response at time n c(n) = filter coefficient vector at time n M = number of coefficients µ = step-size parameter 1 0<µ M E{|xk (n)|2 } k=1

Initialization c(−1) = x(−1) = 0 Computation For n = 0, 1, 2, . . . , compute y(n) ˆ = cH (n − 1) x(n) e(n) = y(n) − y(n) ˆ c(n) = c(n − 1) + 2µx(n)e∗ (n)

534 chapter 10 Adaptive Filters

Stability. The LMS adaptive filter converges in the mean-square sense, that is, the transient MSE dies out, if the adaptation step µ satisfies the condition K (10.4.60) trR where trR is the trace of the input correlation matrix and K is a constant that depends weakly on the statistics of the input data vector. For example, when x(n) ∼ N (0, R), we proved that K = 1 or 23 . In addition, this condition ensures that on average the LMS adaptive filter converges to the optimum filter. We stress that in most practical applications, where the independence assumption does not hold, the step size µ should be much smaller than K/ trR. Therefore, the exact value of K is not important in practice. 0 < 2µ <

Rate of convergence. The transient MSE dies out exponentially without exhibiting any oscillations. This follows from (10.4.53) because when µ satisfies (10.4.40), the eigenvalues of matrix B are positive and less than 1. The settling time, that is, the time taken for the transients to die out, is proportional to the average time constant τ lms,av =

1 µλav

(10.4.61)

M where λav = ( k=1 λk )/M is the average eigenvalue of R (Widrow et al. 1976). The quantity Ptrtotal = ∞ n=0 Ptr (n), which provides the total transient MSE, can be used as a measure for the speed of adaptation. When µλk 1 (see Problem 10.12), we have Ptrtotal

∞

1 θ k (0) 4µ M

Ptr (n)

n=0

(10.4.62)

k=1

where θ k (0) is the initial distance of a coefficient from its optimum setting measured in principal coordinates. As is intuitively expected, the smaller the step size and the farther the initial coefficients are from their optimum settings, the more iterations it takes for the LMS algorithm to converge. Furthermore, from the discussion in Section 10.3, it follows that the LMS algorithm will converge faster if the contours of the error surface are circles, that is, when the input correlation matrix is R = σ 2x I. Steady-state excess MSE. The excess MSE after the adaptation has been completed (i.e., the steady-state value) is given by (10.4.55). When µλk 1, we may approximate (10.4.55) as follows Pex (∞) Po

µ trR 1 − µ trR

which allows a much easier interpretation. Solving for µ trR, we obtain µ trR Pex (∞)/ [Pex (∞) + Po ] which implies that 0 < µ trR < 1. Since µ trR 1, we often use the approximation Pex (∞) µPo trR

(10.4.63)

which implies that Pex (∞) Po , that is, for small values of the step size the excess MSE is much smaller than the optimum MSE. Note that the presence of the irreducible error eo (n) prevents perfect adaptation as n → ∞ because Po > 0. Speed versus quality of adaptation. From the previous discussion we see that there is a tradeoff between rate of convergence (speed of adaptation) and steady-state excess MSE (quality of adaptation, or accuracy of the adaptive filter). The first requirement for an adaptive filter is stability, which is ensured by choosing µ to satisfy (10.4.60). Within this range, decreasing µ to reduce the desired level of misadjustment, according to (10.4.63),

decreases the speed of convergence; see (10.4.62). Conversely, if µ is increased to increase the speed of convergence; this results in an increase in misadjustment. This tradeoff between speed of convergence and misadjustment is a fundamental feature of the LMS algorithm. FIR filters. In this case, the input is a stationary process x(n) with a Toeplitz correlation matrix R. Therefore, we have trR = Mr(0) = ME{|x(n)|2 } = MPx

(10.4.64)

where MPx is called the tap input power. Substituting (10.4.40) into (10.4.64), we obtain 1 1 0 < 2µ < (10.4.65) = MPx tap input power which shows that the selection of the step size depends on the input power. Using (10.4.63) and (10.4.64), we see that misadjustment M is given by Pex (∞) M= µMPx (10.4.66) Po which shows that for given M and Px the value of misadjustment is proportional to µ. We emphasize that the misadjustment provides a measure of how close an LMS adaptive filter is to the corresponding optimum filter. The statistical properties of the SOE, that is, the correlation of the input signal and the cross-correlation between input and desired response signals, play a key role in the performance of the LMS adaptive filter. •

•

•

First, we should make sure that the relation between x(n) and y(n) can be accurately modeled by a linear FIR filter with M coefficients. Inadequacy of the FIR structure, output observation noise, or lack of correlation between x(n) and y(n) increases the magnitude of the irreducible error. If M is very large, we may want to use a pole-zero IIR filter (Shynk 1989; Treichler et al. 1987). If the relationship between x(n) and y(n) is nonlinear, we certainly need a nonlinear filtering structure (Mathews 1991). The LMS algorithm uses a “noisy” instantaneous estimate of the gradient vector. However, when the correlation between input and desired response is weak, the algorithm should make more cautious steps (“wait and average”). Such algorithms update their coefficients every L samples, using all samples between successive updatings to determine the gradient (gradient averaging). The eigenvalue structure of R as measured by its eigenvalue spread (λmax /λmin ) or equivalently by the spectral flatness measure (SFM ) (see Section 4.1) has a strong effect on the rate of convergence of the LMS algorithm. In general, the rate of convergence decreases as the eigenvalue spread increases, that is, as the contours of the cost function become more elliptical, or equivalently the input spectrum becomes more nonwhite.

Normalized LMS algorithm. According to (10.4.60), the selection of µ in practical applications is complicated because the power of the input signal either is unknown or varies with time. This problem can be addressed by using the normalized LMS (NLMS) algorithm [see (10.4.11)] µ ˜ c(n) = c(n − 1) + x(n)e∗ (n) (10.4.67) EM (n) where EM (n) = x(n)2 and 0 < µ ˜ < 1. It can be shown that the NLMS algorithm converges in the mean square if 0 < µ ˜ < 1 (Rupp 1993; Slock 1993), which makes the selection of the step size µ ˜ much easier than the selection of µ in the LMS algorithm. For FIR filters, the quantity EM (n) provides an estimate of ME{|x(n)|2 } and can be computed recursively by using the sliding-window formula EM (n) = EM (n − 1) + |x(n)|2 − |x(n − M)|2

(10.4.68)

535 section 10.4 Least-Mean-Square Adaptive Filters

536 chapter 10 Adaptive Filters

where EM (−1) = 0 or a first-order recursive filter estimator. In practice, to avoid division by zero, if x(n) = 0, we set EM (n) = δ + x(n)2 , where δ is a small positive constant. Other approaches and analyses. The analysis of the LMS algorithm presented in this section is simple, clarifies its performance, and provides useful design guidelines. However, there are many other approaches, which are beyond the scope of this book, that differ in terms of complexity, accuracy, and objectives. Major efforts to remove the independence assumption and replace it with the more realistic statistically dependent input assumption are documented in Macchi (1995), Solo (1997), and Butterweck (1995) and the references therein. Convergence analysis of the LMS algorithm using the stochastic approximation approach and a deterministic approach using the method of ordinary differential equations are discussed in Solo and Kong (1995), Sethares (1993), and Benveniste et al. (1987). Other types of analyses deal with the determination of the probability densities and the probability of large excursions of the adaptive filter coefficients for various types of input signals (Rupp 1995). The analysis of the convergence properties of the LMS algorithm and its variations is still an active area of research, and new results appear continuously. 10.4.4 Applications of the LMS Algorithm We now discuss three practical applications in which the LMS algorithm has made a significant impact. In the first case, we consider the previously discussed linear prediction problem and compare the performance of the LMS algorithm with that of the SDA. Table 10.4 provides a summary of the key differences between the SDA and the LMS algorithms. In the second case, we study echo cancelation in full-duplex data transmission, which employs the LMS algorithm in its implementation. In the third case, we discuss the application of adaptive equalization, which is used to minimize intersymbol interference (ISI) in a dispersive channel environment. TABLE 10.4

Comparison between the SDA and LMS algorithms. SDA

LMS

Deterministic algorithm: lim c(n) = co

Stochastic algorithm: lim E{c(n)} = co

If converges, it terminates to co

If converges, it fluctuates about co The size of fluctuations is proportional to µ

n→∞

n→∞

Noiseless gradient estimate

Noisy gradient estimate

Deterministic steps

Random steps

We can only compare the ensemble average behavior of LMS with the SDA.

Linear prediction In Example 10.3.1, the AR(2) model given in (10.3.25) was considered, and the SDA was used to determine the corresponding linear predictor coefficients. We also analyzed the performance of the SDA. In the following example, we perform a similar acquisition of predictor coefficients using the LMS algorithm, and we study the effects of the eigenvalue spread of the input correlation matrix on the convergence of the LMS adaptive algorithm when it is used to update the coefficients. E XAM PLE 10.4.1.

The second-order system in (10.3.25) is repeated here, which generates the

signal x(n): x(n) + a1 x(n − 1) + a2 x(n − 2) = w(n)

where w(n) ∼ WGN(0, σ 2w ) and where the coefficients are selected from Table 10.2 for two different eigenvalue spreads. A Gaussian pseudorandom number generator was used to obtain 1000 realizations of x(n) using each set of parameter values given in Table 10.2. These sample realizations were used for statistical analysis. The second-order LMS adaptive predictor with coefficients c(n) = [c1 (n) c2 (n)]T is given by [see (10.4.12)] e(n) = x(n) − c1 (n − 1)x(n − 1) − c2 (n − 2)x(n − 2)

n≥0

c1 (n) = c1 (n − 1) + 2µe(n)x(n − 1) c2 (n) = c2 (n − 1) + 2µe(n)x(n − 2) where µ is the step-size parameter. The adaptive predictor was initialized by setting x(−1) = x(−2) = 0 and c1 (−1) = c2 (−1) = 0. The above adaptive predictor was implemented with µ = 0.04, and the predictor coefficients as well as the MSE were recorded for each realization. These quantities were averaged to study the behavior of the LMS algorithm. These calculations were repeated for µ = 0.01. In Figure 10.20 we show several plots obtained for X (R) = 1.22. In plot (a) we show the ensemble averaged trajectory {c(n)}150 n=0 superimposed on the MSE contours. A trajectory of a simple realization is also shown to illustrate its randomness. In plot (b) the c(n) learning curve for the averaged value as well as for one single realization is shown. In plot (c) the corresponding learning curves for the MSE are depicted. Finally, in plot (d ) we show the effect of step size µ on the MSE learning curve. Similar plots are shown in Figure 10.21 for X (R) = 10.

0.5 0

c2

Coefficients

c2(n)

−0.95 −2.5 −1.4

co, 1

c1(n)

0.195

co,2 0 c1

1.6

150 Number of iterations n

(a) Averaged trajectory

(b) c(n) learning curve

1.0

0.5 P(n)

P(n)

m = 0.01

m = 0.04 0.0965

0.0965 0

0 0

150 Number of iterations n (c) MSE P(n) learning curve

150 Number of iterations n (d ) Step-size effect on MSE

FIGURE 10.20 Performance curves for the LMS used in the linear prediction problem with step-size parameter µ = 0.04 and eigenvalue spread X (R) = 1.22.

537 section 10.4 Least-Mean-Square Adaptive Filters

538

2

chapter 10 Adaptive Filters

co, 1

1.5955 Coefficients

c1 (n)

c2

−4

c2 (n)

co,2

−0.95 0

5

500

c1

Number of iterations n

(a) Averaged trajectory

(b) c(n) learning curve

1.0

m = 0.01

P(n)

P(n)

0.5

m = 0.04 0.0322

0.0322 0

500

Number of iterations n (c) MSE P(n) learning curve

500 Number of iterations n (d ) Step-size effect on MSE

FIGURE 10.21 Performance curves for the LMS used in the linear prediction problem with step-size parameter µ = 0.04 and eigenvalue spread X (R) = 10. Several observations can be made from these plots: • • • •

The trajectories and the learning curves for a simple realization are clearly random or “noisy,” while the averaging over the ensemble clearly has a smoothing effect. The averaged quantities (coefficients and the MSE) converge to the true values, and this convergence rate is in accordance with theory. The rate of convergence of the LMS algorithm depends on the step size µ. The smaller the step size, the slower the rate. The rate of convergence also depends on the eigenvalue spread X (R). The larger the spread, the slower the rate. For X (R) = 1.22, the algorithm converges in about 150 steps while for X (R) = 10 it requires about 500 steps.

Clearly these observations compare well with the theory.

Echo cancelation in full-duplex data transmission Figure 10.22 illustrates a system that achieves simultaneous data transmission in both directions (full-duplex) over two-wire circuits using the special two-wire to four-wire interfaces (called hybrid couplers) that exist in any telephone set. Although the hybrid couplers are designed to provide perfect isolation between transmitters and receivers, this is not the case in practical systems. As a result, (1) one part of the transmitted signal leaks through the near-end hybrid to its own receiver (near-end echo), and (2) another part is reflected by the far-end hybrid and ends up at its own receiver (far-end echo). The combined echo signal, which can be 30 dB stronger than the signal received from the other end, increases the number of errors. We note that in cont