Skip to content

EE731 Lecture Notes: Matrix Computations for Signal Processing

INFO

原文档为 EE731 Lecture Notes: Matrix Computations for Signal Processing

James P. Reilly (c)

Department of Electrical and Computer Engineering, McMaster University

September 13, 2004

0 Preface

This collection of ten chapters of notes will give the reader an introduction to the fundamental principles of linear algebra for application in many disciplines of modern engineering and science, including signal processing, control theory, process control, applied statistics, robotics, etc. We assume the reader has an equivalent background to a freshman course in linear algebra, some introduction to probability and statistics, and a basic knowledge of the Fourier transform.

The first chapter, some fundamental ideas required for the remaining portion of the course are established. First, we look at some fundamental ideas of linear algebra such as linear independence, subspaces, rank, nullspace, range, etc., and how these concepts are interrelated. The idea of autocorrelation, and the covariance matrix of a signal, are then discussed and interpreted.

In chapter 2, the most basic matrix decomposition, the so-called eigendecomposition, is presented. The focus of the presentation is to give an intuitive insight into what this decomposition accomplishes. We illustrate how the eigendecomposition can be applied through the Karhunen-Loeve transform. In this way, the reader is made familiar with the important properties of this decomposition. The Karhunen-Loeve transform is then generalized to the broader idea of transform coding.

In chapter 3, we develop the singular value decomposition (SVD), which is closely related to the eigendecomposition of a matrix. We develop the relationships between these two decompositions and explore various properties of the SVD.

Chapter 4 deals with the quadratic form and its relation to the eigendecomposition, and also gives an introduction to error mechanisms in floating point number systems. The condition number of a matrix, which is a critical part in determining a lower bound on the relative error in the solution of a system of linear equations, is also developed.

Chapters 5 and 6 deal with solving linear systems of equations by Gaussian elimination. The Gaussian elimination process is described through a bigger-block matrix approach, that leads to other useful decompositions, such as the Cholesky decomposition of a square symmetric matrix.

Chapters 7-10 deal with solving least-squares problems. The standard least squares problem and its solution are developed in Chapter 7. In Chapter 8, we develop a generalized "pseudoinverse" approach to solving the least-squares problem. The QR decomposition in developed in Chapter 9, and its application to the solution of linear least squares problems is discussed in Chapter 10.

Finally, in Chapter 11, the solution of Toeplitz systems of equations and its underlying theory is developed.

1 Fundamental Concepts

The purpose of this lecture is to review important fundamental concepts in linear algebra, as a foundation for the rest of the course. We first discuss the fundamental building blocks, such as an overview of matrix multiplication from a “big block” perspective, linear independence, subspaces and related ideas, rank, etc., upon which the rigor of linear algebra rests. We then discuss vector norms, and various interpretations of the matrix multiplication operation. We close the chapter with a discussion on determinants.

1.1 Notation

Throughout this course, we shall indicate that a matrix A is of dimension m×n, and whose elements are taken from the set of real numbers, by the notation ARm×n. This means that the matrix A belongs to the Cartesian product of the real numbers, taken m×n times, one for each element of A. In a similar way, the notation ACm×n means the matrix is of dimension m×n, and the elements are taken from the set of complex numbers. By the matrix dimension “m×n”, we mean A consists of m rows and n columns.

Similarly, the notation aRm(Cm) implies a vector of dimension m whose elements are taken from the set of real (complex) numbers. By “dimension of a vector”, we mean its length, i.e., that it consists of m elements.

Also, we shall indicate that a scalar a is from the set of real (complex) numbers by the notation aR(C). Thus, an upper case bold character denotes a matrix, a lower case bold character denotes a vector, and a lower case non-bold character denotes a scalar.

By convention, a vector by default is taken to be a column vector. Further, for a matrix A, we denote its ith column as ai. We also imply that its jth row is ajT, even though this notation may be ambiguous, since it may also be taken to mean the transpose of the jth column. The context of the discussion will help to resolve the ambiguity.

1.2 “Bigger-Block” Interpretations of Matrix Multiplication

Let us define the matrix product C as

(1)Cm×n=Am×kBk×n

The three interpretations of this operation now follow:

1.2.1 Inner-Product Representation

If a and b are column vectors of the same length, then the scalar quantity aTb is referred to as the inner product of a and b. If we define aiTRk as the ith row of A and bjRk as the jth column of B, then the element cij of C is defined as the inner product aiTbj. This is the conventional small-block representation of matrix multiplication.

1.2.2 Column Representation

This is the next bigger–block view of matrix multiplication. Here we look at forming the product one column at a time. The jth column cj of C may be expressed as a linear combination of columns ai of A with coefficients which are the elements of the jth column of B. Thus,

(2)cj=i=1kaibij,j=1,,n.

This operation is identical to the inner–product representation above, except we form the product one column at a time. For example, if we evaluate only the pth element of the jth column cj, we see that (2) degenerates into i=1kapibij. This is the inner product of the pth row and jth column of A and B respectively, which is the required expression for the (p,j)th element of C.

1.2.3 Outer–Product Representation

This is the largest–block representation. Let us define a column vector aRm and a row vector bTRn. Then the outer product of a and b is an m×n matrix of rank one and is defined as abT.

Now let ai and biT be the ith column and row of A and B respectively. Then the product C may also be expressed as

(3)C=i=1kaibiT.

By looking at this operation one column at a time, we see this form of matrix multiplication performs exactly the same operations as the column representation above. For example, the jth column cj of the product is determined from (3) to be cj=i=1kaibij, which is identical to (2) above.

1.2.4 Matrix Pre– and Post–Multiplication

Let us now look at some fundamental ideas distinguishing matrix pre– and post–multiplication. In this respect, consider a matrix A pre–multiplied by B to give Y=BA. (All matrices are assumed to have conformable dimensions). Then we can interpret this multiplication as B operating on the columns of A to give the columns of the product. This follows because each column yi of the product is a transformed version of the corresponding column of A; i.e., yi=Bai, i=1,,n. Likewise, let’s consider A post–multiplied by a matrix C to give X=AC. Then, we interpret this multiplication as C operating on the rows of A, because each row xjT of the product is a transformed version of the corresponding row of A; i.e., xjT=ajTC, j=1,,m, where we define ajT as the jth row of A.

Example:

  • Consider an orthonormal matrix Q of appropriate dimension. We know that multiplication by an orthonormal matrix results in a rotation operation. The operation QA rotates each column of A. The operation AQ rotates each row.

There is another way to interpret pre– and post–multiplication. Again consider the matrix A pre–multiplied by B to give Y=BA. Then according to (2), the jth column yi of Y is a linear combination of the columns of B, whose coefficients are the jth column of A. Likewise, for X=AB, we can say that the ith row xiT of X is a linear combination of the rows of B, whose coefficients are the ith row of A.

Either of these interpretations is equally valid. Being comfortable with the representations of this section is a big step in mastering the field of linear algebra.

1.3 Fundamental Linear Algebra

1.3.1 Linear Independence

Suppose we have a set of n m-dimensional vectors {a1,,an}, where aiRm,i=1,,n. This set is linearly independent under the conditions [1]

(4)j=1ncjaj=0if and only ifc1,,cn=0

In words:

Eq. (4) means that a set of vectors is linearly independent if and only if the only zero linear combination of the vectors has coefficents which are all zero.

A set of n vectors is linearly independent if an n–dimensional space may be formed by taking all possible linear combinations of the vectors. If the dimension of the space is less than n, then the vectors are linearly dependent. The concept of a vector space and the dimension of a vector space is made more precise later.

Note that a set of vectors {a1,,an}, where n>m cannot be linearly independent.

Example 1

(5)A=[a1 a2 a3]=[121031001]

This set is linearly independent. On the other hand, the set

(6)B=[b1 b2 b3]=[123033000]

is not. This follows because the third column is a linear combination of the first two. (1 times the first column plus 1 times the second equals the third column. Thus, the coefficients cj in (4) resulting in zero are any scalar multiple of (1,1,1)).

1.3.2 Span, Range, and Subspaces

In this section, we explore these three closely-related ideas. In fact, their mathematical definitions are almost the same, but the interpretation is different for each case.

Span: The span of a vector set [a1,,an], written as span[a1,,an], where aiRm, is the set of points mapped by

(7)span[a1,,an]={yRmy=j=1ncjaj,cjR}.

In other words, span[a1,,an] is the set of all possible linear combinations of the vectors a. If the vectors are linearly independent, then the dimension of this set of linear combinations is n. If the vectors are linearly dependent, then the dimension is less.

The set of vectors in a span is referred to as a vector space. The dimension of a vector space is the number of linearly independent vectors in the linear combination which forms the space. Note that the vector space dimension is not the dimension (length) of the vectors forming the linear combinations.

Example 2: Consider the following 2 vectors in Fig. 1: The span of these vectors is the (infinite extension of the) plane of the paper.

Subspaces Given a set (space) of vectors [a1,,an]Rm,mn, a subspace S is a vector subset that satisfies two requirements:

  1. If x and y are in the subspace, then x+y is still in the subspace.
  2. If we multiply any vector x in the subspace by a scalar c, then cx is still in the subspace.

These two requirements imply that for a subspace, any linear combination of vectors which are in the subspace is itself in the subspace. Comparing this idea with that of span, we see a subspace defined by the vectors [a1,,an] is identical to span[a1,,an]. However, a subspace has the interpretation that the set of vectors comprizing the subspace must be a subset of a larger space. For example, the vectors [a1,a2] in Fig. 1 define a subspace (the plane of the paper) which is a subset of the three–dimensional universe R3.

Hence formally, a k–dimensional subspace S of span[a1,,an] is determined by span[ai1,,aik], where the distinct indices satisfy {i1,,ik}{1,,n}; that is, the vector space S=span[ai1,,aik] is a subset of span[a1,,an].

Note that [ai1,,aik] is not necessarily a basis for the subspace S. This set is a basis only if it is a maximally independent set. This idea is discussed shortly. The set {ai} need not be linearly independent to define the span or subset.

∗ What is the span of the vectors [b1,,b3] in example 1?

Range: The range of a matrix ARm×n, denoted R(A), is a subspace (set of vectors) satisfying

(8)R(A)={yRmy=Ax, for xRn}.

We can interpret the matrix–vector multiplication y=Ax above according to the column representation for matrix multiplication (2), where the product C has only one column. Thus, we see that y is a linear combination of the columns ai of A, whose coefficients are the elements xi of x. Therefore, (8) is equivalent to (7), and R(A) is thus the span of the columns of A. The distinction between range and span is that the argument of range is a matrix, while for span it is a set of vectors. If the columns of A are (not) linearly independent, then R(A) will (not) span n dimensions. Thus, the dimension of the vector space R(A) is less than or equal to n. Any vector yR(A) is of dimension (length) m.

Example 3:

(9)A=[153243333](the last column is the average of the first two)

R(A) is the set of all linear combinations of any two columns of A.

In the case when n<m (i.e., A is a tall matrix), it is important to note that R(A) is indeed a subspace of the m-dimensional “universe” Rm. In this case, the dimension of R(A) is less than or equal to n. Thus, R(A) does not span the whole universe, and therefore is a subspace of it.

1.3.3 Maximally Independent Set

This is a vector set which cannot be made larger without losing independence, and smaller without remaining maximal; i.e. it is a set containing the maximum number of independent vectors spanning the space.

1.3.4 A Basis

A basis for a subspace is any maximally independent set within the subspace. It is not unique.

Example 4. A basis for the subspace S spanning the first 2 columns of

A=[123333],i.e., S={[100],[230]}

is

e1=(1,0,0)Te2=(0,1,0)T.

[2]or any other linearly independent set in span[e1,e2].

Any vector in S is uniquely represented as a linear combination of the basis vectors.

1.3.5 Orthogonal Complement Subspace

If we have a subspace S of dimension n consisting of vectors [a1,,an], aiRm,i=1,,n, for nm, the orthogonal complement subspace S of S of dimension mn is defined as

(10)S={yRmyTx=0 for all xS}

i.e., any vector in S is orthogonal to any vector in S. The quantity S is pronounced “S–perp”.

Example 5: Take the vector set defining S from Example 4:

(11)S[120300]

then, a basis for S is

(12)[001]

1.3.6 Rank

Rank is an important concept which we will use frequently throughout this course. We briefly describe only a few basic features of rank here. The idea is expanded more fully in the following sections.

  1. The rank of a matrix is the maximum number of linearly independent rows or columns. Thus, it is the dimension of a basis for the columns (rows) of a matrix.
  2. Rank of A (denoted rank(A)), is the dimension of R(A).
  3. if A=BC, and r1=rank(B), r2=rank(C), then rank(A)min(r1,r2).
  4. A matrix ARm×n is said to be rank deficient if its rank is less than min(m,n). Otherwise, it is said to be full rank.
  5. If A is square and rank deficient, then det(A)=0.
  6. It can be shown that rank(A)=rank(AT). More is said on this point later.

A matrix is said to be full column (row) rank if its rank is equal to the number of columns (rows).

Example: The rank of A in Example 4 is 3, whereas the rank of A in Example 3 is 2.

1.3.7 Null Space of A

The null space N(A) of A is defined as

(13)N(A)={xRn0Ax=0}.

From previous discussions, the product Ax is a linear combination of the columns ai of A, where the elements xi of x are the corresponding coefficients. Thus, from (13), N(A) is the set of non–zero coefficients of all zero linear combinations of the columns of A. If the columns of A are linearly independent, then N(A)= by definition, because there can be no coefficients except zero which result in a zero linear combination. In this case, the dimension of the null space is zero, and A is full column rank. The null space is empty if and only if A is full column rank, and is non–empty when A is column rank deficient[3]. Note that any vector in N(A) is of dimension n. Any vector in N((A)) is orthogonal to the rows of A, and is thus in the orthogonal complement of the span of the rows of A.

Example 6: Let A be as before in Example 3. Then N(A)=c(1,1,2)T, where c is a real constant.

A further example is as follows. Take 3 vectors [a1,a2,a3] where aiR3, i=1,,3, that are constrained to lie in a 2–dimensional plane. Then there exists a zero linear combination of these vectors. The coefficients of this linear combination define a vector x which is in the nullspace of A=[a1,a2,a3]. In this case, we see that A is rank deficient.

Another important characterization of a matrix is its nullity. The nullity of A is the dimension of the nullspace of A. In Example 6 above, the nullity of A is one. We then have the following interesting property:

(14)rank(A)+nullity(A)=n.

1.4 Four Fundamental Subspaces of a Matrix

The four matrix subspaces of concern are: the column space, the row space, and their respective orthogonal complements. The development of these four subspaces is closely linked to N(A) and R(A). We assume for this section that ARm×n, rmin(m,n), where r=rankA.

1.4.1 The Column Space

This is simply R(A). Its dimension is r. It is the set of all linear combinations of the columns of A.

1.4.2 The Orthogonal Complement of the Column Space

This may be expressed as R(A), with dimension mr. It may be shown to be equivalent to N(AT), as follows: By definition, N(AT) is the set x satisfying:

(15)[AT][x1xm]=0,

where columns of A are the rows of AT. From (15), we see that N(AT) is the set of xRm which is orthogonal to all columns of A (rows of AT). This by definition is the orthogonal complement of R(A).

1.4.3 The Row Space

The row space is defined simply as R(AT), with dimension r. The row space is the range of the rows of A, or the subspace spanned by the rows, or the set of all possible linear combinations of the rows of A.

1.4.4 The Orthogonal Complement of the Row Space

This may be denoted as R(AT). Its dimension is nr. This set must be that which is orthogonal to all rows of A: i.e., for x to be in this space, x must satisfy

(16)rowsofA[][x1xn]=0.

Thus, the set x, which is the orthogonal complement of the row space satisfying (16), is simply N(A).

We have noted before that rank(A)=rank(AT). Thus, the dimension of the row and column subspaces are equal. This is surprising, because it implies the number of linearly independent rows of a matrix is the same as the number of linearly independent columns. This holds regardless of the size or rank of the matrix. It is not an intuitively obvious fact and there is no immediately obvious reason why this should be so. Nevertheless, the rank of a matrix is the number of independent rows or columns.

1.5 Vector Norms

A vector norm is a means of expressing the length or distance associated with a vector. A norm on a vector space Rn is a function f, which maps a point in Rn into a point in R. Formally, this is stated mathematically as f:RnR. The norm has the following properties:

  1. f(x)0 for all xRn.
  2. f(x)=0 if and only if x=0.
  3. f(x+y)f(x)+f(y) for x,yRn.
  4. f(ax)=|a|f(x) for aR,xRn.

We denote the function f(x) as x.

The p-norms: This is a useful class of norms, generalizing on the idea of the Euclidean norm. They are defined by

(17)xp=(|x1|p+|x2|p++|xn|p)1/p.

If p=1:

x1=i|xi|

which is simply the sum of absolute values of the elements.

If p=2:

x2=(ixi2)12=(xTx)12

which is the familiar Euclidean norm.

If p=:

x=maxi|xi|

which is the largest element of x. This may be shown in the following way. As p, the largest term within the round brackets in (17) dominates all the others. Therefore (17) may be written as

(18)x=limp[i=1nxip]1p=limp[xkp]1p=xk

where k is the index corresponding to the largest element xi.

Note that the p=2 norm has many useful properties, but is expensive to compute. Obviously, the 1– and –norms are easier to compute, but are more difficult to deal with algebraically. All the p–norms obey all the properties of a vector norm.

1.6 Determinants

Consider a square matrix ARm×m. We can define the matrix Aij as the submatrix obtained from A by deleting the ith row and jth column of A. The scalar number det(Aij) ( where det() denotes determinant) is called the minor associated with the element aij of A. The signed minor cij(1)j+idet(Aij) is called the cofactor of aij.

The determinant of A is the m-dimensional volume contained within the columns (rows) of A. This interpretation of determinant is very useful as we see shortly. The determinant of a matrix may be evaluated by the expression

(19)det(A)=j=1maijcij,i(1m).

or

(20)det(A)=i=1maijcij,j(1m).

Both the above are referred to as the cofactor expansion of the determinant. Eq. (19) is along the ith row of A, whereas (20) is along the jth column. It is indeed interesting to note that both versions above give exactly the same number, regardless of the value of i or j.

Eqs. (19) and (20) express the m×m determinant detA in terms of the cofactors cij of A, which are themselves (m1)×(m1) determinants. Thus, m1 recursions of (19) or (20) will finally yield the determinant of the m×m matrix A.

From (19) it is evident that if A is triangular, then det(A) is the product of the main diagonal elements. Since diagonal matrices are in the upper triangular set, then the determinant of a diagonal matrix is also the product of its diagonal elements.

Properties of Determinants Before we begin this discussion, let us define the volume of a parallelopiped defined by the set of column vectors comprising a matrix as the principal volume of that matrix.

We have the following properties of determinants, which are stated without proof:

  1. det(AB)=det(A)det(B)A,BRm×m. The principal volume of the product of matrices is the product of principal volumes of each matrix.
  2. det(A)=det(AT) This property shows that the characteristic polynomials[4] of A and AT are identical. Consequently, as we see later, eigenvalues of AT and A are identical.
  3. det(cA)=cmdet(A)cR,ARm×m. This is a reflection of the fact that if each vector defining the principal volume is multiplied by c, then the resulting volume is multiplied by cm.
  4. det(A)=0A is singular. This implies that at least one dimension of the principal volume of the corresponding matrix has collapsed to zero length.
  5. det(A)=i=1mλi, where λi are the eigen (singular) values of A. This means the parallelopiped defined by the column or row vectors of a matrix may be transformed into a regular rectangular solid of the same m– dimensional volume whose edges have lengths corresponding to the eigen (singular) values of the matrix.
  6. The determinant of an orthonormal[5] matrix is ±1. This is easy to see, because the vectors of an orthonormal matrix are all unit length and mutually orthogonal. Therefore the corresponding principal volume is ±1.
  7. If A is nonsingular, then det(A1)=[det(A)]1.
  8. If B is nonsingular, then det(B1AB)=det(A).
  9. If B is obtained from A by interchanging any two rows (or columns), then det(B)=det(A).
  10. If B is obtained from A by by adding a scalar multiple of one row to another (or a scalar multiple of one column to another), then det(B)=det(A).

A further property of determinants allows us to compute the inverse of A. Define the matrix A~ as the adjoint of A:

(21)A~=[c11c1mcm1cmm]T

where the cij are the cofactors of A. According to (19) or (20), the ith row a~iT of A~ times the ith column ai is det(A); i.e.,

(22)a~iTai=det(A),i=1,,m.

It can also be shown that

(23)a~iTaj=0,ij.

Then, combining (22) and (23) for i,j{1,,m} we have the following interesting property:

(24)A~A=det(A)I,

where I is the m×m identity matrix. It then follows from (24) that the inverse A1 of A is given as

(25)A1=[det(A)]1A~.

Neither (19) nor (25) are computationally efficient ways of calculating a determinant or an inverse respectively. Better methods which exploit the properties of various matrix decompositions are made evident later in the course.

2 Lecture 2

This lecture discusses eigenvalues and eigenvectors in the context of the Karhunen–Loeve (KL) expansion of a random process. First, we discuss the fundamentals of eigenvalues and eigenvectors, then go on to covariance matrices. These two topics are then combined into the K-L expansion. An example from the field of array signal processing is given as an application of algebraic ideas.

A major aim of this presentation is an attempt to de-mystify the concepts of eigenvalues and eigenvectors by showing a very important application in the field of signal processing.

2.1 Eigenvalues and Eigenvectors

Suppose we have a matrix A:

(1)A=[4114]

We investigate its eigenvalues and eigenvectors.

Suppose we take the product Ax1, where x1=T, as shown in Fig. 1. Then,

(2)Ax1=[41].

By comparing the vectors x1 and Ax1 we see that the product vector is scaled and rotated counter–clockwise with respect to x1.

Now consider the case where x2=T. Then Ax2=T. Here, we note a clockwise rotation of Ax2 with respect to x2.

Now lets consider a more interesting case. Suppose x3=T. Then Ax3=T. Now the product vector points in the same direction as x3. The vector Ax3 is a scaled version of the vector x3. Because of this property, x3=T is an eigenvector of A. The scale factor (which in this case is 5) is given the symbol λ and is referred to as an eigenvalue.

Note that x=[1,1]T is also an eigenvector, because in this case, Ax=[3,3]T=3x. The corresponding eigenvalue is 3.

Thus we have, if x is an eigenvector of ARn×n,

(3)Ax=λxscalar multiple (eigenvalue)

i.e., the vector Ax is in the same direction as x but scaled by a factor λ.

Now that we have an understanding of the fundamental idea of an eigenvector, we proceed to develop the idea further. Eq. (3) may be written in the form

(4)(AλI)x=0

where I is the n×n identity matrix. Eq. (4) is a homogeneous system of equations, and from fundamental linear algebra, we know that a nontrivial solution to (4) exists if and only if

(5)det(AλI)=0

where det() denotes determinant. Eq. (5), when evaluated, becomes a polynomial in λ of degree n. For example, for the matrix A above we have

det([4114]λ[1001])=0(6)det[4λ114λ]=(4λ)21=λ28λ+15=0.

It is easily verified that the roots of this polynomial are (5,3), which correspond to the eigenvalues indicated above.

Eq. (5) is referred to as the characteristic equation of A, and the corresponding polynomial is the characteristic polynomial. The characteristic polynomial is of degree n.

More generally, if A is n×n, then there are n solutions of (5), or n roots of the characteristic polynomial. Thus there are n eigenvalues of A satisfying (3); i.e.,

(7)Axi=λixi,i=1,,n.

If the eigenvalues are all distinct, there are n associated linearly–independent eigenvectors, whose directions are unique, which span an n–dimensional Euclidean space.

Repeated Eigenvalues: In the case where there are e.g., r repeated eigenvalues, then a linearly independent set of n eigenvectors exist, provided the rank of the matrix (AλI) in (5) is rank nr. Then, the directions of the r eigenvectors associated with the repeated eigenvalues are not unique. In fact, consider a set of r linearly independent eigenvectors v1,,vr associated with the r repeated eigenvalues. Then, it may be shown that any vector in span[v1,,vr] is also an eigenvector. This emphasizes the fact the eigenvectors are not unique in this case.

Example 1: Consider the matrix given by

[100000000]

It may be easily verified that any vector in span[e2,e3] is an eigenvector associated with the zero repeated eigenvalue.

Example 2: Consider the n×n identity matrix. It has n repeated eigenvalues equal to one. In this case, any n–dimensional vector is an eigenvector, and the eigenvectors span an n–dimensional space.


Eq. (5) gives us a clue how to compute eigenvalues. We can formulate the characteristic polynomial and evaluate its roots to give the λi. Once the eigenvalues are available, it is possible to compute the corresponding eigenvectors vi by evaluating the nullspace of the quantity AλiI, for i=1,,n. This approach is adequate for small systems, but for those of appreciable size, this method is prone to appreciable numerical error. Later, we consider various orthogonal transformations which lead to much more effective techniques for finding the eigenvalues.

We now present some very interesting properties of eigenvalues and eigenvectors, to aid in our understanding.

Property 1 If the eigenvalues of a (Hermitian)[6] symmetric matrix are distinct, then the eigenvectors are orthogonal.

Proof. Let {vi} and {λi},i=1,,n be the eigenvectors and corresponding eigenvalues respectively of ARn×n. Choose any i,j[1,,n],ij. Then

(8)Avi=λivi

and

(9)Avj=λjvj.

Premultiply (8) by vjT and (9) by viT:

(10)vjTAvi=λivjTvi(11)viTAvj=λjviTvj

The quantities on the left are equal when A is symmetric. We show this as follows. Since the left-hand side of (10) is a scalar, its transpose is equal to itself. Therefore, we get vjTAvi=viTATvj.[7] But, since A is symmetric, AT=A. Thus, vjTAvi=viTATvj=viTAxj, which was to be shown. Subtracting (10) from (11), we have

(12)(λiλj)vjTvi=0

where we have used the fact vjTvi=viTvj. But by hypothesis, λiλj0. Therefore, (12) is satisfied only if vjTvi=0, which means the vectors are orthogonal.

Here we have considered only the case where the eigenvalues are distinct. If an eigenvalue λ~ is repeated r times, and rank(Aλ~I)=nr, then a mutually orthogonal set of n eigenvectors can still be found.

Another useful property of eigenvalues of symmetric matrices is as follows:

Property 2 The eigenvalues of a (Hermitian) symmetric matrix are real.

Proof:[8] (By contradiction): First, we consider the case where A is real. Let λ be a non–zero complex eigenvalue of a symmetric matrix A. Then, since the elements of A are real, λ, the complex–conjugate of λ, must also be an eigenvalue of A, because the roots of the characteristic polynomial must occur in complex conjugate pairs. Also, if v is a nonzero eigenvector corresponding to λ, then an eigenvector corresponding λ must be v, the complex conjugate of v. But Property 1 requires that the eigenvectors be orthogonal; therefore, vTv=0. But vTv=(vHv), which is by definition the complex conjugate of the norm of v. But the norm of a vector is a pure real number; hence, vTv must be greater than zero, since v is by hypothesis nonzero. We therefore have a contradiction. It follows that the eigenvalues of a symmetric matrix cannot be complex; i.e., they are real.

While this proof considers only the real symmetric case, it is easily extended to the case where A is Hermitian symmetric.

Property 3 Let A be a matrix with eigenvalues λi,i=1,,n and eigenvectors vi. Then the eigenvalues of the matrix A+sI are λi+s, with corresponding eigenvectors vi, where s is any real number.

Proof: From the definition of an eigenvector, we have Av=λv. Further, we have sIv=sv. Adding, we have (A+sI)v=(λ+s)v. This new eigenvector relation on the matrix (A+sI) shows the eigenvectors are unchanged, while the eigenvalues are displaced by s.

Property 4 Let A be an n×n matrix with eigenvalues λi,i=1,,n. Then

  • The determinant det(A)=i=1nλi.
  • The trace[9] tr(A)=i=1nλi. The proof is straightforward, but because it is easier using concepts presented later in the course, it is not given here.

Property 5 If v is an eigenvector of a matrix A, then cv is also an eigenvector, where c is any real or complex constant. The proof follows directly by substituting cv for v in Av=λv. This means that only the direction of an eigenvector can be unique; its norm is not unique.

2.1.1 Orthonormal Matrices

Before proceeding with the eigendecomposition of a matrix, we must develop the concept of an orthonormal matrix. This form of matrix has mutually orthogonal columns, each of unit norm. This implies that

(13)qiTqj=δij,

where δij is the Kronecker delta, and qi and qj are columns of the orthonormal matrix Q. With (13) in mind, we now consider the product QTQ. The result may be visualized with the aid of the diagram below:

(14)QTQ=[q1Tq2TqNT][q1q2qN]=I.

(When i=j, the quantity qiTqi defines the squared 2 norm of qi, which has been defined as unity. When ij, qiTqj=0, due to the orthogonality of the qi). Eq. (14) is a fundamental property of an orthonormal matrix.

Thus, for an orthonormal matrix, (14) implies the inverse may be computed simply by taking the transpose of the matrix, an operation which requires almost no computational effort.

Eq. (14) follows directly from the fact Q has orthonormal columns. It is not so clear that the quantity QQT should also equal the identity. We can resolve this question in the following way. Suppose that A and B are any two square invertible matrices such that AB=I. Then, BAB=B. By parsing this last expression, we have

(15)(BA)B=B.

Clearly, if (15) is to hold, then the quantity BA must be the identity[10]; hence, if AB=I, then BA=I. Therefore, if QTQ=I, then also QQT=I. From this fact, it follows that if a matrix has orthonormal columns, then it also must have orthonormal rows. We now develop a further useful property of orthonormal marices:

Property 6 The vector 2-norm is invariant under an orthonormal transformation. If Q is orthonormal, then

Qx22=xTQTQx=xTx=x22.

Thus, because the norm does not change, an orthonormal transformation performs a rotation operation on a vector. We use this norm–invariance property later in our study of the least–squares problem.

Suppose we have a matrix URm×n, where m>n, whose columns are orthonormal. We see in this case that U is a tall matrix, which can be formed by extracting only the first n columns of an arbitrary orthonormal matrix. (We reserve the term orthonormal matrix to refer to a complete m×m matrix). Because U has orthonormal columns, it follows that the quantity UTU=In×n. However, it is important to realize that the quantity UUTIm×m in this case, in contrast to the situation when m=n. The latter relation follows from the fact that the m column vectors of UT of length n,n<m, cannot all be mutually orthogonal. In fact, we see later that UUT is a projector onto the subspace R(U).

Suppose we have a vector bRm. Because it is easiest, by convention we represent b using the basis [e1,,em], where the ei are the elementary vectors (all zeros except for a one in the ith position). However it is often convenient to represent b in a basis formed from the columns of an orthonormal matrix Q. In this case, the elements of the vector c=QTb are the coefficients of b in the basis Q. The orthonormal basis is convenient because we can restore b from c simply by taking b=Qc.

An orthonormal matrix is sometimes referred to as a unitary matrix. This follows because the determinant of an orthonormal matrix is ±1.

2.1.2 The Eigendecomposition (ED) of a Square Symmetric Matrix

Almost all matrices on which ED’s are performed (at least in signal processing) are symmetric. A good example are covariance matrices, which are discussed in some detail in the next section.

Let ARn×n be symmetric. Then, for eigenvalues λi and eigenvectors vi, we have

(16)Avi=λivi,i=1,,n.

Let the eigenvectors be normalized to unit 2–norm. Then these n equations can be combined, or stacked side–by–side together, and represented in the following compact form:

(17)AV=VΛ

where V=[v1,v2,,vn] (i.e., each column of V is an eigenvector), and

(18)Λ=[λ10λ20λn]=diag(λ1λn).

Corresponding columns from each side of (17) represent one specific value of the index i in (16). Because we have assumed A is symmetric, from Property 1, the vi are orthogonal. Furthermore, since we have assumed vi2=1, V is an orthonormal matrix. Thus, post-multiplying both sides of (17) by VT, and using VVT=I we get

(19)A=VΛVT.

Eq. (19) is called the eigendecomposition (ED) of A. The columns of V are eigenvectors of A, and the diagonal elements of Λ are the corresponding eigenvalues. Any symmetric matrix may be decomposed in this way. This form of decomposition, with Λ being diagonal, is of extreme interest and has many interesting consequences. It is this decomposition which leads directly to the Karhunen-Loeve expansion which we discuss shortly.

Note that from (19), knowledge of the eigenvalues and eigenvectors of A is sufficient to completely specify A. Note further that if the eigenvalues are distinct, then the ED is unique. There is only one orthonormal V and one diagonal Λ which satisfies (19).

Eq. (19) can also be written as

(20)VTAV=Λ.

Since Λ is diagonal, we say that the unitary (orthonormal) matrix V of eigenvectors diagonalizes A. No other orthonormal matrix can diagonalize A. The fact that only V diagonalizes A is the fundamental property of eigenvectors. If you understand that the eigenvectors of a symmetric matrix diagonalize it, then you understand the “mystery” behind eigenvalues and eigenvectors. Thats all there is to it. We look at the K–L expansion later in this lecture in order to solidify this interpretation, and to show some very important signal processing concepts which fall out of the K–L idea. But the K–L analysis is just a direct consequence of that fact that only the eigenvectors of a symmetric matrix diagonalize.

2.1.3 Conventional Notation on Eigenvalue Indexing

Let ARn×n have rank rn. Also assume A is positive semi–definite; i.e., all its eigenvalues are 0. This is a not too restrictive assumption because most of the matrices on which the eigendecomposition is relevant are positive semi–definite. Then, we see in the next section we have r non-zero eigenvalues and nr zero eigenvalues. It is common convention to order the eigenvalues so that

(21)λ1λ2λrr nonzero eigenvalues>λr+1=,λnnr zero eigenvalues=0

i.e., we order the columns of eq. (17) so that λ1 is the largest, with the remaining nonzero eigenvalues arranged in descending order, followed by nr zero eigenvalues. Note that if A is full rank, then r=n and there are no zero eigenvalues. The quantity λn is the eigenvalue with the lowest value.

The eigenvectors are reordered to correspond with the ordering of the eigenvalues. For notational convenience, we refer to the eigenvector corresponding to the largest eigenvalue as the “largest eigenvector”. The “smallest eigenvector” is then the eigenvector corresponding to the smallest eigenvalue.

2.2 The Eigendecomposition in Relation to the Fundamental Matrix Subspaces

In this section, we develop relationships between the eigendecomposition of a matrix and its range, null space and rank.

Here, we consider square symmetric positive semi–definite matrices ARn×n, whose rank rn. Let us partition the eigendecomposition of A in the following form:

(22)A=VΛVT=[V1V2]rnr[Λ100Λ2][V1TV2T]rnr

where

(23)V1=[v1,v2,,vr]Rn×rV2=[vr+1,,vn]Rn×nr,

The columns of V1 are eigenvectors corresponding to the first r eigenvalues of A, and the columns of V2 correspond to the nr smallest eigenvalues. We also have

(24)Λ1=diag[λ1,,λr]=[λ1λr]Rr×r,

and

(25)Λ2=[λr+1λn]R(nr)×(nr).

In the notation used above, the explicit absence of a matrix element in an off-diagonal position implies that element is zero. We now show that the partition (22) reveals a great deal about the structure of A.

2.2.1 Nullspace

In this section, we explore the relationship between the partition of (22) and the nullspace of A. Recall that the nullspace N(A) of A is defined as

(26)N(A)={x0RnAx=0}.

From (22), we have

(27)Ax=[V1V2][Λ100Λ2][V1TV2T]x.

We now choose x so that xspan(V2). Then x=V2c2, where c2 is any vector in Rnr. Then since V1V2, we have

(28)Ax=[V1V2][Λ100Λ2][0c2]=[V1V2][0Λ2c2].

From (28), it is clear we can find a non–trivial x such that Ax=0 if and only if Λ2=0. Thus, a non–empty nullspace can exist if only if Λ2=0.

Since Λ2R(nr)×(nr), a square symmetric matrix of rank rn must have nr zero eigenvalues.

Moreover from (28) we see that the condition xspanV2 is also necessary for Ax=0. This implies that an orthonormal basis for the nullspace of A is V2. Since V2Rn×(nr), the nullity of A is nr, corresponding to the number of zero eigenvalues.

Thus, we have the important result that if the dimension of N(A) is d=nr, then A must have d zero eigenvalues. The matrix V2Rn×(nr) is an orthonormal basis for N(A).

2.2.2 Range

Let us look at R(A) in the light of the decomposition of (22), where we have seen that Λ2=0 if A is rank deficient. The definition of R(A), repeated here for convenience, is

(29)R(A)={yy=Ax,xRn}.

The vector quantity Ax is therefore given as

(30)Ax=[V1V2][Λ1000][V1TV2T]x.

In the above, it is understood that if A is full rank, then the lower right block of zeros in Λ vanishes and Λ becomes equivalent to Λ1.

Let us define c as

(31)c=[c1c2]=[V1TV2T]x,

where c1Rr and c2Rnr. Then,

(32)y=Ax=[V1V2][Λ1000][c1c2]=[V1V2][Λ1c10]=V1(Λ1c1).

From (31), we see that span(x)=span(c), and therefore span(Λ1c1)=Rr. Thus, the vector y in (32) consists of all possible linear combinations of the columns of V1 and R(A)=R(V1). Therefore we have the important result that V1 is an orthonormal basis for R(A).

2.3 Matrix Norms

Now that we have some understanding of eigenvectors and eigenvalues, we can now present the matrix norm. The matrix norm is related to the vector norm: it is a function which maps Rm×n into R. A matrix norm must obey the same properties as a vector norm. Since a norm is only strictly defined for a vector quantity, a matrix norm is defined by mapping a matrix into a vector. This is accomplished by post multiplying the matrix by a suitable vector. Some useful matrix norms are now presented:

Matrix p-Norms: A matrix p-norm is defined in terms of a vector p-norm. The matrix p-norm of an arbitary matrix A, denoted Ap, is defined as

(33)Ap=supx0Axpxp

where “sup” means supremum; i.e., the largest value of the argument over all values of x0. Since a property of a vector norm is cxp=|c|xp for any scalar c, we can choose c in (33) so that xp=1. Then, an equivalent statement to (33) is

(34)Ap=maxxp=1Axp.

We now provide some interpretation for the above definition for the specific case where p=2 and for A square and symmetric, in terms of the eigendecomposition of A. To find the matrix 2–norm, we differentiate (34) and set the result to zero. Differentiating Ax2 directly is difficult. However, we note that finding the x which maximizes Ax2 is equivalent to finding the x which maximizes Ax22 and the differentiation of the latter is much easier. In this case, we have Ax22=xTATAx. To find the maximum, we use the method of Lagrange multipliers, since x is constrained by (34). Therefore we differentiate the quantity

(35)xTATAx+γ(1xTx)

and set the result to zero. The quantity γ above is the Lagrange multiplier. The details of the differentiation are omitted here, since they will be covered in a later lecture. The interesting result of this process is that x must satisfy

(36)ATAx=γx,x2=1.

Therefore the stationary points of (34) are the eigenvectors of ATA. When A is square and symmetric, the eigenvectors of ATA are equivalent to those of A[11]. Therefore the stationary points of (34) are also the eigenvectors of A. By substituting x=v1 into (34) we find that Ax2=λ1.

It then follows that the solution to (34) is given by the eigenvector corresponding to the largest eigenvalue of A, and Ax2 is equal to the largest eigenvalue of A.

More generally, it is shown in the next lecture for an arbitrary matrix A that

(37)A2=σ1

where σ1 is the largest singular value of A. This quantity results from the singular value decomposition, to be discussed next lecture.

Matrix norms for other values of p, for arbitrary A, are given as

(38)A1=max1jni=1m|aij|(maximum column sum)

and

(39)A=max1imj=1n|aij|(maximum row sum).

Frobenius Norm: The Frobenius norm is the 2-norm of the vector consisting of the 2- norms of the rows (or columns) of the matrix A:

AF=[i=1mj=1n|aij|2]1/2

2.3.1 Properties of Matrix Norms

  1. Consider the matrix ARm×n and the vector xRn. Then,
(40)AxpApxp

This property follows by dividing both sides of the above by xp, and applying (33).

  1. If Q and Z are orthonormal matrices of appropriate size, then
(41)QAZ2=A2

and

(42)QAZF=AF

Thus, we see that the matrix 2–norm and Frobenius norm are invariant to pre– and post– multiplication by an orthonormal matrix.

  1. Further,
(43)AF2=tr(ATA)

where tr() denotes the trace of a matrix, which is the sum of its diagonal elements.

2.4 Covariance Matrices

Here, we investigate the concepts and properties of the covariance matrix Rxx corresponding to a stationary, discrete-time random process x[n]. We break the infinite sequence x[n] into windows of length m, as shown in Fig. 2. The windows generally overlap; in fact, they are typically displaced from one another by only one sample. The samples within the ith window become an m-length vector xi,i=1,2,3,. Hence, the vector corresponding to each window is a vector sample from the random process x[n]. Processing random signals in this way is the fundamental first step in many forms of electronic system which deal with real signals, such as process identification, control, or any form of communication system including telephones, radio, radar, sonar, etc.

The word stationary as used above means the random process is one for which the corresponding joint m–dimensional probability density function describing the distribution of the vector sample x does not change with time. This means that all moments of the distribution (i.e., quantities such as the mean, the variance, and all cross–correlations, as well as all other higher–order statistical characterizations) are invariant with time. Here however, we deal with a weaker form of stationarity referred to as wide–sense stationarily (WSS). With these processes, only the first two moments (mean, variances and covariances) need be invariant with time. Strictly, the idea of a covariance matrix is only relevant for stationary or WSS processes, since expectations only have meaning if the underlying process is stationary.

The covariance matrix RxxRm×m corresponding to a stationary or WSS process x[n] is defined as

(44)RxxE[(xμ)(xμ)T]

where μ is the vector mean of the process and E() denotes the expectation operator over all possible windows of index i of length m in Fig. 2. Often we deal with zero-mean processes, in which case we have

(45)Rxx=E[xixiT]=E[(x1x2xm)(x1x2xm)]=E[x1x1x1x2x1xmx2x1x2x2x2xmxmx1xmx2xmxm],

where (x1,x2,,xm)T=xi. Taking the expectation over all windows, eq. (45) tells us that the element r(1,1) of Rxx is by definition E(x12), which is the mean-square value (the preferred term is variance, whose symbol is σ2) of the first element x1 of all possible vector samples xi of the process. But because of stationarity, r(1,1)=r(2,2)=,=r(m,m) which are all equal to σ2. Thus all main diagonal elements of Rxx are equal to the variance of the process. The element r(1,2)=E(x1x2) is the cross–correlation between the first element of xi and the second element. Taken over all possible windows, we see this quantity is the cross–correlation of the process and itself delayed by one sample. Because of stationarity, r(1,2)=r(2,3)==r(m1,m) and hence all elements on the first upper diagonal are equal to the cross-correlation for a time-lag of one sample. Since multiplication is commutative, r(2,1)=r(1,2), and therefore all elements on the first lower diagonal are also all equal to this same cross-correlation value. Using similar reasoning, all elements on the jth upper or lower diagonal are all equal to the cross-correlation value of the process for a time lag of j samples. Thus we see that the matrix Rxx is highly structured.

Let us compare the process shown in Fig. 2 with that shown in Fig. 3. In the former case, we see that the process is relatively slowly varying. Because we have assumed x[n] to be zero mean, adjacent samples of the process in Fig. 2 will have the same sign most of the time, and hence E(xixi+1) will be a positive number, coming close to the value E(xi2). The same can be said for E(xixi+2), except it is not so close to E(xi2). Thus, we see that for the process of Fig. 2, the diagonals decay fairly slowly away from the main diagonal value.

However, for the process shown in Fig. 3, adjacent samples are uncorrelated with each other. This means that adjacent samples are just as likely to have opposite signs as they are to have the same signs. On average, the terms with positive values have the same magnitude as those with negative values. Thus, when the expectations E(xixi+1),E(xixi+2), are taken, the resulting averages approach zero. In this case then, we see the covariance matrix concentrates around the main diagonal, and becomes equal to σ2I. We note that all the eigenvalues of Rxx are equal to the value σ2. Because of this property, such processes are referred to as “white”, in analogy to white light, whose spectral components are all of equal magnitude.

The sequence {r(1,1),r(1,2),,r(1,m)} is equivalent to the autocorrelation function of the process, for lags 0 to m1. The autocorrelation function of the process characterizes the random process x[n] in terms of its variance, and how quickly the process varies over time. In fact, it may be shown[12] that the Fourier transform of the autocorrelation function is the power spectral density of the process. Further discussion on this aspect of random processes is beyond the scope of this treatment; the interested reader is referred to the reference.

In practice, it is impossible to evaluate the covariance matrix Rxx using expectations as in (44). Expectations cannot be evaluated in practice– they require an infinite amount of data which is never available, and furthermore, the data must be stationary over the observation interval, which is rarely the case. In practice, we evaluate an estimate R^xx of Rxx, based on an observation of finite length N of the process x[n], by replacing the ensemble average (expectation) with a finite temporal average over the N available data points as follows[13]:

(46)R^xx=1Nm+1i=1Nm+1xixiT.

If (46) is used to evaluate R^, then the process need only be stationary over the observation length. Thus, by using the covariance estimate given by (46), we can track slow changes in the true covariance matrix of the process with time, provided the change in the process is small over the observation interval N. Further properties and discussion covariance matrices are given in Haykin.[14]

It is interesting to note that R^xx can be formed in an alternate way from (46). Let XRm×(Nm+1) be a matrix whose ith column is the vector sample xi,i=1,,Nm+1 of x[n]. Then R^xx is also given as

(47)R^xx=1Nm+1XXT.

The proof of this statement is left as an exercise.

Some Properties of Rxx:

  1. Rxx is (Hermitian) symmetric i.e. rij=rji, where denotes complex conjugation.
  2. If the process x[n] is stationary or wide-sense stationary, then Rxx is Toeplitz. This means that all the elements on a given diagonal of the matrix are equal. If you understand this property, then you have a good understanding of the nature of covariance matrices.
  3. If Rxx is diagonal, then the elements of x are uncorrelated. If the magnitudes of the off-diagonal elements of Rxx are significant with respect to those on the main diagonal, the process is said to be highly correlated.
  4. R is positive semi–definite. This implies that all the eigenvalues are greater than or equal to zero. We will discuss positive definiteness and positive semi–definiteness later.
  5. If the stationary or WSS random process x has a Gaussian probability distribution, then the vector mean and the covariance matrix Rxx are enough to completely specify the statistical characteristics of the process.

2.5 The Karhunen-Loeve Expansion of a Random Process

In this section we combine what we have learned about eigenvalues and eigenvectors, and covariance matrices, into the K-L orthonormal expansion of a random process. The KL expansion is extremely useful in compression of images and speech signals.

An orthonormal expansion of a vector xRm involves expressing x as a linear combination of orthonormal basis vectors or functions as follows:

(48)x=Qa

where a=[a1,,am] contains the coefficients or weights of the expansion, and Q=[q1,,qm] is an m×m orthonormal matrix.[15] Because Q is orthonormal, we can write

(49)a=QTx.

The coefficients a represent x in a coordinate system whose axes are the basis [q1,,qm], instead of the conventional basis [e1,,em]. By using different basis functions Q, we can generate sets of coefficients with different properties. For example, we can express the discrete Fourier transform (DFT) in the form of (49), where the columns of Q are harmonically–related rotating exponentials. With this basis, the coefficients a tell us how much of the frequency corresponding to qi is contained in x.

For each vector observation xi, the matrix Q remains constant but a new vector ai of coefficients is generated. To emphasize this point, we re-write (48) as

(50)xi=Qai,i=1,,N

where i is the vector sample index (corresponding to the window position in Fig. 2) and N is the number of vector observations.

2.5.1 Development of the K–L Expansion

Figure 4 shows a scatterplot corresponding to a slowly–varying random process, of the type shown in Figure 2. A scatterplot is a collection of dots, where the ith dot is the point on the m–dimensional plane corresponding to the vector xi. Because of obvious restrictions in drawing, we are limited here to the value m=2. Because the process we have chosen in this case is slowly varying, the elements of each xi are highly correlated; i.e., knowledge of one element implies a great deal about the value of the other. This forces the scatterplot to be elliptical in shape (ellipsoidal in higher dimensions), concentrating along the principal diagonal in the x1x2 plane. Let the quantities a1,a2,,am be the lengths of the m principal axes of the scatterplot ellipse. With highly correlated processes we find that a1>a2>>am. Typically we find that the values ai diminish quickly with increasing i in larger dimensional systems, when the process is highly correlated.

For the sake of contrast, Figure 5 shows a similar scatterplot, except the underlying random process is white. Here there is no correlation between adjacent samples of the process, so there is no diagonal concentration of the scatterplot in this case. This scatterplot is an m–dimensional spheriod.

As we see later in this section, if we wish to store or transmit such a random process, it is wasteful to do so using the conventional coordinate system [e1,e2,,em] when the process is highly correlated. (Transmission using the conventional coordinate system is equivalent to transmitting the elements x1,x2,,xm of xi in sequence.) The inefficiency is a result of the fact that most of the information contained in a given sample xj of x must be re–transmitted in adjacent and subsequent samples. In this section, we seek a transformed coordinate system which is more efficient in this respect. The motivation will become clearer towards the end of the section.

The proposed method of finding an optimum coordinate system in which to represent our random process is to find a basis vector q1Rm such that the corresponding coefficient a1=q1Tx has the maximum possible mean–squared value (variance). Then, we find a second basis vector q2 which is constrained to be orthogonal to q1, such that the variance of the coefficient a2=q2Tx is maximum. We continue in this way until we obtain a complete orthonormal basis Q=[q1,,qm]. Heuristically, we see from Figure 8 that the desired basis is the set of principal axes of the scatterplot ellipse. The benefits of this procedure will become clearer when we apply this technique to the compression of random processes.

The procedure to determine the qi is straightforward. The basis vector q1 is given as the solution to the following problem:

(51)q1=argmaxq2=1E[|qTxi|2]

where the expectation is over all values of i. The constraint on the 2–norm of q is to prevent the solution from going to infinity. Eq. (51) can be written as

(52)q1=argmaxq2=1E[qTxxTq]=argmaxq2=1qTE[xxT]q=argmaxq2=1qTRxxq.

where we have assumed a zero–mean process. The optimization problem above is precisely the same as that for the matrix norm of section 2.3, where it is shown that the stationary points of the argument in (52) are the eigenvectors of Rxx. Therefore, the solution to (52) is q1=v1, the largest eigenvector of Rxx. Similarly, q2,,qm are the remaining successively decreasing eigenvectors of Rxx. Thus, the desired orthonormal matrix is the eigenvector matrix V corresponding to the covariance matrix of the random process. The decomposition of the vector x in this way is called the Karhunen Loeve (KL) expansion of a random process.

In the sequel, the KL expansion is written using the following notation:

(53)xi=Vθi

and

(54)θi=VTxi,

where VRm×m is the orthonormal matrix of eigenvectors, which is the basis of the KL expansion, and θiRm is the vector of KL coefficients.

Thus, the coefficient θ1 of θ on average contains the most energy (variance) of all the coefficients in θ; θ2 is the coefficient which contains the next–highest variance, etc. The coefficient θm contains the least variance. This is in contrast to the conventional coordinate system, in which all axes have equal variances.

By lecture 4, we will have sufficient knowledge to prove that the eigenvectors align themselves along the principal axes of the scatterplot ellipsoid of Figure 4. In highly correlated systems, due to the fact that the principal axes of the scatterplot ellipse have decreasing magnitudes (as shown in Figure 4) the variance of the smallest coefficients is typically much smaller than that of the larger coefficients.

Question: Suppose the process x is white, so that Rxx=E(xxT) is already diagonal, with equal diagonal elements; i.e., Rxx=σ2I, as in Figure 5. What is the K-L basis in this case?

To answer this, we see that all the eigenvalues of Rxx are repeated. Therefore, the eigenvector basis is not unique. In fact, in this case, any vector in Rm is an eigenvector of the matrix σ2I (the eigenvalue is σ2). Therefore, any orthonormal basis is a K-L basis for a white process. This concept is evident from the circular scatterplot of figure 5.

2.5.2 Properties of the KL Expansion

Property 7 The coefficients θ of the KL expansion are uncorrelated.

To prove this, we evaluate the covariance matrix Rθθ of θ, using the definition (54) as follows:

(55)Rθθ=E[θθT]=E[VTxxTV]=VTRxxV=Λ.

Since Rθθ is equal to the diagonal eigenvalue matrix Λ of Rxx, the KL coefficients are uncorrelated.

Property 8 The variance of the ith K–L coefficient θi is equal to the ith eigenvalue λi of Rxx.

The proof follows directly from (55); Rθθ=Λ.

Property 9 The variance of a highly correlated random process x concentrates in the first few KL coefficients.

This property may be justified intuitively from the scatterplot of Figure 4, due to the fact that the length of the first principal axis is greater than that of the second. (This effect becomes more pronounced in higher dimensions.) However here we wish to formally prove this property.

Let us denote the covariance matrix of the process shown in Fig. 2 as R2, and that shown in Fig. 3 as R3. We assume both processes are stationary with equal powers. Let αi be the eigenvalues of R2 and βi be the eigenvalues of R3. Because R3 is diagonal with equal diagonal elements, all the βi are equal. Our assumptions imply that the main diagonal elements of R2 are equal to the main diagonal elements of R3, and hence from Property 4, the trace and the eigenvalue sum of each covariance matrix are equal.

To obtain further insight into the behavior of the two sets of eigenvalues, we consider Hadamard’s inequality[16] which may be stated as:

Consider a square matrix ARm×m. Then, detAi=1maii, with equality if and only if A is diagonal.

From Hadamard’s inequality, detR2<detR3, and so also from Property 4, i=1nαi<i=1nβi. Under the constraint αi=βi, it follows that α1>αn; i.e., the eigenvalues of R2 are not equal. (We say the eigenvalues become disparate). Thus, the variance in the first K-L coefficients of a correlated process is larger than that in the later K-L coefficients. Typically in a highly correlated system, only the first few coefficients have significant variance.

To illustrate this phenomenon further, consider the extreme case where the process becomes so correlated that all elements of its covariance matrix approach the same value. (This will happen if the process x[n] does not vary with time). Then, all columns of the covariance matrix are equal, and the rank of Rxx in this case becomes equal to one, and therefore only one eigenvalue is nonzero. Then all the energy of the process is concentrated into only the first K-L coefficient. In contrast, when the process is white and stationary, all the eigenvalues are of Rxx are equal, and the variance of the process is equally distributed amongst all the K–L coefficients. The point of this discussion is to indicate a general behavior of random processes, which is that as they become more highly correlated, the variance in the K-L coefficients concentrates in the first few elements. The variance in the remaining coefficients becomes negligible.

2.5.3 Applications of the K-L Expansion

Suppose a communications system transmits a stationary, zero–mean highly–correlated sequence x. This means that to transmit the elements of x directly, one sends a particular element xi of x using as many bits as is necessary to convey the information with the required fidelity. However, in sending the next element xi+1, almost all of the same information is sent over again, due to the fact that xi+1 is highly correlated with xi and its previous few samples. That is, xi+1 contains very little new information relative to xi. It is therefore seen that if x is highly correlated, transmitting the samples directly (i.e., using the conventional coordinate system) is very wasteful in terms of the number of required bits to transmit.

But if x is stationary and Rxx is known at the receiver[17], then it is possible for both the transmitter and receiver to “know” the eigenvectors of Rxx, the basis set. If the process is sufficiently highly correlated, then, because of the concentration properties of the K–L transform, the variance of the first few coefficients θ dominates that of the remaining ones. The later coefficients on average typically have a small variance and are not required to accurately represent the signal.

To implement this form of signal compression, let us say that an acceptable level of distortion is obtained by retaining only the first j significant coefficients. We form a truncated K-L coefficient vector θ^ in a similar manner to (54) as

(56)θ^=[θ1θj00]=[v1TvjT0T0T]x.

where coefficients θj+1,,θm are set to zero and therefore need not be transmitted. This means we can represent xi more compactly without sacrificing significant loss of quality; i.e., we have achieved signal compression.

An approximation x^ to the original signal can be reconstructed by:

(57)x^=Vθ^.

From Property 8, the mean–squared error ϵj2 in the KL reconstruction x^ is given as

(58)ϵj2=i=j+1mλi,

which corresponds to the sum of the truncated (smallest) eigenvalues. It is easy to prove that no other basis results in a smaller error. The error ϵj2 in the reconstructed x^ using any basis [q1,,qm] is given by

(59)ϵj2=i=j+1mE|qiTx|22=i=j+1mqiTRxxqi.

where the last line uses (51) and (52). We have seen previously that the eigenvectors are the stationary points of each term in the sum above. Since each term in the sum is positive semi–definite definite, ϵj2 is minimized by minimizing each term individually. Therefore, the minimum of (59) is obtained when the qi are assigned the mj smallest eigenvectors. Since viTRxxvi=λi when v2=1, ϵj2=ϵj2 only when qi=vi. This completes the proof.

In speech applications for example, fewer than one tenth of the coefficients are needed for reconstruction with imperceptible degradation. Note that since R^xx is positive semi–definite, all eigenvalues are non–negative. Hence, the energy measure (58) is always non–negative for any value of j. This type of signal compression is the ultimate form of a type of coding known as transform coding.

Transform coding is now illustrated by an example. A process x[n] was generated by passing a unit-variance zero–mean white noise sequence w(n) through a 3rd-order lowpass digital lowpass Butterworth filter with a relatively low normalized cutoff frequency (0.1 Hz), as shown in Fig. 6. Vector samples xi are extracted from the sequence x[n] as shown in Fig. 2. The filter removes the high-frequency components from the input and so the resulting output process x[n] must therefore vary slowly in time. Thus, the K–L expansion is expected to require only a few principal eigenvector components, and significant compression gains can be achieved.

We show this example for m=10. Listed below are the 10 eigenvalues corresponding to R^xx, the covariance matrix of x, generated from the output of the lowpass filter:

Eigenvalues: 0.5468 0.1975 0.1243 ×101 0.5112 ×103 0.2617 ×104 0.1077 ×105 0.6437 ×107 0.3895 ×108 0.2069 ×109 0.5761 ×1011

The error ϵj2 for j=2 is thus evaluated from the above data as 0.0130, which may be compared to the value 0.7573, which is the total eigenvalue sum. The normalized error is 0.01300.7573=0.0171. Because this error may be considered a low enough value, only the first j=2 K-L components may be considered significant. In this case, we have a compression gain of 10/2=5; i.e., the KL expansion requires only one fifth of the bits relative to representing the signal directly.

The corresponding two principal eigenvectors are plotted in Fig. 7. These plots show the value of the kth element vk of the eigenvector, plotted against its index k for k=1,,m. These waveforms may be interpreted as functions of time.

In this case, we would expect that any observation xi can be expressed accurately as a linear combination of only the first two eigenvector waveforms shown in Fig. 7, whose coefficients θ^ are given by (56). In Fig. 8 we show samples of the true observation x shown as a waveform in time, compared with the reconstruction x^i formed from (57) using only the first j=2 eigenvectors. It is seen that the difference between the true and reconstructed vector samples is small, as expected.

One of the practical difficulties in using the K–L expansion for coding is that the eigenvector set V is not usually known at the receiver in practical cases when the observed signal is mildly or severely nonstationary (e.g. speech or video signals). In this case, the covariance matrix estimate R^xx is changing with time; hence so are the eigenvectors. Transmission of the eigenvector set to the receiver is expensive in terms of information and so is undesirable. This fact limits the explicit use of the K–L expansion for coding. However, it has been shown[18] that the discrete cosine transform (DCT), which is another form of orthonormal expansion whose basis consists of cosine–related functions, closely approximates the eigenvector basis for a certain wide class of signals. The DCT uses a fixed basis, independent of the signal, and hence is always known at the receiver. Transform coding using the DCT enjoys widespread practical use and is the fundamental idea behind the so–called JEPEG and MPEG international standards for image and video coding. The search for other bases, including particularly wavelet functions, to replace the eigenvector basis is a subject of ongoing research. Thus, even though the K–L expansion by itself is not of much practical value, the theoretical ideas behind it are of significant worth.

2.6 Example: Array Processing

Here, we present a further example of the concepts we have developed so far. This example is concerned with direction of arrival estimation using arrays of sensors.

Consider an array of M sensors (e.g., antennas) as shown in Fig. 9. Let there be K<M plane waves incident onto the array as shown. Assume the amplitudes of the incident waves do not change during the time taken for the wave to traverse the array. Also assume for the moment that the amplitude of the first incident wave at the first sensor is unity. Then, from the physics shown in Fig. 9, the signal vector x received by sampling each element of the array simultaneously, from the first incident wave alone, may be described in vector format by x=[1,ejϕ,ej2ϕ,,ej(M1)ϕ]T, where ϕ is the electrical phase–shift between adjacent elements of the array, due to the first incident wave.[19] When there are K incident signals, with corresponding amplitudes ak,k=1,,K, the effects of the K incident signals each add linearly together, each weighted by the corresponding amplitude ak, to form the received signal vector x. The resulting received signal vector, including the noise can then be written in the form

(60)xn(M×1)=S(M×K)an(K×1)+wn(M×1),n=1,,N,

where wn= M-length noise vector at time n whose elements are independent random variables with zero mean and variance σ2, i.e., E(wi2)=σ2. The vector w is assumed uncorrelated with the signal. S=[s1,,sK]sk=[1,ejϕk,ej2ϕk,,ej(M1)ϕk]T are referred to as steering vectors. ϕk,k=1,,K are the electrical phase–shift angles corresponding to the incident signals. The ϕk are assumed to be distinct. an=[a1,,aK]nT is a vector of independent random variables, describing the amplitudes of each of the incident signals at time n.

In (60) we obtain N vector samples xnCM×1, n=1,,N by simultaneously sampling all array elements at N distinct points in time. Our objective is to estimate the directions of arrival ϕk of the plane waves relative to the array, by observing only the received signal.

Note K<M. Let us form the covariance matrix R of the received signal x:

(61)R=E(xxH)=E[(Sa+w)(aHSH+wH)]=SE(aaH)SH+σ2I

The last line follows because the noise is uncorrelated with the signal, thus forcing the cross–terms to zero. In the last line of (61) we have also used that fact that the covariance matrix of the noise contribution (second term) is σ2I. This follows because the elements of the noise vector w are independent with equal power. The first term of (61) we call Ro, which is the contribution to the covariance matrix due only to the signal.

Lets look at the structure of Ro:

Ro=SE(aaH)non-singularSH

From this structure, we may conclude that Ro is rank K. This may be seen as follows. Let us define AE(aaH) and BASH. Because the ϕk are distinct, S is full rank (rank K), and because the ak are independent, A is full rank (K). Therefore the matrix BCK×M is of full rank K. Then, Ro=SB. From this last relation, we can see that the ith, i=1,,M column of Ro is a linear combination of the K columns of S, whose coefficients are the ith column of B. Because B is full rank, K linearly independent linear combinations of the K columns of S are used to form Ro. Thus Ro is rank K. Because K<M, Ro is rank deficient.

Let us now investigate the eigendecomposition on Ro, where λk are the eigenvalues of Ro:

(62)Ro=VΛVH

or

(63)Ro=[][λ1λK00][].

Because RoCM×M is rank K, it has K non-zero eigenvalues and MK zero eigenvalues. We enumerate the eigenvectors v1,,vK as those associated with the largest K eigenvalues, and vK+1,,vM as those associated with the zero eigenvectors.[20] [21]

From the definition of an eigenvector, we have

(68)Rovi=0

or

(69)SASHvi=0,i=K+1,,M.

Since A=E(aaH) and S are full rank, the only way (69) can be satisfied is if the vi,i=K+1,,M are orthogonal to all columns of S=[s(ϕ1),,s(ϕK)]. Therefore we have

(70)skHvi=0,k=1,,K,i=K+1,,M,

We define the matrix VN[vK+1,,vM]. Therefore (70) may be written as

(71)SHVN=0.

We also have

(72)[1,ejϕk,ej2ϕk,,ej(M1)ϕk]HVN=0.

Up to now, we have considered only the noise–free case. What happens when the noise component σ2I is added to Ro to give Rxx in (61)? From Property 3, Lecture 1, we see that if the eigenvalues of Ro are λi, then those of Rxx are λi+σ2. The eigenvectors remain unchanged with the noise contribution, and (70) still holds when noise is present. Note these properties only apply to the true covariance matrix formed using expectations, rather than the estimated covariance matrix formed using time averages.

With this background in place we can now discuss the MUSIC[22] algorithm for estimating directions of arrival of plane waves incident onto arrays of sensors.

2.6.1 The MUSIC Algorithm[23]

We wish to estimate the unknown values [ϕ1,,ϕK] which comprise S=[s(ϕ1),,s(ϕK)]. The MUSIC algorithm assumes the quantity K is known. In the practical case, where expectations cannot be evaluted because they require infinite data, we form an estimate R^ of R based on a finite number N observations as follows:

R^=1Nn=1NxnxnH.

Only if N does R^R.

An estimate V^N of VN may be formed from the eigenvectors associated with the smallest MK eigenvalues of R^. Because of the finite N and the presence of noise, (72) only holds approximately when V^N is used in place of VN. Thus, a reasonable estimate of the desired directions of arrival may be obtained by finding values of the variable ϕ for which the expression on the left of (72) is small instead of exactly zero. Thus, we determine K estimates ϕ^ which locally satisy

(73)ϕ^=argminϕsH(ϕ)V^N

By convention, it is desirable to express (73) as a spectrum–like function, where a peak instead of a null represents a desired signal. It is also convenient to use the squared-norm instead of the norm itself. Thus, the MUSIC “spectrum” P(ϕ) is defined as:

P(ϕ)=1s(ϕ)HV^NV^NHs(ϕ)

It will look something like what is shown in Fig. 10, when K=2 incident signals.

2.7 TO SUMMARIZE

  • An eigenvector x of a matrix A is such that Ax points in the same direction as x.
  • The covariance matrix Rxx of a random process x is defined as E(xxH). For stationary processes, Rxx completely characterizes the process, and is closely related to its covariance function. In practice, the expectation operation is replaced by a time-average.
  • the eigenvectors of Rxx form a natural basis to represent x, since it is only the eigenvectors which diagonalize Rxx. This leads to the coefficients a of the corresponding expansion x=Va being uncorrelated. This has significant application in speech/video encoding.
  • The expection of the square of the coefficients above are the eigenvalues of Rxx. This gives an idea of the relative power present along each eigenvector.
  • If the variables x are Gaussian, then the K-L coefficients are independent. This greatly simplifies receiver design and analysis.

Many of these points are a direct consequence of the fact that it is only the eigenvectors which can diagonalize a matrix. That is basically the only reason why eigenvalues/eigenvectors are so useful. I hope this serves to demystify this subject. Once you see that it is only the eigenvectors which diagonalize, the property that they are a natural basis for the process x becomes easy to understand.

An interpretation of an eigenvalue is that it represents the average energy in each coefficient of the K–L expansion.

3 The Singular Value Decomposition (SVD)

In this lecture we learn about one of the most fundamental and important matrix decompositions of linear algebra: the SVD. It bears some similarity with the eigendecomposition (ED), but is more general. Usually, the ED is of interest only on symmetric square matrices, but the SVD may be applied to any matrix. The SVD gives us important information about the rank, the column and row spaces of the matrix, and leads to very useful solutions and interpretations of least squares problems. We also discuss the concept of matrix projectors, and their relationship with the SVD.

3.1 The Singular Value Decomposition (SVD)

We have found so far that the eigendecomposition is a useful analytic tool. However, it is only applicable on square symmetric matrices. We now consider the SVD, which may be considered a generalization of the ED to arbitrary matrices. Thus, with the SVD, all the analytical uses of the ED which before were restricted to symmetric matrices may now be applied to any form of matrix, regardless of size, whether it is symmetric or nonsymmetric, rank deficient, etc.

Theorem 1 Let ARm×n. Then A can be decomposed according to the singular value decomposition as

(1)A=UΣVT

where U and V are orthonormal and

URm×m,VRn×n

and

Σ=diag(σ1,σ2,,σp)Rm×np=min(m,n)

where

σ1σ2σ3σp0.

The matrix Σ must be of dimension Rm×n (i.e., the same size as A), to maintain dimensional consistency of the product in (1). It is therefore padded with zeros either on the bottom or to the right of the diagonal block, depending on whether m>n or m<n, respectively. We denote the square p×p diagonal matrix as Σ~; the m×n diagonal matrix containing the zero blocks is denoted as Σ.

Since U and V are orthonormal, we may also write (1) in the form:

(2)UTm×mAm×nVn×n=Σm×n

where Σ is a diagonal matrix. The values σi which are defined to be positive, are referred to as the singular values of A. The columns ui and vi of U and V are respectively called the left and right singular vectors of A.

The SVD corresponding to (1) may be shown diagramatically in the following way:

(3)A=[U]m×m[σ10σp00]m×n[VT]n×n

Each line above represents a column of either U or V.

3.2 Existence Proof of the SVD

Consider two vectors x and y where x2=y2=1, s.t. Ax=σy, where σ=A2. The fact that such vectors x and y can exist follows from the definition of the matrix 2-norm. We define orthonormal matrices U and V so that x and y form their first columns, as follows:

U=[y,U1]V=[x,V1]

That is, U1 consists of a set of non–unique orthonormal columns which are mutually orthogonal to themselves and to y; similarly for V1.

We then define a matrix A1 as

(4)UTAV=A1=[yTU1T]A[x,V1]

The matrix A1 has the following structure:

(5)[yTU1T]orthonormalA[xV1]orthonormal=[yTU1T][σyAV1]=[σwT0B]1m11n1A1.

where BU1TAV1. The 0 in the (2,1) block above follows from the fact that U1y, because U is orthonormal.

Now, we post-multiply both sides of (5) by the vector [σw] and take 2-norms:

(6)A1[σw]22=[σwT0B][σw]22(σ2+wTw)2.

This follows because the term on the extreme right is only the first element of the vector product of the middle term. But, as we have seen, matrix p-norms obey the following property:

(7)Ax2A2x2.

Therefore using (6) and (7), we have

(8)A122[σw]22A1[σw]22(σ2+wTw)2.

Note that [σw]22=σ2+wTw. Dividing (8) by this quantity, we obtain

(9)A122σ2+wTw.

But, we defined σ=A2. Therefore, the following must hold:

(10)σ=A2=UTAV2=A12

where the equality on the right follows because the matrix 2-norm is invariant to matrix pre- and post-multiplication by an orthonormal matrix. By comparing (9) and (10), we have the result w=0.

Substituting this result back into (5), we now have

(11)A1=[σ00B].

The whole process repeats using only the component B, until An becomes diagonal.

It is instructive to consider an alternative proof for the SVD. The following is useful because it is a constructive proof, which shows us how to form the components of the SVD.

Theorem 2 Let ARm×n be a rank r matrix (rp=min(m,n)). Then there exist orthonormal matrices U and V such that

(12)UTAV=[Σ~000]

where

(13)Σ~=diag(σ1,,σr),σi>0.

Proof: Consider the square symmetric positive semi–definite matrix ATA[24]. Let the eigenvalues greater than zero be σ12,σ22,,σr2. Then, from our knowledge of the eigendecomposition, there exists an orthonormal matrix VRn×n such that

(14)VTATAV=[Σ~2000].

where Σ~2=diag[σ12,,σr2]. We now partition V as [V1 V2], where V1Rn×r. Then (14) has the form

(15)[V1TV2T]nATA[V1V2]rnr=[Σ~2000].

Then by equating corresponding blocks in (15) we have

(16)V1TATAV1=Σ~2(r×r)(17)V2TATAV2=0.(nr)×(nr)

From (16), we can write

(18)Σ~1V1TATAV1Σ~1=I.

Then, we define the matrix U1Rm×r from (18) as

(19)U1=AV1Σ~1.

Then from (18) we have U1TU1=I and it follows that

(20)U1TAV1=Σ~.

From (17) we also have

(21)AV2=0.

We now choose a matrix U2 so that U=[U1 U2], where U2Rm×(mr), is orthonormal. Then from (19) and because U1U2, we have

(22)U2TU1=U2TAV1Σ~1=0.

Therefore

(23)U2TAV1=0.

Combining (20), (21) and (23), we have

(24)UTAV=[U1TAV1U1TAV2U2TAV1U2TAV2]=[Σ~000]

The proof can be repeated using an eigendecomposition on the matrix AATRm×m instead of on ATA. In this case, the roles of the orthonormal matrices V and U are interchanged.

The above proof is useful for several reasons:

  • It is short and elegant.
  • We can also identify which part of the SVD is not unique. Here, we assume that ATA has no repeated non–zero eigenvalues. Because V2 are the eigenvectors corresponding to the zero eigenvalues of ATA, V2 is not unique when there are repeated zero eigenvalues. This happens when m<n+1, (i.e., A is sufficiently short) or when the nullity of A2, or a combination of these conditions.
  • By its construction, the matrix U2Rm×mr is not unique whenever it consists of two or more columns. This happens when m2r. It is left as an exercise to show that similar conclusions on the uniqueness of U and V can be made when the proof is developed using the matrix AAT.

3.3 Partitioning the SVD

Here we assume that A has rp non-zero singular values (and pr zero singular values). Later, we see that r=rank(A). For convenience of notation, we arrange the singular values as:

σ1σrmaxminnon-zeros.v.r non-zero s.v’s>σr+1==σp=0pr zero s.v.’s

In the remainder of this lecture, we use the SVD partitioned in both U and V. We can write the SVD of A in the form

(25)A=[U1U2][Σ~000][V1TV2T]

where where Σ~Rr×r=diag(σ1,,σr), and U is partitioned as

(26)U=[U1U2]rmrm

The columns of U1 are the left singular vectors associated with the r nonzero singular values, and the columns of U2 are the left singular vectors associated with the zero singular values. V is partitioned in an analogous manner:

(27)V=[V1V2]rnrn

3.4 Interesting Properties and Interpretations of the SVD

The above partition reveals many interesting properties of the SVD:

3.4.1 rank(A) = r

Using (25), we can write A as

(28)A=[U1U2][Σ~V1T0]=U1Σ~V1T=U1B

where BRr×nΣ~V1T. From (28) it is clear that the ith, i=1,,r column of A is a linear combination of the columns of U1, whose coefficients are given by the ith column of B. But since there are rn columns in U1, there can only be r linearly independent columns in A. It follows from the definition of rank that rank(A)=r.

This point is analogous to the case previously considered in Lecture 2, where we saw rank is equal to the number of non-zero eigenvalues, when A is a square symmetric matrix. In this case however, the result applies to any matrix. This is another example of how the SVD is a generalization of the eigendecomposition.

Determination of rank when σ1,,σr are distinctly greater than zero, and when σr+1,,σp are exactly zero is easy. But often in practice, due to finite precision arithmetic and fuzzy data, σr may be very small, and σr+1 may be not quite zero. Hence, in practice, determination of rank is not so easy. A common method is to declare rankA=r if σr+1ϵ, where ϵ is a small number specific to the problem considered.

3.4.2 N(A)=R(V2)

Recall the nullspace N(A)={x0Ax=0}. So, we investigate the set {x} such that Ax=0. Let xspan(V2); i.e., x=V2c, where cRnr. By substituting (25) for A, by noting that V1V2 and that V1TV1=I, we have

(29)Ax=[U1U2][Σ~000][0c]=0.

Thus, span(V2) is at least a subspace of N(A). However, if x contains any components of V1, then (29) will not be zero. But since V=[V1V2] is a complete basis in Rn, we see that V2 alone is a basis for the nullspace of A.

3.4.3 R(A)=R(U1)

Recall that the definition of range R(A) is {yy=Ax,xRn}. From (25),

(30)Ax=[U1U2][Σ~000][V1TV2T]x=[U1U2][Σ~000][d1d2]

where

(31)[d1d2]rnr=[V1TV2T]x.

From the above we have

(32)Ax=[U1U2][Σ~d10]=U1(Σ~d1)

We see that as x moves throughout Rn, the quantity Σ~d1 moves throughout Rr. Thus, the quantity y=Ax in this context consists of all linear combinations of the columns of U1. Thus, an orthonormal basis for R(A) is U1.

3.4.4 R(AT)=R(V1)

Recall that R(AT) is the set of all linear combinations of rows of A. Our property can be seen using a transposed version of the argument in Section 3.4.3 above. Thus, V1 is an orthonormal basis for the rows of A.

3.4.5 R(A)=R(U2)

From Sect. 3.4.3, we see that R(A)=R(U1). Since from (25), U1U2, then U2 is a basis for the orthogonal complement of R(A). Hence the result.

3.4.6 A2=σ1=σmax

This is easy to see from the definition of the 2-norm and the ellipsoid example of section 3.6.

3.4.7 Inverse of A

If the svd of a square matrix A is given, it is easy to find the inverse. Of course, we must assume A is full rank, (which means σi>0) for the inverse to exist. The inverse of A is given from the svd, using the familiar rules, as

(33)A1=VΣ1UT.

The evaluation of Σ1 is easy because Σ is square and diagonal. Note that this treatment indicates that the singular values of A1 are [σn1,σn11,,σ11].

3.4.8 The SVD diagonalizes any system of equations

Consider the system of equations Ax=b, for an arbitrary matrix A. Using the SVD of A, we have

(34)UΣVTx=b.

Let us now represent b in the basis U, and x in the basis V, in the same way as in Sect. 3.6. We therefore have

(35)c=[c1c2]rmr=[U1TU2T]b

and

(36)d=[d1d2]rnr=[V1TV2T]x

Substituting the above into (34), the system of equations becomes

(37)Σd=c.

This shows that as long as we choose the correct bases, any system of equations can become diagonal. This property represents the power of the SVD; it allows us to transform arbitrary algebraic structures into their simplest forms.

If m>n or if rank r<min(m,n), then the system of equations Ax=b can only be satisfied if bR(U1). To see this, Σ above has an (mr)×n block of zeros below the diagonal block of nonzero singular values. Thus, the lower mr elements of left-hand side of (37) are all zero. Then if the equality of (37) is to be satisfied, c2 must also be zero. This means that U2Tb=0, or that bR(U1).

Further, if n>m, or if r<min(m,n), then, if xo is a solution to Ax=b, xo+V2z is also a solution, where zRnr. This follows because, as we have seen, V2 is a basis for N(A); thus, the component AV2z=0, and Axo+AV2z=Axo=b.

3.4.9 The “rotation” interpretation of the SVD

From the SVD relation A=UΣVT, we have

(38)AV=UΣ.

Note that since Σ is diagonal, the matrix UΣ on the right has orthogonal columns, whose 2–norm’s are equal to the corresponding singular value. We can therefore interpret the matrix V as an orthonormal matrix which rotates the rows of A so that the result is a matrix with orthogonal columns. Likewise, we have

(39)UTA=ΣVT.

The matrix ΣVT on the right has orthogonal rows with 2–norm equal to the corresponding singular value. Thus, the orthonormal matrix UT operates (rotates) the columns of A to produce a matrix with orthogonal rows.

In the case where m>n, (A is tall), then the matrix Σ is also tall, with zeros in the bottom mn rows. Then, only the first n columns of U are relevant in (38), and only the first n rows of UT are relevant in (39). When m<n, a corresponding transposed statement replacing U with V can be made.

3.5 Relationship between SVD and ED

It is clear that the eigendecomposition and the singular value decomposition share many properties in common. The price we pay for being able to perform a diagonal decomposition on an arbitray matrix is that we need two orthonormal matrices instead of just one, as is the case for square symmetric matrices. In this section, we explore further relationships between the ED and the SVD.

Using (25), we can write

(40)ATA=V[Σ~000]UTU[Σ~000]VT=V[Σ~2000]VT.

Thus it is apparent, that the eigenvectors V of the matrix ATA are the right singular vectors of A, and that the singular values of A squared are the corresponding nonzero eigenvalues. Note that if A is short (m<n) and full rank, the matrix ATA will contain nm additional zero eigenvalues that are not included as singular values of A. This follows because the rank of the matrix ATA is m when A is full rank, yet the size of ATA is n×n.

As discussed in Golub and van Loan, the SVD is numerically more stable to compute than the ED. However, in the case where n>>m, the matrix V of the SVD of A becomes large, which means the SVD on A becomes more costly to compute, relative to the eigendecomposition of ATA.

Further, we can also say, using the form AAT, that

(41)AAT=U[Σ~2000]VTV[Σ~2000]UT=U[Σ~2000]UT

which indicates that the eigenvectors of AAT are the left singular vectors U of A, and the singular values of A squared are the nonzero eigenvalues of AAT. Notice that in this case, if A is tall and full rank, the matrix AAT will contain mn additional zero eigenvalues that are not included as singular values of A.

We now compare the fundamental defining relationships for the ED and the SVD: For the ED, if A is symmetric, we have:

A=QΛQTAQ=QΛ,

where Q is the matrix of eigenvectors, and Λ is the diagonal matrix of eigenvalues. Writing this relation column-by-column, we have the familiar eigenvector/eigenvalue relationship:

(42)Aqi=λiqii=1,,n.

For the SVD, we have

A=UΣVTAV=UΣ

or

(43)Avi=σiuii=1,,p,

where p=min(m,n). Also, since AT=VΣUTATU=VΣ, we have

(44)ATui=σivii=1,,p.

Thus, by comparing (42), (43), and (44), we see the singular vectors and singular values obey a relation which is similar to that which defines the eigenvectors and eigenvalues. However, we note that in the SVD case, the fundamental relationship expresses left singular values in terms of right singular values, and vice-versa, whereas the eigenvectors are expressed in terms of themselves.

Exercise: compare the ED and the SVD on a square symmetric matrix, when i) A is positive definite, and ii) when A has some positive and some negative eigenvalues.

3.6 Ellipsoidal Interpretation of the SVD

The singular values of A, where ARm×n are the lengths of the semi-axes of the hyperellipsoid E given by:

E={yy=Ax,x2=1}.

That is, E is the set of points mapped out as x takes on all possible values such that x2=1, as shown in Fig. 1. To appreciate this point, let us look at the set of y corresponding to {xx2=1}. We take

(45)y=Ax=UΣVTx.

Let us change bases for both x and y. Define

(46)c=UTyd=VTx.

Then (45) becomes

(47)c=Σd.

We note that d2=1 if x2=1. Thus, our problem is transformed into observing the set {c} corresponding to the set {dd2=1}. The set {c} can be determined by evaluating 2-norms on each side of (47):

(48)i=1p(ciσi)2=i=1p(di)2=1.

We see that the set {c} defined by (48) is indeed the canonical form of an ellipse in the basis U. Thus, the principal axes of the ellipse are aligned along the columns ui of U, with lengths equal to the corresponding singular value σi. This interpretation of the SVD is useful later in our study of condition numbers.

3.7 An Interesting Theorem

First, we realize that the SVD of A provides a “sum of outer-products” representation:

(49)A=UΣVT=i=1pσiuiviT,p=min(m,n).

Given ARm×n with rank r, then what is the matrix BRm×n with rank k<r closest to A in 2-norm? What is this 2-norm distance? This question is answered in the following theorem:

Theorem 3 Define

(50)Ak=i=1kσiuiviT,kr,

then

minrank(B)=kAB2=AAk2=σk+1.

In words, this says the closest rank k<r matrix B matrix to A in the 2–norm sense is given by Ak. Ak is formed from A by excluding contributions in (49) associated with the smallest singular values.

Proof: Since UTAkV=diag(σ1,,σk,0,,0) it follows that rank(Ak)=k, and that

(51)AAk2=UT(AAk)V2=diag(0,,0,σk+1,,σr,0,,0)2=σk+1.

where the first line follows from the fact the the 2-norm of a matrix is invariant to pre– and post–multiplication by an orthonormal matrix (properties of matrix p-norms, Lecture 2). Further, it may be shown that, for any matrix BRm×n of rank k<r,[25]

(52)AB2σk+1

Comparing (51) and (52), we see the closest rank k matrix to A is Ak given by (50).

This result is very useful when we wish to approximate a matrix by another of lower rank. For example, let us look at the Karhunen-Loeve expansion as discussed in Lecture 1. For a sample xn of a random process xRm, we express x as

(53)xi=Vθi

where the columns of V are the eigenvectors of the covariance matrix R. We saw in Lecture 2 that we may represent xi with relatively few coefficients by setting the elements of θ associated with the smallest eigenvalues of R to zero. The idea was that the resulting distortion in x would have minimum energy.

This fact may now be seen in a different light with the aid of this theorem. Suppose we retain the j=r elements of a given θ associated with the largest r eigenvalues. Let θ~[θ1,θ2,,θr,0,,0]T and x~=Vθ~. Then

(54)R~=E(x~x~T)=E(Vθ~θ~TV)=V[E|θ1|2E|θr|20]VT=VΛ~VT,

where Λ~=diag[λ1,,λr,0,,0]. Since R~ is positive definite, square and symmetric, its eigendecomposition and singular value decomposition are identical; hence, λi=σi,i=1,,r. Thus from this theorem, and (54), we know that the covariance matrix R~ formed from truncating the K-L coefficients is the closest rank–r matrix to the true covariance matrix R in the 2–norm sense.

4 Orthogonal Projections

4.1 Sufficient Conditions for a Projector

Suppose we have a subspace S=R(X), where X=[x1xn]Rm×n is full rank, m>n, and an arbitrary vector yRm. How do we find a matrix PRm×m so that the product PyS?

The matrix P is referred to as a projector. That is, we can project an arbitrary vector y onto the subspace S, by premultiplying y by P. Note that this projection has non-trivial meaning only when m>n. Otherwise, yS already for arbitrary y.

A matrix P is a projection matrix onto S if:

  1. R(P)=S
  2. P2=P
  3. PT=P

A matrix satisfying condition (2) is called an idempotent matrix. This is the fundamental property of a projector.

We now show that these three conditions are sufficient for P to be a projector. An arbitrary vector y can be expressed as

(55)y=ys+yc

where ysS and ycS (the orthogonal complement subspace of S). We see that ys is the desired projection of y onto S. Thus, in mathematical terms, our objective is to show that

(56)Py=ys.

Because of condition 2, P2=P, hence

(57)Ppi=pii=1,,m

where pi is a column of P. Because ysS, and also (p1pm)S (condition 1), then ys can be expressed as a linear combination of the pi’s:

(58)ys=i=1mcipi,ciR.

Combining (57) and (58), we have

(59)Pys=i=1mciPpi=i=1mcipi=ys.

If R(P)=S (condition 1), then Pyc=0. Hence,

(60)Py=P(ys+yc)=Pys=ys.

i.e., P projects y onto S, if P obeys conditions 1 and 2. Furthermore, by repeating the above proof, and using condition 3, we have

yTPS

i.e., P projects both column- and row–vectors onto S, by pre- and post-multiplying, respectively. Because this property is a direct consequence of the three conditions above, then these conditions are sufficient for P to be a projector.

4.2 A Definition for P

Let X=[x1xn], xiRm,n<m be full rank. Then the matrix P where

(61)P=X(XTX)1XT

is a projector onto S=R(X). Other definitions of P equivalent to (61) will follow later after we discuss pseudo inverses.

Note that when X has orthonormal columns, then the projector becomes XXTRm×m, which according to our previous discussion on orthonormal matrices in Chapter 2, is not the m×m identity.

Exercises:

  • prove (61).
  • How is P in (61) formed if r=rank(X)<n?

Theorem 4 The projector onto S defined by (61) is unique.

Proof: Let Y be any other m×n full rank matrix such that R(Y)=S. Since X and Y are both in S, each column of Y must be a linear combination of the columns of X. Therefore, there exists a full-rank matrix CRn×n so that

(62)Y=XC.

The projector P1 formed from Y is therefore

(63)P1=Y(YTY)1YT=XC((XC)TXC)1(XC)T=XC(CTXTXC)1CTXT=XCC1(XTX)1(CT)1CTXT=X(XTX)1XT=P.

Thus, the projector formed from (61) onto S is unique, regardless of the set of vectors used to form X, provided the corresponding matrix X is full rank and that R(X)=S.

In Section 4.1 we discussed sufficient conditions for a projector. This means that while these conditions are enough to specify a projector, there may be other conditions which also specify a projector. But since we have now proved the projector is unique, the conditions in Section 4.1 are also necessary.

4.3 The Orthogonal Complement Projector

Consider the vector y, and let ys be the projection of y onto our subspace S, and yc be the projection onto the orthogonal complement subspace S. Thus,

(64)y=ys+yc=Py+yc.

Therefore we have

(65)yPy=yc(IP)y=yc.

It follows that if P is a projector onto S, then the matrix (IP) is a projector onto S. It is easily verified that this matrix satisfies the all required properties for this projector.

4.4 Orthogonal Projections and the SVD

Suppose we have a matrix ARm×n of rank r. Then, using the partitions of eqeqpart, we have these useful relations:

  1. V1V1T is the orthogonal projector onto [N(A)]=R(AT).
  2. V2V2T is the orthogonal projector onto N(A)
  3. U1U1T is the orthogonal projector onto R(A)
  4. U2U2T is the orthogonal projector onto [R(A)]=N(AT)

To justify these results, we show each projector listed above satisfies the three conditions for a projector:

  1. First, we must show that each projector above is in the range of the corresponding subspace (condition 1). In Sects. 3.4.2 and 3.4.3, we have already verified that V2 is a basis for N(A), and that U1 is a basis for R(A), as required. It is easy to verify that the remaining two projectors above (no.’s 1 and 4 respectively) also have the appropriate ranges.
  2. From the orthonormality property of each of the matrix partitions above, it is easy to see condition 2 (idempotency) holds in each case.
  3. Finally, each matrix above is symmetric (condition 3). Therefore, each matrix above is a projector onto the corresponding subspace.

5 The Quadratic Form

We introduce the quadratic form by considering the idea of positive definiteness. A square matrix ARn×n is positive definite if and only if, for any x0Rn,

(1)xTAx>0.

The matrix A is positive semi-definite if and only if, for any x0 we have

(2)xTAx0,

which includes the possibility that A is rank deficient. The quantity on the left in (1) is referred to as a quadratic form, and is a matrix equivalent to the scalar quantity ax2.

It is only the symmetric part of A which is relevant in a quadratric form. This may be seen as follows. The symmetric part T of A is defined as T=12(A+AT), whereas the asymmetric part S of A is defined as S=12(AAT). Then A=T+S. It may be verified by direct multiplication that the quadratic form can also be expressed in the form

(3)xTAx=i=1nj=1naijxixj.

Because ST=S, the (i,j)th term corresponding to the asymmetric part of (3) exactly cancels that corresponding to the (j,i)th term. Further, the terms corresponding to i=j are zero for the asymmetric part. Thus the part of the quadratic form corresponding to the asymmetric part S is zero. Therefore, when considering quadratic forms, it suffices to consider only the symmetric part T of a matrix. Quadratic forms on positive definite matrices are used very frequently in least-squares and adaptive filtering applications.

Theorem 1 A matrix A is positive definite if and only if all eigenvalues of the symmetric part of A are positive.

Proof: Since only the symmetric part of A is relevant, the quadratic form on A may be expressed as xTAx=xTVΛVTx where an eigendecomposition has been performed on the symmetric part of A. Let us define z=VTx. Thus we have

(4)xTAx=zTΛz=i=1nzi2λi.

Thus (4) is greater than zero for arbitrary x if and only if λi>0, i=1,,n.

From (4), it is easy to verify that the equation k=i=1nzi2λi, where k is a constant, defines a multi-dimensional ellipse where k/λi is the length of the ith principal axis. Since z=VTx where V is orthonormal, z is a rotation transformation on x, and the equation k=xTAx is a rotated version of (4). Thus k=xTAx is also an ellipse with principal axes given by k/λi. In this case, the ith principal axes of the ellipse lines up along the ith eigenvector vi of A.

Positive definiteness of A in the quadratic form xTAx is the matrix analog to the scalar a being positive in the scalar expression ax2. The scalar equation y=ax2 is a parabola which faces upwards if a is positive. Likewise, the equation y=xTAx is a multi-dimensional parabola which faces upwards in all directions if A is positive definite.

Example: We now discuss an example to illustrate the above discussion. A three-dimensional plot of y=xTAx is shown plotted in Fig. 1 for A given by

(5)A=[2112].

The corresponding contour plot is plotted in Fig. 2. Note that this curve is elliptical in cross-section in a plane y=k as discussed above. It may be readily verified that the eigenvalues of A are 3,1 with corresponding eigenvectors T and [1,1]T. For y=k=1, the lengths of the principal axes of the ellipse are then 1/3 and 1. It is seen from the figure these principal axes are indeed the lengths indicated, and are lined up along the directions of the eigenvectors as required.

We write the ellipse in the form

(6)y=xTAx=zTΛz=i=1nzi2λi

where z=Vx as before. It is seen from Fig. 1 and (6) that, since A is positive definite, the curve defined by y=zi2λi, for all zk,ki held constant, is an upward-facing parabola for all i=1,,n. (To observe the behaviour of y vs. zi in this case, we use the vertical axis y and the appropriate eigenvector direction, instead of the usual x-axis direction).

Theorem 2 A symmetric matrix A can be decomposed into the form A=BBT if and only if A is positive definite or positive semi-definite.

Proof: (Necessary condition) Let us define z as BTx. Then

(7)xTAx=xTBBTx=zTz0.

Conversely (sufficient condition) without loss of generality we take an eigendecomposition on the symmetric part of A as A=VΛVT. Since A is positive definite by hypothesis, we can write A=(VΛ1/2)(VΛ1/2)T. Let us define B=VΛ1/2QT where Q is an arbitrary orthonormal matrix of appropriate size. Then A=VΛ1/2QTQ(Λ1/2)TVT=BBT.

Note that A in this case can only be positive semi-definite if A has a non-empty null space. Otherwise, it is strictly positive definite.

The fact that A can be decomposed into two symmetric factors in this way is the fundamental idea behind the Cholesky factorization, which is a major topic of the following chapter.

5.1 The Gaussian Multi-Variate Probability Density Function

Here, we very briefly introduce this topic so we can use this material for an example of the application of the Cholesky decomposition later in this course, and also in least-squares analysis to follow shortly. This topic is a good application of quadratic forms. More detail is provided in several books.[26]

First we consider the uni-variate case of the Gaussian probability distribution function (pdf). The pdf p(x) of a Gaussian-distributed random variable x with mean μ and variance σ2 is given as

(8)p(x)=12πσ2exp[12σ2(xμ)2].

This is the familiar bell-shaped curve. It is completely specified by two parameters- the mean μ which determines the position of the peak, and the variance σ2 which determines the width or spread of the curve.

We now consider the more interesting multi-dimensional case. Consider a Gaussian-distributed vector xRn with mean μ and covariance Σ. The multivariate pdf describing the variation of x is

(9)p(x)=(2π)n2|Σ|12exp[12(xμ)TΣ1(xμ)].

We can see that the multi-variate case collapses to the uni-variate case when the number of variables becomes one. A plot of p(x) vs. x is shown in Fig. 3, for μ=0 and Σ defined as

(10)Σ=[2112].

Because the exponent in (9) is a quadratic form, the set of points satisfied by the equation 12(xμ)TΣ1(xμ)=k where k is a constant, is an ellipse. Therefore this ellipse defines a contour of equal probability density. The interior of this ellipse defines a region into which an observation will fall with a specified probability α which is dependent on k. This probability level α is given as

(11)α=R(2π)n2|Σ|12exp[12(xμ)TΣ1(xμ)]dx.

where R is the interior of the ellipse. Stated another way, an ellipse is the region in which any observation governed by the probability distribution (9) will fall with a specified probability level α. As k increases, the ellipse gets larger, and α increases. These ellipses are referred to as joint confidence regions at probability level α.

The covariance matrix Σ controls the shape of the ellipse. Because the quadratic form in this case involves Σ1, the length of the ith principal axis is 2kλi instead of 2k/λi as it would be if the quadratic form were in Σ. Therefore as the eigenvalues of Σ increase, the size of the joint confidence regions increase (i.e., the spread of the distribution increases) for a given value of k.

Now suppose we let Σ become poorly conditioned in such a way that the variances (main diagonal elements of Σ) remain constant. Then the ratio of the largest to smallest principal axes become large, and the ellipse becomes elongated. In this case, the pdf takes on more of the shape shown in Fig. 4, which shows a multi-variate Gaussian pdf for μ=0 for a relatively poorly conditioned Σ given as

(12)Σ=[21.91.92].

Here, because the ellipse describing the joint confidence region is elongated, we see that if one of the variables is known, the distribution of the other variable becomes more concentrated around the value of the first; i.e., knowledge of one variable tells us relatively more about the other. This implies the variables are more highly correlated with one another. But we have seen previously that if the variables in a vector random process are highly correlated, then the off-diagonal elements of the covariance matrix become larger, which leads to their eigenvalues becomming more disparate; i.e., the condition number of the covariance matrix becomes worse. It is precisely this poorer condition number that causes the ellipse in Fig. 4 to become elongated.

With this discussion, we we now have gone full circle: a highly correlated system has large off-diagonal elements in its covariance matrix. This leads to a poorly conditioned covariance matrix. But a Gaussian-distributed process with a poorly-conditioned covariance matrix has a joint confidence region that is elongated. In turn, an elongated joint confidence region means the system is highly correlated, which takes us back to the beginning.

Understanding these relationships is a key element in the signal processing rigor.

5.2 The Rayleigh Quotient

The Rayleigh quotient is a simple mathematical structure that has a great deal of interesting uses. The Rayleigh quotient r(x) is defined as

(13)r(x)=xTAxxTx.

It is easily verified that if x is the ith eigenvector vi of A, (not necessarliy normalized to unit norm), then r(x)=λi:

(14)viTAviviTvi=λiviTviviTvi=λi.

In fact, it is easily shown by differentiating r(x) with respect to x, that x=vi is a stationary point of r(x).

Further along this line of reasoning, let us define a subspace Sk as Sk=span{v1,,vk}, k=1,,n, where vi is the ith eigenvector of ARn×n, where A is symmetric. Then, a variation of the Courant Fischer minimax theorem[27] says that

(15)λk=minxSk,x0xTAxxTx.

Question: It is easily shown by differentiation that (13) for r(x)=λi minimizes AλiIx2. The perturbation theory of Golub and Van Loan says that if x in (13) is a good approximation to an eigenvector, then r(x) is a good approximation to the corresponding eigenvalue, and vice versa. Starting with an initial estimate x0 with unit 2-norm, suggest an iteration using (13) which gives an improved estimate of the eigenvector. How can the eigenvalue be found?

This technique is referred to as the Rayleigh quotient iteration for computing an eigenvalue and eigenvector. In fact, this iteration is remarkably effective; it can be shown to have cubic convergence.


  1. Eq. (4) is called a linear combination of the vectors aj. Each vector is multiplied by a weight (or coefficient) cj, and the result summed. ↩︎

  2. A vector ei is referred to as an elementary vector, and has zeros everywhere except for a 1 in the ith position. ↩︎

  3. Column rank deficient is when the rank of the matrix is less than the number of columns. ↩︎

  4. The characteristic polynomial of a matrix is defined in Chapter 2. ↩︎

  5. An orthonormal matrix is defined in Chapter 2. ↩︎

  6. A symmetric matrix is one where A=AT, where the superscript T means transpose, i.e, for a symmetric matrix, an element aij=aji. A Hermitian symmetric (or just Hermitian) matrix is relevant only for the complex case, and is one where A=AH, where superscript H denotes the Hermitian transpose. This means the matrix is transposed and complex conjugated. Thus for a Hermitian matrix, an element aij=aji. In this course we will generally consider only real matrices. However, when complex matrices are considered, Hermitian symmetric is implied instead of symmetric. ↩︎

  7. Here, we have used the property that for matrices or vectors A and B of conformable size, (AB)T=BTAT. ↩︎

  8. From Lastman and Sinha, Microcomputer–based Numerical Methods for Science and Engineering. ↩︎

  9. The trace denoted tr() of a square matrix is the sum of its elements on the main diagonal (also called the “diagonal” elements). ↩︎

  10. This only holds if A and B are square invertible. ↩︎

  11. This proof is left as an exercise. ↩︎

  12. A. Papoulis, Probability, Random Variables, and Stochastic Processes, McGraw Hill, 3rd Ed. ↩︎

  13. Process with this property are referred to as ergodic processes. ↩︎

  14. Haykin, “Adaptive Filter Theory”, Prentice Hall, 3rd. ed. ↩︎

  15. An expansion of x usually requires the basis vectors to be only linearly independent–not necessarily orthonormal. But orthonormal basis vectors are most commonly used because they can be inverted using the very simple form of (49). ↩︎

  16. For a proof, refer to Cover and Thomas, Elements of Information Theory ↩︎

  17. This is not necessarily a valid assumption. We discuss this point further, later in the section. ↩︎

  18. K.R. Rao and P. Yip, “Discrete Cosine Transform– Algorithms, Advantages, Applications”. ↩︎

  19. It may be shown that if dλ/2, then there is a one–to–one relationship between the electrical angle ϕ and the corresponding physical angle θ. In fact, ϕ=2πdλsinθ. We can only observe the electrical angle ϕ, not the desired physical angle θ. Thus, we deduce the desired physical angle from the observed electrical angle from this mathematical relationship. ↩︎

  20. Note that the eigenvalue zero has multiplicity MK. Therefore, the eigenvectors vK+1,,vM are not unique. However, a set of orthonormal eigenvectors which are orthogonal to the remaining eigenvectors exist. Thus we can treat the zero eigenvectors as if they were distinct. ↩︎

  21. Let us define the so–called signal subspace SS as SS=span[v1,,vK] (64) and the noise subspace SN as SN=span[vK+1,,vM]. (65) We now digress briefly to discuss these two subspaces further. From our discussion above, all columns of Ro are linear combinations of the columns of S. Therefore span[Ro]=span[S]. (66) But it is also easy to verify that span[Ro]SS (67) Comparing (66) and (67), we see that SSS. From (60) we see that any received signal vector x, in the absence of noise, is a linear combination of the columns of S. Thus, any noise–free signal resides completely in SS. This is the origin of the term “signal subspace”. Further, any component of the received signal residing in SN must be entirely due to the noise. This is the origin of the term “noise subspace”. We note that the signal and noise subspaces are orthogonal complement subspaces of each other. ↩︎

  22. This word is an acronym for MUltiple SIgnal Classification. ↩︎

  23. R.O. Schmidt, “Multiple emitter location and parameter estimation”, IEEE Trans. Antennas and Propag., vol AP-34, Mar. 1986, pp 276-280. ↩︎

  24. The concept of positive definiteness is discussed next lecture. It means all the eigenvalues are greater than or equal to zero. ↩︎

  25. Golub and van Loan pg. 73. ↩︎

  26. e.g., H. Van Trees, "Detection, Estimation and Modulation Theory", Part 1. L.L. Scharf, "Statistical Signal Processing: Detection, Estimation, and Time Series Analysis," pg. 55. ↩︎

  27. See Wilkinson, "The Algebraic Eigenvalue Problem", pp. 100-101. ↩︎

Reunited - Toby Fox
00:0000:00