EE731 Lecture Notes: Matrix Computations for Signal Processing
INFO
原文档为 EE731 Lecture Notes: Matrix Computations for Signal Processing
James P. Reilly (c)
Department of Electrical and Computer Engineering, McMaster University
September 13, 2004
0 Preface
This collection of ten chapters of notes will give the reader an introduction to the fundamental principles of linear algebra for application in many disciplines of modern engineering and science, including signal processing, control theory, process control, applied statistics, robotics, etc. We assume the reader has an equivalent background to a freshman course in linear algebra, some introduction to probability and statistics, and a basic knowledge of the Fourier transform.
The first chapter, some fundamental ideas required for the remaining portion of the course are established. First, we look at some fundamental ideas of linear algebra such as linear independence, subspaces, rank, nullspace, range, etc., and how these concepts are interrelated. The idea of autocorrelation, and the covariance matrix of a signal, are then discussed and interpreted.
In chapter 2, the most basic matrix decomposition, the so-called eigendecomposition, is presented. The focus of the presentation is to give an intuitive insight into what this decomposition accomplishes. We illustrate how the eigendecomposition can be applied through the Karhunen-Loeve transform. In this way, the reader is made familiar with the important properties of this decomposition. The Karhunen-Loeve transform is then generalized to the broader idea of transform coding.
In chapter 3, we develop the singular value decomposition (SVD), which is closely related to the eigendecomposition of a matrix. We develop the relationships between these two decompositions and explore various properties of the SVD.
Chapter 4 deals with the quadratic form and its relation to the eigendecomposition, and also gives an introduction to error mechanisms in floating point number systems. The condition number of a matrix, which is a critical part in determining a lower bound on the relative error in the solution of a system of linear equations, is also developed.
Chapters 5 and 6 deal with solving linear systems of equations by Gaussian elimination. The Gaussian elimination process is described through a bigger-block matrix approach, that leads to other useful decompositions, such as the Cholesky decomposition of a square symmetric matrix.
Chapters 7-10 deal with solving least-squares problems. The standard least squares problem and its solution are developed in Chapter 7. In Chapter 8, we develop a generalized "pseudoinverse" approach to solving the least-squares problem. The QR decomposition in developed in Chapter 9, and its application to the solution of linear least squares problems is discussed in Chapter 10.
Finally, in Chapter 11, the solution of Toeplitz systems of equations and its underlying theory is developed.
1 Fundamental Concepts
The purpose of this lecture is to review important fundamental concepts in linear algebra, as a foundation for the rest of the course. We first discuss the fundamental building blocks, such as an overview of matrix multiplication from a “big block” perspective, linear independence, subspaces and related ideas, rank, etc., upon which the rigor of linear algebra rests. We then discuss vector norms, and various interpretations of the matrix multiplication operation. We close the chapter with a discussion on determinants.
1.1 Notation
Throughout this course, we shall indicate that a matrix
Similarly, the notation
Also, we shall indicate that a scalar
By convention, a vector by default is taken to be a column vector. Further, for a matrix
1.2 “Bigger-Block” Interpretations of Matrix Multiplication
Let us define the matrix product
The three interpretations of this operation now follow:
1.2.1 Inner-Product Representation
If
1.2.2 Column Representation
This is the next bigger–block view of matrix multiplication. Here we look at forming the product one column at a time. The
This operation is identical to the inner–product representation above, except we form the product one column at a time. For example, if we evaluate only the
1.2.3 Outer–Product Representation
This is the largest–block representation. Let us define a column vector
Now let
By looking at this operation one column at a time, we see this form of matrix multiplication performs exactly the same operations as the column representation above. For example, the
1.2.4 Matrix Pre– and Post–Multiplication
Let us now look at some fundamental ideas distinguishing matrix pre– and post–multiplication. In this respect, consider a matrix
Example:
- Consider an orthonormal matrix
of appropriate dimension. We know that multiplication by an orthonormal matrix results in a rotation operation. The operation rotates each column of . The operation rotates each row.
There is another way to interpret pre– and post–multiplication. Again consider the matrix
Either of these interpretations is equally valid. Being comfortable with the representations of this section is a big step in mastering the field of linear algebra.
1.3 Fundamental Linear Algebra
1.3.1 Linear Independence
Suppose we have a set of
In words:
Eq. (4) means that a set of vectors is linearly independent if and only if the only zero linear combination of the vectors has coefficents which are all zero.
A set of
Note that a set of vectors
Example 1
This set is linearly independent. On the other hand, the set
is not. This follows because the third column is a linear combination of the first two. (
1.3.2 Span, Range, and Subspaces
In this section, we explore these three closely-related ideas. In fact, their mathematical definitions are almost the same, but the interpretation is different for each case.
Span: The span of a vector set
In other words,
The set of vectors in a span is referred to as a vector space. The dimension of a vector space is the number of linearly independent vectors in the linear combination which forms the space. Note that the vector space dimension is not the dimension (length) of the vectors forming the linear combinations.
Example 2: Consider the following 2 vectors in Fig. 1: The span of these vectors is the (infinite extension of the) plane of the paper.
Subspaces Given a set (space) of vectors
- If
and are in the subspace, then is still in the subspace. - If we multiply any vector
in the subspace by a scalar , then is still in the subspace.
These two requirements imply that for a subspace, any linear combination of vectors which are in the subspace is itself in the subspace. Comparing this idea with that of span, we see a subspace defined by the vectors
Hence formally, a
Note that
∗ What is the span of the vectors
Range: The range of a matrix
We can interpret the matrix–vector multiplication
Example 3:
In the case when
1.3.3 Maximally Independent Set
This is a vector set which cannot be made larger without losing independence, and smaller without remaining maximal; i.e. it is a set containing the maximum number of independent vectors spanning the space.
1.3.4 A Basis
A basis for a subspace is any maximally independent set within the subspace. It is not unique.
Example 4. A basis for the subspace
is
[2]or any other linearly independent set in
Any vector in
1.3.5 Orthogonal Complement Subspace
If we have a subspace
i.e., any vector in
Example 5: Take the vector set defining
then, a basis for
1.3.6 Rank
Rank is an important concept which we will use frequently throughout this course. We briefly describe only a few basic features of rank here. The idea is expanded more fully in the following sections.
- The rank of a matrix is the maximum number of linearly independent rows or columns. Thus, it is the dimension of a basis for the columns (rows) of a matrix.
- Rank of
(denoted ), is the dimension of . - if
, and , , then . - A matrix
is said to be rank deficient if its rank is less than . Otherwise, it is said to be full rank. - If
is square and rank deficient, then . - It can be shown that
. More is said on this point later.
A matrix is said to be full column (row) rank if its rank is equal to the number of columns (rows).
Example: The rank of
1.3.7 Null Space of A
The null space
From previous discussions, the product
Example 6: Let
A further example is as follows. Take 3 vectors
Another important characterization of a matrix is its nullity. The nullity of
1.4 Four Fundamental Subspaces of a Matrix
The four matrix subspaces of concern are: the column space, the row space, and their respective orthogonal complements. The development of these four subspaces is closely linked to
1.4.1 The Column Space
This is simply
1.4.2 The Orthogonal Complement of the Column Space
This may be expressed as
where columns of
1.4.3 The Row Space
The row space is defined simply as
1.4.4 The Orthogonal Complement of the Row Space
This may be denoted as
Thus, the set
We have noted before that
1.5 Vector Norms
A vector norm is a means of expressing the length or distance associated with a vector. A norm on a vector space
for all . if and only if . for . for .
We denote the function
The p-norms: This is a useful class of norms, generalizing on the idea of the Euclidean norm. They are defined by
If
which is simply the sum of absolute values of the elements.
If
which is the familiar Euclidean norm.
If
which is the largest element of
where
Note that the
1.6 Determinants
Consider a square matrix
The determinant of
or
Both the above are referred to as the cofactor expansion of the determinant. Eq. (19) is along the
Eqs. (19) and (20) express the
From (19) it is evident that if
Properties of Determinants Before we begin this discussion, let us define the volume of a parallelopiped defined by the set of column vectors comprising a matrix as the principal volume of that matrix.
We have the following properties of determinants, which are stated without proof:
. The principal volume of the product of matrices is the product of principal volumes of each matrix. This property shows that the characteristic polynomials[4] of and are identical. Consequently, as we see later, eigenvalues of and are identical. . This is a reflection of the fact that if each vector defining the principal volume is multiplied by , then the resulting volume is multiplied by . is singular. This implies that at least one dimension of the principal volume of the corresponding matrix has collapsed to zero length. , where are the eigen (singular) values of . This means the parallelopiped defined by the column or row vectors of a matrix may be transformed into a regular rectangular solid of the same m– dimensional volume whose edges have lengths corresponding to the eigen (singular) values of the matrix. - The determinant of an orthonormal[5] matrix is
. This is easy to see, because the vectors of an orthonormal matrix are all unit length and mutually orthogonal. Therefore the corresponding principal volume is . - If
is nonsingular, then . - If
is nonsingular, then . - If
is obtained from by interchanging any two rows (or columns), then . - If
is obtained from by by adding a scalar multiple of one row to another (or a scalar multiple of one column to another), then .
A further property of determinants allows us to compute the inverse of
where the
It can also be shown that
Then, combining (22) and (23) for
where
Neither (19) nor (25) are computationally efficient ways of calculating a determinant or an inverse respectively. Better methods which exploit the properties of various matrix decompositions are made evident later in the course.
2 Lecture 2
This lecture discusses eigenvalues and eigenvectors in the context of the Karhunen–Loeve (KL) expansion of a random process. First, we discuss the fundamentals of eigenvalues and eigenvectors, then go on to covariance matrices. These two topics are then combined into the K-L expansion. An example from the field of array signal processing is given as an application of algebraic ideas.
A major aim of this presentation is an attempt to de-mystify the concepts of eigenvalues and eigenvectors by showing a very important application in the field of signal processing.
2.1 Eigenvalues and Eigenvectors
Suppose we have a matrix
We investigate its eigenvalues and eigenvectors.
Suppose we take the product
By comparing the vectors
Now consider the case where
Now lets consider a more interesting case. Suppose
Note that
Thus we have, if
i.e., the vector
Now that we have an understanding of the fundamental idea of an eigenvector, we proceed to develop the idea further. Eq. (3) may be written in the form
where
where
It is easily verified that the roots of this polynomial are (5,3), which correspond to the eigenvalues indicated above.
Eq. (5) is referred to as the characteristic equation of
More generally, if
If the eigenvalues are all distinct, there are
Repeated Eigenvalues: In the case where there are e.g.,
Example 1: Consider the matrix given by
It may be easily verified that any vector in
Example 2: Consider the
Eq. (5) gives us a clue how to compute eigenvalues. We can formulate the characteristic polynomial and evaluate its roots to give the
We now present some very interesting properties of eigenvalues and eigenvectors, to aid in our understanding.
Property 1 If the eigenvalues of a (Hermitian)[6] symmetric matrix are distinct, then the eigenvectors are orthogonal.
Proof. Let
and
Premultiply (8) by
The quantities on the left are equal when
where we have used the fact
Here we have considered only the case where the eigenvalues are distinct. If an eigenvalue
Another useful property of eigenvalues of symmetric matrices is as follows:
Property 2 The eigenvalues of a (Hermitian) symmetric matrix are real.
Proof:[8] (By contradiction): First, we consider the case where
While this proof considers only the real symmetric case, it is easily extended to the case where
Property 3 Let
Proof: From the definition of an eigenvector, we have
Property 4 Let
- The determinant
. - The trace[9]
. The proof is straightforward, but because it is easier using concepts presented later in the course, it is not given here.
Property 5 If
2.1.1 Orthonormal Matrices
Before proceeding with the eigendecomposition of a matrix, we must develop the concept of an orthonormal matrix. This form of matrix has mutually orthogonal columns, each of unit norm. This implies that
where
(When
Thus, for an orthonormal matrix, (14) implies the inverse may be computed simply by taking the transpose of the matrix, an operation which requires almost no computational effort.
Eq. (14) follows directly from the fact
Clearly, if (15) is to hold, then the quantity
Property 6 The vector 2-norm is invariant under an orthonormal transformation. If
Thus, because the norm does not change, an orthonormal transformation performs a rotation operation on a vector. We use this norm–invariance property later in our study of the least–squares problem.
Suppose we have a matrix
Suppose we have a vector
An orthonormal matrix is sometimes referred to as a unitary matrix. This follows because the determinant of an orthonormal matrix is
2.1.2 The Eigendecomposition (ED) of a Square Symmetric Matrix
Almost all matrices on which ED’s are performed (at least in signal processing) are symmetric. A good example are covariance matrices, which are discussed in some detail in the next section.
Let
Let the eigenvectors be normalized to unit 2–norm. Then these
where
Corresponding columns from each side of (17) represent one specific value of the index
Eq. (19) is called the eigendecomposition (ED) of
Note that from (19), knowledge of the eigenvalues and eigenvectors of
Eq. (19) can also be written as
Since
2.1.3 Conventional Notation on Eigenvalue Indexing
Let
i.e., we order the columns of eq. (17) so that
The eigenvectors are reordered to correspond with the ordering of the eigenvalues. For notational convenience, we refer to the eigenvector corresponding to the largest eigenvalue as the “largest eigenvector”. The “smallest eigenvector” is then the eigenvector corresponding to the smallest eigenvalue.
2.2 The Eigendecomposition in Relation to the Fundamental Matrix Subspaces
In this section, we develop relationships between the eigendecomposition of a matrix and its range, null space and rank.
Here, we consider square symmetric positive semi–definite matrices
where
The columns of
and
In the notation used above, the explicit absence of a matrix element in an off-diagonal position implies that element is zero. We now show that the partition (22) reveals a great deal about the structure of
2.2.1 Nullspace
In this section, we explore the relationship between the partition of (22) and the nullspace of
From (22), we have
We now choose
From (28), it is clear we can find a non–trivial
Since
Moreover from (28) we see that the condition
Thus, we have the important result that if the dimension of
2.2.2 Range
Let us look at
The vector quantity
In the above, it is understood that if
Let us define
where
From (31), we see that
2.3 Matrix Norms
Now that we have some understanding of eigenvectors and eigenvalues, we can now present the matrix norm. The matrix norm is related to the vector norm: it is a function which maps
Matrix p-Norms: A matrix p-norm is defined in terms of a vector p-norm. The matrix p-norm of an arbitary matrix
where “sup” means supremum; i.e., the largest value of the argument over all values of
We now provide some interpretation for the above definition for the specific case where
and set the result to zero. The quantity
Therefore the stationary points of (34) are the eigenvectors of
It then follows that the solution to (34) is given by the eigenvector corresponding to the largest eigenvalue of
More generally, it is shown in the next lecture for an arbitrary matrix
where
Matrix norms for other values of
and
Frobenius Norm: The Frobenius norm is the 2-norm of the vector consisting of the 2- norms of the rows (or columns) of the matrix
2.3.1 Properties of Matrix Norms
- Consider the matrix
and the vector . Then,
This property follows by dividing both sides of the above by
- If
and are orthonormal matrices of appropriate size, then
and
Thus, we see that the matrix 2–norm and Frobenius norm are invariant to pre– and post– multiplication by an orthonormal matrix.
- Further,
where
2.4 Covariance Matrices
Here, we investigate the concepts and properties of the covariance matrix
The word stationary as used above means the random process is one for which the corresponding joint m–dimensional probability density function describing the distribution of the vector sample
The covariance matrix
where
where
Let us compare the process shown in Fig. 2 with that shown in Fig. 3. In the former case, we see that the process is relatively slowly varying. Because we have assumed
However, for the process shown in Fig. 3, adjacent samples are uncorrelated with each other. This means that adjacent samples are just as likely to have opposite signs as they are to have the same signs. On average, the terms with positive values have the same magnitude as those with negative values. Thus, when the expectations
The sequence
In practice, it is impossible to evaluate the covariance matrix
If (46) is used to evaluate
It is interesting to note that
The proof of this statement is left as an exercise.
Some Properties of
is (Hermitian) symmetric i.e. , where denotes complex conjugation. - If the process
is stationary or wide-sense stationary, then is Toeplitz. This means that all the elements on a given diagonal of the matrix are equal. If you understand this property, then you have a good understanding of the nature of covariance matrices. - If
is diagonal, then the elements of are uncorrelated. If the magnitudes of the off-diagonal elements of are significant with respect to those on the main diagonal, the process is said to be highly correlated. is positive semi–definite. This implies that all the eigenvalues are greater than or equal to zero. We will discuss positive definiteness and positive semi–definiteness later. - If the stationary or WSS random process
has a Gaussian probability distribution, then the vector mean and the covariance matrix are enough to completely specify the statistical characteristics of the process.
2.5 The Karhunen-Loeve Expansion of a Random Process
In this section we combine what we have learned about eigenvalues and eigenvectors, and covariance matrices, into the K-L orthonormal expansion of a random process. The KL expansion is extremely useful in compression of images and speech signals.
An orthonormal expansion of a vector
where
The coefficients
For each vector observation
where
2.5.1 Development of the K–L Expansion
Figure 4 shows a scatterplot corresponding to a slowly–varying random process, of the type shown in Figure 2. A scatterplot is a collection of dots, where the
For the sake of contrast, Figure 5 shows a similar scatterplot, except the underlying random process is white. Here there is no correlation between adjacent samples of the process, so there is no diagonal concentration of the scatterplot in this case. This scatterplot is an
As we see later in this section, if we wish to store or transmit such a random process, it is wasteful to do so using the conventional coordinate system
The proposed method of finding an optimum coordinate system in which to represent our random process is to find a basis vector
The procedure to determine the
where the expectation is over all values of
where we have assumed a zero–mean process. The optimization problem above is precisely the same as that for the matrix norm of section 2.3, where it is shown that the stationary points of the argument in (52) are the eigenvectors of
In the sequel, the KL expansion is written using the following notation:
and
where
Thus, the coefficient
By lecture 4, we will have sufficient knowledge to prove that the eigenvectors align themselves along the principal axes of the scatterplot ellipsoid of Figure 4. In highly correlated systems, due to the fact that the principal axes of the scatterplot ellipse have decreasing magnitudes (as shown in Figure 4) the variance of the smallest coefficients is typically much smaller than that of the larger coefficients.
Question: Suppose the process
To answer this, we see that all the eigenvalues of
2.5.2 Properties of the KL Expansion
Property 7 The coefficients
To prove this, we evaluate the covariance matrix
Since
Property 8 The variance of the
The proof follows directly from (55);
Property 9 The variance of a highly correlated random process
This property may be justified intuitively from the scatterplot of Figure 4, due to the fact that the length of the first principal axis is greater than that of the second. (This effect becomes more pronounced in higher dimensions.) However here we wish to formally prove this property.
Let us denote the covariance matrix of the process shown in Fig. 2 as
To obtain further insight into the behavior of the two sets of eigenvalues, we consider Hadamard’s inequality[16] which may be stated as:
Consider a square matrix
. Then, , with equality if and only if is diagonal.
From Hadamard’s inequality,
To illustrate this phenomenon further, consider the extreme case where the process becomes so correlated that all elements of its covariance matrix approach the same value. (This will happen if the process
2.5.3 Applications of the K-L Expansion
Suppose a communications system transmits a stationary, zero–mean highly–correlated sequence
But if
To implement this form of signal compression, let us say that an acceptable level of distortion is obtained by retaining only the first
where coefficients
An approximation
From Property 8, the mean–squared error
which corresponds to the sum of the truncated (smallest) eigenvalues. It is easy to prove that no other basis results in a smaller error. The error
where the last line uses (51) and (52). We have seen previously that the eigenvectors are the stationary points of each term in the sum above. Since each term in the sum is positive semi–definite definite,
In speech applications for example, fewer than one tenth of the coefficients are needed for reconstruction with imperceptible degradation. Note that since
Transform coding is now illustrated by an example. A process
We show this example for
Eigenvalues: 0.5468 0.1975 0.1243
The error
The corresponding two principal eigenvectors are plotted in Fig. 7. These plots show the value of the
In this case, we would expect that any observation
One of the practical difficulties in using the K–L expansion for coding is that the eigenvector set
2.6 Example: Array Processing
Here, we present a further example of the concepts we have developed so far. This example is concerned with direction of arrival estimation using arrays of sensors.
Consider an array of
where
In (60) we obtain
Note
The last line follows because the noise is uncorrelated with the signal, thus forcing the cross–terms to zero. In the last line of (61) we have also used that fact that the covariance matrix of the noise contribution (second term) is
Lets look at the structure of
From this structure, we may conclude that
Let us now investigate the eigendecomposition on
or
Because
From the definition of an eigenvector, we have
or
Since
We define the matrix
We also have
Up to now, we have considered only the noise–free case. What happens when the noise component
With this background in place we can now discuss the MUSIC[22] algorithm for estimating directions of arrival of plane waves incident onto arrays of sensors.
2.6.1 The MUSIC Algorithm[23]
We wish to estimate the unknown values
Only if
An estimate
By convention, it is desirable to express (73) as a spectrum–like function, where a peak instead of a null represents a desired signal. It is also convenient to use the squared-norm instead of the norm itself. Thus, the MUSIC “spectrum”
It will look something like what is shown in Fig. 10, when
2.7 TO SUMMARIZE
- An eigenvector
of a matrix is such that points in the same direction as . - The covariance matrix
of a random process is defined as . For stationary processes, completely characterizes the process, and is closely related to its covariance function. In practice, the expectation operation is replaced by a time-average. - the eigenvectors of
form a natural basis to represent , since it is only the eigenvectors which diagonalize . This leads to the coefficients of the corresponding expansion being uncorrelated. This has significant application in speech/video encoding. - The expection of the square of the coefficients above are the eigenvalues of
. This gives an idea of the relative power present along each eigenvector. - If the variables
are Gaussian, then the K-L coefficients are independent. This greatly simplifies receiver design and analysis.
Many of these points are a direct consequence of the fact that it is only the eigenvectors which can diagonalize a matrix. That is basically the only reason why eigenvalues/eigenvectors are so useful. I hope this serves to demystify this subject. Once you see that it is only the eigenvectors which diagonalize, the property that they are a natural basis for the process
An interpretation of an eigenvalue is that it represents the average energy in each coefficient of the K–L expansion.
3 The Singular Value Decomposition (SVD)
In this lecture we learn about one of the most fundamental and important matrix decompositions of linear algebra: the SVD. It bears some similarity with the eigendecomposition (ED), but is more general. Usually, the ED is of interest only on symmetric square matrices, but the SVD may be applied to any matrix. The SVD gives us important information about the rank, the column and row spaces of the matrix, and leads to very useful solutions and interpretations of least squares problems. We also discuss the concept of matrix projectors, and their relationship with the SVD.
3.1 The Singular Value Decomposition (SVD)
We have found so far that the eigendecomposition is a useful analytic tool. However, it is only applicable on square symmetric matrices. We now consider the SVD, which may be considered a generalization of the ED to arbitrary matrices. Thus, with the SVD, all the analytical uses of the ED which before were restricted to symmetric matrices may now be applied to any form of matrix, regardless of size, whether it is symmetric or nonsymmetric, rank deficient, etc.
Theorem 1 Let
where
and
where
The matrix
Since
where
The SVD corresponding to (1) may be shown diagramatically in the following way:
Each line above represents a column of either
3.2 Existence Proof of the SVD
Consider two vectors
That is,
We then define a matrix
The matrix
where
Now, we post-multiply both sides of (5) by the vector
This follows because the term on the extreme right is only the first element of the vector product of the middle term. But, as we have seen, matrix p-norms obey the following property:
Therefore using (6) and (7), we have
Note that
But, we defined
where the equality on the right follows because the matrix 2-norm is invariant to matrix pre- and post-multiplication by an orthonormal matrix. By comparing (9) and (10), we have the result
Substituting this result back into (5), we now have
The whole process repeats using only the component
It is instructive to consider an alternative proof for the SVD. The following is useful because it is a constructive proof, which shows us how to form the components of the SVD.
Theorem 2 Let
where
Proof: Consider the square symmetric positive semi–definite matrix
where
Then by equating corresponding blocks in (15) we have
From (16), we can write
Then, we define the matrix
Then from (18) we have
From (17) we also have
We now choose a matrix
Therefore
Combining (20), (21) and (23), we have
The above proof is useful for several reasons:
- It is short and elegant.
- We can also identify which part of the SVD is not unique. Here, we assume that
has no repeated non–zero eigenvalues. Because are the eigenvectors corresponding to the zero eigenvalues of , is not unique when there are repeated zero eigenvalues. This happens when , (i.e., is sufficiently short) or when the nullity of , or a combination of these conditions. - By its construction, the matrix
is not unique whenever it consists of two or more columns. This happens when . It is left as an exercise to show that similar conclusions on the uniqueness of and can be made when the proof is developed using the matrix .
3.3 Partitioning the SVD
Here we assume that
In the remainder of this lecture, we use the SVD partitioned in both
where where
The columns of
3.4 Interesting Properties and Interpretations of the SVD
The above partition reveals many interesting properties of the SVD:
3.4.1 rank(A) = r
Using (25), we can write
where
This point is analogous to the case previously considered in Lecture 2, where we saw rank is equal to the number of non-zero eigenvalues, when
Determination of rank when
3.4.2
Recall the nullspace
Thus,
3.4.3
Recall that the definition of range
where
From the above we have
We see that as
3.4.4
Recall that
3.4.5
From Sect. 3.4.3, we see that
3.4.6
This is easy to see from the definition of the 2-norm and the ellipsoid example of section 3.6.
3.4.7 Inverse of A
If the svd of a square matrix
The evaluation of
3.4.8 The SVD diagonalizes any system of equations
Consider the system of equations
Let us now represent
and
Substituting the above into (34), the system of equations becomes
This shows that as long as we choose the correct bases, any system of equations can become diagonal. This property represents the power of the SVD; it allows us to transform arbitrary algebraic structures into their simplest forms.
If
Further, if
3.4.9 The “rotation” interpretation of the SVD
From the SVD relation
Note that since
The matrix
In the case where
3.5 Relationship between SVD and ED
It is clear that the eigendecomposition and the singular value decomposition share many properties in common. The price we pay for being able to perform a diagonal decomposition on an arbitray matrix is that we need two orthonormal matrices instead of just one, as is the case for square symmetric matrices. In this section, we explore further relationships between the ED and the SVD.
Using (25), we can write
Thus it is apparent, that the eigenvectors
As discussed in Golub and van Loan, the SVD is numerically more stable to compute than the ED. However, in the case where
Further, we can also say, using the form
which indicates that the eigenvectors of
We now compare the fundamental defining relationships for the ED and the SVD: For the ED, if
where
For the SVD, we have
or
where
Thus, by comparing (42), (43), and (44), we see the singular vectors and singular values obey a relation which is similar to that which defines the eigenvectors and eigenvalues. However, we note that in the SVD case, the fundamental relationship expresses left singular values in terms of right singular values, and vice-versa, whereas the eigenvectors are expressed in terms of themselves.
Exercise: compare the ED and the SVD on a square symmetric matrix, when i)
3.6 Ellipsoidal Interpretation of the SVD
The singular values of
That is, E is the set of points mapped out as
Let us change bases for both
Then (45) becomes
We note that
We see that the set
3.7 An Interesting Theorem
First, we realize that the SVD of
Given
Theorem 3 Define
then
In words, this says the closest rank
Proof: Since
where the first line follows from the fact the the 2-norm of a matrix is invariant to pre– and post–multiplication by an orthonormal matrix (properties of matrix p-norms, Lecture 2). Further, it may be shown that, for any matrix
Comparing (51) and (52), we see the closest rank
This result is very useful when we wish to approximate a matrix by another of lower rank. For example, let us look at the Karhunen-Loeve expansion as discussed in Lecture 1. For a sample
where the columns of
This fact may now be seen in a different light with the aid of this theorem. Suppose we retain the
where
4 Orthogonal Projections
4.1 Sufficient Conditions for a Projector
Suppose we have a subspace
The matrix
A matrix
A matrix satisfying condition (2) is called an idempotent matrix. This is the fundamental property of a projector.
We now show that these three conditions are sufficient for
where
Because of condition 2,
where
Combining (57) and (58), we have
If
i.e.,
i.e.,
4.2 A Definition for P
Let
is a projector onto
Note that when
Exercises:
- prove (61).
- How is
in (61) formed if ?
Theorem 4 The projector onto
Proof: Let
The projector
Thus, the projector formed from (61) onto
In Section 4.1 we discussed sufficient conditions for a projector. This means that while these conditions are enough to specify a projector, there may be other conditions which also specify a projector. But since we have now proved the projector is unique, the conditions in Section 4.1 are also necessary.
4.3 The Orthogonal Complement Projector
Consider the vector
Therefore we have
It follows that if
4.4 Orthogonal Projections and the SVD
Suppose we have a matrix
is the orthogonal projector onto . is the orthogonal projector onto is the orthogonal projector onto is the orthogonal projector onto
To justify these results, we show each projector listed above satisfies the three conditions for a projector:
- First, we must show that each projector above is in the range of the corresponding subspace (condition 1). In Sects. 3.4.2 and 3.4.3, we have already verified that
is a basis for , and that is a basis for , as required. It is easy to verify that the remaining two projectors above (no.’s 1 and 4 respectively) also have the appropriate ranges. - From the orthonormality property of each of the matrix partitions above, it is easy to see condition 2 (idempotency) holds in each case.
- Finally, each matrix above is symmetric (condition 3). Therefore, each matrix above is a projector onto the corresponding subspace.
5 The Quadratic Form
We introduce the quadratic form by considering the idea of positive definiteness. A square matrix
The matrix
which includes the possibility that
It is only the symmetric part of
Because
Theorem 1 A matrix
Proof: Since only the symmetric part of
Thus (4) is greater than zero for arbitrary
From (4), it is easy to verify that the equation
Positive definiteness of
Example: We now discuss an example to illustrate the above discussion. A three-dimensional plot of
The corresponding contour plot is plotted in Fig. 2. Note that this curve is elliptical in cross-section in a plane
We write the ellipse in the form
where
Theorem 2 A symmetric matrix
Proof: (Necessary condition) Let us define
Conversely (sufficient condition) without loss of generality we take an eigendecomposition on the symmetric part of
Note that
The fact that
5.1 The Gaussian Multi-Variate Probability Density Function
Here, we very briefly introduce this topic so we can use this material for an example of the application of the Cholesky decomposition later in this course, and also in least-squares analysis to follow shortly. This topic is a good application of quadratic forms. More detail is provided in several books.[26]
First we consider the uni-variate case of the Gaussian probability distribution function (pdf). The pdf
This is the familiar bell-shaped curve. It is completely specified by two parameters- the mean
We now consider the more interesting multi-dimensional case. Consider a Gaussian-distributed vector
We can see that the multi-variate case collapses to the uni-variate case when the number of variables becomes one. A plot of
Because the exponent in (9) is a quadratic form, the set of points satisfied by the equation
where
The covariance matrix
Now suppose we let
Here, because the ellipse describing the joint confidence region is elongated, we see that if one of the variables is known, the distribution of the other variable becomes more concentrated around the value of the first; i.e., knowledge of one variable tells us relatively more about the other. This implies the variables are more highly correlated with one another. But we have seen previously that if the variables in a vector random process are highly correlated, then the off-diagonal elements of the covariance matrix become larger, which leads to their eigenvalues becomming more disparate; i.e., the condition number of the covariance matrix becomes worse. It is precisely this poorer condition number that causes the ellipse in Fig. 4 to become elongated.
With this discussion, we we now have gone full circle: a highly correlated system has large off-diagonal elements in its covariance matrix. This leads to a poorly conditioned covariance matrix. But a Gaussian-distributed process with a poorly-conditioned covariance matrix has a joint confidence region that is elongated. In turn, an elongated joint confidence region means the system is highly correlated, which takes us back to the beginning.
Understanding these relationships is a key element in the signal processing rigor.
5.2 The Rayleigh Quotient
The Rayleigh quotient is a simple mathematical structure that has a great deal of interesting uses. The Rayleigh quotient
It is easily verified that if
In fact, it is easily shown by differentiating
Further along this line of reasoning, let us define a subspace
Question: It is easily shown by differentiation that (13) for
This technique is referred to as the Rayleigh quotient iteration for computing an eigenvalue and eigenvector. In fact, this iteration is remarkably effective; it can be shown to have cubic convergence.
Eq. (4) is called a linear combination of the vectors
. Each vector is multiplied by a weight (or coefficient) , and the result summed. ↩︎ A vector
is referred to as an elementary vector, and has zeros everywhere except for a 1 in the th position. ↩︎ Column rank deficient is when the rank of the matrix is less than the number of columns. ↩︎
The characteristic polynomial of a matrix is defined in Chapter 2. ↩︎
An orthonormal matrix is defined in Chapter 2. ↩︎
A symmetric matrix is one where
, where the superscript means transpose, i.e, for a symmetric matrix, an element . A Hermitian symmetric (or just Hermitian) matrix is relevant only for the complex case, and is one where , where superscript denotes the Hermitian transpose. This means the matrix is transposed and complex conjugated. Thus for a Hermitian matrix, an element . In this course we will generally consider only real matrices. However, when complex matrices are considered, Hermitian symmetric is implied instead of symmetric. ↩︎ Here, we have used the property that for matrices or vectors
and of conformable size, . ↩︎ From Lastman and Sinha, Microcomputer–based Numerical Methods for Science and Engineering. ↩︎
The trace denoted
of a square matrix is the sum of its elements on the main diagonal (also called the “diagonal” elements). ↩︎ This only holds if
and are square invertible. ↩︎ This proof is left as an exercise. ↩︎
A. Papoulis, Probability, Random Variables, and Stochastic Processes, McGraw Hill, 3rd Ed. ↩︎
Process with this property are referred to as ergodic processes. ↩︎
Haykin, “Adaptive Filter Theory”, Prentice Hall, 3rd. ed. ↩︎
An expansion of
usually requires the basis vectors to be only linearly independent–not necessarily orthonormal. But orthonormal basis vectors are most commonly used because they can be inverted using the very simple form of (49). ↩︎ For a proof, refer to Cover and Thomas, Elements of Information Theory ↩︎
This is not necessarily a valid assumption. We discuss this point further, later in the section. ↩︎
K.R. Rao and P. Yip, “Discrete Cosine Transform– Algorithms, Advantages, Applications”. ↩︎
It may be shown that if
, then there is a one–to–one relationship between the electrical angle and the corresponding physical angle . In fact, . We can only observe the electrical angle , not the desired physical angle . Thus, we deduce the desired physical angle from the observed electrical angle from this mathematical relationship. ↩︎ Note that the eigenvalue zero has multiplicity
. Therefore, the eigenvectors are not unique. However, a set of orthonormal eigenvectors which are orthogonal to the remaining eigenvectors exist. Thus we can treat the zero eigenvectors as if they were distinct. ↩︎ Let us define the so–called signal subspace
as (64) and the noise subspace as . (65) We now digress briefly to discuss these two subspaces further. From our discussion above, all columns of are linear combinations of the columns of . Therefore . (66) But it is also easy to verify that (67) Comparing (66) and (67), we see that . From (60) we see that any received signal vector , in the absence of noise, is a linear combination of the columns of . Thus, any noise–free signal resides completely in . This is the origin of the term “signal subspace”. Further, any component of the received signal residing in must be entirely due to the noise. This is the origin of the term “noise subspace”. We note that the signal and noise subspaces are orthogonal complement subspaces of each other. ↩︎ This word is an acronym for MUltiple SIgnal Classification. ↩︎
R.O. Schmidt, “Multiple emitter location and parameter estimation”, IEEE Trans. Antennas and Propag., vol AP-34, Mar. 1986, pp 276-280. ↩︎
The concept of positive definiteness is discussed next lecture. It means all the eigenvalues are greater than or equal to zero. ↩︎
Golub and van Loan pg. 73. ↩︎
e.g., H. Van Trees, "Detection, Estimation and Modulation Theory", Part 1. L.L. Scharf, "Statistical Signal Processing: Detection, Estimation, and Time Series Analysis," pg. 55. ↩︎
See Wilkinson, "The Algebraic Eigenvalue Problem", pp. 100-101. ↩︎