EE731 Lecture Notes: Matrix Computations for Signal Processing

INFO

原文档为 EE731 Lecture Notes: Matrix Computations for Signal Processing

James P. Reilly (c)

Department of Electrical and Computer Engineering, McMaster University

September 13, 2004

0 Preface

This collection of ten chapters of notes will give the reader an introduction to the fundamental principles of linear algebra for application in many disciplines of modern engineering and science, including signal processing, control theory, process control, applied statistics, robotics, etc. We assume the reader has an equivalent background to a freshman course in linear algebra, some introduction to probability and statistics, and a basic knowledge of the Fourier transform.

The first chapter, some fundamental ideas required for the remaining portion of the course are established. First, we look at some fundamental ideas of linear algebra such as linear independence, subspaces, rank, nullspace, range, etc., and how these concepts are interrelated. The idea of autocorrelation, and the covariance matrix of a signal, are then discussed and interpreted.

In chapter 2, the most basic matrix decomposition, the so-called eigendecomposition, is presented. The focus of the presentation is to give an intuitive insight into what this decomposition accomplishes. We illustrate how the eigendecomposition can be applied through the Karhunen-Loeve transform. In this way, the reader is made familiar with the important properties of this decomposition. The Karhunen-Loeve transform is then generalized to the broader idea of transform coding.

In chapter 3, we develop the singular value decomposition (SVD), which is closely related to the eigendecomposition of a matrix. We develop the relationships between these two decompositions and explore various properties of the SVD.

Chapter 4 deals with the quadratic form and its relation to the eigendecomposition, and also gives an introduction to error mechanisms in floating point number systems. The condition number of a matrix, which is a critical part in determining a lower bound on the relative error in the solution of a system of linear equations, is also developed.

Chapters 5 and 6 deal with solving linear systems of equations by Gaussian elimination. The Gaussian elimination process is described through a bigger-block matrix approach, that leads to other useful decompositions, such as the Cholesky decomposition of a square symmetric matrix.

Chapters 7-10 deal with solving least-squares problems. The standard least squares problem and its solution are developed in Chapter 7. In Chapter 8, we develop a generalized "pseudoinverse" approach to solving the least-squares problem. The QR decomposition in developed in Chapter 9, and its application to the solution of linear least squares problems is discussed in Chapter 10.

Finally, in Chapter 11, the solution of Toeplitz systems of equations and its underlying theory is developed.

1 Fundamental Concepts

The purpose of this lecture is to review important fundamental concepts in linear algebra, as a foundation for the rest of the course. We first discuss the fundamental building blocks, such as an overview of matrix multiplication from a “big block” perspective, linear independence, subspaces and related ideas, rank, etc., upon which the rigor of linear algebra rests. We then discuss vector norms, and various interpretations of the matrix multiplication operation. We close the chapter with a discussion on determinants.

1.1 Notation

Throughout this course, we shall indicate that a matrix $A$ is of dimension $m \times n$ , and whose elements are taken from the set of real numbers, by the notation $A \in R^{m \times n}$ . This means that the matrix $A$ belongs to the Cartesian product of the real numbers, taken $m \times n$ times, one for each element of $A$ . In a similar way, the notation $A \in C^{m \times n}$ means the matrix is of dimension $m \times n$ , and the elements are taken from the set of complex numbers. By the matrix dimension “ $m \times n$ ”, we mean $A$ consists of $m$ rows and $n$ columns.

Similarly, the notation $a \in R^{m} (C^{m})$ implies a vector of dimension $m$ whose elements are taken from the set of real (complex) numbers. By “dimension of a vector”, we mean its length, i.e., that it consists of $m$ elements.

Also, we shall indicate that a scalar $a$ is from the set of real (complex) numbers by the notation $a \in R (C)$ . Thus, an upper case bold character denotes a matrix, a lower case bold character denotes a vector, and a lower case non-bold character denotes a scalar.

By convention, a vector by default is taken to be a column vector. Further, for a matrix $A$ , we denote its $i$ th column as $a_{i}$ . We also imply that its $j$ th row is $a_{j}^{T}$ , even though this notation may be ambiguous, since it may also be taken to mean the transpose of the $j$ th column. The context of the discussion will help to resolve the ambiguity.

1.2 “Bigger-Block” Interpretations of Matrix Multiplication

Let us define the matrix product $C$ as

\begin{matrix} (1) & \underset{m \times n}{C} = \underset{m \times k}{A} \underset{k \times n}{B} \end{matrix}

The three interpretations of this operation now follow:

1.2.1 Inner-Product Representation

If $a$ and $b$ are column vectors of the same length, then the scalar quantity $a^{T} b$ is referred to as the inner product of $a$ and $b$ . If we define $a_{i}^{T} \in R^{k}$ as the $i$ th row of $A$ and $b_{j} \in R^{k}$ as the $j$ th column of $B$ , then the element $c_{i j}$ of $C$ is defined as the inner product $a_{i}^{T} b_{j}$ . This is the conventional small-block representation of matrix multiplication.

1.2.2 Column Representation

This is the next bigger–block view of matrix multiplication. Here we look at forming the product one column at a time. The $j$ th column $c_{j}$ of $C$ may be expressed as a linear combination of columns $a_{i}$ of $A$ with coefficients which are the elements of the $j$ th column of $B$ . Thus,

\begin{matrix} (2) & c_{j} = \sum_{i = 1}^{k} a_{i} b_{i j}, j = 1, \dots, n . \end{matrix}

This operation is identical to the inner–product representation above, except we form the product one column at a time. For example, if we evaluate only the $p$ th element of the $j$ th column $c_{j}$ , we see that (2) degenerates into $\sum_{i = 1}^{k} a_{p i} b_{i j}$ . This is the inner product of the $p$ th row and $j$ th column of $A$ and $B$ respectively, which is the required expression for the $(p, j)$ th element of $C$ .

1.2.3 Outer–Product Representation

This is the largest–block representation. Let us define a column vector $a \in R^{m}$ and a row vector $b^{T} \in R^{n}$ . Then the outer product of $a$ and $b$ is an $m \times n$ matrix of rank one and is defined as $a b^{T}$ .

Now let $a_{i}$ and $b_{i}^{T}$ be the $i$ th column and row of $A$ and $B$ respectively. Then the product $C$ may also be expressed as

\begin{matrix} (3) & C = \sum_{i = 1}^{k} a_{i} b_{i}^{T} . \end{matrix}

By looking at this operation one column at a time, we see this form of matrix multiplication performs exactly the same operations as the column representation above. For example, the $j$ th column $c_{j}$ of the product is determined from (3) to be $c_{j} = \sum_{i = 1}^{k} a_{i} b_{i j}$ , which is identical to (2) above.

1.2.4 Matrix Pre– and Post–Multiplication

Let us now look at some fundamental ideas distinguishing matrix pre– and post–multiplication. In this respect, consider a matrix $A$ pre–multiplied by $B$ to give $Y = B A$ . (All matrices are assumed to have conformable dimensions). Then we can interpret this multiplication as $B$ operating on the columns of $A$ to give the columns of the product. This follows because each column $y_{i}$ of the product is a transformed version of the corresponding column of $A$ ; i.e., $y_{i} = B a_{i}$ , $i = 1, \dots, n$ . Likewise, let’s consider $A$ post–multiplied by a matrix $C$ to give $X = A C$ . Then, we interpret this multiplication as $C$ operating on the rows of $A$ , because each row $x_{j}^{T}$ of the product is a transformed version of the corresponding row of $A$ ; i.e., $x_{j}^{T} = a_{j}^{T} C$ , $j = 1, \dots, m$ , where we define $a_{j}^{T}$ as the $j$ th row of $A$ .

Example:

Consider an orthonormal matrix $Q$ of appropriate dimension. We know that multiplication by an orthonormal matrix results in a rotation operation. The operation $Q A$ rotates each column of $A$ . The operation $A Q$ rotates each row.

There is another way to interpret pre– and post–multiplication. Again consider the matrix $A$ pre–multiplied by $B$ to give $Y = B A$ . Then according to (2), the $j$ th column $y_{i}$ of $Y$ is a linear combination of the columns of $B$ , whose coefficients are the $j$ th column of $A$ . Likewise, for $X = A B$ , we can say that the $i$ th row $x_{i}^{T}$ of $X$ is a linear combination of the rows of $B$ , whose coefficients are the $i$ th row of $A$ .

Either of these interpretations is equally valid. Being comfortable with the representations of this section is a big step in mastering the field of linear algebra.

1.3 Fundamental Linear Algebra

1.3.1 Linear Independence

Suppose we have a set of $n$ $m$ -dimensional vectors ${a_{1}, \dots, a_{n}}$ , where $a_{i} \in R^{m}, i = 1, \dots, n$ . This set is linearly independent under the conditions ^[1]

\begin{matrix} (4) & \sum_{j = 1}^{n} c_{j} a_{j} = 0 if and only if c_{1}, \dots, c_{n} = 0 \end{matrix}

In words:

Eq. (4) means that a set of vectors is linearly independent if and only if the only zero linear combination of the vectors has coefficents which are all zero.

A set of $n$ vectors is linearly independent if an $n$ –dimensional space may be formed by taking all possible linear combinations of the vectors. If the dimension of the space is less than $n$ , then the vectors are linearly dependent. The concept of a vector space and the dimension of a vector space is made more precise later.

Note that a set of vectors ${a_{1}, \dots, a_{n}}$ , where $n > m$ cannot be linearly independent.

Example 1

\begin{matrix} (5) & A = [a_{1} a_{2} a_{3}] = [\begin{matrix} 1 & 2 & 1 \\ 0 & 3 & - 1 \\ 0 & 0 & 1 \end{matrix}] \end{matrix}

This set is linearly independent. On the other hand, the set

\begin{matrix} (6) & B = [b_{1} b_{2} b_{3}] = [\begin{matrix} 1 & 2 & - 3 \\ 0 & 3 & - 3 \\ 0 & 0 & 0 \end{matrix}] \end{matrix}

is not. This follows because the third column is a linear combination of the first two. ( $- 1$ times the first column plus $- 1$ times the second equals the third column. Thus, the coefficients $c_{j}$ in (4) resulting in zero are any scalar multiple of $(1, 1, 1)$ ).

1.3.2 Span, Range, and Subspaces

In this section, we explore these three closely-related ideas. In fact, their mathematical definitions are almost the same, but the interpretation is different for each case.

Span: The span of a vector set $[a_{1}, \dots, a_{n}]$ , written as $span [a_{1}, \dots, a_{n}]$ , where $a_{i} \in R^{m}$ , is the set of points mapped by

\begin{matrix} (7) & span [a_{1}, \dots, a_{n}] = {y \in R^{m} ∣ y = \sum_{j = 1}^{n} c_{j} a_{j}, c_{j} \in R} . \end{matrix}

In other words, $span [a_{1}, \dots, a_{n}]$ is the set of all possible linear combinations of the vectors $a$ . If the vectors are linearly independent, then the dimension of this set of linear combinations is $n$ . If the vectors are linearly dependent, then the dimension is less.

The set of vectors in a span is referred to as a vector space. The dimension of a vector space is the number of linearly independent vectors in the linear combination which forms the space. Note that the vector space dimension is not the dimension (length) of the vectors forming the linear combinations.

Example 2: Consider the following 2 vectors in Fig. 1: The span of these vectors is the (infinite extension of the) plane of the paper.

Subspaces Given a set (space) of vectors $[a_{1}, \dots, a_{n}] \in R^{m}, m \geq n$ , a subspace $S$ is a vector subset that satisfies two requirements:

If $x$ and $y$ are in the subspace, then $x + y$ is still in the subspace.
If we multiply any vector $x$ in the subspace by a scalar $c$ , then $c x$ is still in the subspace.

These two requirements imply that for a subspace, any linear combination of vectors which are in the subspace is itself in the subspace. Comparing this idea with that of span, we see a subspace defined by the vectors $[a_{1}, \dots, a_{n}]$ is identical to $span [a_{1}, \dots, a_{n}]$ . However, a subspace has the interpretation that the set of vectors comprizing the subspace must be a subset of a larger space. For example, the vectors $[a_{1}, a_{2}]$ in Fig. 1 define a subspace (the plane of the paper) which is a subset of the three–dimensional universe $R^{3}$ .

Hence formally, a $k$ –dimensional subspace $S$ of $span [a_{1}, \dots, a_{n}]$ is determined by $span [a_{i_{1}}, \dots, a_{i_{k}}]$ , where the distinct indices satisfy ${i_{1}, \dots, i_{k}} \subset {1, \dots, n}$ ; that is, the vector space $S = span [a_{i_{1}}, \dots, a_{i_{k}}]$ is a subset of $span [a_{1}, \dots, a_{n}]$ .

Note that $[a_{i_{1}}, \dots, a_{i_{k}}]$ is not necessarily a basis for the subspace $S$ . This set is a basis only if it is a maximally independent set. This idea is discussed shortly. The set ${a_{i}}$ need not be linearly independent to define the span or subset.

∗ What is the span of the vectors $[b_{1}, \dots, b_{3}]$ in example 1?

Range: The range of a matrix $A \in R^{m \times n}$ , denoted $R (A)$ , is a subspace (set of vectors) satisfying

\begin{matrix} (8) & R (A) = {y \in R^{m} ∣ y = A x, for x \in R^{n}} . \end{matrix}

We can interpret the matrix–vector multiplication $y = A x$ above according to the column representation for matrix multiplication (2), where the product $C$ has only one column. Thus, we see that $y$ is a linear combination of the columns $a_{i}$ of $A$ , whose coefficients are the elements $x_{i}$ of $x$ . Therefore, (8) is equivalent to (7), and $R (A)$ is thus the span of the columns of $A$ . The distinction between range and span is that the argument of range is a matrix, while for span it is a set of vectors. If the columns of $A$ are (not) linearly independent, then $R (A)$ will (not) span $n$ dimensions. Thus, the dimension of the vector space $R (A)$ is less than or equal to $n$ . Any vector $y \in R (A)$ is of dimension (length) $m$ .

Example 3:

\begin{matrix} (9) & A = [\begin{matrix} 1 & 5 & 3 \\ 2 & 4 & 3 \\ 3 & 3 & 3 \end{matrix}] (the last column is the average of the first two) \end{matrix}

$R (A)$ is the set of all linear combinations of any two columns of $A$ .

In the case when $n < m$ (i.e., $A$ is a tall matrix), it is important to note that $R (A)$ is indeed a subspace of the $m$ -dimensional “universe” $R^{m}$ . In this case, the dimension of $R (A)$ is less than or equal to $n$ . Thus, $R (A)$ does not span the whole universe, and therefore is a subspace of it.

1.3.3 Maximally Independent Set

This is a vector set which cannot be made larger without losing independence, and smaller without remaining maximal; i.e. it is a set containing the maximum number of independent vectors spanning the space.

1.3.4 A Basis

A basis for a subspace is any maximally independent set within the subspace. It is not unique.

Example 4. A basis for the subspace $S$ spanning the first 2 columns of

A = [\begin{matrix} 1 & 2 & 3 \\ 3 & - 3 & 3 \end{matrix}], i.e., S = {[\begin{matrix} 1 \\ 0 \\ 0 \end{matrix}], [\begin{matrix} 2 \\ 3 \\ 0 \end{matrix}]}

\begin{aligned} e_{1} & = (1, 0, 0)^{T} \\ e_{2} & = (0, 1, 0)^{T} . \end{aligned}

^[2]or any other linearly independent set in $span [e_{1}, e_{2}]$ .

Any vector in $S$ is uniquely represented as a linear combination of the basis vectors.

1.3.5 Orthogonal Complement Subspace

If we have a subspace $S$ of dimension $n$ consisting of vectors $[a_{1}, \dots, a_{n}]$ , $a_{i} \in R^{m}, i = 1, \dots, n$ , for $n \leq m$ , the orthogonal complement subspace $S_{⊥}$ of $S$ of dimension $m - n$ is defined as

\begin{matrix} (10) & S_{⊥} = {y \in R^{m} ∣ y^{T} x = 0 for all x \in S} \end{matrix}

i.e., any vector in $S_{⊥}$ is orthogonal to any vector in $S$ . The quantity $S_{⊥}$ is pronounced “S–perp”.

Example 5: Take the vector set defining $S$ from Example 4:

\begin{matrix} (11) & S \equiv [\begin{matrix} 1 & 2 \\ 0 & 3 \\ 0 & 0 \end{matrix}] \end{matrix}

then, a basis for $S_{⊥}$ is

\begin{matrix} (12) & [\begin{matrix} 0 \\ 0 \\ 1 \end{matrix}] \end{matrix}

1.3.6 Rank

Rank is an important concept which we will use frequently throughout this course. We briefly describe only a few basic features of rank here. The idea is expanded more fully in the following sections.

The rank of a matrix is the maximum number of linearly independent rows or columns. Thus, it is the dimension of a basis for the columns (rows) of a matrix.
Rank of $A$ (denoted $rank (A)$ ), is the dimension of $R (A)$ .
if $A = B C$ , and $r_{1} = rank (B)$ , $r_{2} = rank (C)$ , then $rank (A) \leq min (r_{1}, r_{2})$ .
A matrix $A \in R^{m \times n}$ is said to be rank deficient if its rank is less than $min (m, n)$ . Otherwise, it is said to be full rank.
If $A$ is square and rank deficient, then $det (A) = 0$ .
It can be shown that $rank (A) = rank (A^{T})$ . More is said on this point later.

A matrix is said to be full column (row) rank if its rank is equal to the number of columns (rows).

Example: The rank of $A$ in Example 4 is 3, whereas the rank of $A$ in Example 3 is 2.

1.3.7 Null Space of A

The null space $N (A)$ of $A$ is defined as

\begin{matrix} (13) & N (A) = {x \in R^{n} \neq 0 ∣ A x = 0} . \end{matrix}

From previous discussions, the product $A x$ is a linear combination of the columns $a_{i}$ of $A$ , where the elements $x_{i}$ of $x$ are the corresponding coefficients. Thus, from (13), $N (A)$ is the set of non–zero coefficients of all zero linear combinations of the columns of $A$ . If the columns of $A$ are linearly independent, then $N (A) = \emptyset$ by definition, because there can be no coefficients except zero which result in a zero linear combination. In this case, the dimension of the null space is zero, and $A$ is full column rank. The null space is empty if and only if $A$ is full column rank, and is non–empty when $A$ is column rank deficient^[3]. Note that any vector in $N (A)$ is of dimension $n$ . Any vector in $N ((A))$ is orthogonal to the rows of $A$ , and is thus in the orthogonal complement of the span of the rows of $A$ .

Example 6: Let $A$ be as before in Example 3. Then $N (A) = c (1, 1, - 2)^{T}$ , where $c$ is a real constant.

A further example is as follows. Take 3 vectors $[a_{1}, a_{2}, a_{3}]$ where $a_{i} \in R^{3}$ , $i = 1, \dots, 3$ , that are constrained to lie in a 2–dimensional plane. Then there exists a zero linear combination of these vectors. The coefficients of this linear combination define a vector $x$ which is in the nullspace of $A = [a_{1}, a_{2}, a_{3}]$ . In this case, we see that $A$ is rank deficient.

Another important characterization of a matrix is its nullity. The nullity of $A$ is the dimension of the nullspace of $A$ . In Example 6 above, the nullity of $A$ is one. We then have the following interesting property:

\begin{matrix} (14) & rank (A) + nullity (A) = n . \end{matrix}

1.4 Four Fundamental Subspaces of a Matrix

The four matrix subspaces of concern are: the column space, the row space, and their respective orthogonal complements. The development of these four subspaces is closely linked to $N (A)$ and $R (A)$ . We assume for this section that $A \in R^{m \times n}$ , $r \leq min (m, n)$ , where $r = rank A$ .

1.4.1 The Column Space

This is simply $R (A)$ . Its dimension is $r$ . It is the set of all linear combinations of the columns of $A$ .

1.4.2 The Orthogonal Complement of the Column Space

This may be expressed as $R (A)_{⊥}$ , with dimension $m - r$ . It may be shown to be equivalent to $N (A^{T})$ , as follows: By definition, $N (A^{T})$ is the set $x$ satisfying:

\begin{matrix} (15) & [\begin{matrix} A^{T} \end{matrix}] [\begin{matrix} x_{1} \\ ⋮ \\ x_{m} \end{matrix}] = 0, \end{matrix}

where columns of $A$ are the rows of $A^{T}$ . From (15), we see that $N (A^{T})$ is the set of $x \in R^{m}$ which is orthogonal to all columns of $A$ (rows of $A^{T}$ ). This by definition is the orthogonal complement of $R (A)$ .

1.4.3 The Row Space

The row space is defined simply as $R (A^{T})$ , with dimension $r$ . The row space is the range of the rows of $A$ , or the subspace spanned by the rows, or the set of all possible linear combinations of the rows of $A$ .

1.4.4 The Orthogonal Complement of the Row Space

This may be denoted as $R (A^{T})_{⊥}$ . Its dimension is $n - r$ . This set must be that which is orthogonal to all rows of $A$ : i.e., for $x$ to be in this space, $x$ must satisfy

\begin{matrix} (16) & \begin{matrix} rows \\ of \\ A \end{matrix} \to [\begin{matrix} ⋮ \end{matrix}] [\begin{matrix} x_{1} \\ ⋮ \\ x_{n} \end{matrix}] = 0 . \end{matrix}

Thus, the set $x$ , which is the orthogonal complement of the row space satisfying (16), is simply $N (A)$ .

We have noted before that $rank (A) = rank (A^{T})$ . Thus, the dimension of the row and column subspaces are equal. This is surprising, because it implies the number of linearly independent rows of a matrix is the same as the number of linearly independent columns. This holds regardless of the size or rank of the matrix. It is not an intuitively obvious fact and there is no immediately obvious reason why this should be so. Nevertheless, the rank of a matrix is the number of independent rows or columns.

1.5 Vector Norms

A vector norm is a means of expressing the length or distance associated with a vector. A norm on a vector space $R^{n}$ is a function $f$ , which maps a point in $R^{n}$ into a point in $R$ . Formally, this is stated mathematically as $f : R^{n} \to R$ . The norm has the following properties:

$f (x) \geq 0$ for all $x \in R^{n}$ .
$f (x) = 0$ if and only if $x = 0$ .
$f (x + y) \leq f (x) + f (y)$ for $x, y \in R^{n}$ .
$f (a x) = | a | f (x)$ for $a \in R, x \in R^{n}$ .

We denote the function $f (x)$ as $∥ x ∥$ .

The p-norms: This is a useful class of norms, generalizing on the idea of the Euclidean norm. They are defined by

\begin{matrix} (17) & ∥ x ∥_{p} = (| x_{1} |^{p} + | x_{2} |^{p} + \dots + | x_{n} |^{p})^{1 / p} . \end{matrix}

If $p = 1$ :

∥ x ∥_{1} = \sum_{i} | x_{i} |

which is simply the sum of absolute values of the elements.

If $p = 2$ :

∥ x ∥_{2} = {(\sum_{i} x_{i}^{2})}^{\frac{1}{2}} = (x^{T} x)^{\frac{1}{2}}

which is the familiar Euclidean norm.

If $p = \infty$ :

∥ x ∥_{\infty} = max_{i} | x_{i} |

which is the largest element of $x$ . This may be shown in the following way. As $p \to \infty$ , the largest term within the round brackets in (17) dominates all the others. Therefore (17) may be written as

\begin{matrix} (18) & ∥ x ∥_{\infty} = lim_{p \to \infty} {[\sum_{i = 1}^{n} x_{i}^{p}]}^{\frac{1}{p}} = lim_{p \to \infty} [x_{k}^{p}]^{\frac{1}{p}} = x_{k} \end{matrix}

where $k$ is the index corresponding to the largest element $x_{i}$ .

Note that the $p = 2$ norm has many useful properties, but is expensive to compute. Obviously, the 1– and $\infty$ –norms are easier to compute, but are more difficult to deal with algebraically. All the p–norms obey all the properties of a vector norm.

1.6 Determinants

Consider a square matrix $A \in R^{m \times m}$ . We can define the matrix $A_{i j}$ as the submatrix obtained from $A$ by deleting the $i$ th row and $j$ th column of $A$ . The scalar number $det (A_{i j})$ ( where $det (\cdot)$ denotes determinant) is called the minor associated with the element $a_{i j}$ of $A$ . The signed minor $c_{i j} ≜ (- 1)^{j + i} det (A_{i j})$ is called the cofactor of $a_{i j}$ .

The determinant of $A$ is the m-dimensional volume contained within the columns (rows) of $A$ . This interpretation of determinant is very useful as we see shortly. The determinant of a matrix may be evaluated by the expression

\begin{matrix} (19) & det (A) = \sum_{j = 1}^{m} a_{i j} c_{i j}, i \in (1 \dots m) . \end{matrix}

\begin{matrix} (20) & det (A) = \sum_{i = 1}^{m} a_{i j} c_{i j}, j \in (1 \dots m) . \end{matrix}

Both the above are referred to as the cofactor expansion of the determinant. Eq. (19) is along the $i$ th row of $A$ , whereas (20) is along the $j$ th column. It is indeed interesting to note that both versions above give exactly the same number, regardless of the value of $i$ or $j$ .

Eqs. (19) and (20) express the $m \times m$ determinant $det A$ in terms of the cofactors $c_{i j}$ of $A$ , which are themselves $(m - 1) \times (m - 1)$ determinants. Thus, $m - 1$ recursions of (19) or (20) will finally yield the determinant of the $m \times m$ matrix $A$ .

From (19) it is evident that if $A$ is triangular, then $det (A)$ is the product of the main diagonal elements. Since diagonal matrices are in the upper triangular set, then the determinant of a diagonal matrix is also the product of its diagonal elements.

Properties of Determinants Before we begin this discussion, let us define the volume of a parallelopiped defined by the set of column vectors comprising a matrix as the principal volume of that matrix.

We have the following properties of determinants, which are stated without proof:

$det (A B) = det (A) det (B) A, B \in R^{m \times m}$ . The principal volume of the product of matrices is the product of principal volumes of each matrix.
$det (A) = det (A^{T})$ This property shows that the characteristic polynomials^[4] of $A$ and $A^{T}$ are identical. Consequently, as we see later, eigenvalues of $A^{T}$ and $A$ are identical.
$det (c A) = c^{m} det (A) c \in R, A \in R^{m \times m}$ . This is a reflection of the fact that if each vector defining the principal volume is multiplied by $c$ , then the resulting volume is multiplied by $c^{m}$ .
$det (A) = 0 ⟺ A$ is singular. This implies that at least one dimension of the principal volume of the corresponding matrix has collapsed to zero length.
$det (A) = \prod_{i = 1}^{m} λ_{i}$ , where $λ_{i}$ are the eigen (singular) values of $A$ . This means the parallelopiped defined by the column or row vectors of a matrix may be transformed into a regular rectangular solid of the same m– dimensional volume whose edges have lengths corresponding to the eigen (singular) values of the matrix.
The determinant of an orthonormal^[5] matrix is $\pm 1$ . This is easy to see, because the vectors of an orthonormal matrix are all unit length and mutually orthogonal. Therefore the corresponding principal volume is $\pm 1$ .
If $A$ is nonsingular, then $det (A^{- 1}) = [det (A)]^{- 1}$ .
If $B$ is nonsingular, then $det (B^{- 1} A B) = det (A)$ .
If $B$ is obtained from $A$ by interchanging any two rows (or columns), then $det (B) = - det (A)$ .
If $B$ is obtained from $A$ by by adding a scalar multiple of one row to another (or a scalar multiple of one column to another), then $det (B) = det (A)$ .

A further property of determinants allows us to compute the inverse of $A$ . Define the matrix $\tilde{A}$ as the adjoint of $A$ :

\begin{matrix} (21) & \tilde{A} = {[\begin{matrix} c_{11} & \dots & c_{1 m} \\ ⋮ & ⋱ & ⋮ \\ c_{m 1} & \dots & c_{m m} \end{matrix}]}^{T} \end{matrix}

where the $c_{i j}$ are the cofactors of $A$ . According to (19) or (20), the $i$ th row ${\tilde{a}}_{i}^{T}$ of $\tilde{A}$ times the $i$ th column $a_{i}$ is $det (A)$ ; i.e.,

\begin{matrix} (22) & {\tilde{a}}_{i}^{T} a_{i} = det (A), i = 1, \dots, m . \end{matrix}

It can also be shown that

\begin{matrix} (23) & {\tilde{a}}_{i}^{T} a_{j} = 0, i \neq j . \end{matrix}

Then, combining (22) and (23) for $i, j \in {1, \dots, m}$ we have the following interesting property:

\begin{matrix} (24) & \tilde{A} A = det (A) I, \end{matrix}

where $I$ is the $m \times m$ identity matrix. It then follows from (24) that the inverse $A^{- 1}$ of $A$ is given as

\begin{matrix} (25) & A^{- 1} = [det (A)]^{- 1} \tilde{A} . \end{matrix}

Neither (19) nor (25) are computationally efficient ways of calculating a determinant or an inverse respectively. Better methods which exploit the properties of various matrix decompositions are made evident later in the course.

2 Lecture 2

This lecture discusses eigenvalues and eigenvectors in the context of the Karhunen–Loeve (KL) expansion of a random process. First, we discuss the fundamentals of eigenvalues and eigenvectors, then go on to covariance matrices. These two topics are then combined into the K-L expansion. An example from the field of array signal processing is given as an application of algebraic ideas.

A major aim of this presentation is an attempt to de-mystify the concepts of eigenvalues and eigenvectors by showing a very important application in the field of signal processing.

2.1 Eigenvalues and Eigenvectors

Suppose we have a matrix $A$ :

\begin{matrix} (1) & A = [\begin{matrix} 4 & 1 \\ 1 & 4 \end{matrix}] \end{matrix}

We investigate its eigenvalues and eigenvectors.

Suppose we take the product $A x_{1}$ , where $x_{1} =^{T}$ , as shown in Fig. 1. Then,

\begin{matrix} (2) & A x_{1} = [\begin{matrix} 4 \\ 1 \end{matrix}] . \end{matrix}

By comparing the vectors $x_{1}$ and $A x_{1}$ we see that the product vector is scaled and rotated counter–clockwise with respect to $x_{1}$ .

Now consider the case where $x_{2} =^{T}$ . Then $A x_{2} =^{T}$ . Here, we note a clockwise rotation of $A x_{2}$ with respect to $x_{2}$ .

Now lets consider a more interesting case. Suppose $x_{3} =^{T}$ . Then $A x_{3} =^{T}$ . Now the product vector points in the same direction as $x_{3}$ . The vector $A x_{3}$ is a scaled version of the vector $x_{3}$ . Because of this property, $x_{3} =^{T}$ is an eigenvector of $A$ . The scale factor (which in this case is 5) is given the symbol $λ$ and is referred to as an eigenvalue.

Note that $x = [1, - 1]^{T}$ is also an eigenvector, because in this case, $A x = [3, - 3]^{T} = 3 x$ . The corresponding eigenvalue is 3.

Thus we have, if $x$ is an eigenvector of $A \in R^{n \times n}$ ,

\begin{matrix} (3) & A x = λ x ↑ scalar multiple (eigenvalue) \end{matrix}

i.e., the vector $A x$ is in the same direction as $x$ but scaled by a factor $λ$ .

Now that we have an understanding of the fundamental idea of an eigenvector, we proceed to develop the idea further. Eq. (3) may be written in the form

\begin{matrix} (4) & (A - λ I) x = 0 \end{matrix}

where $I$ is the $n \times n$ identity matrix. Eq. (4) is a homogeneous system of equations, and from fundamental linear algebra, we know that a nontrivial solution to (4) exists if and only if

\begin{matrix} (5) & det (A - λ I) = 0 \end{matrix}

where $det (\cdot)$ denotes determinant. Eq. (5), when evaluated, becomes a polynomial in $λ$ of degree $n$ . For example, for the matrix $A$ above we have

det ([\begin{matrix} 4 & 1 \\ 1 & 4 \end{matrix}] - λ [\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}]) = 0

\begin{matrix} (6) & det [\begin{matrix} 4 - λ & 1 \\ 1 & 4 - λ \end{matrix}] = (4 - λ)^{2} - 1 = λ^{2} - 8 λ + 15 = 0. \end{matrix}

It is easily verified that the roots of this polynomial are (5,3), which correspond to the eigenvalues indicated above.

Eq. (5) is referred to as the characteristic equation of $A$ , and the corresponding polynomial is the characteristic polynomial. The characteristic polynomial is of degree $n$ .

More generally, if $A$ is $n \times n$ , then there are $n$ solutions of (5), or $n$ roots of the characteristic polynomial. Thus there are $n$ eigenvalues of $A$ satisfying (3); i.e.,

\begin{matrix} (7) & A x_{i} = λ_{i} x_{i}, i = 1, \dots, n . \end{matrix}

If the eigenvalues are all distinct, there are $n$ associated linearly–independent eigenvectors, whose directions are unique, which span an $n$ –dimensional Euclidean space.

Repeated Eigenvalues: In the case where there are e.g., $r$ repeated eigenvalues, then a linearly independent set of $n$ eigenvectors exist, provided the rank of the matrix $(A - λ I)$ in (5) is rank $n - r$ . Then, the directions of the $r$ eigenvectors associated with the repeated eigenvalues are not unique. In fact, consider a set of $r$ linearly independent eigenvectors $v_{1}, \dots, v_{r}$ associated with the $r$ repeated eigenvalues. Then, it may be shown that any vector in $span [v_{1}, \dots, v_{r}]$ is also an eigenvector. This emphasizes the fact the eigenvectors are not unique in this case.

Example 1: Consider the matrix given by

[\begin{matrix} 1 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \end{matrix}]

It may be easily verified that any vector in $span [e_{2}, e_{3}]$ is an eigenvector associated with the zero repeated eigenvalue.

Example 2: Consider the $n \times n$ identity matrix. It has $n$ repeated eigenvalues equal to one. In this case, any $n$ –dimensional vector is an eigenvector, and the eigenvectors span an $n$ –dimensional space.

Eq. (5) gives us a clue how to compute eigenvalues. We can formulate the characteristic polynomial and evaluate its roots to give the $λ_{i}$ . Once the eigenvalues are available, it is possible to compute the corresponding eigenvectors $v_{i}$ by evaluating the nullspace of the quantity $A - λ_{i} I$ , for $i = 1, \dots, n$ . This approach is adequate for small systems, but for those of appreciable size, this method is prone to appreciable numerical error. Later, we consider various orthogonal transformations which lead to much more effective techniques for finding the eigenvalues.

We now present some very interesting properties of eigenvalues and eigenvectors, to aid in our understanding.

Property 1 If the eigenvalues of a (Hermitian)^[6] symmetric matrix are distinct, then the eigenvectors are orthogonal.

Proof. Let ${v_{i}}$ and ${λ_{i}}, i = 1, \dots, n$ be the eigenvectors and corresponding eigenvalues respectively of $A \in R^{n \times n}$ . Choose any $i, j \in [1, \dots, n], i \neq j$ . Then

\begin{matrix} (8) & A v_{i} = λ_{i} v_{i} \end{matrix}

and

\begin{matrix} (9) & A v_{j} = λ_{j} v_{j} . \end{matrix}

Premultiply (8) by $v_{j}^{T}$ and (9) by $v_{i}^{T}$ :

\begin{matrix} (10) & v_{j}^{T} A v_{i} = λ_{i} v_{j}^{T} v_{i} \end{matrix}

\begin{matrix} (11) & v_{i}^{T} A v_{j} = λ_{j} v_{i}^{T} v_{j} \end{matrix}

The quantities on the left are equal when $A$ is symmetric. We show this as follows. Since the left-hand side of (10) is a scalar, its transpose is equal to itself. Therefore, we get $v_{j}^{T} A v_{i} = v_{i}^{T} A^{T} v_{j}$ .^[7] But, since $A$ is symmetric, $A^{T} = A$ . Thus, $v_{j}^{T} A v_{i} = v_{i}^{T} A^{T} v_{j} = v_{i}^{T} A x_{j}$ , which was to be shown. Subtracting (10) from (11), we have

\begin{matrix} (12) & (λ_{i} - λ_{j}) v_{j}^{T} v_{i} = 0 \end{matrix}

where we have used the fact $v_{j}^{T} v_{i} = v_{i}^{T} v_{j}$ . But by hypothesis, $λ_{i} - λ_{j} \neq 0$ . Therefore, (12) is satisfied only if $v_{j}^{T} v_{i} = 0$ , which means the vectors are orthogonal. $◻$

Here we have considered only the case where the eigenvalues are distinct. If an eigenvalue $\tilde{λ}$ is repeated $r$ times, and $rank (A - \tilde{λ} I) = n - r$ , then a mutually orthogonal set of $n$ eigenvectors can still be found.

Another useful property of eigenvalues of symmetric matrices is as follows:

Property 2 The eigenvalues of a (Hermitian) symmetric matrix are real.

Proof:^[8] (By contradiction): First, we consider the case where $A$ is real. Let $λ$ be a non–zero complex eigenvalue of a symmetric matrix $A$ . Then, since the elements of $A$ are real, $λ^{*}$ , the complex–conjugate of $λ$ , must also be an eigenvalue of $A$ , because the roots of the characteristic polynomial must occur in complex conjugate pairs. Also, if $v$ is a nonzero eigenvector corresponding to $λ$ , then an eigenvector corresponding $λ^{*}$ must be $v^{*}$ , the complex conjugate of $v$ . But Property 1 requires that the eigenvectors be orthogonal; therefore, $v^{T} v^{*} = 0$ . But $v^{T} v^{*} = (v^{H} v)^{*}$ , which is by definition the complex conjugate of the norm of $v$ . But the norm of a vector is a pure real number; hence, $v^{T} v^{*}$ must be greater than zero, since $v$ is by hypothesis nonzero. We therefore have a contradiction. It follows that the eigenvalues of a symmetric matrix cannot be complex; i.e., they are real.

While this proof considers only the real symmetric case, it is easily extended to the case where $A$ is Hermitian symmetric. $◻$

Property 3 Let $A$ be a matrix with eigenvalues $λ_{i}, i = 1, \dots, n$ and eigenvectors $v_{i}$ . Then the eigenvalues of the matrix $A + s I$ are $λ_{i} + s$ , with corresponding eigenvectors $v_{i}$ , where $s$ is any real number.

Proof: From the definition of an eigenvector, we have $A v = λ v$ . Further, we have $s I v = s v$ . Adding, we have $(A + s I) v = (λ + s) v$ . This new eigenvector relation on the matrix $(A + s I)$ shows the eigenvectors are unchanged, while the eigenvalues are displaced by $s$ . $◻$

Property 4 Let $A$ be an $n \times n$ matrix with eigenvalues $λ_{i}, i = 1, \dots, n$ . Then

The determinant $det (A) = \prod_{i = 1}^{n} λ_{i}$ .
The trace^[9] $tr (A) = \sum_{i = 1}^{n} λ_{i}$ . The proof is straightforward, but because it is easier using concepts presented later in the course, it is not given here.

Property 5 If $v$ is an eigenvector of a matrix $A$ , then $c v$ is also an eigenvector, where $c$ is any real or complex constant. The proof follows directly by substituting $c v$ for $v$ in $A v = λ v$ . This means that only the direction of an eigenvector can be unique; its norm is not unique.

2.1.1 Orthonormal Matrices

Before proceeding with the eigendecomposition of a matrix, we must develop the concept of an orthonormal matrix. This form of matrix has mutually orthogonal columns, each of unit norm. This implies that

\begin{matrix} (13) & q_{i}^{T} q_{j} = δ_{i j}, \end{matrix}

where $δ_{i j}$ is the Kronecker delta, and $q_{i}$ and $q_{j}$ are columns of the orthonormal matrix $Q$ . With (13) in mind, we now consider the product $Q^{T} Q$ . The result may be visualized with the aid of the diagram below:

\begin{matrix} (14) & Q^{T} Q = [\begin{matrix} \leftarrow q_{1}^{T} \to \\ \leftarrow q_{2}^{T} \to \\ ⋮ \\ \leftarrow q_{N}^{T} \to \end{matrix}] [\begin{matrix} ↑ & ↑ & ↑ \\ q_{1} & q_{2} & \dots & q_{N} \\ ↓ & ↓ & ↓ \end{matrix}] = I . \end{matrix}

(When $i = j$ , the quantity $q_{i}^{T} q_{i}$ defines the squared 2 norm of $q_{i}$ , which has been defined as unity. When $i \neq j$ , $q_{i}^{T} q_{j} = 0$ , due to the orthogonality of the $q_{i}$ ). Eq. (14) is a fundamental property of an orthonormal matrix.

Thus, for an orthonormal matrix, (14) implies the inverse may be computed simply by taking the transpose of the matrix, an operation which requires almost no computational effort.

Eq. (14) follows directly from the fact $Q$ has orthonormal columns. It is not so clear that the quantity $Q Q^{T}$ should also equal the identity. We can resolve this question in the following way. Suppose that $A$ and $B$ are any two square invertible matrices such that $A B = I$ . Then, $B A B = B$ . By parsing this last expression, we have

\begin{matrix} (15) & (B A) \cdot B = B . \end{matrix}

Clearly, if (15) is to hold, then the quantity $B A$ must be the identity^[10]; hence, if $A B = I$ , then $B A = I$ . Therefore, if $Q^{T} Q = I$ , then also $Q Q^{T} = I$ . From this fact, it follows that if a matrix has orthonormal columns, then it also must have orthonormal rows. We now develop a further useful property of orthonormal marices:

Property 6 The vector 2-norm is invariant under an orthonormal transformation. If $Q$ is orthonormal, then

∥ Q x ∥_{2}^{2} = x^{T} Q^{T} Q x = x^{T} x = ∥ x ∥_{2}^{2} .

Thus, because the norm does not change, an orthonormal transformation performs a rotation operation on a vector. We use this norm–invariance property later in our study of the least–squares problem.

Suppose we have a matrix $U \in R^{m \times n}$ , where $m > n$ , whose columns are orthonormal. We see in this case that $U$ is a tall matrix, which can be formed by extracting only the first $n$ columns of an arbitrary orthonormal matrix. (We reserve the term orthonormal matrix to refer to a complete $m \times m$ matrix). Because $U$ has orthonormal columns, it follows that the quantity $U^{T} U = I_{n \times n}$ . However, it is important to realize that the quantity $U U^{T} \neq I_{m \times m}$ in this case, in contrast to the situation when $m = n$ . The latter relation follows from the fact that the $m$ column vectors of $U^{T}$ of length $n, n < m$ , cannot all be mutually orthogonal. In fact, we see later that $U U^{T}$ is a projector onto the subspace $R (U)$ .

Suppose we have a vector $b \in R^{m}$ . Because it is easiest, by convention we represent $b$ using the basis $[e_{1}, \dots, e_{m}]$ , where the $e_{i}$ are the elementary vectors (all zeros except for a one in the $i$ th position). However it is often convenient to represent $b$ in a basis formed from the columns of an orthonormal matrix $Q$ . In this case, the elements of the vector $c = Q^{T} b$ are the coefficients of $b$ in the basis $Q$ . The orthonormal basis is convenient because we can restore $b$ from $c$ simply by taking $b = Q c$ .

An orthonormal matrix is sometimes referred to as a unitary matrix. This follows because the determinant of an orthonormal matrix is $\pm 1$ .

2.1.2 The Eigendecomposition (ED) of a Square Symmetric Matrix

Almost all matrices on which ED’s are performed (at least in signal processing) are symmetric. A good example are covariance matrices, which are discussed in some detail in the next section.

Let $A \in R^{n \times n}$ be symmetric. Then, for eigenvalues $λ_{i}$ and eigenvectors $v_{i}$ , we have

\begin{matrix} (16) & A v_{i} = λ_{i} v_{i}, i = 1, \dots, n . \end{matrix}

Let the eigenvectors be normalized to unit 2–norm. Then these $n$ equations can be combined, or stacked side–by–side together, and represented in the following compact form:

\begin{matrix} (17) & A V = V Λ \end{matrix}

where $V = [v_{1}, v_{2}, \dots, v_{n}]$ (i.e., each column of $V$ is an eigenvector), and

\begin{matrix} (18) & Λ = [\begin{matrix} λ_{1} & 0 \\ λ_{2} \\ ⋱ \\ 0 & λ_{n} \end{matrix}] = diag (λ_{1} \dots λ_{n}) . \end{matrix}

Corresponding columns from each side of (17) represent one specific value of the index $i$ in (16). Because we have assumed $A$ is symmetric, from Property 1, the $v_{i}$ are orthogonal. Furthermore, since we have assumed $∥ v_{i} ∥_{2} = 1$ , $V$ is an orthonormal matrix. Thus, post-multiplying both sides of (17) by $V^{T}$ , and using $V V^{T} = I$ we get

\begin{matrix} (19) & A = V Λ V^{T} . \end{matrix}

Eq. (19) is called the eigendecomposition (ED) of $A$ . The columns of $V$ are eigenvectors of $A$ , and the diagonal elements of $Λ$ are the corresponding eigenvalues. Any symmetric matrix may be decomposed in this way. This form of decomposition, with $Λ$ being diagonal, is of extreme interest and has many interesting consequences. It is this decomposition which leads directly to the Karhunen-Loeve expansion which we discuss shortly.

Note that from (19), knowledge of the eigenvalues and eigenvectors of $A$ is sufficient to completely specify $A$ . Note further that if the eigenvalues are distinct, then the ED is unique. There is only one orthonormal $V$ and one diagonal $Λ$ which satisfies (19).

Eq. (19) can also be written as

\begin{matrix} (20) & V^{T} A V = Λ . \end{matrix}

Since $Λ$ is diagonal, we say that the unitary (orthonormal) matrix $V$ of eigenvectors diagonalizes $A$ . No other orthonormal matrix can diagonalize $A$ . The fact that only $V$ diagonalizes $A$ is the fundamental property of eigenvectors. If you understand that the eigenvectors of a symmetric matrix diagonalize it, then you understand the “mystery” behind eigenvalues and eigenvectors. Thats all there is to it. We look at the K–L expansion later in this lecture in order to solidify this interpretation, and to show some very important signal processing concepts which fall out of the K–L idea. But the K–L analysis is just a direct consequence of that fact that only the eigenvectors of a symmetric matrix diagonalize.

2.1.3 Conventional Notation on Eigenvalue Indexing

Let $A \in R^{n \times n}$ have rank $r \leq n$ . Also assume $A$ is positive semi–definite; i.e., all its eigenvalues are $\geq 0$ . This is a not too restrictive assumption because most of the matrices on which the eigendecomposition is relevant are positive semi–definite. Then, we see in the next section we have $r$ non-zero eigenvalues and $n - r$ zero eigenvalues. It is common convention to order the eigenvalues so that

\begin{matrix} (21) & \underset{r nonzero eigenvalues}{\underset{⏟}{λ_{1} \geq λ_{2} \geq \dots \geq λ_{r}}} > \underset{n - r zero eigenvalues}{\underset{⏟}{λ_{r + 1} = \dots, λ_{n}}} = 0 \end{matrix}

i.e., we order the columns of eq. (17) so that $λ_{1}$ is the largest, with the remaining nonzero eigenvalues arranged in descending order, followed by $n - r$ zero eigenvalues. Note that if $A$ is full rank, then $r = n$ and there are no zero eigenvalues. The quantity $λ_{n}$ is the eigenvalue with the lowest value.

The eigenvectors are reordered to correspond with the ordering of the eigenvalues. For notational convenience, we refer to the eigenvector corresponding to the largest eigenvalue as the “largest eigenvector”. The “smallest eigenvector” is then the eigenvector corresponding to the smallest eigenvalue.

2.2 The Eigendecomposition in Relation to the Fundamental Matrix Subspaces

In this section, we develop relationships between the eigendecomposition of a matrix and its range, null space and rank.

Here, we consider square symmetric positive semi–definite matrices $A \in R^{n \times n}$ , whose rank $r \leq n$ . Let us partition the eigendecomposition of $A$ in the following form:

\begin{matrix} (22) & A = V Λ V^{T} = \underset{r n - r}{[\begin{matrix} V_{1} & V_{2} \end{matrix}]} [\begin{matrix} Λ_{1} & 0 \\ 0 & Λ_{2} \end{matrix}] \underset{\binom{r}{n - r}}{[\begin{matrix} V_{1}^{T} \\ V_{2}^{T} \end{matrix}]} \end{matrix}

where

\begin{matrix} (23) & \begin{aligned} V_{1} & = [v_{1}, v_{2}, \dots, v_{r}] \in R^{n \times r} \\ V_{2} & = [v_{r + 1}, \dots, v_{n}] \in R^{n \times n - r}, \end{aligned} \end{matrix}

The columns of $V_{1}$ are eigenvectors corresponding to the first $r$ eigenvalues of $A$ , and the columns of $V_{2}$ correspond to the $n - r$ smallest eigenvalues. We also have

\begin{matrix} (24) & Λ_{1} = diag [λ_{1}, \dots, λ_{r}] = [\begin{matrix} λ_{1} \\ ⋱ \\ λ_{r} \end{matrix}] \in R^{r \times r}, \end{matrix}

and

\begin{matrix} (25) & Λ_{2} = [\begin{matrix} λ_{r + 1} \\ ⋱ \\ λ_{n} \end{matrix}] \in R^{(n - r) \times (n - r)} . \end{matrix}

In the notation used above, the explicit absence of a matrix element in an off-diagonal position implies that element is zero. We now show that the partition (22) reveals a great deal about the structure of $A$ .

2.2.1 Nullspace

In this section, we explore the relationship between the partition of (22) and the nullspace of $A$ . Recall that the nullspace $N (A)$ of $A$ is defined as

\begin{matrix} (26) & N (A) = {x \neq 0 \in R^{n} ∣ A x = 0} . \end{matrix}

From (22), we have

\begin{matrix} (27) & A x = [\begin{matrix} V_{1} & V_{2} \end{matrix}] [\begin{matrix} Λ_{1} & 0 \\ 0 & Λ_{2} \end{matrix}] [\begin{matrix} V_{1}^{T} \\ V_{2}^{T} \end{matrix}] x . \end{matrix}

We now choose $x$ so that $x \in span (V_{2})$ . Then $x = V_{2} c_{2}$ , where $c_{2}$ is any vector in $R^{n - r}$ . Then since $V_{1} ⊥ V_{2}$ , we have

\begin{matrix} (28) & A x = [\begin{matrix} V_{1} & V_{2} \end{matrix}] [\begin{matrix} Λ_{1} & 0 \\ 0 & Λ_{2} \end{matrix}] [\begin{matrix} 0 \\ c_{2} \end{matrix}] = [\begin{matrix} V_{1} & V_{2} \end{matrix}] [\begin{matrix} 0 \\ Λ_{2} c_{2} \end{matrix}] . \end{matrix}

From (28), it is clear we can find a non–trivial $x$ such that $A x = 0$ if and only if $Λ_{2} = 0$ . Thus, a non–empty nullspace can exist if only if $Λ_{2} = 0$ .

Since $Λ_{2} \in R^{(n - r) \times (n - r)}$ , a square symmetric matrix of rank $r \leq n$ must have $n - r$ zero eigenvalues.

Moreover from (28) we see that the condition $x \in span V_{2}$ is also necessary for $A x = 0$ . This implies that an orthonormal basis for the nullspace of $A$ is $V_{2}$ . Since $V_{2} \in R^{n \times (n - r)}$ , the nullity of $A$ is $n - r$ , corresponding to the number of zero eigenvalues.

Thus, we have the important result that if the dimension of $N (A)$ is $d = n - r$ , then $A$ must have $d$ zero eigenvalues. The matrix $V_{2} \in R^{n \times (n - r)}$ is an orthonormal basis for $N (A)$ .

2.2.2 Range

Let us look at $R (A)$ in the light of the decomposition of (22), where we have seen that $Λ_{2} = 0$ if $A$ is rank deficient. The definition of $R (A)$ , repeated here for convenience, is

\begin{matrix} (29) & R (A) = {y ∣ y = A x, x \in R^{n}} . \end{matrix}

The vector quantity $A x$ is therefore given as

\begin{matrix} (30) & A x = [\begin{matrix} V_{1} & V_{2} \end{matrix}] [\begin{matrix} Λ_{1} & 0 \\ 0 & 0 \end{matrix}] [\begin{matrix} V_{1}^{T} \\ V_{2}^{T} \end{matrix}] x . \end{matrix}

In the above, it is understood that if $A$ is full rank, then the lower right block of zeros in $Λ$ vanishes and $Λ$ becomes equivalent to $Λ_{1}$ .

Let us define $c$ as

\begin{matrix} (31) & c = [\begin{matrix} c_{1} \\ c_{2} \end{matrix}] = [\begin{matrix} V_{1}^{T} \\ V_{2}^{T} \end{matrix}] x, \end{matrix}

where $c_{1} \in R^{r}$ and $c_{2} \in R^{n - r}$ . Then,

\begin{matrix} (32) & \begin{aligned} y = A x & = [\begin{array}{c} V_{1} & V_{2} \end{array}] [\begin{array}{c} Λ_{1} & 0 \\ 0 & 0 \end{array}] [\begin{array}{c} c_{1} \\ c_{2} \end{array}] \\ = [\begin{array}{c} V_{1} & V_{2} \end{array}] [\begin{array}{c} Λ_{1} c_{1} \\ 0 \end{array}] \\ = V_{1} (Λ_{1} c_{1}) . \end{aligned} \end{matrix}

From (31), we see that $span (x) = span (c)$ , and therefore $span (Λ_{1} c_{1}) = R^{r}$ . Thus, the vector $y$ in (32) consists of all possible linear combinations of the columns of $V_{1}$ and $R (A) = R (V_{1})$ . Therefore we have the important result that $V_{1}$ is an orthonormal basis for $R (A)$ .

2.3 Matrix Norms

Now that we have some understanding of eigenvectors and eigenvalues, we can now present the matrix norm. The matrix norm is related to the vector norm: it is a function which maps $R^{m \times n}$ into $R$ . A matrix norm must obey the same properties as a vector norm. Since a norm is only strictly defined for a vector quantity, a matrix norm is defined by mapping a matrix into a vector. This is accomplished by post multiplying the matrix by a suitable vector. Some useful matrix norms are now presented:

Matrix p-Norms: A matrix p-norm is defined in terms of a vector p-norm. The matrix p-norm of an arbitary matrix $A$ , denoted $∥ A ∥_{p}$ , is defined as

\begin{matrix} (33) & ∥ A ∥_{p} = sup_{x \neq 0} \frac{∥ A x ∥_{p}}{∥ x ∥_{p}} \end{matrix}

where “sup” means supremum; i.e., the largest value of the argument over all values of $x \neq 0$ . Since a property of a vector norm is $∥ c x ∥_{p} = | c | ∥ x ∥_{p}$ for any scalar $c$ , we can choose $c$ in (33) so that $∥ x ∥_{p} = 1$ . Then, an equivalent statement to (33) is

\begin{matrix} (34) & ∥ A ∥_{p} = max_{∥ x ∥_{p} = 1} ∥ A x ∥_{p} . \end{matrix}

We now provide some interpretation for the above definition for the specific case where $p = 2$ and for $A$ square and symmetric, in terms of the eigendecomposition of $A$ . To find the matrix 2–norm, we differentiate (34) and set the result to zero. Differentiating $∥ A x ∥_{2}$ directly is difficult. However, we note that finding the $x$ which maximizes $∥ A x ∥_{2}$ is equivalent to finding the $x$ which maximizes $∥ A x ∥_{2}^{2}$ and the differentiation of the latter is much easier. In this case, we have $∥ A x ∥_{2}^{2} = x^{T} A^{T} A x$ . To find the maximum, we use the method of Lagrange multipliers, since $x$ is constrained by (34). Therefore we differentiate the quantity

\begin{matrix} (35) & x^{T} A^{T} A x + γ (1 - x^{T} x) \end{matrix}

and set the result to zero. The quantity $γ$ above is the Lagrange multiplier. The details of the differentiation are omitted here, since they will be covered in a later lecture. The interesting result of this process is that $x$ must satisfy

\begin{matrix} (36) & A^{T} A x = γ x, ∥ x ∥_{2} = 1. \end{matrix}

Therefore the stationary points of (34) are the eigenvectors of $A^{T} A$ . When $A$ is square and symmetric, the eigenvectors of $A^{T} A$ are equivalent to those of $A$ ^[11]. Therefore the stationary points of (34) are also the eigenvectors of $A$ . By substituting $x = v_{1}$ into (34) we find that $∥ A x ∥_{2} = λ_{1}$ .

It then follows that the solution to (34) is given by the eigenvector corresponding to the largest eigenvalue of $A$ , and $∥ A x ∥_{2}$ is equal to the largest eigenvalue of $A$ .

More generally, it is shown in the next lecture for an arbitrary matrix $A$ that

\begin{matrix} (37) & ∥ A ∥_{2} = σ_{1} \end{matrix}

where $σ_{1}$ is the largest singular value of $A$ . This quantity results from the singular value decomposition, to be discussed next lecture.

Matrix norms for other values of $p$ , for arbitrary $A$ , are given as

\begin{matrix} (38) & ∥ A ∥_{1} = max_{1 \leq j \leq n} \sum_{i = 1}^{m} | a_{i j} | (maximum column sum) \end{matrix}

and

\begin{matrix} (39) & ∥ A ∥_{\infty} = max_{1 \leq i \leq m} \sum_{j = 1}^{n} | a_{i j} | (maximum row sum) . \end{matrix}

Frobenius Norm: The Frobenius norm is the 2-norm of the vector consisting of the 2- norms of the rows (or columns) of the matrix $A$ :

∥ A ∥_{F} = {[\sum_{i = 1}^{m} \sum_{j = 1}^{n} | a_{i j} |^{2}]}^{1 / 2}

2.3.1 Properties of Matrix Norms

Consider the matrix $A \in R^{m \times n}$ and the vector $x \in R^{n}$ . Then,

\begin{matrix} (40) & ∥ A x ∥_{p} \leq ∥ A ∥_{p} ∥ x ∥_{p} \end{matrix}

This property follows by dividing both sides of the above by $∥ x ∥_{p}$ , and applying (33).

If $Q$ and $Z$ are orthonormal matrices of appropriate size, then

\begin{matrix} (41) & ∥ Q A Z ∥_{2} = ∥ A ∥_{2} \end{matrix}

and

\begin{matrix} (42) & ∥ Q A Z ∥_{F} = ∥ A ∥_{F} \end{matrix}

Thus, we see that the matrix 2–norm and Frobenius norm are invariant to pre– and post– multiplication by an orthonormal matrix.

Further,

\begin{matrix} (43) & ∥ A ∥_{F}^{2} = tr (A^{T} A) \end{matrix}

where $tr (\cdot)$ denotes the trace of a matrix, which is the sum of its diagonal elements.

2.4 Covariance Matrices

Here, we investigate the concepts and properties of the covariance matrix $R_{x x}$ corresponding to a stationary, discrete-time random process $x [n]$ . We break the infinite sequence $x [n]$ into windows of length $m$ , as shown in Fig. 2. The windows generally overlap; in fact, they are typically displaced from one another by only one sample. The samples within the $i$ th window become an m-length vector $x_{i}, i = 1, 2, 3, \dots$ . Hence, the vector corresponding to each window is a vector sample from the random process $x [n]$ . Processing random signals in this way is the fundamental first step in many forms of electronic system which deal with real signals, such as process identification, control, or any form of communication system including telephones, radio, radar, sonar, etc.

The word stationary as used above means the random process is one for which the corresponding joint m–dimensional probability density function describing the distribution of the vector sample $x$ does not change with time. This means that all moments of the distribution (i.e., quantities such as the mean, the variance, and all cross–correlations, as well as all other higher–order statistical characterizations) are invariant with time. Here however, we deal with a weaker form of stationarity referred to as wide–sense stationarily (WSS). With these processes, only the first two moments (mean, variances and covariances) need be invariant with time. Strictly, the idea of a covariance matrix is only relevant for stationary or WSS processes, since expectations only have meaning if the underlying process is stationary.

The covariance matrix $R_{x x} \in R^{m \times m}$ corresponding to a stationary or WSS process $x [n]$ is defined as

\begin{matrix} (44) & R_{x x} ≜ E [(x - μ) (x - μ)^{T}] \end{matrix}

where $μ$ is the vector mean of the process and $E (\cdot)$ denotes the expectation operator over all possible windows of index $i$ of length $m$ in Fig. 2. Often we deal with zero-mean processes, in which case we have

\begin{matrix} (45) & \begin{aligned} R_{x x} & = E [x_{i} x_{i}^{T}] \\ = E [(\begin{array}{c} x_{1} \\ x_{2} \\ ⋮ \\ x_{m} \end{array}) (\begin{array}{c} x_{1} & x_{2} & \dots & x_{m} \end{array})] \\ = E [\begin{array}{c} x_{1} x_{1} & x_{1} x_{2} & \dots & x_{1} x_{m} \\ x_{2} x_{1} & x_{2} x_{2} & \dots & x_{2} x_{m} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ x_{m} x_{1} & x_{m} x_{2} & \dots & x_{m} x_{m} \end{array}], \end{aligned} \end{matrix}

where $(x_{1}, x_{2}, \dots, x_{m})^{T} = x_{i}$ . Taking the expectation over all windows, eq. (45) tells us that the element $r (1, 1)$ of $R_{x x}$ is by definition $E (x_{1}^{2})$ , which is the mean-square value (the preferred term is variance, whose symbol is $σ^{2}$ ) of the first element $x_{1}$ of all possible vector samples $x_{i}$ of the process. But because of stationarity, $r (1, 1) = r (2, 2) = \dots, = r (m, m)$ which are all equal to $σ^{2}$ . Thus all main diagonal elements of $R_{x x}$ are equal to the variance of the process. The element $r (1, 2) = E (x_{1} x_{2})$ is the cross–correlation between the first element of $x_{i}$ and the second element. Taken over all possible windows, we see this quantity is the cross–correlation of the process and itself delayed by one sample. Because of stationarity, $r (1, 2) = r (2, 3) = \dots = r (m - 1, m)$ and hence all elements on the first upper diagonal are equal to the cross-correlation for a time-lag of one sample. Since multiplication is commutative, $r (2, 1) = r (1, 2)$ , and therefore all elements on the first lower diagonal are also all equal to this same cross-correlation value. Using similar reasoning, all elements on the $j$ th upper or lower diagonal are all equal to the cross-correlation value of the process for a time lag of $j$ samples. Thus we see that the matrix $R_{x x}$ is highly structured.

Let us compare the process shown in Fig. 2 with that shown in Fig. 3. In the former case, we see that the process is relatively slowly varying. Because we have assumed $x [n]$ to be zero mean, adjacent samples of the process in Fig. 2 will have the same sign most of the time, and hence $E (x_{i} x_{i + 1})$ will be a positive number, coming close to the value $E (x_{i}^{2})$ . The same can be said for $E (x_{i} x_{i + 2})$ , except it is not so close to $E (x_{i}^{2})$ . Thus, we see that for the process of Fig. 2, the diagonals decay fairly slowly away from the main diagonal value.

However, for the process shown in Fig. 3, adjacent samples are uncorrelated with each other. This means that adjacent samples are just as likely to have opposite signs as they are to have the same signs. On average, the terms with positive values have the same magnitude as those with negative values. Thus, when the expectations $E (x_{i} x_{i + 1}), E (x_{i} x_{i + 2}), \dots$ are taken, the resulting averages approach zero. In this case then, we see the covariance matrix concentrates around the main diagonal, and becomes equal to $σ^{2} I$ . We note that all the eigenvalues of $R_{x x}$ are equal to the value $σ^{2}$ . Because of this property, such processes are referred to as “white”, in analogy to white light, whose spectral components are all of equal magnitude.

The sequence ${r (1, 1), r (1, 2), \dots, r (1, m)}$ is equivalent to the autocorrelation function of the process, for lags 0 to $m - 1$ . The autocorrelation function of the process characterizes the random process $x [n]$ in terms of its variance, and how quickly the process varies over time. In fact, it may be shown^[12] that the Fourier transform of the autocorrelation function is the power spectral density of the process. Further discussion on this aspect of random processes is beyond the scope of this treatment; the interested reader is referred to the reference.

In practice, it is impossible to evaluate the covariance matrix $R_{x x}$ using expectations as in (44). Expectations cannot be evaluated in practice– they require an infinite amount of data which is never available, and furthermore, the data must be stationary over the observation interval, which is rarely the case. In practice, we evaluate an estimate ${\hat{R}}_{x x}$ of $R_{x x}$ , based on an observation of finite length $N$ of the process $x [n]$ , by replacing the ensemble average (expectation) with a finite temporal average over the $N$ available data points as follows^[13]:

\begin{matrix} (46) & {\hat{R}}_{x x} = \frac{1}{N - m + 1} \sum_{i = 1}^{N - m + 1} x_{i} x_{i}^{T} . \end{matrix}

If (46) is used to evaluate $\hat{R}$ , then the process need only be stationary over the observation length. Thus, by using the covariance estimate given by (46), we can track slow changes in the true covariance matrix of the process with time, provided the change in the process is small over the observation interval $N$ . Further properties and discussion covariance matrices are given in Haykin.^[14]

It is interesting to note that ${\hat{R}}_{x x}$ can be formed in an alternate way from (46). Let $X \in R^{m \times (N - m + 1)}$ be a matrix whose $i$ th column is the vector sample $x_{i}, i = 1, \dots, N - m + 1$ of $x [n]$ . Then ${\hat{R}}_{x x}$ is also given as

\begin{matrix} (47) & {\hat{R}}_{x x} = \frac{1}{N - m + 1} X X^{T} . \end{matrix}

The proof of this statement is left as an exercise.

Some Properties of $R_{x x}$ :

$R_{x x}$ is (Hermitian) symmetric i.e. $r_{i j} = r_{j i}^{*}$ , where $*$ denotes complex conjugation.
If the process $x [n]$ is stationary or wide-sense stationary, then $R_{x x}$ is Toeplitz. This means that all the elements on a given diagonal of the matrix are equal. If you understand this property, then you have a good understanding of the nature of covariance matrices.
If $R_{x x}$ is diagonal, then the elements of $x$ are uncorrelated. If the magnitudes of the off-diagonal elements of $R_{x x}$ are significant with respect to those on the main diagonal, the process is said to be highly correlated.
$R$ is positive semi–definite. This implies that all the eigenvalues are greater than or equal to zero. We will discuss positive definiteness and positive semi–definiteness later.
If the stationary or WSS random process $x$ has a Gaussian probability distribution, then the vector mean and the covariance matrix $R_{x x}$ are enough to completely specify the statistical characteristics of the process.

2.5 The Karhunen-Loeve Expansion of a Random Process

In this section we combine what we have learned about eigenvalues and eigenvectors, and covariance matrices, into the K-L orthonormal expansion of a random process. The KL expansion is extremely useful in compression of images and speech signals.

An orthonormal expansion of a vector $x \in R^{m}$ involves expressing $x$ as a linear combination of orthonormal basis vectors or functions as follows:

\begin{matrix} (48) & x = Q a \end{matrix}

where $a = [a_{1}, \dots, a_{m}]$ contains the coefficients or weights of the expansion, and $Q = [q_{1}, \dots, q_{m}]$ is an $m \times m$ orthonormal matrix.^[15] Because $Q$ is orthonormal, we can write

\begin{matrix} (49) & a = Q^{T} x . \end{matrix}

The coefficients $a$ represent $x$ in a coordinate system whose axes are the basis $[q_{1}, \dots, q_{m}]$ , instead of the conventional basis $[e_{1}, \dots, e_{m}]$ . By using different basis functions $Q$ , we can generate sets of coefficients with different properties. For example, we can express the discrete Fourier transform (DFT) in the form of (49), where the columns of $Q$ are harmonically–related rotating exponentials. With this basis, the coefficients $a$ tell us how much of the frequency corresponding to $q_{i}$ is contained in $x$ .

For each vector observation $x_{i}$ , the matrix $Q$ remains constant but a new vector $a_{i}$ of coefficients is generated. To emphasize this point, we re-write (48) as

\begin{matrix} (50) & x_{i} = Q a_{i}, i = 1, \dots, N \end{matrix}

where $i$ is the vector sample index (corresponding to the window position in Fig. 2) and $N$ is the number of vector observations.

2.5.1 Development of the K–L Expansion

Figure 4 shows a scatterplot corresponding to a slowly–varying random process, of the type shown in Figure 2. A scatterplot is a collection of dots, where the $i$ th dot is the point on the $m$ –dimensional plane corresponding to the vector $x_{i}$ . Because of obvious restrictions in drawing, we are limited here to the value $m = 2$ . Because the process we have chosen in this case is slowly varying, the elements of each $x_{i}$ are highly correlated; i.e., knowledge of one element implies a great deal about the value of the other. This forces the scatterplot to be elliptical in shape (ellipsoidal in higher dimensions), concentrating along the principal diagonal in the $x_{1} - x_{2}$ plane. Let the quantities $a_{1}, a_{2}, \dots, a_{m}$ be the lengths of the $m$ principal axes of the scatterplot ellipse. With highly correlated processes we find that $a_{1} > a_{2} > \dots > a_{m}$ . Typically we find that the values $a_{i}$ diminish quickly with increasing $i$ in larger dimensional systems, when the process is highly correlated.

For the sake of contrast, Figure 5 shows a similar scatterplot, except the underlying random process is white. Here there is no correlation between adjacent samples of the process, so there is no diagonal concentration of the scatterplot in this case. This scatterplot is an $m$ –dimensional spheriod.

As we see later in this section, if we wish to store or transmit such a random process, it is wasteful to do so using the conventional coordinate system $[e_{1}, e_{2}, \dots, e_{m}]$ when the process is highly correlated. (Transmission using the conventional coordinate system is equivalent to transmitting the elements $x_{1}, x_{2}, \dots, x_{m}$ of $x_{i}$ in sequence.) The inefficiency is a result of the fact that most of the information contained in a given sample $x_{j}$ of $x$ must be re–transmitted in adjacent and subsequent samples. In this section, we seek a transformed coordinate system which is more efficient in this respect. The motivation will become clearer towards the end of the section.

The proposed method of finding an optimum coordinate system in which to represent our random process is to find a basis vector $q_{1} \in R^{m}$ such that the corresponding coefficient $a_{1} = q_{1}^{T} x$ has the maximum possible mean–squared value (variance). Then, we find a second basis vector $q_{2}$ which is constrained to be orthogonal to $q_{1}$ , such that the variance of the coefficient $a_{2} = q_{2}^{T} x$ is maximum. We continue in this way until we obtain a complete orthonormal basis $Q = [q_{1}, \dots, q_{m}]$ . Heuristically, we see from Figure 8 that the desired basis is the set of principal axes of the scatterplot ellipse. The benefits of this procedure will become clearer when we apply this technique to the compression of random processes.

The procedure to determine the $q_{i}$ is straightforward. The basis vector $q_{1}$ is given as the solution to the following problem:

\begin{matrix} (51) & q_{1} = \arg max_{∥ q ∥_{2} = 1} E [| q^{T} x_{i} |^{2}] \end{matrix}

where the expectation is over all values of $i$ . The constraint on the 2–norm of $q$ is to prevent the solution from going to infinity. Eq. (51) can be written as

\begin{matrix} (52) & \begin{aligned} q_{1} & = \arg max_{∥ q ∥_{2} = 1} E [q^{T} x x^{T} q] \\ = \arg max_{∥ q ∥_{2} = 1} q^{T} E [x x^{T}] q \\ = \arg max_{∥ q ∥_{2} = 1} q^{T} R_{x x} q . \end{aligned} \end{matrix}

where we have assumed a zero–mean process. The optimization problem above is precisely the same as that for the matrix norm of section 2.3, where it is shown that the stationary points of the argument in (52) are the eigenvectors of $R_{x x}$ . Therefore, the solution to (52) is $q_{1} = v_{1}$ , the largest eigenvector of $R_{x x}$ . Similarly, $q_{2}, \dots, q_{m}$ are the remaining successively decreasing eigenvectors of $R_{x x}$ . Thus, the desired orthonormal matrix is the eigenvector matrix $V$ corresponding to the covariance matrix of the random process. The decomposition of the vector $x$ in this way is called the Karhunen Loeve (KL) expansion of a random process.

In the sequel, the KL expansion is written using the following notation:

\begin{matrix} (53) & x_{i} = V θ_{i} \end{matrix}

and

\begin{matrix} (54) & θ_{i} = V^{T} x_{i}, \end{matrix}

where $V \in R^{m \times m}$ is the orthonormal matrix of eigenvectors, which is the basis of the KL expansion, and $θ_{i} \in R^{m}$ is the vector of KL coefficients.

Thus, the coefficient $θ_{1}$ of $θ$ on average contains the most energy (variance) of all the coefficients in $θ$ ; $θ_{2}$ is the coefficient which contains the next–highest variance, etc. The coefficient $θ_{m}$ contains the least variance. This is in contrast to the conventional coordinate system, in which all axes have equal variances.

By lecture 4, we will have sufficient knowledge to prove that the eigenvectors align themselves along the principal axes of the scatterplot ellipsoid of Figure 4. In highly correlated systems, due to the fact that the principal axes of the scatterplot ellipse have decreasing magnitudes (as shown in Figure 4) the variance of the smallest coefficients is typically much smaller than that of the larger coefficients.

Question: Suppose the process $x$ is white, so that $R_{x x} = E (x x^{T})$ is already diagonal, with equal diagonal elements; i.e., $R_{x x} = σ^{2} I$ , as in Figure 5. What is the K-L basis in this case?

To answer this, we see that all the eigenvalues of $R_{x x}$ are repeated. Therefore, the eigenvector basis is not unique. In fact, in this case, any vector in $R^{m}$ is an eigenvector of the matrix $σ^{2} I$ (the eigenvalue is $σ^{2}$ ). Therefore, any orthonormal basis is a K-L basis for a white process. This concept is evident from the circular scatterplot of figure 5.

2.5.2 Properties of the KL Expansion

Property 7 The coefficients $θ$ of the KL expansion are uncorrelated.

To prove this, we evaluate the covariance matrix $R_{θ θ}$ of $θ$ , using the definition (54) as follows:

\begin{matrix} (55) & \begin{aligned} R_{θ θ} & = E [θ θ^{T}] \\ = E [V^{T} x x^{T} V] \\ = V^{T} R_{x x} V \\ = Λ . \end{aligned} \end{matrix}

Since $R_{θ θ}$ is equal to the diagonal eigenvalue matrix $Λ$ of $R_{x x}$ , the KL coefficients are uncorrelated.

Property 8 The variance of the $i$ th K–L coefficient $θ_{i}$ is equal to the $i$ th eigenvalue $λ_{i}$ of $R_{x x}$ .

The proof follows directly from (55); $R_{θ θ} = Λ$ .

Property 9 The variance of a highly correlated random process $x$ concentrates in the first few KL coefficients.

This property may be justified intuitively from the scatterplot of Figure 4, due to the fact that the length of the first principal axis is greater than that of the second. (This effect becomes more pronounced in higher dimensions.) However here we wish to formally prove this property.

Let us denote the covariance matrix of the process shown in Fig. 2 as $R_{2}$ , and that shown in Fig. 3 as $R_{3}$ . We assume both processes are stationary with equal powers. Let $α_{i}$ be the eigenvalues of $R_{2}$ and $β_{i}$ be the eigenvalues of $R_{3}$ . Because $R_{3}$ is diagonal with equal diagonal elements, all the $β_{i}$ are equal. Our assumptions imply that the main diagonal elements of $R_{2}$ are equal to the main diagonal elements of $R_{3}$ , and hence from Property 4, the trace and the eigenvalue sum of each covariance matrix are equal.

To obtain further insight into the behavior of the two sets of eigenvalues, we consider Hadamard’s inequality^[16] which may be stated as:

Consider a square matrix $A \in R^{m \times m}$ . Then, $det A \leq \prod_{i = 1}^{m} a_{i i}$ , with equality if and only if $A$ is diagonal.

From Hadamard’s inequality, $det R_{2} < det R_{3}$ , and so also from Property 4, $\prod_{i = 1}^{n} α_{i} < \prod_{i = 1}^{n} β_{i}$ . Under the constraint $\sum α_{i} = \sum β_{i}$ , it follows that $α_{1} > α_{n}$ ; i.e., the eigenvalues of $R_{2}$ are not equal. (We say the eigenvalues become disparate). Thus, the variance in the first K-L coefficients of a correlated process is larger than that in the later K-L coefficients. Typically in a highly correlated system, only the first few coefficients have significant variance.

To illustrate this phenomenon further, consider the extreme case where the process becomes so correlated that all elements of its covariance matrix approach the same value. (This will happen if the process $x [n]$ does not vary with time). Then, all columns of the covariance matrix are equal, and the rank of $R_{x x}$ in this case becomes equal to one, and therefore only one eigenvalue is nonzero. Then all the energy of the process is concentrated into only the first K-L coefficient. In contrast, when the process is white and stationary, all the eigenvalues are of $R_{x x}$ are equal, and the variance of the process is equally distributed amongst all the K–L coefficients. The point of this discussion is to indicate a general behavior of random processes, which is that as they become more highly correlated, the variance in the K-L coefficients concentrates in the first few elements. The variance in the remaining coefficients becomes negligible.

2.5.3 Applications of the K-L Expansion

Suppose a communications system transmits a stationary, zero–mean highly–correlated sequence $x$ . This means that to transmit the elements of $x$ directly, one sends a particular element $x_{i}$ of $x$ using as many bits as is necessary to convey the information with the required fidelity. However, in sending the next element $x_{i + 1}$ , almost all of the same information is sent over again, due to the fact that $x_{i + 1}$ is highly correlated with $x_{i}$ and its previous few samples. That is, $x_{i + 1}$ contains very little new information relative to $x_{i}$ . It is therefore seen that if $x$ is highly correlated, transmitting the samples directly (i.e., using the conventional coordinate system) is very wasteful in terms of the number of required bits to transmit.

But if $x$ is stationary and $R_{x x}$ is known at the receiver^[17], then it is possible for both the transmitter and receiver to “know” the eigenvectors of $R_{x x}$ , the basis set. If the process is sufficiently highly correlated, then, because of the concentration properties of the K–L transform, the variance of the first few coefficients $θ$ dominates that of the remaining ones. The later coefficients on average typically have a small variance and are not required to accurately represent the signal.

To implement this form of signal compression, let us say that an acceptable level of distortion is obtained by retaining only the first $j$ significant coefficients. We form a truncated K-L coefficient vector $\hat{θ}$ in a similar manner to (54) as

\begin{matrix} (56) & \hat{θ} = [\begin{matrix} θ_{1} \\ ⋮ \\ θ_{j} \\ 0 \\ ⋮ \\ 0 \end{matrix}] = [\begin{matrix} v_{1}^{T} \\ ⋮ \\ v_{j}^{T} \\ 0^{T} \\ ⋮ \\ 0^{T} \end{matrix}] x . \end{matrix}

where coefficients $θ_{j + 1}, \dots, θ_{m}$ are set to zero and therefore need not be transmitted. This means we can represent $x_{i}$ more compactly without sacrificing significant loss of quality; i.e., we have achieved signal compression.

An approximation $\hat{x}$ to the original signal can be reconstructed by:

\begin{matrix} (57) & \hat{x} = V \hat{θ} . \end{matrix}

From Property 8, the mean–squared error $ϵ_{j}^{2}$ in the KL reconstruction $\hat{x}$ is given as

\begin{matrix} (58) & ϵ_{j}^{2} = \sum_{i = j + 1}^{m} λ_{i}, \end{matrix}

which corresponds to the sum of the truncated (smallest) eigenvalues. It is easy to prove that no other basis results in a smaller error. The error $ϵ_{j}^{2}$ in the reconstructed $\hat{x}$ using any basis $[q_{1}, \dots, q_{m}]$ is given by

\begin{matrix} (59) & ϵ_{j}^{2} = \sum_{i = j + 1}^{m} E | q_{i}^{T} x |_{2}^{2} = \sum_{i = j + 1}^{m} q_{i}^{T} R_{x x} q_{i} . \end{matrix}

where the last line uses (51) and (52). We have seen previously that the eigenvectors are the stationary points of each term in the sum above. Since each term in the sum is positive semi–definite definite, $ϵ_{j}^{2}$ is minimized by minimizing each term individually. Therefore, the minimum of (59) is obtained when the $q_{i}$ are assigned the $m - j$ smallest eigenvectors. Since $v_{i}^{T} R_{x x} v_{i} = λ_{i}$ when $∥ v ∥_{2} = 1$ , $ϵ_{j}^{2} = ϵ_{j}^{2}$ only when $q_{i} = v_{i}$ . This completes the proof.

In speech applications for example, fewer than one tenth of the coefficients are needed for reconstruction with imperceptible degradation. Note that since ${\hat{R}}_{x x}$ is positive semi–definite, all eigenvalues are non–negative. Hence, the energy measure (58) is always non–negative for any value of $j$ . This type of signal compression is the ultimate form of a type of coding known as transform coding.

Transform coding is now illustrated by an example. A process $x [n]$ was generated by passing a unit-variance zero–mean white noise sequence $w (n)$ through a 3rd-order lowpass digital lowpass Butterworth filter with a relatively low normalized cutoff frequency (0.1 Hz), as shown in Fig. 6. Vector samples $x_{i}$ are extracted from the sequence $x [n]$ as shown in Fig. 2. The filter removes the high-frequency components from the input and so the resulting output process $x [n]$ must therefore vary slowly in time. Thus, the K–L expansion is expected to require only a few principal eigenvector components, and significant compression gains can be achieved.

We show this example for $m = 10$ . Listed below are the 10 eigenvalues corresponding to ${\hat{R}}_{x x}$ , the covariance matrix of $x$ , generated from the output of the lowpass filter:

Eigenvalues: 0.5468 0.1975 0.1243 $\times 10^{- 1}$ 0.5112 $\times 10^{- 3}$ 0.2617 $\times 10^{- 4}$ 0.1077 $\times 10^{- 5}$ 0.6437 $\times 10^{- 7}$ 0.3895 $\times 10^{- 8}$ 0.2069 $\times 10^{- 9}$ 0.5761 $\times 10^{- 11}$

The error $ϵ_{j}^{2}$ for $j = 2$ is thus evaluated from the above data as 0.0130, which may be compared to the value 0.7573, which is the total eigenvalue sum. The normalized error is $\frac{0.0130}{0.7573} = 0.0171$ . Because this error may be considered a low enough value, only the first $j = 2$ K-L components may be considered significant. In this case, we have a compression gain of $10 / 2 = 5$ ; i.e., the KL expansion requires only one fifth of the bits relative to representing the signal directly.

The corresponding two principal eigenvectors are plotted in Fig. 7. These plots show the value of the $k$ th element $v_{k}$ of the eigenvector, plotted against its index $k$ for $k = 1, \dots, m$ . These waveforms may be interpreted as functions of time.

In this case, we would expect that any observation $x_{i}$ can be expressed accurately as a linear combination of only the first two eigenvector waveforms shown in Fig. 7, whose coefficients $\hat{θ}$ are given by (56). In Fig. 8 we show samples of the true observation $x$ shown as a waveform in time, compared with the reconstruction ${\hat{x}}_{i}$ formed from (57) using only the first $j = 2$ eigenvectors. It is seen that the difference between the true and reconstructed vector samples is small, as expected.

One of the practical difficulties in using the K–L expansion for coding is that the eigenvector set $V$ is not usually known at the receiver in practical cases when the observed signal is mildly or severely nonstationary (e.g. speech or video signals). In this case, the covariance matrix estimate ${\hat{R}}_{x x}$ is changing with time; hence so are the eigenvectors. Transmission of the eigenvector set to the receiver is expensive in terms of information and so is undesirable. This fact limits the explicit use of the K–L expansion for coding. However, it has been shown^[18] that the discrete cosine transform (DCT), which is another form of orthonormal expansion whose basis consists of cosine–related functions, closely approximates the eigenvector basis for a certain wide class of signals. The DCT uses a fixed basis, independent of the signal, and hence is always known at the receiver. Transform coding using the DCT enjoys widespread practical use and is the fundamental idea behind the so–called JEPEG and MPEG international standards for image and video coding. The search for other bases, including particularly wavelet functions, to replace the eigenvector basis is a subject of ongoing research. Thus, even though the K–L expansion by itself is not of much practical value, the theoretical ideas behind it are of significant worth.

2.6 Example: Array Processing

Here, we present a further example of the concepts we have developed so far. This example is concerned with direction of arrival estimation using arrays of sensors.

Consider an array of $M$ sensors (e.g., antennas) as shown in Fig. 9. Let there be $K < M$ plane waves incident onto the array as shown. Assume the amplitudes of the incident waves do not change during the time taken for the wave to traverse the array. Also assume for the moment that the amplitude of the first incident wave at the first sensor is unity. Then, from the physics shown in Fig. 9, the signal vector $x$ received by sampling each element of the array simultaneously, from the first incident wave alone, may be described in vector format by $x = [1, e^{j ϕ}, e^{j 2 ϕ}, \dots, e^{j (M - 1) ϕ}]^{T}$ , where $ϕ$ is the electrical phase–shift between adjacent elements of the array, due to the first incident wave.^[19] When there are $K$ incident signals, with corresponding amplitudes $a_{k}, k = 1, \dots, K$ , the effects of the $K$ incident signals each add linearly together, each weighted by the corresponding amplitude $a_{k}$ , to form the received signal vector $x$ . The resulting received signal vector, including the noise can then be written in the form

\begin{matrix} (60) & \underset{(M \times 1)}{x_{n}} = \underset{(M \times K)}{S} \underset{(K \times 1)}{a_{n}} + \underset{(M \times 1)}{w_{n}}, n = 1, \dots, N, \end{matrix}

where $w_{n} =$ M-length noise vector at time $n$ whose elements are independent random variables with zero mean and variance $σ^{2}$ , i.e., $E (w_{i}^{2}) = σ^{2}$ . The vector $w$ is assumed uncorrelated with the signal. $S = [s_{1}, \dots, s_{K}]$ $s_{k} = [1, e^{j ϕ_{k}}, e^{j 2 ϕ_{k}}, \dots, e^{j (M - 1) ϕ_{k}}]^{T}$ are referred to as steering vectors. $ϕ_{k}, k = 1, \dots, K$ are the electrical phase–shift angles corresponding to the incident signals. The $ϕ_{k}$ are assumed to be distinct. $a_{n} = [a_{1}, \dots, a_{K}]_{n}^{T}$ is a vector of independent random variables, describing the amplitudes of each of the incident signals at time $n$ .

In (60) we obtain $N$ vector samples $x_{n} \in C^{M \times 1}$ , $n = 1, \dots, N$ by simultaneously sampling all array elements at $N$ distinct points in time. Our objective is to estimate the directions of arrival $ϕ_{k}$ of the plane waves relative to the array, by observing only the received signal.

Note $K < M$ . Let us form the covariance matrix $R$ of the received signal $x$ :

\begin{matrix} (61) & \begin{aligned} R = E (x x^{H}) & = E [(S a + w) (a^{H} S^{H} + w^{H})] \\ = S E (a a^{H}) S^{H} + σ^{2} I \end{aligned} \end{matrix}

The last line follows because the noise is uncorrelated with the signal, thus forcing the cross–terms to zero. In the last line of (61) we have also used that fact that the covariance matrix of the noise contribution (second term) is $σ^{2} I$ . This follows because the elements of the noise vector $w$ are independent with equal power. The first term of (61) we call $R_{o}$ , which is the contribution to the covariance matrix due only to the signal.

Lets look at the structure of $R_{o}$ :

R_{o} = S \underset{non-singular}{\underset{⏟}{E (a a^{H})}} S^{H}

From this structure, we may conclude that $R_{o}$ is rank $K$ . This may be seen as follows. Let us define $A ≜ E (a a^{H})$ and $B ≜ A S^{H}$ . Because the $ϕ_{k}$ are distinct, $S$ is full rank (rank $K$ ), and because the $a_{k}$ are independent, $A$ is full rank ( $K$ ). Therefore the matrix $B \in C^{K \times M}$ is of full rank $K$ . Then, $R_{o} = S B$ . From this last relation, we can see that the $i$ th, $i = 1, \dots, M$ column of $R_{o}$ is a linear combination of the $K$ columns of $S$ , whose coefficients are the $i$ th column of $B$ . Because $B$ is full rank, $K$ linearly independent linear combinations of the $K$ columns of $S$ are used to form $R_{o}$ . Thus $R_{o}$ is rank $K$ . Because $K < M$ , $R_{o}$ is rank deficient.

Let us now investigate the eigendecomposition on $R_{o}$ , where $λ_{k}$ are the eigenvalues of $R_{o}$ :

\begin{matrix} (62) & R_{o} = V Λ V^{H} \end{matrix}

\begin{matrix} (63) & R_{o} = [\dots] [\begin{matrix} λ_{1} \\ ⋱ \\ λ_{K} \\ 0 \\ ⋱ \\ 0 \end{matrix}] [\dots] . \end{matrix}

Because $R_{o} \in C^{M \times M}$ is rank $K$ , it has $K$ non-zero eigenvalues and $M - K$ zero eigenvalues. We enumerate the eigenvectors $v_{1}, \dots, v_{K}$ as those associated with the largest $K$ eigenvalues, and $v_{K + 1}, \dots, v_{M}$ as those associated with the zero eigenvectors.^[20] ^[21]

From the definition of an eigenvector, we have

\begin{matrix} (68) & R_{o} v_{i} = 0 \end{matrix}

\begin{matrix} (69) & S A S^{H} v_{i} = 0, i = K + 1, \dots, M . \end{matrix}

Since $A = E (a a^{H})$ and $S$ are full rank, the only way (69) can be satisfied is if the $v_{i}, i = K + 1, \dots, M$ are orthogonal to all columns of $S = [s (ϕ_{1}), \dots, s (ϕ_{K})]$ . Therefore we have

\begin{matrix} (70) & s_{k}^{H} v_{i} = 0, k = 1, \dots, K, i = K + 1, \dots, M, \end{matrix}

We define the matrix $V_{N} ≜ [v_{K + 1}, \dots, v_{M}]$ . Therefore (70) may be written as

\begin{matrix} (71) & S^{H} V_{N} = 0 . \end{matrix}

We also have

\begin{matrix} (72) & [1, e^{j ϕ_{k}}, e^{j 2 ϕ_{k}}, \dots, e^{j (M - 1) ϕ_{k}}]^{H} V_{N} = 0 . \end{matrix}

Up to now, we have considered only the noise–free case. What happens when the noise component $σ^{2} I$ is added to $R_{o}$ to give $R_{x x}$ in (61)? From Property 3, Lecture 1, we see that if the eigenvalues of $R_{o}$ are $λ_{i}$ , then those of $R_{x x}$ are $λ_{i} + σ^{2}$ . The eigenvectors remain unchanged with the noise contribution, and (70) still holds when noise is present. Note these properties only apply to the true covariance matrix formed using expectations, rather than the estimated covariance matrix formed using time averages.

With this background in place we can now discuss the MUSIC^[22] algorithm for estimating directions of arrival of plane waves incident onto arrays of sensors.

2.6.1 The MUSIC Algorithm^[23]

We wish to estimate the unknown values $[ϕ_{1}, \dots, ϕ_{K}]$ which comprise $S = [s (ϕ_{1}), \dots, s (ϕ_{K})]$ . The MUSIC algorithm assumes the quantity $K$ is known. In the practical case, where expectations cannot be evaluted because they require infinite data, we form an estimate $\hat{R}$ of $R$ based on a finite number $N$ observations as follows:

\hat{R} = \frac{1}{N} \sum_{n = 1}^{N} x_{n} x_{n}^{H} .

Only if $N \to \infty$ does $\hat{R} \to R$ .

An estimate ${\hat{V}}_{N}$ of $V_{N}$ may be formed from the eigenvectors associated with the smallest $M - K$ eigenvalues of $\hat{R}$ . Because of the finite $N$ and the presence of noise, (72) only holds approximately when ${\hat{V}}_{N}$ is used in place of $V_{N}$ . Thus, a reasonable estimate of the desired directions of arrival may be obtained by finding values of the variable $ϕ$ for which the expression on the left of (72) is small instead of exactly zero. Thus, we determine $K$ estimates $\hat{ϕ}$ which locally satisy

\begin{matrix} (73) & \hat{ϕ} = \arg min_{ϕ} ∥ s^{H} (ϕ) {\hat{V}}_{N} ∥ \end{matrix}

By convention, it is desirable to express (73) as a spectrum–like function, where a peak instead of a null represents a desired signal. It is also convenient to use the squared-norm instead of the norm itself. Thus, the MUSIC “spectrum” $P (ϕ)$ is defined as:

P (ϕ) = \frac{1}{s (ϕ)^{H} {\hat{V}}_{N} {\hat{V}}_{N}^{H} s (ϕ)}

It will look something like what is shown in Fig. 10, when $K = 2$ incident signals.

2.7 TO SUMMARIZE

An eigenvector $x$ of a matrix $A$ is such that $A x$ points in the same direction as $x$ .
The covariance matrix $R_{x x}$ of a random process $x$ is defined as $E (x x^{H})$ . For stationary processes, $R_{x x}$ completely characterizes the process, and is closely related to its covariance function. In practice, the expectation operation is replaced by a time-average.
the eigenvectors of $R_{x x}$ form a natural basis to represent $x$ , since it is only the eigenvectors which diagonalize $R_{x x}$ . This leads to the coefficients $a$ of the corresponding expansion $x = V a$ being uncorrelated. This has significant application in speech/video encoding.
The expection of the square of the coefficients above are the eigenvalues of $R_{x x}$ . This gives an idea of the relative power present along each eigenvector.
If the variables $x$ are Gaussian, then the K-L coefficients are independent. This greatly simplifies receiver design and analysis.

Many of these points are a direct consequence of the fact that it is only the eigenvectors which can diagonalize a matrix. That is basically the only reason why eigenvalues/eigenvectors are so useful. I hope this serves to demystify this subject. Once you see that it is only the eigenvectors which diagonalize, the property that they are a natural basis for the process $x$ becomes easy to understand.

An interpretation of an eigenvalue is that it represents the average energy in each coefficient of the K–L expansion.

3 The Singular Value Decomposition (SVD)

In this lecture we learn about one of the most fundamental and important matrix decompositions of linear algebra: the SVD. It bears some similarity with the eigendecomposition (ED), but is more general. Usually, the ED is of interest only on symmetric square matrices, but the SVD may be applied to any matrix. The SVD gives us important information about the rank, the column and row spaces of the matrix, and leads to very useful solutions and interpretations of least squares problems. We also discuss the concept of matrix projectors, and their relationship with the SVD.

3.1 The Singular Value Decomposition (SVD)

We have found so far that the eigendecomposition is a useful analytic tool. However, it is only applicable on square symmetric matrices. We now consider the SVD, which may be considered a generalization of the ED to arbitrary matrices. Thus, with the SVD, all the analytical uses of the ED which before were restricted to symmetric matrices may now be applied to any form of matrix, regardless of size, whether it is symmetric or nonsymmetric, rank deficient, etc.

Theorem 1 Let $A \in R^{m \times n}$ . Then $A$ can be decomposed according to the singular value decomposition as

\begin{matrix} (1) & A = U Σ V^{T} \end{matrix}

where $U$ and $V$ are orthonormal and

U \in R^{m \times m}, V \in R^{n \times n}

and

Σ = diag (σ_{1}, σ_{2}, \dots, σ_{p}) \in R^{m \times n} p = min (m, n)

where

σ_{1} \geq σ_{2} \geq σ_{3} \dots \geq σ_{p} \geq 0.

The matrix $Σ$ must be of dimension $R^{m \times n}$ (i.e., the same size as $A$ ), to maintain dimensional consistency of the product in (1). It is therefore padded with zeros either on the bottom or to the right of the diagonal block, depending on whether $m > n$ or $m < n$ , respectively. We denote the square $p \times p$ diagonal matrix as $\tilde{Σ}$ ; the $m \times n$ diagonal matrix containing the zero blocks is denoted as $Σ$ .

Since $U$ and $V$ are orthonormal, we may also write (1) in the form:

\begin{matrix} (2) & \underset{m \times m}{U^{T}} \underset{m \times n}{A} \underset{n \times n}{V} = \underset{m \times n}{Σ} \end{matrix}

where $Σ$ is a diagonal matrix. The values $σ_{i}$ which are defined to be positive, are referred to as the singular values of $A$ . The columns $u_{i}$ and $v_{i}$ of $U$ and $V$ are respectively called the left and right singular vectors of $A$ .

The SVD corresponding to (1) may be shown diagramatically in the following way:

\begin{matrix} (3) & A = \underset{m \times m}{[\begin{matrix} U \end{matrix}]} \underset{m \times n}{[\begin{matrix} σ_{1} & 0 \\ ⋱ \\ σ_{p} \\ 0 & ⋱ \\ 0 \end{matrix}]} \underset{n \times n}{[\begin{matrix} V^{T} \end{matrix}]} \end{matrix}

Each line above represents a column of either $U$ or $V$ .

3.2 Existence Proof of the SVD

Consider two vectors $x$ and $y$ where $∥ x ∥_{2} = ∥ y ∥_{2} = 1$ , s.t. $A x = σ y$ , where $σ = ∥ A ∥_{2}$ . The fact that such vectors $x$ and $y$ can exist follows from the definition of the matrix 2-norm. We define orthonormal matrices $U$ and $V$ so that $x$ and $y$ form their first columns, as follows:

\begin{aligned} U & = [y, U_{1}] \\ V & = [x, V_{1}] \end{aligned}

That is, $U_{1}$ consists of a set of non–unique orthonormal columns which are mutually orthogonal to themselves and to $y$ ; similarly for $V_{1}$ .

We then define a matrix $A_{1}$ as

\begin{matrix} (4) & U^{T} A V = A_{1} = [\begin{matrix} y^{T} \\ U_{1}^{T} \end{matrix}] A [x, V_{1}] \end{matrix}

The matrix $A_{1}$ has the following structure:

\begin{matrix} (5) & \underset{orthonormal}{\underset{⏟}{[\begin{matrix} y^{T} \\ U_{1}^{T} \end{matrix}]}} A \underset{orthonormal}{\underset{⏟}{[\begin{matrix} x & V_{1} \end{matrix}]}} = [\begin{matrix} y^{T} \\ U_{1}^{T} \end{matrix}] [\begin{matrix} σ y & A V_{1} \end{matrix}] = \underset{1 n - 1}{\overset{\binom{1}{m - 1}}{[\begin{matrix} σ & w^{T} \\ 0 & B \end{matrix}]}} ≜ A_{1} . \end{matrix}

where $B ≜ U_{1}^{T} A V_{1}$ . The $0$ in the (2,1) block above follows from the fact that $U_{1} ⊥ y$ , because $U$ is orthonormal.

Now, we post-multiply both sides of (5) by the vector $[\begin{matrix} σ \\ w \end{matrix}]$ and take 2-norms:

\begin{matrix} (6) & {‖ A_{1} [\begin{matrix} σ \\ w \end{matrix}] ‖}_{2}^{2} = {‖ [\begin{matrix} σ & w^{T} \\ 0 & B \end{matrix}] [\begin{matrix} σ \\ w \end{matrix}] ‖}_{2}^{2} \geq (σ^{2} + w^{T} w)^{2} . \end{matrix}

This follows because the term on the extreme right is only the first element of the vector product of the middle term. But, as we have seen, matrix p-norms obey the following property:

\begin{matrix} (7) & ∥ A x ∥_{2} \leq ∥ A ∥_{2} ∥ x ∥_{2} . \end{matrix}

Therefore using (6) and (7), we have

\begin{matrix} (8) & ∥ A_{1} ∥_{2}^{2} {‖ [\begin{matrix} σ \\ w \end{matrix}] ‖}_{2}^{2} \geq {‖ A_{1} [\begin{matrix} σ \\ w \end{matrix}] ‖}_{2}^{2} \geq (σ^{2} + w^{T} w)^{2} . \end{matrix}

Note that ${‖ [\begin{matrix} σ \\ w \end{matrix}] ‖}_{2}^{2} = σ^{2} + w^{T} w$ . Dividing (8) by this quantity, we obtain

\begin{matrix} (9) & ∥ A_{1} ∥_{2}^{2} \geq σ^{2} + w^{T} w . \end{matrix}

But, we defined $σ = ∥ A ∥_{2}$ . Therefore, the following must hold:

\begin{matrix} (10) & σ = ∥ A ∥_{2} = ∥ U^{T} A V ∥_{2} = ∥ A_{1} ∥_{2} \end{matrix}

where the equality on the right follows because the matrix 2-norm is invariant to matrix pre- and post-multiplication by an orthonormal matrix. By comparing (9) and (10), we have the result $w = 0$ .

Substituting this result back into (5), we now have

\begin{matrix} (11) & A_{1} = [\begin{matrix} σ & 0 \\ 0 & B \end{matrix}] . \end{matrix}

The whole process repeats using only the component $B$ , until $A_{n}$ becomes diagonal. $◻$

It is instructive to consider an alternative proof for the SVD. The following is useful because it is a constructive proof, which shows us how to form the components of the SVD.

Theorem 2 Let $A \in R^{m \times n}$ be a rank $r$ matrix ( $r \leq p = min (m, n)$ ). Then there exist orthonormal matrices $U$ and $V$ such that

\begin{matrix} (12) & U^{T} A V = [\begin{matrix} \tilde{Σ} & 0 \\ 0 & 0 \end{matrix}] \end{matrix}

where

\begin{matrix} (13) & \tilde{Σ} = diag (σ_{1}, \dots, σ_{r}), σ_{i} > 0. \end{matrix}

Proof: Consider the square symmetric positive semi–definite matrix $A^{T} A$ ^[24]. Let the eigenvalues greater than zero be $σ_{1}^{2}, σ_{2}^{2}, \dots, σ_{r}^{2}$ . Then, from our knowledge of the eigendecomposition, there exists an orthonormal matrix $V \in R^{n \times n}$ such that

\begin{matrix} (14) & V^{T} A^{T} A V = [\begin{matrix} {\tilde{Σ}}^{2} & 0 \\ 0 & 0 \end{matrix}] . \end{matrix}

where ${\tilde{Σ}}^{2} = diag [σ_{1}^{2}, \dots, σ_{r}^{2}]$ . We now partition $V$ as $[V_{1} V_{2}]$ , where $V_{1} \in R^{n \times r}$ . Then (14) has the form

\begin{matrix} (15) & \underset{n}{[\begin{matrix} V_{1}^{T} \\ V_{2}^{T} \end{matrix}]} A^{T} A \underset{r n - r}{[\begin{matrix} V_{1} & V_{2} \end{matrix}]} = [\begin{matrix} {\tilde{Σ}}^{2} & 0 \\ 0 & 0 \end{matrix}] . \end{matrix}

Then by equating corresponding blocks in (15) we have

\begin{matrix} (16) & V_{1}^{T} A^{T} A V_{1} = {\tilde{Σ}}^{2} (r \times r) \end{matrix}

\begin{matrix} (17) & V_{2}^{T} A^{T} A V_{2} = 0 . (n - r) \times (n - r) \end{matrix}

From (16), we can write

\begin{matrix} (18) & {\tilde{Σ}}^{- 1} V_{1}^{T} A^{T} A V_{1} {\tilde{Σ}}^{- 1} = I . \end{matrix}

Then, we define the matrix $U_{1} \in R^{m \times r}$ from (18) as

\begin{matrix} (19) & U_{1} = A V_{1} {\tilde{Σ}}^{- 1} . \end{matrix}

Then from (18) we have $U_{1}^{T} U_{1} = I$ and it follows that

\begin{matrix} (20) & U_{1}^{T} A V_{1} = \tilde{Σ} . \end{matrix}

From (17) we also have

\begin{matrix} (21) & A V_{2} = 0 . \end{matrix}

We now choose a matrix $U_{2}$ so that $U = [U_{1} U_{2}]$ , where $U_{2} \in R^{m \times (m - r)}$ , is orthonormal. Then from (19) and because $U_{1} ⊥ U_{2}$ , we have

\begin{matrix} (22) & U_{2}^{T} U_{1} = U_{2}^{T} A V_{1} {\tilde{Σ}}^{- 1} = 0 . \end{matrix}

Therefore

\begin{matrix} (23) & U_{2}^{T} A V_{1} = 0 . \end{matrix}

Combining (20), (21) and (23), we have

\begin{matrix} (24) & U^{T} A V = [\begin{matrix} U_{1}^{T} A V_{1} & U_{1}^{T} A V_{2} \\ U_{2}^{T} A V_{1} & U_{2}^{T} A V_{2} \end{matrix}] = [\begin{matrix} \tilde{Σ} & 0 \\ 0 & 0 \end{matrix}] \end{matrix}

$◻$ The proof can be repeated using an eigendecomposition on the matrix $A A^{T} \in R^{m \times m}$ instead of on $A^{T} A$ . In this case, the roles of the orthonormal matrices $V$ and $U$ are interchanged.

The above proof is useful for several reasons:

It is short and elegant.
We can also identify which part of the SVD is not unique. Here, we assume that $A^{T} A$ has no repeated non–zero eigenvalues. Because $V_{2}$ are the eigenvectors corresponding to the zero eigenvalues of $A^{T} A$ , $V_{2}$ is not unique when there are repeated zero eigenvalues. This happens when $m < n + 1$ , (i.e., $A$ is sufficiently short) or when the nullity of $A \geq 2$ , or a combination of these conditions.
By its construction, the matrix $U_{2} \in R^{m \times m - r}$ is not unique whenever it consists of two or more columns. This happens when $m - 2 \geq r$ . It is left as an exercise to show that similar conclusions on the uniqueness of $U$ and $V$ can be made when the proof is developed using the matrix $A A^{T}$ .

3.3 Partitioning the SVD

Here we assume that $A$ has $r \leq p$ non-zero singular values (and $p - r$ zero singular values). Later, we see that $r = rank (A)$ . For convenience of notation, we arrange the singular values as:

\underset{\begin{matrix} max min \\ non-zero \\ s.v. \\ r non-zero s.v’s \end{matrix}}{\underset{⏟}{σ_{1} \geq \dots \geq σ_{r}}} > \underset{p - r zero s.v.’s}{\underset{⏟}{σ_{r + 1} = \dots = σ_{p} = 0}}

In the remainder of this lecture, we use the SVD partitioned in both $U$ and $V$ . We can write the SVD of $A$ in the form

\begin{matrix} (25) & A = [\begin{matrix} U_{1} & U_{2} \end{matrix}] [\begin{matrix} \tilde{Σ} & 0 \\ 0 & 0 \end{matrix}] [\begin{matrix} V_{1}^{T} \\ V_{2}^{T} \end{matrix}] \end{matrix}

where where $\tilde{Σ} \in R^{r \times r} = diag (σ_{1}, \dots, σ_{r})$ , and $U$ is partitioned as

\begin{matrix} (26) & U = \underset{r m - r}{[\begin{matrix} U_{1} & U_{2} \end{matrix}]} m \end{matrix}

The columns of $U_{1}$ are the left singular vectors associated with the $r$ nonzero singular values, and the columns of $U_{2}$ are the left singular vectors associated with the zero singular values. $V$ is partitioned in an analogous manner:

\begin{matrix} (27) & V = \underset{r n - r}{[\begin{matrix} V_{1} & V_{2} \end{matrix}]} n \end{matrix}

3.4 Interesting Properties and Interpretations of the SVD

The above partition reveals many interesting properties of the SVD:

3.4.1 rank(A) = r

Using (25), we can write $A$ as

\begin{matrix} (28) & A = [\begin{matrix} U_{1} & U_{2} \end{matrix}] [\begin{matrix} \tilde{Σ} V_{1}^{T} \\ 0 \end{matrix}] = U_{1} \tilde{Σ} V_{1}^{T} = U_{1} B \end{matrix}

where $B \in R^{r \times n} ≜ \tilde{Σ} V_{1}^{T}$ . From (28) it is clear that the $i$ th, $i = 1, \dots, r$ column of $A$ is a linear combination of the columns of $U_{1}$ , whose coefficients are given by the $i$ th column of $B$ . But since there are $r \leq n$ columns in $U_{1}$ , there can only be $r$ linearly independent columns in $A$ . It follows from the definition of rank that $rank (A) = r$ .

This point is analogous to the case previously considered in Lecture 2, where we saw rank is equal to the number of non-zero eigenvalues, when $A$ is a square symmetric matrix. In this case however, the result applies to any matrix. This is another example of how the SVD is a generalization of the eigendecomposition.

Determination of rank when $σ_{1}, \dots, σ_{r}$ are distinctly greater than zero, and when $σ_{r + 1}, \dots, σ_{p}$ are exactly zero is easy. But often in practice, due to finite precision arithmetic and fuzzy data, $σ_{r}$ may be very small, and $σ_{r + 1}$ may be not quite zero. Hence, in practice, determination of rank is not so easy. A common method is to declare $rank A = r$ if $σ_{r + 1} \leq ϵ$ , where $ϵ$ is a small number specific to the problem considered.

3.4.2 $N (A) = R (V_{2})$

Recall the nullspace $N (A) = {x \neq 0 ∣ A x = 0}$ . So, we investigate the set ${x}$ such that $A x = 0$ . Let $x \in span (V_{2})$ ; i.e., $x = V_{2} c$ , where $c \in R^{n - r}$ . By substituting (25) for $A$ , by noting that $V_{1} ⊥ V_{2}$ and that $V_{1}^{T} V_{1} = I$ , we have

\begin{matrix} (29) & A x = [\begin{matrix} U_{1} & U_{2} \end{matrix}] [\begin{matrix} \tilde{Σ} & 0 \\ 0 & 0 \end{matrix}] [\begin{matrix} 0 \\ c \end{matrix}] = 0 . \end{matrix}

Thus, $span (V_{2})$ is at least a subspace of $N (A)$ . However, if $x$ contains any components of $V_{1}$ , then (29) will not be zero. But since $V = [V_{1} V_{2}]$ is a complete basis in $R^{n}$ , we see that $V_{2}$ alone is a basis for the nullspace of $A$ .

3.4.3 $R (A) = R (U_{1})$

Recall that the definition of range $R (A)$ is ${y ∣ y = A x, x \in R^{n}}$ . From (25),

\begin{matrix} (30) & A x = [\begin{matrix} U_{1} & U_{2} \end{matrix}] [\begin{matrix} \tilde{Σ} & 0 \\ 0 & 0 \end{matrix}] [\begin{matrix} V_{1}^{T} \\ V_{2}^{T} \end{matrix}] x = [\begin{matrix} U_{1} & U_{2} \end{matrix}] [\begin{matrix} \tilde{Σ} & 0 \\ 0 & 0 \end{matrix}] [\begin{matrix} d_{1} \\ d_{2} \end{matrix}] \end{matrix}

where

\begin{matrix} (31) & \underset{\binom{r}{n - r}}{[\begin{matrix} d_{1} \\ d_{2} \end{matrix}]} = [\begin{matrix} V_{1}^{T} \\ V_{2}^{T} \end{matrix}] x . \end{matrix}

From the above we have

\begin{matrix} (32) & A x = [\begin{matrix} U_{1} & U_{2} \end{matrix}] [\begin{matrix} \tilde{Σ} d_{1} \\ 0 \end{matrix}] = U_{1} (\tilde{Σ} d_{1}) \end{matrix}

We see that as $x$ moves throughout $R^{n}$ , the quantity $\tilde{Σ} d_{1}$ moves throughout $R^{r}$ . Thus, the quantity $y = A x$ in this context consists of all linear combinations of the columns of $U_{1}$ . Thus, an orthonormal basis for $R (A)$ is $U_{1}$ .

3.4.4 $R (A^{T}) = R (V_{1})$

Recall that $R (A^{T})$ is the set of all linear combinations of rows of $A$ . Our property can be seen using a transposed version of the argument in Section 3.4.3 above. Thus, $V_{1}$ is an orthonormal basis for the rows of $A$ .

3.4.5 $R (A)_{⊥} = R (U_{2})$

From Sect. 3.4.3, we see that $R (A) = R (U_{1})$ . Since from (25), $U_{1} ⊥ U_{2}$ , then $U_{2}$ is a basis for the orthogonal complement of $R (A)$ . Hence the result.

3.4.6 $∥ A ∥_{2} = σ_{1} = σ_{max}$

This is easy to see from the definition of the 2-norm and the ellipsoid example of section 3.6.

3.4.7 Inverse of A

If the svd of a square matrix $A$ is given, it is easy to find the inverse. Of course, we must assume $A$ is full rank, (which means $σ_{i} > 0$ ) for the inverse to exist. The inverse of $A$ is given from the svd, using the familiar rules, as

\begin{matrix} (33) & A^{- 1} = V Σ^{- 1} U^{T} . \end{matrix}

The evaluation of $Σ^{- 1}$ is easy because $Σ$ is square and diagonal. Note that this treatment indicates that the singular values of $A^{- 1}$ are $[σ_{n}^{- 1}, σ_{n - 1}^{- 1}, \dots, σ_{1}^{- 1}]$ .

3.4.8 The SVD diagonalizes any system of equations

Consider the system of equations $A x = b$ , for an arbitrary matrix $A$ . Using the SVD of $A$ , we have

\begin{matrix} (34) & U Σ V^{T} x = b . \end{matrix}

Let us now represent $b$ in the basis $U$ , and $x$ in the basis $V$ , in the same way as in Sect. 3.6. We therefore have

\begin{matrix} (35) & c = \underset{\binom{r}{m - r}}{[\begin{matrix} c_{1} \\ c_{2} \end{matrix}]} = [\begin{matrix} U_{1}^{T} \\ U_{2}^{T} \end{matrix}] b \end{matrix}

and

\begin{matrix} (36) & d = \underset{\binom{r}{n - r}}{[\begin{matrix} d_{1} \\ d_{2} \end{matrix}]} = [\begin{matrix} V_{1}^{T} \\ V_{2}^{T} \end{matrix}] x \end{matrix}

Substituting the above into (34), the system of equations becomes

\begin{matrix} (37) & Σ d = c . \end{matrix}

This shows that as long as we choose the correct bases, any system of equations can become diagonal. This property represents the power of the SVD; it allows us to transform arbitrary algebraic structures into their simplest forms.

If $m > n$ or if rank $r < min (m, n)$ , then the system of equations $A x = b$ can only be satisfied if $b \in R (U_{1})$ . To see this, $Σ$ above has an $(m - r) \times n$ block of zeros below the diagonal block of nonzero singular values. Thus, the lower $m - r$ elements of left-hand side of (37) are all zero. Then if the equality of (37) is to be satisfied, $c_{2}$ must also be zero. This means that $U_{2}^{T} b = 0$ , or that $b \in R (U_{1})$ .

Further, if $n > m$ , or if $r < min (m, n)$ , then, if $x_{o}$ is a solution to $A x = b$ , $x_{o} + V_{2} z$ is also a solution, where $z \in R^{n - r}$ . This follows because, as we have seen, $V_{2}$ is a basis for $N (A)$ ; thus, the component $A V_{2} z = 0$ , and $A x_{o} + A V_{2} z = A x_{o} = b$ .

3.4.9 The “rotation” interpretation of the SVD

From the SVD relation $A = U Σ V^{T}$ , we have

\begin{matrix} (38) & A V = U Σ . \end{matrix}

Note that since $Σ$ is diagonal, the matrix $U Σ$ on the right has orthogonal columns, whose 2–norm’s are equal to the corresponding singular value. We can therefore interpret the matrix $V$ as an orthonormal matrix which rotates the rows of $A$ so that the result is a matrix with orthogonal columns. Likewise, we have

\begin{matrix} (39) & U^{T} A = Σ V^{T} . \end{matrix}

The matrix $Σ V^{T}$ on the right has orthogonal rows with 2–norm equal to the corresponding singular value. Thus, the orthonormal matrix $U^{T}$ operates (rotates) the columns of $A$ to produce a matrix with orthogonal rows.

In the case where $m > n$ , ( $A$ is tall), then the matrix $Σ$ is also tall, with zeros in the bottom $m - n$ rows. Then, only the first $n$ columns of $U$ are relevant in (38), and only the first $n$ rows of $U^{T}$ are relevant in (39). When $m < n$ , a corresponding transposed statement replacing $U$ with $V$ can be made.

3.5 Relationship between SVD and ED

It is clear that the eigendecomposition and the singular value decomposition share many properties in common. The price we pay for being able to perform a diagonal decomposition on an arbitray matrix is that we need two orthonormal matrices instead of just one, as is the case for square symmetric matrices. In this section, we explore further relationships between the ED and the SVD.

Using (25), we can write

\begin{matrix} (40) & A^{T} A = V [\begin{matrix} \tilde{Σ} & 0 \\ 0 & 0 \end{matrix}] U^{T} U [\begin{matrix} \tilde{Σ} & 0 \\ 0 & 0 \end{matrix}] V^{T} = V [\begin{matrix} {\tilde{Σ}}^{2} & 0 \\ 0 & 0 \end{matrix}] V^{T} . \end{matrix}

Thus it is apparent, that the eigenvectors $V$ of the matrix $A^{T} A$ are the right singular vectors of $A$ , and that the singular values of $A$ squared are the corresponding nonzero eigenvalues. Note that if $A$ is short ( $m < n$ ) and full rank, the matrix $A^{T} A$ will contain $n - m$ additional zero eigenvalues that are not included as singular values of $A$ . This follows because the rank of the matrix $A^{T} A$ is $m$ when $A$ is full rank, yet the size of $A^{T} A$ is $n \times n$ .

As discussed in Golub and van Loan, the SVD is numerically more stable to compute than the ED. However, in the case where $n >> m$ , the matrix $V$ of the SVD of $A$ becomes large, which means the SVD on $A$ becomes more costly to compute, relative to the eigendecomposition of $A^{T} A$ .

Further, we can also say, using the form $A A^{T}$ , that

\begin{matrix} (41) & A A^{T} = U [\begin{matrix} {\tilde{Σ}}^{2} & 0 \\ 0 & 0 \end{matrix}] V^{T} V [\begin{matrix} {\tilde{Σ}}^{2} & 0 \\ 0 & 0 \end{matrix}] U^{T} = U [\begin{matrix} {\tilde{Σ}}^{2} & 0 \\ 0 & 0 \end{matrix}] U^{T} \end{matrix}

which indicates that the eigenvectors of $A A^{T}$ are the left singular vectors $U$ of $A$ , and the singular values of $A$ squared are the nonzero eigenvalues of $A A^{T}$ . Notice that in this case, if $A$ is tall and full rank, the matrix $A A^{T}$ will contain $m - n$ additional zero eigenvalues that are not included as singular values of $A$ .

We now compare the fundamental defining relationships for the ED and the SVD: For the ED, if $A$ is symmetric, we have:

A = Q Λ Q^{T} \to A Q = Q Λ,

where $Q$ is the matrix of eigenvectors, and $Λ$ is the diagonal matrix of eigenvalues. Writing this relation column-by-column, we have the familiar eigenvector/eigenvalue relationship:

\begin{matrix} (42) & A q_{i} = λ_{i} q_{i} i = 1, \dots, n . \end{matrix}

For the SVD, we have

A = U Σ V^{T} \to A V = U Σ

\begin{matrix} (43) & A v_{i} = σ_{i} u_{i} i = 1, \dots, p, \end{matrix}

where $p = min (m, n)$ . Also, since $A^{T} = V Σ U^{T} \to A^{T} U = V Σ$ , we have

\begin{matrix} (44) & A^{T} u_{i} = σ_{i} v_{i} i = 1, \dots, p . \end{matrix}

Thus, by comparing (42), (43), and (44), we see the singular vectors and singular values obey a relation which is similar to that which defines the eigenvectors and eigenvalues. However, we note that in the SVD case, the fundamental relationship expresses left singular values in terms of right singular values, and vice-versa, whereas the eigenvectors are expressed in terms of themselves.

Exercise: compare the ED and the SVD on a square symmetric matrix, when i) $A$ is positive definite, and ii) when $A$ has some positive and some negative eigenvalues.

3.6 Ellipsoidal Interpretation of the SVD

The singular values of $A$ , where $A \in R^{m \times n}$ are the lengths of the semi-axes of the hyperellipsoid E given by:

E = {y ∣ y = A x, ∥ x ∥_{2} = 1} .

That is, E is the set of points mapped out as $x$ takes on all possible values such that $∥ x ∥_{2} = 1$ , as shown in Fig. 1. To appreciate this point, let us look at the set of $y$ corresponding to ${x ∣ ∥ x ∥_{2} = 1}$ . We take

\begin{matrix} (45) & y = A x = U Σ V^{T} x . \end{matrix}

Let us change bases for both $x$ and $y$ . Define

\begin{matrix} (46) & \begin{aligned} c & = U^{T} y \\ d & = V^{T} x . \end{aligned} \end{matrix}

Then (45) becomes

\begin{matrix} (47) & c = Σ d . \end{matrix}

We note that $∥ d ∥_{2} = 1$ if $∥ x ∥_{2} = 1$ . Thus, our problem is transformed into observing the set ${c}$ corresponding to the set ${d ∣ ∥ d ∥_{2} = 1}$ . The set ${c}$ can be determined by evaluating 2-norms on each side of (47):

\begin{matrix} (48) & \sum_{i = 1}^{p} {(\frac{c_{i}}{σ_{i}})}^{2} = \sum_{i = 1}^{p} (d_{i})^{2} = 1. \end{matrix}

We see that the set ${c}$ defined by (48) is indeed the canonical form of an ellipse in the basis $U$ . Thus, the principal axes of the ellipse are aligned along the columns $u_{i}$ of $U$ , with lengths equal to the corresponding singular value $σ_{i}$ . This interpretation of the SVD is useful later in our study of condition numbers.

3.7 An Interesting Theorem

First, we realize that the SVD of $A$ provides a “sum of outer-products” representation:

\begin{matrix} (49) & A = U Σ V^{T} = \sum_{i = 1}^{p} σ_{i} u_{i} v_{i}^{T}, p = min (m, n) . \end{matrix}

Given $A \in R^{m \times n}$ with rank $r$ , then what is the matrix $B \in R^{m \times n}$ with rank $k < r$ closest to $A$ in 2-norm? What is this 2-norm distance? This question is answered in the following theorem:

Theorem 3 Define

\begin{matrix} (50) & A_{k} = \sum_{i = 1}^{k} σ_{i} u_{i} v_{i}^{T}, k \leq r, \end{matrix}

then

min_{rank (B) = k} ∥ A - B ∥_{2} = ∥ A - A_{k} ∥_{2} = σ_{k + 1} .

In words, this says the closest rank $k < r$ matrix $B$ matrix to $A$ in the 2–norm sense is given by $A_{k}$ . $A_{k}$ is formed from $A$ by excluding contributions in (49) associated with the smallest singular values.

Proof: Since $U^{T} A_{k} V = diag (σ_{1}, \dots, σ_{k}, 0, \dots, 0)$ it follows that $rank (A_{k}) = k$ , and that

\begin{matrix} (51) & \begin{aligned} ∥ A - A_{k} ∥_{2} & = ∥ U^{T} (A - A_{k}) V ∥_{2} \\ = ∥ diag (0, \dots, 0, σ_{k + 1}, \dots, σ_{r}, 0, \dots, 0) ∥_{2} \\ = σ_{k + 1} . \end{aligned} \end{matrix}

where the first line follows from the fact the the 2-norm of a matrix is invariant to pre– and post–multiplication by an orthonormal matrix (properties of matrix p-norms, Lecture 2). Further, it may be shown that, for any matrix $B \in R^{m \times n}$ of rank $k < r$ ,^[25]

\begin{matrix} (52) & ∥ A - B ∥_{2} \geq σ_{k + 1} \end{matrix}

Comparing (51) and (52), we see the closest rank $k$ matrix to $A$ is $A_{k}$ given by (50). $◻$

This result is very useful when we wish to approximate a matrix by another of lower rank. For example, let us look at the Karhunen-Loeve expansion as discussed in Lecture 1. For a sample $x_{n}$ of a random process $x \in R^{m}$ , we express $x$ as

\begin{matrix} (53) & x_{i} = V θ_{i} \end{matrix}

where the columns of $V$ are the eigenvectors of the covariance matrix $R$ . We saw in Lecture 2 that we may represent $x_{i}$ with relatively few coefficients by setting the elements of $θ$ associated with the smallest eigenvalues of $R$ to zero. The idea was that the resulting distortion in $x$ would have minimum energy.

This fact may now be seen in a different light with the aid of this theorem. Suppose we retain the $j = r$ elements of a given $θ$ associated with the largest $r$ eigenvalues. Let $\tilde{θ} ≜ [θ_{1}, θ_{2}, \dots, θ_{r}, 0, \dots, 0]^{T}$ and $\tilde{x} = V \tilde{θ}$ . Then

\begin{matrix} (54) & \begin{aligned} \tilde{R} & = E (\tilde{x} {\tilde{x}}^{T}) \\ = E (V \tilde{θ} {\tilde{θ}}^{T} V) \\ = V [\begin{array}{c} E | θ_{1} |^{2} \\ ⋱ \\ E | θ_{r} |^{2} \\ ⋱ \\ 0 \end{array}] V^{T} \\ = V \tilde{Λ} V^{T}, \end{aligned} \end{matrix}

where $\tilde{Λ} = diag [λ_{1}, \dots, λ_{r}, 0, \dots, 0]$ . Since $\tilde{R}$ is positive definite, square and symmetric, its eigendecomposition and singular value decomposition are identical; hence, $λ_{i} = σ_{i}, i = 1, \dots, r$ . Thus from this theorem, and (54), we know that the covariance matrix $\tilde{R}$ formed from truncating the K-L coefficients is the closest rank–r matrix to the true covariance matrix $R$ in the 2–norm sense.

4 Orthogonal Projections

4.1 Sufficient Conditions for a Projector

Suppose we have a subspace $S = R (X)$ , where $X = [x_{1} \dots x_{n}] \in R^{m \times n}$ is full rank, $m > n$ , and an arbitrary vector $y \in R^{m}$ . How do we find a matrix $P \in R^{m \times m}$ so that the product $P y \in S$ ?

The matrix $P$ is referred to as a projector. That is, we can project an arbitrary vector $y$ onto the subspace $S$ , by premultiplying $y$ by $P$ . Note that this projection has non-trivial meaning only when $m > n$ . Otherwise, $y \in S$ already for arbitrary $y$ .

A matrix $P$ is a projection matrix onto $S$ if:

$R (P) = S$
$P^{2} = P$
$P^{T} = P$

A matrix satisfying condition (2) is called an idempotent matrix. This is the fundamental property of a projector.

We now show that these three conditions are sufficient for $P$ to be a projector. An arbitrary vector $y$ can be expressed as

\begin{matrix} (55) & y = y_{s} + y_{c} \end{matrix}

where $y_{s} \in S$ and $y_{c} \in S_{⊥}$ (the orthogonal complement subspace of $S$ ). We see that $y_{s}$ is the desired projection of $y$ onto $S$ . Thus, in mathematical terms, our objective is to show that

\begin{matrix} (56) & P y = y_{s} . \end{matrix}

Because of condition 2, $P^{2} = P$ , hence

\begin{matrix} (57) & P p_{i} = p_{i} i = 1, \dots, m \end{matrix}

where $p_{i}$ is a column of $P$ . Because $y_{s} \in S$ , and also $(p_{1} \dots p_{m}) \in S$ (condition 1), then $y_{s}$ can be expressed as a linear combination of the $p_{i}$ ’s:

\begin{matrix} (58) & y_{s} = \sum_{i = 1}^{m} c_{i} p_{i}, c_{i} \in R . \end{matrix}

Combining (57) and (58), we have

\begin{matrix} (59) & P y_{s} = \sum_{i = 1}^{m} c_{i} P p_{i} = \sum_{i = 1}^{m} c_{i} p_{i} = y_{s} . \end{matrix}

If $R (P) = S$ (condition 1), then $P y_{c} = 0$ . Hence,

\begin{matrix} (60) & P y = P (y_{s} + y_{c}) = P y_{s} = y_{s} . \end{matrix}

i.e., $P$ projects $y$ onto $S$ , if $P$ obeys conditions 1 and 2. Furthermore, by repeating the above proof, and using condition 3, we have

y^{T} P \in S

i.e., $P$ projects both column- and row–vectors onto $S$ , by pre- and post-multiplying, respectively. Because this property is a direct consequence of the three conditions above, then these conditions are sufficient for $P$ to be a projector.

4.2 A Definition for P

Let $X = [x_{1} \dots x_{n}]$ , $x_{i} \in R^{m}, n < m$ be full rank. Then the matrix $P$ where

\begin{matrix} (61) & P = X (X^{T} X)^{- 1} X^{T} \end{matrix}

is a projector onto $S = R (X)$ . Other definitions of $P$ equivalent to (61) will follow later after we discuss pseudo inverses.

Note that when $X$ has orthonormal columns, then the projector becomes $X X^{T} \in R^{m \times m}$ , which according to our previous discussion on orthonormal matrices in Chapter 2, is not the $m \times m$ identity.

Exercises:

prove (61).
How is $P$ in (61) formed if $r = rank (X) < n$ ?

Theorem 4 The projector onto $S$ defined by (61) is unique.

Proof: Let $Y$ be any other $m \times n$ full rank matrix such that $R (Y) = S$ . Since $X$ and $Y$ are both in $S$ , each column of $Y$ must be a linear combination of the columns of $X$ . Therefore, there exists a full-rank matrix $C \in R^{n \times n}$ so that

\begin{matrix} (62) & Y = X C . \end{matrix}

The projector $P_{1}$ formed from $Y$ is therefore

\begin{matrix} (63) & \begin{aligned} P_{1} & = Y (Y^{T} Y)^{- 1} Y^{T} \\ = X C ((X C)^{T} X C)^{- 1} (X C)^{T} \\ = X C (C^{T} X^{T} X C)^{- 1} C^{T} X^{T} \\ = X C C^{- 1} (X^{T} X)^{- 1} (C^{T})^{- 1} C^{T} X^{T} \\ = X (X^{T} X)^{- 1} X^{T} \\ = P . \end{aligned} \end{matrix}

Thus, the projector formed from (61) onto $S$ is unique, regardless of the set of vectors used to form $X$ , provided the corresponding matrix $X$ is full rank and that $R (X) = S$ . $◻$

In Section 4.1 we discussed sufficient conditions for a projector. This means that while these conditions are enough to specify a projector, there may be other conditions which also specify a projector. But since we have now proved the projector is unique, the conditions in Section 4.1 are also necessary.

4.3 The Orthogonal Complement Projector

Consider the vector $y$ , and let $y_{s}$ be the projection of $y$ onto our subspace $S$ , and $y_{c}$ be the projection onto the orthogonal complement subspace $S_{⊥}$ . Thus,

\begin{matrix} (64) & y = y_{s} + y_{c} = P y + y_{c} . \end{matrix}

Therefore we have

\begin{matrix} (65) & \begin{aligned} y - P y & = y_{c} \\ (I - P) y & = y_{c} . \end{aligned} \end{matrix}

It follows that if $P$ is a projector onto $S$ , then the matrix $(I - P)$ is a projector onto $S_{⊥}$ . It is easily verified that this matrix satisfies the all required properties for this projector.

4.4 Orthogonal Projections and the SVD

Suppose we have a matrix $A \in R^{m \times n}$ of rank $r$ . Then, using the partitions of eqeqpart, we have these useful relations:

$V_{1} V_{1}^{T}$ is the orthogonal projector onto $[N (A)]_{⊥} = R (A^{T})$ .
$V_{2} V_{2}^{T}$ is the orthogonal projector onto $N (A)$
$U_{1} U_{1}^{T}$ is the orthogonal projector onto $R (A)$
$U_{2} U_{2}^{T}$ is the orthogonal projector onto $[R (A)]_{⊥} = N (A^{T})$

To justify these results, we show each projector listed above satisfies the three conditions for a projector:

First, we must show that each projector above is in the range of the corresponding subspace (condition 1). In Sects. 3.4.2 and 3.4.3, we have already verified that $V_{2}$ is a basis for $N (A)$ , and that $U_{1}$ is a basis for $R (A)$ , as required. It is easy to verify that the remaining two projectors above (no.’s 1 and 4 respectively) also have the appropriate ranges.
From the orthonormality property of each of the matrix partitions above, it is easy to see condition 2 (idempotency) holds in each case.
Finally, each matrix above is symmetric (condition 3). Therefore, each matrix above is a projector onto the corresponding subspace.

5 The Quadratic Form

We introduce the quadratic form by considering the idea of positive definiteness. A square matrix $A \in R^{n \times n}$ is positive definite if and only if, for any $x \neq 0 \in R^{n}$ ,

\begin{matrix} (1) & x^{T} A x > 0. \end{matrix}

The matrix $A$ is positive semi-definite if and only if, for any $x \neq 0$ we have

\begin{matrix} (2) & x^{T} A x \geq 0, \end{matrix}

which includes the possibility that $A$ is rank deficient. The quantity on the left in (1) is referred to as a quadratic form, and is a matrix equivalent to the scalar quantity $a x^{2}$ .

It is only the symmetric part of $A$ which is relevant in a quadratric form. This may be seen as follows. The symmetric part $T$ of $A$ is defined as $T = \frac{1}{2} (A + A^{T})$ , whereas the asymmetric part $S$ of $A$ is defined as $S = \frac{1}{2} (A - A^{T})$ . Then $A = T + S$ . It may be verified by direct multiplication that the quadratic form can also be expressed in the form

\begin{matrix} (3) & x^{T} A x = \sum_{i = 1}^{n} \sum_{j = 1}^{n} a_{i j} x_{i} x_{j} . \end{matrix}

Because $S^{T} = - S$ , the $(i, j)$ th term corresponding to the asymmetric part of (3) exactly cancels that corresponding to the $(j, i)$ th term. Further, the terms corresponding to $i = j$ are zero for the asymmetric part. Thus the part of the quadratic form corresponding to the asymmetric part $S$ is zero. Therefore, when considering quadratic forms, it suffices to consider only the symmetric part $T$ of a matrix. Quadratic forms on positive definite matrices are used very frequently in least-squares and adaptive filtering applications.

Theorem 1 A matrix $A$ is positive definite if and only if all eigenvalues of the symmetric part of $A$ are positive.

Proof: Since only the symmetric part of $A$ is relevant, the quadratic form on $A$ may be expressed as $x^{T} A x = x^{T} V Λ V^{T} x$ where an eigendecomposition has been performed on the symmetric part of $A$ . Let us define $z = V^{T} x$ . Thus we have

\begin{matrix} (4) & x^{T} A x = z^{T} Λ z = \sum_{i = 1}^{n} z_{i}^{2} λ_{i} . \end{matrix}

Thus (4) is greater than zero for arbitrary $x$ if and only if $λ_{i} > 0$ , $i = 1, \dots, n$ . $◻$

From (4), it is easy to verify that the equation $k = \sum_{i = 1}^{n} z_{i}^{2} λ_{i}$ , where k is a constant, defines a multi-dimensional ellipse where $\sqrt{k / λ_{i}}$ is the length of the ith principal axis. Since $z = V^{T} x$ where $V$ is orthonormal, $z$ is a rotation transformation on $x$ , and the equation $k = x^{T} A x$ is a rotated version of (4). Thus $k = x^{T} A x$ is also an ellipse with principal axes given by $\sqrt{k / λ_{i}}$ . In this case, the ith principal axes of the ellipse lines up along the ith eigenvector $v_{i}$ of $A$ .

Positive definiteness of $A$ in the quadratic form $x^{T} A x$ is the matrix analog to the scalar $a$ being positive in the scalar expression $a x^{2}$ . The scalar equation $y = a x^{2}$ is a parabola which faces upwards if $a$ is positive. Likewise, the equation $y = x^{T} A x$ is a multi-dimensional parabola which faces upwards in all directions if $A$ is positive definite.

Example: We now discuss an example to illustrate the above discussion. A three-dimensional plot of $y = x^{T} A x$ is shown plotted in Fig. 1 for $A$ given by

\begin{matrix} (5) & A = [\begin{matrix} 2 & 1 \\ 1 & 2 \end{matrix}] . \end{matrix}

The corresponding contour plot is plotted in Fig. 2. Note that this curve is elliptical in cross-section in a plane $y = k$ as discussed above. It may be readily verified that the eigenvalues of $A$ are $3, 1$ with corresponding eigenvectors $^{T}$ and $[1, - 1]^{T}$ . For $y = k = 1$ , the lengths of the principal axes of the ellipse are then $1 / \sqrt{3}$ and $1$ . It is seen from the figure these principal axes are indeed the lengths indicated, and are lined up along the directions of the eigenvectors as required.

We write the ellipse in the form

\begin{matrix} (6) & y = x^{T} A x = z^{T} Λ z = \sum_{i = 1}^{n} z_{i}^{2} λ_{i} \end{matrix}

where $z = V x$ as before. It is seen from Fig. 1 and (6) that, since $A$ is positive definite, the curve defined by $y = z_{i}^{2} λ_{i}$ , for all $z_{k}, k \neq i$ held constant, is an upward-facing parabola for all $i = 1, \dots, n$ . (To observe the behaviour of $y$ vs. $z_{i}$ in this case, we use the vertical axis $y$ and the appropriate eigenvector direction, instead of the usual x-axis direction). $◻$

Theorem 2 A symmetric matrix $A$ can be decomposed into the form $A = B B^{T}$ if and only if $A$ is positive definite or positive semi-definite.

Proof: (Necessary condition) Let us define $z$ as $B^{T} x$ . Then

\begin{matrix} (7) & x^{T} A x = x^{T} B B^{T} x = z^{T} z \geq 0. \end{matrix}

Conversely (sufficient condition) without loss of generality we take an eigendecomposition on the symmetric part of $A$ as $A = V Λ V^{T}$ . Since $A$ is positive definite by hypothesis, we can write $A = (V Λ^{1 / 2}) (V Λ^{1 / 2})^{T}$ . Let us define $B = V Λ^{1 / 2} Q^{T}$ where $Q$ is an arbitrary orthonormal matrix of appropriate size. Then $A = V Λ^{1 / 2} Q^{T} Q (Λ^{1 / 2})^{T} V^{T} = B B^{T}$ . $◻$

Note that $A$ in this case can only be positive semi-definite if $A$ has a non-empty null space. Otherwise, it is strictly positive definite.

The fact that $A$ can be decomposed into two symmetric factors in this way is the fundamental idea behind the Cholesky factorization, which is a major topic of the following chapter.

5.1 The Gaussian Multi-Variate Probability Density Function

Here, we very briefly introduce this topic so we can use this material for an example of the application of the Cholesky decomposition later in this course, and also in least-squares analysis to follow shortly. This topic is a good application of quadratic forms. More detail is provided in several books.^[26]

First we consider the uni-variate case of the Gaussian probability distribution function (pdf). The pdf $p (x)$ of a Gaussian-distributed random variable x with mean $μ$ and variance $σ^{2}$ is given as

\begin{matrix} (8) & p (x) = \frac{1}{\sqrt{2 π σ^{2}}} \exp [- \frac{1}{2 σ^{2}} (x - μ)^{2}] . \end{matrix}

This is the familiar bell-shaped curve. It is completely specified by two parameters- the mean $μ$ which determines the position of the peak, and the variance $σ^{2}$ which determines the width or spread of the curve.

We now consider the more interesting multi-dimensional case. Consider a Gaussian-distributed vector $x \in R^{n}$ with mean $μ$ and covariance $Σ$ . The multivariate pdf describing the variation of $x$ is

\begin{matrix} (9) & p (x) = (2 π)^{- \frac{n}{2}} | Σ |^{- \frac{1}{2}} \exp [- \frac{1}{2} (x - μ)^{T} Σ^{- 1} (x - μ)] . \end{matrix}

We can see that the multi-variate case collapses to the uni-variate case when the number of variables becomes one. A plot of $p (x)$ vs. $x$ is shown in Fig. 3, for $μ = 0$ and $Σ$ defined as

\begin{matrix} (10) & Σ = [\begin{matrix} 2 & 1 \\ 1 & 2 \end{matrix}] . \end{matrix}

Because the exponent in (9) is a quadratic form, the set of points satisfied by the equation $\frac{1}{2} (x - μ)^{T} Σ^{- 1} (x - μ) = k$ where $k$ is a constant, is an ellipse. Therefore this ellipse defines a contour of equal probability density. The interior of this ellipse defines a region into which an observation will fall with a specified probability $α$ which is dependent on $k$ . This probability level $α$ is given as

\begin{matrix} (11) & α = \int_{R} (2 π)^{- \frac{n}{2}} | Σ |^{- \frac{1}{2}} \exp [- \frac{1}{2} (x - μ)^{T} Σ^{- 1} (x - μ)] d x . \end{matrix}

where $R$ is the interior of the ellipse. Stated another way, an ellipse is the region in which any observation governed by the probability distribution (9) will fall with a specified probability level $α$ . As $k$ increases, the ellipse gets larger, and $α$ increases. These ellipses are referred to as joint confidence regions at probability level $α$ .

The covariance matrix $Σ$ controls the shape of the ellipse. Because the quadratic form in this case involves $Σ^{- 1}$ , the length of the ith principal axis is $\sqrt{2 k λ_{i}}$ instead of $\sqrt{2 k / λ_{i}}$ as it would be if the quadratic form were in $Σ$ . Therefore as the eigenvalues of $Σ$ increase, the size of the joint confidence regions increase (i.e., the spread of the distribution increases) for a given value of $k$ .

Now suppose we let $Σ$ become poorly conditioned in such a way that the variances (main diagonal elements of $Σ$ ) remain constant. Then the ratio of the largest to smallest principal axes become large, and the ellipse becomes elongated. In this case, the pdf takes on more of the shape shown in Fig. 4, which shows a multi-variate Gaussian pdf for $μ = 0$ for a relatively poorly conditioned $Σ$ given as

\begin{matrix} (12) & Σ = [\begin{matrix} 2 & 1.9 \\ 1.9 & 2 \end{matrix}] . \end{matrix}

Here, because the ellipse describing the joint confidence region is elongated, we see that if one of the variables is known, the distribution of the other variable becomes more concentrated around the value of the first; i.e., knowledge of one variable tells us relatively more about the other. This implies the variables are more highly correlated with one another. But we have seen previously that if the variables in a vector random process are highly correlated, then the off-diagonal elements of the covariance matrix become larger, which leads to their eigenvalues becomming more disparate; i.e., the condition number of the covariance matrix becomes worse. It is precisely this poorer condition number that causes the ellipse in Fig. 4 to become elongated.

With this discussion, we we now have gone full circle: a highly correlated system has large off-diagonal elements in its covariance matrix. This leads to a poorly conditioned covariance matrix. But a Gaussian-distributed process with a poorly-conditioned covariance matrix has a joint confidence region that is elongated. In turn, an elongated joint confidence region means the system is highly correlated, which takes us back to the beginning.

Understanding these relationships is a key element in the signal processing rigor.

5.2 The Rayleigh Quotient

The Rayleigh quotient is a simple mathematical structure that has a great deal of interesting uses. The Rayleigh quotient $r (x)$ is defined as

\begin{matrix} (13) & r (x) = \frac{x^{T} A x}{x^{T} x} . \end{matrix}

It is easily verified that if $x$ is the ith eigenvector $v_{i}$ of $A$ , (not necessarliy normalized to unit norm), then $r (x) = λ_{i}$ :

\begin{matrix} (14) & \frac{v_{i}^{T} A v_{i}}{v_{i}^{T} v_{i}} = \frac{λ_{i} v_{i}^{T} v_{i}}{v_{i}^{T} v_{i}} = λ_{i} . \end{matrix}

In fact, it is easily shown by differentiating $r (x)$ with respect to $x$ , that $x = v_{i}$ is a stationary point of $r (x)$ .

Further along this line of reasoning, let us define a subspace $S_{k}$ as $S_{k} = span {v_{1}, \dots, v_{k}}$ , $k = 1, \dots, n$ , where $v_{i}$ is the ith eigenvector of $A \in R^{n \times n}$ , where $A$ is symmetric. Then, a variation of the Courant Fischer minimax theorem^[27] says that

\begin{matrix} (15) & λ_{k} = min_{x \in S_{k}, x \neq 0} \frac{x^{T} A x}{x^{T} x} . \end{matrix}

Question: It is easily shown by differentiation that (13) for $r (x) = λ_{i}$ minimizes $∥ A - λ_{i} I x ∥_{2}$ . The perturbation theory of Golub and Van Loan says that if $x$ in (13) is a good approximation to an eigenvector, then $r (x)$ is a good approximation to the corresponding eigenvalue, and vice versa. Starting with an initial estimate $x_{0}$ with unit 2-norm, suggest an iteration using (13) which gives an improved estimate of the eigenvector. How can the eigenvalue be found?

This technique is referred to as the Rayleigh quotient iteration for computing an eigenvalue and eigenvector. In fact, this iteration is remarkably effective; it can be shown to have cubic convergence.

Eq. (4) is called a linear combination of the vectors $a_{j}$ . Each vector is multiplied by a weight (or coefficient) $c_{j}$ , and the result summed. ↩︎
A vector $e_{i}$ is referred to as an elementary vector, and has zeros everywhere except for a 1 in the $i$ th position. ↩︎
Column rank deficient is when the rank of the matrix is less than the number of columns. ↩︎
The characteristic polynomial of a matrix is defined in Chapter 2. ↩︎
An orthonormal matrix is defined in Chapter 2. ↩︎
A symmetric matrix is one where $A = A^{T}$ , where the superscript $T$ means transpose, i.e, for a symmetric matrix, an element $a_{i j} = a_{j i}$ . A Hermitian symmetric (or just Hermitian) matrix is relevant only for the complex case, and is one where $A = A^{H}$ , where superscript $H$ denotes the Hermitian transpose. This means the matrix is transposed and complex conjugated. Thus for a Hermitian matrix, an element $a_{i j} = a_{j i}^{*}$ . In this course we will generally consider only real matrices. However, when complex matrices are considered, Hermitian symmetric is implied instead of symmetric. ↩︎
Here, we have used the property that for matrices or vectors $A$ and $B$ of conformable size, $(A B)^{T} = B^{T} A^{T}$ . ↩︎
From Lastman and Sinha, Microcomputer–based Numerical Methods for Science and Engineering. ↩︎
The trace denoted $tr (\cdot)$ of a square matrix is the sum of its elements on the main diagonal (also called the “diagonal” elements). ↩︎
This only holds if $A$ and $B$ are square invertible. ↩︎
This proof is left as an exercise. ↩︎
A. Papoulis, Probability, Random Variables, and Stochastic Processes, McGraw Hill, 3rd Ed. ↩︎
Process with this property are referred to as ergodic processes. ↩︎
Haykin, “Adaptive Filter Theory”, Prentice Hall, 3rd. ed. ↩︎
An expansion of $x$ usually requires the basis vectors to be only linearly independent–not necessarily orthonormal. But orthonormal basis vectors are most commonly used because they can be inverted using the very simple form of (49). ↩︎
For a proof, refer to Cover and Thomas, Elements of Information Theory ↩︎
This is not necessarily a valid assumption. We discuss this point further, later in the section. ↩︎
K.R. Rao and P. Yip, “Discrete Cosine Transform– Algorithms, Advantages, Applications”. ↩︎
It may be shown that if $d \leq λ / 2$ , then there is a one–to–one relationship between the electrical angle $ϕ$ and the corresponding physical angle $θ$ . In fact, $ϕ = \frac{2 π d}{λ} \sin θ$ . We can only observe the electrical angle $ϕ$ , not the desired physical angle $θ$ . Thus, we deduce the desired physical angle from the observed electrical angle from this mathematical relationship. ↩︎
Note that the eigenvalue zero has multiplicity $M - K$ . Therefore, the eigenvectors $v_{K + 1}, \dots, v_{M}$ are not unique. However, a set of orthonormal eigenvectors which are orthogonal to the remaining eigenvectors exist. Thus we can treat the zero eigenvectors as if they were distinct. ↩︎
Let us define the so–called signal subspace $S_{S}$ as $S_{S} = span [v_{1}, \dots, v_{K}]$ (64) and the noise subspace $S_{N}$ as $S_{N} = span [v_{K + 1}, \dots, v_{M}]$ . (65) We now digress briefly to discuss these two subspaces further. From our discussion above, all columns of $R_{o}$ are linear combinations of the columns of $S$ . Therefore $span [R_{o}] = span [S]$ . (66) But it is also easy to verify that $span [R_{o}] \in S_{S}$ (67) Comparing (66) and (67), we see that $S \in S_{S}$ . From (60) we see that any received signal vector $x$ , in the absence of noise, is a linear combination of the columns of $S$ . Thus, any noise–free signal resides completely in $S_{S}$ . This is the origin of the term “signal subspace”. Further, any component of the received signal residing in $S_{N}$ must be entirely due to the noise. This is the origin of the term “noise subspace”. We note that the signal and noise subspaces are orthogonal complement subspaces of each other. ↩︎
This word is an acronym for MUltiple SIgnal Classification. ↩︎
R.O. Schmidt, “Multiple emitter location and parameter estimation”, IEEE Trans. Antennas and Propag., vol AP-34, Mar. 1986, pp 276-280. ↩︎
The concept of positive definiteness is discussed next lecture. It means all the eigenvalues are greater than or equal to zero. ↩︎
Golub and van Loan pg. 73. ↩︎
e.g., H. Van Trees, "Detection, Estimation and Modulation Theory", Part 1. L.L. Scharf, "Statistical Signal Processing: Detection, Estimation, and Time Series Analysis," pg. 55. ↩︎
See Wilkinson, "The Algebraic Eigenvalue Problem", pp. 100-101. ↩︎

EE731 Lecture Notes: Matrix Computations for Signal Processing ​

0 Preface ​

1 Fundamental Concepts ​

1.1 Notation ​

1.2 “Bigger-Block” Interpretations of Matrix Multiplication ​

1.2.1 Inner-Product Representation ​

1.2.2 Column Representation ​

1.2.3 Outer–Product Representation ​

1.2.4 Matrix Pre– and Post–Multiplication ​

1.3 Fundamental Linear Algebra ​

1.3.1 Linear Independence ​

1.3.2 Span, Range, and Subspaces ​

1.3.3 Maximally Independent Set ​

1.3.4 A Basis ​

1.3.5 Orthogonal Complement Subspace ​

1.3.6 Rank ​

1.3.7 Null Space of A ​

1.4 Four Fundamental Subspaces of a Matrix ​

1.4.1 The Column Space ​

1.4.2 The Orthogonal Complement of the Column Space ​

1.4.3 The Row Space ​

1.4.4 The Orthogonal Complement of the Row Space ​

1.5 Vector Norms ​

1.6 Determinants ​

2 Lecture 2 ​

2.1 Eigenvalues and Eigenvectors ​

2.1.1 Orthonormal Matrices ​

2.1.2 The Eigendecomposition (ED) of a Square Symmetric Matrix ​

2.1.3 Conventional Notation on Eigenvalue Indexing ​

2.2 The Eigendecomposition in Relation to the Fundamental Matrix Subspaces ​

2.2.1 Nullspace ​

2.2.2 Range ​

2.3 Matrix Norms ​

2.3.1 Properties of Matrix Norms ​

2.4 Covariance Matrices ​

2.5 The Karhunen-Loeve Expansion of a Random Process ​

2.5.1 Development of the K–L Expansion ​

2.5.2 Properties of the KL Expansion ​

2.5.3 Applications of the K-L Expansion ​

2.6 Example: Array Processing ​

2.6.1 The MUSIC Algorithm[23] ​

2.7 TO SUMMARIZE ​

3 The Singular Value Decomposition (SVD) ​

3.1 The Singular Value Decomposition (SVD) ​

3.2 Existence Proof of the SVD ​

3.3 Partitioning the SVD ​

3.4 Interesting Properties and Interpretations of the SVD ​

3.4.1 rank(A) = r ​

3.4.2 N(A)=R(V2) ​

3.4.3 R(A)=R(U1) ​

3.4.4 R(AT)=R(V1) ​

3.4.5 R(A)⊥=R(U2) ​

3.4.6 ∥A∥2=σ1=σmax ​

3.4.7 Inverse of A ​

3.4.8 The SVD diagonalizes any system of equations ​

3.4.9 The “rotation” interpretation of the SVD ​

3.5 Relationship between SVD and ED ​

3.6 Ellipsoidal Interpretation of the SVD ​

3.7 An Interesting Theorem ​

4 Orthogonal Projections ​

4.1 Sufficient Conditions for a Projector ​

4.2 A Definition for P ​

4.3 The Orthogonal Complement Projector ​

4.4 Orthogonal Projections and the SVD ​

5 The Quadratic Form ​

5.1 The Gaussian Multi-Variate Probability Density Function ​

5.2 The Rayleigh Quotient ​

EE731 Lecture Notes: Matrix Computations for Signal Processing

0 Preface

1 Fundamental Concepts

1.1 Notation

1.2 “Bigger-Block” Interpretations of Matrix Multiplication

1.2.1 Inner-Product Representation

1.2.2 Column Representation

1.2.3 Outer–Product Representation

1.2.4 Matrix Pre– and Post–Multiplication

1.3 Fundamental Linear Algebra

1.3.1 Linear Independence

1.3.2 Span, Range, and Subspaces

1.3.3 Maximally Independent Set

1.3.4 A Basis

1.3.5 Orthogonal Complement Subspace

1.3.6 Rank

1.3.7 Null Space of A

1.4 Four Fundamental Subspaces of a Matrix

1.4.1 The Column Space

1.4.2 The Orthogonal Complement of the Column Space

1.4.3 The Row Space

1.4.4 The Orthogonal Complement of the Row Space

1.5 Vector Norms

1.6 Determinants

2 Lecture 2

2.1 Eigenvalues and Eigenvectors

2.1.1 Orthonormal Matrices

2.1.2 The Eigendecomposition (ED) of a Square Symmetric Matrix

2.1.3 Conventional Notation on Eigenvalue Indexing

2.2 The Eigendecomposition in Relation to the Fundamental Matrix Subspaces

2.2.1 Nullspace

2.2.2 Range

2.3 Matrix Norms

2.3.1 Properties of Matrix Norms

2.4 Covariance Matrices

2.5 The Karhunen-Loeve Expansion of a Random Process

2.5.1 Development of the K–L Expansion

2.5.2 Properties of the KL Expansion

2.5.3 Applications of the K-L Expansion

2.6 Example: Array Processing

2.6.1 The MUSIC Algorithm^[23]

2.7 TO SUMMARIZE

3 The Singular Value Decomposition (SVD)

3.1 The Singular Value Decomposition (SVD)

3.2 Existence Proof of the SVD

3.3 Partitioning the SVD

3.4 Interesting Properties and Interpretations of the SVD

3.4.1 rank(A) = r

3.4.2 $N (A) = R (V_{2})$

3.4.3 $R (A) = R (U_{1})$

3.4.4 $R (A^{T}) = R (V_{1})$

3.4.5 $R (A)_{⊥} = R (U_{2})$

3.4.6 $∥ A ∥_{2} = σ_{1} = σ_{max}$

3.4.7 Inverse of A

3.4.8 The SVD diagonalizes any system of equations

3.4.9 The “rotation” interpretation of the SVD

3.5 Relationship between SVD and ED

3.6 Ellipsoidal Interpretation of the SVD

3.7 An Interesting Theorem

4 Orthogonal Projections

4.1 Sufficient Conditions for a Projector

4.2 A Definition for P

4.3 The Orthogonal Complement Projector

4.4 Orthogonal Projections and the SVD

5 The Quadratic Form

5.1 The Gaussian Multi-Variate Probability Density Function

5.2 The Rayleigh Quotient