This page is a study notes of the course Matrix Methods in Data Analysis, Signal Processing, and Machine Learning by professor Gilbert Strang.
Column Space & Rank
- Column space
\(C(A) = span(\overrightarrow{a_1},\overrightarrow{a_2},\overrightarrow{a_3},\overrightarrow{a_4},\overrightarrow{a_5})\)
- Basis for \(C(A)\):
vectors of span \(C(A)\) that are linear independent
\[A =
\begin{bmatrix}
1 & 0 & -1 & 0 & 4 \\
2 & 1 & 0 & 0 & 9 \\
-1 & 2 & 5 & 1& -5\\
1 & -1 & -3 & -2 & 9 \\
\end{bmatrix}
\]
Reduced row echelon form by row operation: \[rref(A) = R =
\begin{bmatrix}
1 & 0 & -1 & 0 & 4 \\
0 & 1 & 2 & 0 & 1 \\
0 & 0 & 0 & 1& -3\\
0 & 0 & 0 & 0 & 0 \\
\end{bmatrix}
\]
The pivot column \(\overrightarrow{r_1}\), \(\overrightarrow{r_2}\), and \(\overrightarrow{r_4}\) of matix \(R\) are linear independent
\(\Rightarrow\) \(\overrightarrow{a_1}\), \(\overrightarrow{a_2}\), and \(\overrightarrow{a_4}\) are also linear independent
\(\Rightarrow\) \(\overrightarrow{a_1}\), \(\overrightarrow{a_2}\), and \(\overrightarrow{a_4}\) form the basis for \(C(A)\)
\(\Rightarrow\) \(dim(C(A)) = 3\) and \(rank(A) = 3\)
Understanding the matrix multiplication
\[A =
\begin{bmatrix}
2 & 1 & 3\\
3 & 1 & 4\\
5 & 7 & 12\\
\end{bmatrix}
\begin{bmatrix}
x_1 \\
x_2 \\
x_3 \\
\end{bmatrix}
= x_1 \begin{bmatrix}
2 \\
3 \\
5 \\
\end{bmatrix}
+x_2 \begin{bmatrix}
1 \\
1 \\
7 \\
\end{bmatrix}
+ x_3 \begin{bmatrix}
3 \\
4 \\
12 \\
\end{bmatrix}
\]
\(A_{3x3} = C_{3x2} \ R_{2x3}\) where \(C\) is the column space of matrix \(A\) and \(R\) is the row space of matrix \(A\)
\[
\begin{bmatrix}
2 & 1 & 3\\
3 & 1 & 4\\
5 & 7 & 12\\
\end{bmatrix}
=
\begin{bmatrix}
2 & 1\\
3 & 1\\
5 & 7\\
\end{bmatrix}
\begin{bmatrix}
1 & 0 & 1\\
0 & 1 & 1\\
\end{bmatrix}
\]
column rank \(r=2\) = row rank
Matrices Factorizations
- \(A = LU\) from eimination we get the the upper matrix \(U\) \[
\begin{bmatrix}
2 & 3\\
4 & 7\\
\end{bmatrix}
\rightarrow
\begin{bmatrix}
2 & 3\\
0 & 1\\
\end{bmatrix}
\] then \(A = LU\)
\[
\begin{bmatrix}
2 & 3\\
4 & 7\\
\end{bmatrix}
=
\begin{bmatrix}
1 & 0\\
2 & 1\\
\end{bmatrix}
\begin{bmatrix}
2 & 3\\
0 & 1\\
\end{bmatrix}
\]
- \(A = QR\) from orthogonalization
- \(S = Q\Lambda Q^{T}\) from eigenvectors of a symmetric matrix \(S\)
- \(A = X\Lambda X^{-1}\) diagonalizeds \(A\) by eigenvector matrix \(X\). \(A\) is not symmetric
- \(A = U\Sigma V^{T}\) (orthogonal)(diagonal)(orthogonal) = Singular Value Decomposition
4 Fundamental Subspaces A m by n rank r
- Column space \(C(A)\): dim = r
- Row space \(C(A^T)\): dim = r (same as dim of column space)
- Null space of the matrix \(N(A)\): dim = n-r
- Null space of transpose of A \(N(A^T)\): dim = m-r
- \(C(A)\) is orthogonal to \(N(A)\) (1st column of \(A\) is orthogonal to x, 2nd column of \(A\) is orthogonal to x,…)
- \(C(A^T)\) is orthogonal to \(N(A^T)\)
Null space is all solutions of \(Ax = 0\)
Orthonormal & Orthogonal
Orthonormal Columns \(Q\):
- \(Q^TQ = I\)
- \(QQ^T = I\)? Sometimes yes, sometimes no. Yes when Q is a square matrix, and \(Q\) is called orthogonal matix
- example 1: rotation matrix \[Q =
\begin{bmatrix}
cos \theta & -sin \theta\\
sin \theta & cos \theta\\
\end{bmatrix}
\]
- example 2: reflection matrix \[Q =
\begin{bmatrix}
cos \theta & sin \theta\\
sin \theta & -cos \theta\\
\end{bmatrix}
\]
Eigenvalues & Eigenvectors
\[Ax_i = \lambda_i x_i\]
where \(i = 1,...,n\), \(A\) is a n by n matrix, \(\lambda\) are the eigenvalue of \(A\) and \(x\) are the eigenvectors of \(A\)
Why do we want this?
- \(A^2x = A(Ax) = A(\lambda x) = \lambda^2 x\) \(\Rightarrow\) we find out that eigenvectors \(x\) of \(A\) are also the eigenvectors of \(A^2\) and the eigenvalues of \(A^2\) are \(\lambda^2\) where \(\lambda\) are the eigenvalues of \(A\). We can then also extend it to
- \(A^kx = \lambda^kx\)
- \(A^{-1}x = \frac{1}{\lambda}x\). It’s always true because when \(\lambda=0\), \(A\) is not invertible.
- \(e^{At}x = e^{\lambda t}x\)
Matrix \(B\) is Similar to \(A\) if
\[B = M^{-1} A M\] where \(M\) can be any invertible matrix. Then \(B\) and \(A\) have the same eigenvalues. Eigenvectors are not the same.
proof: \[
\begin{aligned}
By &= \lambda y \\
M^{-1}AMy &= \lambda y \\
AMy &= M\lambda y \\
A(My) &= \lambda(My)
\end{aligned}
\] The eigenvalues are still \(\lambda\) and the eigenvectors change to \(My\)
Some other properties
- Eigenvalue(A) + Eigenvalue(B) \(\neq\) Eigenvalue(A+B)
- \(\sum \lambda = trace(A)\) (trace is adding the diagonal values)
- \(\prod \lambda = \text{det}(A)\)
- \(AX = X\Lambda\) \(\Rightarrow\) \(A = X\Lambda X{-1}\) where \(\Lambda\) is the matrix with eigenvalues \(\lambda\) at the diagonal.
Symmetric matrix:
- Eigenvalues are real
- Eigenvectors are orthogonal: \(S = Q\Lambda Q^{-1}\)
Positive Definite Matrices
Positive Definite Matrix: Symmetric matrices that has positive eigenvalues
The following 5 tests can be used for determining if the mattrix is symmetric positive definite. All of them are testing the same property but from a different point of view, so fulfilling one test means all others are also fulfilled.
- All \(\lambda_i > 0\)
- Energy \(X^TSX > 0\) (all \(x \neq 0\))
- \(S = A^TA\) (independent columns in \(A\))
- All leading determinants > 0
- All pivots in elimination > 0
The following content shows these 5 tests for matrix \(S\)
\[ S =
\begin{bmatrix}
3 & 4\\
4 & 6\\
\end{bmatrix}
\]
- skip
- given the matrix is 2 by 2, we can assume \[X=\begin{bmatrix}
x\\
y\\
\end{bmatrix}
\] \[
\begin{bmatrix}
x & y\\
\end{bmatrix}
\begin{bmatrix}
3 & 4\\
4 & 6\\
\end{bmatrix}
\begin{bmatrix}
x \\
y \\
\end{bmatrix} = 2x^2+6y^2+8xy
\]
- This can be a loss function \(f(x,y)\), and we need \(S\) to be positive definite so that \(f(x,y)>0\) (convex, or bowl shape).
- Gradient descent can be used for finding the minimum. \[\bigtriangledown f =
\begin{bmatrix}
\frac{\partial f}{\partial x} \\
\frac{\partial f}{\partial y} \\
\end{bmatrix}\]
- If the one eigenvalue is very large and the other is very small, then the gradient descent would look like zig-zaging down the mountain instead of going down by straight line.
- skip
- 1st leading determinant = 3, 2nd leading determinant = \((3 \cdot 6)-(4 \cdot 4) = 2\)
- The pivots are \(3\) and \(\frac{2}{3}\) \[ S =
\begin{bmatrix}
3 & 4\\
4 & 6\\
\end{bmatrix} \sim \begin{bmatrix}
3 & 4\\
0 & \frac{2}{3}\\
\end{bmatrix}
\]
Positive Semidefinite Matrices
- All \(\lambda_i \geq 0\)
- Energy \(X^TSX \geq 0\) (all \(x \neq 0\))
- \(S = A^TA\) (dependent columns in allowed)
- All leading determinants \(\geq 0\)
- \(r\) pivots in elimination > 0, \(r \leq n\)
example 1:
\[ S' =
\begin{bmatrix}
3 & 4\\
4 & \frac{16}{3}\\
\end{bmatrix}
\]
- The determinant \(det(S') = 0\) \(\Rightarrow\) it has a 0 eigenvalue
- but how do we know the other eigenvalue is 0? By trace, Sum of \(\lambda\)s must be \(3+\frac{16}{3} = \frac{25}{3}\). Therefore \(\lambda\)s must be \(0\) and \(\frac{25}{3}\)
example 2:
\[
\begin{bmatrix}
1 & 1 & 1 \\
1 & 1 & 1 \\
1 & 1 & 1 \\
\end{bmatrix}
\]
It is a positive semidefinite matrix, with the eigenvalues 3,0,0 because:
- It is a singular matrix
- with rank=1 \(\Rightarrow\) there is only 1 non-zero eigenvalue
- The trace is 3 \(\Rightarrow\) the only non-zero eigenvalue is 3
Now, we try to decompose the matrix to \(A^TA\)
\[
\begin{aligned}
\begin{bmatrix}
1 & 1 & 1 \\
1 & 1 & 1 \\
1 & 1 & 1 \\
\end{bmatrix} &= \lambda_1q_1q_1^T+\lambda_2q_2q_2^T+\lambda_3q_3q_3^T = Q\Lambda Q^T \\
&= 3 \begin{bmatrix}
1/\sqrt{3} \\
1/\sqrt{3} \\
1/\sqrt{3} \\
\end{bmatrix}
\begin{bmatrix}
\frac{1}{\sqrt{3}} & \frac{1}{\sqrt{3}} & \frac{1}{\sqrt{3}}\\
\end{bmatrix} + 0 + 0
\end{aligned}
\]
Since a covariance matrix is derived by \(A^TA\), it is by definition a positive semidefinite a matrix. However, there are circumstances that the “covariance/correlation matrix” is not positive semidefinite when they are not derive by \(A^TA\)
- pairwise estimation: It is common that when there are missing values in the sample data, analyst would estimate the covariance matrix by pairwise estimation, which is easy to do in R.
- pairwise forecast: The analyst can also forecast pairwise correlation of more than 2 variables, then in this case the matrix is not positive semidefinite and is not technically a covariance/correlation matrix.
Singular Value Decomposition
Compared with \(S = Q\Lambda Q^T\).
Now \(A = U\Sigma V^T\). \(Ax = U\Sigma V^Tx\) can be viewed as rotation, stretch, and then rotation of vector \(x\).
Now the matrix \(A\) is rectengular
\(A^TA\) is n by n semidefinite positive \(\rightarrow A^TA = V\Lambda V^T\) \(AA^T\) is m by m semidefinite positive \(\rightarrow AA^T = U\Lambda U^T\)
- Recall when looking for eigenvalue \(Ax = \lambda x\)
- Now for SVD, we are looking for a set of orthogonal vectors \(v\) mutiply by \(A\), we get a set of orthogonal \(u\) mutiply by singular values \(\sigma\): \(Av_i = \sigma u_i\) for \(i = 1,...r\) where \(r\) is the rank
\[
\begin{aligned}
\rightarrow AV &= U\Sigma \\
A &= U\Sigma V^T \\
A^TA &= V\Sigma^TU^TU\Sigma V^T \\
A^TA &= V(\Sigma^T\Sigma)V^T
\end{aligned}
\]
where
- \(U\) is the eigenvectors of \(AA^T\) and
- \(V\) is the eigenvectors of \(A^TA\)
We know that vestors \(V\) are orthogonal eigenvectors of \(A^TA\). We want to show \(u_1^Tu_2 =0\)
\[
\begin{aligned}
u_1^Tu_2 &= (\frac{Av_1}{\sigma_1})^T(\frac{Av_2}{\sigma_2}) \\
&= \frac{v_1^TA^TAv_2}{\sigma_1\sigma_2}\\
&= \frac{v_1^T\sigma_2^2v_2}{\sigma_1\sigma_2}\\
&= 0 \\
&\rightarrow u_1,u_2 \text{ are also orthogonal}
\end{aligned}
\]
Eckart-Young
\[A = U\Sigma V^T = \sigma_1u_1v_1^T + \sigma_2u_2v_2^T + ... \sigma_ru_rv_r^T\] \[A_k = U_k\Sigma_k V_k^T = \sigma_1u_1v_1^T + \sigma_2u_2v_2^T + ... \sigma_ku_kv_k^T\]
Eckart-Young Statement: If B has rank k then \(||A-B|| \geq||A-A_k||\)
About norms for vectors:
- \(L^2\) norm: \(||v||_2 = \sqrt{|v_1|^2+ ...+|v_n|^2}\)
- \(L^1\) norm: \(||v||_1 = |v_1| +...+ |v_n|\)
- \(L^{\infty}\) norm: \(||v||_{\infty} = \text{max}|v_i|\)
- \(||cv|| = |c| \ ||v||\)
- \(||v+w|| \leq ||v||+||w||\)
Below norms for matrix \(A\) satisfy Eckart-Young statement:
- \(||A||_2 = \sigma_1\)
- Frobenious norm: \(||A||_F = \sqrt{|a_{11}|^2+ ...|a_{1n}|^2+|a_{b1}|^2+...+|a_{mn}|^2}\)
- Nuclear norm: \(||A||_N = \sigma_1+\sigma_2+...+\sigma_r\)
Norms of Vectors and Matrices
4 Ways to Solve Least Squares
\(Ax=b\)
where \(A\) is \(m\) by \(n\) and rank \(r\)
- Solve \(A^TAx = A^Tb\) to minimize \(||Ax-b||^2\) if \(A\) has independent columns then \(A^TA\) is invertible
Gram-Schmidt \(A = QR\) leads to \(x = R^{-1}Q^Tb\)
The pseudoinverse directly multiplies \(b\) to give \(x = A^{+}b\)
\(A = U\Sigma V^T\), if A is invertible, then \(A^{-1} = V\Sigma^{-1}U^T\). If A is not invertible: \(A^{+} = V\Sigma^{+}U^T\)
THe best \(x\) is the limit of \((A^TA + \delta I)^{-1}A^Tb\) as \(\delta \to 0\)
Claim: \(A^{+}b = (A^TA)^{-1}A^Tb\) when \(N(A) = 0\), \(r=n\)
\(A^{+} = V\Sigma^{+}U^T = (A^TA)^{-1}A^T\)
Notice that
- \((A^TA)^{-1}A^TA=I\)
- while \(A(A^TA)^{-1}A^T \neq I\) Note that for \((A^TA)^{-1}A^T\) to be the inverse of \(A\), we need \((A^TA)^{-1}A^TA=I\) and \(A(A^TA)^{-1}A^T = I\)
Difficulties with Ax = b
\(Ax=b\)
- \(x = A^{+}b\) The pseudoinverse
- Sign ok, condition \(\frac{\sigma_1}{\sigma_n}\) ok, x = A\b (matlab command of elimination)
- \(m>n=\text{rank}\), then \((A^TA)^{-1} \hat{x}=A^Tb\) the linear regression, where m is the number of equations
- \(m<n\), then minimize the norm
- Columns in bad condition -> Gram-Schmidt normal or column pivoting
- Near singular -> inverse problem -> add penality:
\(\text{min} ||Ax-b||^2+\delta^2x^2\)
- Too big -> iterative method
- Way too big -> randomize linear algebra