Basic Linear Algebra for Machine Learning

This page is a study notes of the course Matrix Methods in Data Analysis, Signal Processing, and Machine Learning by professor Gilbert Strang.

Column Space & Rank

Column space
\(C(A) = span(\overrightarrow{a_1},\overrightarrow{a_2},\overrightarrow{a_3},\overrightarrow{a_4},\overrightarrow{a_5})\)
Basis for \(C(A)\):
vectors of span \(C(A)\) that are linear independent

\[A = \begin{bmatrix} 1 & 0 & -1 & 0 & 4 \\ 2 & 1 & 0 & 0 & 9 \\ -1 & 2 & 5 & 1& -5\\ 1 & -1 & -3 & -2 & 9 \\ \end{bmatrix} \]

Reduced row echelon form by row operation: \[rref(A) = R = \begin{bmatrix} 1 & 0 & -1 & 0 & 4 \\ 0 & 1 & 2 & 0 & 1 \\ 0 & 0 & 0 & 1& -3\\ 0 & 0 & 0 & 0 & 0 \\ \end{bmatrix} \]

The pivot column \(\overrightarrow{r_1}\), \(\overrightarrow{r_2}\), and \(\overrightarrow{r_4}\) of matix \(R\) are linear independent
\(\Rightarrow\) \(\overrightarrow{a_1}\), \(\overrightarrow{a_2}\), and \(\overrightarrow{a_4}\) are also linear independent
\(\Rightarrow\) \(\overrightarrow{a_1}\), \(\overrightarrow{a_2}\), and \(\overrightarrow{a_4}\) form the basis for \(C(A)\)
\(\Rightarrow\) \(dim(C(A)) = 3\) and \(rank(A) = 3\)

Understanding the matrix multiplication

\[A = \begin{bmatrix} 2 & 1 & 3\\ 3 & 1 & 4\\ 5 & 7 & 12\\ \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ x_3 \\ \end{bmatrix} = x_1 \begin{bmatrix} 2 \\ 3 \\ 5 \\ \end{bmatrix} +x_2 \begin{bmatrix} 1 \\ 1 \\ 7 \\ \end{bmatrix} + x_3 \begin{bmatrix} 3 \\ 4 \\ 12 \\ \end{bmatrix} \]

\(A_{3x3} = C_{3x2} \ R_{2x3}\) where \(C\) is the column space of matrix \(A\) and \(R\) is the row space of matrix \(A\)

\[ \begin{bmatrix} 2 & 1 & 3\\ 3 & 1 & 4\\ 5 & 7 & 12\\ \end{bmatrix} = \begin{bmatrix} 2 & 1\\ 3 & 1\\ 5 & 7\\ \end{bmatrix} \begin{bmatrix} 1 & 0 & 1\\ 0 & 1 & 1\\ \end{bmatrix} \]

column rank \(r=2\) = row rank

Matrices Factorizations

\(A = LU\) from eimination we get the the upper matrix \(U\) \[ \begin{bmatrix} 2 & 3\\ 4 & 7\\ \end{bmatrix} \rightarrow \begin{bmatrix} 2 & 3\\ 0 & 1\\ \end{bmatrix} \] then \(A = LU\)
\[ \begin{bmatrix} 2 & 3\\ 4 & 7\\ \end{bmatrix} = \begin{bmatrix} 1 & 0\\ 2 & 1\\ \end{bmatrix} \begin{bmatrix} 2 & 3\\ 0 & 1\\ \end{bmatrix} \]
\(A = QR\) from orthogonalization
\(S = Q\Lambda Q^{T}\) from eigenvectors of a symmetric matrix \(S\)
\(A = X\Lambda X^{-1}\) diagonalizeds \(A\) by eigenvector matrix \(X\). \(A\) is not symmetric
\(A = U\Sigma V^{T}\) (orthogonal)(diagonal)(orthogonal) = Singular Value Decomposition

4 Fundamental Subspaces A m by n rank r

Column space \(C(A)\): dim = r
Row space \(C(A^T)\): dim = r (same as dim of column space)
Null space of the matrix \(N(A)\): dim = n-r
Null space of transpose of A \(N(A^T)\): dim = m-r
\(C(A)\) is orthogonal to \(N(A)\) (1st column of \(A\) is orthogonal to x, 2nd column of \(A\) is orthogonal to x,…)
\(C(A^T)\) is orthogonal to \(N(A^T)\)

Null space is all solutions of \(Ax = 0\)

Orthonormal & Orthogonal

Orthonormal Columns \(Q\):

\(Q^TQ = I\)
\(QQ^T = I\)? Sometimes yes, sometimes no. Yes when Q is a square matrix, and \(Q\) is called orthogonal matix
- example 1: rotation matrix \[Q = \begin{bmatrix} cos \theta & -sin \theta\\ sin \theta & cos \theta\\ \end{bmatrix} \]
- example 2: reflection matrix \[Q = \begin{bmatrix} cos \theta & sin \theta\\ sin \theta & -cos \theta\\ \end{bmatrix} \]

Eigenvalues & Eigenvectors

\[Ax_i = \lambda_i x_i\]

where \(i = 1,...,n\), \(A\) is a n by n matrix, \(\lambda\) are the eigenvalue of \(A\) and \(x\) are the eigenvectors of \(A\)

Why do we want this?

\(A^2x = A(Ax) = A(\lambda x) = \lambda^2 x\) \(\Rightarrow\) we find out that eigenvectors \(x\) of \(A\) are also the eigenvectors of \(A^2\) and the eigenvalues of \(A^2\) are \(\lambda^2\) where \(\lambda\) are the eigenvalues of \(A\). We can then also extend it to
\(A^kx = \lambda^kx\)
\(A^{-1}x = \frac{1}{\lambda}x\). It’s always true because when \(\lambda=0\), \(A\) is not invertible.
\(e^{At}x = e^{\lambda t}x\)

Matrix \(B\) is Similar to \(A\) if

\[B = M^{-1} A M\] where \(M\) can be any invertible matrix. Then \(B\) and \(A\) have the same eigenvalues. Eigenvectors are not the same.

proof: \[ \begin{aligned} By &= \lambda y \\ M^{-1}AMy &= \lambda y \\ AMy &= M\lambda y \\ A(My) &= \lambda(My) \end{aligned} \] The eigenvalues are still \(\lambda\) and the eigenvectors change to \(My\)

Some other properties

Eigenvalue(A) + Eigenvalue(B) \(\neq\) Eigenvalue(A+B)
\(\sum \lambda = trace(A)\) (trace is adding the diagonal values)
\(\prod \lambda = \text{det}(A)\)
\(AX = X\Lambda\) \(\Rightarrow\) \(A = X\Lambda X{-1}\) where \(\Lambda\) is the matrix with eigenvalues \(\lambda\) at the diagonal.

Symmetric matrix:

Eigenvalues are real
Eigenvectors are orthogonal: \(S = Q\Lambda Q^{-1}\)

Positive Definite Matrices

Positive Definite Matrix: Symmetric matrices that has positive eigenvalues

The following 5 tests can be used for determining if the mattrix is symmetric positive definite. All of them are testing the same property but from a different point of view, so fulfilling one test means all others are also fulfilled.

All \(\lambda_i > 0\)
Energy \(X^TSX > 0\) (all \(x \neq 0\))
\(S = A^TA\) (independent columns in \(A\))
All leading determinants > 0
All pivots in elimination > 0

The following content shows these 5 tests for matrix \(S\)

\[ S = \begin{bmatrix} 3 & 4\\ 4 & 6\\ \end{bmatrix} \]

skip
given the matrix is 2 by 2, we can assume \[X=\begin{bmatrix} x\\ y\\ \end{bmatrix} \] \[ \begin{bmatrix} x & y\\ \end{bmatrix} \begin{bmatrix} 3 & 4\\ 4 & 6\\ \end{bmatrix} \begin{bmatrix} x \\ y \\ \end{bmatrix} = 2x^2+6y^2+8xy \]
- This can be a loss function \(f(x,y)\), and we need \(S\) to be positive definite so that \(f(x,y)>0\) (convex, or bowl shape).
- Gradient descent can be used for finding the minimum. \[\bigtriangledown f = \begin{bmatrix} \frac{\partial f}{\partial x} \\ \frac{\partial f}{\partial y} \\ \end{bmatrix}\]
- If the one eigenvalue is very large and the other is very small, then the gradient descent would look like zig-zaging down the mountain instead of going down by straight line.
skip
1st leading determinant = 3, 2nd leading determinant = \((3 \cdot 6)-(4 \cdot 4) = 2\)
The pivots are \(3\) and \(\frac{2}{3}\) \[ S = \begin{bmatrix} 3 & 4\\ 4 & 6\\ \end{bmatrix} \sim \begin{bmatrix} 3 & 4\\ 0 & \frac{2}{3}\\ \end{bmatrix} \]

Positive Semidefinite Matrices

All \(\lambda_i \geq 0\)
Energy \(X^TSX \geq 0\) (all \(x \neq 0\))
\(S = A^TA\) (dependent columns in allowed)
All leading determinants \(\geq 0\)
\(r\) pivots in elimination > 0, \(r \leq n\)

example 1:

\[ S' = \begin{bmatrix} 3 & 4\\ 4 & \frac{16}{3}\\ \end{bmatrix} \]

The determinant \(det(S') = 0\) \(\Rightarrow\) it has a 0 eigenvalue
but how do we know the other eigenvalue is 0? By trace, Sum of \(\lambda\)s must be \(3+\frac{16}{3} = \frac{25}{3}\). Therefore \(\lambda\)s must be \(0\) and \(\frac{25}{3}\)

example 2:

\[ \begin{bmatrix} 1 & 1 & 1 \\ 1 & 1 & 1 \\ 1 & 1 & 1 \\ \end{bmatrix} \]

It is a positive semidefinite matrix, with the eigenvalues 3,0,0 because:

It is a singular matrix
with rank=1 \(\Rightarrow\) there is only 1 non-zero eigenvalue
The trace is 3 \(\Rightarrow\) the only non-zero eigenvalue is 3

Now, we try to decompose the matrix to \(A^TA\)

\[ \begin{aligned} \begin{bmatrix} 1 & 1 & 1 \\ 1 & 1 & 1 \\ 1 & 1 & 1 \\ \end{bmatrix} &= \lambda_1q_1q_1^T+\lambda_2q_2q_2^T+\lambda_3q_3q_3^T = Q\Lambda Q^T \\ &= 3 \begin{bmatrix} 1/\sqrt{3} \\ 1/\sqrt{3} \\ 1/\sqrt{3} \\ \end{bmatrix} \begin{bmatrix} \frac{1}{\sqrt{3}} & \frac{1}{\sqrt{3}} & \frac{1}{\sqrt{3}}\\ \end{bmatrix} + 0 + 0 \end{aligned} \]

Since a covariance matrix is derived by \(A^TA\), it is by definition a positive semidefinite a matrix. However, there are circumstances that the “covariance/correlation matrix” is not positive semidefinite when they are not derive by \(A^TA\)

pairwise estimation: It is common that when there are missing values in the sample data, analyst would estimate the covariance matrix by pairwise estimation, which is easy to do in R.
pairwise forecast: The analyst can also forecast pairwise correlation of more than 2 variables, then in this case the matrix is not positive semidefinite and is not technically a covariance/correlation matrix.

Singular Value Decomposition

Compared with \(S = Q\Lambda Q^T\).
Now \(A = U\Sigma V^T\). \(Ax = U\Sigma V^Tx\) can be viewed as rotation, stretch, and then rotation of vector \(x\).

Now the matrix \(A\) is rectengular

\(A^TA\) is n by n semidefinite positive \(\rightarrow A^TA = V\Lambda V^T\) \(AA^T\) is m by m semidefinite positive \(\rightarrow AA^T = U\Lambda U^T\)

Recall when looking for eigenvalue \(Ax = \lambda x\)
Now for SVD, we are looking for a set of orthogonal vectors \(v\) mutiply by \(A\), we get a set of orthogonal \(u\) mutiply by singular values \(\sigma\): \(Av_i = \sigma u_i\) for \(i = 1,...r\) where \(r\) is the rank
\[ \begin{aligned} \rightarrow AV &= U\Sigma \\ A &= U\Sigma V^T \\ A^TA &= V\Sigma^TU^TU\Sigma V^T \\ A^TA &= V(\Sigma^T\Sigma)V^T \end{aligned} \]

where

\(U\) is the eigenvectors of \(AA^T\) and
\(V\) is the eigenvectors of \(A^TA\)

We know that vestors \(V\) are orthogonal eigenvectors of \(A^TA\). We want to show \(u_1^Tu_2 =0\)

\[ \begin{aligned} u_1^Tu_2 &= (\frac{Av_1}{\sigma_1})^T(\frac{Av_2}{\sigma_2}) \\ &= \frac{v_1^TA^TAv_2}{\sigma_1\sigma_2}\\ &= \frac{v_1^T\sigma_2^2v_2}{\sigma_1\sigma_2}\\ &= 0 \\ &\rightarrow u_1,u_2 \text{ are also orthogonal} \end{aligned} \]

Eckart-Young

\[A = U\Sigma V^T = \sigma_1u_1v_1^T + \sigma_2u_2v_2^T + ... \sigma_ru_rv_r^T\] \[A_k = U_k\Sigma_k V_k^T = \sigma_1u_1v_1^T + \sigma_2u_2v_2^T + ... \sigma_ku_kv_k^T\]

Eckart-Young Statement: If B has rank k then \(||A-B|| \geq||A-A_k||\)

About norms for vectors:

\(L^2\) norm: \(||v||_2 = \sqrt{|v_1|^2+ ...+|v_n|^2}\)
\(L^1\) norm: \(||v||_1 = |v_1| +...+ |v_n|\)
\(L^{\infty}\) norm: \(||v||_{\infty} = \text{max}|v_i|\)
\(||cv|| = |c| \ ||v||\)
\(||v+w|| \leq ||v||+||w||\)

Below norms for matrix \(A\) satisfy Eckart-Young statement:

\(||A||_2 = \sigma_1\)
Frobenious norm: \(||A||_F = \sqrt{|a_{11}|^2+ ...|a_{1n}|^2+|a_{b1}|^2+...+|a_{mn}|^2}\)
Nuclear norm: \(||A||_N = \sigma_1+\sigma_2+...+\sigma_r\)

Norms of Vectors and Matrices

4 Ways to Solve Least Squares

\(Ax=b\)

where \(A\) is \(m\) by \(n\) and rank \(r\)

Solve \(A^TAx = A^Tb\) to minimize \(||Ax-b||^2\) if \(A\) has independent columns then \(A^TA\) is invertible
Gram-Schmidt \(A = QR\) leads to \(x = R^{-1}Q^Tb\)
The pseudoinverse directly multiplies \(b\) to give \(x = A^{+}b\)
\(A = U\Sigma V^T\), if A is invertible, then \(A^{-1} = V\Sigma^{-1}U^T\). If A is not invertible: \(A^{+} = V\Sigma^{+}U^T\)
THe best \(x\) is the limit of \((A^TA + \delta I)^{-1}A^Tb\) as \(\delta \to 0\)

Claim: \(A^{+}b = (A^TA)^{-1}A^Tb\) when \(N(A) = 0\), \(r=n\)

\(A^{+} = V\Sigma^{+}U^T = (A^TA)^{-1}A^T\)

Notice that

\((A^TA)^{-1}A^TA=I\)
while \(A(A^TA)^{-1}A^T \neq I\) Note that for \((A^TA)^{-1}A^T\) to be the inverse of \(A\), we need \((A^TA)^{-1}A^TA=I\) and \(A(A^TA)^{-1}A^T = I\)

Difficulties with Ax = b

\(Ax=b\)

\(x = A^{+}b\) The pseudoinverse
Sign ok, condition \(\frac{\sigma_1}{\sigma_n}\) ok, x = A\b (matlab command of elimination)
\(m>n=\text{rank}\), then \((A^TA)^{-1} \hat{x}=A^Tb\) the linear regression, where m is the number of equations
\(m<n\), then minimize the norm
Columns in bad condition -> Gram-Schmidt normal or column pivoting
Near singular -> inverse problem -> add penality:
\(\text{min} ||Ax-b||^2+\delta^2x^2\)
Too big -> iterative method
Way too big -> randomize linear algebra