The SVD allows us to decompose data matrix $X$ as the product of three matrices, $U,\ V^T,\ \Sigma$, where essentially $U$ contains information about the column space of $X$, $V$ contains information about the row space of $X$ and $\Sigma$ is a hierarchically ordered diagonal matrix, which tells you how important the various columns of $U$ and $V$ are.
In fact, the data matrix $X$ has only $m$ columns, it means there are at most $m$ linearly independence columns in this $n$-dimensional vector space.So the first $m$ columns of $U$ are important in representing this data matrix. To explain this fact more deeply, we try to represent expansion as a sum of rank-1 matrices.
Based on the Singular Value Decomposition,we can get the following equation,
Since $\Sigma$ is a diagnoal matrix, we can expand the above equation as follow,
Even though $U$ is a massive $n\times n$ matrix, there are only at most $m$ non-zeros singular values in $\Sigma$ that means the rank of data matrix $X$ satisfies $rank(X)\leq m$. Actually,I can selcet the first $m$ columns of $U$, the first $m\times m$ block of $\Sigma$ and the original $V$ to represent the data matrix $X$. We always call $X=\hat{U}\hat{\Sigma}V^T$ as the economy SVD and $X=U\Sigma V^T$ as the full SVD. Since we consider the case of $n\gg m$, $\hat{U}\in \mathbb{R}^{n\times m}$ and $\hat{\Sigma}\in \mathbb{R}^{m\times m}$ need lower storage.
Therefore, we can give another explanation about SVD: we can decompose the data matrix X into the orthogonal basis $U$ and $V$, where essentially you can rewrite this as a sum of rank-1 matrices, which increasingly improve the approximation of $X$. According to this explanation, we can give out some interesting results like, the best rank-1 matrix that we can make to approximate $X$ is $\sigma_1u_1v_1^T$; the best rank-2 matrix that we can make to approximate $X$ is $\sigma_1u_1v_1^T+\sigma_2u_2v_2^T$ and so on and so forth.
Since we hope to use less data storage to approximate the real data as much as possible in practice, we often truncate at rank r. Oftentimes we have a lot of negligibly small singular values like $\sigma{r+1},\ \sigma{r+2},\cdots,\sigma_m$, it means that most of the information of $X$ is captured in the first $r$ singular values and the first $r$ singular vectors. So we can throw away all of low singular values and singular vectors and only keep the first $r$ columns of $U$ and $V$ and the first $r\times r$ submatrix in $\Sigma$. Then we are going to define this truncated SVD as follows,
Here we find that the truncated SVD $\tilde{U}\tilde{\Sigma}\tilde{V}^T$ is the best rank-$r$ matrix approximating the data matrix $X$.Thus, high-dimensional data may be well described by a few dominant patterns given by the columns of $\tilde{U}$ and $\tilde{V}$.Like the mentioned in the first section,we realize that the truncated singular vectors $\tilde{U}$ provides a coordinate transformation from the high-dimensional measurement space into a low-dimensional pattern space.
The Eckart-Young theorem states that the absolute best approximation to the matrix $X$ of rank r,
Theorem(Eckart-Young) The optimal rank-r approximation to $X$, in a least-squares sense, is given by the rank-r SVD truncation $\tilde{X}$,
The Ecakrt-Young theorem guarantees that the best possible matrix approximation to $X$ of rank $r$ is given by the firsr $r$ truncated SVD.
Finally,we should mention an important point. At beginning of discussing the SVD, we define the matrices $U$ and $V$ are unitary. However,if we truncated at rank r,the truncated matrices $\tilde{U}$ and $\tilde{V}$ are no longer square matrices, so they are not unitary matrix again. They satisfy that $\tilde{U}^T\tilde{U}=\tilde{V}^TV=I,\ \tilde{U}\tilde{U}^T\not=I$