Norma apa dari kesalahan rekonstruksi yang diminimalkan oleh matriks aproksimasi peringkat rendah yang diperoleh dengan PCA?

Mengingat PCA (atau SVD) pendekatan matriks $X$ dengan matriks , kita tahu bahwa adalah yang terbaik peringkat rendah perkiraan . $\hat X$ $\hat X$ $X$

Apakah ini sesuai dengan norma diinduksi $\parallel \cdot \parallel_2$ (yaitu norma nilai eigen terbesar) atau menurut norma Frobenius $\parallel \cdot \parallel_F$ ?

pca svd matrix-decomposition

— Donbeo
sumber

Satu kata jawaban: Keduanya.

Mari kita mulai dengan mendefinisikan norma-norma. Untuk matriks $X$ , operator $2$ -norm didefinisikan sebagai dan norma Frobenius sebagai

‖ X ‖_{2} = s kamu hal \frac{‖ X v ‖_{2}}{‖ v ‖_{2}} = m Sebuah x (s_{saya})

$\|X\|_2 = \mathrm{sup}\frac{\|Xv\|_2}{\|v\|_2} = \mathrm{max}(s_i)$

mana

adalah nilai singular

, yaitu elemen diagonal

dalam dekomposisi nilai singular

‖ X ‖_{F} = \sqrt{\sum_{saya j} X_{saya j}^{2}} = t r (X^{⊤} X) = \sqrt{\sum s_{saya}^{2}},

$\|X\|_F = \sqrt {\sum_{ij} X_{ij}^2} = \mathrm{tr}(X^\top X) = \sqrt{\sum s_i^2},$

s_{i}

$s_i$

X

$X$

S

$S$

X = U S V^{⊤}

$X = USV^\top$

PCA diberikan oleh dekomposisi nilai singular yang sama ketika data dipusatkan. merupakan komponen utama, adalah sumbu utama, yaitu vektor eigen dari matriks kovarians, dan rekonstruksi dengan hanya komponen utama yang sesuai dengan nilai tunggal terbesar diberikan oleh . $US$ $V$ $X$ $k$ $k$ $X_k = U_k S_k V_k^\top$

The Eckart-Young teorema mengatakan bahwa adalah matriks meminimalkan norma kesalahan rekonstruksi antara semua matriks pangkat . Ini berlaku untuk keduanya, norma Frobenius dan operator -norm. Seperti yang ditunjukkan oleh @ cardinal dalam komentar, itu pertama kali dibuktikan oleh Schmidt (dari ketenaran Gram-Schmidt) pada tahun 1907 untuk kasus Frobenius. Itu kemudian ditemukan kembali oleh Eckart dan Young pada tahun 1936 dan sekarang sebagian besar dikaitkan dengan nama mereka. Mirsky menggeneralisasi teorema pada tahun 1958 untuk semua norma yang tidak berubah di bawah transformasi kesatuan, dan ini termasuk norma 2 operator. $X_k$ $\|X-A\|$ $A$ $k$ $2$

Teorema ini kadang-kadang disebut teorema Eckart-Young-Mirsky. Stewart (1993) menyebutnya teorema aproksimasi Schmidt. Saya bahkan pernah melihatnya disebut teorema Schmidt-Eckart-Young-Mirsky.

Eckart and Young, 1936, Perkiraan satu matriks oleh yang lain dari peringkat yang lebih rendah
Mirsky, 1958, fungsi pengukur Simetris dan norma-norma invarian yang tidak biasa
Stewart, 1993, Tentang sejarah awal dekomposisi nilai singular

Bukti untuk operator -norm $2$

Biarkan menjadi peringkat penuh . Karena adalah peringkat , ruang nolnya memiliki dimensi . Ruang yang direntang oleh vektor tunggal yang sesuai dengan nilai singular terbesar memiliki dimensi . Jadi kedua ruang ini harus bersilangan. Biarkan menjadi vektor satuan dari persimpangan. Kemudian kita mendapatkan: $X$ $n$ $A$ $k$ $n-k$ $k+1$ $X$ $k+1$ $w$ QED.

‖ X - SEBUAH ‖_{2}^{2} \geq ‖ (X - SEBUAH) w ‖_{2}^{2} = ‖ X w ‖_{2}^{2} = \sum_{saya = 1}^{k + 1} s_{saya}^{2} (v_{saya}^{⊤} w)^{2} \geq s_{k + 1}^{2} = ‖ X - X_{k} ‖_{2}^{2},

$\|X-A\|^2_2 \ge \|(X-A)w\|^2_2 = \|Xw\|^2_2 = \sum_{i=1}^{k+1}s_i^2(v_i^\top w)^2 \ge s_{k+1}^2 = \|X-X_k\|_2^2,$

Bukti untuk norma Frobenius

Kami ingin mencari matriks dari peringkat yang meminimalkan . Kita bisa pd , di mana memiliki ortonormal kolom. Meminimalkan untuk tetap adalah masalah regresi dengan solusi . Memasukkannya, kita melihat bahwa kita sekarang perlu meminimalkan $A$ $k$ $\|X-A\|^2_F$ $A=BW^\top$ $W$ $k$ $\|X-BW^\top\|^2$ $W$ $B=XW$ mana adalah matriks kovarian , yaitu . Berarti bahwa kesalahan rekonstruksi ini diminimalkan dengan mengambil sebagai kolom beberapa ortonormal vektor memaksimalkan total varian dari proyeksi.

‖ X - X W W^{⊤} ‖^{2} = ‖ X ‖^{2} - ‖ X W W^{⊤} ‖^{2} = c o n s t - t r (W W^{⊤} X^{⊤} X W W^{⊤}) = c o n s t - c o n s t \cdot t r (W^{⊤} Σ W),

$\|X-XWW^\top\|^2=\|X\|^2-\|XWW^\top\|^2=\mathrm{const}-\mathrm{tr}(WW^\top X^\top XWW^\top)\\=\mathrm{const}-\mathrm{const}\cdot\mathrm{tr}(W^\top\Sigma W),$

Σ

$\Sigma$

X

$X$

Σ = X^{⊤} X / (n - 1)

$\Sigma=X^\top X/(n-1)$

W

$W$

k

$k$

$k$ $X=USV^\top$ $\Sigma=VS^2V^\top/(n-1)=V\Lambda V^\top$ $R=V^\top W$

t r (W^{⊤} Σ W) = t r (R^{⊤} Λ R) = \sum_{i} λ_{i} \sum_{j} R_{i j}^{2} \leq \sum_{i = 1}^{k} λ_{k},

$\mathrm{tr}(W^\top\Sigma W)=\mathrm{tr}(R^\top\Lambda R)=\sum_i \lambda_i \sum_j R_{ij}^2 \le \sum_{i=1}^k \lambda_k,$ with maximum achieved when

W = V_{k}

$W=V_k$ . The theorem then follows immediately.

See the following three related threads:

Earlier attempt of a proof for Frobenius norm

This proof I found somewhere online but it is wrong (contains a gap), as explained by @cardinal in the comments.

Frobenius norm is invariant under unitary transformations, because they do not change the singular values. So we get:

‖ X - A ‖_{F} = ‖ U S V^{⊤} - A ‖ = ‖ S - U^{⊤} A V ‖ = ‖ S - B ‖,

$\|X-A\|_F=\|USV^\top - A\| = \|S - U^\top A V\| = \|S-B\|,$ where

B = U^{⊤} A V

$B=U^\top A V$ . Continuing:

‖ X - A ‖_{F} = \sum_{i j} (S_{i j} - B_{i j})^{2} = \sum_{i} (s_{i} - B_{i i})^{2} + \sum_{i \neq j} B_{i j}^{2} .

$\|X-A\|_F = \sum_{ij}(S_{ij}-B_{ij})^2 = \sum_i (s_i-B_{ii})^2 + \sum_{i\ne j}B_{ij}^2.$ This is minimized when all off-diagonal elements of

B

$B$ are zero and all

k

$k$ diagonal terms cancel out the

k

$k$ largest singular values

s_{i}

$s_i$ [gap here: this is not obvious], i.e.

B_{o p t i m a l} = S_{k}

$B_\mathrm{optimal}=S_k$ and hence

A_{o p t i m a l} = U_{k} S_{k} V_{k}^{⊤}

$A_\mathrm{optimal} = U_k S_k V_k^\top$ .

— amoeba says Reinstate Monica
sumber

The proof in the case of the Frobeniius norm is not correct (or at least complete) since the argument here does not preclude the possibility that a matrix of the same rank could cancel out some of the other diagonal terms while having "small" off-diagonals. To see the gap more clearly note that holding the diagonals constant and "zeroing" the off-diagonals can often increase the rank of the matrix in question!

— cardinal

Note also that the SVD was known to Beltrami (at least in a quite general, though special case) and Jordan as early as 1874.

— cardinal

@cardinal: Hmmmm, I am not sure I see the gap. If

B

$B$ cancels out some other diagonal terms in

S

$S$ instead of

k

$k$ largest ones and has some nonzero off-diagonal terms instead, then both sums,

\sum_{i} (s_{i} - B_{i i})^{2}

$\sum_{i}(s_i-B_{ii})^2$ and

\sum_{i \neq j} B_{i j}^{2}

$\sum_{i\ne j}B_{ij}^2$ , are going to increase. So it will only increase the reconstruction error. No? Still, I tried to find another proof for Frobenius norm in the literature, and have read that it should somehow follow easily from the operator norm case. But so far I don't see how it should follow...

— amoeba says Reinstate Monica

I do like G. W. Stewart (1993), On the early history of the singular value decomposition, SIAM Review, vol. 35, no. 4, 551-566 and, given your prior demonstrated interest in historical matters, I think you will too. Unfortunately, I think Stewart is unintentionally overly dismissive of the elegance of Schmidt's 1907 proof. Hidden within it is a regression interpretation that Stewart overlooks and which is really quite pretty. There is another proof that follows the initial diagonalization approach you take, but which requires some extra work to fill the gap. (cont.)

— cardinal

@cardinal: Yes, you are right, now I see the gap too. Thanks a lot for the Stewart paper, that was a very interesting read. I see that Stewart presents Schmidt's and Weyl's proofs, but both of them look more complicated than what I would like to copy here (and so far I have not had the time to study them carefully). I am surprised: I expected this to be a very simple result, but it seems it is less trivial than I thought. In particular, I would not have expected that the Frobenius case is so much more complicated than the operator norm one. I will edit the post now. Happy New Year!

— amoeba says Reinstate Monica

Norma apa dari kesalahan rekonstruksi yang diminimalkan oleh matriks aproksimasi peringkat rendah yang diperoleh dengan PCA?

Satu kata jawaban: Keduanya.

Bukti untuk operator -norm222

Bukti untuk norma Frobenius

Earlier attempt of a proof for Frobenius norm

Bukti untuk operator -norm $2$