(Dipindahkan dari komentar ke jawaban seperti yang diminta oleh @Greenparker)
Bagian 1)
The Istilah log p berasal dari (Gaussian) konsentrasi ukuran. Secara khusus, jika Anda memilikipIID Gaussian variabel acak [F1], maksimumnya ada di urutanσ √logp−−−−√p dengan probabilitas tinggi.σlogp−−−−√
Faktor hanya datang fakta Anda melihat kesalahan prediksi rata-rata - yaitu, itu cocok dengan n - 1 di sisi lain - jika Anda melihat kesalahan total, itu tidak akan ada di sana.n−1n−1
Bagian 2)
Pada dasarnya, Anda memiliki dua kekuatan yang perlu Anda kontrol:
- i) properti bagus karena memiliki lebih banyak data (jadi kami ingin n menjadi besar);
- ii) kesulitan memiliki lebih banyak fitur (tidak relevan) (jadi kami ingin menjadi kecil).p
Dalam statistik klasik, kita biasanya memperbaiki dan membiarkan n hingga tak terhingga: rejim ini tidak super berguna untuk teori dimensi tinggi karena ia (asimtotik) dalam rejim dimensi rendah oleh konstruksi .pn
Atau, kita bisa membiarkan pergi ke infinity dan n tetap diperbaiki, tetapi kemudian kesalahan kita baru saja meledak ketika masalah pada dasarnya menjadi tidak mungkin. Bergantung pada masalahnya, kesalahan tersebut dapat menjadi tak terbatas atau berhenti pada batas atas alami ( mis. , Kesalahan klasifikasi 100%).pn
Karena kedua kasus ini agak tidak berguna, kami malah mempertimbangkan n,p keduanya akan menjadi tak terbatas sehingga teori kami keduanya relevan (tetap berdimensi tinggi) tanpa apokaliptik (fitur tak terbatas, data terbatas).
Memiliki dua "kenop" umumnya lebih sulit daripada memiliki kenop tunggal, jadi kami memperbaiki untuk beberapa f tetap dan membiarkan n pergi ke infinity (dan karenanya p pergi ke infinity secara tidak langsung). [F2] Pilihan f menentukan perilaku masalah. Untuk alasan dalam jawaban saya untuk bagian 1, ternyata "kejahatan" dari fitur tambahan hanya tumbuh sebagai log p sedangkan "kebaikan" dari data tambahan tumbuh sebagai n .p=f(n)fnpflogpn
- Jika tetap konstan (ekuivalen,p=f(n)=Θ(Cn)untuk beberapaC), kami menginjak air dan masalahnya adalah pencucian (kesalahan tetap diperbaiki secara asimptotik);logpnp=f(n)=Θ(Cn)C
- jika (p=o(Cn)) kita asimtotik mencapai nol kesalahan;logpn→0p=o(Cn)
- dan jika (p=ω(Cn)), kesalahan akhirnya menjadi tak terhingga.logpn→∞p=ω(Cn)
Rezim terakhir ini kadang-kadang disebut "dimensi ultra-tinggi" dalam literatur. Istilah "ultra-dimensi tinggi" tidak memiliki definisi yang ketat sejauh yang saya tahu, tetapi secara informal hanya "rezim yang memecah laso dan penaksir serupa."
λλ=3log(p)/n−−−−−−−√.
First consider a case where p=f(n)=3n. This is in the 'tractable' high-dimensional regime described above and, as theory predicts, we see the prediction error converge to zero:
Code to reproduce:
library(glmnet)
library(ggplot2)
# Standard High-Dimensional Asymptotics: log(p) / n -> 0
N <- c(50, 100, 200, 400, 600, 800, 1000, 1100, 1200, 1300)
P <- 3 * N
ERROR_HD <- data.frame()
for(ix in seq_along(N)){
n <- N[ix]
p <- P[ix]
PMSE <- replicate(20, {
X <- matrix(rnorm(n * p), ncol=p)
beta <- rep(0, p)
beta[1:10] <- runif(10, 2, 3)
y <- X %*% beta + rnorm(n)
g <- glmnet(X, y)
## Cf. Theorem 7.2 of Bickel et al. AOS 37(4), p.1705-1732, 2009.
## lambda ~ 2*\sqrt{2} * \sqrt{\log(p)/n}
## is good scaling for controlling prediction error of the lasso
err <- X %*% beta - predict(g, newx=X, s=3 * sqrt(log(p)/n))
mean(err^2)
})
ERROR_HD <- rbind(ERROR_HD, data.frame(PMSE=PMSE, n=n, p=p))
}
ggplot(ERROR_HD, aes(x=n, y=PMSE)) + geom_point() + theme_bw() +
xlab("Number of Samples (n)") +
ylab("Mean Prediction Error (at observed design points)") +
ggtitle("Prediction Error Converging to 0 under High-Dim Asymptotics") +
scale_x_continuous(sec.axis = sec_axis(~ 3 * ., name="Number of Features (p)")) +
scale_y_log10()
We can compare this to the case where logpn stays approximately constant: I call this the "borderline" ultra-high-dimensional regime, but that's not a standard term:
P <- 10 + ceiling(exp(N/120))
Here we see that the prediction error (using the same design as above) levels off instead of continuing to zero.
If we set P to grow faster than en, (e.g., en2), the prediction error increases without bound. These en2 is ludicrously fast and leads to enormous problems / numerical problems, so here's a slightly slower, but still UHD example:
P <- 10 + ceiling(exp(N^(1.03)/120))
(I used a sparse random X for speed, so don't try to compare the numbers with the other plots directly) It's hard to see any uptick in this graph, perhaps because we kept the UHD growth from being too "ultra" in the name of computational time. Using a larger exponent (like en1.5) would make the asymptotic growth a bit clearer.
Despite what I said above and how it may appear, the ultra-high-dimensional regime is not actually completely hopeless (though it's close), but it requires much more sophisticated techniques than just a simple max of Gaussian random variables to control the error. The need to use these complex techniques is the ultimate source of the complexity you note.
There's no particular reason to think that p,n should grow "together" in any way (i.e., there's not an obvious "real-world" reason to fix p=f(n)), but math generally lacks language and tools for discussing limits with two "degrees of freedom" so it's the best we can do (for now!).
Part 3)
I'm afraid I don't know any books in the statistical literature that really focus on the growth of logp vs n explicitly. (There may be something in the compressive sensing literature)
My current favorite reference for this kind of theory is Chapters 10 and 11 of Statistical Learning with Sparsity [F3] but it generally takes the approach of considering n,p fixed and giving finite-sample (non-asymptotic) properties of getting a "good" result. This is actually a more powerful approach - once you have the result for any n,p, it's easy to consider the asymptotics - but these results are generally harder to derive, so we currently only have them for lasso-type estimators as far as I know.
If you're comfortable and willing to delve into research literature, I'd look at works by Jianqing Fan and Jinchi Lv, who have done most of the foundational work on ultra-high-dimensional problems. ("Screening" is a good term to search on)
[F1] Actually, any subgaussian random variable, but this doesn't add that much to this discussion.
[F2] We might also set the "true" sparsity s to depend on n (s=g(n)) but that doesn't change things too much.
[F3] T. Hastie, R. Tibshirani, and M. Wainwright. Statistical Learning with Sparsity. Monographs on Statistics and Applied Probability 143. CRC Press, 2015. Available at for free download at https://web.stanford.edu/~hastie/StatLearnSparsity_files/SLS.pdf
[BRT] Peter J. Bickel, Ya'acov Ritov, and Alexandre B. Tsybakov. "Simultaneous Analysis of Lasso and Dantzig Selector." Annals of Statistics 37(4), p. 1705-1732, 2009. http://dx.doi.org/10.1214/08-AOS620