Bagaimana melakukan regresi linier satu demi satu dengan beberapa simpul yang tidak diketahui?

14

Apakah ada paket yang harus dilakukan regresi linier satu demi satu, yang dapat mendeteksi banyak simpul secara otomatis? Terima kasih. Ketika saya menggunakan paket strucchange. Saya tidak bisa mendeteksi titik perubahan. Saya tidak tahu bagaimana mendeteksi titik perubahan. Dari plot, saya bisa melihat ada beberapa poin yang saya inginkan dapat membantu saya memilihnya. Adakah yang bisa memberi contoh di sini?

regression change-point

— Honglang Wang
sumber

1

Ini tampaknya pertanyaan yang sama dengan stats.stackexchange.com/questions/5700/… . Jika itu berbeda secara substansial, beri tahu kami dengan mengedit pertanyaan Anda untuk mencerminkan perbedaan; jika tidak, kami akan menutupnya sebagai duplikat.

— whuber

1

Saya telah mengedit pertanyaan.

— Honglang Wang

1

Saya pikir Anda dapat melakukan ini sebagai masalah optimasi non-linear. Tulis saja persamaan fungsi yang akan dipasang, dengan koefisien dan lokasi simpul sebagai parameter.

— mark999

1

Saya pikir segmentedpaketnya adalah apa yang Anda cari.

— AlefSin

1

Saya memiliki masalah yang sama, menyelesaikannya dengan segmentedpaket R : stackoverflow.com/a/18715116/857416

— sebuah ben yang berbeda

8

Apakah MARS berlaku? R memiliki paket earthyang mengimplementasikannya.

— Wayne
sumber

8

Secara umum, agak aneh jika ingin memasukkan sesuatu sebagai linear yang bijaksana. Namun, jika Anda benar-benar ingin melakukannya, maka algoritma MARS adalah yang paling langsung. Ini akan membangun fungsi satu simpul pada satu waktu; dan kemudian biasanya memangkas kembali jumlah simpul untuk melawan pohon keputusan yang terlalu pas. Anda dapat mengakses algotitme MARS di R via earthatau mda. Secara umum, ini sesuai dengan GCV yang tidak begitu jauh dari kriteria informasi lainnya (AIC, BIC, dll.)

MARS tidak akan benar-benar memberikan Anda yang "optimal" cocok karena simpul ditanam satu per satu. Akan sangat sulit untuk menyesuaikan jumlah simpul yang benar-benar "optimal" karena kemungkinan permutasi penempatan simpul akan cepat meledak.

Secara umum, inilah mengapa orang beralih ke smoothing splines. Kebanyakan spline smoothing berbentuk kubik hanya agar Anda dapat menipu mata manusia sehingga tidak memiliki diskontinuitas. Akan sangat mungkin untuk melakukan spline smoothing linier. Keuntungan besar dari smoothing splines adalah parameter tunggal mereka untuk dioptimalkan. Itu memungkinkan Anda untuk dengan cepat mencapai solusi yang benar-benar "optimal" tanpa harus mencari melalui sekumpulan permutasi. Namun, jika Anda benar-benar ingin mencari titik belok, dan Anda memiliki cukup data untuk melakukannya, maka sesuatu seperti MARS mungkin akan menjadi taruhan terbaik Anda.

Berikut adalah beberapa contoh kode untuk penghalusan spline linier yang dihukum dalam R:

require(mgcv);data(iris);
gam.test <- gam(Sepal.Length ~ s(Petal.Width,k=6,bs='ps',m=0),data=iris)
summary(gam.test);plot(gam.test);

Simpul aktual yang dipilih belum tentu berkorelasi dengan titik belok sejati.

— Shea Parkes
sumber

3

Saya telah memprogram ini dari awal beberapa tahun yang lalu, dan saya memiliki file Matlab untuk melakukan regresi linear sepotong-bijaksana di komputer saya. Sekitar 1 hingga 4 breakpoint secara komputasi dimungkinkan untuk sekitar 20 titik pengukuran. 5 atau 7 break point mulai terlalu banyak.

Pendekatan matematika murni seperti yang saya lihat adalah mencoba semua kombinasi yang mungkin seperti yang disarankan oleh pengguna mbq dalam pertanyaan yang ditautkan dalam komentar di bawah pertanyaan Anda.

Karena garis yang dipasang semua berurutan dan berdekatan (tidak ada tumpang tindih), kombinatorik akan mengikuti segitiga Pascals. Jika ada tumpang tindih antara titik data yang digunakan oleh segmen garis saya percaya bahwa kombinatorik akan mengikuti angka Stirling dari jenis kedua sebagai gantinya.

Solusi terbaik dalam pikiran saya adalah memilih kombinasi garis pas yang memiliki standar deviasi terendah dari nilai korelasi R ^ 2 dari garis pas. Saya akan mencoba menjelaskan dengan sebuah contoh. Perlu diingat bahwa menanyakan berapa banyak break point yang harus ditemukan dalam data, sama dengan mengajukan pertanyaan "Berapa lama pantai Inggris?" seperti di salah satu makalah Benoit Mandelbrots (ahli matematika) tentang fraktal. Dan ada trade-off antara jumlah break point dan kedalaman regresi.

Sekarang untuk contoh.

$y$ $x$ $x$ $y$

\begin{array}{cccccc} x & y & R^{2} l i n e 1 & R^{2} l i n e 2 & s u m o f R^{2} v a l u e s & s t a n d a r d d e v i a t i o n o f R^{2} \\ 1 & 1 & 1, 000 & 0, 0400 & 1, 0400 & 0, 6788 \\ 2 & 2 & 1, 000 & 0, 0118 & 1, 0118 & 0, 6987 \\ 3 & 3 & 1, 000 & 0, 0004 & 1, 0004 & 0, 7067 \\ 4 & 4 & 1, 000 & 0, 0031 & 1, 0031 & 0, 7048 \\ 5 & 5 & 1, 000 & 0, 0135 & 1, 0135 & 0, 6974 \\ 6 & 6 & 1, 000 & 0, 0238 & 1, 0238 & 0, 6902 \\ 7 & 7 & 1, 000 & 0, 0277 & 1, 0277 & 0, 6874 \\ 8 & 8 & 1, 000 & 0, 0222 & 1, 0222 & 0, 6913 \\ 9 & 9 & 1, 000 & 0, 0093 & 1, 0093 & 0, 7004 \\ 10 & 10 & 1, 000 & - 1, 978 & 1, 000 & 0, 7071 \\ 11 & 9 & 0, 9709 & 0, 0271 & 0, 9980 & 0, 6673 \\ 12 & 8 & 0, 8951 & 0, 1139 & 1, 0090 & 0, 5523 \\ 13 & 7 & 0, 7734 & 0, 2558 & 1, 0292 & 0, 3659 \\ 14 & 6 & 0, 6134 & 0, 4321 & 1, 0455 & 0, 1281 \\ 15 & 5 & 0, 4321 & 0, 6134 & 1, 0455 & 0, 1282 \\ 16 & 4 & 0, 2558 & 0, 7733 & 1, 0291 & 0, 3659 \\ 17 & 3 & 0, 1139 & 0, 8951 & 1, 0090 & 0, 5523 \\ 18 & 2 & 0, 0272 & 0, 9708 & 0, 9980 & 0, 6672 \\ 19 & 1 & 0 & 1, 000 & 1, 000 & 0, 7071 \\ 20 & 2 & 0, 0094 & 1, 000 & 1, 0094 & 0, 7004 \\ 21 & 3 & 0, 0222 & 1, 000 & 1, 0222 & 0, 6914 \\ 22 & 4 & 0, 0278 & 1, 000 & 1, 0278 & 0, 6874 \\ 23 & 5 & 0, 0239 & 1, 000 & 1, 0239 & 0, 6902 \\ 24 & 6 & 0, 0136 & 1, 000 & 1, 0136 & 0, 6974 \\ 25 & 7 & 0, 0032 & 1, 000 & 1, 0032 & 0, 7048 \\ 26 & 8 & 0, 0004 & 1, 000 & 1, 0004 & 0, 7068 \\ 27 & 9 & 0, 0118 & 1, 000 & 1, 0118 & 0, 6987 \\ 28 & 10 & 0, 04 & 1, 000 & 1, 04 & 0, 6788 \end{array}

$\begin{array}{|c|c|c|c|c|c|} \hline &x &y &R^2 line 1 &R^2 line 2 &sum of R^2 values &standard deviation of R^2 \\ \hline &1 &1 &1,000 &0,0400 &1,0400 &0,6788 \\ \hline &2 &2 &1,000 &0,0118 &1,0118 &0,6987 \\ \hline &3 &3 &1,000 &0,0004 &1,0004 &0,7067 \\ \hline &4 &4 &1,000 &0,0031 &1,0031 &0,7048 \\ \hline &5 &5 &1,000 &0,0135 &1,0135 &0,6974 \\ \hline &6 &6 &1,000 &0,0238 &1,0238 &0,6902 \\ \hline &7 &7 &1,000 &0,0277 &1,0277 &0,6874 \\ \hline &8 &8 &1,000 &0,0222 &1,0222 &0,6913 \\ \hline &9 &9 &1,000 &0,0093 &1,0093 &0,7004 \\ \hline &10 &10 &1,000 &-1,978 &1,000 &0,7071 \\ \hline &11 &9 &0,9709 &0,0271 &0,9980 &0,6673 \\ \hline &12 &8 &0,8951 &0,1139 &1,0090 &0,5523 \\ \hline &13 &7 &0,7734 &0,2558 &1,0292 &0,3659 \\ \hline &14 &6 &0,6134 &0,4321 &1,0455 &0,1281 \\ \hline &15 &5 &0,4321 &0,6134 &1,0455 &0,1282 \\ \hline &16 &4 &0,2558 &0,7733 &1,0291 &0,3659 \\ \hline &17 &3 &0,1139 &0,8951 &1,0090 &0,5523 \\ \hline &18 &2 &0,0272 &0,9708 &0,9980 &0,6672 \\ \hline &19 &1 &0 &1,000 &1,000 &0,7071 \\ \hline &20 &2 &0,0094 &1,000 &1,0094 &0,7004 \\ \hline &21 &3 &0,0222 &1,000 &1,0222 &0,6914 \\ \hline &22 &4 &0,0278 &1,000 &1,0278 &0,6874 \\ \hline &23 &5 &0,0239 &1,000 &1,0239 &0,6902 \\ \hline &24 &6 &0,0136 &1,000 &1,0136 &0,6974 \\ \hline &25 &7 &0,0032 &1,000 &1,0032 &0,7048 \\ \hline &26 &8 &0,0004 &1,000 &1,0004 &0,7068 \\ \hline &27 &9 &0,0118 &1,000 &1,0118 &0,6987 \\ \hline &28 &10 &0,04 &1,000 &1,04 &0,6788 \\ \hline \end{array}$

These y values have the graph:

idealized data

Which clearly has two break points. For the sake of argument we will calculate the R^2 correlation values (with the Excel cell formulas (European dot-comma style)):

=INDEX(LINEST(B1:$B$1;A1:$A$1;TRUE;TRUE);3;1)
=INDEX(LINEST(B1:$B$28;A1:$A$28;TRUE;TRUE);3;1)

for all possible non-overlapping combinations of two fitted lines. All the possible pairs of R^2 values have the graph:

R^2 values

The question is which pair of R^2 values should we choose, and how do we generalize to multiple break points as asked in the title? One choice is to pick the combination for which the sum of the R-square correlation is the highest. Plotting this we get the upper blue curve below:

sum of R squared and standard deviation of R squared

The blue curve, the sum of the R-squared values, is the highest in the middle. This is more clearly visible from the table with the value $1,0455$ as the highest value. However it is my opinion that the minimum of the red curve is more accurate. That is, the minimum of the standard deviation of the R^2 values of the fitted regression lines should be the best choice.

Piece wise linear regression - Matlab - multiple break points

— Mats Granvik
sumber

1

There is a pretty nice algorithm described in Tomé and Miranda (1984).

The proposed methodology uses a least-squares approach to compute the best continuous set of straight lines that fit a given time series, subject to a number of constraints on the minimum distance between breakpoints and on the minimum trend change at each breakpoint.

The code and a GUI are available in both Fortran and IDL from their website: http://www.dfisica.ubi.pt/~artome/linearstep.html

— arkaia
sumber

0

... first of all you must to do it by iterations, and under some informative criterion, like AIC AICc BIC Cp; because you can get an "ideal" fit, if number of knots K = number od data points N, ok. ... first put K = 0; estimate L = K + 1 regressions, calculate AICc, for instance; then assume minimal number of data points at a separate segment, say L = 3 or L = 4, ok ... put K = 1; start from L-th data as the first knot, calculate SS or MLE, ... and step by step the next data point as a knot, SS or MLE, up to the last knot at the N - L data; choose the arrangement with the best fit (SS or MLE) calculate AICc ... ... put K = 2; ... use all previous regressions (that is their SS or MLE), but step by step divide a single segment into all possible parts ... choose the arrangement with the best fit (SS or MLE) calculate AICc ... if the last AICc occurs greater then the previous one: stop the iterations ! This is an optimal solution under AICc criterion, ok

— Maciek
sumber

AIC, BIC can't be used because they penalised for extra parameters, which is clearly not the case here.

— HelloWorld

0

I once came across a program called Joinpoint. On their website they say it fits a joinpoint model where "several different lines are connected together at the 'joinpoints'". And further: "The user supplies the minimum and maximum number of joinpoints. The program starts with the minimum number of joinpoint (e.g. 0 joinpoints, which is a straight line) and tests whether more joinpoints are statistically significant and must be added to the model (up to that maximum number)."

The NCI uses it for trend modelling of cancer rates, maybe it fits your needs as well.

— psj
sumber

0

In order to fit to data a piecewise function :

where $a_1 , a_2 , p_1 , q_1, p_2 , q_2 , p_3 , q_3$ are unknown parameters to be approximately computed, there is a very simple method (not iterative, no initial guess, easy to code in any math computer language). The theory given page 29 in paper : https://fr.scribd.com/document/380941024/Regression-par-morceaux-Piecewise-Regression-pdf and from page 30 :

For example, with the exact data provided by Mats Granvik the result is :

Without scattered data, this example is not very signifiant. Other examples with scattered data are shown in the referenced paper.

— JJacquelin
sumber

0

You can use the mcp package if you know the number of change points to infer. It gives you great modeling flexibility and a lot of information about the change points and regression parameters, but at the cost of speed.

The mcp website contains many applied examples, e.g.,

library(mcp)

# Define the model
model = list(
  response ~ 1,  # plateau (int_1)
  ~ 0 + time,    # joined slope (time_2) at cp_1
  ~ 1 + time     # disjoined slope (int_3, time_3) at cp_2
)

# Fit it. The `ex_demo` dataset is included in mcp
fit = mcp(model, data = ex_demo)

Then you can visualize:

plot(fit)

Or summarise:

summary(fit)

Family: gaussian(link = 'identity')
Iterations: 9000 from 3 chains.
Segments:
  1: response ~ 1
  2: response ~ 1 ~ 0 + time
  3: response ~ 1 ~ 1 + time

Population-level parameters:
    name match  sim  mean lower  upper Rhat n.eff
    cp_1    OK 30.0 30.27 23.19 38.760    1   384
    cp_2    OK 70.0 69.78 69.27 70.238    1  5792
   int_1    OK 10.0 10.26  8.82 11.768    1  1480
   int_3    OK  0.0  0.44 -2.49  3.428    1   810
 sigma_1    OK  4.0  4.01  3.43  4.591    1  3852
  time_2    OK  0.5  0.53  0.40  0.662    1   437
  time_3    OK -0.2 -0.22 -0.38 -0.035    1   834

Disclaimer: I am the developer of mcp.

— Jonas Lindeløv
sumber

The use of "detect" in the question indicates the number--and even the existence--of changepoints are not known beforehand.

— whuber