Koefisien Determinasi (


21

Saya ingin sepenuhnya memahami gagasan menggambarkan jumlah variasi antar variabel. Setiap penjelasan web sedikit mekanis dan tumpul. Saya ingin "mendapatkan" konsepnya, bukan hanya menggunakan angka secara mekanis.r2

Misalnya: Jam belajar vs skor tes

r = .8

r2 = .64

  • So, what does this mean?
  • 64% of the variability of test scores can be explained by hours?
  • How do we know that just by squaring?

your question is not about R vs R-square (you understand that 0.82=0.64) it is about interpretation of r2. Please reformulate the title.
robin girard


@amoeba agreed, I pulled the tag.
Brett

You need n to determine the significance. Also see, stats.stackexchange.com/a/265924/99274.
Carl

Jawaban:


27

Start with the basic idea of variation. Your beginning model is the sum of the squared deviations from the mean. The R^2 value is the proportion of that variation that is accounted for by using an alternative model. For example, R-squared tells you how much of the variation in Y you can get rid of by summing up the squared distances from a regression line, rather than the mean.

I think this is made perfectly clear if we think about the simple regression problem plotted out. Consider a typical scatterplot where you have a predictor X along the horizontal axis and a response Y along the vertical axis.

The mean is a horizontal line on the plot where Y is constant. The total variation in Y is the sum of squared differences between the mean of Y and each individual data point. It's the distance between the mean line and every individual point squared and added up.

You can also calculate another measure of variability after you have the regression line from the model. This is the difference between each Y point and the regression line. Rather than each (Y - the mean) squared we get (Y - the point on the regression line) squared.

If the regression line is anything but horizontal, we're going to get less total distance when we use this fitted regression line rather than the mean--that is there is less unexplained variation. The ratio between the extra variation explained and the original variation is your R^2. It's the proportion of the original variation in your response that is explained by fitting that regression line.

enter image description here

Here is some R code for a graph with the mean, the regression line, and segments from the regression line to each point to help visualize:

library(ggplot2)
data(faithful)

plotdata <- aggregate( eruptions ~ waiting , data = faithful, FUN = mean) 

linefit1 <- lm(eruptions ~ waiting, data = plotdata)

plotdata$expected <- predict(linefit1)
plotdata$sign <- residuals(linefit1) > 0

p <- ggplot(plotdata, aes(y=eruptions, x=waiting, xend=waiting, yend=expected) )  

p  + geom_point(shape = 1, size = 3) +
     geom_smooth(method=lm, se=FALSE) + 
     geom_segment(aes(y=eruptions, x=waiting, xend=waiting, yend=expected, colour = sign),  
                  data = plotdata) +
     theme(legend.position="none")  +
     geom_hline(yintercept = mean(plotdata$eruptions), size = 1)

> The ratio between the variation explained and the original variation is your R^2 Let's see if I got this. If the original variation from mean totals 100, and the regression variation totals 20, then the ratio = 20/100 = .2 You're saying R^2 = .2 b/c 20% of the mean variation (red) is accounted for by the explained variation (green) (In the case of r=1) If the original variation totals 50, and the regression variation totals 0, then the ratio = 0/50 = 0 = 0% of the variation from the mean (red) is accounted for by the explained variation (green) I'd expect R^2 to be 1, not 0.
JackOfAll

1
R^2 = 1-(SSR/SST) or (SST-SSR)/SST. So, in your examples, R^2=.80 and 1.00. The difference between the regression line and each point is that left UNexplained by the fit. The rest is the proportion explained. Otherwise, that's exactly right.
Brett

I edited that last paragraph to try to make it a bit clearer. Conceptually (and computationally) all you need is there. It might be clearer to actually add the formula and refer to the SST SSE and SSR, but then I was trying to get at it conceptually
Brett

ie: R^2 is the proportion of the total variation from mean (SST) that is the difference b/w the expected regression value and mean value (SSE). In my example of hours vs. score, the regression value would be the expected test score based on correlation with hours studied. Any additional variation from that is attributed to SSR. For a given point, hours studied variable/regression explained x% of the total variation from the mean (SST). With a high r-value, "explained" is big percentage of SST compared to SSR. With a low r-value, "explained" is a lower percentage of SST compared to SSR.
JackOfAll

@BrettMagill, I think the link to the image is broken...
Garrett

6

A mathematical demonstration of the relationship between the two is here: Pearson's correlation and least squares regression analysis.

I am not sure if there is a geometric or any other intuition that can be offered apart from the math but if I can think of one I will update this answer.

Update: Geometric Intuition

Here is a geometric intuition I came up with. Suppose that you have two variables x and y which are mean centered. (Assuming mean centered lets us ignore the intercept which simplifies the geometrical intuition a bit.) Let us first consider the geometry of linear regression. In linear regression, we model y as follows:

y=x β+ϵ.

Consider the situation when we have two observations from the above data generating process given by the pairs (y1,y2) and (x1,x2). We can view them as vectors in two-dimensional space as shown in the figure below:

alt text http://a.imageshack.us/img202/669/linearregression1.png

Thus, in terms of the above geometry, our goal is to find a β such that the vector x β is the closest possible to the vector y. Note that different choices of β scale x appropriately. Let β^ be the value of β that is our best possible approximation of y and denote y^=x β^. Thus,

y=y^+ϵ^

From a geometrical perspective we have three vectors. y, y^ and ϵ^. A little thought suggests that we must choose β^ such that three vectors look like the one below: alt text http://a.imageshack.us/img19/9524/intuitionlinearregressi.png

In other words, we need to choose β such that the angle between x β and ϵ^ is 900.

So, how much variation in y have we explained with this projection of y onto the vector x. Since the data is mean centered the variance in y is equals (y12+y22) which is the square of the distance between the point represented by the point y and the origin. The variation in y^ is similarly the distance from the point y^ and the origin and so on.

By the Pythagorean theorem, we have:

y2=y^2+ϵ^2

Therefore, the proportion of the variance explained by x is y^2y2. Notice also that cos(θ)=y^y. and the wiki tells us that the geometrical interpretation of correlation is that correlation equals the cosine of the angle between the mean-centered vectors.

Therefore, we have the required relationship:

(Correlation)2 = Proportion of variation in y explained by x.

Hope that helps.


I do appreciate your attempt at helping, but unfortunately, this just made things 10x worse. Are you really introducing trigonometry to explain r^2? You're way too smart to be a good teacher!
JackOfAll

I thought that you wanted to know why correlation^2 = R^2. In any case, different ways of understanding the same concept helps or at least that is my perspective.

3

The Regression By Eye applet could be of use if you're trying to develop some intuition.

It lets you generate data then guess a value for R, which you can then compare with the actual value.

Dengan menggunakan situs kami, Anda mengakui telah membaca dan memahami Kebijakan Cookie dan Kebijakan Privasi kami.
Licensed under cc by-sa 3.0 with attribution required.