Bisakah standar deviasi data non-negatif melebihi rata-rata?


15

Saya memiliki beberapa jerat 3D triangulasi. Statistik untuk area segitiga adalah:

  • Min 0.000
  • Maks 2341.141
  • Berarti 56.317
  • Std dev 98.720

Jadi, apakah itu berarti sesuatu yang sangat berguna tentang standar deviasi atau menyarankan ada bug dalam menghitungnya, ketika angka-angkanya seperti di atas? Wilayah-wilayah tersebut tentu jauh dari distribusi normal.

Dan seperti seseorang yang disebutkan dalam salah satu tanggapan mereka di bawah ini, hal yang benar-benar mengejutkan saya adalah hanya butuh satu SD dari angka-angka untuk menjadi negatif dan dengan demikian keluar dari domain hukum.

Terima kasih


4
Dalam dataset standar deviasi sampel adalah 100 sedangkan rerata 52 - hampir mendekati apa yang Anda amati. {2,2,2,202}10052
whuber

5
Untuk contoh yang umum (untuk beberapa), hasil rata-rata seseorang bermain blackjack selama satu jam mungkin negatif $ 25 tetapi dengan standar deviasi katakan $ 100 (angka untuk ilustrasi). Koefisien variasi yang besar ini memudahkan seseorang untuk diperdaya untuk berpikir bahwa mereka lebih baik daripada yang sebenarnya.
Michael McGowan

Pertanyaan tindak lanjutnya juga cukup informatif: ia menempatkan batasan pada SD dari set (data non-negatif), dengan memberikan rata-rata.
whuber

Jawaban:


9

Tidak ada yang menyatakan bahwa standar deviasi harus kurang dari atau lebih dari rata-rata. Diberikan seperangkat data, Anda dapat menjaga mean tetap sama tetapi mengubah standar deviasi ke tingkat yang sewenang-wenang dengan menambahkan / mengurangi angka positif dengan tepat .

Menggunakan dataset contoh @ whuber dari komentarnya ke pertanyaan: {2, 2, 2, 202}. Seperti yang dinyatakan oleh @whuber: mean adalah 52 dan standar deviasi adalah 100.

Sekarang, ganggu setiap elemen data sebagai berikut: {22, 22, 22, 142}. Rata-rata masih 52 tetapi standar deviasi adalah 60.


1
Jika Anda menambahkan ke setiap elemen, Anda mengubah parameter lokasi , yaitu rata-rata. Anda mengubah dispersi (yaitu deviasi standar) dengan mengalikan dengan faktor skala (asalkan mean Anda adalah nol).
Dirk Eddelbuettel

@DirkEddelbuettel Anda benar Saya memperbaiki jawabannya dan memberikan contoh untuk kejelasan.
varty

2
Saya tidak mengikuti contohnya. Dataset baru jelas tidak berasal dari yang asli dengan "menambah atau mengurangi angka positif" dari masing-masing nilai asli.
Whuber

3
Saya tidak dapat mengeditnya karena saya tidak tahu apa yang ingin Anda katakan. Jika Anda sewenang-wenang dapat menambahkan nilai-nilai yang terpisah untuk masing-masing angka dalam dataset, Anda hanya mengubah satu set nilai-nilai ke dalam satu set yang sama sekali berbeda dari n nilai-nilai. Saya tidak melihat bagaimana itu relevan dengan pertanyaan atau bahkan paragraf pembuka Anda. Saya pikir siapa pun akan mengakui bahwa perubahan tersebut dapat mengubah mean dan SD, tetapi itu tidak memberi tahu kami mengapa SD dari set data non-negatif dapat merupakan kelipatan positif dari rata-rata. nn
whuber

2
You are right: the quoted assertion is mine and it does not appear in your reply. (It happens to be correct and relevant, though. :-) One point I'm trying to get across is that the mere ability to change the SD while keeping the mean the same doesn't answer the question. How much can the SD be changed (while keeping all data non-negative)? The other point I have tried to make is that your example does not illustrate a general, predictable process of making such alterations to data. This makes it appear arbitrary, which is not of much help.
whuber

9

Of course, these are independent parameters. You can set simple explorations in R (or another tool you may prefer).

R> set.seed(42)     # fix RNG
R> x <- rnorm(1000) # one thousand N(0,1)
R> mean(x)          # and mean is near zero
[1] -0.0258244
R> sd(x)            # sd is near one
[1] 1.00252
R> sd(x * 100)      # scale to std.dev of 100
[1] 100.252
R> 

Similarly, you standardize the data you are looking at by subtracting the mean and dividing by the standard deviation.

Edit And following @whuber's idea, here is one an infinity of data sets which come close to your four measurements:

R> data <- c(0, 2341.141, rep(52, 545))
R> data.frame(min=min(data), max=max(data), sd=sd(data), mean=mean(data))
  min     max      sd    mean
1   0 2341.14 97.9059 56.0898
R> 

I am not sure I understand your point. They are not exactly independent as one could change the mean by perturbing one data point and thereby change the standard deviation as well. Did I misinterpret something?
varty

Noting that triangle areas cannot be negative (as confirmed by the minimum value quoted in the question), one would hope for an example consisting solely of non-negative numbers.
whuber

(+1) Re the edit: Try using 536 replications of 52.15 :-).
whuber

Nice one re 536 reps. Should have done a binary search :)
Dirk Eddelbuettel

Xvar(X)=p(1p)100>X>0(50)25099/100(1)2+(1/100)992. There are more examples of bounded variables in nature than gaussians?
robin girard

7

I am not sure why @Andy is surprised at this result, but I know he is not alone. Nor am I sure what the normality of the data has to do with the fact that the sd is higher than the mean. It is quite simple to generate a data set that is normally distributed where this is the case; indeed, the standard normal has mean of 0, sd of 1. It would be hard to get a normally distribute data set of all positive values with sd > mean; indeed, it ought not be possible (but it depends on the sample size and what test of normality you use... with a very small sample, odd things happen)

However, once you remove the stipulation of normality, as @Andy did, there's no reason why sd should be larger or smaller then the mean, even for all positive values. A single outlier will do this. e.g.

x <- runif(100, 1, 200) x <- c(x, 2000)

gives mean of 113 and sd of 198 (depending on seed, of course).

But a bigger question is why this surprises people.

I don't teach statistics, but I wonder what about the way statistics is taught makes this notion common.


I have never studied statistics, just a couple of units of engineering math and that was thirty years ago. Other people at work, who I thought understood the domain better, have been talking about representing bad data by "number of std devs away from the mean". So, it's more about "how std dev is commonly mentioned" than "taught" :-)
Andy Dent

@Andy having a large number of std away from the mean simply means that the variable is not significantly different from zero. Then it depend on the context (was is the meaning of the random variable) but in some case you might want to remove those ?
robin girard

@Peter see my comment to Dirk, this might explain the "surprise" in some context. Actually I have teached statistic for some time and I have never seen the surprise you are talking about. Anyway, I prefer studient that are surprised by everything I am pretty sure this is a good epistemologic position (better than fainting the absolutly no surprise position :)).
robin girard

@AndyDent "bad" data, to me, means data that is incorrectly recorded. Data that is far from the mean are outliers. For example, suppose you are measuring people's heights. If you measure me and record my height as 7'5' instead of 5'7, that's bad data. If you measure Yao Ming and record his height as 7'5", that's an outlier but not bad data. Regardless of the fact that it is very far from the mean (something like 6 sds)
Peter Flom - Reinstate Monica

@Peter Florn, In our case, we have outliers which we want to get rid of because they represent triangles that will cause algorithmic problems processing the mesh. They may even be "bad data" in your sense if they were created by faulty scanning devices or conversion from other formats :-) Other shapes may have outliers which are legitimately a long way from the mean but don't represent a problem. One of the more interesting things about this data is we have "bad data" at both ends but the small ones are not far from the mean.
Andy Dent

6

Just adding a generic point that, from a calculus perspective,

xf(x)dx
and
x2f(x)dx
are related by Jensen's inequality, assuming both integrals exist,
x2f(x)dx{xf(x)dx}2.
Given this general inequality, nothing prevents the variance to get arbitrarily large. Witness the Student's t distribution with ν degrees of freedom,
XT(ν,μ,σ)
and take Y=|X| whose second moment is the same as the second moment of X,
E[|X|2]=νν2σ2+μ2,
when ν>2. So it goes to infinity when ν goes down to 2, while the mean of Y remains finite as long as ν>1.

1
Please note the explicit restriction to nonnegative values in the question.
whuber

The Student example gets easily translated into the absolute-value-of-a-Student's-t-distribution example...
Xi'an

1
But that changes the mean, of course :-). The question concerns the relationship between the SD and the mean (see its title). I am not saying you're wrong; I'm just (implicitly) suggesting that your reply could, with little work, more directly address the question.
whuber

@whuber: ok, I edited the above to consider the absolute value (I also derived the mean of the absolute value but <a href="ceremade.dauphine.fr/~xian/meanabs.pdf">it is rather ungainly</a>...)
Xi'an

3

Perhaps the OP is surprised that the mean - 1 S.D. is a negative number (especially where the minimum is 0).

Here are two examples that may clarify.

Suppose you have a class of 20 first graders, where 18 are 6 years old, 1 is 5, and 1 is 7. Now add in the 49-year-old teacher. The average age is 8.0, while the standard deviation is 9.402.

You might be thinking: one standard deviation ranges for this class ranges from -1.402 to 17.402 years. You might be surprised that the S.D. includes a negative age, which seems unreasonable.

You don't have to worry about the negative age (or the 3D plots extending less than the minimum of 0.0). Intuitively, you still have about two-thirds of the data within 1 S.D. of the mean. (You actually have 95% of the data within 2 S.D. of the mean.)

When the data takes on a non-normal distribution, you will see surprising results like this.

Second example. In his book, Fooled by Randomness, Nassim Taleb sets up the thought experiment of a blindfolded archer shooting at a wall of inifinte length. The archer can shoot between +90 degrees and -90 degrees.

Every once in a while, the archer will shoot the arrow parallel to the wall, and it will never hit. Consider how far the arrow misses the target as the distribution of numbers. The standard deviation for this scenario would be inifinte.


The rule about 2/3 of the data within 1 SD of the mean is for normal data. But the classroom data is clearly non-normal (even if it passes some test for normality because of small sample size). Taleb's example is terrible. It's an example of poor operationalization of a variable. Taken as is, both the mean and the SD would be infinite. But that's nonsense. "How far the arrow misses" - to me, that's a distance. The arrow, no matter how it is fired, will land somewhere. Measure the distance from there to the target. No more infinity.
Peter Flom - Reinstate Monica

1
Yup, the OP was sufficiently surprised the first time I saw mean - 1 SD went negative that I wrote a whole new set of unit tests using data from Excel to confirm at least my algorithm was calculating the same values. Because Excel just has to be an authoritative source, right?
Andy Dent

@Peter The 2/3 rule (part of a 68-95-99.7% rule) is good for a huge variety of datasets, many of them non-normal and even for moderately skewed ones. (The rule is quite good for symmetric datsets.) The non-finiteness of the SD and mean are not "nonsense." Taleb's example is one of the few non-contrived situations where the Cauchy distribution clearly governs the data-generation process. The infiniteness of the SD does not derive from the possibility of missing the wall but from the distribution of actual hits.
whuber

1
@whuber I was aware of your first point, which is a good one. I disagree about your second point re Taleb. It seems to me like another contrived example.
Peter Flom - Reinstate Monica

3

A gamma random variable X with density

fX(x)=βαΓ(α)xα1eβxI(0,)(x),
with α,β>0, is almost surely positive. Choose any mean m>0 and any standard deviation s>0. As long as they are positive, it does not matter if m>s or m<s. Putting α=m2/s2 and β=m/s2, the mean and standard deviation of X are E[X]=α/β=m and Var[X]=α/β2=s. With a big enough sample from the distribution of X, by the SLLN, the sample mean and sample standard deviation will be close to m and s. You can play with R to get a feeling about this. Here are examples with m>s and m<s.
> m <- 10
> s <- 1
> x <- rgamma(10000, shape = m^2/s^2, rate = m/s^2)
> mean(x)
[1] 10.01113
> sd(x)
[1] 1.002632

> m <- 1
> s <- 10
> x <- rgamma(10000, shape = m^2/s^2, rate = m/s^2)
> mean(x)
[1] 1.050675
> sd(x)
[1] 10.1139

1

As pointed out in the other answers, the mean x¯ and standard deviation σx are essentially unrelated in that it is not necessary for the standard deviation to be smaller than the mean. However, if the data are nonnegative, taking on values in [0,c], say, then, for large data sets (where the distinction between dividing by n or by n1 does not matter very much), the following inequality holds:

σxx¯(cx¯)c2
and so if x¯>c/2, we can be sure that σx will be smaller. Indeed, since σx=c/2 only for an extremal distribution (half the data have value 0 and the other half value c), σx<x¯ can hold in some cases when x¯<c/2 as well. If the data are measurements of some physical quantity that is nonnegative (e.g. area) and have an empirical distribution that is a good fit to a normal distribution, then σx will be considerably smaller than min{x¯,cx¯} since the fitted normal distribution should assign negligibly small probability to the events {X<0} and {X>c}.

4
I don't think the question is whether the dataset is normal; its non-normality is stipulated. The question concerns whether there might have been some error made in computing the standard deviation, because the OP is surprised that even in this obviously non-normal dataset the SD is much larger than the mean. If an error was not made, what can one conclude from such a large coefficient of variation?
whuber

9
Any answer or comment that claims the mean and sd of a dataset are unrelated is plainly incorrect, because both are functions of the same data and both will change whenever a single one of the data values is changed. This remark does bear some echoes of a similar sounding statement that is true (but not terribly relevant to the current question); namely, that the sample mean and sample sd of data drawn independently from a normal distribution are independent (in the probabilistic sense).
whuber

1

What you seem to have in mind implicitly is a prediction interval that would bound the occurrence of new observations. The catch is: you must postulate a statistical distribution compliant with the fact that your observations (triangle areas) must remain non-negative. Normal won't help, but log-normal might be just fine. In practical terms, take the log of observed areas, calculate the mean and standard deviation, form a prediction interval using the normal distribution, and finally evaluate the exponential for the lower and upper limits -- the transformed prediction interval won't be symmetric around the mean, and is guaranteed to not go below zero. This is what I think the OP actually had in mind.


0

Felipe Nievinski points to a real issue here. It makes no sense to talk in normal distribution terms when the distribution is clearly not a normal distribution. All-positive values with a relatively small mean and relatively large standard deviation cannot have a normal distribution. So, the task is to figure out what sort of distribution fits the situation. The original post suggests that a normal distribution (or some such) was clearly in mind. Otherwise negative numbers would not come up. Log normal, Rayleigh, Weibull come to mind ... I don't know but wonder what might be best in a case like this?

Dengan menggunakan situs kami, Anda mengakui telah membaca dan memahami Kebijakan Cookie dan Kebijakan Privasi kami.
Licensed under cc by-sa 3.0 with attribution required.