Mengapa nilai p yang lebih rendah tidak lebih banyak bukti terhadap nol? Argumen dari Johansson 2011

31

Johansson (2011) dalam " Hail the impossible: nilai-p, bukti, dan kemungkinan " (di sini juga terkait dengan jurnal ) menyatakan bahwa nilai- yang lebih rendah sering dianggap sebagai bukti yang lebih kuat terhadap nol. Johansson menyiratkan bahwa orang akan menganggap bukti terhadap nol lebih kuat jika uji statistik mereka menghasilkan nilai , dibandingkan jika uji statistik mereka menghasilkan nilai . Johansson mencantumkan empat alasan mengapa nilai- tidak dapat digunakan sebagai bukti terhadap nol: $p$ $p$ $0.01$ $p$ $0.45$ $p$

$p$ terdistribusi secara merata di bawah hipotesis nol dan karena itu tidak pernah dapat menunjukkan bukti untuk nol.

$p$ dikondisikan semata-mata pada hipotesis nol dan karenanya tidak cocok untuk mengkuantifikasi bukti, karena bukti selalu relatif dalam arti menjadi bukti untuk atau melawan hipotesis relatif terhadap hipotesis lain.

$p$ designates probability of obtaining evidence (given the null), rather than strength of evidence.

$p$ depends on unobserved data and subjective intentions and therefore implies, given the evidential interpretation, that the evidential strength of observed data depends on things that did not happen and subjective intentions.

Unfortunately I cannot get an intuitive understanding from Johansson's article. To me a $p$ -value of $0.01$ indicates there is less chance the null is true, than a $p$ -value of $0.45$ . Why are lower $p$ -values not stronger evidence against null?

— luciano
sumber

Hello, @luciano! I see that you have not accepted any answer in this thread. What kind of answer are you looking for? Is your question primarily about Johannson's arguments specifically, or about lower p-values in general?

— amoeba says Reinstate Monica

This is all about the Fisher vs Neyman-Pearson frequentist frameworks. See more in this answer by @gung.

— Firebug

21

My personal appraisal of his arguments:

Here he talks about using $p$ as evidence for the Null, whereas his thesis is that $p$ can't be used as evidence against the Null. So, I think this argument is largely irrelevant.
I think this is a misunderstanding. Fisherian $p$ testing follows strongly in the idea of Popper's Critical Rationalism that states you cannot support a theory but only criticize it. So in that sense there only is a single hypothesis (the Null) and you simply check if your data are in accordance with it.
I disagree here. It depends on the test statistic but $p$ is usually a transformation of an effect size that speaks against the Null. So the higher the effect, the lower the p value---all other things equal. Of course, for different data sets or hypotheses this is no longer valid.
I am not sure I completely understand this statement, but from what I can gather this is less a problem of $p$ as of people using it wrongly. $p$ was intended to have the long-run frequency interpretation and that is a feature not a bug. But you can't blame $p$ for people taking a single $p$ value as proof for their hypothesis or people publishing only $p<.05$ .

His suggestion of using the likelihood ratio as a measure of evidence is in my opinion a good one (but here the idea of a Bayes factor is more general), but in the context in which he brings it is a bit peculiar: First he leaves the grounds of Fisherian testing where there is no alternative hypothesis to calculate the likelihood ratio from. But $p$ as evidence against the Null is Fisherian. Hence he confounds Fisher and Neyman-Pearson. Second, most test statistics that we use are (functions of) the likelihood ratio and in that case $p$ is a transformation of the likelihood ratio. As Cosma Shalizi puts it:

among all tests of a given size $s$ , the one with the smallest miss probability, or highest power, has the form "say 'signal' if $q(x)/p(x) > t(s)$ , otherwise say 'noise'," and that the threshold $t$ varies inversely with $s$ . The quantity $q(x)/p(x)$ is the likelihood ratio; the Neyman-Pearson lemma says that to maximize power, we should say "signal" if it is sufficiently more likely than noise.

Here $q(x)$ is the density under state "signal" and $p(x)$ the density under state "noise". The measure for "sufficiently likely" would here be $P(q(X)/p(x) > t_{obs} \mid H_0)$ which is $p$ . Note that in correct Neyman-Pearson testing $t_{obs}$ is substituted by a fixed $t(s)$ such that $P(q(X)/p(x) > t(s) \mid H_0)=\alpha$ .

— Momo
sumber

6

+1 for point 3 alone. Cox describes the p-value as a calibration of the likelihood ratio (or other test statistic) & it's a point of view that's often forgotten.

— Scortchi - Reinstate Monica

(+1) Nice answer, @Momo. I am wondering if it could be improved by adding something like "But they are!" in a large font as the header of your response, because this seems to be your answer to OP's title question "Why are lower p-values not more evidence against the null?". You debunk all the given arguments, but do not explicitly provide an answer to the title question.

— amoeba says Reinstate Monica

1

I'd be a bit hesitant to do that, it is all very subtle and very dependent on assumptions, contexts etc. For example, you may flat out deny that probabilistic statements can be used as "evidence" and thus the statement is correct. In a Fisherian point of view it is not. Also, I wouldn't say I debunk (all) the arguments, I think I only provide a different perspective and point out some logical flaws in the argument. The author argues his point well and tries to provide solution to a pertinent approach that by itself may be seen as equally problematic.

— Momo

9

Alasan bahwa argumen seperti Johansson didaur ulang begitu sering tampaknya terkait dengan fakta bahwa nilai-P adalah indeks bukti terhadap nol tetapi bukan merupakan ukuran bukti. Bukti memiliki dimensi lebih dari yang bisa diukur oleh angka tunggal, dan selalu ada aspek hubungan antara nilai-P dan bukti yang sulit ditemukan orang.

Saya telah meninjau banyak argumen yang digunakan oleh Johansson dalam sebuah makalah yang menunjukkan hubungan antara nilai-P dan fungsi kemungkinan, dan dengan demikian bukti: http://arxiv.org/abs/1311.0081 Sayangnya kertas itu sekarang telah tiga kali ditolak, meskipun argumen dan bukti untuk mereka belum dibantah. (Tampaknya itu tidak menyenangkan bagi wasit yang memiliki pendapat seperti pendapat Johansson daripada salah.)

— Michael Lew
sumber

+1 @Michael Lew, what about changing the title? To P(ee) or not to P(ee) ... doesn't sound like a dilemna. We all know what to do in that situation. =D Joking aside, what were the reasons for your paper to be rejected?

— Seorang pria tua di laut.

4

Adding to @Momo's nice answer:

Do not forget multiplicity. Given many independent p-values, and sparse non-trivial effect sizes, the smallest p-values are from the null, with probability tending to $1$ as the number of hypotheses increases. So if you tell me you have a small p-value, the first thing I want to know is how many hypotheses you have been testing.

— JohnRos
sumber

2

It is worth noting that the evidence itself is not affected by multiplicity of testing, even if your response to the evidence might be altered. The evidence in the data is the evidence in the data and it is not affected by any calculations that you may perform in your computer. The typical 'correction' of p-values for multiplicity of testing has to do with preserving false positive error rates, not correcting the relationship between the p-value and the experimental evidence.

— Michael Lew

1

Is Johansson talking about p-values from two different experiments? If so, comparing p-values may be like comparing apples to lamb chops. If experiment "A" involves a huge number of samples, even a small inconsequential difference may be statistically significant. If experiment "B" involves only a few samples, an important difference may be statistically insignificant. Even worse (that's why I said lamb chops and not oranges), the scales may be totally incomparable (psi in one and kwh in the other).

— Emil Friedman
sumber

3

My impression is that Johansson is not talking about comparing p-values from different experiments. In light of that & @Glen_b's comment, would you mind clarifying your post, Emil? It's fine to raise a related point ('I think J's wrong in context A, but it would have some merit in context B'), but it needs to be clear that that's what you are doing. If you are asking a question or commenting, please delete this post & make it a comment.

— gung - Reinstate Monica