Contoh ketika menggunakan akurasi sebagai ukuran hasil akan mengarah pada kesimpulan yang salah

Saya mencari berbagai ukuran kinerja untuk model prediksi. Banyak yang ditulis tentang masalah menggunakan akurasi, bukan sesuatu yang lebih berkelanjutan untuk mengevaluasi kinerja model. Frank Harrell http://www.fharrell.com/post/class-damage/ memberikan contoh ketika menambahkan variabel informatif ke model akan menyebabkan penurunan akurasi, kesimpulan yang jelas berlawanan dengan intuisi dan salah.

Namun, dalam kasus ini, ini tampaknya disebabkan oleh memiliki kelas yang tidak seimbang, dan karenanya dapat diselesaikan hanya dengan menggunakan akurasi yang seimbang sebagai gantinya ((sens + spec) / 2). Apakah ada beberapa contoh di mana menggunakan akurasi pada dataset yang seimbang akan menghasilkan beberapa kesimpulan yang jelas salah atau berlawanan dengan intuisi?

Edit

Saya mencari sesuatu di mana keakuratan akan turun bahkan ketika modelnya jelas lebih baik, atau bahwa menggunakan akurasi akan mengarah pada pilihan positif palsu dari beberapa fitur. Sangat mudah untuk membuat contoh negatif palsu, di mana akurasi sama untuk dua model di mana satu jelas lebih baik menggunakan kriteria lain.

machine-learning predictive-models accuracy

— rep_ho
sumber

Terkait, tetapi bukan duplikat: Mengapa akurasi bukan ukuran terbaik untuk menilai model klasifikasi?

— Stephan Kolassa

Posting Stephan terhubung di atas adalah sumber yang bagus dan memiliki semua yang Anda butuhkan. Langkah awal Anda dengan asumsi bahwa klasifikasi paksa (keputusan prematur) diperlukan telah menyebabkan masalah. Dan (sens + spec) / 2 bukan skor akurasi yang tepat; mengoptimalkannya akan mengarah pada memilih fitur yang salah dan memberi mereka bobot yang salah, belum lagi mengabaikan informasi berguna yang berasal dari probabilitas, seperti zona "tidak ada keputusan".

— Frank Harrell

Saya memahami nilai prediksi probabilistik, tetapi saya mencari contoh di mana hal-hal buruk yang Anda sebutkan benar-benar terjadi untuk data yang seimbang atau akurasi yang seimbang.

— rep_ho

BTW: menurut Gneiting & Raftery 2007, akurasi tepat (meskipun tidak sepenuhnya tepat). ada komentar tentang itu? amstat.tandfonline.com/doi/abs/10.1198/...

— rep_ho

Saya harap jawaban saya bermanfaat. (@FrankHarrell: komentar apa pun akan diterima.) Akurasi yang seimbang juga tidak akan membantu di sini. Mengenai akurasi sebagai aturan penilaian yang tepat tetapi tidak sepenuhnya tepat: Anda mungkin tertarik pada Apakah akurasi aturan penilaian yang tidak tepat dalam pengaturan klasifikasi biner? Jika utas itu tidak menjawab pertanyaan Anda, pertimbangkan untuk mengajukan pertanyaan baru (dan tunjukkan mengapa utas yang ada bukanlah jawaban sehingga tidak tertipu-tertutup).

— Stephan Kolassa

Jawaban:

Saya akan curang.

Secara khusus, saya sering berdebat (misalnya, di sini ) bahwa bagian statistik pemodelan dan prediksi hanya mencakup membuat prediksi probabilistik untuk keanggotaan kelas (atau memberikan kepadatan prediksi, dalam hal peramalan numerik). Memperlakukan contoh spesifik seolah-olah itu milik kelas tertentu (atau prediksi titik dalam kasus numerik), bukan statistik yang benar lagi. Itu adalah bagian dari aspek teoretik keputusan .

Dan keputusan tidak hanya harus didasarkan pada prediksi probabilistik, tetapi juga pada biaya kesalahan klasifikasi, dan pada sejumlah tindakan yang mungkin lainnya . Misalnya, bahkan jika Anda hanya memiliki dua kelas yang memungkinkan, "sakit" vs. "sehat", Anda dapat memiliki sejumlah besar tindakan yang mungkin tergantung pada seberapa besar kemungkinan seorang pasien menderita penyakit, mengirimnya pulang karena ia hampir pasti sehat, memberinya dua aspirin, menjalankan tes tambahan, segera memanggil ambulans dan menempatkannya sebagai penunjang kehidupan.

Menilai akurasi mengandaikan keputusan seperti itu. Akurasi sebagai metrik evaluasi untuk klasifikasi adalah kesalahan kategori .

Jadi, untuk menjawab pertanyaan Anda, saya akan berjalan di jalur kesalahan kategori seperti itu. Kami akan mempertimbangkan skenario sederhana dengan kelas yang seimbang di mana klasifikasi tanpa memperhatikan biaya kesalahan klasifikasi memang akan menyesatkan kita.

Misalkan wabah Ganas Gutrot merajalela di populasi. Untungnya, kita dapat menyaring semua orang dengan mudah untuk beberapa sifat $t$ ( $0\leq t \leq 1$ ), dan kita tahu bahwa kemungkinan pengembangan MG tergantung secara linear $t$ , $p=\gamma t$ untuk beberapa parameter $\gamma$ ( $0\leq \gamma \leq 1$ ). Sifat itu $t$ didistribusikan secara merata dalam populasi.

Untungnya, ada vaksinnya. Sayangnya, itu mahal, dan efek sampingnya sangat tidak nyaman. (Aku akan membiarkan imajinasimu memberikan perinciannya.) Namun, mereka lebih baik daripada menderita MG.

Demi kepentingan abstraksi, saya berpendapat bahwa memang hanya ada dua tindakan yang mungkin dilakukan untuk setiap pasien, mengingat nilai sifat mereka $t$ : baik memvaksinasi, atau tidak memvaksinasi.

Dengan demikian, pertanyaannya adalah: bagaimana seharusnya kita memutuskan siapa yang akan divaksinasi dan siapa yang tidak $t$ ? Kami akan menjadi utilitarian tentang hal ini dan bertujuan memiliki total biaya terendah yang diharapkan. Jelas bahwa ini adalah memilih ambang $\theta$ dan untuk memvaksinasi semua orang dengan $t\geq\theta$ .

Model dan keputusan 1 digerakkan oleh akurasi. Pas dengan model. Untungnya, kita sudah tahu modelnya. Pilih ambangnya $\theta$ yang memaksimalkan akurasi ketika mengklasifikasikan pasien , dan memvaksinasi semua orang dengan $t\geq \theta$ . Kita dengan mudah melihatnya $\theta=\frac{1}{2\gamma}$ adalah angka ajaib - semua orang dengan $t\geq \theta$ memiliki kemungkinan lebih tinggi tertular MG daripada tidak, dan sebaliknya, sehingga ambang probabilitas klasifikasi ini akan memaksimalkan akurasi. Dengan asumsi kelas seimbang, $\gamma=1$ , kami akan memvaksinasi setengah populasi. Lucunya, kalau $\gamma<\frac{1}{2}$ , kami tidak akan memvaksinasi siapa pun . (Kami sebagian besar tertarik pada kelas yang seimbang, jadi mari kita abaikan bahwa kita membiarkan sebagian dari populasi mati sebagai Kematian yang Mengerikan yang Mengerikan.)

Tidak perlu dikatakan, ini tidak memperhitungkan biaya diferensial dari kesalahan klasifikasi.

Model-dan-keputusan 2 memanfaatkan prediksi probabilistik kami ("mengingat sifat Anda $t$ , probabilitas Anda terkena MG $\gamma t$ ") dan struktur biaya.

Pertama, ini adalah grafik kecil. Sumbu horizontal memberikan sifat, sumbu vertikal probabilitas MG. Segitiga yang diarsir memberikan proporsi populasi yang akan mengontrak MG. Garis vertikal memberi beberapa tertentu $\theta$ . Garis putus-putus horisontal di $\gamma\theta$ akan membuat perhitungan di bawah ini sedikit lebih mudah diikuti. Kami berasumsi $\gamma>\frac{1}{2}$ , hanya untuk membuat hidup lebih mudah.

Mari kita beri nama biaya kita dan hitung kontribusinya pada total biaya yang diharapkan, diberikan $\theta$ dan $\gamma$ (dan fakta bahwa sifat tersebut terdistribusi secara merata dalam populasi).

Membiarkan $c^+_+$ menunjukkan biaya untuk pasien yang divaksinasi dan akan mengontrak MG. Diberikan $\theta$ , proporsi populasi yang mengeluarkan biaya ini adalah trapesium yang diarsir di bagian kanan bawah area $(1 - θ) γ θ + \frac{1}{2} (1 - θ) (γ - γ θ) .$ $(1-\theta)\gamma\theta + \frac{1}{2}(1-\theta)(\gamma-\gamma\theta).$
Membiarkan $c^-_+$ menunjukkan biaya untuk pasien yang divaksinasi dan tidak akan mengontrak MG. Diberikan $\theta$ , proporsi populasi yang mengeluarkan biaya ini adalah trapesium yang tidak diarsir di bagian kanan atas area $(1 - θ) (1 - γ) + \frac{1}{2} (1 - θ) (γ - γ θ) .$ $(1-\theta)(1-\gamma) + \frac{1}{2}(1-\theta)(\gamma-\gamma\theta).$
Membiarkan $c^-_-$ menunjukkan biaya untuk pasien yang tidak divaksinasi dan tidak akan mengontrak MG. Diberikan $\theta$ , proporsi populasi yang mengeluarkan biaya ini adalah trapesium yang tidak diarsir di bagian kiri atas area $θ (1 - γ θ) + \frac{1}{2} θ γ θ .$ $\theta(1-\gamma\theta) + \frac{1}{2}\theta\gamma\theta.$
Membiarkan $c^+_-$ menunjukkan biaya untuk pasien yang tidak divaksinasi dan akan mengontrak MG. Diberikan $\theta$ , proporsi populasi yang mengeluarkan biaya ini adalah segitiga berarsir di bagian kiri bawah dengan luas $\frac{1}{2} θ γ θ .$ $\frac{1}{2}\theta\gamma\theta.$

(Di setiap trapesium, pertama-tama saya menghitung luas persegi panjang, lalu menambahkan luas segitiga.)

Total biaya yang diharapkan adalah

c_{+}^{+} ((1 - θ) γ θ + \frac{1}{2} (1 - θ) (γ - γ θ)) + c_{+}^{-} ((1 - θ) (1 - γ) + \frac{1}{2} (1 - θ) (γ - γ θ)) + c_{-}^{-} (θ (1 - γ θ) + \frac{1}{2} θ γ θ) + c_{-}^{+} \frac{1}{2} θ γ θ .

$c^+_+\bigg((1-\theta)\gamma\theta + \frac{1}{2}(1-\theta)(\gamma-\gamma\theta)\bigg) + c^-_+\bigg((1-\theta)(1-\gamma) + \frac{1}{2}(1-\theta)(\gamma-\gamma\theta)\bigg) + c^-_-\bigg(\theta(1-\gamma\theta) + \frac{1}{2}\theta\gamma\theta\bigg) + c^+_-\frac{1}{2}\theta\gamma\theta.$

Differentiating and setting the derivative to zero, we obtain that expected costs are minimized by

θ^{*} = \frac{c_{+}^{-} - c_{-}^{-}}{γ (c_{-}^{+} + c_{+}^{-} - c_{+}^{+} - c_{-}^{-})} .

$\theta^\ast = \frac{c^-_+-c^-_-}{\gamma(c^+_-+c^-_+-c^+_+-c^-_-)}.$

This is only equal to the accuracy maximizing value of $\theta$ for a very specific cost structure, namely if and only if

\frac{1}{2 γ} = \frac{c_{+}^{-} - c_{-}^{-}}{γ (c_{-}^{+} + c_{+}^{-} - c_{+}^{+} - c_{-}^{-})},

$\frac{1}{2\gamma} = \frac{c^-_+-c^-_-}{\gamma(c^+_-+c^-_+-c^+_+-c^-_-)},$ or

\frac{1}{2} = \frac{c_{+}^{-} - c_{-}^{-}}{c_{-}^{+} + c_{+}^{-} - c_{+}^{+} - c_{-}^{-}} .

$\frac{1}{2} = \frac{c^-_+-c^-_-}{c^+_-+c^-_+-c^+_+-c^-_-}.$

As an example, suppose that $\gamma=1$ for balanced classes and that costs are

c_{+}^{+} = 1, c_{+}^{-} = 2, c_{-}^{+} = 10, c_{-}^{-} = 0.

$c^+_+ = 1, \quad c^-_+=2, \quad c^+_-=10, \quad c^-_-=0.$ Then the accuracy maximizing

θ = \frac{1}{2}

$\theta=\frac{1}{2}$ will yield expected costs of

1.875

$1.875$ , whereas the cost minimizing

θ = \frac{2}{11}

$\theta=\frac{2}{11}$ will yield expected costs of

1.318

$1.318$ .

In this example, basing our decisions on non-probabilistic classifications that maximized accuracy led to more vaccinations and higher costs than using a decision rule that explicitly used the differential cost structures in the context of a probabilistic prediction.

Bottom line: accuracy is only a valid decision criterion if

there is a one-to-one relationship between classes and possible actions
and the costs of actions applied to classes follow a very specific structure.

In the general case, evaluating accuracy asks a wrong question, and maximizing accuracy is a so-called type III error: providing the correct answer to the wrong question.

R code:

rm(list=ls())
gamma <- 0.7

cost_treated_positive <- 1          # cost of treatment, side effects unimportant
cost_treated_negative <- 2          # cost of treatment, side effects unnecessary
cost_untreated_positive <- 10       # horrible, painful death
cost_untreated_negative <- 0        # nothing

expected_cost <- function ( theta ) {
    cost_treated_positive * ( (1-theta)*theta*gamma + (1-theta)*(gamma-gamma*theta)/2 ) +
    cost_treated_negative * ( (1-theta)*(1-gamma) + (1-theta)*(gamma-gamma*theta)/2 ) +
    cost_untreated_negative *( theta*(1-gamma*theta) + theta*gamma*theta/2 ) +
    cost_untreated_positive * theta*gamma*theta/2
}

(theta <- optim(par=0.5,fn=expected_cost,lower=0,upper=1,method="L-BFGS-B")$par)
(cost_treated_negative-cost_untreated_negative)/
    (gamma*(cost_treated_negative+cost_untreated_positive-cost_treated_positive-cost_untreated_negative))

plot(c(0,1),c(0,1),type="n",bty="n",xaxt="n",xlab="Trait t",yaxt="n",ylab="MG probability")
rect(0,0,1,1)
axis(1,c(0,theta,1),c(0,"theta",1),lty=0,line=-1)
axis(2,c(0,1),lty=0,line=-1,las=1)
axis(4,c(0,gamma,1),c(0,"gamma",1),lty=0,line=-1.8,las=1)
polygon(c(0,1,1),c(0,0,gamma),col="lightgray")
abline(v=theta,col="red",lwd=2)
abline(h=gamma*theta,lty=2,col="red",lwd=2)

expected_cost(1/(2*gamma))
expected_cost(theta)

— Stephan Kolassa
sumber

Great post! Just logged in to express my thanks!

— Wolfone

"but also on costs of misclassifications" I don't think its true: just as your calculation itself shows, it also (quite surprisingly!) depends on the cost of correct classifications as well!

— Tamas Ferenci

I didn't want to edit your excellent answer that deeply, but it'd be perhaps instructive to note that the optimal threshold depends on the four cost, but only through

c_{d}^{+} = c_{-}^{+} - c_{+}^{+}

$c_d^+=c_-^+-c_+^+$ and

c_{d}^{-} = c_{+}^{-} - c_{-}^{-}

$c_d^-=c_+^--c_-^-$ :

θ^{*} = \frac{c_{d}^{-}}{γ (c_{d}^{-} + c_{d}^{+})}

$\theta^\ast=\frac{c_d^-}{\gamma\left(c_d^-+c_d^+\right)}$ .

— Tamas Ferenci

We could also make a plot of this:

levelplot( thetastar ~ cdminus + cdplus, data = data.table( expand.grid( cdminus = seq( 0, 10, 0.01 ), cdplus = seq( 0, 10, 0.01 ) ) )[ , .( cdminus, cdplus, thetastar = cdminus/(cdminus + cdplus) ) ] )

— Tamas Ferenci

"the optimal threshold depends on the four cost, but only through" More specifically, only through their ratio, sorry:

θ^{*} = \frac{1 / γ}{1 + \frac{c_{d}^{+}}{c_{d}^{-}}}

$\theta^{\ast}=\frac{1/\gamma}{1+\frac{c_d^+}{c_d^-}}$ .

— Tamas Ferenci

It might worth adding another, perhaps more straightforward example to Stephen's excellent answer.

Let's consider a medical test, the result of which is normally distributed, both in sick and in healthy people, with different parameters of course (but for simplicity, let's assume homoscedasticity, i.e., that the variance is the same):

\begin{matrix} T ∣ D ⊖ \sim N (μ_{-}, σ^{2}) \\ T ∣ D \oplus \sim N (μ_{+}, σ^{2}) \end{matrix} .

$\begin{gather*}T \mid D \ominus \sim \mathcal{N}\left(\mu_{-},\sigma^2\right)\\T \mid D \oplus \sim \mathcal{N}\left(\mu_{+},\sigma^2\right)\end{gather*}.$ Let's denote the prevalence of the disease with

p

$p$ (i.e.

D \oplus \sim B e r n (p)

$D\oplus\sim Bern\left(p\right)$ ), so this, together with the above, which are essentially conditional distributions, fully specifies the joint distribution.

Thus the confusion matrix with threshold $b$ (i.e., those with test results above $b$ are classified as sick) is

(\begin{matrix} D \oplus & D ⊖ \\ T \oplus & p (1 - Φ_{+} (b)) & (1 - p) (1 - Φ_{-} (b)) \\ T ⊖ & p Φ_{+} (b) & (1 - p) Φ_{-} (b) \end{matrix}) .

$\begin{pmatrix} & D\oplus & D\ominus\\ T\oplus & p\left(1-\Phi_{+}\left(b\right)\right) & \left(1-p\right)\left(1-\Phi_{-}\left(b\right)\right)\\ T\ominus & p\Phi_{+}\left(b\right) & \left(1-p\right)\Phi_{-}\left(b\right)\\ \end{pmatrix}.$

Accuracy-based approach

The accuracy is

p (1 - Φ_{+} (b)) + (1 - p) Φ_{-} (b),

$p\left(1-\Phi_{+}\left(b\right)\right)+\left(1-p\right)\Phi_{-}\left(b\right),$

we take its derivative w.r.t. $b$ , set it equal to 0, multiply with $\sqrt{1\pi\sigma^2}$ and rearrange a bit:

\begin{matrix} - p φ_{+} (b) + φ_{-} (b) - p φ_{-} (b) = 0 \\ e^{- \frac{{(b - μ_{-})}^{2}}{2 σ^{2}}} [(1 - p) - p e^{- \frac{2 b (μ_{-} - μ_{+}) + (μ_{+}^{2} - μ_{-}^{2})}{2 σ^{2}}}] = 0 \end{matrix}

$\begin{gather*} -p\varphi_{+}\left(b\right)+\varphi_{-}\left(b\right)-p\varphi_{-}\left(b\right)=0\\ e^{-\frac{\left(b-\mu_{-}\right)^2}{2\sigma^2}}\left[\left(1-p\right)-pe^{-\frac{2b\left(\mu_{-}-\mu_{+}\right)+\left(\mu_{+}^2-\mu_{-}^2\right)}{2\sigma^2}}\right]=0\end{gather*}$ The first term can't be zero, so the only way the product can be zero is if the second term is zero:

\begin{matrix} (1 - p) - p e^{- \frac{2 b (μ_{-} - μ_{+}) + (μ_{+}^{2} - μ_{-}^{2})}{2 σ^{2}}} = 0 \\ - \frac{2 b (μ_{-} - μ_{+}) + (μ_{+}^{2} - μ_{-}^{2})}{2 σ^{2}} = \log \frac{1 - p}{p} \\ 2 b (μ_{+} - μ_{-}) + (μ_{-}^{2} - μ_{+}^{2}) = 2 σ^{2} \log \frac{1 - p}{p} \end{matrix}

$\begin{gather*}\left(1-p\right)-pe^{-\frac{2b\left(\mu_{-}-\mu_{+}\right)+\left(\mu_{+}^2-\mu_{-}^2\right)}{2\sigma^2}}=0\\-\frac{2b\left(\mu_{-}-\mu_{+}\right)+\left(\mu_{+}^2-\mu_{-}^2\right)}{2\sigma^2}=\log\frac{1-p}{p}\\ 2b\left(\mu_{+}-\mu_{-}\right)+\left(\mu_{-}^2-\mu_{+}^2\right)=2\sigma^2\log\frac{1-p}{p}\\ \end{gather*}$ So the solution is

b^{*} = \frac{(μ_{+}^{2} - μ_{-}^{2}) + 2 σ^{2} \log \frac{1 - p}{p}}{2 (μ_{+} - μ_{-})} = \frac{μ_{+} + μ_{-}}{2} + \frac{σ^{2}}{μ_{+} - μ_{-}} \log \frac{1 - p}{p} .

$b^{\ast}=\frac{\left(\mu_{+}^2-\mu_{-}^2\right)+2\sigma^2\log\frac{1-p}{p}}{2\left(\mu_{+}-\mu_{-}\right)}=\frac{\mu_{+}+\mu_{-}}{2}+\frac{\sigma^2}{\mu_{+}-\mu_{-}}\log\frac{1-p}{p}.$

Note that this - of course - doesn't depend on the costs.

If the classes are balanced, the optimum is the average of the mean test values in sick and healthy people, otherwise it is displaced based on the imbalance.

Cost-based approach

Using Stephen's notation, the expected overall cost is

c_{+}^{+} p (1 - Φ_{+} (b)) + c_{+}^{-} (1 - p) (1 - Φ_{-} (b)) + c_{-}^{+} p Φ_{+} (b) + c_{-}^{-} (1 - p) Φ_{-} (b) .

$c_{+}^{+}p\left(1-\Phi_{+}\left(b\right)\right) + c_{+}^{-}\left(1-p\right)\left(1-\Phi_{-}\left(b\right)\right) + c_{-}^{+} p\Phi_{+}\left(b\right) + c_{-}^{-} \left(1-p\right)\Phi_{-}\left(b\right).$ Take its derivate w.r.t

b

$b$ and set it equal to zero:

\begin{matrix} - c_{+}^{+} p φ_{+} (b) - c_{+}^{-} (1 - p) φ_{-} (b) + c_{-}^{+} p φ_{+} (b) + c_{-}^{-} (1 - p) φ_{-} (b) = \\ = φ_{+} (b) p (c_{-}^{+} - c_{+}^{+}) + φ_{-} (b) (1 - p) (c_{-}^{-} - c_{+}^{-}) = \\ = φ_{+} (b) p c_{d}^{+} - φ_{-} (b) (1 - p) c_{d}^{-} = 0, \end{matrix}

$\begin{gather*} -c_{+}^{+} p \varphi_{+}\left(b\right)-c_{+}^{-}\left(1-p\right)\varphi_{-}\left(b\right)+c_{-}^{+}p\varphi_{+}\left(b\right)+c_{-}^{-}\left(1-p\right)\varphi_{-}\left(b\right)=\\ =\varphi_{+}\left(b\right)p\left(c_{-}^{+}-c_{+}^{+}\right)+\varphi_{-}\left(b\right)\left(1-p\right)\left(c_{-}^{-}-c_{+}^{-}\right)=\\ = \varphi_{+}\left(b\right)pc_d^{+}-\varphi_{-}\left(b\right)\left(1-p\right)c_d^{-}= 0,\end{gather*}$ using the notation I introduced in my comments below Stephen's answer, i.e.,

c_{d}^{+} = c_{-}^{+} - c_{+}^{+}

$c_d^{+}=c_{-}^{+}-c_{+}^{+}$ and

c_{d}^{-} = c_{+}^{-} - c_{-}^{-}

$c_d^{-}=c_{+}^{-}-c_{-}^{-}$ .

The optimal threshold is therefore given by the solution of the equation

\frac{φ_{+} (b)}{φ_{-} (b)} = \frac{(1 - p) c_{d}^{-}}{p c_{d}^{+}} .

$\boxed{\frac{\varphi_{+}\left(b\right)}{\varphi_{-}\left(b\right)}=\frac{\left(1-p\right)c_d^{-}}{pc_d^{+}}}.$ Two things should be noted here:

This results is totally generic and works for any distribution of the test results, not only normal. ( $\varphi$ in that case of course means the probability density function of the distribution, not the normal density.)
Whatever the solution for $b$ is, it is surely a function of $\frac{\left(1-p\right)c_d^{-}}{pc_d^{+}}$ . (I.e., we immediately see how costs matter - in addition to class imbalance!)

I'd be really interested to see if this equation has a generic solution for $b$ (parametrized by the $\varphi$ s), but I would be surprised.

Nevertheless, we can work it out for normal! $\sqrt{2\pi\sigma^2}$ s cancel on the left hand side, so we have

\begin{matrix} e^{- \frac{1}{2} (\frac{{(b - μ_{+})}^{2}}{σ^{2}} - \frac{{(b - μ_{-})}^{2}}{σ^{2}})} = \frac{(1 - p) c_{d}^{-}}{p c_{d}^{+}} \\ {(b - μ_{-})}^{2} - {(b - μ_{+})}^{2} = 2 σ^{2} \log \frac{(1 - p) c_{d}^{-}}{p c_{d}^{+}} \\ 2 b (μ_{+} - μ_{-}) + (μ_{-}^{2} - μ_{+}^{2}) = 2 σ^{2} \log \frac{(1 - p) c_{d}^{-}}{p c_{d}^{+}} \end{matrix}

$\begin{gather*} e^{-\frac{1}{2}\left(\frac{\left(b-\mu_{+}\right)^2}{\sigma^2}-\frac{\left(b-\mu_{-}\right)^2}{\sigma^2}\right)}=\frac{\left(1-p\right)c_d^{-}}{pc_d^{+}} \\ \left(b-\mu_{-}\right)^2-\left(b-\mu_{+}\right)^2 =2\sigma^2 \log \frac{\left(1-p\right)c_d^{-}}{pc_d^{+}} \\ 2b\left(\mu_{+}-\mu_{-}\right)+\left(\mu_{-}^2-\mu_{+}^2\right) =2\sigma^2 \log \frac{\left(1-p\right)c_d^{-}}{pc_d^{+}}\end{gather*}$ therefore the solution is

b^{*} = \frac{(μ_{+}^{2} - μ_{-}^{2}) + 2 σ^{2} \log \frac{(1 - p) c_{d}^{-}}{p c_{d}^{+}}}{2 (μ_{+} - μ_{-})} = \frac{μ_{+} + μ_{-}}{2} + \frac{σ^{2}}{μ_{+} - μ_{-}} \log \frac{(1 - p) c_{d}^{-}}{p c_{d}^{+}} .

$b^{\ast}=\frac{\left(\mu_{+}^2-\mu_{-}^2\right)+2\sigma^2 \log \frac{\left(1-p\right)c_d^{-}}{pc_d^{+}}}{2\left(\mu_{+}-\mu_{-}\right)}=\frac{\mu_{+}+\mu_{-}}{2}+\frac{\sigma^2}{\mu_{+}-\mu_{-}}\log \frac{\left(1-p\right)c_d^{-}}{pc_d^{+}}.$

(Compare it the the previous result! We see that they are equal if and only if $c_d^{-}=c_d^{+}$ , i.e. the differences in misclassification cost compared to the cost of correct classification is the same in sick and healthy people.)

A short demonstration

Let's say $c_{-}^{-}=0$ (it is quite natural medically), and that $c_{+}^{+}=1$ (we can always obtain it by dividing the costs with $c_{+}^{+}$ , i.e., by measuring every cost in $c_{+}^{+}$ units). Let's say that the prevalence is $p=0.2$ . Also, let's say that $\mu_{-}=9.5$ , $\mu_{+}=10.5$ and $\sigma=1$ .

In this case:

library( data.table )
library( lattice )

cminusminus <- 0
cplusplus <- 1
p <- 0.2
muminus <- 9.5
muplus <- 10.5
sigma <- 1

res <- data.table( expand.grid( b = seq( 6, 17, 0.1 ),
                                cplusminus = c( 1, 5, 10, 50, 100 ),
                                cminusplus = c( 2, 5, 10, 50, 100 ) ) )
res$cost <- cplusplus*p*( 1-pnorm( res$b, muplus, sigma ) ) +
  res$cplusminus*(1-p)*(1-pnorm( res$b, muminus, sigma ) ) +
  res$cminusplus*p*pnorm( res$b, muplus, sigma ) +
  cminusminus*(1-p)*pnorm( res$b, muminus, sigma )

xyplot( cost ~ b | factor( cminusplus ), groups = cplusminus, ylim = c( -1, 22 ),
        data = res, type = "l", xlab = "Threshold",
        ylab = "Expected overall cost", as.table = TRUE,
        abline = list( v = (muplus+muminus)/2+
                         sigma^2/(muplus-muminus)*log((1-p)/p) ),
        strip = strip.custom( var.name = expression( {"c"^{"+"}}["-"] ),
                              strip.names = c( TRUE, TRUE ) ),
        auto.key = list( space = "right", points = FALSE, lines = TRUE,
                         title = expression( {"c"^{"-"}}["+"] ) ),
        panel = panel.superpose, panel.groups = function( x, y, col.line, ... ) {
          panel.xyplot( x, y, col.line = col.line, ... )
          panel.points( x[ which.min( y ) ], min( y ), pch = 19, col = col.line )
        } )

The result is (points depict the minimum cost, and the vertical line shows the optimal threshold with the accuracy-based approach):

We can very nicely see how cost-based optimum can be different than the accuracy-based optimum. It is instructive to think over why: if it is more costly to classify a sick people erroneously healthy than the other way around ( $c_{-}^{+}$ is high, $c_{+}^{-}$ is low) than the threshold goes down, as we prefer to classify more easily into the category sick, on the other hand, if it is more costly to classify a healthy people erroneously sick than the other way around ( $c_{-}^{+}$ is low, $c_{+}^{-}$ is high) than the threshold goes up, as we prefer to classify more easily into the category healthy. (Check these on the figure!)

A real-life example

Let's have a look at an empirical example, instead of a theoretical derivation. This example will be different basically from two aspects:

Instead of assuming normality, we will simply use the empirical data without any such assumption.
Instead of using one single test, and its results in its own units, we will use several tests (and combine them with a logistic regression). Threshold will be given to the final predicted probability. This is actually the preferred approach, see Chapter 19 - Diagnosis - in Frank Harrell's BBR.

The dataset (acath from the package Hmisc) is from the Duke University Cardiovascular Disease Databank, and contains whether the patient had significant coronary disease, as assessed by cardiac catheterization, this will be our gold standard, i.e., the true disease status, and the "test" will be the combination of the subject's age, sex, cholesterol level and duration of symptoms:

library( rms )
library( lattice )
library( latticeExtra )
library( data.table )

getHdata( "acath" )
acath <- acath[ !is.na( acath$choleste ), ]
dd <- datadist( acath )
options( datadist = "dd" )

fit <- lrm( sigdz ~ rcs( age )*sex + rcs( choleste ) + cad.dur, data = acath )

It worth plotting the predicted risks on logit-scale, to see how normal they are (essentially, that was what we assumed previously, with one single test!):

densityplot( ~predict( fit ), groups = acath$sigdz, plot.points = FALSE, ref = TRUE,
             auto.key = list( columns = 2 ) )

Well, they're hardly normal...

Let's go on and calculate the expected overall cost:

ExpectedOverallCost <- function( b, p, y, cplusminus, cminusplus,
                                 cplusplus = 1, cminusminus = 0 ) {
  sum( table( factor( p>b, levels = c( FALSE, TRUE ) ), y )*matrix(
    c( cminusminus, cplusminus, cminusplus, cplusplus ), nc = 2 ) )
}

table( predict( fit, type = "fitted" )>0.5, acath$sigdz )

ExpectedOverallCost( 0.5, predict( fit, type = "fitted" ), acath$sigdz, 2, 4 )

And let's plot it for all possible costs (a computational note: we don't need to mindlessly iterate through numbers from 0 to 1, we can perfectly reconstruct the curve by calculating it for all unique values of predicted probabilities):

ps <- sort( unique( c( 0, 1, predict( fit, type = "fitted" ) ) ) )

xyplot( sapply( ps, ExpectedOverallCost,
                p = predict( fit, type = "fitted" ), y = acath$sigdz,
                cplusminus = 2, cminusplus = 4 ) ~ ps, type = "l", xlab = "Threshold",
        ylab = "Expected overall cost", panel = function( x, y, ... ) {
          panel.xyplot( x, y, ... )
          panel.points( x[ which.min( y ) ], min( y ), pch = 19, cex = 1.1 )
          panel.text( x[ which.min( y ) ], min( y ), round( x[ which.min( y ) ], 3 ),
                      pos = 3 )
        } )

We can very well see where we should put the threshold to optimize the expected overall cost (without using sensitivity, specificity or predictive values anywhere!). This is the correct approach.

It is especially instructive to contrast these metrics:

ExpectedOverallCost2 <- function( b, p, y, cplusminus, cminusplus,
                                  cplusplus = 1, cminusminus = 0 ) {
  tab <- table( factor( p>b, levels = c( FALSE, TRUE ) ), y )
  sens <- tab[ 2, 2 ] / sum( tab[ , 2 ] )
  spec <- tab[ 1, 1 ] / sum( tab[ , 1 ] )
  c( `Expected overall cost` = sum( tab*matrix( c( cminusminus, cplusminus, cminusplus,
                                                   cplusplus ), nc = 2 ) ),
     Sensitivity = sens,
     Specificity = spec,
     PPV = tab[ 2, 2 ] / sum( tab[ 2, ] ),
     NPV = tab[ 1, 1 ] / sum( tab[ 1, ] ),
     Accuracy = 1 - ( tab[ 1, 1 ] + tab[ 2, 2 ] )/sum( tab ),
     Youden = 1 - ( sens + spec - 1 ),
     Topleft = ( 1-sens )^2 + ( 1-spec )^2
  )
}

ExpectedOverallCost2( 0.5, predict( fit, type = "fitted" ), acath$sigdz, 2, 4 )

res <- melt( data.table( ps, t( sapply( ps, ExpectedOverallCost2,
                                        p = predict( fit, type = "fitted" ),
                                        y = acath$sigdz,
                                        cplusminus = 2, cminusplus = 4 ) ) ),
             id.vars = "ps" )

p1 <- xyplot( value ~ ps, data = res, subset = variable=="Expected overall cost",
              type = "l", xlab = "Threshold", ylab = "Expected overall cost",
              panel=function( x, y, ... ) {
                panel.xyplot( x, y,  ... )
                panel.abline( v = x[ which.min( y ) ],
                              col = trellis.par.get()$plot.line$col )
                panel.points( x[ which.min( y ) ], min( y ), pch = 19 )
              }  )
p2 <- xyplot( value ~ ps, groups = variable,
              data = droplevels( res[ variable%in%c( "Expected overall cost",
                                                     "Sensitivity",
                                                     "Specificity", "PPV", "NPV" ) ] ),
              subset = variable%in%c( "Sensitivity", "Specificity", "PPV", "NPV" ),
              type = "l", xlab = "Threshold", ylab = "Sensitivity/Specificity/PPV/NPV",
              auto.key = list( columns = 3, points = FALSE, lines = TRUE ) )
doubleYScale( p1, p2, use.style = FALSE, add.ylab2 = TRUE )

We can now analyze those metrics that are sometimes specifically advertised as being able to come up with an optimal cutoff without costs, and contrast it with our cost-based approach! Let's use the three most often used metrics:

Accuracy (maximize accuracy)
Youden rule (maximize $Sens+Spec-1$ )
Topleft rule (minimize $\left(1-Sens\right)^2+\left(1-Spec\right)^2$ )

(For simplicity, we will subtract the above values from 1 for the Youden and the Accuracy rule so that we have a minimization problem everywhere.)

Let's see the results:

p3 <- xyplot( value ~ ps, groups = variable,
              data = droplevels( res[ variable%in%c( "Expected overall cost", "Accuracy",
                                                     "Youden", "Topleft"  ) ] ),
              subset = variable%in%c( "Accuracy", "Youden", "Topleft"  ),
              type = "l", xlab = "Threshold", ylab = "Accuracy/Youden/Topleft",
              auto.key = list( columns = 3, points = FALSE, lines = TRUE ),
              panel = panel.superpose, panel.groups = function( x, y, col.line, ... ) {
                panel.xyplot( x, y, col.line = col.line, ... )
                panel.abline( v = x[ which.min( y ) ], col = col.line )
                panel.points( x[ which.min( y ) ], min( y ), pch = 19, col = col.line )
              } )
doubleYScale( p1, p3, use.style = FALSE, add.ylab2 = TRUE )

This of course pertains to one specific cost structure, $c_{-}^{-}=0$ , $c_{+}^{+}=1$ , $c_{+}^{-}=2$ , $c_{-}^{+}=4$ (this obviously matters only for the optimal cost decision). To investigate the effect of cost structure, let's pick just the optimal threshold (instead of tracing the whole curve), but plot it as a function of costs. More specifically, as we have already seen, the optimal threshold depends on the four costs only through the $c_d^{-}/c_d^{+}$ ratio, so let's plot the optimal cutoff as a function of this, along with the typically used metrics that don't use costs:

res2 <- data.frame( rat = 10^( seq( log10( 0.02 ), log10( 50 ), length.out = 500 ) ) )
res2$OptThreshold <- sapply( res2$rat,
                             function( rat ) ps[ which.min(
                               sapply( ps, Vectorize( ExpectedOverallCost, "b" ),
                                       p = predict( fit, type = "fitted" ),
                                       y = acath$sigdz,
                                       cplusminus = rat,
                                       cminusplus = 1,
                                       cplusplus = 0 ) ) ] )

xyplot( OptThreshold ~ rat, data = res2, type = "l", ylim = c( -0.1, 1.1 ),
        xlab = expression( {"c"^{"-"}}["d"]/{"c"^{"+"}}["d"] ), ylab = "Optimal threshold",
        scales = list( x = list( log = 10, at = c( 0.02, 0.05, 0.1, 0.2, 0.5, 1,
                                                   2, 5, 10, 20, 50 ) ) ),
        panel = function( x, y, resin = res[ ,.( ps[ which.min( value ) ] ),
                                             .( variable ) ], ... ) {
          panel.xyplot( x, y, ... )
          panel.abline( h = resin[variable=="Youden"] )
          panel.text( log10( 0.02 ), resin[variable=="Youden"], "Y", pos = 3 )
          panel.abline( h = resin[variable=="Accuracy"] )
          panel.text( log10( 0.02 ), resin[variable=="Accuracy"], "A", pos = 3 )
          panel.abline( h = resin[variable=="Topleft"] )
          panel.text( log10( 0.02 ), resin[variable=="Topleft"], "TL", pos = 1 )
        } )

Horizontal lines indicate the approaches that don't use costs (and are therefore constant).

Again, we nicely see that as the additional cost of misclassification in the healthy group rises compared to that of the diseased group, the optimal threshold increases: if we really don't want healthy people to be classified as sick, we will use higher cutoff (and the other way around, of course!).

And, finally, we yet again see why those methods that don't use costs are not (and can't!) be always optimal.

— Tamas Ferenci
sumber