It might worth adding another, perhaps more straightforward example to Stephen's excellent answer.
Let's consider a medical test, the result of which is normally distributed, both in sick and in healthy people, with different parameters of course (but for simplicity, let's assume homoscedasticity, i.e., that the variance is the same): T∣D⊖∼N(μ−,σ2)T∣D⊕∼N(μ+,σ2).
Let's denote the prevalence of the disease with p (i.e. D⊕∼Bern(p)), so this, together with the above, which are essentially conditional distributions, fully specifies the joint distribution.
Thus the confusion matrix with threshold b (i.e., those with test results above b are classified as sick) is ⎛⎝⎜T⊕T⊖D⊕p(1−Φ+(b))pΦ+(b)D⊖(1−p)(1−Φ−(b))(1−p)Φ−(b)⎞⎠⎟.
Accuracy-based approach
The accuracy is p(1−Φ+(b))+(1−p)Φ−(b),
we take its derivative w.r.t. b, set it equal to 0, multiply with 1πσ2−−−−√ and rearrange a bit: −pφ+(b)+φ−(b)−pφ−(b)=0e−(b−μ−)22σ2[(1−p)−pe−2b(μ−−μ+)+(μ2+−μ2−)2σ2]=0
The first term can't be zero, so the only way the product can be zero is if the second term is zero: (1−p)−pe−2b(μ−−μ+)+(μ2+−μ2−)2σ2=0−2b(μ−−μ+)+(μ2+−μ2−)2σ2=log1−pp2b(μ+−μ−)+(μ2−−μ2+)=2σ2log1−pp
So the solution is b∗=(μ2+−μ2−)+2σ2log1−pp2(μ+−μ−)=μ++μ−2+σ2μ+−μ−log1−pp.
Note that this - of course - doesn't depend on the costs.
If the classes are balanced, the optimum is the average of the mean test values in sick and healthy people, otherwise it is displaced based on the imbalance.
Cost-based approach
Using Stephen's notation, the expected overall cost is c++p(1−Φ+(b))+c−+(1−p)(1−Φ−(b))+c+−pΦ+(b)+c−−(1−p)Φ−(b).
Take its derivate w.r.t b and set it equal to zero: −c++pφ+(b)−c−+(1−p)φ−(b)+c+−pφ+(b)+c−−(1−p)φ−(b)==φ+(b)p(c+−−c++)+φ−(b)(1−p)(c−−−c−+)==φ+(b)pc+d−φ−(b)(1−p)c−d=0,
using the notation I introduced in my comments below Stephen's answer, i.e., c+d=c+−−c++ and c−d=c−+−c−−.
The optimal threshold is therefore given by the solution of the equation φ+(b)φ−(b)=(1−p)c−dpc+d.
Two things should be noted here:
- This results is totally generic and works for any distribution of the test results, not only normal. (φ in that case of course means the probability density function of the distribution, not the normal density.)
- Whatever the solution for b is, it is surely a function of (1−p)c−dpc+d. (I.e., we immediately see how costs matter - in addition to class imbalance!)
I'd be really interested to see if this equation has a generic solution for b (parametrized by the φs), but I would be surprised.
Nevertheless, we can work it out for normal! 2πσ2−−−−√s cancel on the left hand side, so we have e−12((b−μ+)2σ2−(b−μ−)2σ2)=(1−p)c−dpc+d(b−μ−)2−(b−μ+)2=2σ2log(1−p)c−dpc+d2b(μ+−μ−)+(μ2−−μ2+)=2σ2log(1−p)c−dpc+d
therefore the solution is b∗=(μ2+−μ2−)+2σ2log(1−p)c−dpc+d2(μ+−μ−)=μ++μ−2+σ2μ+−μ−log(1−p)c−dpc+d.
(Compare it the the previous result! We see that they are equal if and only if c−d=c+d, i.e. the differences in misclassification cost compared to the cost of correct classification is the same in sick and healthy people.)
A short demonstration
Let's say c−−=0 (it is quite natural medically), and that c++=1 (we can always obtain it by dividing the costs with c++, i.e., by measuring every cost in c++ units). Let's say that the prevalence is p=0.2. Also, let's say that μ−=9.5, μ+=10.5 and σ=1.
In this case:
library( data.table )
library( lattice )
cminusminus <- 0
cplusplus <- 1
p <- 0.2
muminus <- 9.5
muplus <- 10.5
sigma <- 1
res <- data.table( expand.grid( b = seq( 6, 17, 0.1 ),
cplusminus = c( 1, 5, 10, 50, 100 ),
cminusplus = c( 2, 5, 10, 50, 100 ) ) )
res$cost <- cplusplus*p*( 1-pnorm( res$b, muplus, sigma ) ) +
res$cplusminus*(1-p)*(1-pnorm( res$b, muminus, sigma ) ) +
res$cminusplus*p*pnorm( res$b, muplus, sigma ) +
cminusminus*(1-p)*pnorm( res$b, muminus, sigma )
xyplot( cost ~ b | factor( cminusplus ), groups = cplusminus, ylim = c( -1, 22 ),
data = res, type = "l", xlab = "Threshold",
ylab = "Expected overall cost", as.table = TRUE,
abline = list( v = (muplus+muminus)/2+
sigma^2/(muplus-muminus)*log((1-p)/p) ),
strip = strip.custom( var.name = expression( {"c"^{"+"}}["-"] ),
strip.names = c( TRUE, TRUE ) ),
auto.key = list( space = "right", points = FALSE, lines = TRUE,
title = expression( {"c"^{"-"}}["+"] ) ),
panel = panel.superpose, panel.groups = function( x, y, col.line, ... ) {
panel.xyplot( x, y, col.line = col.line, ... )
panel.points( x[ which.min( y ) ], min( y ), pch = 19, col = col.line )
} )
The result is (points depict the minimum cost, and the vertical line shows the optimal threshold with the accuracy-based approach):
We can very nicely see how cost-based optimum can be different than the accuracy-based optimum. It is instructive to think over why: if it is more costly to classify a sick people erroneously healthy than the other way around (c+− is high, c−+ is low) than the threshold goes down, as we prefer to classify more easily into the category sick, on the other hand, if it is more costly to classify a healthy people erroneously sick than the other way around (c+− is low, c−+ is high) than the threshold goes up, as we prefer to classify more easily into the category healthy. (Check these on the figure!)
A real-life example
Let's have a look at an empirical example, instead of a theoretical derivation. This example will be different basically from two aspects:
- Instead of assuming normality, we will simply use the empirical data without any such assumption.
- Instead of using one single test, and its results in its own units, we will use several tests (and combine them with a logistic regression). Threshold will be given to the final predicted probability. This is actually the preferred approach, see Chapter 19 - Diagnosis - in Frank Harrell's BBR.
The dataset (acath
from the package Hmisc
) is from the Duke University Cardiovascular Disease Databank, and contains whether the patient had significant coronary disease, as assessed by cardiac catheterization, this will be our gold standard, i.e., the true disease status, and the "test" will be the combination of the subject's age, sex, cholesterol level and duration of symptoms:
library( rms )
library( lattice )
library( latticeExtra )
library( data.table )
getHdata( "acath" )
acath <- acath[ !is.na( acath$choleste ), ]
dd <- datadist( acath )
options( datadist = "dd" )
fit <- lrm( sigdz ~ rcs( age )*sex + rcs( choleste ) + cad.dur, data = acath )
It worth plotting the predicted risks on logit-scale, to see how normal they are (essentially, that was what we assumed previously, with one single test!):
densityplot( ~predict( fit ), groups = acath$sigdz, plot.points = FALSE, ref = TRUE,
auto.key = list( columns = 2 ) )
Well, they're hardly normal...
Let's go on and calculate the expected overall cost:
ExpectedOverallCost <- function( b, p, y, cplusminus, cminusplus,
cplusplus = 1, cminusminus = 0 ) {
sum( table( factor( p>b, levels = c( FALSE, TRUE ) ), y )*matrix(
c( cminusminus, cplusminus, cminusplus, cplusplus ), nc = 2 ) )
}
table( predict( fit, type = "fitted" )>0.5, acath$sigdz )
ExpectedOverallCost( 0.5, predict( fit, type = "fitted" ), acath$sigdz, 2, 4 )
And let's plot it for all possible costs (a computational note: we don't need to mindlessly iterate through numbers from 0 to 1, we can perfectly reconstruct the curve by calculating it for all unique values of predicted probabilities):
ps <- sort( unique( c( 0, 1, predict( fit, type = "fitted" ) ) ) )
xyplot( sapply( ps, ExpectedOverallCost,
p = predict( fit, type = "fitted" ), y = acath$sigdz,
cplusminus = 2, cminusplus = 4 ) ~ ps, type = "l", xlab = "Threshold",
ylab = "Expected overall cost", panel = function( x, y, ... ) {
panel.xyplot( x, y, ... )
panel.points( x[ which.min( y ) ], min( y ), pch = 19, cex = 1.1 )
panel.text( x[ which.min( y ) ], min( y ), round( x[ which.min( y ) ], 3 ),
pos = 3 )
} )
We can very well see where we should put the threshold to optimize the expected overall cost (without using sensitivity, specificity or predictive values anywhere!). This is the correct approach.
It is especially instructive to contrast these metrics:
ExpectedOverallCost2 <- function( b, p, y, cplusminus, cminusplus,
cplusplus = 1, cminusminus = 0 ) {
tab <- table( factor( p>b, levels = c( FALSE, TRUE ) ), y )
sens <- tab[ 2, 2 ] / sum( tab[ , 2 ] )
spec <- tab[ 1, 1 ] / sum( tab[ , 1 ] )
c( `Expected overall cost` = sum( tab*matrix( c( cminusminus, cplusminus, cminusplus,
cplusplus ), nc = 2 ) ),
Sensitivity = sens,
Specificity = spec,
PPV = tab[ 2, 2 ] / sum( tab[ 2, ] ),
NPV = tab[ 1, 1 ] / sum( tab[ 1, ] ),
Accuracy = 1 - ( tab[ 1, 1 ] + tab[ 2, 2 ] )/sum( tab ),
Youden = 1 - ( sens + spec - 1 ),
Topleft = ( 1-sens )^2 + ( 1-spec )^2
)
}
ExpectedOverallCost2( 0.5, predict( fit, type = "fitted" ), acath$sigdz, 2, 4 )
res <- melt( data.table( ps, t( sapply( ps, ExpectedOverallCost2,
p = predict( fit, type = "fitted" ),
y = acath$sigdz,
cplusminus = 2, cminusplus = 4 ) ) ),
id.vars = "ps" )
p1 <- xyplot( value ~ ps, data = res, subset = variable=="Expected overall cost",
type = "l", xlab = "Threshold", ylab = "Expected overall cost",
panel=function( x, y, ... ) {
panel.xyplot( x, y, ... )
panel.abline( v = x[ which.min( y ) ],
col = trellis.par.get()$plot.line$col )
panel.points( x[ which.min( y ) ], min( y ), pch = 19 )
} )
p2 <- xyplot( value ~ ps, groups = variable,
data = droplevels( res[ variable%in%c( "Expected overall cost",
"Sensitivity",
"Specificity", "PPV", "NPV" ) ] ),
subset = variable%in%c( "Sensitivity", "Specificity", "PPV", "NPV" ),
type = "l", xlab = "Threshold", ylab = "Sensitivity/Specificity/PPV/NPV",
auto.key = list( columns = 3, points = FALSE, lines = TRUE ) )
doubleYScale( p1, p2, use.style = FALSE, add.ylab2 = TRUE )
We can now analyze those metrics that are sometimes specifically advertised as being able to come up with an optimal cutoff without costs, and contrast it with our cost-based approach! Let's use the three most often used metrics:
- Accuracy (maximize accuracy)
- Youden rule (maximize Sens+Spec−1)
- Topleft rule (minimize (1−Sens)2+(1−Spec)2)
(For simplicity, we will subtract the above values from 1 for the Youden and the Accuracy rule so that we have a minimization problem everywhere.)
Let's see the results:
p3 <- xyplot( value ~ ps, groups = variable,
data = droplevels( res[ variable%in%c( "Expected overall cost", "Accuracy",
"Youden", "Topleft" ) ] ),
subset = variable%in%c( "Accuracy", "Youden", "Topleft" ),
type = "l", xlab = "Threshold", ylab = "Accuracy/Youden/Topleft",
auto.key = list( columns = 3, points = FALSE, lines = TRUE ),
panel = panel.superpose, panel.groups = function( x, y, col.line, ... ) {
panel.xyplot( x, y, col.line = col.line, ... )
panel.abline( v = x[ which.min( y ) ], col = col.line )
panel.points( x[ which.min( y ) ], min( y ), pch = 19, col = col.line )
} )
doubleYScale( p1, p3, use.style = FALSE, add.ylab2 = TRUE )
This of course pertains to one specific cost structure, c−−=0, c++=1, c−+=2, c+−=4 (this obviously matters only for the optimal cost decision). To investigate the effect of cost structure, let's pick just the optimal threshold (instead of tracing the whole curve), but plot it as a function of costs. More specifically, as we have already seen, the optimal threshold depends on the four costs only through the c−d/c+d ratio, so let's plot the optimal cutoff as a function of this, along with the typically used metrics that don't use costs:
res2 <- data.frame( rat = 10^( seq( log10( 0.02 ), log10( 50 ), length.out = 500 ) ) )
res2$OptThreshold <- sapply( res2$rat,
function( rat ) ps[ which.min(
sapply( ps, Vectorize( ExpectedOverallCost, "b" ),
p = predict( fit, type = "fitted" ),
y = acath$sigdz,
cplusminus = rat,
cminusplus = 1,
cplusplus = 0 ) ) ] )
xyplot( OptThreshold ~ rat, data = res2, type = "l", ylim = c( -0.1, 1.1 ),
xlab = expression( {"c"^{"-"}}["d"]/{"c"^{"+"}}["d"] ), ylab = "Optimal threshold",
scales = list( x = list( log = 10, at = c( 0.02, 0.05, 0.1, 0.2, 0.5, 1,
2, 5, 10, 20, 50 ) ) ),
panel = function( x, y, resin = res[ ,.( ps[ which.min( value ) ] ),
.( variable ) ], ... ) {
panel.xyplot( x, y, ... )
panel.abline( h = resin[variable=="Youden"] )
panel.text( log10( 0.02 ), resin[variable=="Youden"], "Y", pos = 3 )
panel.abline( h = resin[variable=="Accuracy"] )
panel.text( log10( 0.02 ), resin[variable=="Accuracy"], "A", pos = 3 )
panel.abline( h = resin[variable=="Topleft"] )
panel.text( log10( 0.02 ), resin[variable=="Topleft"], "TL", pos = 1 )
} )
Horizontal lines indicate the approaches that don't use costs (and are therefore constant).
Again, we nicely see that as the additional cost of misclassification in the healthy group rises compared to that of the diseased group, the optimal threshold increases: if we really don't want healthy people to be classified as sick, we will use higher cutoff (and the other way around, of course!).
And, finally, we yet again see why those methods that don't use costs are not (and can't!) be always optimal.