Friday, January 29, 2016

The remains of R. A. Fisher are visited by Bayesians again

Thanks to Phillip Alday, the second edition of Doing Bayesian Data Analysis has visited the remains of R. A. Fisher at St. Peter's Cathedral in Adelaide. Here is a photo that Phillip snapped: 
In a previous post, the first edition visited Fisher's remains. The book has also visited Bayes tomb, Laplace's memorial, and Jacob Bernoulli's grave. The goal here is to have some fun and pique some historical interest, not to show any (serious) disrespect. If you pose the book at sites of other relevant people, dead or not-quite-yet-dead, let me know!

Thursday, December 31, 2015

(A pointer to) bayes.js: A Small Library for Doing MCMC in the Browser

Screen shot from bayes.js blog post.
This is just a pointer to a new post at Rasmus Bååth's Blog that might interest readers of DBDA2E.

bayes.js: A small JavaScript library that implements an adaptive MCMC sampler and a couple of probability distributions, and that makes it relatively easy to implement simple Bayesian models in JavaScript.

See the post at http://www.sumsar.net/blog/2015/12/bayes-js-a-small-library-for-doing-mcmc-in-the-browser/

Sunday, December 27, 2015

Lessons from Bayesian disease diagnosis: Don't over-interpret the Bayes factor, VERSION 2


[This is a revised version of a previous post which is superseded by this post. 
This revision has corrected derivations, new R/JAGS code, and new diagrams.] 

Overview
"Captain, the prior probability of this
character dying and leaving the show
is infinitesimal."
A primary example of Bayes' rule is for disease diagnosis (or illicit drug screening). The example is invoked routinely to explain the importance of prior probabilities. Here's one version of it: Suppose a diagnostic test has a 97% detection rate and a 5% false alarm rate. Suppose a person selected at random tests positive. What is the probability that the person has the disease? It might seem that the odds of having the disease are 0.97/0.05 (i.e., the detection rate over the false alarm rate), which corresponds to a probability of about 95%. But Bayesians know that such an answer is not appropriate because it ignores the prior probability of the disease, which presumably is very rare. Suppose the prior probability of having the disease is 1%. Then Bayes' rule implies that the posterior probability of having the disease is only about 16%, even after testing positive! This type of example is presented over and over again in introductory expositions (e.g., pp. 103-105 of DBDA2E), emphatically saying not to use only the detection and false alarm rates, and always to incorporate the prior probabilities of the conditions.


The ratio of the detection and false-alarm rates is the Bayes factor, hence the lesson of the primary example is: Do not use the Bayes factor alone to make decisions; instead always incorporate the prior probabilities of the options. Unfortunately, this lesson is often forgotten or at least de-emphasized in general model-selection settings. In model-selection settings, some people compute the Bayes factor and use it alone as the basis for model selection. The tacit assumption is that the models all have equal prior probabilities. But this assumption of equal prior probabilities seems implausible in most situations, and should be explicitly justified. If we should not rely on the Bayes factor alone to make decisions in disease diagnosis or drug testing, why should we rely on the Bayes factor alone to make decisions in other model-selection applications?

In this post
  • I explain the role of the Bayes factor in the simple disease-diagnosis situation. (This also applies to random drug screening.)
  • Then I extend to the situation in which the prior probabilities are not certain. I show that in the limit, with utter uncertainty in the prior probabilities, the posterior odds fade to match the Bayes factor. But even with some modest knowledge of the base rates, the posterior odds can be far from the Bayes factor. 
  • Hierarchical diagrams are provided to illustrate the model comparisons.
  • JAGS code is provided for hands-on experience.
Conclusion: When deciding between models, the decision should not be based on the Bayes factor alone.

Bayes' rule applied to disease diagnosis

A person is randomly selected from a population in which the prior probability of having a particular disease is 1%. Denote the condition of the person by the variable \(M\), which can have two values, either \(M\!\!=\!\!1\) for having the disease or \(M\!\!=\!\!2\) for not having the disease. The prior probability can be written as \(p(M\!\!=\!\!1) = 0.01\), and hence \(p(M\!\!=\!\!2) = 0.99\).

A diagnostic test for the disease can show a "positive" result suggesting that the disease is present, or a "negative" result suggesting that the disease is absent. The result of the test is the datum, denoted \(D\), which can have two values: \(D\!\!=\!\!1\) for "positive" or \(D\!\!=\!\!0\) for "negative". The correct-detection rate (a.k.a. sensitivity) of the test is 97%, meaning \(p( D\!\!=\!\!1 | M\!\!=\!\!1 ) = 0.97\). The false-alarm rate of the test is 5%, meaning \(p( D\!\!=\!\!1 | M\!\!=\!\!2 ) = 0.05\). The complement of the false alarm rate is called the specificity of the test, which in this example is 95%.

Applying Bayes' rule yields: \[ p( M\!\!=\!\!1| D\!\!=\!\!1 ) = p( D\!\!=\!\!1 | M\!\!=\!\!1 ) p( M\!\!=\!\!1 ) / p( D\!\!=\!\!1 ) \] and \[ p( M\!\!=\!\!2| D\!\!=\!\!1 ) = p( D\!\!=\!\!1 | M\!\!=\!\!2 ) p( M\!\!=\!\!2 ) / p( D\!\!=\!\!1 ) \] In the above two equations, the denominator \( p( D\!\!=\!\!1 ) \) is the same, computed as \( p( D\!\!=\!\!1 ) = \sum_m p( D\!\!=\!\!1 | M\!\!=\!\!m ) p( M\!\!=\!\!m ) \). The ratio of the two equations is \begin{align} \underbrace{\frac{p( M\!\!=\!\!1 | D\!\!=\!\!1 )}{p( M\!\!=\!\!2 | D\!\!=\!\!1 ) } }_{\mbox{posterior odds}} &= \underbrace{\frac{p( D\!\!=\!\!1 | M\!\!=\!\!1 )}{p( D\!\!=\!\!1 | M\!\!=\!\!2 )} }_{\mbox{Bayes factor}} \times \underbrace{\frac{p( M\!\!=\!\!1 ) }{ p( M\!\!=\!\!2 )}}_{\mbox{prior odds}} \label{eq:BF}\end{align} Notice from Equation \ref{eq:BF} that the Bayes factor is the ratio of the detection rate to the false alarm rate (when the datum is \(D\!\!=\!\!1\)). The Bayes factor converts the prior odds (i.e., the ratio of prior probabilities) to the posterior odds (i.e., the ratio of posterior probabilities).

When we plug in the numerical values of the example, we get \[ \underbrace{\frac{0.164}{0.836}}_{\mbox{posterior odds}} = \underbrace{\frac{0.97}{0.05}}_{\mbox{Bayes factor}} \times \underbrace{\frac{0.01}{0.99}}_{\mbox{prior odds}} \] or \[  \underbrace{0.196}_{\mbox{posterior odds}} = \underbrace{19.4}_{\mbox{Bayes factor}} \times \underbrace{0.0102}_{\mbox{prior odds}}\] (the displayed values are rounded, whereas the underlying calculations retained many decimal places). Thus, even though the Bayes factor is 19.4 in favor of having the disease, the posterior probability of having the disease is only 0.164 because the prior probability was 0.01.

The real-life lesson of these sorts of examples is that when doing random disease screening or random drug testing, it is not appropriate to use the Bayes factor alone to interpret the result of the diagnostic test. The prior probabilities of the conditions (disease or healthy, using drug or not) must be factored in, and the decision should be based on the posterior probability. 


Disease diagnosis is model comparison

The case of disease diagnosis is just a case of model comparison. In model comparison, we have some data and we infer the posterior probabilities of the models given the data. For disease diagnosis, the data consist of a dichotomous value, namely \(D\!\!=\!\!1\) or \(D\!\!=\!\!0\), and the two models are merely two possible underlying states of the world, namely \(M\!\!=\!\!1\) or \(M\!\!=\!\!2\). Disease diagnosis is simply asking which model, disease or healthy, is a better model of the outcome of the diagnostic test.

The moral of the disease-diagnosis example ---that the Bayes factor alone is not appropriate for selecting the model--- applies to model comparison generally. Model comparison should be based on the posterior probabilities of the models.

Figure 1: Disease diagnosis as model comparison. At bottom, D is the datum from the diagnostic test, with D=1 denoting "positive" result and D=0 denoting negative result. At left, model M=1 represents disease present. The right bar of its Bernoulli distribution over D=1 has height labeled by p(D=1|M=1), which is the correct-detection rate. At right, model M=2 represents disease absent.The right bar of its Bernoulli distribution over D=1 has height labeled by p(D=1|M=2) which is the false-alarm rate. At the top of the diagram is the base rate of the diseases. The prior probability of having the disease, M=1, is \(\pi_1\).
Figure 1 shows disease diagnosis as model comparison analogous to the diagrams in Figures 10.1 and 10.2 of DBDA2E (pp. 267 and 269). The caption provides a description. Essentially, the model for "has disease" has likelihood function \(D \sim \mbox{dbern}( p(D\!\!=\!\!1|M\!\!=\!\!1) ) \) and the model for "no disease" has likelihood function \(D \sim \mbox{dbern}( p(D\!\!=\!\!1|M\!\!=\!\!2) ) \). The prior probability of the model indices is \(M \sim \mbox{dcat}( \pi_1 , 1\!-\!\pi_1 )\).

Decision rule in model comparison sometimes uses only the Bayes factor, but should not

One practice in model comparison is to base decisions on the Bayes factor, forgetting about or at least de-emphasizing the prior probabilities (and hence the posterior probabilities) of the models. A typical decision rule is,

If the Bayes factor is >3 in favor of one model relative to the other, accept the better model and reject the other. 
Applying the Bayes-factor-only decision rule to disease diagnosis would lead us to decide that a person has the disease whenever the diagnostic test is positive because the Bayes factor is 19.4 which is much greater than 3. Using only the Bayes factor for decision making, instead of the posterior probabilities of the models, ignores Bayes' rule and is not Bayesian despite the terminology. A Bayes factor is only Bayesian in the full context of Bayes' rule, which converts prior probabilities to posterior probabilities. (Note that I am not concerned with the value 3 in the decision rule. The decision threshold could be 3 or 10 or whatever. Rather, the problem is not taking into account the prior probabilities of the models.)

The usual justification for using the Bayes factor alone is two fold. First, if the prior probabilities of the models are all equal, then the posterior odds algebraically equal the Bayes factor. (You can see that for yourself from Equation \ref{eq:BF}, above.) Setting the prior probabilities to be equal seems to be an innocuous default for many model-comparison settings, even if it's an absurd assumption for random screening of diseases or drug usage. Second, even if unequal prior probabilities for the models are assumed, the Bayes factor tells us how to modify those prior odds to arrive at the corresponding posterior odds. Therefore it is informative to report the Bayes factor.

The two justifications must be used carefully. First, I doubt that most model-comparison settings really are well represented by assuming equal prior probabilities of the models. Setting the prior probabilities equal to each other is an assertion of equality, not an expression of uncertainty; the next section discusses uncertainty. Setting the prior probabilities to 0.50/0.50 should be based on as much prior information as setting the prior probabilities to 0.01/0.99. Second, even if the Bayes factor is reported to summarize the analysis, the Bayes factor should not be the basis for decision making. The decision should be based on the posterior probabilities of the models.

Modeling uncertainty in the prior probabilities of the models

The usual formal set-up, as in the equations above, assumes specific exact values for the prior probabilities of the models. That is to say, \(p(M\!\!=\!\!m)\) has a specific exact value such as
\(p(M\!\!=\!\!1) = 0.01\) for the "has disease" model. But what if we are uncertain about the prior probabilities of the models? Modeling uncertainty is exactly what Bayesian analysis is all about.

In a previous version of this post, I made a mathematical derivation that seemed reasonable but which did not match my intuition for how the answer should play out. I asked readers to let me know if they could point out errors in the derivation. In a series of emails, Tomasz Miąsko (tomasz.miasko@gmail.com) patiently helped me to reconceptualize the derivation, and he also provided me with a simple JAGS script to verify the result. Thank you, Tomasz! I finally have clarity and here I present an elaborated derivation and JAGS script.

Denote \(p(M\!\!=\!\!1)\) by the parameter \(\pi_1\), which can take on any value from 0 to 1. Of course, \(p(M\!\!=\!\!2) = 1-\pi_1\). We express our uncertainty in \(\pi_1\) as a probability distribution \(p(\pi_1)\). For example, \(p(\pi_1)\) could be a beta distribution that has mode \(\omega\) and concentration \(\kappa\). (Hence the shape parameters for the beta distribution are \(a=\omega (\kappa-2) + 1\) and \(b= (1-\omega) (\kappa-2) + 1\).) If we are extremely certain about the prior probability of the model, then \(\kappa\) has a very large value, meaning that the prior distribution on \(\pi_1\) is tightly concentrated at \(\omega\), and the analysis converges to the point value assumed in the equations above. If we are uncertain about the prior probability of the model, then \(\kappa\) is a small-ish value, with \(\kappa \ge 2\). When \(\kappa=2\) the prior is a uniform distribution and we have "utter" uncertainty in the value of \(\pi_2\). Figure 2 shows a diagram of the model.

Figure 2: Disease diagnosis with uncertainty in the base rates of the diseases. See Figure 1 for description of the lower part of the diagram. Near the top of the diagram is the base rate of the diseases, where the prior probability of having the disease is denoted \(\pi_1\). There is uncertainty in the value of \(\pi_1\) represented by the beta distribution at the top of the diagram. The beta distribution has mode \(\omega\) and concentration \(\kappa\).

Notice that the overall model involves two parameters: \(M\) and \(\pi_1\). Consider Bayes' rule for the two-parameter space (as pointed out to me by Tomasz Miasko):
\begin{align}
p( M , \pi_1 | D )
&= p( D | M , \pi_1 ) p( M , \pi_1 ) / p^*(D)
\notag
\\
&= p( D | M ) p( M | \pi_1 ) p( \pi_1 ) / p^*(D)
\label{eq:pMpi1gD}
\end{align}
where, and this is crucial, \(p^*(D)\) is the double integral over the two-parameter space:
\begin{align}
p^*(D)
&=
\sum_m \int_{\pi_1=0}^{\pi_1=1} d\pi_1 p( D | M ) p( M | \pi_1 ) p( \pi_1 )
\label{eq:pD2param}
\end{align}
Notice that \(p^*(D)\) is a constant, not a function of \(\pi_1\).

To compute the posterior probability of \(M\) given data \(D\), we marginalize the two-parameter posterior of Equation \ref{eq:pMpi1gD}:
\begin{align}
p(M|D)
&=
\int_{\pi_1=0}^{\pi_1=1} d\pi_1 p( M , \pi_1 | D )
\notag
\\
&=
\int_{\pi_1=0}^{\pi_1=1} d\pi_1 p( D | M ) p( M | \pi_1 ) p( \pi_1 ) / p^*(D)
\notag
\\
&=
\frac{ p( D | M )}{ p^*(D) } \int_{\pi_1=0}^{\pi_1=1} d\pi_1  p( M | \pi_1 ) p( \pi_1 )
\label{eq:pMgD2param}
\end{align}
Notice in Equation \ref{eq:pMgD2param} (above) that \(p^*(D)\) was pulled in front of the integral sign, which is valid because \(p^*(D)\) is a constant. For the two models, Equation \ref{eq:pMgD2param} becomes
\begin{align}
p(M=1|D)
&=
\frac{ p( D | M=1 )}{ p^*(D) }
\int_{\pi_1=0}^{\pi_1=1} d\pi_1  \pi_1 p( \pi_1 )
\notag
\\
&\mbox{and}
\notag
\\
p(M=2|D)
&=
\frac{ p( D | M=2 )}{ p^*(D) }
\int_{\pi_1=0}^{\pi_1=1} d\pi_1  (1-\pi_1) p( \pi_1 )
\notag
\end{align}
The posterior odds are therefore
\begin{align}
\frac{ p(M=1|D) }{ p(M=2|D) }
&=
\frac{ p( D | M=1 ) }{ p( D | M=2 ) }
\times
\frac{ \int_{\pi_1=0}^{\pi_1=1} d\pi_1  \pi_1 p( \pi_1 ) }{ \int_{\pi_1=0}^{\pi_1=1} d\pi_1  (1-\pi_1) p( \pi_1 )}
\label{eq:postOdds2param}
\end{align}

In particular, when the distribution \(p( \pi_1 )\) is uniform with \(p( \pi_1 )=1.0\), the prior odds in Equation \ref{eq:postOdds2param} are 1/1. Thus, when there is complete uncertainty on \(\pi_1\) as expressed by a uniform distribution, then the posterior odds match the Bayes factor.

[What was wrong with the derivation of the previous post superseded by this post? The previous derivation began with the wrong conditional, but I didn't realize it because the notation suppressed the conditioned variable and hid the problem. The previous derivation started with \(p(M=1|D)\) without explicitly noting that it was conditional on \(\pi_1\). Therefore it was easy to think, in fact difficult not to think, that the derivation began with \(p(M=1,\pi_1|D)\) when in fact it began with \(p(M=1|D,\pi_1)\). Thus, when integrating over \(\pi_1\), the result is \(\int d\pi_1 \, p(M=1|D,\pi_1) \) and not the desired \(\int d\pi_1 \, p(M=1,\pi_1|D) \). Thus, the derivation of the previous post derived some strange quantity and not the posterior probability of \(M\).]


Numerical results from R and JAGS

Here are some results from the R script for JAGS appended below. In all cases, the mode of the prior on \(\pi_1\) is \(\omega=0.01\).

When the prior on \(\pi_1\) is extremely concentrated around \(\omega=0.01\), with \(\kappa=20000\), a run of JAGS produces
P(M=1|D)=0.165; P(M=2|D)=0.835; P(M=1|D)/P(M=2|D)=0.198
This result essentially matches the numerical result in the opening paragraphs of this post, which assumed a singular value for \(\pi_1\) at 0.01.

When the prior on \(\pi_1\) is only modestly concentrated around \(\omega=0.01\), with \(\kappa=20\), a run of JAGS produces
P(M=1|D)=0.547; P(M=2|D)=0.453; P(M=1|D)/P(M=2|D)=1.21
Notice that the posterior odds are now slightly in favor of having the disease, but are far from the Bayes factor of 19.4.


When the prior on \(\pi_1\) is weakly concentrated around \(\omega=0.01\), with \(\kappa=10\), a run of JAGS produces
P(M=1|D)=0.701; P(M=2|D)=0.299; P(M=1|D)/P(M=2|D)=2.34
Notice that the posterior odds are leaning in favor of having the disease, but have still not exceeded a ratio of 3, and are still far from the Bayes factor of 19.4.


When the prior on \(\pi_1\) is completely uncertain around \(\omega=0.01\), that is, a uniform distribution with \(\kappa=2\), a run of JAGS produces
P(M=1|D)=0.951; P(M=2|D)=0.0493; P(M=1|D)/P(M=2|D)=19.3
Here, with utter uncertainty in the prior probabilities of the models, the posterior odds match the Bayes factor, as was commented just after Equation \ref{eq:postOdds2param}.

Recap: The posterior odds of the models are not the same thing as the Bayes factor of the models. Decisions should be based on the posterior probabilities. The Bayes factor does not take into account the prior probabilities of the models (and despite its name the Bayes factor by itself does not even use Bayes' rule). The derivations and numerical examples above have shown that even with a lot of uncertainty in the prior probabilities on models, the posterior odds can be quite different than the Bayes factor. The posterior odds numerically equal the Bayes factor when (i) there is utter uniform uncertainty in the prior probabilities of the models or (ii) when there is complete prior knowledge that the models have exactly equal prior probability. I suspect that neither of those conditions occurs in reality. [Revised 28-Dec-2015, 9:35am] The posterior odds numerically equal the Bayes factor when (as a broad sufficient but not necessary condition) the uncertainty of the model prior probability is symmetric and known to have a mode at exactly 0.5, which I suspect rarely occurs in reality.

The R and JAGS script:

# Original script by Tomasz Miasko 24-Dec-2015 tomasz.miasko@gmail.com
# Elaborated by John Kruschke 26-Dec-2015 johnkruschke@gmail.com

rm(list=ls())

library(rjags)

# JAGS model for disease diagnosis example, 27-December-2015,
# http://doingbayesiandataanalysis.blogspot.com/

# Datum D is outcome of diagnostic test(s) for a single individual.
# D=1 suggests disease is present, D=0 suggests disease is absent.
# pD1gM[m] is probability of D=1 given model index m.
# m=1 means disease is truly present, m=2 means disease is truly absent.

modelString = "
model {
  for ( i in 1:length(D) ) {
    D[i] ~ dbern( pD1gM[m] ) # pD1gM[1:2] are constants provided in data list.
  }
  m ~ dcat( pM[1:2] )   # model index probability.
  pM[1] ~ dbeta( omega*(kappa-2)+1 ,
                (1-omega)*(kappa-2)+1 ) # prior probability for m=1
  pM[2] <- 1-pM[1]      # prior probability for m=2
  # omega and kappa are constants provided in data list.
}
"

# Specify constants:
# The result of the diagnostic test(s) for a single individual:
D = c(rep(1,1),rep(0,0)) # use D=NA for prior
# Probability D=1 for m=1,2; i.e.,
# correct detection rate and false alarm rate:
pD1gM = c(0.97,0.05) # high value for disease present, low value for absent
# Modal base rate of disease present:
omega = 0.01
# Certainty of modal base rate (kappa >= 2):
kappa = c(2,20,20000)[3]

model = jags.model(
  file=textConnection(modelString),
  data=list( D=D , pD1gM=pD1gM , omega=omega , kappa=kappa ),
  n.chains=4)
update( object=model, n.iter=1000 )
samples = coda.samples(
  model=model,
  variable.names=c("m","pM"),
  n.iter=125000) # total MCMC steps = n.chains*n.iter
chain = as.matrix(samples)

pM1gD = sum(chain[,"m"]==1)/length(chain[,"m"])
pM2gD = sum(chain[,"m"]==2)/length(chain[,"m"])

print( paste0( "For D = ", sum(D) , " 1's and " , length(D)-sum(D) , " 0's" ,
              ": P(M=1|D)=" , signif(pM1gD,3) ,
              "; P(M=2|D)=" , signif(pM2gD,3) ,
              "; P(M=1|D)/P(M=2|D)=", signif(pM1gD/pM2gD,3) ) )
hist( chain[,"pM[1]"] , breaks=51 ,
      main=bquote("p( pM[1] | D=" *.(sum(D))* " 1's and " 
                    *.(length(D)-sum(D))* " 0's )" ) ,
      col="grey" , border="grey" , xlim=c(0,1) )

Tuesday, December 22, 2015

Bayesian Item Response Theory in JAGS: A Hierarchical Two Parameter Logistic Model

I recently created a hierarchical two-parameter logistic model for item response theory (IRT). The JAGS script is now in the folder of scripts that accompany the book (available at the book's web site; click book cover at right). 

Below are slides that accompany my presentation of the material. I hope the slides are self-explanatory for those of you who are already familiar with IRT, and maybe even for those of you who are not. Maybe one day I'll record a narration over the slides and post a video. Meanwhile, I hope the slides below are useful.

By the way, if you find this program to be useful and adapt it to your own real data, please let me know about the results. (And if you write up the results, it's okay to cite this blog post and the book :-)

All analyses begin with the data. The data are correct/wrong answers for items on a test. The JAGS program will use long format for the data:
(Click on any image to enlarge.)

Each item is modeled as producing a 1/0 response with probability that depends on the item's difficulty and the subject's ability:




There is indeterminacy in the difficulty/ability scale, and two points must be arbitrarily pinned down. I chose the following: The most difficult not-impossible item is given a difficulty of 100, and the easiest non-trivial item is given a difficulty of 0. Then I opted to put the subject abilities and item discriminations under higher-level distributions, to get the benefits of shrinkage in a hierarchical model:

Here is the JAGS script itself (again, see the program folder that accompanies the book):





Now for the data used to demonstrate the model:

Results from the JAGS run. First, the MCMC diagnostics all look good:
(Click on any image to the images enlarged.)


Here are the estimates of individual items, listed in order of estimated difficulty. Notice that the easiest has difficulty fixed at 0, and the most difficult has  difficulty fixed at 100.













Subject abilities:


 

Two of the (simulated) subjects have data from only a single item. But this does not leave the estimated ability unbounded because of shrinkage from the hierarchical model, and the Bayesian estimation provides exact degrees of uncertainty on the estimates:

And in Bayesian estimation it is easy to make comparisons of item difficulties, or of item discriminations, or of subject abilities:

Saturday, December 12, 2015

Lessons from Bayesian disease diagnosis: Don't over-interpret the Bayes factor

This post has been revised and is now superseded by the new post linked >HERE<.

Overview
"Captain, the prior probability of this
character dying is extremely small."
A primary example of Bayes' rule is for disease diagnosis (or illicit drug screening). The example is invoked routinely to explain the importance of prior probabilities. Here's one version of it: Suppose a diagnostic test has a 97% detection rate and a 5% false alarm rate. Suppose a person selected at random tests positive. What is the probability that the person has the disease? It might seem that the odds of having the disease are 0.97/0.05 (i.e., the detection rate over the false alarm rate), which corresponds to a probability of about 95%. But Bayesians know that such an answer is not appropriate because it ignores the prior probability of the disease, which presumably is very rare. Suppose the prior probability of having the disease is 1%. Then Bayes' rule implies that the posterior probability of having the disease is only about 16%, even after testing positive! This type of example is presented over and over again in introductory expositions (e.g., pp. 103-105 of DBDA2E), emphatically saying not to use only the detection and false alarm rates, and always to incorporate the prior probabilities of the conditions.


The ratio of the detection and false-alarm rates is the Bayes factor, hence the lesson of the primary example is: Do not use the Bayes factor alone to make decisions; instead always incorporate the prior probabilities of the options. Unfortunately, this lesson is often forgotten or at least de-emphasized in general model-selection settings. In model-selection settings, some people compute the Bayes factor and use it alone as the basis for model selection. The tacit assumption is that the models all have equal prior probabilities. But this assumption of equal prior probabilities seems implausible in most situations, and should be explicitly justified.

If we should not rely on the Bayes factor alone to make decisions in disease diagnosis or drug testing, why should we rely on the Bayes factor alone to make decisions in other model-selection applications?

In this post I first explain the role of the Bayes factor in the usual simple disease-diagnosis or drug-screening situation. Then I extend to a situation in which the prior probabilities are not certain. I show that in the limit, with utter uncertainty in the prior probabilities, the posterior odds fade to match the Bayes factor. But the decision should not be based on the Bayes factor alone.

This post has been revised and is now superseded by the new post linked >HERE<.

Bayes' rule applied to disease diagnosis

A person is randomly selected from a population in which the prior probability of having a particular disease is 1%. Denote the condition of the person by the variable \(M\), which can have two values, either \(M=\mbox{healthy}\) or \(M=\mbox{disease}\). The prior probability can be written as \(p(M=\mbox{disease}) = 0.01\), and hence \(p(M=\mbox{healthy}) = 0.99\).

A diagnostic test for the disease can show a "positive" result, suggesting the disease is present, or a "negative" result, suggesting the disease is absent. The result of the test is the datum, denoted \(D\), which can have two values \(D=\mbox{pos}\) or \(D=\mbox{neg}\). The detection rate of the test is 97%, meaning \(p( D=\mbox{pos} | M=\mbox{disease} ) = 0.97\). This is also called the "sensitivity" of the test. The false-alarm rate of the test is 5%, meaning \(p( D=\mbox{pos} | M=\mbox{healthy} ) = 0.05\). The complement of the false alarm rate is called the "specificity" of the test, which in this example is 95%.

Applying Bayes' rule yields: \[
p(  M=\mbox{disease}| D=\mbox{pos} ) = p( D=\mbox{pos} | M=\mbox{disease} ) p( M=\mbox{disease} ) / p( D=\mbox{pos} )
\] and \[
p(  M=\mbox{healthy}| D=\mbox{pos} ) = p( D=\mbox{pos} | M=\mbox{healthy} ) p( M=\mbox{healthy} ) / p( D=\mbox{pos} )
\] In the above two equations, the denominator \( p( D=\mbox{pos} ) \) is the same, computed as \( \sum_m p( D=\mbox{pos} | M=\mbox{m} ) p( M=\mbox{m} ) \). The ratio of the two equations is
\[
\underbrace{\frac{p(  M=\mbox{disease}| D=\mbox{pos} )}{p(  M=\mbox{healthy}| D=\mbox{pos} ) } }_{\mbox{posterior odds}}
=
\underbrace{\frac{p( D=\mbox{pos} | M=\mbox{disease} )}{p( D=\mbox{pos} | M=\mbox{healthy} )} }_{\mbox{Bayes factor}} \times
\underbrace{\frac{p( M=\mbox{disease} ) }{ p( M=\mbox{healthy} )}}_{\mbox{prior odds}}
\] Notice from the above equation that the Bayes factor is the ratio of the detection rate to the false alarm rate (when the datum is \(D=\mbox{pos}\)). The Bayes factor converts the prior odds (i.e., the ratio of prior probabilities) to the posterior odds (i.e., the ratio of posterior probabilities).

When we plug in the numerical values of the example, we get
\[
\underbrace{\frac{0.164}{0.836}}_{\mbox{posterior odds}}
=
\underbrace{\frac{0.97}{0.05}}_{\mbox{Bayes factor}} \times
\underbrace{\frac{0.01}{0.99}}_{\mbox{prior odds}}
\]
or
\[
\underbrace{0.196}_{\mbox{posterior odds}}
=
\underbrace{19.4}_{\mbox{Bayes factor}} \times
\underbrace{0.0102}_{\mbox{prior odds}}
\]
(the displayed values are rounded, whereas the underlying calculations retained many decimal places). Thus, even though the Bayes factor is 19.4 in favor of having the disease, the posterior probability of having the disease is only 0.164 because the prior probability was 0.01.

The real-life lesson of these sorts of examples is that when doing random disease screening or random drug testing, it is not appropriate to use the Bayes factor alone to interpret the result of the diagnostic test. The prior probabilities of the conditions (disease or healthy, using drug or not) must be factored in, and the decision should be based on the posterior probability.

Disease diagnosis is model comparison

The case of disease diagnosis is just a case of model comparison. In model comparison, we have some data and we infer the posterior probabilities of the models given the data. For disease diagnosis, the data consist of a single dichotomous value, namely \(D=\mbox{pos}\) or \(D=\mbox{neg}\), and the two models are merely two possible underlying states of the world, namely \(M=\mbox{disease}\) or \(M=\mbox{healthy}\). Disease diagnosis is simply asking which model, disease or healthy, is a better model of the outcome of the diagnostic test.

The moral of the disease-diagnosis example ---that the Bayes factor alone is not appropriate for selecting the model--- applies to model comparison generally. Model comparison should be based on the posterior probabilities of the models.

Decision rule in model comparison sometimes uses only the Bayes factor --- but should not

One practice in model comparison is to base decisions on the Bayes factor, forgetting about or at least de-emphasizing the prior probabilities (and hence the posterior probabilities) of the models. A typical decision rule is,
If the Bayes factor is >3 in favor of one model relative to the other, accept the better model and reject the other. 
Applying the Bayes-factor-only decision rule to disease diagnosis would lead us to decide that a person has the disease whenever the diagnostic test is positive because the Bayes factor is 19.4 which is much greater than 3. Using only the Bayes factor for decision making, instead of the posterior probabilities of the models, ignores Bayes' rule and is not Bayesian despite the terminology. A Bayes factor is only Bayesian in the full context of Bayes' rule, which converts prior probabilities to posterior probabilities.


The usual justification for using the Bayes factor alone is two fold. First, if the prior probabilities of the models are all equal, then the posterior odds algebraically equal the Bayes factor. (You can see that for yourself from the equations above.) Setting the prior probabilities to be equal seems to be an innocuous default for many model-comparison settings, even if it's an absurd assumption for random screening of diseases or drug usage. Second, even if unequal prior probabilities for the models are assumed, the Bayes factor tells us how to modify those prior odds to arrive at the corresponding posterior odds. Therefore it is informative to report the Bayes factor.

The two justifications must be used carefully. First, I doubt that most model-comparison settings really are well represented by assuming equal prior probabilities of the models. If this were true, it would be asserting that the researcher is utterly ignorant of the models and has no prior data or theory or knowledge of any sort to suggest that one model is more or less plausible than the other.[replaced 2015-Dec-13 9:04am with the following sentence] Setting the prior probabilities equal to each other is an assertion of equality, not an expression of uncertainty; the next section discusses uncertainty. Second, even if the Bayes factor is reported to summarize the analysis, the Bayes factor should not be the basis for decision making. The decision should be based on the posterior probabilities of the models.

Modeling uncertainty in the prior probabilities of the models

The usual formal set-up, as in the equations above, assumes specific values for the prior probabilities of the models. That is to say, \(p(M=m)\) has a specific value, such as 0.01 for the "disease" model. But what if we are uncertain about the prior probabilities of the models?

This post has been revised and is now superseded by the new post linked >HERE<.

Modeling uncertainty is exactly what Bayesian analysis is all about. It is straight forward to model uncertainty in the prior probabilities of the models. (I do not recall seeing this analysis in the literature, but I would be quite surprised if the following analysis has not appeared somewhere before. Maybe even in this blog.) Denote \(p(M=1)\) by the parameter \(\theta\), which can take on any value from 0 to 1. We will express our uncertainty in \(\theta\) as a beta distribution that has mode \(\omega\) and concentration \(\kappa\). (Hence the shape parameters for the beta distribution are \(a=\omega (\kappa-2) + 1\) and \(b= (1-\omega) (\kappa-2) + 1\).) That is, \(p(\theta)=dbeta(\theta\,|\,\omega,\kappa)\). If we are extremely certain about the prior probability of the model, then \(\kappa\) has a very large value, meaning that the prior distribution on \(\theta\) is tightly concentrated at \(\omega\), and the analysis converges to the point value assumed in the equations above. If we are uncertain about the prior probability of the model, then \(\kappa\) is a small-ish value, with \(\kappa \ge 2\) and \(\kappa=2\) indicating a uniform distribution.

For the two-model case, the posterior probabilities of the models are
\[
p(M=1|D) =
\int_0^1 d\theta \, p(\theta) \, p(D|M=1) \, \theta / [ p(D|M=1) \, \theta + p(D|M=2) \, (1-\theta) ] 
\]
and
\[
p(M=2|D) =
\int_0^1 d\theta \, p(\theta) \, p(D|M=2) \, (1-\theta) / [ p(D|M=1) \, \theta + p(D|M=2) \, (1-\theta) ] 
\]
The ratio of those equations becomes
\[
\underbrace{\frac{p(M=1|D)}{p(M=2|D)}}_{\mbox{Posterior odds}}
=
\underbrace{\frac{p(D|M=1)}{p(D|M=2)}}_{\mbox{Bayes factor}}
\frac{\int_0^1 d\theta \, p(\theta) \, \theta / [ p(D|M=1) \, \theta + p(D|M=2) \, (1-\theta) ]}{\int_0^1 d\theta \, p(\theta) \, (1-\theta) / [ p(D|M=1) \, \theta + p(D|M=2) \, (1-\theta) ]}

\]
Heartier souls should feel free to derive an analytical simplification of the integrals above, but I'll just perform a numerical approximation in R. The code is appended below.

Here are some numerical results for disease diagnosis when the test result is positive. In all cases, the mode of the prior on \(M=1\) is \(\omega=0.01\), reflecting the assumed rarity of the disease. When \(\kappa\) is very large at 10002, then the posterior odds of having the disease are 0.197, essentially matching the analysis presented earlier when using \(p(M=\mbox{dis})=0.01\) exactly. When \(\kappa\) is set to a moderate value of 102, then the posterior odds of having the disease are 0.345. When \(\kappa\) is set to a value of 12 that expresses great uncertainty, then the posterior odds of having the disease are 1.23. Finally, when \(\kappa\) is set to a value of 2 that expresses essentially total ignorance about the prior probability of the disease, which means that the prior probability of having the disease could be 0.01 or 0.5 or 0.99 uniformly, then the posterior odds of having the disease are 7.65, which is getting close to the Bayes factor of 19.4. (Intuition tells me the result should equal the Bayes factor, which means either my intuition is wrong or the mathematical derivation is wrong or my R code is wrong. At this point I trust the numerical result and suspect my intuition. Perhaps a kind soul will gently enlighten me. UPDATE 25-Dec-2015: I am re-thinking this, and now believe that something is wrong with the derivation, and that the intuition is right after all. Stay tuned...!)

[Paragraph added 2015-Dec-13 12:50pm:] I think the math and numbers are correct. The prior ratio involves \(\int_0^1 d\theta \, p(\theta) \, \theta / [ p(D|M=1) \, \theta + p(D|M=2) \, (1-\theta) ]\) divided by 1 minus that integral. The integral involves \(p(D|M=1)\) and \(p(D|M=2)\), so the prior ratio will not generally become 1 when \(p(\theta)\) is uniform.

All of which is to say that if the prior probabilities of the models are uncertain, then the uncertainty can be incorporated into the analysis, but you must still specify a particular degree of uncertainty. The more uncertain you are about the prior probabilities of the models, the stronger is the influence of the Bayes factor. But notice that even with fairly high uncertainty, when \(\kappa=12\), the posterior odds were quite different than the Bayes factor. [Next sentence added 2015-Dec-13 9:12am] Moreover, even when there is complete uncertainty about the prior probabilities of the models, when \(\kappa=2\), the posterior odds are not as extreme as the Bayes factor.

R code:
# Specify distribution on prior probability of Model 1, having the disease:
omega=0.01 # Modal prior probability of disease.
kappa=c(2,12,102,10002)[4] # >= 2.0.
# Specify false alarm rate and hit rate of diagnostic test:
FA = 0.05  # p( Test=+ | healthy )
Hit = 0.97 # p( Test=+ | disease )
BF = Hit/FA # Bayes factor
# Create grid on theta, p(disease):
dTheta = 0.0001 # differential interval on theta.
Theta = seq(0,1,dTheta) # grid on theta for approximating the integral.
# Create prior distribution on theta, as a beta distribution:
pTheta = dbeta( Theta , shape1=omega*(kappa-2)+1 , shape2=(1-omega)*(kappa-2)+1 )
pTheta = (pTheta / sum(pTheta)) / dTheta # make sure that densities sum to 1.0
# Compute the prior integrals:
integTheta = sum( dTheta * pTheta * Theta/(FA+(Hit-FA)*Theta) )
integOneMinusTheta = sum( dTheta * pTheta * (1-Theta)/(FA+(Hit-FA)*Theta) )
# Compute posterior probability of disease, given Test=+
priorOdds = integTheta / integOneMinusTheta
pDisGivenTestPos = Hit * integTheta
pHealthGivenTestPos = FA * integOneMinusTheta
show( c( pDisGivenTestPos , pHealthGivenTestPos ,
         pDisGivenTestPos / pHealthGivenTestPos ) )
show( c( priorOdds , BF , priorOdds * BF )  )





Wednesday, December 2, 2015

Prior on df (normality) parameter in t distribution

Many models in DBDA use a t distribution for "robust" estimation of central tendency parameters. The idea is that the heavy tails of a t distribution can accommodate data points that would be outliers relative to a normal distribution, and therefore the estimates of the mean and standard deviation are not distorted by the outliers. Thus the estimates of the mean and standard deviation in the t distribution are "robust against outliers" relative to a normal distribution.

The parameter that controls the heaviness of the tails in a t distribution is called the degrees-of-freedom parameter, ν (nu). I prefer to call it the normality parameter because larger values make the distribution more normal, and the distribution is normal when ν is infinity. Values of ν larger than about 30.0 make the distribution roughly normal, and values of ν less than about 30.0 give the distribution noticeably heavier tails than a normal.

Here's the news: Past Self had learned, sometime long ago, that ν must be 1.0 or larger as an algebraic constraint in the definition of the t distribution. Past Self was wrong. In fact, ν must be 0.0 or larger. This issue was brought to my attention by Osvaldo Martin who asked me, in comments on a previous blog post, why I thought ν had to be greater than 1.0. Perhaps other people have asked me this question in the past, but somehow this is the first time I was prompted to double check my answer.

So what? Well, all the models involving a t distribution in the 1st and 2nd editions of DBDA had priors that restricted ν to be 1.0 or greater (see figure below). Fortunately, there is essentially no bad consequence whatsoever, because real data virtually never are distributed with such extreme outliers that an estimated value of ν<1 is needed. This lack of consequence is presumably why the error of Past Self has gone uncorrected so long.

The prior used in DBDA2E. There's no need to shift it +1.


Despite the error being virtually harmless, I've changed all the programs so that the priors go from ν=0 up. This change simplifies the model specifications, too. Now the JAGS models simply specify

nu ~ dexp(1/30.0)

instead of

nu <- nuMinusOne + 1
nuMinusOne ~ dexp(1/29.0)

I've made this change in umpteen programs, including the Stan programs too. I've checked that all the JAGS programs still run fine, but I have not yet checked the Stan programs. Let me know if you encounter any problems. The programs are available at the book's web site.

The change also means that the diagrams of the models can be simplified, such as Figure 16.11 on p. 468. I'll leave that to your imagination.