Tuesday, November 25, 2014

Warning message in runjags turned off in new version of DBDA2E utilities

A reader pointed out that programs that accompany DBDA2E and use runjags produce a disconcerting warning message about "starting parallel chains without setting different PRNG for each chain. Different .RNG.name values have been added to each set of initial values." This is merely runjags saying that it is doing the right thing, and not to worry. But it seems to cause worry. Therefore, in a new version of the DBDA2E utilities program I have set the options in runjags not to produce the warning message. To get the new version, go to the software web page and scroll down to the bottom where you will find the programs for the book. Click the download arrow on the right-hand side. Be sure to unzip/extract the file.

Monday, November 24, 2014

eBook and Kindle of DBDA2E now available

The eBook (PDF) and Kindle editions of Doing Bayesian Data Analysis, 2nd Edition, are now available. The book web page has relevant links.

Friday, November 21, 2014

Ordinal probit regression: Transforming polr() parameter values to make them more intuitive


In R, the polr function in the MASS package does ordinal probit regression (and ordinal logistic regression, but I focus here on probit). The polr function yields parameter estimates that are difficult to interpret intuitively because they assume a baseline intercept of 0 and a noise standard deviation of 1, which produces slopes and thresholds that are relative to an underlying scale with little direct relation to the ordinal values in the data. In can be more intuitive to fix the low and high thresholds at values that make sense on the ordinal data scale, and estimate the other parameters relative to those low and high thresholds. This intuitive parameterization is used in Chapter 23 of DBDA2E. This blog post presents a function that transforms the polr parameter values to the intuitive parameterization.

Background: Ordinal probit regression.

See Chapter 23 of DBDA2E for a complete introduction to and explanation of ordinal probit regression. The figure below provides a quick reminder of the model.
The data to be predicted are ordinal, shown on the left side of the figure. The predictor in this case is metric, shown at the lower-right of the figure. The model assumes there exists a underlying metric scale for the ordinal outcome, and this underlying scale is chopped into subintervals that have thresholds denoted theta (see the diagram). The model assumes there is some noisy mapping from the predictor to the underlying metric response scale, as assumed by ordinary linear regression, and the probability of each discrete ordinal response is the area under the normal curve between the thresholds for that response category.

From the data we estimate the intercept, the slope, the standard deviation, and the thresholds. But the underlying scale has arbitrary location and magnification, so two of the parameters can be arbitrarily fixed. In polr, the intercept is fixed at 0 and the standard deviation is fixed at 1. But that yields thresholds and slopes on a scale with no intuitive relation to the response values, which go from 1 to K, where K is the maximum ordinal response level.

A more intuitive parameterization fixes the low threshold at 1.5 (half way between 1 and 2), and the high threshold at K-.5 (half way between K-1 and K). Then the thresholds are directly interpretable with respect to those anchors, and the standard deviation can also be interpreted relative to the distance between those anchors.

Example: Happiness and financial assets.

Consider subjective ratings of happiness on a 1 to 5 ordinal scale, where 1 is bummed out and 5 is ecstatic (I forget the exact wording used to label the response items). For each respondent we also get their total assets. Some real data are shown in the top panel of the figure below:
See the figure caption for details. The lower panels of the figure show aspects of the Bayesian posterior distribution. Notice that the lowest threshold is fixed at 1.5, and the highest threshold is fixed 4.5. This means that the underlying metric output scale is aligned with the ordinal responses, and the other parameters are relative to the low and high anchors. In particular, the intercept is at about 3.16, which suggests a happiness rating of about 3 even with zero assets. The slope is about 4.14e-6 happiness units per yuan. The standard deviation of the noise is about 0.854 happiness units, suggesting large variability on a 5-point scale.

The output of polr, however, looks like this (where Happiness is Yord and Assets is X):


> polrInfo = polr( Yord ~ X , method="probit" )
> print( polrInfo )
Coefficients: X 4.025474e-06 Intercepts: 1|2 2|3 3|4 4|5 -2.048053 -1.180089 -0.127664 1.544575


What do those parameter estimates mean? The coefficient on X is the slope, and the intercepts are the thresholds. While we can use those estimates to derive predictions of outcome probabilities, they aren't very directly interpretable, at least not for me!

The polr estimates can be transformed into the parameterization I explained above, and then we get the following:


> polrInfoOrd = polrToOrdScale( polrInfo )
> print( polrInfoOrd )
 
$sigma
[1] 0.8350432

$b0
[1] 3.210213

$coefficients
           X 
3.361444e-06 

$zeta
     1|2      2|3      3|4      4|5 
1.500000 2.224788 3.103608 4.500000 


The thresholds are the zeta values; notice that the lowest threshold is 1.5 and the highest threshold is 4.5. The sigma, intercept (i.e., b0), and slope (i.e., coefficient) are very close to the modal posterior values of the Bayesian estimation.

The function:

Here is the specification of the function in R:

polrToOrdScale = function( polrObject ) {
  polrThresh = polrObject$zeta
  polrSlopes = polrObject$coefficients
  polrInter = 0.0
  polrSigma = 1.0
  K = length(polrThresh) + 1  # K is number of ordinal levels
  sigmaMult = unname( (K-2)/(polrThresh[K-1]-polrThresh[1]) )
  inter = unname( 1.5 - ( sigmaMult*polrThresh[1] ) )
  respThresh = sigmaMult*polrThresh + inter
  respSigma = sigmaMult*polrSigma
  respB0 = sigmaMult*polrInter + inter
  respSlopes = sigmaMult*polrSlopes
  return( list( sigma=respSigma ,
                b0=respB0 ,
                coefficients=respSlopes ,
                zeta=respThresh ) )
}


I have only tested the function on a few examples; there are no guarantees! Feel free to reply with refinements.

Monday, November 10, 2014

2nd Edition has shipped (Doing Bayesian Data Analysis)

I am told by some readers that they have received a physical copy of the 2nd Edition of Doing Bayesian Data Analysis, but I have yet to see it myself. I hope the paper is heavy and opaque, but the book lifts the spirits and the concepts are transparent! Info about the 2nd Edition is here.

Wednesday, October 8, 2014

2nd Edition of Doing Bayesian Data Analysis available November 2014


The 2nd Edition of Doing Bayesian Data Analysis will be available in early or mid November, 2014. The book is completely re-written from cover to cover, with loads of new material, and all new programs. For complete information, please see the book's new web site.

Here's a partial list of what's new in the Second Edition:
  • There are all new programs in JAGS and Stan. The new programs are designed to be much easier to use than the scripts in the first edition. In particular, there are now compact high-level scripts that make it easy to run the programs on your own data sets. This new programming was a major undertaking by itself.
  • The introductory Chapter 2, regarding the basic ideas of how Bayesian inference re-allocates credibility across possibilities, is completely rewritten and greatly expanded.
  • There are completely new chapters on the programming languages R (Ch. 3), JAGS (Ch. 8), and Stan (Ch. 14). The lengthy new chapter on R includes explanations of data files and structures such as lists and data frames, along with several utility functions. (It also has a new poem that I am particularly pleased with.) The new chapter on JAGS includes explanation of the RunJAGS package which executes JAGS on parallel computer cores. The new chapter on Stan provides a novel explanation of the concepts of Hamiltonian Monte Carlo. The chapter on Stan also explains conceptual differences in program flow between it and JAGS.
  • Chapter 5 on Bayes’ rule is greatly revised, with a new emphasis on how Bayes’ rule re-allocates credibility across parameter values from prior to posterior. The material on model comparison has been removed from all the early chapters and integrated into a compact presentation in Chapter 10.
  • What were two separate chapters on the Metropolis algorithm and Gibbs sampling have been consolidated into a single chapter on MCMC methods (as Chapter 7).
  • There is extensive new material on MCMC convergence diagnostics in Chapters 7 and 8. There are explanations of autocorrelation and effective sample size. There is also exploration of the stability of the estimates of the HDI limits. New computer programs display the diagnostics, as well.
  • Chapter 9 on hierarchical models includes extensive new and unique material on the crucial concept of shrinkage, along with new examples.
  • All the material on model comparison, which was spread across various chapters in the first edition, in now consolidated into a single focused chapter (Ch. 10) that emphasizes its conceptualization as a case of hierarchical modeling.
  • Chapter 11 on null hypothesis significance testing is extensively revised. It has new material for introducing the concept of sampling distribution. It has new illustrations of sampling distributions for various stopping rules, and for multiple tests.
  • Chapter 12, regarding Bayesian approaches to null value assessment, has new material about the region of practical equivalence (ROPE), new examples of accepting the null value by Bayes factors, and new explanation of the Bayes factor in terms of the Savage-Dickey method.
  • Chapter 13, regarding statistical power and sample size, has an extensive new section on sequential testing, and making the research goal be precision of estimation instead of rejecting or accepting a particular value.
  • Chapter 15, which introduces the generalized linear model, is fully revised, with more complete tables showing combinations of predicted and predictor variable types.
  • Chapter 16, regarding estimation of means, now includes extensive discussion of comparing two groups, along with explicit estimates of effect size.
  • Chapter 17, regarding regression on a single metric predictor, now includes extensive examples of robust regression in JAGS and Stan. New examples of hierarchical regression, including quadratic trend, graphically illustrate shrinkage in estimates of individual slopes and curvatures. The use of weighted data is also illustrated.
  • Chapter 18, on multiple linear regression, includes a new section on Bayesian variable selection, in which various candidate predictors are probabilistically included in the regression model.
  • Chapter 19, on one-factor ANOVA-like analysis, has all new examples, including a completely worked out example analogous to analysis of covariance (ANCOVA), and a new example involving heterogeneous variances.
  • Chapter 20, on multi-factor ANOVA-like analysis, has all new examples, including a completely worked out example of a split-plot design that involves a combination of a within-subjects factor and a between-subjects factor.
  • Chapter 21, on logistic regression, is expanded to include examples of robust logistic regression, and examples with nominal predictors.
  • There is a completely new chapter (Ch. 22) on multinomial logistic regression. This chapter fills in a case of the generalized linear model (namely, a nominal predicted variable) that was missing from the first edition.
  • Chapter 23, regarding ordinal data, is greatly expanded. New examples illustrate single-group and two-group analyses, and demonstrate how interpretations differ from treating ordinal data as if they were metric.
  • There is a new section (25.4) that explains how to model censored data in JAGS.
  • Many exercises are new or revised.
And the cover is different! The correspondence of the doggies to Bayes’ rule is now made explicit: The half-up ears of the posterior doggie are a compromise between the perky ears and floppy ears of the likelihood and prior doggies. The marginal likelihood is not usually computed in MCMC methods, so the doggie in the denominator gets sleepy with nothing much to do. Let's hope that what’s between the covers of the book is as friendly and engaging as the doggies on the cover.

Again, for complete information about the second edition, please see the book's new web site.

Monday, September 29, 2014

Doing Bayesian Data Analysis in Python

Do you prefer Python? Some readers have undertaken to translate the computer programs from Doing Bayesian Data Analysis into Python, including Osvaldo Martin, who has this GitHub site for his ongoing project. If you are interested in what he has done, or if you are interested in contributing, please contact him. If you have your own Python-translation project going, please add a comment to this post. Thank you all for your contributions to make Bayesian methods available in all flavors.

Thursday, September 18, 2014

p value from likelihood ratio test is STILL not the same as p value from maximum likelihood estimate

In yesterday's post, I described two ways for finding a p value for a parameter, and pointed out that the two ways lead to different p values. As an example, I considered the slope parameter in simple linear regression. One way to get a p value for the slope parameter is with a standard likelihood ratio test, for linear model with free slope versus intercept-only model. The other way to get a p value is to generate a sampling distribution of the MLE slope parameter itself, from a null hypothesis that consists of the MLE intercept-only model. Through remarks from commenters, especially Arash Khodadadi and Noah Silbert, I've been prodded to clarify the differences between these approaches. This post points out
  • The likelihood ratio test compares model fits relative to the variance, while the sampling distribution of the MLE slope parameter is on the absolute scale of the data.
  • An example in which the slope parameter has p < .05 two-tailed in the sampling distribution of the MLE slope parameter but has p > .05 (one-tailed) in the sampling distribution of the likelihood ratio.
First, the data (a little different from the previous posts):
The header of the plot, above, indicates the MLE of the linear model, and -2log(LR), a.k.a. G2, rounded to three significant digits.

The null hypothesis is an intercept-only model (beta1=0) with beta0 and sigma set to their MLE values when beta1 is fixed at zero. (Which means, in this case, that beta0 is the mean of y and sigma is the sd of y using N not N-1 in the denominator.) I generated sample data from the null hypothesis using the x values in the actual data. For each sample I computed the MLE of the full model and -2log(LR). The resulting marginal sampling distributions are shown here:
In the left panel, above, the one-tailed p value of MLE beta1 is displayed; multiply it by 2 to get the two-tailed p value. Notice it is different than the p value of the likelihood ratio.

Below is the joint sampling distribution, where each point is a sample from the null hypothesis. There is a new twist to this figure: Each point is color coded for the magnitude of MLE sigma, where blue is the largest MLE sigma in the distribution and red is the smallest MLE sigma in the distribution.
The joint sampling distribution also shows the thresholds for p=.05 (one-tailed or two-tailed), and the actual data statistics are plotted as a "+".

You can see from the joint sampling distribution that MLE beta1 can be large-ish even when -2log(LR) is small-ish when the sample MLE sigma is large-ish (blue points). But the opposite can happen when the sample MLE sigma is small-ish (red points). Thus, a key difference between the measures of the slope parameter is how they deal with the variance. The likelihood ratio compares the free-slope against intercept-only models relative to the variance, while the MLE beta1 considers the slope on the absolute scale of the data, not relative to the variance.

As discussed in yesterday's post, I don't think either test is inherently better than the other. They just ask the question about the slope parameter in different ways. As mentioned yesterday, posing the question in terms of absolute MLE beta1 has direct intuitive interpretation. It's also much easier to use when defining confidence intervals as the range of parameter values not rejected by p<alpha (which is, for me, the most coherent way to define confidence intervals). But that's a topic for another day!