Wednesday, April 18, 2012

Improved Programs for Hierarchical Bayesian ANOVA

I've made three important changes in the programs for hierarchical Bayesian ANOVA.

First, I changed the hyperprior on the variance (standard deviation) parameters. I was never pleased with the "+.1" kludge that I used in the book, explained on p. 496: "There is one other trick in the BUGS model specification that is not in the hierarchical diagram of Figure 18.1. One line of the BUGS model specifies that the standard deviation of the group effects, denoted aSDunabs, comes from a t distribution: aSDunabs ~ dt(0,0.001,2). Another line takes the absolute value to “fold” the t distribution onto the nonnegative numbers: aSD <- abs(aSDunabs) + .1. But that line also mysteriously adds a small constant, namely 0.1. This constant keeps aSD from venturing extremely close to zero. The reason for keeping aSD away from zero is that shrinkage can become overwhelmingly strong when there are many groups with few data points per group. This becomes especially problematic in the next chapter when we consider interaction of factors." A problem with that kludge is that the aSD parameter can never be less than 0.1, which is unlikely to reflect a real prior belief. It can produce strangely truncated posterior distributions. Therefore I simply abandoned the folded-t prior (recommended by Gelman, 2006) and used a reasonable gamma instead. The gamma prior, for each SD parameter, has a mode at 0.1 and a standard deviation of 10, which seems to be an appropriate scaling when the data are standardized (as indeed they are in the programs). Here's a picture of this prior:
The neat thing about this prior is that it smoothly drops down to zero as the parameter value gets close to zero. (For how to specify a gamma prior with a desired mode and standard deviation, see this previous blog post.) The prior on the within-cell SD remains uniform, as in the original programs and recommended by Gelman (2006).

Second, I changed the plots of the variance parameters so that they are on the original data scale, not the standardized scale. This was achieved merely by multiplying by ySDorig.

Third, I changed how contrasts are specified. The original method was problematic because the contrasts were specified as vectors of contrast coefficients, which (a) lacked explicit meaning and (b) only made sense if the order of the levels of the factors was exactly as assumed by the contrast coefficients. Unfortunately, it turns out that different settings of R's "locale" can produce different orderings of the factor levels, rendering the contrast coefficients nonsensical! (Thanks to Damion Junk for pointing out this problem to me.) The new method uses logical indexing of named levels, so that the contrast specification is meaningful and independent of local ordering conventions. For example, for the mussel muscle data, the old contrast specification was
  contrastList = list( ...
    ORE1vORE2 = c(1,-1,0,0,0) ,
    ALAvORE = c(-1/2,-1/2,1,0,0) ,
    ... )
but the new specification is
  normalize = function( v ){ return( v / sum(v) ) }
  contrastList = list( ...
    ORE1vORE2 = (xnames=="OregonN")-(xnames=="OregonT") ,
    ALAvORE = (xnames=="Alaska")-normalize(xnames=="OregonN"|xnames=="OregonT") ,
    ...  )

Fourth, I improved the program for displaying posterior histograms, so that it now defaults to better breaks based on the HDI. Therefore it is best to use plotPost without a breaks argument unless the (new) default looks bad.

Where are the new programs? Get the updated files, ANOVAonewayJagsSTZ.R, ANOVAtwowayJagsSTZ.R, plotPost.R, and McDonaldSK1991data.txt, from the usual place (click the column labeled "Last Modified" to get the most recent listed first). ** Updated April 25, 2012. Also with the program PoissonExponentialJagsSTZ.R **

12 comments:

  1. I am a software architect involved with performance modeling. Do you have recommendations of books dealing with generation of distributions of data and fitting data to distributions ? What is the best practical book ?

    Thanks.

    ReplyDelete
  2. Perhaps I do not understand your question, but isn't Bayesian parameter estimation fitting of distributions to data? And isn't posterior prediction the generation of data distributions?

    ReplyDelete
  3. Hi Dr. Kruschke,

    I've been reading your book "Doing Bayesian Analysis" and doing the examples from the book.

    I have a question regarding the within-subjects Two-Way ANOVA covered in the book. Conventional ANOVAs in the NHST tradition can only accommodate one random effect. However, many studies have more than one random effect. For example, an experiment I did had both within-subjects and within-items random effects.

    I don't recall you discussing multiple random effects in your book, so I wonder if it is possible at all to include two (or even more) random effects using BRugs. I attempted to modify the code in ANOVAtwowayBRugsWithinSubj.R, but it hasn't been successful.


    Thanks!
    Best,
    Xiao




    Thanks.

    ReplyDelete
  4. Xiao:

    Thanks for your comment.

    First, I strongly recommend using JAGS instead of BUGS/BRugs. For reasons why, see

    this post
    .
    For installation instructions, see this post.

    Regarding random effects, the answer in general is yes, JAGS/BUGS is extremely flexible for describing multi-level random effects models. You might want to take a look at Gelman & Hill's (2007) book, Data Analysis Using Regression and Multilevel/Hierarchical Models, pp. 245-246. They say "Our advice ... is to always use multilevel modeling ("random effects"). ... we avoid the terms "fixed" and "random" entirely, and focus on the description of the model itself"

    ReplyDelete
  5. I have some suggestions and questions concerning ANOVAonewayJagsSTZ.R. I was not satisfied about the convergence analysis and I made some changes:

    # EXAMINE THE RESULTS
    checkConvergence = TRUE
    if ( checkConvergence ) {openGraph(width=7,height=7)}
    if ( checkConvergence ) {autocorr.plot( codaSamples[[1]][,c("b0", "b[1]","b[2]")] )}
    if ( checkConvergence ) {autocorr.plot( codaSamples[[1]][,c("a0","a[1]","a[2]")] )}
    if ( checkConvergence ) {autocorr.plot( codaSamples[[1]][,c("sigma","aSD")] )}
    if ( checkConvergence ) {for (v in 1:nvar(codaSamples)) {
    print(dimnames(codaSamples[[1]])[[2]][v])
    show(gelman.diag(codaSamples[,v]))
    } }
    #if ( checkConvergence ) {show( gelman.diag( codaSamples ) ) }
    if ( checkConvergence ) {effectiveChainLength = effectiveSize( codaSamples )
    show( effectiveChainLength ) }

    The gelman.diag is changes from multivariate to univariate, because I got repeatedly error messages: Error in show(gelman.diag(codaSamples)) :
    error in evaluating the argument 'object' in selecting a method for function 'show': Error in chol.default(W) :
    the leading minor of order 6 is not positive definite

    Furthermore, I added convergence graphics for the other parameters. The results show that aSD, a0, a[1] and a[2] are autocorrelated, while b0, b[1] and b[2] seem ok. Can we trust the latter, as they are based on a0, a[1] and a[2]?
    Furthermore, can we trust aSD? How do we know, as it is clearly "clumpy" at value 0?

    ReplyDelete
  6. Dear Hans:

    Thanks very much for the thoughtful comment and suggestions for improving the code.

    I agree that convergence diagnostics deserve more discussion in DBDA and the blog.

    Some thoughts about gelman.diag:
    * Many practitioners think it's useful only when the chains begin at overdispersed positions. I have found that chains can get orphaned in complex models even when not overdispersed at the beginning, so the gelman.diag can still be useful.
    * I too have found that gelman.diag produces errors sometimes. Therefore I lately have been embedding it within try() in R, so that the script does not crash if gelman.diag fails. Thus, I use try(gelman.diag(...)).

    I have found that convergence can also be usefully examined by viewing superimposed density plots of the separate chains. It's pretty obvious when their profiles do not match, and it's better than a trace plot.

    I think that the goal (once convergence is checked) is to have an effective sample size of about 10,000 --- because that makes the HDI limits reasonably stable.

    Regarding the parameters in this specific model: We don't care about a0, a[1], etc. They are irrelevant, and in some sense don't even exist. We do care about b0, b[1], etc. The higher-level variance parameters in models like these are often strongly autocorrelated. Fortunately their specific values tend to have little influence on downstream parameters, and we usually care mostly about the downstream parameters. If, however, your research specifically targets the higher-level variance parameters, and you want stable estimates of those parameters, then you want a larger effective sample size for them.

    ReplyDelete
  7. Thanks for your answer, it certainly gave food for thought. Of course you are completely right when saying that a0, a[1] etc. are irrelevant. There should be a restriction, the STZ b[1], b[2] etc. are easy to interpret and the improvements on auto-correlations are great. The recalculations from a to b seem to work fine.

    However, I find them a drag when interactions are concerned and I prefer for that reason the use of STZ dummy coding, without the recalculations. This reduces the bugs model code considerably.

    I find often relatively high auto-correlation for variance/precision estimates. A distribution such as dgamma( shape=1.01005, rate=0.1005)has a high density close to zero. I often see some clumpyness near zero. Is it possible that there is a relation between this kind of prior distribution and auto-correlation of the estimate?

    ReplyDelete
  8. This comment has been removed by the author.

    ReplyDelete
  9. This comment has been removed by the author.

    ReplyDelete
  10. This comment has been removed by the author.

    ReplyDelete
  11. This comment has been removed by the author.

    ReplyDelete
  12. This comment has been removed by the author.

    ReplyDelete