Advanced Modeling in R
Non-linear, Bayesian, and mixed effect methods
R. Condit∗
Smithsonian Tropical Research Institute, 7-9 May 2012
1 General organization
The course will cover several advanced statistical modeling methods using the programming
language R, including maximum-likelihood, non-linear, Bayesian, and multi-level (hierarchical)
methods as well as techniques for using data simulation to test models. The R function lmer, an
accessible yet complex tool for advanced modeling, will be covered in detail. To establish a base for
understanding multi-level models, some review of standard regression will be included, plus a
session on fitting non-linear models with maximum likelihood.
During the first half of each session, I will explain methods and present examples of their use; in
the second half, students will work on assignments using the same methods. Datasets will be
provided, but students are encouraged to bring their own data as well. A course web site will provide
sample code, data, and a list of key R functions. Students should be familiar with R: manipulating
dataframes, graphing, and linear regression.
1.1 Applying
- Apply: Contact Liliana Londoņo, Center for Paleobiology, STRI
1.2 Schedule
- When: Three sessions, 8:30-4:30, 7-9 May 2012
- Where: Tupper Training Room (Next to Small Meeting Room, below cafeteria)
2 Software requirements
I assume you will have laptops running R, that you know how to manipulate dataframes in R, and
have some experience with graphing and simple summary statistics. I suspect you have already used
the functions lm and (perhaps) glm, but in case you haven’t, you will quickly learn them. The
course will begin with those functions as a baseline for moving off into more advanced
methods for fitting models. Please have the packages listed below installed and running
beforehand, and I encourage you to get programming editor already installed before we
start.
3 Course web site
4 Sources
5 Contents
- Modeling with standard regression and maximum likelihood [morning 1]
- Linear regression with lm (review)
- Gaussian error
- Residuals and statistics (coef, summary)
- Data treemass: log(agb) vs. log(dbh)
- Centering x in linear regression!
Use xCenter = x - mean(x)
- Numerical estimation with optim
- maximize likelihood vs. minimize sum of squares
- alternate methods in optim (Nelder-Mead etc.)
- comparing models with AIC
- Non-linear models
- Bayesian methods [afternoon 1, day 2]
- Bayes rule and the posterior distribution
- Metropolis, the Gibbs sampler (MCMC)
- Another method for fitting parameters
- Automatically provides fully accurate confidence
- Much more flexible modeling options (ie, non-linear with many parameters)
- Any error distribution
- Latent states or latent data
- Hierarchical modeling
- Limitations: long run time, complicated program
- Keys to your own program
- Getting the correct likelihood functions, and this can be difficult in complex
models
- Preparing data structures to save all the data and likelihood
- Looping through all the parameters and hyperparameters
- Returning results
- Details
- Parameter correlation, autocorrelation and poor convergence
- Diagnostics (see coda package)
- Fitting the covariance
- Special cases where Metropolis not needed
- Data simulation [not covered]
- Two purposes of simulation
- Understand connection from Process –> Data
- Test whether models work
- R’s probability distribution functions
- Regression with error
- Multi-level regression
- Extra: Survival
- Multi-level models (mixed effect, hierarchical, random vs. fixed effects) [day 3]
- Why multi-level modeling?
- Limitation: linear (or transformed linear) with normal error
- Multi-level vs. standard regression
Bates Chap 4, Section 4.4; Gelman & Hill pp. 251-259
- Regression with one group using lmer
- output of display
- graphs using the coefficients
- variable intercept, slope, or both
- Regression with two groups or two predictors x using lmer
- output of display
- models with or without covariance
- group level predictor (see Gelman&Hill p. 265)
- graphs using the coefficients
- Random for fixed?
- Traditional
- Random: nuisance effects, unrepeatable (batch, plot)
- Fixed: permanent group, repeatable (sex
- Gray area: year? site?
- Recent issues favoring multi-level approach
(ie, Gelman, who replaces ’random’ with ’grouping’)
- Is group-level variation an explicit research topic?
- Can different groups be thought of as similar?
- Can information on one group support other groups?
- Are some groups rare and thus needing support?
- Are there enough groups? (too few -> little evidence on group-level
variation)
6 Error functions
- dnorm is the standard
- dbinom is the standard for survival or occurrence (or similar)
- dlnorm
- for abundances, whether integer or not (but usually not used in favor of
log-transformation
- good match for tree growth rates
- but cannot handle zeroes
- dgamma is similar to log=normal
- dpois including zeroes (but does not handle much ecological data well)
- for integer abundances
- handles zeroes
- however, close to Gaussian so not appropriate for much ecological data
- dnbinom
- for integer abundances that are highly skewed
- very common in ecology
- R: prob=dnbinom(count,size=k,mu=mu)
- size is ’clumping parameter’; mu is mean
7 R functions
- Data extraction
- subset
- apply
- tapply
- cut
- dim
- str
- names
- ifelse [R base package]
- IfElse [CTFSRPackage version]
- Graphics
- hist
- plot
- points
- line
- curve
- abline
- box
- axis
- X11
- dev.set
- Modeling
- summary
- mean
- median
- sd
- var
- cor
- CI [CTFSRPackage]
- model
- lm
- glm
- lmer [lme4 package]
- coef
- summary
- fixef [arm package]
- ranef [arm package]
- display [arm package]
- dotplot [lattice package]
- xyplot [lattice package]
- Probability distributions
- PDFs
- dnorm, rnorm, pnorm, qnorm
- dbinom, rbinom, pbinom, qbinom
- dlnorm etc.
- dnbinom etc.
- Likelihood
- optimize
- optim
- metrop1step [in CTFSRPackage]