Advanced Modeling in R
Non-linear, Bayesian, and mixed effect methods
R. Condit, M. Ferrari
Cenpat, Patagonia
October 2012
1 Course overview
The course will cover several advanced statistical modeling methods using the programming
language R, including maximum-likelihood, non-linear, Bayesian, and multi-level (hierarchical)
methods as well as techniques for using data simulation to test models. The R function lmer, an
accessible yet complex tool for advanced modeling, will be covered in detail. To establish a base for
understanding multi-level models, some review of standard regression will be included, plus a
session on fitting non-linear models with maximum likelihood.
During the first half of each session, I will explain methods and present examples of their use; in
the second half, students will work on assignments using the same methods. Datasets will be
provided, but students are encouraged to bring their own data as well. A course web site will provide
sample code, data, and a list of key R functions. Students should be familiar with R: manipulating
dataframes, graphing, and linear regression.
1.1 To apply
- To join, contact Alexandra Sapoznikow, Oficina de Vinculación Tecnológica, Centro
Nacional Patagónico -Conicet
1.2 Schedule
- When: Five Sessions, 9:00-18:00, 9-13 Oct 2012
- Where: Salon Península, Cenpat, Puerto Madryn
2 Software required
3 Course web site
4 Books and other background material
5 Contents and approximate scheduling (daily progress will depend on experience of the
students
- Modeling with standard regression and maximum likelihood [day 1]
- Linear regression with lm (review)
- Gaussian error
- Residuals and statistics (coef, summary)
- Data treemass: log(agb) vs. log(dbh)
- Centering x in linear regression!
Use xCenter = x - mean(x)
- Numerical estimation with optim
- maximize likelihood vs. minimize sum of squares
- alternate methods in optim (Nelder-Mead etc.)
- comparing models with AIC
- Non-linear models
- Survival models with maximum likelihood [day 2 morning]
- binomial error instead of Gaussian error
- logistic function to describe data
- Data simulation [day 2 afternoon]
- Two purposes of simulation
- Understand connection from Process –> Data
- Test whether models work
- R’s probability distribution functions
- density and random draws (eg, dnorm and rnorm)
- important distributions: normal, binomial, poisson, negative binomial
- Regression with error
- Multi-level regression
- Extra: Survival
- Multi-level models (mixed effect, hierarchical, random vs. fixed effects) [day 2-3]
- Why multi-level modeling?
- Limitation: linear (or transformed linear) with normal error
- Multi-level vs. standard regression
Bates Chap 4, Section 4.4; Gelman & Hill pp. 251-259
- Regression with one group using lmer
- output of display
- graphs using the coefficients
- variable intercept, slope, or both
- Regression with two groups or two predictors x using lmer
- output of display
- models with or without covariance
- group level predictor (see Gelman&Hill p. 265)
- graphs using the coefficients
- Random or fixed?
- Traditional
- Random: nuisance effects, unrepeatable (batch, plot)
- Fixed: permanent group, repeatable (sex
- Gray area: year? site?
- Recent issues favoring multi-level approach
(ie, Gelman, who replaces ’random’ with ’grouping’)
- Is group-level variation an explicit research topic?
- Can different groups be thought of as similar?
- Can information on one group support other groups?
- Are some groups rare and thus needing support?
- Are there enough groups? (too few -> little evidence on group-level
variation)
- Bayesian methods [day 4-5]
- Bayes rule and the posterior distribution
- Metropolis, the Gibbs sampler (MCMC)
- Another method for fitting parameters
- Automatically provides fully accurate confidence
- Much more flexible modeling options (ie, non-linear with many parameters)
- Any error distribution
- Latent states or latent data
- Hierarchical modeling
- Limitations: long run time, complicated program
- Keys to your own program
- Getting the correct likelihood functions, and this can be difficult in complex
models
- Preparing data structures to save all the data and likelihood
- Looping through all the parameters and hyperparameters
- Returning results
- Details
- Parameter correlation, autocorrelation and poor convergence
- Diagnostics (see coda package)
- Fitting the covariance
- Special cases where Metropolis not needed
6 Key R functions
- Data extraction
- subset
- apply
- tapply
- cut
- dim
- str
- names
- ifelse [R base package]
- IfElse [CTFSRPackage version]
- Graphics
- hist
- plot
- points
- line
- curve
- abline
- box
- axis
- X11
- dev.set
- Modeling
- summary
- mean
- median
- sd
- var
- cor
- CI [CTFSRPackage]
- model
- lm
- glm
- lmer [lme4 package]
- coef
- summary
- fixef [arm package]
- ranef [arm package]
- display [arm package]
- dotplot [lattice package]
- xyplot [lattice package]
- Likelihood
- optimize
- optim
- metrop1step [in CTFSRPackage]
- Error functions and probability distributions
- dnorm is the standard
- dbinom is the standard for survival or occurrence (or similar)
- dlnorm
- for abundances, whether integer or not (but usually not used in favor of
log-transformation
- good match for tree growth rates
- but cannot handle zeroes
- dgamma is similar to log=normal
- dpois including zeroes (but does not handle much ecological data well)
- for integer abundances
- handles zeroes
- however, close to Gaussian so not appropriate for much ecological data
- dnbinom
- for integer abundances that are highly skewed
- very common in ecology
- R: prob=dnbinom(count,size=k,mu=mu)
- size is ’clumping parameter’; mu is mean