Tutorial: Analysing large data with many models
A common pattern in the analysis of large data is to perform the same analysis on many (many many) smaller subsets of data. In this tutorial you will learn some basic statistical and computation strategies for solving this type of problem. During the course of the tutorial we will fit tens of thousands of linear models to a variety of datasets and explore how we can summarise the results to gain insight into our data.
The basic steps of large data analysis that we will follow are:
* Identify and fit an appropriate model for a single subset of the data
* Fit the model to every subset
* Examine model fit statistics to identify subsets that don't follow
the same pattern, and modify and refit the model if necessary
* Look at coefficients and other summary statistics across all subsets
* Create a single model that summarises the many smaller models
You will learn how to tackle these problems using commands purely in base R, but we will also use the plyr package extensively. The plyr package provides a toolbox for a common set of problems: you need to break a big problem down into manageable pieces, operate on each pieces and then put all the pieces back together. It's already possible to do this with split and the apply functions, but plyr just makes it all a bit easier with consistent names, arguments and
outputs; input from and output to data.frames, matrices and lists; progress bars to keep track of long running operations; and built-in error recovery.
We will touch on issues of computational efficiency, including approaches for caching your work so that in the event of unanticipated error or machine failure you lose little time, but the main emphasis will be on learning the most about your data, not with fitting data in memory or similar problems.
Participants should be familiar with the basic tools of linear models in R, and should have struggled with analysing large data in the past.