Tutorial: Regression on large data sets: big(g)lm and other approaches

Thomas Lumley, Dept of Biostatistics, University of Washington, Seattle

Abstract

Everyone knows that R can handle only small data sets. This tutorial will look at ways to show that `everyone' is wrong. There are three main approaches. For data sets with up to a few hundreds of thousands of rows it is possible to perform the regressions in R if only the necessary variables are loaded. For larger data sets we can use incremental updates of bounded memory computations as in the biglm package, or perform the large-data computations directly in a database.

Outline

1) Why does lm() use a lot of memory?
2) Data examples
3) A little SQL: Load-on-demand regression
5) Bounded-memory algorithms
6) One pass: biglm
7) Iterative: bigglm
8) More SQL: pushing computations to the database.

Who is this for?

Users of R who want to analyse data sets that cannot fit conveniently into memory. The focus will be on linear and generalized linear models, but the techniques are relevant to other computations.