Tutorial: Applied Statistical Genetics with R

Andrea S. Foulkes, University of Massachusetts School of Public Health and Health Sciences


Recent technological advancements, coupled with extensive genetic sequencing efforts, have led to an explosion in the availability of molecular level data for the study of complex diseases, such as cardiovascular disease and cancer. At the same time, the on-going success of large-scale genome-wide association studies has fueled interest and expanded our knowledge base for the conduct of improved candidate-gene studies. This tutorial introduces fundamental concepts and R tools for population-based investigations of genotype-trait associations, useful for medical and public health investigators.


   1. Unit 1: Genetic data concepts and tests. We begin by providing a general overview of genetic association studies and related genetic concepts. Topics covered include types of investigations, linkage disequilbrium and Hardy-Weinberg equilibrium, as well as the role of population substructure on associated measures and tests.

   2. Unit 2: Multiple testing adjustments. The second portion of the tutorial focuses on applications of several multiple comparison procedures that address the multiplicity problem inherent in most genotype-trait association studies. These include both single-step and step-down adjustments (e.g. Bonferroni and false discovery rate control) as well as resampling-based methods and the q-value.

   3. Unit 3: Accounting for ambiguity in phase. Haplotype reconstruction techniques are typically applied to population-level association data, in which allelic phase is generally unobservable. In the third section of this tutorial, an expectation-maximization type algorithm and a Gibbs sampling algorithm are described and demonstrated for this setting.                                                   

   4. Unit 4: Topics in high-dimensional data analysis. Finally, in the fourth part of the tutorial, tree-based approaches, including classification and regression trees (CART), random forests (RFs) and Logic regression are presented and illustrated.

Publicly available data sets are used to aid in the illustration of analytic tools. Applications focus on the rapidly expanding field of genetic epidemiology, and specifically the study of complex disease genotype-trait associations among unrelated individuals. Particular attention is given to appropriate handling of population-level environmental and demographic factors that confound or modify associations of interest. This tutorial will offer participants a basic introduction to genetic association studies and fundamental knowledge of the broad and powerful spectrum of tools R offers for addressing the array of analytical challenges inherent in these investigations. Content is based on A.S. Foulkes (2009) Applied Statistical Genetics with R for Population-Based Association Studies, Springer (http://www.springer.com/978-0-89553-6).


Elementary knowledge of statistical concepts at the level of a first course in biostatistics is assumed. This tutorial is intended to appeal to public health and medical researchers involved in genetic investigations, as well as biologists, statisticians and computer scientists with interests in bioinformatic tools. Topics will extend coverage in UseR!2008 tutorial.