useR!2007

August 8–10. Iowa State University, Ames, Iowa

Sponsored by:

XLSolutions

ASA Sections on Statistical Graphics and Computing

Insightful

Machine Learning Tools for Model Building and Inference:
Partying with Trees, Forests and Boosting Methods

Torsten Hothorn, Torsten.Hothorn@R-project.org, and Carolin Strobl; FAU Erlangen-Nuernberg, Germany

When constructing a statistical model for the functional relationship between a response variable and a, possibly huge, number of covariates we aim at interpretation of the regression relationship and/or prediction of the response. The latter task may be based on rather complex models, such as a random forest, while inspecting simple models is usually preferred when interpretation is more important.

In this tutorial we will show how to build and compare models for both interpretation and prediction in continuous regression, classification and survival analysis. After starting with simple tree-based regression models and efficient visualization techniques for such models, we will move to forests of trees and discuss some properties of variable importance measures. Later, we introduce boosting methods for fitting generalized linear and generalized additive models in possibly high-dimensional situations. Finally, we focus on non-parametric regression models fitted by boosting stumps or larger regression trees.

The basic principles of the design and analysis of benchmark experiments for the performance-based comparison of multiple candidate models or tuning of hyper-parameters will be introduced and illustrated throughout the course. The procedures dealt with are implemented in packages `randomForest', `party' and `mboost' and we will use these packages to construct regression models for predicting total body fat, glaucomateous damages of the optic nerve head, and breast cancer survival, among many other examples shown during the course.