useR

Tutorial: Classification with Individual and Ensemble Trees


Esteban Alfaro, Matías Gámez and Noelia García (Faculty of Economics and Business Sciences of Albacete, University of Castilla‐La Mancha. Plaza de la Universidad, s/n. 02071 Albacete (Spain). e‐mail: {Esteban.Alfaro, Matias.Gamez, Noelia.Garcia}@uclm.es)

Overview

Classification trees are a powerful alternative to the more traditional statistical models. This model has the advantage of being able to detect non-linear relationships and showing a good performance in presence of qualitative information as it happens in many real problems. As a result, they are widely used as base classifiers for ensemble methods. AdaBoost constructs its base classifiers in sequence, updating a distribution over the training examples to create each base classifier. Bagging combines the individual classifiers built in bootstrap replicates of the training set. Random Forest is a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. In this tutorial, we compare the prediction accuracy of these techniques for several UCI data sets.

Goals

The first goal of this tutorial is to introduce to the less expert audience in classification with individual or ensemble trees through several R packages as rpart, adabag or RandomForest. The second goal is that the audience brought their own data in order to apply these methods to it.

Outline

  1. Introduction
  2. Individual classification trees
  3. Ensemble trees
  4. Examples

Prerequisites

Participants should have some basic knowledge of data manipulation and standard functions in R.

Intended Audience

Researchers, students and professionals interested in Classification.

Workshop Materials

The slides used in the tutorial will be available for participants. Participants are welcome to bring their own laptops and datasets which application could be discussed at the last part of the tutorial.

Related Links

ALFARO, E., GAMEZ, M., GARCIA, N. (2012). adabag: Applies multiclass AdaBoost.M1, AdaBoost-SAMME and Bagging. R Package version 3.1. http://CRAN.R-project.org/package=adabag.

FAN, Y., MURPHY, T.B., WATSON, R.W.G. (2012). digeR: GUI tool for analyzing 2D DIGE data. R package version 1.3. http://CRAN.R-project.org/package=digeR

KUHN, M. (2012). caret: Classification and Regression Training. R package version 5.15-023. Contributions from WING, J. WESTON, S., WILLIAMS, A., KEEFER, C., ENGELHARDT, A., URL http://CRAN.R-project.org/package=caret.

LIAW, A., WIENER, M. (2002). Classification and Regression by randomForest. R News 2(3), 18-22. http://cran.r-project.org/web/packages/randomForest/.

THERNEAU, T.M., ATKINSON, B., RIPLEY, B. (2012). rpart: Recursive Partitioning. R package version 3.1-55. http://CRAN.R-project.org/package=rpart.

References

ALFARO, E., GARCIA, N., GAMEZ, M. AND ELIZONDO, D. (2008): “Bankruptcy forecasting: An empirical comparison of AdaBoost and neural networks”. Decision Support Systems, 45, 110–122.

BREIMAN, L. (1996): “Bagging predictors”. Machine Learning, 24(2), 123-140

BREIMAN, L. (2001) “Random Forest”. Machine Learning , 45, 5-32

BREIMAN, L., FRIEDMAN, J.H. OLSHEN, R. AND STONE C.J. (1984) Classification and regression trees. Belmont, Wadsworth International Group

FREUND, Y. AND SCHAPIRE, R.E. (1996): ``Experiments with a new boosting algorithm''. In Proceedings of the Thirteenth International Conference on Machine Learning, 148-156, Morgan Kaufmann.

ZHU, J., ZOU, H., ROSSET, S. AND HASTIE, T. (2009): ``Multi-class AdaBoost''. Statistics and Its Interface, 2, 349-360.