Tutorial: An introduction to data cleaning with R

Mark van der Loo, Statistics Netherlands -
Edwin de Jonge, Statistics Netherlands -


Raw statistical data is rarely of sufficient quality to allow for immediate statistical analyses. In many cases, one has to take care of invalid or missing values in a dataset before one can start analysing the data. Such a process becomes even more complex when variables in a record- set are interrelated by consistency rules. For example, in a survey where one asks for gender (male, female) and pregnant (yes, no), the combination (gender=male, pregnant=yes) is invalid for obvious reasons.



This tutorial will cover a number of tools and methods that allow for automated and reproducible data cleaning. Topics discussed include

Some, but not all elements of the tutorial will make use of R-packages to which the authors have contributed, most notably: tabplot, editrules, deducorrect and rspa.


The tutorial will include practical exercises so bring your laptop!

Intended Audience

Workshop Materials

Related Links