useR

Tutorial: An introduction to data cleaning with R


Mark van der Loo, Statistics Netherlands - m.vanderloo@cbs.nl.
Edwin de Jonge, Statistics Netherlands - e.dejonge@cbs.nl

Overview

Raw statistical data is rarely of sufficient quality to allow for immediate statistical analyses. In many cases, one has to take care of invalid or missing values in a dataset before one can start analysing the data. Such a process becomes even more complex when variables in a record- set are interrelated by consistency rules. For example, in a survey where one asks for gender (male, female) and pregnant (yes, no), the combination (gender=male, pregnant=yes) is invalid for obvious reasons.

Goals

Outline

This tutorial will cover a number of tools and methods that allow for automated and reproducible data cleaning. Topics discussed include

Some, but not all elements of the tutorial will make use of R-packages to which the authors have contributed, most notably: tabplot, editrules, deducorrect and rspa.

Prerequisites

The tutorial will include practical exercises so bring your laptop!

Intended Audience

Workshop Materials

Related Links