Tutorials

The tutorials will take place on 10-11 July 2018. Click the tutorial for more information and register here.

There is a late-breaking change. Heather Turner will not be able to make it to Australia. Her tutorial notes are available at https://github.com/hturner/gnm-day-course. We are lucky that Max Kuhn from RStudio has stepped in to provide an alternative tutorial for that time slot. Details are below.

Information on handling preferences: Thank you if you entered your tutorial preferences in our form. This has helped us keep tabs on numbers for each tutorial, and allocate presenters to room based on these numbers. We are now quite sure that we can handle all preferences with our room sizes. You can change your mind now! We are not checking the records of your preferences, you can simply go to the tutorial of your choice in the session. No need to enter any new preferences.

You will need your badge to get into the tutorial. It will be colour-coded depending on what session you have registered for.

Presenter Title Venue Target audience What to bring
Tuesday Morning 9:00-12:30 Break 10:30-11:00
Paula Moraga Disease risk modeling and visualization using R P8 People who are interested in health surveillance or any subject that deals with spatially referenced data Please come to the tutorial with R and RStudio installed, and ensure you have installed the following packages:"dplyr", "ggplot2", "leaflet", "geoR", "rgdal", "raster", "sp", "spdep", "SpatialEpi", "SpatialEpiApp". The package rgdal may take a long time to install depending on the system so best done ahead of time. We will also need the R package INLA, install it typing this:

install.packages("INLA", repos = "https://inla.r-inla-download.org/R/stable", dep = TRUE)

In this tutorial we will learn how to estimate disease risk and quantify risk factors using areal and geostatistical data. We will also create interactive maps of disease risk and risk factors, and introduce presentation options such as interactive dashboards and shiny apps. We will work through two disease mapping examples using data of malaria in the Gambia and cancer in Pennsylvania, United States. We will focus on disease risk, but the approaches covered are also applicable to other fields such as climate, ecology or crime. We will cover the following topics:

  • Model disease risk in different settings
  • Manipulate and transform point, areal and raster data using the spatial packages `sp`, `spdep`, `raster`, `rgdal`, `geoR`, `SpatialEpi` and `SpatialEpiApp`,
  • Retrieve high resolution spatially referenced environmental data using `raster`,
  • Fit and interpret spatial models using `INLA`,
  • Map disease risk and risk factors using `leaflet` and `ggplot2`,
  • Communicate results with interactive dashboards using `flexdashboard` and shiny apps using `shiny`.

Simon Jackson Wrangling data in the Tidyverse P11 Beginner-to-intermediate R users who want to improve the day-to-day quality and efficiency of their data wrangling skills Please remember to bring a laptop with R and RStudio installed. To speed things up, please also install the tidyverse package and familiarise yourself with RStudio projects. Data files will be made available online at the workshop.

This hands-on tutorial will help beginner-to-intermediate R users take their data wrangling skills to the next level with an introduction to the Tidyverse: a collection of data science packages including dplyr, tidyr, purrr, ggplot2, and more. Using provided data sets and practical examples, you will learn how to efficiently import, tidy, and transform data to more quickly focus on tasks like visualization and modeling.

Elizabeth Stark Production-ready R: Getting started with R and docker P9 Experience with using the command line and running basic scripts is helpful but not necessary. Some prior exposure to docker, git and cloud computing is helpful but no in depth knowledge is required. The tutorial will have hands-on components, so it would be great if you can pre-install Docker on your laptop using the instructions here https://docs.docker.com/install/ . Once you have it installed you can type `docker run hello-world` at a console prompt to test it (try `sudo` if it complains). We will be testing out the Rstudio containers, so when you install docker please run `docker pull rocker/studio` and `docker pull rocker/geospatial` to download those images prior. And don't worry if you don't have a laptop or can't get things installed prior - we can assist or you can work alongside someone else for the exercises.

We will present some real-world data science scenarios and use these as a basis to walk participants through the process of building and deploying R-docker apps. Participants will gain experience in writing R scripts to run as stand-alone docker applications through examples, discussion and activities. We will provide code that can be used as a basis for participants' own projects.

Scott Came Applications with R and Docker P6 Attendees with some exposure to Docker interested in building multi-container networked applications using Docker and R Attendees should plan to complete Part 1 of the Docker Getting Started orientation at https://docs.docker.com/get-started/ prior to the tutorial. And no worries if you are not planning to bring a laptop to the tutorial, as you can just pair up with someone else for the exercises.

In this tutorial we will explore several "advanced" scenarios of using Docker and R together to ease deployment of R applications. Attendees will gain hands-on experience building and deploying docker images for Shiny, databases, plumber, and keras. We will also look at cloud deployment and scaling applications with Kubernetes.

Przemyslaw Biecek DALEX: Descriptive mAchine Learning EXplanations. Tools for exploration, validation and explanation of complex machine learning models P10 Applied data scientists, analysts interested in machine learning models. Please bring a laptop with R and following libraries installed via CRAN install.packages(c("DALEX", "breakDown", "live", "auditor", "randomForest", "ceterisParibus"))

Complex machine learning models are frequently used in predictive modelling. There are a lot of examples for random forest like or boosting like models in medicine, finance, agriculture etc. ascending order In this workshop we will show why and how one would analyse the structure of the black-box model.

This will be a hands-on workshop with four parts. In each part there will be a short lecture (around 20-25 minutes) and then time for practice and discussion (around 20-25 min).

* Introduction

Here we will show what problems may arise from blind application of black-box models. Also we will show situations in which the understanding of a model structure leads to model improvements, model stability and larger trust in the model.

During the hands-on part we will fit few complex models (like xgboost, randomForest) with the mlr package and discuss basic diagnostic tools for these models.

* Conditional Explainers

In this part we will introduce techniques for understanding of marginal/conditional response of a model given a one- two- variables. We will cover PDP (Partial Dependence Plots) and ICE (Individual Conditional Expectations) packages for continuous variables and MPP (Merging Path Plot from factorMerger package) for categorical variables.

* Local Explainers

In this part we will introduce techniques that explain key factors that drive single model predictions. This covers Break Down plots for linear models (lm / glm) and tree-based models (randomForestExplainer, xgboostExplainer) along with model agnostic approaches implemented in the live package (an extension of the LIME method).

* Global Explainers

In this part we will introduce tools for global analysis of the black-box model, like variable importance plots, interaction importance plots and tools for model diagnostic.

* Literature

Staniak, Mateusz, and Przemysław Biecek. 2017. Live: Local Interpretable (Model-Agnostic) Visual Explanations.

Sitko, Agnieszka, and Przemyslaw Biecek. 2017. FactorMerger: Hierarchical Algorithm for Post-Hoc Testing. https://github.com/MI2DataLab/factorMerger.

Greenwell, Brandon M. 2017. “Pdp: An R Package for Constructing Partial Dependence Plots.” The R Journal 9 (1): 421–36. https://journal.r-project.org/archive/2017/RJ-2017-016/index.html.

Goldstein, Alex, Adam Kapelner, Justin Bleich, and Emil Pitkin. 2015. “Peeking Inside the Black Box: Visualizing Statistical Learning with Plots of Individual Conditional Expectation.” Journal of Computational and Graphical Statistics 24 (1): 44–65. doi:10.1080/10618600.2014.907095.

Apley, Dan. 2017. ALEPlot: Accumulated Local Effects (Ale) Plots and Partial Dependence (Pd) Plots. https://CRAN.R-project.org/package=ALEPlot.

Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. 2016. “‘Why Should I Trust You?’: Explaining the Predictions of Any Classifier.” In, 1135–44. ACM Press. doi:10.1145/2939672.2939778.

Biecek, Przemyslaw. 2017. BreakDown: BreakDown Plots. https://CRAN.R-project.org/package=breakDown.

Hanjo Odendaal The ultimate online collection toolbox: Combining RSelenium and Rvest P7 Intermediate R users looking to explore online data collection Installation details are at https://bit.ly/2Moe5nH

Rvest from Hadley Wickham has become the go to package for all online collection or website interaction (web-scraping) tasks in R. Although the package is amazing, it is not able to interact with a webpage when the page is dynamically loaded through javascript. For the latter, we need to have a browser that we 'drive' around the website to collect/load and interact with objects. Welcome to Rselenium from John Harrison. The package provides the necessary tools that allows the user to drive a web-browser, from R using script commands. In this tutorial, we will be looking at installing RSelenium, learning basic commands, look at javascript tips and how to play well with others like rvest.

Tuesday Afternoon 1:30-5:00 Break 3:00-3:30
Thomas Lumley fasteR: ways to speed up R code P10 Intermediate R programmers interested in speeding up their code See instructions at https://github.com/tslumley/useRfasteR

This workshop will cover some intermediate and advanced techniques for optimising R code. We will look at both processing speed and memory use, but will not cover converting your code into other languages (eg C).

Max Kuhn Recipes for Data Processing P9 The tutorial is for people who do feature engineering or need to include preprocessing with their models. The audience should include people who do feature engineering or need to include preprocessing with their models. From a technical standpoint, some experience in modeling and R is a good idea. Basic tidyverse syntax will be reviewed. The materials will be added a few days before the tutorial. To install the required packages: 'AmesHousing', 'broom', 'kknn', 'recipes', 'rsample', 'tidyverse', 'yardstick', 'caret'. Notes are available https://github.com/topepo/user2018

R has an excellent framework for specifying models using formulas. While elegant and useful, it was designed in a time when models had small numbers of terms and complex preprocessing of data was not commonplace. As such, it has some limitations. In this tutorial, a new package called `recipes` is shown where the specification of model terms and preprocessing steps can be enumerated sequentially. The recipe can be estimated and applied to any dataset. Current options include simple transformations (log, Box-Cox, interactions, dummy variables, ...), signal extraction (PCA, PLS, ICA, MDS, ...), basis functions (splines, polynomials, ...), imputation methods, and others. An example is used to demonstrate the functionality.

Kevin Kuo Deep learning with TensorFlow and Keras P11 Anyone interested in deep learning TBA

We begin with a quick introduction of deep learning concepts, just enough to have a working vocabulary to facilitate construction of neural networks during the tutorial. The TensorFlow suite of R packages will be covered, including keras, tfestimators, and tfdatasets. Together with the participants, we build end-to-end workflows to perform classification and regression tasks using neural networks. We discuss the data pre-processing needs specific to neural network models, architectural choices, and best practices. Examples will be chosen to span a wide range of interests, including learning on structured data, time series, and unstructured text and image data.

Johann Gagnon-Bartsch Looking to clean your data? Learn how to Remove Unwanted Variation with R P8 Data analysis, Statistics, Bioinformatics and Computational Biology TBA

High-dimensional data often suffer from unwanted variation; for example, gene expression data commonly contain batch effects, and fMRI data commonly suffer from various systematic errors as well. Removing this unwanted variation while preserving the true signal in the data is essential to deriving the right scientific conclusions. A major complication, however, is that the factors causing the unwanted variation are often unknown and must be inferred from the data. In this tutorial we present the RUV (remove unwanted variation) package. RUV methods cover a range of approaches for removing unwanted variation depending on the purpose of the study: differential expression analysis, global data normalisation and visualisation, or classification. We also demonstrate an R shiny application that provides an overview of the methods, along with interactive options for data visualisation and method diagnostics.

Stephanie Kovalchik Statistical Models for Sport in R P6 Beginner to intermediate R users with an interest in sports Please bring a laptop with R installed and install the following libraries via CRAN: dplyr, tidyr, ggplot2, rvest, jsonlite, stringr, mgcv, rjags, BradleyTerry2, lubridate, pitchRx. These additional libraries should be installed via github using devtools: Rselenium (https://github.com/ropensci/RSelenium), deuce (https://github.com/skoval/deuce). There will also be part of the Web scraping material that will require Docker, which you can install here: https://docs.docker.com/install/.

The workshop will cover a number of skills and statistical models that are common in sports statistics and show how each can be implemented in R. The workshop will introduce participants to a range of R packages and real sports examples.

Dale Bryan-Brown and Brett Parker Spatial modelling using ‘raster’ package P7 People interested in spatial modelling using satellite images, or other raster data sets. Worked examples will revolve around environmental modelling. TBA

The topics covered in this workshop include; an introduction to 1) using R as a GIS (5 minutes), 2) point-type data (5 minutes), and 3) raster data (5 minutes). We will then build on the basic knowledge of using R as a GIS by creating raster data (10 minutes) and exploring its basic features (10 minutes). After that we will import raster data into R (10 minutes) and begin to manipulate its extent, resolution, projection and values (45 minutes) and visualise the data (15 minutes). After the group is comfortable with editing the features of raster data we will discuss summarising raster data using point data (45 minutes). Finally, we will use this summarised data in a simple generalised linear model (45 minutes). The outline in dot-points follows. Times are rough estimates.

Wednesday Morning 9:00-12:30 Break 10:30-11:00
Julie Josse and Nick Tierney Missing values imputation P10 People who want to know more about how dealing with missing values in their analysis and what is the available methods implemented - Basic knowledge of PCA and linear models are required For this tutorial, remember to come with your laptop, Rstudio and the following packages installed: "VIM", "naniar", "missMDA", "Amelia", "mice", "missForest", "FactoMineR", "tidyverse". Slides, course notes, data sets, and Rmarkdown analyses will be available on my web page: http://juliejosse.com/teaching/

The ability to easily collect and gather a large amount of data from different sources can be seen as an opportunity to better understand many processes. It has already led to breakthroughs in several application areas. However, due to the wide heterogeneity of measurements and objectives, these large databases often exhibit an extraordinary high number of missing values. Hence, in addition to scientific questions, such data also present some important methodological and technical challenges for data analyst. In this tutorial, we give an overview of the missing values literature as well as the recent improvements that caught the attention of the community due to their ability to handle large matrices with large amount of missing entries. We will illustrate the methods on medical, environmental and survey data.

Carson Sievert Interactive data visualization on the web with R Auditorium Anyone interested in interactive data visualization TBA

This tutorial teaches practical workflows for creating interactive web graphics which support common data analysis tasks. Through a series of examples, demos, exercises, and lecture, attendees will gain a foundation for navigating through common barriers of productivity associated with both the creation (e.g. start-up cost, iteration cost, dead-end cost) and distribution (e.g., deployment cost, scaling cost, latency cost) of interactive web graphics.

Matteo Fasiolo Quantile Generalized Additive Models: moving beyond Gaussianity P6 The attendees should have a basic understanding of regression models and of the basic concepts underlying statistics and machine learning (e.g. probability densities, quantiles, etc). Please bring your own laptop, with either R version 3.4.4 or 3.5 installed. On MAC you will also need to install XQuartz. Please install the mgcViz package from CRAN and use devtools::install_github("mfasiolo/mgcFam") to install mgcFam. For some of the exercises you might also need the following packages from CRAN: languageR, gamair and e1071.

Generalized Additive Models (GAMs) models are an extension of traditional parametric regression models, which have proved highly useful for both predictive and inferential purposes in a wide variety of scientific and commercial applications. One reason behind the popularity of GAMs is that they strike an interesting balance between flexibility and interpretability, while being able to handle large data sets. The mgcv R package is arguably the state-of-the-art tool for fitting such models, hence the first half of this tutorial will introduce GAMs and mgcv, in the context of electricity demand forecasting. The second part of the tutorial will show how traditional GAMs can be extended to quantile GAMs, and how the latter can be fitted using the qgam R package. By the end of the tutorial the attendees should be able to build, fit and visualize traditional or quantile GAM models, using a combination of the mgcv, qgam and mgcViz R packages. This tutorial is aimed at a broad audience of statistical modellers, interested in using GAMs for predictive or inferential purposes. The models which will be presented in the tutorial have a very wide range of applicability, hence they should be of interest to practitioners in business intelligence, ecology, linguistics, epidemiology and geoscience to name a few.

Maria Prokofieva Follow Me: Introduction to social media analysis in R P7 The tutorial will aim at the broad range of participants from various backgrounds (business, academics, etc.) TBA

The tutorial will review a range of R packages in social media analysis in R and will aim at teaching general principles of working with social media platforms and analysing information there. The social media platform covered are Facebook, Twitter, Instagram and Youtube. Topics covered during the tutorial include: 1. Structure of the social media data (e.g. user-related data, posting related data, hashtags) 2. Benefits and challenges working with social media data (textual/non-textual information, large data volume, API limitations, 3. Connecting to a social media platform (e.g. authentication) and downloading data 4. Data analysis of the profile information (e.g. followers, likes, dislikes, favorites - platform dependent) 5. Data analysis of textual information (e.g. user posts, comments, dynamics, sentiment analysis, word clouds, etc.) 6. Visualisation of the social media data.

Tong He xgboost and MXNet P11 TBA laptop running OSX/Windows/Linux with recent R release, preferably the latest R 3.5.0, and the R packages xgboost and mxnet. mxnet installation is bit tricky, see details for [linux](https://mxnet.incubator.apache.org/install/index.html?platform=Linux&language=R&processor=CPU), [OSX](https://mxnet.incubator.apache.org/install/index.html?platform=MacOS&language=R&processor=CPU), [Windows](https://mxnet.incubator.apache.org/install/index.html?platform=Windows&language=R&processor=CPU)

TBA

Dirk Eddelbuettel Extending R with C++: Motivation, Introduction and Examples P8 Beginning to intermediate users of R who want to go further and farther Now, Rcpp is a fairly big topic, and it requires working compiler setup. This tends to be somewhere between easier-on-some and more-tedious-on-other systems with Windows arguably the most difficult. We say a bit more about this in the Rcpp FAQ [1] -- and we do not need more than R itself needs when compiling packages with C/C++/Fortran code is needed. I have found in the past that I cannot simply assume /everybody/ gets this without help, so I tend to do a bit 'lecture' and hands-on exercise. We will see if I manage to shift the balance a little. So if feel adventurous and want to take this on, I recommend: (1) a decent editor and environment; RStudio fits the bill for most people (2) a working compiler: Linux and macOS generally have it (though macOS keeps changing, and I don't use it myself so reach out to resources such as (3) CRAN packages Rcpp (of course) and RcppArmadillo A simple test to see if you are good, is to use Rcpp::evalCpp() on an expression: R> Rcpp::evalCpp("6 * 7") [1] 42 This actually creates a miniscule C++ routine around the expression, and will only show the expected result if the setup is working. When things fails, the RStudio IDE tends to show a few hints so try that. If all this sounds insurmountable, do not despair. I still recommend the tutorial as I think we should find time to bend your laptop to do all this during conference breaks.

Rcpp has become the principal venue for extending R with compiled code. It makes it easy to extend R with C or C++ spanning the range from simple one-liners to larger routines and bindings of entire external libraries. We will motivate and introduce Rcpp as a natural extension to R that provides an easy-to-use and powerful interface. Helper functions and tools including RStudio will be used to easy creation of R extensions. Several examples will introduce basic use cases including writing code with RcppArmadillo which is the most widely-used package on top of Rcpp. This provides a natual bridge to the more recent RcppMLPACK package (which combines the MLPACK machine learning library with the Armadillo linear algebra library) from which we will study one or two examples.

Charles Gray Are you R Curious? (** This is a FREE tutorial.) P9 Beginning users Your laptop and enthusiasm (installation guide is on https://github.com/softloud/rcurious/blob/master/explore/onboarding.Rmd )

Been meaning to quit excel and learn R for ages but not managed to find the time? Or maybe you are just not quite sure what this R thing that everyone is talking about is? This is the workshop for you. Or, have you had someone say, “It’s easy to use R. Just type R.” or “Oh, I just use the lm() function.” and thought, huh? Sometimes R users can trivialise the process of getting started. In this workshop, we aim to equip new R users with the confidence to problem-solve their way through getting set up with R and RStudio, and importing and exploring data in R. Developed through discussions amongst RLadies who also teach and communicate, we aim to visit the biggest potential pitfalls of your first data analysis in R. From installation issues with packages, to different data structures, to beginning exploratory data analysis and visualisation in R.