Talk Schedule

The talks will take place on 11-13 July 2018 (click the interested talk for its abstract). A datatable version is provided here, if you’re looking for a more easy-to-search & R-oriented format. Information for presenters is here.

Time Session Presenter Venue Title Keywords Chair Slides
13:45 Keynote Steph de Silva AUD Beyond syntax, towards culture: the potentiality of deep open source communities NA Di Cook click here
For over twenty years, R has been a programming language under development. In that time a collection of open source communities have sprung up around it. These communities have commonalities that are developing into a distinct programming subculture. The existence of a common subculture connecting these communities is important for two reasons: the power to create value and the potential to champion values.
15:30 Applications in society Richard Layton P8 Data, methods, and metrics for studying student persistence applications, community/education, persistence metrics, intersectionality, longitudinal Jessie Roberts click here
This paper introduces R users to data and tools for investigating undergraduate persistence metrics using the midfieldr package and its associated data packages. The data are the student records (registrar's data) of approximately 200,000 undergraduates at US institutions from 1990 to 2016. midfieldr provides functions for determining persistence metrics such as graduation rates and for grouping and displaying findings by program, institution, race/ethnicity, and sex. These packages provide an entry to this type of intersectional research for anyone with basic proficiency in R and familiarity with packages from the tidyverse. The goal of the paper is to introduce the packages and to share our data, methods, and metrics for intersectional research in student persistence.
15:50 Applications in society Maria Holcekova P8 The dynamic approach to inequality: Using longitudinal trajectories of young women and their parents in determining their socio-economic positions within the contemporary Western society visualisation, clustering, imputation, longitudinal data analysis Jessie Roberts NA
Intensified globalisation and ensuing increased affluence of Western populations has changed the composition of traditional social class system in England. This does not imply the disappearance of socio-economic (SE) classes and inequalities, but rather their redefinition. Unfortunately, limited research has considered the dynamic nature of SE positions, especially in understanding youth transitions from parental to personal SE classes. I address this problem using nationally representative longitudinal data in the Next Steps 1990 youth cohort study in England. Firstly, I explore the parental transition patterns using longCatPlot. Secondly, I visualise missing data through the missmap function in Amelia and impute these values using random forests in missForest. Thirdly, I employ the daisy function within the cluster to create SE groups based on Gower distance, partitioning around medoids, and silhouette width. Finally, I visualise these results using ggplot2. In doing so, I establish five distinct SE groups of young women that contributes to the understanding of new forms of inequality, and I discuss its implications in terms of access to educational and labour market resources.
16:10 Applications in society Frank C.S. Liu P8 The Second Wing of Polls: How Multiple Correspondence Analysis using R Advances Exploring Associated Attitudes in Smaller-Data applications Jessie Roberts NA
Polls and surveys have been used for better forecasting voter preferences and understanding consumer behavior. Academically we employ the strength of these smaller but representative data to confirm theory, including identifying associations between theoretically identified variables. However, researchers who like to explore new patterns for better understanding voters' behavior and attitudes are hardly satisfied by the current practice of survey data analysis. While we turn to bigger data, little attention has been preserved to the value of such smaller data for their potential to achieve the same goal. This talk will demonstrate how the use of "FactorMineR" package of R assists exploration of associated concepts and attitudes and patterns that could not be identified by theories in the first place. Implications for the practice of survey data collection and MCA's connection to association rules mining will be discussed.
16:30 Applications in society Meryam Krit P8 Modelling Rift Valley Fever models, applications, community/education Jessie Roberts click here
Rift Valley Fever (RVF) is one of the major viral zoonoses in Africa, affecting man and several domestic animal species. The epidemics generally involve a 5 – 15 year cycle marked by abnormally high rainfall (El Niño/Southern Oscillation phenomenon (ENSO)), but there is more and more evidence of inter-epidemic transmission.A flexible model describing RVF transmission dynamics in six species (human, domestic animal, four vectors) in three different areas will be presented. The model allows for migration, flooding, variation in climate, seasonal effects on vector egg hatching, transhumance, alternative wildlife hosts and increased susceptibility of animals.User-friendly shiny interface and optimized Rcpp implementation allow the epidemiological researchers to study different scenarios and adapt it to other situations. Application of the model to the specific situation in Tanzania and Algeria will be discussed.
15:30 Big data Miguel Gonzalez-Fierro P9 Spark on demand with AZTK big data Max Kuhn NA
Apache Spark has become the technology of choice for big data engineering. However, provisioning and maintaining Spark clusters can be challenging and expensive. To address this issue, Microsoft has developed the Azure Distributed Data Engineering Toolkit (AZTK). This talk describes how AZTK Spark clusters can be provisioned in the cloud from a local machine with just a few commands. The clusters are ready to use in under 5 minutes and come with R and R Studio Server pre-installed, allowing R users to start developing Spark applications immediately. Users can apply their own Docker image to customize the Spark environment. ATZK clusters, composed of low priority Azure virtual machines, can be created on demand and run only as needed allowing for large cost savings. We will show a short demo of how the pre-installed sparklyr package can be used to perform data engineering tasks using dplyr syntax, and machine learning using the Spark MLlib library.
15:50 Big data Benjamin Ortiz Ulloa P9 Graphs: Datastructures to Query algorithms, models, databases, networks, text analysis/NLP, big data Max Kuhn NA
When people think of graphs, they often think about mapping out social media connections. While graphs are indeed useful for mapping out social networks, they have many other practical applications. Data in the real world resemble vertices and edges more than they resemble rows and columns. This allows researchers to intuitively grasp the data modeled and stored within a graph. Graph exploration -- also known as graph traversal -- is traditionally done with a traversal language such as Gremlin or Cypher. The functionality of these traversal languages can be duplicated by combining the igraph and magrittr packages. Traversing a graph in R gives useRs access to a myriad of simple, but powerful algorithms to explore their data sets. This talk will show why data should be explored as a graph as well as show how a graph can be traversed in R. I will do this by going through a survey of different graph traversal techniques and by showing the code patterns necessary for each of those techniques.
16:10 Big data Amy Stringer Amy Stringer P9 Automated Visualisations for Big Data visualisation, reproducibility, big data Max Kuhn NA
The Catlin Seaview project is a large scale reef survey for estimating coralcover at various locations around the world. Upon re-surveying, it is possibleto track changes in, and predict the future condition of, these reefs over time.The survey collects hundreds of thousands of images from 2km transects ofreef, which are then sent to a neural network for automatic annotation of reefcommunities. Annotations are completed in such a way that the resulting datahave hierarchical spatial scales; going up from image, to transect, to reef, tosubregion, to region.Here, we present an efficient method for extracting, summarising and visu-alising the big and complex data with Rmarkdown, dplyr and ggplot2. The useof Rmarkdown, for report generation, allows for the introduction of parametersinto the construction of the document, allowing for entirely unique reports to bedeveloped from the one source script. This approach has resulted in a systemfor compiling 22 reproducible reports, extracting, summarising and visualisingdata at multiple spatial scales, from over 600 000 images, in a matter of minutes;leaving machines to do the work so that people have time to think.
16:30 Big data Snehalata Huzurbazar P9 Visualizations to guide dimension reduction for sparse high-dimensional data visualisation Max Kuhn NA
Dimension reduction for high-dimensional data is necessary for descriptive data analysis. Most researchers restrict themselves to visualizing 2 or 3 dimensions, however, to understand relationships between many variables in high-dimensional data, more dimensions are needed. This talk presents several new options for visualizing beyond 3D. These are illustrated using 16S rRNA microbiome data. We will show intensity plots developed to highlight the changing contributions of taxa (or subjects) as the number of principal components of the dimension reduction or ordination method are changed. And secondarily revive Andrews curves, connected with a tour algorithm for viewing 1D projections of multiple principal components, to study group behavior in the high-dimensional data. The plots provide a quick visualization of taxa/subjects that are close to the `center' or that contribute to dissimilarity. They also allow for exploration of patterns among related subjects or taxa not seen in other visualizations. All code is written in R and available on Github.
15:30 Stat methods for high-dim biology Florian Rohart P10 mixOmics: An R package for 'omics feature selection and multiple data integration data mining, applications, bioinformatics, multivariate, big data Julie Josse NA
The mixOmics R-package contains a suite of multivariate methods that model molecular features holistically and statistically integrate diverse types of data (e.g. ‘omics data as transcriptomics, proteomics, metabolomics) to offer an insightful picture of a biological system.Our two latest frameworks for data integration; N-integration with DIABLO combines different ‘omics datasets measured on the same N samples or individuals; P-integration with MINT combines studies measured on the same P features (e.g., genes) but from independent cohorts of individuals. Both frameworks are introduced in a discriminative context for the identification of relevant and robust molecular signatures across multiple data sets. mixOmics is a well-designed, user-friendly package with attractive graphical outputs. It represents a significant contribution to the field of computational biology which has a strong need for such toolkits to mine and integrate datasets.
15:50 Stat methods for high-dim biology Claus Ekstrøm P10 Using mommix for fast, large-scale genome-studies in the presence of gene-environment and gene-gene interaction algorithms, models, bioinformatics, big data Julie Josse NA
The majority of disorders and outcomes analysed in genome-wide association studies are believed to multi-factorial and influenced by gene-environment (GxE), gene-gene (GxG) interactions, or both. However, including GxE or GxG increases the computational burden by several orders of magnitude which makes the inclusion of interactions prohibitively cumbersome.Finite mixtures of regression models provide a flexible modeling framework for many phenomena. Using moment-based estimation of the regression parameters, we develop unbiased estimators with a minimum of assumptions on the mixture components. In particular, only the average regression model for one of the components in the mixture model is needed and no requirements on any of the distributions.We present a new R package, mommix, for moment-based mixtures of regression models, which implements this new approach for regression mixtures. We illustrate the use of the moment-based mixture of regression models with an application to genome-wide association analysis, and show that the implementation is fast, which makes large-scale genetic analysis with gene-environment and gene-gene interactions feasible.
16:10 Stat methods for high-dim biology Jacob Bergstedt P10 Quantifying the immune system with the MMI package models, data mining, applications, reproducibility, bioinformatics, interfaces Julie Josse NA
The blood composition of immune cells provide a key indicator of human health and disease. To identify the sources of variation in this composition, we combined standardized flow cytometry and a questionnaire investigating demographical factors in 816 French individuals. The study is published in the Nature immunology article “Natural variation in innate immune cell parameters is preferentially driven by genetic factors”.To facilitate the study, we developed the R package MMI (https://github.com/jacobbergstedt/mmi), which defines a framework to specify a family of models. Operations are implemented for models in the family, such as doing tests, computing confidence intervals or AIC measures and investigating residuals, the results of which are collected in a MapReduce-like pattern. The software keeps track of variables, parameter transformations, multiple testing and selective inference adjustments.With the package we release the dataset of 816 observations of 166 immune cell parameters and 44 demographical variables. We hope that this resource can be used to generate hypotheses in immunology, but also be of benefit to the broader community, in education and benchmarking.
16:30 Stat methods for high-dim biology Rudradev Sengupta P10 High Performance Computing Using R for High Dimensional Surrogacy Applications in Drug Development models, data mining, applications, bioinformatics, performance, big data Julie Josse NA
Identification of genetic biomarkers is a primary data analysis task in the context of drug discovery experiments. These experiments consist of several high dimensional datasets which contain information about a set of new drugs under development. This type of data structure introduces the challenge of multi-source data integration which is needed in order to identify the biological pathways related to the new set of drugs under development. In order to process all the information contained in the datasets, high performance computing techniques are required. Currently available R packages, for parallel computing, are not optimized for a specific setting and data structure. We proposed a new “master-slave” framework, for data analysis using R in parallel, in a computer cluster. The proposed data analysis workflow is applied to a multi-source high dimensional drug discovery dataset and a performance comparison is made between the new framework and existing R packages for parallel computing. Different configuration settings, for parallel programming in R, are presented to show that the computation time, for the specific application under consideration, can be reduced by 534.62%.
15:30 Robust methods Kasey Jones P6 rollmatch: An R Package for Rolling Entry Matching algorithms, models Adam Sparks NA
The gold standard of experimental research is the randomized-control trial. However, many healthcare interventions are implemented without a randomized control group for practical or ethical reasons. Propensity score matching (PSM) is a popular method for approximating a randomized experiment from observational data by matching members of a treatment group to similar candidates of a control group that did not receive the intervention. However, traditional PSM is not designed for studies that enroll participants on a rolling basis, a common practice in healthcare interventions where delaying treatment may impact patient health. Rolling Entry Matching (REM) is a new matching method that addresses the rolling entry problem by selecting comparison group members who are similar to intervention members with respect to both static, unchanging characteristics (e.g., race, DOB) and dynamic characteristics that change over time (e.g., health conditions, health care use). This presentation will introduce both REM and rollmatch, an R package for performing REM to assess rolling entry interventions.
15:50 Robust methods Charles T. Gray P6 varameta': Meta-analysis of medians algorithms, models, applications, reproducibility Adam Sparks NA
Meta-analyses bring together summary statistics from multiple sources; which are reported in various ways. In this talk I will introduce the `varameta` package, which will provide an underlying (and reproducible) framework for understanding skewed meta-analysis data and reporting. The `varameta` package accompanies a couple of theoretical meta-analysis papers I am working on for meta-analysis of medians. This package is also designed to be an adjunct to the well-established conventional `metafor` package. In this package I have collated the existing techniques for meta-analysing skewed data reported as medians and interquartile ranges (or ranges). The `varameta` package will also include reproducible simulation documentation (in .Rmd) of existing methods in meta-analysis benchmarked against our proposed estimator for the standard error of the sample median. In this talk I will demonstrate the package, the web interface for clinicians, as well as how it can be implemented in everyday systematic reviews.
16:10 Robust methods Sevvandi Kandanaarachchi P6 Does normalizing your data affect outlier detection? algorithms, Data pre-processing Adam Sparks click here
It is common practice to normalize data before using an outlier detection method. But which method should we use to normalize the data? Does it matter? The short answer is yes, it does. The choice of normalization method may increase or decrease the effectiveness of an outlier detection method on a given dataset. In this talk we investigate this triangular relationship between datasets, normalization methods and outlier detection methods.
16:30 Robust methods Priyanga Dilini Talagala P6 oddstream and stray: Anomaly Detection in Streaming Temporal Data with R algorithms, space/time, multivariate, streaming data, outlier detection Adam Sparks NA
This work introduces two R packages, oddstream and stray for detecting anomalous series within a large collection of time series in the context of non-stationary streaming data. In `oddstream` we define an anomaly as an observation that is very unlikely given the recent distribution of a given system. This package provides a framework that provides early detection of anomalous behaviour within a large collection of streaming time series. This includes a novel approach that adapts to non-stationarity. In `stray` we define an anomaly as an observation that deviates markedly from the majority with a large distance gap. This package provides a framework to detect anomalies in high dimensional data. Then the framework is extended to identify anomalies in streaming temporal data. The proposed algorithms use time series features as inputs, and approaches based on extreme value theory for the model building process. Using various synthetic and real datasets, we demonstrate the wide applicability and usefulness of our proposed frameworks. We show that the proposed algorithms can work well in the presence of noisy non-stationary data within multiple classes of time series.
15:30 Reproducibility John Blischak AUD The workflowr R package: a framework for reproducible and collaborative data science reproducibility Scott Came click here
The workflowr R package helps scientists organize their research in a way that promotes effective project management, reproducibility, collaboration, and sharing of results. workflowr combines literate programming (knitr and rmarkdown) and version control (Git, via git2r) to generate a website containing time-stamped, versioned, and documented results. Any R user can quickly and easily adopt workflowr, which includes four key features: (1) workflowr automatically creates a directory structure for organizing data, code, and results; (2) workflowr uses the version control system Git to track different versions of the code and results without the user needing to understand Git syntax; (3) to support reproducibility, workflowr automatically includes code version information in webpages displaying results and; (4) workflowr facilitates online Web hosting (e.g. GitHub Pages) to share results. Our goal is that workflowr will make it easier for scientists to organize and communicate reproducible research results. Documentation and source code are available at https://github.com/jdblischak/workflowr.
15:50 Reproducibility Peter Baker AUD Efficient data analysis and reporting: DRY workflows in R applications, reproducibility Scott Came NA
When analysing data for different projects, do you often find yourself repeating the same steps? Typically, these steps follow a familiar pattern of reading, cleaning, summarising, plotting and analysing data then producing a report. To aid reproducibility, naive examples using Rmarkdown are often presented. However, I routinely employ a modular approach combining GNU Make, R, Rmarkdown and/or Sweave files tracked under git. This system helps to implement a don't repeat yourself (DRY) approach and scales up well as projects become more complex.To aid automation, I have developed generic R, Rmarkdown, STATA, SAS and other pattern rules for GNU Make as well as R packages to generate a project skeleton consisting of initial directories, Makefiles, R syntax files for basic data cleaning and summaries; move data files and documents to standard directories; use codebook information to specify factors and check data; and finally initialise and add these to a local git repository. Comparisons will be made with alternate approaches such as ProjectTemplate and drake.GNU Make pattern rules and R software are available at https://github.com/petebaker.
16:10 Reproducibility Filip Krikava AUD Automated unit test generation using genthat reproducibility, testing Scott Came click here
Your package has examples and vignettes of its overall functionality but no unit tests for individual functions. Writing those is no fun. Yet, when something goes wrong, unit tests are your best tool to quickly pinpoint errors. The genthat package can generate unit tests for you in the popular testthat format. Moreover, it can also be used to create reproductions when you find a bug in someone else’s code. There, instead of generating passing test cases it will generate the smallest, purposefully failing, one.Genthat does not magically create new tests out of the blue, instead it simply extracts the smallest possible test fragments from existing code. It does that by recording the input arguments and return values of all function called by clients of your package. The generated tests concentrate on single functions and test them independently of each other. Therefore a failing test usually locates the error more precisely that a failing chunk of application code. Trying it out on random set of 1500 CRAN packages, genthat managed to reproduce 80% of all function calls, increasing the unit test coverage from 19% to 54%. In this talk we present genthat and discuss testing R code.
16:30 Reproducibility Dan Wilson AUD Practical R Workflows reproducibility, workflow Scott Came click here
Learn how R can be used to create reproducible workflows for practical use in business. As analysts and data scientists we often need to repeat our work time and time again. Sometimes this will be the exact same task, other times it may be a slight variation for another client or stakeholder. This talk will demonstrate a real-world set of workflows established at The Data Collective designed to reduce the amount of copy/paste type actions to a few function calls that get the repetitive actions out of the way, so you can focus on the important parts of your job. Find out how to overcome the challenges of a repeatable workflow and make your life easier.
15:30 Spatial data and modeling Matt Moores P7 bayesImageS: an R package for Bayesian image analysis algorithms, applications, space/time Dale Bryan-Brown NA
There are many approaches to Bayesian computation with intractable likelihoods, including the exchange algorithm, approximate Bayesian computation (ABC), thermodynamic integration, and composite likelihood. These approaches vary in accuracy as well as scalability for datasets of significant size. The Potts model is an example where such methods are required, due to its intractable normalising constant. This model is a type of Markov random field, which is commonly used for image segmentation. The dimension of its parameter space increases linearly with the number of pixels in the image, making this a challenging application for scalable Bayesian computation. My talk will introduce various algorithms in the context of the Potts model and describe their implementation in C++, using OpenMP for parallelism. I will also discuss the process of releasing this software as an open source R package on the CRAN repository.
15:50 Spatial data and modeling Jin Li P7 A new R package for spatial predictive modelling: spm models, data mining, reproducibility, space/time, performance, spatial predictive models; hybrid methods of geostatistics and machine learning; model selection and validation; predictive accuracy Dale Bryan-Brown click here
Accuracy of spatial predictions is crucial for evidence-informed environmental management and conservation. Improving the accuracy by identifying the most accurate predictive model is essential, but also challenging as the accuracy is affected by multiple factors. Recently developed hybrid methods of machine learning methods and geostatistics have shown their advantages in spatial predictive modelling in environmental sciences, with significantly improved predictive accuracy. An R package, ‘spm: Spatial Predictive Modelling’, has been developed to introduce these methods and recently released for R users. This presentation will briefly introduce spm, including: 1) spatial predictive methods, 2) new hybrid methods of geostatistical and machine learning methods, 3) assessment of predictive accuracy, 4) applications of spatial predictive models, and 5) relevant functions in spm. It will then demonstrate how to apply some functions in spm to relevant datasets and to show the resultant improvements in predictive accuracy and modelling efficiency. Although in this presentation, spm is applied to data in environmental sciences, it can also be applied to data in other relevant disciplines.
16:10 Spatial data and modeling Daniel Fryer P7 rcosmo: Statistical Analysis of the Cosmic Microwave Background visualisation, databases, space/time, big data, new R package Dale Bryan-Brown NA
The Cosmic Microwave Background (CMB) is remnant electromagnetic radiation from the epoch of recombination. It is the most ancient important source of data about the early universe and the key to unlocking the mysteries of the Big Bang and the structure of time and space. Spurred on by a wealth of satellite data, intensive investigations in the past few years have resulted in many physical and mathematical results to characterise CMB radiation. It can be modelled as a realisation of a homogeneous Gaussian random field on the sphere. But, what does any of this matter for statisticians if they cannot play with the CMB data in their favourite programming language?A new R package, rcosmo, provides easy access to the CMB data and various tools for exploring geometric and statistical properties of the CMB. This talk will be a quick introduction to rcosmo by one of its developers, followed by an invitation for discussions and suggestions.This research was supported under the Australian Research Council's Discovery Project DP160101366.
16:30 Spatial data and modeling Marek Rogala P7 Using deep learning on Satellite imagery to get a business edge visualisation, algorithms, models, applications, web app, Satellite data Dale Bryan-Brown NA
The talk is about new possibilities arising from applying deep learning to satellite imagery. Satellite data changes the game as it allows to reach information not available to business nowadays and to travel in time. Combined with deep learning techniques, it delivers unique insights that have never been available before.Using deep learning on satellite data can deliver insights no human can. Satellite data is huge and non-obvious. By being able to go back to an arbitrary time in history we can prevent frauds. We can build forecasts and observe events we wouldn’t have access to otherwise. We’ll explore a number of emerging use cases and the common traits behind them. I will show how our R department is working with satellite data and how we use Shiny to build decision support systems for business.As an example of my previous talks, here’s a link of my talk at UseR, Brussel 2017: https://channel9.msdn.com/Events/useR-international-R-User-conferences/useR-International-R-User-2017-Conference/shinycollections-Google-Docs-like-live-collaboration-in-Shiny
Time Session Presenter Venue Title Keywords Chair Slides
9:05 Keynote Thomas Lin Pedersen AUD The Grammar of Animation NA Rob J Hyndman click here
In the world of data visualisation much work has been put into defining a grammar for both static and interactive graphics. These efforts has often been coupled to the development of visualisation frameworks where the grammar has been reflected in the API design. Less attention has been devoted to a grammar of animation, and subsequently animation frameworks has often missed the breadth and composability that are the hallmark of grammar-driven visualisation frameworks. In this talk I will justify and present a grammar of animation and position it in relation to graphics and interactivity grammar, thus creating a clear division of responsibility between the three domains. I will present an R implementation of the grammar of animation which builds on top of the ggplot2 framework and made available as the gganimate package, Using examples with gganimate I'll show how the proposed grammar can be used to break down, and reason about, animated data visualisation, and how the grammar succinctly can describe very diverse animation operations.
10:30 Applications in health and environment Mark Padgham P8 tRansport tools for the World (Health Organization) applications, reproducibility, community/education, space/time, big data Paula Andrea NA
The World Health Organization (WHO) contracted us to provide actionable evidence for the redesign of urban transport policies to help rather than hinder human health. That means more active transport. Designing cost-effective policies to get people walking and cycling requires insight into where, when, how, and why people currently travel. This is challenging, especially in cities with limited resources, data, or analysis capabilities. We briefly describe some technical details of our 'Active Transport Toolkit' (ATT), but the primary focus will be the context that led to the WHO contract and where we plan to go next. We argue that useRs are well-placed to provide openly available, global-scale, transparent tools for policy making. It was the flexibility of the R language and the supportiveness of its community - notably including ROpenSci, which hosts two of our packages - that enabled us to develop the ATT in a way that makes it flexible enough to capture citys' unique characteristics while providing a consistent user interface. The talk will conclude with a outline of lessons learned from the perspective of others wanting to create R tools to inform policy.
10:50 Applications in health and environment Philip Dyer P8 Models of global marine biodiversity: an exercise in mixing R markdown, parallel processing and caching on supercomputers models, applications, reproducibility, performance, big data Paula Andrea click here
R has become the standard language in ecology for statistics and modelling. If a technique has been published in mathematical ecology it has an R package. Even the data sets have an R package! The size of data sets in ecology has been growing to the point where global analysis of ecological data can be considered. At the same time powerful statistical techniques that rely on randomly permuting the data, such as bootstrapping, have become more popular. These are exciting times, but how do we get R to process our large data sets with computationally expensive algorithms without waiting forever to get results? For those new to R, or at least new to big data in R, I have some tips, techniques and packages to help you get going. I have benefited from using R markdown and Knitr to make short transcript files. I have also made use of caching to avoid recalculating big models and using parallel processing to calculate the models faster in the first place.
11:10 Applications in health and environment Chris Hansen P8 Enabling Analysts: Embracing R in a National Statistics Office Official Statistics Paula Andrea NA
Stats NZ has recently adopted R as an approved analytical tool, and more recently for use in production of official outputs. Since adoption, R has had significant uptake, and has been a great enabler for analysts. R is more expressive and flexible than the existing tools, allowing them to more easily solve a variety of problems. R is deployed on powerful servers, so users have a generous supply of memory and cores, meaning large datasets can be handled, and long-running computations parallelised. Analysts access R using R Studio Server, and this IDE itself has had a number of positive impacts--the use of RStudio projects and R markdown documents in particular helps analysts work in a more organised way, and ensures work is reproducible. Our statistical platforms can now also use R. This is done via OpenCPU which enables remote exection of function via an HTTP API. That is, OpenCPU can be used to call functions in internally developed packages as web services. This has proven useful as we transition to a more service oriented architecture. In this talk we describe the R environment at Stats NZ, it’s implications for analysts, and provide examples of its use in practice.
11:30 Applications in health and environment Tracy Huang P8 Developing an Uncertainty Toolbox for Agriculture: a closer look at Sensitivity Analysis visualisation, applications, web app, space/time, big data, R6 and Reference Classes Paula Andrea NA
Digiscape is one of 8 Future Science Platforms in CSIRO focussed on delivering new analytics in the digital age to better inform agricultural systems in the face of uncertainty. The Uncertainty Toolbox is one of 15 projects within Digiscape trying to make a difference to the way models are interpreted, reported and communicated in practice for decision-making. Uncertainty is front and centre of every modelling problem but it is sometimes difficult to quantify and challenging to communicate. The Sensitivity Analysis workflow focuses on developing a general framework for sensitivity analysis to inform the modeller about key parameters of interest and refine the model so it can be used in a robust way to make predictions and forecasts with uncertainties. We focus on methods applicable for large scale, non-monotonic problems that develop variance based approaches to sensitivity analysis using emulators. As such, the framework for developing this workflow in R becomes important for transparency and usability. We will outline the design steps for constructing this workflow using the latest object oriented systems available in R and give a demonstration of the tool using Shiny.
10:30 Models and methods for biology and beyond Zachary Foster P9 Taxa and metacoder: R packages for parsing, visualization, and manipulation of taxonomic data visualisation, data mining, applications, databases, bioinformatics, Taxonomy Anna Quaglieri NA
Modern microbiome research is producing datasets that are difficult to manipulate and visualize due to the hierarchical nature of taxonomic classifications. The “taxa” package provides a set of classes for the storage and manipulation of taxonomic data. Classes range from simple building blocks to project-level objects storing multiple user-defined datasets mapped to a taxonomy. It includes parsers that can read in taxonomic information in nearly any form. It also provides functions modeled after dplyr for manipulating a taxonomy and associated datasets such that hierarchical relationships between taxa as well as mappings between taxa and data are preserved. We hope taxa will provide a basis for an ecosystem of compatible packages. We have also developed the metacoder package for visualizing hierarchical data. Metacoder implements a novel visualization called heat trees that use the color and size of nodes and edges on a taxonomic tree to quantitatively depict up to 4 statistics. This allows for rapid exploration of data and information-dense, publication-quality graphics. This is an alternative to the stacked barcharts typically used in microbiome research.
10:50 Models and methods for biology and beyond Saswati Saha P9 Multiple testing approaches for evaluating the effectiveness of a drug combination in a multiple-dose factorial design. applications, multivariate, Factorial Design, Drug Combination Anna Quaglieri NA
Drug combination trials are often motivated from the fact that using existing drugs in combination might prove to be more productive than the existing drug alone and less expensive than producing an entirely new drug. Several approaches have been explored for developing statistical methods that compare fixed (single) dose combinations to its component. However, the extension of these approaches to a multiple dose combination clinical trial is not always so simple. Considering these facts we have proposed three approaches by which we can provide confirmatory assurance that combination of two or more drugs is more effective than the component drug alone. These approaches involved multiple comparisons in multilevel factorial design where the type 1 error is controlled by bonferroni test, bootstrap test, and a union intersection test where the least favorable null configuration has been considered. We have also built a R package implementing the above approaches and in this presentation we would like to demonstrate how this R package can be used in a drug combination trial. We will also demonstrate how these three approaches are performing when benchmarked with an existing approach.
11:10 Models and methods for biology and beyond Bill Lattner P9 Modeling Heterogeneous Treatment Effects with R models, applications Anna Quaglieri click here
Randomized experiments have become ubiquitous in many fields. Traditionally, we have focused on reporting the average treatment effect (ATE) from such experiments. With recent advances in machine learning, and the overall scale at which experiments are now conducted, we can broaden our analysis to include heterogeneous treatment effects. This provides a more nuanced view of the effect of a treatment or change on the outcome of interest. Going one step further, we can use models of heterogeneous treatment effects to optimally allocate treatment.In this talk will provide a brief overview of heterogeneous treatment effect modeling. We will show how to apply some recently proposed methods using R, and compare the results of each using a question wording experiment from the General Social Survey. Finally, we will conclude with some practical issues in modeling heterogeneous treatment effects, including model selection and obtaining valid confidence intervals.
11:30 Models and methods for biology and beyond Shian Su P9 Glimma: interactive graphics for gene expression analysis visualisation, applications, bioinformatics Anna Quaglieri NA
Modern RNA sequencing produces large amounts of data containing tens of thousands of genes. Exploratory and statistical analysis of these genes produces plots or tables with many data points. Glimma is a Bioconductor package that provides interactive versions of common plots from limma, a widely used gene expression analysis package. It allows researchers to explore the statistical summary of their data, with cross-chart interactions providing greater insight into the behaviours of specific genes. Interactivity allows genes of interest to be quickly interrogated on the summary graphic which provides better context than searching through spreadsheets. Cross-chart interactions display useful additional content that would otherwise require manual querying. Glimma produces HTML pages with custom D3 Javascript that handles interactions completely independent from R, the resulting plots to easily be shared with researchers without the need for software dependencies beyond a modern browser.
10:30 Learning and teaching François Michonneau P7 Lessons learned from developing R-based curricula across disciplines community/education Sam Clifford click here
The Carpentries is a non-profit volunteer organization that teaches scientists with no or little programming experience foundational skills in coding, data science, and best-practices for reproducible research. We offer 2-day workshops for a variety of disciplines including Ecology, Genomics, Geospatial analysis, and Social Sciences. With 1300+ instructors who have taught 500+ workshops on all continents, we worked with our community of instructors to assemble evidence-based curricula using results from research on teaching and learning. We have developed detailed short- and long-term assessments to evaluate the effectiveness and level of satisfaction of our learners after attending a workshop, as well as the impact on their research and careers 6 months or more afterwards. We find that workshop participants program more often, are more confident, and use programming practices that the report make them more efficient and reproducible. Here, we will present the lessons we learned about developing curricula based on teaching R to novices across diverse disciplines, and the strategies we use to instill the desire to continue learning after attending our workshops.
10:50 Learning and teaching Matthias Gehrke P7 Student Performance and Acceptance of Technology in a Statistics Course Based on R mosaic - Results from a Pre- and Post-Test Survey community/education, teaching Sam Clifford click here
In the last years there is movement towards simulation-based inference (e.g., bootstrapping and randomization tests) in order to improve students' understanding of statistical reasoning (see e.g. Chance et al. 2016). The R package mosaic was developed with a "minimal R" approach to simplify the introduction of these concepts (Pruim et al. (2017)). With a pre and post survey we analysed whether students improved in understanding as well as in acceptance of R during a one semester statistics course in economically related Bachelor and Master programs. These courses were held by different lecturers at multiple locations in Germany. At our private university of applied sciences for professionals studying while working the use of R is compulsory in all statistical courses.While conceptual understanding was evaluated by a subset of the modified CAOS inventory (like Chance et al. (2016)) the acceptance and use of technology was collected by using an adopted version of UTAUT2 (Venkatesh et al. (2012)).
11:10 Learning and teaching Mette Langaas P7 Teaching statistics - with all learning resources written in R Markdown community/education, teaching Sam Clifford click here
In applied courses in statistics it is important for the student to see a mix of theory, practical examples and data analyses. Being able to study the R code used to produce the data analyses, and to run and modify the R code will give the student hands on experience, which again may lead to increased theoretical understanding.I will tell about my experiences with producing and using learning material written in R Markdown in two courses in statistics at the Norwegian University of Science and Technology. One course is at the master level (Generalized linear models) with few students (35) and a mix of plenary and interactive lectures. The other course is at the bachelor level (Statistical learning) with more students (70).
10:30 Data handling Chester Ismay AUD Statistical Inference: A Tidy Approach using R visualisation, community/education, statistical inference, tidyverse community Jenny Bryan click here
How do you code-up a permutation test in R? What about an ANOVA or a chi-square test? Have you ever been uncertain as to exactly which type of test you should run given the data and questions asked? The `infer` R package was created to unite common statistical inference tasks into an expressive and intuitive framework to alleviate some of these struggles and make inference more intuitive. This talk will focus on developing an understanding of the design principles of the package, which are firmly motivated by Hadley Wickham's tidy tools manifesto. It will also discuss the implementation, centered on the common conceptual threads that link a surprising range of hypothesis tests and confidence intervals. Lastly, we'll dive into some examples of how to implement the code of the `infer` package via different data sets and variable scenarios. The package is aimed to be useful to new students of statistics as well as seasoned practitioners.
10:50 Data handling Thomas Lumley AUD Subsampling and one-step polishing for generalised linear models algorithms, models, databases, big data Jenny Bryan NA
Using only a commodity laptop it's possible to fit a generalised linear model to a dataset from about a million to a billion rows by first fitting to a subset and then doing a one-step update. The method depends on a bit of asymptotic theory, some sampling, the Fisher scoring algorithm, efficient R-database interfaces, and a little of the tidyverse.
11:10 Data handling James Hester AUD Glue strings to data in R Package development Jenny Bryan NA
String interpolation, evaluating a variable name to a value within a string, isa feature of many programming languages including Python, Julia, Javascript,Rust, and most Unix Shells. R's `sprintf()` and `paste()` functions providesome of this functionality, but have limitations which make them cumbersome touse. There are also some existing add on packages with similar functionality,however each has drawbacks.The glue package performs robust stringinterpolation for R. This includes evaluation of variables and arbitrary R code,with a clean and simple syntax. Because it is dependency-free, it is easy toincorporate into packages. In addition, glue provides an extensible interfaceto perform more complex transformations; such as `glue_sql()` to constructSQL queries with automatically quoted variables.This talk will show how to utilize glue to write beautiful code which iseasy to read, write and maintain. We will also discuss ways to best use glue whenperformance is a concern. Finally we will create custom glue functions tailoredtowards specific use cases, such as JSON construction, colored messages, emojiinterpolation and more.
11:30 Data handling Max Kuhn AUD Data Preprocessing using Recipes algorithms, models Jenny Bryan click here
The recipes package can be used as a replacement for model.matrix as well as a general feature engineering tool. The package uses a dplyr-like syntax where a specification for a sequence of data preprocessing steps are created with the execution of these steps deferred until later. Data processing recipes can be created sequentially and intermediate results can be cached. An example is used to illustrate the basic recipe functionality and philosophy.
10:30 Statistical modeling John Fox P10 New Features in the car and effects Packages visualisation, models Matteo Fasiolo click here
The widely used car and effects packages are associated with Foxand Weisberg, An R Companion to Applied Regression, the thirdedition of which will be published this year. In preparation, wehave released the substantially revised version 3.0-0 of the carpackage and version 4.0-1 of the effects package.The car package focuses on tools, many of them graphical, that areuseful for applied regression analysis (linear, generalized linear, mixed-effects models, etc.), including tools for preparing, examining, and transformingdata prior to specification of a regression model, and tools thatare useful for assessing regression models that have been fit todata. The effects packages focuses on graphical methods forinterpreting regression models that have been fit to data.Among the many changes and improvements to the packages are areconceptualization of effect displays, which we call "predictoreffects"; the ability to add partial residuals to effect plots ofarbitrary complexity; simplification to the arguments of plottingfunctions; new and improved functions for summarizing and testingstatistical models; and improved methods for selecting variabletransformations.
10:50 Statistical modeling Rainer Hirk P10 mvord: An R Package for Fitting Multivariate Ordinal Regression Models algorithms, models, applications, multivariate Matteo Fasiolo NA
The R package mvord implements composite likelihood estimation in the class of multivariate ordinal regression models with probit and a logit link. A flexible modeling framework for multiple ordinal measurements on the same subject is set up, which takes into consideration the dependence among the multiple observations by employing different error structures. Heterogeneity in the error structure across the subjects can be accounted for by the package, which allows for covariate dependent error structures. In addition, regression coefficients and threshold parameters are varying across the multiple response dimensions in the default implementation. However, constraints can be defined by the user if areduction of the parameter space is desired. The proposed multivariate framework is illustrated by means of a credit risk application.
11:10 Statistical modeling Joachim Schwarz P10 Partial Least Squares with formative constructs and a binary target variable PLS, pslpm package, formative constructs, binary target variable Matteo Fasiolo NA
During the last years, the use of PLS became more and more important for the modelling of dependencies between latent variables as an alternative to classical structural equation modelling. However, a non-metric target variable in combination with formatively measured constructs is still a particular challenge for the PLS-approach.By using the plspm package (Sanchez/Trinchera/Russolillo 2017), we tested a model from the human resources management field. Main goal of this model is to examine the moderating and mediating role of meaning at work for the relationship between several social, personal, environmental and motivational job characteristics and the intention to quit as a manifest binary target variable. Coping with the complexity of the model, consisting of more than 70 latent variables, all formatively measured, many of them one indicator constructs, there are some pitfalls in the application of the plspm package, but due to the flexibility of R, it is possible even to evaluate such a complex model.
11:30 Statistical modeling Murray Cameron P10 Exceeding the designer's expectation algorithms, models, applications Matteo Fasiolo click here
Statistical methods and their software implementation are generally designed for a particular class of applications. However, the nature of data, analysis and statisticians is that uses of the methods are envisaged that extend the application. Sometimes the reason is the nature of the data, sometimes it is a new type of model and sometimes it is the limitations of the software available. Software for regression and for generalised linear models have regularly been used in 'non-standard' ways.We will discuss some examples, considering some changepoint models in particular and emphasise some old lessons for software developers.
10:30 Better data performance David Cooley P6 Starting with geospatial data in Shiny, and knowing when to stop visualisation, databases, web app, performance, spatial David Smith NA
Theme:Coupling R with geospatial databases to reduce the calculations and data in R and improve shiny app speedLike any web page, Shiny apps need to be quick and responsive for a better user experience. Doing complex calculations and storing large data objects will slow the app. Therefore, it's often desirable to remove as much of this as possible from the app. The talk will demonstrate- Using MongoDB as a geospatial database- Querying & returning geospatial data to R from MongoDB- Comparison and benchmarking of geospatial operations in R vs on the database server- Applying this to a shiny app with a demonstration, highlighting the pros & cons- Introducing the latest updates to the `googleway` package for displaying data and using Google Map tools through R- Using Google Maps to trigger database queries and operations
10:50 Better data performance Jeffrey O. Hanson P6 prioritizr: Systematic conservation prioritization in R reproducibility, space/time, performance, conservation David Smith NA
Biodiversity is in crisis. To prevent further declines, protected areas need to be established in places that will achieve conservation objectives for minimal cost. However, existing decision support tools tend to offer limited customizability and can take a long time to deliver solutions. To overcome these limitations and help prioritize conservation efforts in a transparent and reproducible manner, here we present the prioritizr R package. Inspired by the tidyverse principles, this R package provides a flexible interface for articulating, building and solving conservation planning problems. In contrast to existing tools, the prioritizr R package uses integer linear programming (ILP) techniques to mathematically formulate and solve conservation problems. As a consequence, the prioritizr R package can find solutions that are guaranteed to be optimal and in record time. By finding solutions to problems that are relevant to the species, ecosystems, and economic factors in areas of interest, conservation scientists, planners, and decision makers stand a far greater chance at enhancing biodiversity. For more information, visit https://github.com/prioritizr/prioritizr.
11:10 Better data performance Remy Gavard P6 Using R to pre-process ultra-high-resolution mass spectrometry data of complex mixtures. algorithms, applications David Smith NA
Scientists are able to determine over hundreds of thousands of components in crude oil using Fourier transform ion cyclotron resonance mass spectrometry (FTICR-MS). The statistical tools required to analyse the mass spectra struggle to keep pace with advancinginstrument capabilities and increasing quantities of data. Today most ultrahigh resolution analyses for complex mixture samples are based on single, labour-intensive, experiments.We present a new algorithm developed in R named Themis to jointly pre-process replicate measurements of a complex sample analysed using FTICR-MS. This improves consistency as a preliminary step to assigning chemical compositions, and the algorithm has a quality control criterion. Through the use of peak alignment and an adaptive mixture model-based strategy, it is possible to distinguish true peaks from noise.Themis demonstrated a more effective removal of noise-related peaks and the preservation and improvement of the chemical composition profile. Themis enabled the isolation of peaks that would have otherwise been discarded using traditional peak picking (based upon signal-to-noise ratio alone) for a single spectrum.
11:30 Better data performance Joshua Bon P6 Semi-infinite programming in R algorithms, models David Smith NA
Semi-infinite programming (SIP) is an optimisation problem where, generally, there are a finite number of variables but an infinite number of (parametrised) constraints. We show how to optimise simple SIP problems in R, in particular SIP for shape-constrained regression. The package sipr (under development) will be presented and collaboration sought from those in attendance.
13:00 Keynote Bill Venables AUD Adventures with R: Two stories of analyses and a new perspective on data NA Paul Murrell click here
I will discuss two recent analyses, one from psycholinguistics and the other from fisheries, that show the versatility of R to tackle the wide range of challenges facing the statistician/modeller adventurer. I will conclude with a more generic discussion of the status and role of data in our contemporary analytical disciplines and offer an alternative perspective from the current orthodoxy.
14:00 R in the community Simon Jackson P9 R from academia to commercial business applications, community/education, big data, industry, skill development Rhydwn McGuire NA
A 2017 report by StackOverflow showed that the use of R is greatest and growing fastest in academia. Commercial industries like tech, media, and finance, however, show the smallest usage and lowest adoption rates of the language. Yet learnings regarding the use of R and data science in academia and commercial settings complement each other. This presentation will share my experience as an R user moving from academia into commercial business; the transition moving from cognitive scientist at an Australian University to being a data scientist at one of the world’s largest travel e-commerce sites, Booking.com. I’ll discuss how the cutting-edge R skills used in academia can improve commercial product development. I will also identify the knowledge gaps I had moving into commercial business. This will be relevant to academics looking to move into industry, and business employers looking to hire data scientists from academia.
14:20 R in the community Joseph Rickert P9 Connecting R to the "Good Stuff" algorithms, models, applications, big data, interfaces Rhydwn McGuire NA
In his book, Extending R, John Chambers writes: One of the attractions of R has always been the ability to compute an interesting result quickly. A key motivation for the original S remains as important now: to give easy access to the best computations for understanding data.R developers have taken the challenge implied in John’s statement to heart, and have integrated R with some really “good stuff’ while providing easy access that conforms to natural R workflows. Rcpp and Shiny, for example, are both spectacularly successful projects in which R developers expanded the reach of R by connecting to external resources.In this talk, I will survey the ongoing work to connect R to “good stuff” such as the CVX optimization software, the Stan Bayesian engine, Spark, Keras and TensorFlow; and provide some code examples including using the sparlkyr package to run machine learning models on Spark and the keras package to run deep learning and other models on TensorFlow.
14:40 R in the community Lisa Chen P9 Using R to help industry clients – The benefits and Opportunities visualisation, algorithms, models, data mining, applications, web app, reproducibility, multivariate, networks, performance, text analysis/NLP, big data Rhydwn McGuire NA
Dr Lisa Chen is Chief Analytics Officer for Harmonic Analytics. She is a highly qualified and experienced data scientist, with a PhD in Statistics and a Bachelor of Science in Computer Science and Statistics. Lisa has extensive experience using R including designing solution-based models for complex optimisation problems, and analysing large-scale datasets in R. Harmonic has helped customers globally to address business challenges, across sectors including; agriculture, aviation, banking, energy, government, health, telecommunications and utilities. We use R in our daily project work and also help clients with data science team development and R Training. We will outline how we have used R, and RShiny and the benefits realised. We will discuss our journey, data-driven approach, workflow and industry observations. We will discuss our learning with R, e.g. observations regarding Big Data with R, version control and some of the pain points & work-arounds. We will share our observations on how clients are starting to adapt open source and R for their analytical work, plus the trends and opportunities. Lisa will demonstrate examples of our interactive client dashboards.
14:00 Community and education Jonathan Carroll P6 Volunteer Vignettes; A Case-Study in Enhancing Documentation applications, reproducibility, community/education, documentation Kim Fitter click here
Vignettes; long-form documentation for a package. Often a use-case, discussion, or scientific article. These are incredibly useful to both users and developers. In 2017, Julia Silge scraped CRAN and found most packages don't have one [1].At the start of 2018, I decided to give back to the community by 'being the change I wanted to see in the world' and writing a Volunteer Vignette a month, for the entire year. Yet all the new and interesting packages I could think to write something for already had vignettes.The solution came to me in February; have the community nominate packages. I made the call via Twitter [2] and received an encouraging response. I set about writing the first Volunteer Vignette and immediately discovered bugs and other issues, all of which have lead to positive discussions with the author and updates to the package.In this talk I will present my first six months of the Volunteer Vignettes Project. I will demonstrate why vignettes are an invaluable step in making a robust R package.[1] https://juliasilge.com/blog/mining-cran-description/[2] https://twitter.com/carroll_jono/status/961139524901527552
14:20 Community and education Robin Hankin P6 Special and general relativity in R visualisation, community/education, space/time Kim Fitter NA
Although mostly used for statistics, R is a general purpose tool andhere I discuss how the R programming language can be used in thecontext of physics education. Here I introduce two R packages thathave been used in the teaching of Einstein's theories of special andgeneral relativity.The 'gyrogroup' package implements the Lorentz boosts for relativisticvelocity addition. It provides dramatic visualization of thelittle-known fact that relativisitic Lorentzian velocity addition isneither commutative nor associative. The 'schwarzschild' packagepresents visualization of black hole physics, and gravitational waves.In this presentation I discuss these two packages and also the moregeneral issue of R used as a teaching tool in the context of physicsmore generally.
14:40 Community and education Sam Clifford P6 Classes without dependencies community/education Kim Fitter click here
Although important, learning statistics isn’t generally why students choose to study science. To engage a cohort of first year Bachelor of Science students with diverse backgrounds and interests, we decided to design their core first year quantitative methods unit (with no math or programming prerequisites) around R.The course is designed to be practical; using RStudio and tidyverse packages rather than statistical tables, students can quickly engage in visualisation, data wrangling, writing functions, and modelling as part of a coherent workflow for scientific inquiry.In this talk, we discuss the learning and teaching principles and activities, outlining the use of blended and problem based learning to teach both the quantitative topic and the use of R, developing students' data analysis skills and confidence.We discuss how workshop activities , quizzes, problem solving tasks, and the final project (a collaborative scientific article) not only assess students' skills but prepare them for work as a professional scientist. We will discuss students’ feedback on their experience in their journey from novice student to young scientist.
14:00 Scalable R Le Zhang P7 Build scalable Shiny applications for employee attrition prediction on Azure cloud visualisation, models, data mining, applications, web app, reproducibility, performance Michael Lawrence NA
Voluntary employee attrition may negatively affect a company in various aspects. Identifying employees with inclination of leaving is therefore pivotal to save potential loss. Data-driven techniques, assisted by a machine learning model, exhibit high accuracy in prediction for employee attrition and offer company executives insightful information for decision making.The talk will cover a step-by-step tutorial about how to build a model for employee attrition prediction and deploy such analytical solution as Shiny-based web service on Azure cloud. R is used as the primary programming language and method for the development. Novel R packages such as AzureSMR and AzureDSVM that allow data scientists and developers to programmatically operate cloud resources and seamlessly operationalize the analytics within an R session, will also be introduced in the talk. Shiny application of the analytics including interactive data visualization and model creation is designed and deployed on Docker containers orchestrated by Kubernetes. Parameters of the deployment environment are carefully tuned to favor scalability of the application.
14:20 Scalable R Bryan Galvin P7 Moving from Prototype to Production in R: A Look Inside the Machine Learning Infrastructure at Netflix data mining, reproducibility, performance, big data, interfaces Michael Lawrence NA
Machine learning helps inform decision making on just about every aspect of the business at Netflix, so it is important to empower our data scientists with tooling that makes them more effective. To accomplish this, we developed Metaflow-a platform written in Python for data scientists to develop, run, and deploy projects without getting in their way. Some key design features include: * Ability to work with the R packages we all know and love with no restrictions* Scale up seamlessly from local development to the almost infinite resources in the cloud * Automatic checkpointing of data and code with immutable snapshots created at each step of the modeling pipeline * Deployment made easy with built-in hosting service and schedulingIn this talk, I will present an overview of some of the best practices that are baked into Metaflow, focusing especially on those that can be applied effectively at organizations that are not at Netflix scale. Additionally, I will cover some of the lessons learned from using reticulate to interface R with a large Python project.
14:40 Scalable R Jason Gasper P7 Integrating R into a production data environment: A case example of using Oracle database services and R for fisheries management in Alaska. applications, databases, reproducibility Michael Lawrence NA
Catch and economic information from fisheries off Alaska are critical for the management and conservation of marine resources. The National Marine Fisheries Service, Alaska Regional Office, uses an Oracle database to monitor and store federal fishery catch data off Alaska. Annually, the system processes over 2 million+ fishery catch transactions, and it currently houses over 25 years of historical fishery data. Information in the database includes details on harvested fish, estimates of bycatch, at-sea observations of discards, electronic monitoring of catch (video-derived estimates), geospatial information, and complex business rules to monitor catch allocations to ensure overfishing does not occur. Our paper provides an high-level overview of the system architecture, with a focus on our use of R-Cran for both development (e.g., simulation and testing) and production (e.g., statistical features) within our Oracle database.
14:00 Visualisation Paul Murrell P10 The Minard Paradox visualisation Carson Sievert click here
Charles Joseph Minard's depiction of Napoleon's 1812 Russian campaignmight be described as the best statistical graphic ever drawn ... byhand. Minard did not have the benefit of modern computer technologyto help with his drawing; he did not have the option of importing aGoogle map tile; and he probably did not even consider the possibilityof interactive tooltips. However, there are aspects of what Minardproduced by hand that are very challenging for modern graphicalsoftware, particularly the thick bands that represent the size ofNapolean's army over time. This talk will describe the 'vwline'package for R and explore some of the interesting challenges thatarise when attempting to render variable-width lines with software.
14:20 Visualisation Natalia da Silva P10 Interactive Graphics for Visually Diagnosing Forest Classifiers in R visualisation, data mining, web app Carson Sievert NA
This paper describes structuring data and constructing plots to explore forest classification models interactively. A forest classifier is an example of an ensemble since it is produced by bagging multiple trees. The process of bagging and combining results from multiple trees produces numerous diagnostics which, with interactive graphics, can provide a lot of insight into class structure in high dimensions. Various aspects are explored in this paper, to assess model complexity, individual model contributions, variable importance and dimension reduction, and uncertainty in prediction associated with individual observations. The ideas are applied to the random forest algorithm (Breiman, 2001) and projection pursuit forest (da Silva et al., 2017), but could be more broadly applied to other bagged ensembles. Interactive graphics are built in R (R Core Team, 2016) using the ggplot2( Wickham, 2016), plotly (Sievert et al., 2017), and shiny (Chang et al., 2015) packages.
14:40 Visualisation Chun Fung (Jackson) Kwok P10 Rjs: Going hand in hand with Javascript visualisation, interfaces, JavaScript Carson Sievert NA
Many of the popular data visualisation packages in R, e.g. Plotly, Leaflet and DiagrammeR, are powered by JavaScript. I will demonstrate how far a little JavaScript can go towards creating animated and interactive visualisations from within R. This is done with the package, Rjs, which provides a simple interface between R and JavaScript. It allows you to seamlessly combine R modelling packages with JavaScript interactive visualisation libraries. This talk is for researchers, data analysts, and intermediate R users looking to extend their skills in interactive data visualisation.
14:00 Complex models and performance Hong Ooi P8 SAR: a practical, rating-free hybrid recommender for large data algorithms, models, applications, big data Kelly O'Briant NA
SAR (Smart Adaptive Recommendations) is a fast, scalable, adaptive algorithm for personalised recommendations, based on user transaction history and item descriptions. From an end-user's point of view, SAR has the following benefits. First, it is relatively easy to explain to a nontechnical audience, compared to algorithms that rely on matrix factorisation. Second, it doesn't use subjective ratings, which can be unreliable given the pervasive influence of social media: a product that gets review-bombed after going viral will have meaningless ratings. Third, it takes event times into account, thus allowing recommendations to evolve with changing trends. Finally, it does well in recommending cold items, by building a regression model on item data. In this talk I'll discuss two separate implementations of SAR: a standalone one in base R, and an interface to an Azure web service. The former allows easy experimentation and evaluation, while the latter provides more options and is scalable to production-scale datasets.
14:20 Complex models and performance Fang Zhou P8 Jumpstart Machine Learning with Pre-Trained Models algorithms, models, reproducibility, interfaces Kelly O'Briant NA
As a community many of us are building models (statistical and machine learning) that address various scenarios. At conferences, like UseR!, but also across many academic conferences, researchers publish papers that introduce new algorithms with implementations available on GitHub, implemented in R and Python and other frameworks. The community also makes available pre-trained models, especially deep learning models, to demonstrate or highlight the capabilities of the algorithm. To foster a healthy collaboration and for the reproducibility of key results, it is important that fellow data scientists can read about a new algorithm or approach and to be able to try it out very quickly to see whether it meets their needs. While pre-trained machine learning models are available, they are often difficult to set up and evaluate. We are exploring a framework to make this process simpler by making it easy for any data scientist to investigate and evaluate pre-trained models. We will share our learnings and our proposal to enable data scientists to quickly discover pre-trained models that will support them to be able to get from zero to hero in short order.
14:40 Complex models and performance Stepan Sindelar P8 FastR: an alternative R language implementation applications, performance, R implementations Kelly O'Briant NA
R is a highly dynamic language that employs a unique combination of data type immutability, lazy evaluation, argument matching, large amount of built-in functionality, and interaction with C and Fortran code. It is therefore a challenging task to develop an alternative R runtime that is both compatible with GNU R and can provide performance of R code comparable to static programming languages like C.FastR is an open source alternative R implementation that is trying to achieve this. The talk will introduce FastR and demonstrate the performance improvements it can offer, compatibility with GNU R by being able to run unmodified popular complex CRAN packages like ggplot2 or Shiny, and FastR unique features, for example in-process multi-threaded execution, and tools like CPU sampler or viewing R memory dumps with VisualVM.
15:30 Genomics, signatures to single cells Momeneh (Sepideh) Foroutan P10 Singscore: a single-sample gene-set scoring method for analysing molecular signatures visualisation, applications, bioinformatics Peter Hickey NA
Several single sample gene-set enrichment analysis methods have been introduced to score samples against gene expression signatures, such as ssGSEA, GSVA, PLAGE and combining z-scores. Although these methods have been proposed to generate single-sample scores, they use information from all samples in a dataset to calculate scores for individual samples. This leads to unstable scores which are influenced by sample size and composition in datasets. We have proposed singscore, a ranked-based and truly single-sample scoring method implemented as an R/Bioconductor package singscore. We compare singscore to other methods and show that our approach performs as well as other methods for large datasets in terms of stability, while outperforming them in small datasets. Singscore is fast and generates easily-interpretable scores. We show the application of this method in cancer biology, where the dependence between distinct molecular signatures can be investigated across samples. Singscore has potential applications in personalised medicine, as it calculates replicable scores for individual samples regardless of the sample size or composition in the data.
15:50 Genomics, signatures to single cells Liam Crowhurst P10 scIVA: Single Cell Interactive Visualisation and Analysis visualisation, data mining, web app, reproducibility, bioinformatics, big data Peter Hickey NA
Technological advances enable measurements of gene expression at single cell resolution, creating datasets for investigating biological processes in life science research. Gene Expression data is commonly represented as a matrix of tens of thousands of genes and up to millions of cell, which has created a demand amongst biologists for quick visualisation and analysis. We developed scIVA, a Shiny web app that is designed to be used as an interactive visualisation tool of gene expression datasets, intended for those with little R experience and for users to gain preliminary insights into datasets for further exploration and analysis. The web app will also be available for download as a standalone R package. The web app performs various visualisations, all of which are interactive and downloadable through use of Plotly, integrated with d3 Javascript, as graphing tools. Moreover, scIVA allows for users to search for specific genes, subset by clusters and subpopulations, generate heatmaps and perform statistical analyses. The presentation will include a demonstration of the web app’s key features.
16:10 Genomics, signatures to single cells Sarah Williams P10 Celaref: Annotating single-cell RNAseq clusters by similarity to reference datasets applications, bioinformatics Peter Hickey click here
Single-cell RNA sequencing (scRNAseq) is a way of measuring gene expression of many individual cells simultaneously, and is often used on samples which contain a mix of different cell types. In an scRNAseq analysis individual cells are typically clustered to group them by cell type. After clustering, identifying what type of cell is in each cluster (e.g. neurons) usually needs domain-specific knowledge of marker genes and function. The celaref package accepts pre-computed cell-clusters and aims to suggest cell-types for each cluster via similarity to reference datasets (scRNAseq experiments or microarrays) from similar samples. Briefly, within-dataset differential expression is calculated to identify the most enriched genes for each cluster, then their rankings are examined in reference datasets. Kolmogorov–Smirnov tests are used to decide if multiple matches should be reported. Initial experiments on brain, lacrimal gland and blood PBMC samples show sensible matching between similar cell types without overreaching on dissimilar cells. Celaref will be submitted to Bioconductor and is available at https://github.com/MonashBioinformaticsPlatform/celaref
16:30 Genomics, signatures to single cells Luke Zappia P10 clustree: a package for producing clustering trees using ggraph visualisation, algorithms, data mining, bioinformatics Peter Hickey click here
Clustering analysis is commonly used in many fields to group together similar samples. Many clustering algorithms exist, but all of them require some sort of user input to set parameters that affect the number of clusters produced. Deciding on the correct number of clusters for a given dataset is a difficult problem that can be tackled by looking at the relationships between samples at different resolutions. Here I will present clustree, an R package for producing clustering tree visualisations. These visualisations combine information from multiple clusterings with different resolutions, showing where new clusters come from and how samples change clusters as the number of clusters increases. Summarised information describing the samples in each cluster can be overlaid on the tree to give additional insight. I will also describe my experience developing clustree, particularly how I have made use of the ggraph package. The clustree package is available at https://github.com/lazappi/clustree and a preprint describing clustering trees can be read at https://www.biorxiv.org/content/early/2018/03/02/274035.
15:30 Data mining Ilia Karmanov P9 Teach yourself deep-learning with R visualisation, algorithms, models, Deep Neural Nets, CNNs, MLPs, Machine Learning Kevin Kuo NA
R's concise matrix algebra and calculus functionality makes it easy to create machine learning-models from scratch. Creating models from scratch is a great way to learn how they actually work. We show how R can be used to create a linear regression, MLP and CNN from scratch (see blog: http://blog.revolutionanalytics.com/2017/07/nnets-from-scratch.html) and thus how one may go about teaching themselves about DNNs. We believe this "hands-on" approach to learning is more effective because it exposes the user to all the "leaky abstractions" that modern frameworks hide and helps them understand what makes the models fragile. R's simple interface lets us easily "play" with the created models to understand further (potentially abstract) topics, e.g: (i) visualise the classification boundary and thus investigate what effect number of neurons (and layers) have. (ii) Visualise different CNN filter-maps. (iii) Solve a neural-net deterministically through linear-programming (without SGD) by working through "Proof of Theorem 1" in "Understanding deep learning requires re-thinking generalization" by Zhang 2017 (as a mirror to solving linear regression with SGD).
15:50 Data mining Angus Taylor P9 Deep learning at scale with Azure Batch AI algorithms, models, Deep learning Kevin Kuo NA
In recent years, R users have been increasingly exploring the use of deep learning methods to solve difficult problems from computer vision to natural language processing. However, developing deep learning models is a time-consuming and compute-intensive task. To obtain good performance on many datasets, it is necessary to test many combinations of network structures and hyperparameters. In this talk, we will discuss how Microsoft Azure Batch AI can be used to perform this tuning task at scale on clusters of GPU-enabled virtual machines in the cloud. Developers create a single R script to define tests of multiple different network configurations, using the popular deep learning frameworks mxnet or Keras. We explain how to build a simple Docker image that can be deployed across multiple machines and defines the necessary installation dependencies. Batch AI will scale VM clusters as necessary to parallelize the tasks and obtain the optimal network configuration efficiently, saving hours or even days of the developer’s time. We will demonstrate the value of Batch AI with a live demo of training a deep learning model, implemented in R, on the classic MNIST computer vision dataset.
16:10 Data mining Timothy Wong P9 Modelling Field Operation Capacity using Generalised Additive Model and Random Forest algorithms, models, multivariate, big data Kevin Kuo click here
In any customer-facing business, accurately predicting demand ahead of time is of paramount importance*. Workforce capacity can be flexibly scheduled at local area accordingly. In this way, we can ensure having sufficient workforce to meet volatile demand.In this case study, we focus on the gas boiler repairing field operation in the UK. We have developed a prototype capacity forecasting procedure which uses a mixture of machine learning techniques to achieve its goal. Firstly, it uses Generalised Additive Model approach to estimate the number of incoming work requests. It takes into account the non-linear effects of multiple predictor variables. The next stage uses a large random forest to estimate the expected number of appointments for each work request by feeding in various ordinal and categorical inputs. At this stage, the size of the training set is considerable large and does not fully-fit in memory. In light of this, the random forest model was trained in chunks / parallel to enhance computational performance. Once all previous steps have been completed, probabilistic input such as the ECMWF Ensemble weather forecast to give a view of all predicted scenarios.
16:30 Data mining Bernd Bischl P9 iml: A new Package for Model-Agnostic Interpretable Machine Learning algorithms, models, machine learning Kevin Kuo NA
iml implements model-agnostic interpretability methods to explain the functional behavior and individual predictions of machine learning models. A large advantage of model-agnostic interpretability methods over model-specific ones is their flexibility, as often not one but many types of machine learning models are evaluated for solving a task. Anything that is build on top of an interpretation such an interpretation, e.g., a visualization or graphical user interface, now also becomes independent of the underlying model.Currently implemented are:Feature importance, Partial dependence plots, Individual conditional expectation plots (ICE), Tree surrogate, LocalModel: Local Interpretable Model-agnostic Explanations, Shapley value for explaining single predictions.The talk will cover the basic concepts behind model-agnostic interpretations, and demonstrate the functionality of the package through applied examples in R.Link to CRAN release: https://cran.r-project.org/web/packages/iml/index.htmlLink to Github page: https://github.com/christophM/iml
15:30 Simulation and modeling focus on surv anal Bachmann Patrick P6 Estimating individual Customer Lifetime Values with R: The CLVTools Package models Marie Trussart NA
Valuing customers is key to any firm. Customer lifetime value (CLV) is the central metric for valuing customers. It describes the long-term economic value of customers and gives managers an idea of how customers will evolve over time. To model CLVs in continuous non-contractual business settings such as retailers, probabilistic customer attrition models are the preferred choice in literature and practice. Our R package CLVTools provides an efficient and easy to use implementation frameworks for probabilistic customer attrition models. Building up on the learnings of other implementations, we adopt S4 classes to allow constructing rich and rather complex models that nevertheless still are easy to apply for the end user. In addition, the package includes recent model extensions, such as the option to consider contextual factors, that are not available in other packages.This article will focus on both, the theory of the underlying statistical framework as well as about the practical application using real world data.
15:50 Simulation and modeling focus on surv anal Sam Brilleman P6 simsurv: A Package for Simulating Simple or Complex Survival Data models, simulation; survival analysis Marie Trussart click here
The simsurv package allows users to simulate simple or complex survival data. Survival data refers to a variable corresponding to the time from a defined baseline until occurrence of an event of interest. Depending on the field, the analysis of survival data can be known as survival, duration, reliability, or event history analysis. It has been common to make simplifying parametric assumptions when simulating survival data, e.g. assuming survival times follow an exponential or Weibull distribution. However, such assumptions are unrealistic in many settings. The simsurv package provides additional flexibility by allowing users to simulate survival times from 2-component mixture distributions or a user-defined hazard function. The mixture distributions allow for a variety of flexible baseline hazard functions. Moreover, a user-defined hazard function can provide even greater flexibility since the cumulative hazard does not require a closed-form solution. This means it is possible to simulate survival times under complex statistical models such as those for joint longitudinal-survival data. The package is modelled on the survsim package in Stata (Crowther and Lambert, 2012, Stata J).
16:10 Simulation and modeling focus on surv anal Raju Rimal P6 R-package for simulating linear model data (simrel) models, applications, web app, multivariate, interfaces, Simulation Marie Trussart click here
Data science is generating enormous amounts of data, and new and advanced analytical methods are constantly being developed to cope with the challenge of extracting information from such “big-data”. Researchers often use simulated data to assess and document the properties of these new methods. Here we present an R-package `simrel`, which is a versatile and transparent tool for simulating linear model data with an extensive range of adjustable properties. The method is based on the concept of relevant components and a reduction of the regression model. The concept was first implemented in an earlier version of `simrel` but only for single response case. In this version we introduce random rotations of latent components spanning a response space in order to obtain a multivariate response matrix Y. The properties of the linear relation between predictors and responses are defined by a small set of input parameters which allow versatile and adjustable simulations. In addition to the R-package, user-friendly shiny application with elaborate documentation and an RStudio gadget provide an easy interface for the package.
16:30 Simulation and modeling focus on surv anal Andrés Villegas P6 StMoMo: An R Package for Stochastic Mortality Modelling models, applications Marie Trussart NA
In this talk we use the framework of generalised (non-)linear models to define the family of generalised Age-Period-Cohort stochastic mortality models which encompasses the vast majority of stochastic mortality projection models proposed to date, including the well-known Lee-Carter and Cairns-Blake-Dowd models. We also introduce the R package StMoMo which exploits the unifying framework of the generalised Age-Period-Cohort family to provide tools for fitting stochastic mortality models, assessing their goodness of fit and performing mortality projections. We illustrate some of the capabilities of the package by performing a comparison of several stochastic mortality models applied to the Australian mortality experience. The R package StMoMo is available at http://CRAN.R-project.org/package=StMoMo.
15:30 Improving performance Helena Kotthaus AUD Optimizing Parallel R Programs via Dynamic Scheduling Strategies models, performance Earo Wang NA
We present scheduling strategies for optimizing the overall runtime of parallel R programs. Our proposal improves upon the existing mclapply function of the parallel package, which already offers a load balancing option that dynamically allocates tasks to worker processes. However, this mechanism has shortcomings when used on heterogeneous hardware architectures, where different CPU cores might have vastly different performance characteristics. We thus propose to enhance mclapply with a new parameter that allows mapping tasks to specific CPUs. The new affinity.list parameter, already available on the R-devel branch, allows setting a so-called CPU affinity mask that specifies on which CPU a given task is allowed to run. We demonstrate the benefits of the new mclapply version by showing how it can speed up parallel applications like parameter tuning. In this case study, we develop a regression model that guides the scheduling by estimating the runtime of a task for each processor type based on previous executions. In a series of code examples, we explain how this approach can be generalized to develop efficient scheduling strategies for parallel R programs.
15:50 Improving performance Stepan Sindelar AUD Combining R and Python with GraalVM applications, performance, programming languages interoperability, debugging Earo Wang NA
GraalVM is a multi-language runtime that allows to run and combine multiple programming languages in one process and operating on the same data without the need to copy the data when crossing language boundaries. Moreover, the dynamic just-in-time compiler included in GraalVM is capable of applying optimizations across the languages boundaries. The languages implemented on top of GraalVM include FastR, an alternative R implementation, C, Ruby, JavaScript, and recently added GraalPython.The talk will present interesting ways how R and Python can be combined into a polyglot application running on GraalVM, for example using R package from Python or vice versa, and briefly explain how this interoperability works on the technical level. One of the most important parts of a language ecosystem is tooling and especially interactive debugger. The talk will also present how one can debug multiple GraalVM languages at the same time in the Google Chrome Dev Tools, for instance stepping from R into C code.
16:10 Improving performance David Smith AUD Speeding up computations in R with parallel programming in the cloud models, performance, parallel programming Earo Wang click here
There are many common workloads in R that are "embarrassingly parallel": group-by analyses, simulations, and grid-based computations are just a few examples. In this talk, I'll provide a review of tools for implementing embarrassingly parallel computations in R, including the built-in "parallel" package and extensions such as the "foreach" package. I'll also demonstrate how you can dramatically reduce the time for a complex computation -- optimizing hyperparameters for a predictive model with the "caret" package -- by using a cluster of parallel R session in the cloud. With the "doAzureParallel" package, I'll show how you can create a cluster of virtual machines running R in Azure, parallelize the problem by registering backend to "foreach", and shut down the cluster when the computation is complete, all with just a few lines of R code.
16:30 Improving performance Romain François AUD rrrow: an R front end to Apache Arrow algorithms, performance, big data, streaming data, interfaces Earo Wang NA
Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication. Languages currently supported include C, C++, Java, JavaScript, Python, and Ruby.R support is currently being implemented and in this talk we will discuss the various challenges, and our short, medium and long term vision for the connection between R and Apache Arrow.
15:30 Sports analytics Robert Nguyen P7 Using Australian Rules Football to Broaden the Appeal of R and Statistics Among Youth and Public Without a STEM Background visualisation, models, applications, reproducibility, community/education, interfaces Alex Whan NA
Our talk explores how sports analytics can be used to encourage those without a STEM background into the application of statistics and programming to a real world environment. Through the use of a R package (fitzRoy) related to AFL we aim to both lower the barrier to entry for data access while also increasing analytical fan engagement in AFL. We will also talk about common issues that arise for the first time R package builder.A key barrier to entry for the growth of the AFL community is data access which prevents not only people having a go at writing, but it also prevents current media having reproducible work. By having an R package with online lessons on creating common fan rating systems like ELO, Pythagorean and Massey this will engage people who otherwise might have put learning statistical modelling and R into their personal *this is too hard* bucket. Commonly users are taught from a cleaned dataset and jump straight into modelling. This misses a key part which is cleaning. Our package, we aim to use tangible examples of scraped, raw AFL data from afltables and footywire to teach users how to clean scraped data themselves to get it into a tidy format for modelling.
15:50 Sports analytics Alex Fun P7 Using TMB (Template Model Builder) to predict the winner of a ping pong match algorithms, models, applications Alex Whan NA
In a recent and popular stats.stackexchange post, the following question was asked:“I bet with my colleague that I will beat him in fifty consecutive ping pong games. So far I have won 15, what are my chances of winning the next 35 games?”-- from https://stats.stackexchange.com/questions/329521/To answer this question, I propose the following data generation process for the score-line in each game: The OP (original poster) is a far superior player who still wishes to make the game fun for their opponent (they are colleagues after all). This leads to a regression problem for the OP’s probability of winning a point, that cannot be fit using standard regression packages. This introductory talk will demonstrate how to use the TMB (Template Model Builder) package with an optimisation algorithm to find maximum likelihood estimates for the regression coefficients. This will show that TMB is a very useful and efficient tool, that allows the practictioner a lot of flexibility in exploring novel data generation processes and objective functions. I will also briefly touch upon using C++ from R, and automatic differentiation, which is great for those that dislike multivariate calculus.
16:10 Sports analytics Andrew Simpkin P7 A Shiny app used to predict training load in professional sports visualisation, algorithms, models, applications, databases, web app, multivariate, performance, streaming data Alex Whan NA
We have developed a Shiny dashboard web application used in professional sports to predict player load while planning a training session. This app allows coaches to better plan, prescribe and tailor training drills in advance. The Shiny dashboard app is deployed on Shiny Server Pro and connects to an SQL database of GPS data across multiple teams and sports. Teams can plan, save, edit and delete planned sessions to and from the GPS database. Based on retrospectively collected GPS and accelerometer data, we have developed a statistical learning algorithm to cluster similar drills and predict training load. The model achieves correlations over 0.95 in out-of-sample testing, with median differences of below 1% of GPS outcomes.
16:30 Sports analytics Sayani Gupta P7 CricketData: An R package for international cricket data visualisation, data mining, applications, web app, reproducibility Alex Whan NA
The CricketData package provides convenient scraper functions for downloading data from ESPNCricinfo into tibbles. Functions are provided for obtaining data on the performance of male and female players across Test, One Day International and Twenty20 formats, and for batting, bowling and fielding. Tidyverse packages can then be used to explore, visualise and analyse the data. The package enables a user to answer simple questions such as -What is the highest number of catches taken by a wicket keeper?-What is the maximum number of catches taken by a fielder in a particular innings?-How many batsmen have scored consecutive 100’s in two matches or more?-What is the maximum number of maiden overs by a bowler in a specific innings?It will also allow for deeper questions to be addressed such as-Do batsman tend to get run out more frequently when they are about to score a century?-How does the performance of cricketers change in the 12 months before he/she retires?-When is the period of peak performance during a cricketer’s career?Finally, it makes it easy to produce visual comparisons of player performance across different statistics.
15:30 Leveraging web apps Katie Sasso P8 Shiny meets Electron: Turn your Shiny app into a standalone desktop app in no time applications, databases, web app, reproducibility, interfaces, Automation Johnathan Carroll NA
Using Shiny in consulting can be challenging, as all deployment options involve either sending intellectual property and data to the cloud or IT involvement. When providing consultative like services to extremely large, risk-averse, enterprises this can greatly restrict one’s ability to quickly get Shiny apps into users’ hands, as engagement of IT can take months if approved at all. We’ll share how the Columbus Collaboratory team overcame these barriers to rapid deployment by coupling R Portable and Electron, a framework for creating native applications with a variety of web technologies. All the tools needed to use Electron for desktop deployment of Shiny apps will be reviewed. We’ll highlight a specific example in which these technologies were used within a large enterprise to completely automate a weekly report. We’ll also share how the app used R Packages such as openxlsx, shinydashboard, RODBC, and Zoo to query an internal database, cleanse data, calculate key metrics, and create a downloadable excel file for dissemination. The best part? This Shiny app was delivered to the end business user as a stand-alone executable. https://github.com/ksasso/Electron_ShinyApp_Deployment
15:50 Leveraging web apps Adrian Barnett P8 Saving time for researchers by creating publication lists using shiny applications, databases, web app, open access Johnathan Carroll click here
Researchers are often asked by funders or employers to list their publications, but funders often have different requirements (e.g., all papers versus only those in the last five years) and researchers waste a lot of time formatting papers. To save time for researchers I made a shiny application (https://aushsi.shinyapps.io/orcid/) that takes a researcher’s ORCID ID and outputs their papers in alternative formats. It uses crossref and pubmed (rentrez) to supplement the ORCID data. The app was included in the Australian Research Council’s instructions to applicants and has been well used with many good suggestions for improvements. However, the ORCID data is relatively messy and papers can be in multiple formats, making it difficult to create a standardised paper that can be flexibly manipulated. For example, the publication’s author data are in different fields and formats. Google Scholar publications are nicely standardised, but there are authentication issues when using shiny.I will describe how the app has developed and canvass how it could be improved, including adding the percent of publications that are open access or other alternative research metrics.
16:10 Leveraging web apps Gergely Daroczi P8 Managing database credentials and connections: an easy and secure approach applications, databases, web app, interfaces, business Johnathan Carroll click here
Although the `DBI` R package family already provides a standardized way of opening connections to various databases and querying data, and eg the `config` package allows to store the database connection default parameters in a central file, maybe some of the sensitive fields encrypted via `keyring` or the `secret` packages -- but there is no convenient and secure wrapper around these for the actual R end-users. This talk introduces a new package taking care of opening connections in the background to the databases specified in a secured and encrypted YAML file, so that the R user can simply specify the SQL command without the need to think about what DB backend and credentials are used.
16:30 Leveraging web apps Ian Hansel P8 Large Scale Data Visualisation with Deck.gl and Shiny visualisation, web app, space/time Johnathan Carroll click here
deck.gl is a WebGL-powered framework for visual exploratory data analysis of large datasets' - https://uber.github.io/deck.gl/#/Combining deck.gl and shiny allows for rich interactive graphics of large datasets, in particular visualising GeoSpatial data. We will review how to integrate deck.gl with shiny using the upcoming R package 'deck.gl'. The talk will:- Review the underlying technologies; WebGL, Mapbox and React.js- Dive into an example exploring the latest Census from the Australian Beureau of Statistics- Compare to existing visualisation capabilities in the 'rthreejs' and 'leaflet' packages- Discuss how further integrations with React.js can enable more browser based interfaces to data and analyticsAfter the talk the attendees should:- Know how Deck.gl works- Understand how to visualise data in deck.gl from R using the 'deck.gl' package- Want to use deck.gl in their own work :)The talk is aimed at those with some experience (or interest) in GeoSpatial analysis.
Time Session Presenter Venue Title Keywords Chair Slides
9:05 Keynote Roger Peng AUD Teaching R to New Users: From tapply to Tidyverse NA Nick Tierney click here
The intentional ambiguity of the R language, inherited from the S language, is one of its defining features. Is it an interactive system for data analysis or is it a sophisticated programming language for software developers? The ability of R to cater to users who do not see themselves as programmers, but then allow them to slide gradually into programming, is an enduring quality of the language and is what has allowed it to gain significance over time. As the R community has grown in size and diversity, R’s ability to match the needs of the community has similarly grown. However, this growth has raised interesting questions about R’s value proposition today and how new users to R should be introduced to the system. I will discuss some lessons learned from my experience teaching R to new users and from observing the evolution of the language over the past 20 years.
10:30 Optimisation for model fitting Anqi Fu P6 Disciplined Convex Optimization with CVXR models, data mining, applications, multivariate, big data, interfaces, optimization Martin Maechler click here
CVXR is an R package that provides an object-oriented modeling language for convex optimization, similar to CVX, CVXPY, YALMIP, and Convex.jl. It allows the user to formulate convex optimization problems in a natural mathematical syntax rather than the restrictive standard form required by most solvers. The user specifies an objective and set of constraints by combining constants, variables, and parameters using a library of functions with known mathematical properties. CVXR then applies signed disciplined convex programming (DCP) to verify the problem's convexity. Once verified, the problem is converted into standard conic form using graph implementations and passed to a cone solver such as ECOS or SCS. We demonstrate CVXR's modeling framework with applications in engineering, statistical estimation, and machine learning.
10:50 Optimisation for model fitting Giuseppe Bruno P6 Stochastic Gradient Descent: boosting its performances in R data mining, big data Martin Maechler NA
Despite the tremendous improvements in HW&SW technologies, the requirements for training Machine Learning models keep growing. With standard loss functions the Gradient Descent (GD) provides a simple approach.The whole gradient is the sum of the gradients of each component function:∇ F(w) =2 = Σ(xiT w - yi) xi.The complexity per iteration is O(n d). Here we gauge the Stochastic Gradient Descent (SGD) where the gradient is approximated with one observation. When the stopping criterium is|w_{k+1}-w_k|<ε we have that GD requires O(log(1/ ε)) iteration while SGD needs O(1/ ε).Albeit the iterative nature of the SGD prevents its straightforward parallelization, a few alternatives have been proposed in the literature for carrying out its parallel implementation. In this paper we provide different benchmark examples of parallel implementation through standard shared memory and Spark distributed computing framework to boost the SGD performances. Preliminary results, under given conditions show significant performance improvements. The possibility to take advantage of these speed up is open to practitioners and not just computer specialists.
11:10 Optimisation for model fitting Melina Ribaud P6 Robustness criterion for the derivative kriging-based optimization algorithms, models, applications Martin Maechler NA
In the context of robust shape optimization, the estimation cost of some numerical models is reduced with a kriging metamodel. The function, the first and second derivatives are provided by the majority of industrial codes. We propose a robust optimization procedure that leans on the prediction function and its derivatives. Those predictions are given by the kriging. The use of the derivatives improves the metamodel quality. We rely on Rccp and nloptr packages of R to estimate, predict and simulate the kriging with derivatives. Taylor theorem that is calculated with the prediction of the function and its derivatives is applied on each used point to approximate the variation of the function. This cheap criterion is used as the replacement of a full computation of the second moment from the model. A Pareto front of robust solutions (minimization of the function and the robustness criterion) is then generated by the NSGA-II genetic algorithm through the nsga2r package from R. This algorithm efficiently produces a Pareto front with no regard to the model complexity.
11:30 Optimisation for model fitting Andrew Locke P6 Augmented Lagrangian for constrained optimizations in Empirical Likelihood estimations algorithms, models Martin Maechler NA
Empirical Likelihood is a useful tool for inference as it does not require knowledge about where the data comes from. It can be extended in many ways including regression or adding constraints using estimating equations.The positivity constraint has often been overlooked or ignored but this means existing methods may not be applicable for some data. We look at enforcing this constraint by applying the Karush-Kuhn-Tucker conditions and using a multiplicative iterative optimization method of updating parameters which ensures movement towards the maximum. We have programmed this method in R and use simulations to demonstrate the model works
10:30 Infrastructure and tools for genomic analysis Ido Bar P7 Shinotate: an R-based shiny server for annotation and analysis of RNA-Seq transcriptome assemblies visualisation, web app, bioinformatics Matt Ritchie NA
Assembly of transcriptome data in non-model species has become common practice in the last decade thanks to the advent of high-throughput RNA-sequencing platforms and accompanying bioinformatics tools. Trinity is one of the most commonly used tools for transcriptome assembly from Illumina RNA-Seq data and its accompanying functional annotation framework, Trinotate, offers a pipeline for running the various annotation tools and consolidating the results into a single database. Trinotate also includes a web-based graphical user interface for querying the annotations and provide basic visualisation, but its Perl implementation makes it difficult to customise and deploy. Shinotate was developed to provide a modern graphical interface for the analysis of transcriptome annotations, utilising the Trinotate annotation framework to deliver summarised results and insights to users of all skill levels. Shinotate is written in R and uses the `tidyverse` approach to summarise and visualise the data stored in Trinotate and thus can be easily adapted to accommodate custom annotation tables. It serves interactive annotation tables and plots, with search, selection and data export functions.
10:50 Infrastructure and tools for genomic analysis Peter Hickey P7 DelayedArray: A tibble for arrays bioinformatics, performance, big data Matt Ritchie click here
High-throughput genomics data are commonly summarised in a feature-by-sample matrix or higher-dimensional array. In R, these have traditionally been stored in-memory, but this is becoming prohibitive for large, contemporary datasets, such as those generated using new genomics technologies like single-cell RNA-seq. Instead, these arrays may be stored on-disk, using the Hierarchical Data Format 5 (HDF5), for example. The Bioconductor project has developed the DelayedArray, which supports different 'backends' to wrap around an in-memory, on-disk, or remotely served representation of an array, providing a unified interface to the data that is familiar to users of ordinary R arrays. In this sense, a DelayedArray is to an array as a tibble is to a data frame. I will provide an overview of the DelayedArray framework, explain the requirements for developing a new backend for a DelayedArray, and highlight example backends for on-disk and remotely served data. I will also demonstrate how user-created packages can extend the capabilities of the DelayedArray and how this has enabled us to analyse large genomics datasets in R that were previously infeasible.
11:10 Infrastructure and tools for genomic analysis Ramyar Molania P7 improved normalization of the Nanostring nCounter gene expression data bioinformatics Matt Ritchie NA
The NanoString nCounter gene expression assay uses molecular barcodes and single molecule imaging to detect and count hundreds of unique transcripts in a single reaction. These counts need to be normalized to adjust for variations in assay efficiency, the amount of sample, and other factors. Most users adopt one of the options described in the nSolver analysis software, which involve background correction based on the observed values for 8 negative control probes, a within sample normalization using the observed values for 6 positive control probes, and normalization across samples using reference (“housekeeping”) genes. Including technical replicates is not recommended by the assay developers, but some users do so anyway. Here we present a new normalization called RUV3 which makes vital use of technical replicates and suitable control genes. We illustrate its effectiveness on four quite different datasets, and offer suggestions on the design and analysis of studies involving this technology.
11:30 Infrastructure and tools for genomic analysis Abdul Abdulmonem A. Alsaleh P7 Identifying methylation biomarkers for childhood leukaemia from human 450k DNA methylation array data using ABC.RAP R package data mining, bioinformatics, data analysis Matt Ritchie NA
To date, the majority of the available 450k DNA methylation analysis tools focus on single CpG methylation differences. The array based CpG region analysis pipeline (ABC.RAP) R package was developed to analyse normalised human 450k DNA methylation array datasets and applies Student’s t-test and delta beta analysis to identify candidate genes containing multiple differentially methylated CpG sites. In addition, ABC.RAP can profile DNA methylation for any gene of interest, providing a powerful feature for comparison between datasets. We analysed nine publicly available acute leukaemia datasets and identified a panel of 11 genes that were consistently methylated across different cohorts. We used targeted DNA methylation sequencing (MiSeq; Illumina) to sequence blood samples from healthy adults and newborns and also leukaemia xenograft samples and cell lines. The selected panel of genes showed dense DNA methylation in leukaemia samples compared to low-level methylation in control samples consistent with the publicly available 450k array data. ABC.RAP was accepted by the CRAN, and can be accessed on the following site: https://cran.r-project.org/package=ABC.RAP
10:30 Classification and data mining Christoph Bergmeir P9 ssc: An R Package for Semi-Supervised Classification algorithms, models, data mining Charles Gray NA
Semi-supervised classification has become a popular area of machine learning, where both labeled and unlabeled data are used to train a classifier. This learning paradigm has obtained promising results, specifically in the presence of a reduced set of labeled examples. We present the R package ssc (https://cran.r-project.org/package=ssc) that implements a collection of self-labeled techniques to construct a classification model. This family of techniques enlarges the original labeled set using the most confident predictions to classify unlabeled data. The techniques implemented in the ssc package can be applied to classification problems in several domains by the specification of a suitable learning scheme. At low ratios of labeled data, it can be shown to perform better than classical supervised classifiers.
10:50 Classification and data mining Przemyslaw Biecek P9 DALEX will help you to understand this complex predictive model visualisation, algorithms, models, data mining, bioinformatics Charles Gray click here
Complex machine learning models (random forest/gradient boosting machines/other ensembles) are frequently used in predictive modeling and have many successful applications in predictive and prognostic modeling. Yet in many cases these models are perceived as ,,black-boxes’’ with good accuracy but very complex, hard to understand, structure. In this talk I will present the methodology for exploration, validation and explanation of complex machine learning models. The methodology is implemented in the DALEX library for the R (Descriptive mAchine Learning EXplanations). The methodology contains three sets of explainers:- explainers for individual model predictions, that may be used to better understand key variables that drive model predictions,- explainers for individual variables, that may be used to better understand how model predictions are related with values of a selected feature,- explainers for global model structure, that may be used to assess globally important variables or important structures in the model.Find more about DALEX here: https://pbiecek.github.io/DALEX/I will give a workshop about this package.
11:10 Classification and data mining Roel Henckaerts P9 Tree-Based Machine Learning for Insurance Pricing visualisation, models, Tree-based machine learning Charles Gray NA
The goal of this paper is to apply machine learning techniques to insurance pricing, thereby leaving the actuarial comfort zone of generalized linear models (GLMs) and generalized additive models (GAMs). We focus on developing full tariff plans, built from both the frequency and severity of claims. We adapt the cost functions and performance measures used in the algorithms such that the specific characteristics of insurance data are carefully incorporated: highly unbalanced count data with excess zeros on the frequency side and scarce, but potentially heavy-tailed and right-censored data on the severity side. One of the key requirements is the need for transparent, interpretable pricing models which are easily explainable to all stakeholders. We therefore shy away from black box models such as neural networks and rather focus on tree-based machine learning models. Starting from single recursive trees we work towards more advanced ensembles such as bagged trees, random forests and boosted trees. We also present visualization tools to obtain insights from the models by assessing the importance of the different risk factors and their impact on the price of an insurance contract.
11:30 Classification and data mining Lubomír Štěpánek P9 Classification and evaluation of facial attractiveness and emotions for purposes of plastic surgery using machine-learning methods and R algorithms, models, multivariate, machine learning Charles Gray click here
Current plastic surgery deals with aesthetic indications such as an improvement of the attractiveness of a smile or other facial emotions. In this work, we have applied machine-learning methods and R to explore how accurate classification of photographed faces into sets of facial emotions (based on Ekman-Friesen FACS scale) is, and furthermore which facial emotions are associated with highest facial attractiveness, measured using Likert scale by a board of observers. Facial image data collected for each of a patient, exposed to an emotion incentive, were processed, landmarked and analyzed using R. Neural networks (neuralnet package) in comparison to Bayesian naive classifiers (e1071 package) and regression trees (rpart package) manifested the highest predictive accuracy of a new face categorization into facial emotions. Decision trees identified that the geometry of a mouth, eyebrows and eyes, respectively, affect in descending order an intensity of a classified emotion. We performed machine-learning analyses using R to point out which facial emotions and their geometry affect facial attractiveness the most, and therefore should preferentially be addressed within plastic surgeries.
10:30 Modeling and algorithms with a health focus Mauricio Sarrias P8 Comparing Implementations of Logit Models with Individual Heterogeneity models, applications, reproducibility, performance Saskia Freytag NA
This paper discusses different specifications for Multinomial Logit models that include individual-heterogeneity (Mixed Logit Model, Latent Class MNL, GMNL model, MM-MNL). Due to the ability of these models to include unobserved heterogeneity, they have become quite popular for the empirical analysis of choice decisions. However, due to the complex estimation of these models using simulated maximum likelihood (SML) is quite difficult to compare or even replicate results from different software implementations, even using the same database. For example, the estimates depend on the optimization algorithm, the way on how random numbers are generated; the prime number used if Halton draws are selected, etc. Despite the now widespread use of these methods and these consideration, there appears to be no systematic investigation of the accuracy of these models or a comparison of the performance of the SML estimation routines that now exist in several software. In this article, I compare different implementation (R, Stata, Matlab) of these models by focusing in their ability to retrieve the true parameters from different data generating process and different default setup.
10:50 Modeling and algorithms with a health focus Daniel Putler P8 Optimally Locating Opioid Treatment Centers in Under Served Areas Using R and Alteryx applications, web app, Public Health, Optimization Saskia Freytag NA
In 2016 there were nearly 64,000 drug overdose deaths in the US, most of these due to opioids. Treatment of opioid addiction is one of the primary tools for addressing this situation. However, many of the areas hardest hit by opioid use are under served from a treatment perspective. An issue currently impeding the location of treatment facilities is the lack of fine grained location data of opioid abusers.We present a Web application that assists decision makers in locating opioid treatment facilities in under served areas. To do this, estimates of the number of adult opioid abusers at the census tract level are developed using R, based on data from the National Survey on Drug Use and Health and both census tract and microdata sample data from the American Community Survey. The census tract estimates of adult opioid abusers is used, along with data on the locations of existing opioid treatment facilities, to locate new facilities in areas that are further than ten miles from existing facilities, and maximizes the estimated number of abusers within a ten mile radius of the new facilities. The optimization is done using an evolutionary algorithm that is implemented in Alteryx.
11:10 Modeling and algorithms with a health focus Nicholas Tierney P8 Maxcovr: Find the best locations for facilities using the maximal covering location problem visualisation, algorithms, models, applications, space/time, interfaces Saskia Freytag NA
Want better wifi at the office? Improved access to healthcare? The maximal covering location problem (MCLP) can help! The MCLP finds optimal locations of facilities to improve their coverage on a set of targets. This means better placed wifi routers and healthcare facilities. Although the MCLP was described in the 1970s, it can be daunting to actually implement as you need to know how to:1) Formulate an optimisation problem2) Make it talk to a solver engine3) Get the data into the appropriate format for the solver to recognise4) Translate the model output into a usable formatIt is challenging, particularly if you are not familiar with optimisation, or techniques such as linear programming. It is, however, a great use case for an R package to abstract away detail you don’t need to worry about. The R package maxcovr provides a set of tools to perform, summarise, and visualise the MCLP, so that you can move on with your analysis, place better cellphone towers, and create better access to health facilities.In this talk, I describe why the MCLP is useful, where it can be applied, and demonstrate of the use of maxcovr, before finally discussing future directions.
11:30 Modeling and algorithms with a health focus Brianna Hitt P8 Optimal group testing algorithms for infectious disease detection with the binGroup package algorithms, applications, binary response; infectious disease testing; pooled testing; screening; sensitivity; specificity Saskia Freytag NA
Group testing is the process of amalgamating clinical specimens from individuals (e.g., blood, urine, or saliva) into groups to test for an infectious disease. When disease prevalence is small, the majority of these groups will test negatively. For positive testing groups, there are many algorithmic retesting procedures available to differentiate positive individuals from negative ones. The appeal of group testing to laboratories is that the number of tests needed is significantly less than testing each individual separately. Both estimating the probability of disease infection and identifying positive/negative individuals are goals of group testing. Unfortunately, no package has been available to address the identification goal for the most common group testing algorithms. We present the first functions for identification and make these available in the binGroup package to complement its large set of estimation functions. Our new functions calculate operating characteristics for algorithms and choose the optimal set of group sizes for user specified settings. These new functions allow laboratories to understand how well an algorithm is expected to perform before implementation.
10:30 Working with text Aneesha Bakharia P10 Topic Modeling with LDA and NMF from a Qualitative Content Analysis Perspective visualisation, text analysis/NLP, interfaces Jim Hester NA
The Latent Dirichlet Allocation (LDA) and Non Negative Matrix Factorisation (NMF) algorithms are able to find the latent topics within a document collection. Although LDA is specifically designed as a topic modeling algorithm, NMF is able to produce more coherent topics for smaller domain specific document collections. Both algorithms map documents to topics and topics to words and perform soft clustering (i.e., documents and words can belong to multiple topics), making them particularly suitable as qualitative content analysis aids. In this presentation the mathematical underpinning of both algorithms along with their relevant R packages (Topic Models and NMF) will briefly be introduced. The main focus of the presentation however will be on using R to address issues that qualitative researchers encounter when using topic modeling algorithms which include trust, topic quality/coherence, topic interpretation, evidence gathering and model parameter selection. Various tools to visualise the output of topic model will be discussed (i.e., LDAVis) and an intuitive user interface to explore topic models and gather evidence will be built using Shiny.
10:50 Working with text JeongMin Kwon P10 From humble data to training data data mining, applications, text analysis/NLP Jim Hester click here
There are lots of imperfect data. User feedback is not trustworthy, and implicit data is not unlabelled and hard to wrangle - and it is hard to use for machine learning and many other ways. But we can use them with changing thinking and data wrangling. In this presentation, I suggest some new ideas for wrangling data to use in machine learning and show our case studies.
11:10 Working with text Talia Beech P10 Strategic Capability Analysis for CANSOFCOM visualisation, text analysis/NLP Jim Hester NA
To enhance Canada Special Operations Forces Command competitive advantage in deterring and defeating adversaries as well as collaborating with allies, a strategic capability assessment was conducted to identify current and future capability gaps using concepts from the forecasted future operating environment. Military capability implications are identified and assessed using a wargame-based survey approach across a range of units within the Command. Data collected included ordinal data as well as supplementary comments.Ordinal data is analyzed using the Likert package, with an emphasis on the visualization of the data using stacked bar plots. Comment data is evaluated using R text mining packages with some emphasis on preprocessing steps to simplify text mining tasks. Results are used as a foundation for implementing constructive institutional change across the Command.
11:30 Working with text Thomas Klebel P10 jstor: An R package for Analysing Scientific Articles data mining, text analysis/NLP Jim Hester NA
The interest in the (quantitative) analysis of textual data has increased considerably over the last few years. For researchers investigating the scholarly literature the full text archive of JSTOR (http://www.jstor.org) offers a rich and diverse set of journal articles and other texts. Through its service Data for Research (http://www.jstor.org/dfr/), JSTOR gives researchers the opportunity to analyse this data, by delivering metadata, n-grams and, upon special request, full-text materials. jstor (https://tklebel.github.io/jstor/) enables researchers to easily import the supplied metadata to R. These metadata can either be analysed on their own, or be used in conjunction with n-grams or full-text-data. The presentation will show how jstor supports investigations of scholarly literature, covering the analysis of n-grams and citation analysis. Besides introducing possible applications, the paper will also discuss limitations regarding data quality and possible solutions thereof.
10:30 New dev for temporal data Earo Wang AUD tsibble: The 15th time series standard visualisation, space/time Natalia da Silva NA
The conventional matrix structure that underlies time series models in R does not easily accommodate a few complications, such as multiple variables, heterogeneous data types, low time resolutions, implicit missing values, and multilevel. This work addresses the broader issues of better data structures and modern data pipelines for analysing and visualising temporal-context data. We extend the tidy data concept to temporal data, and note that the “molten” data structure is flexible enough to handle heterogeneity, low time resolutions, and implicit missing values. There are two constraints required to turn the “molten” data into a valid temporal data: (1) an explicitly declared index variable containing timestamps; (2) a constraint uniquely identifies the multiple units of measurements, which is referred to as a “key”. A syntactical approach is introduced to describe nested or crossed data structure, which employs the “key”. Based on the tidy temporal data, a data pipeline is discussed and formulated to facilitate time-based transformation and visualisation. A case study is included to demonstrate the tidy structure and the data pipeline ideas and usage.
10:50 New dev for temporal data Rob Hyndman AUD Tidy forecasting in R space/time Natalia da Silva click here
The forecast package in R is widely used and provides good tools for monthly, quarterly and annual time series. But it is not so well-developed for daily and sub-daily data, and it does not interact easily with modern tidy packages such as dplyr, purrr and tidyr.I will describe our plans and progress in developing a collection of packages to provide tidy tools for time series and forecasting, which will interact seamlessly with tidyverse packages, and provide functions to handle time series at any frequency.
11:10 New dev for temporal data Mitchell O'Hara-Wild AUD fasster: Forecasting multiple seasonality with state switching algorithms, models, streaming data, timeseries Natalia da Silva NA
Forecasting time-series which contain multiple seasonal patterns requires flexible modelling approaches, and the need for continuously updating models emphasises the importance of fast model estimation. In response to shortcomings in current models, a new model is proposed which brings the desirable qualities of speed, flexibility and support for exogenous regressors into a state space model. This proposed model also introduces state switching, which captures groups of irregular multiple seasonality by switching between states. The functionality of the proposed model extends beyond forecasting, by allowing for model based time-series decomposition, imputation of missing values, and support for streaming data.This model is available as an R package (mitchelloharawild/fasster), which provides formula based model specification, and uses tidy data structures (tsibble) and APIs which will later become familiar in forecast's next iteration: tidyforecast.
11:30 New dev for temporal data Thiyanga Talagala AUD seer: R package for feature-based forecast-model selection algorithms, models, time series Natalia da Silva NA
The seer package provides a novel framework for forecast model selection using time series features. We call this framework FFORMS (Feature-based FORecast Model Selection). The underlying approach involves computing a vector of features from the time series which are then used to select the forecasting model. The model selection process is carried out using a classification algorithm -- we use the time series features as inputs, and the best forecasting algorithm as the output. The classification algorithm can be built in advance of the forecasting exercise (so it is an “offline” procedure). Then, when we have a new time series to forecast, we can quickly compute its features, use the pre-trained classification algorithm to identify the best forecasting model, and produce the required forecasts. Thus, the “online” part of our algorithm requires only feature computation, and the application of a single forecasting model, with no need to estimate large numbers of models within a class, or to carry out a computationally-intensive cross-validation procedure. This framework is compared against several benchmarks and other commonly used forecasting methods.
13:00 Keynote Danielle Navarro AUD R for Psychological Science? NA Danielle Navarro click here
Traditionally, R has been viewed as a language for data science and statistics. In the social sciences it has been extremely popular with researchers at the more quantitative end of the spectrum - but uptake has been less widespread outside of the more statistically inclined. I don't think the R language needs to be limited in this way. Since 2011 I've been teaching introductory research methods classes for undergraduates using R, running programming classes for R with postgraduate students, doing my own data analysis with R, implementing cognitive models with R and occasionally even running behavioural experiments in R. In this talk I reflect on some of these experiences - the good, the bad and the ugly - and discuss prospects and challenges for wider adoption R as a tool within the psychological sciences.
14:00 Applications in big data Ansgar Wenzel P10 A closer look at UK MOT results - Why does my car always fail? applications Adam Gruer NA
We present an analysis of the last 10 years of MOT results in the UK, with a particular focus on when cars fail and why. This is based on an open data set provided by the UK Government on the MOT, an annual car check mandatory for all vehicles older than three years. We hope that the results of this can inform customer choice when purchasing used (or new) vehicles as well as provide some interesting results. In particular, we consider geographical, vehicle, owner data, and interactions between these groups to identify the main drivers for a car to fail an MOT. We also consider the severity of a fail, eg. a non-working number plate light versus unsafe brake discs. We use these results to inform the training and design of a model that we train to predict failure (or passing) of a given vehicle. Additionally, we present results that were found using some less-common techniques. We also consider whether there are significant regional differences in pass or fail rates for different car brands or models.
14:20 Applications in big data Kevin Kuo P10 Claims reserving in general insurance with R and Keras algorithms, models, data mining, applications, insurance Adam Gruer NA
In loss reserving, actuaries are concerned with estimating liabilities from current and future, yet to be reported, claims. In this session, we first provide an overview of the loss reserving problem and current techniques. We then frame the loss reserving problem as a predictive modeling problem, and propose a deep learning approach to solve it. We benchmark the model against existing techniques, then discuss applications of deep learning to other problems in actuarial science and insurance.
14:40 Applications in big data Sangeeta Bhatia P10 Big Brother is Watching - Using Digital Disease Surveillance Tools for Near Real-Time Forecasting models, applications, Epidemiology Adam Gruer NA
In our increasingly interconnected world, it is crucial to understand the risk of an outbreak originating in one country/region and spreading to the rest of the world. Digital disease surveillance tools such as ProMed, HealthMap etc. have the potential to serve as important early warning systems as well as complement the field surveillance data during an ongoing outbreak. While there are a number of systems that carry out digital disease surveillance, there is as yet a lack of tools that can compile and analyse the generated data to produce easily understood actionable reports. I will present a flexible statistical model that uses different streams of data (such as disease surveillance data, mobility data etc.) for short-term incidence trend forecasting. I will also highlight an example of disaggregating aggregated data to obtain incidence information at a fine spatial scale. This could be particularly important in instances where information at sub-national levels is lacking or incomplete. The model has been developed in R and will be made available as a R package as well as through a website for use by non-technical stakeholders.
14:00 R Consortium projects David Smith P9 The voice of the R community community/education NA click here
The R Consortium commissioned a broad survey of the R community in 2017, with over 3500 respondents around the world giving their insights and feedback on their usage of R. With such a large sample size, the data gave us some quantitative insights that can help continue the massive growth of R and establish sustainability in the ecosystem. I'll discuss some of the results from that survey, and other ways that the R Consortium helps to amplify the voice of the R community.
14:20 R Consortium projects Joseph Rickert P9 Sustainable community investment in action - a look at some of the R Consortium Funded Grant Projects and Working Groups. community/education NA NA
R Consortium has funded over $500,000 USD to R community members improving the community and technical infrastructure for the benefit of the R ecosystem. In addition, the working groups program has worked to drive discussion and alignment on key areas such as industry adoption, package health, and educational standards.In this talk, we will showcase several of the working groups and funded projects stewarded by R Consortium. This will provide an opportunity for the audience to understand the work being done, and also opportunities on how they could take part.
14:40 R Consortium projects NA Various funded researchers P9 What we are doing with the R Consortium funds community/education NA NA
In this last session, we will tell you about the work that we are doing or about to do with our grant. We will also take questions about applying for project funding.
14:00 Community development Peter Dalgaard P6 What's in a name? 20 years of R release management community/education, History of R Miles McBain NA
In this talk, I will go through the history of R releases since 1997. I will discuss the role of the R Core Team with special emphasis on development principles and release management issues. A few "war stories" will also be included. Some light will be thrown on the choice of release names since 2011.
14:20 Community development Dennis Irorere P6 R labs Africa community/education Miles McBain NA
In this century there is a pent-up demand for “the next big thing”, and R labs Africa is at the right time to lead what is important to many in the areas of big data, data science, machine learning and artificial intelligence. I will reference the quote, “Talent is everywhere, it only needs opportunity to emerge” and this is what the R labs Africa will be about. That is, providing opportunity to marginalized and at risk communities all over Africa to learn about Data science with R through group mentor-ship, real world challenges and providing regular meet up. There will be an Annual R converge where everyone meets to talk about the future of R, share ideas and motivate one another.When access to knowledge is democratized, we see meaningful social development, because it takes a brilliant mind from a disadvantage background to create transformation solutions that solve problems within his or her domains of life experiences. When brilliance and naïve context find a nexus, true local solutions are created. These are the opportunities R labs Africa will bring to Africa. Already, the Akure R user group has reached out to about 62 young minds in the spans of 3 months.
14:40 Community development Joe Kliegman P6 nextjournal community/education, reproducibility Miles McBain NA
Nextjournal is a web-based application for creating living documents with executable code. It is a notebook-style multi-language programming and presentation environment focused on long-term reproducibility. It is designed so that code the articles contain can be run and reused years later by automatically versioning the full software stack without requiring any knowledge of version control software. It was created to facilitate seamless collaborative development and strives to make research easier to reproduce, reuse, and trust.
14:00 Modeling and algorithms Teck Kiang Tan P7 Doubly Classified Models with R models, applications, community/education Simone Blomberg NA
When we look at a cross tabulation, could we see any pattern out from it? When the table is big, it is extremely hard to discovery pattern by examining the cell frequencies. Doubly classified models are a set of statistical models aim to reveal patterns out from a cross classification table. There are substantial applications of these models. For instance, social researchers who are interested to find out intergenerational social mobility will find this way of analysing refreshing. These models are not new-fangled, but standard textbooks cover only a few of them, and journal articles are usually too technical to grasp the idea behind the model. For those with little mathematical statistic background, it causes great difficult to understand them. The talk focuses on conveying these models using a new graphical table tool called symbolic tables to give the basic idea about the models. Together, using a few standard R functions, mainly generalized linear model, doubly classified models can be set up easily. Real life examples will be illustrated, extracted mainly from the book titled “Doubly Classified Model with R”, written by the author.
14:20 Modeling and algorithms Sourav Das P7 A routine for measuring the nonstationarity of a time series space/time, streaming data Simone Blomberg NA
Since the 1960's nonstationarytime series have beeninvestigated extensively. Methodology and theory haveevolved rapidly since Dahlhaus' construction of locallystationary processes in the 1990's. However much of thetheory in above constructions rely on assumptions ofsmoothness on the time varying transfer function. Howeverwhen modelling real data, tools for assessing such regularityconditions are yet to be developed. We have proposed a methodology that allows a domain expert to measure the non-stationarity of a time series using principles of non-parametric regression. In this talk we present a R routine that can be used to easily compute the proposed non-stationarity index.
14:40 Modeling and algorithms Aya Alwan P7 Observation driven Conway-Maxwell Poisson count data models algorithms, models Simone Blomberg NA
Conway-Maxwell-Poisson (CMP) distributions is one of the flexible generalisation of the Poisson distribution that gained recent attention due to its flexibility in modelling both overdispersed and underdispersed count data. The main hindrance to their wider use in practice seems to be the inability to directly model the mean of counts, making them not compatible with nor comparable to competing count regression models, such as the log-linear Poisson, negative binomial or generalized Poisson regression models. In this talk, we will review how CMP can be parametrized via the mean, so that simpler and more easily-interpretable mean-models can be used, such as a log-linear model. A newly developed R package to fit the model to data will be discussed. Some simulated and real datasets will be used as demonstration.
14:00 Programming, performance and productivity Martin Maechler P8 Helping R to be (even more) accurate algorithms, reproducibility, performance, numerical accuracy Roger Peng NA
R has originally primarily been **the** _""super calculator""_for all applied statisticians and data scientists. Notably, at the heart ofmany statistical modelling algorithms are computations withprobabilities, risk measures or densities for Maximum Likelihood.In some cases these can go wrong _""without notice""_ because of inherentlimitations in computer arithmetic such as cancellation and underflow. 1. See how useRs can use R smartly in order to not lose precisionunnecessarily. Why R's reliable distribution related functions `[dpq]*()`*all* have arguments such as `log.p` and `lower.tail`, e.g., dnorm(x, .., log = FALSE); pnorm(q, .., lower.tail = TRUE, log.p = FALSE)Further, why you should know about `log1p()` and `expm1()`. 2. The CRAN R package `Rmpfr` provides an (S4 classed) interface to the GNU.MPFR library for arbitrary precise computation when needed, e.g., fordetermining numerically reliable computations in R itself or in our CRANpackage `copula`.
14:20 Programming, performance and productivity Tomas Kalibera P8 Preventing and Detecting Memory Protection Bugs in Packages programming packages, finding bugs in packages Roger Peng NA
R's garbage collector (GC) ensures that memory used for R values is automatically reclaimed when they become unreachable via pointers and hence no longer needed. R code is handled automatically, but C code must protect from the GC the R values it needs and unprotect them after. Forgetting to protect and/or to unprotect (protect bug) often makes R crash but also can lead to incorrect results. It is not uncommon that old protect bugs are uncovered much later by inconsequential code changes. These bugs are common and hard to find, and thus R offers tools to detect them. `gctorture` helps testing by increasing the chance a protect bug will crash R and will do so sooner after the code with the bug executes. `rchk` is a static analysis tool that identifies potential protect bugs in C code without executing it; it is used regularly to check incoming CRAN packages. Finally, protect bugs can be prevented by following several simple programming rules. The talk is intended for package developers and everyone who write C code to work with R.
14:40 Programming, performance and productivity Kelly O'Briant P8 How to Play with and Integrate DevOps Technologies in an R Data Science Workflow Cloud computing Roger Peng NA
Over the last year I’ve become obsessed with trying to encourage the data science community to explore and exploit DevOps and cloud computing technology. This isn’t often an easy undertaking. Most people (data scientists or not) are skeptical of deviating from the tools and workflows they’ve come to rely on. This talk will feature case studies in developing data science products and workflows in the cloud, and how working with these tools can open up a world of new possibilities within the intersection of DevOps and Data Analytics.KEYS Topics to discuss:- How DataOps can address the growing scope of data science tasking- Where to start when you start exploring cloud services- How to work through functionality/engineering challenges in a cloud environment- Case studies in data science product engineering and deployment
15:30 Keynote Jenny Bryan AUD Code smells and feels NA Thomas Lumley click here
"Code smell" is an evocative term for that vague feeling of unease we get when reading certain bits of code. It's not necessarily wrong, but neither is it obviously correct. We may be reluctant to work on such code, because past experience suggests it's going to be fiddly and bug-prone. In contrast, there's another type of code that just feels good to read and work on. What's the difference? If we can be more precise about code smells and feels, we can be intentional about writing code that is easier and more pleasant to work on. I've been fortunate to spend the last couple years embedded in a group of developers working on the tidyverse and r-lib packages. Based on this experience, I'll talk about specific code smells and deodorizing strategies for R.