Poster Schedule

The talks will take place on 11-13 July 2018 (click the interested talk for its abstract). Poster presenters have the option to do a 30 second lightning talk after the first keynote, to advertise their poster. Information for presenters is here.

Time Session Presenter Venue Title Keywords Chair
14:40 Poster lightning Shinichi Takayanagi AUD How LINE Corp Use R to Compete in a Data-Driven World applications, reproducibility, Practical use of R language in business Simon Jackson
LINE is one of the most popular messaging applications in Asia developed by LINE corp.We always collect and utilize the log of user behavior to import our services.In our data science team, R plays an important role in all stages of the data-analysis and science.It ranges from exploratory data analysis and modeling to sharing the results using Shiny with other colleagues.In our poster presentation, we explain how we use R to solve our business problems and share some practical insights for others who want to use R in their business.At first, as our data science team grows, different people develop their own R script to solve similar tasks like data base connection.We started working on our internal R package called "liner" shared through Github Enterprise to solve this problems.Now, multiple members often collaborate in order to improve and fix bugs using this package.We also introduce an overview of our R-related data analysis platform along with some useful OSS tools:- yanagishima: It helps people run query easily developed at LINE corp- Drone: Docker-based continuous integration tool. We use this for automated unit test and deploy "liner" package to users
14:40 Poster lightning Mahmoud Ahmed AUD pcr: an R package for quality assessment, analysis and testing of qPCR data bioinformatics, data-access Simon Jackson
Real-time quantitative PCR (qPCR) is a broadly used technique in the biomedical research. We developed an R package to implement methods for quality assessment, analysis and testing qPCR data for statistical significance. The Double Delta CT and standard curve models were implemented to quantify the relative expression of target genes from CT values in standard qPCR. In addition, calculation of amplification efficiency and curves from serial dilution qPCR experiments are used to assess the data quality. Finally, two-group testing and linear models are used to test for statistical significance among conditions. Using two datasets from qPCR experiments, we applied different quality assessment, analysis and statistical testing in the pcr package and compared the results to the original published articles. The pcr package provides an intuitive and unified interface for its main functions to allow biologist to perform all necessary steps of qPCR analysis and produce graphs in a uniform way.
14:40 Poster lightning Miles McBain AUD An R pipeline for creating and hosting collaborative Web VR environments. visualisation, space/time Simon Jackson
We are continuing the rich history of R as a generative tool for data driven documents by developing a capability to generate Virtual Reality (VR) scenes. A VR scene object is provided that can create, configure, and serve multi-user Web VR scenes. Development so far includes tools that harness the R spatial ecosystem to create textured 3D meshes from contour and raster data. Several other VR primitives are provided, including 360 photo 'portals'.The immediate application for this work has been the creation of environments for the calibration of geo-spatial statistical models by expert elicitation. However, we see the ability to generate and host multi-user scenes, complete with facility for activity logging, as a compelling new platform for collecting and visualising experimental data.Given this work is evolving rapidly, a poster would be the perfect tool to engage attendees in a variety of conversations about the API, the technology stack, and applications. If given a poster slot at useR!2018, we would like to bring some VR hardware to the session, to give attendees a chance to experience some of the virtual environments our tools have generated.
14:40 Poster lightning Kim Ki-Yeol AUD Squamous cell carcinoma analysis with R bioinformatics Simon Jackson
Squamous cell carcinoma (SCC) is the most common histological type of head neck cancer and cervical cancer. Carcinogenesis in these two types of cancers demonstrate similar multistep progression. The purpose of the study is to identify the significant consensus gene modules of these two cancers. We used a publicly available expression dataset for the study. The dataset included head neck cancer (42 cancer samples and 14 normal samples) and cervical cancer (20 cancer samples and 8 normal samples). We used only human papilloma virus 16 positive samples for excluding the bias according to the different types of HPV. We identified consensus modules of two types of cancers and explored the biological functions of each modules by annotation tool. We identified 8 consensus gene modules of head neck cancer and cervical cancer. Each module was well preserved between the two types of cancer. The modules included significant biological functions, including ATP binding and extracellular exosome. Consensus gene module identification is expected to contribute to more personalized management of multiple cancer types.
14:40 Poster lightning Erika Siregar AUD AnalevR: An Interactive R-Based Analysis Environment for Utilizing BPS-Statistics Indonesia Data visualisation, web app, community/education, networks, big data, analysis, collaboration, cloud computing, analysis as a service, bps, statistics Indonesia, analysis environment, remote data analysis Simon Jackson
As Indonesia’s national statistical agency, BPS-Statistics Indonesia produces a massive amount of strategic data every year. However, these data are still underutilized by other parties (governments, researchers, etc.) due to technical limitations and raw data exclusivity and locality. Actually, numerous people outside BPS are capable of conducting analysis but unable to access the data. To increase the data usefulness, we introduce AnalevR, an online R-based analysis environment that allows anyone to perform analyses and create visualization without having to own the original raw data. It uses a notebook-like interface where users can type commands and the output appears below it. BPS provides the data and analysis service (including the R modules) which are held in cloud storage and can be explored via the helper function. Users remotely execute R commands and perform analysis inside the workspaces. A user can create up to 10 workspaces, each representing different sessions. Each saved session preserves the user-defined variables and functions for future use. This breakthrough will raise users’ involvement in employing BPS’ data and increase statistical quality in Indonesia.
14:40 Poster lightning Yan Holtz AUD From Data to Viz visualisation, community/education Simon Jackson
Selecting the right graphic type is a common task for a data scientist. On a daily basis, an R user deals with a data frame and must decide what visualization is the most appropriate to represent it.The task is not easy. The data scientist must:- Know the broad spectrum of visualization types- Figure out which dataviz is doable given the dataset- Try several (or all) of them- Find the code to create the charts- Avoid the common caveats associated with the selected optionData-to-viz.com is a new website that comes to meet these needs. It displays an interactive decision tree. The user describe their dataset, what leads them to a set of appropriate graphic types. A description is provided for each, explaining its pros and cons. Links to the R and the Python graph galleries are provided, which allows to get the corresponding code in seconds.The complete decision tree is also available in a static version through a poster. The project has not been released yet due to its potential announcement at the useR conference.
14:40 Poster lightning Goknur Giner AUD Pathway-VisualiseR visualisation, web app, community/education, bioinformatics, networks, Bioconductor Simon Jackson
Statistical modelling of any genomic research produces sets of genes and individual biomarkers which require investigation. Further exploration of those set of biomarkers is a prominent step towards discovering the source of a biological problem. Furthermore, understanding the collective behaviour of genes has been shown to provide valuable insights into the triggers of many human diseases. We have developed an RShiny application that provides an interface for researchers, enabling them to discover the interaction between their genes and biological pathways. This application allow users to inquire into details of their research through Gene Ontology (GO) analysis with interactive network visualisations and links to related web sites.
14:40 Poster lightning Edgar Santos-Fernandez AUD ActisoftR: a toolbox for processing and visualizing scored actigraphy data. visualisation, applications, space/time Simon Jackson
Actigraphy is a cost-effective and convenient tool for activity-based monitoring. It allows studying sleep/wake patterns and identifying disorders in sleep research. ActisoftR was designed for parsing actigraphy outputs and to summarise scored data across user-defined intervals. It consists of several functions for importing, generating reports and statistics, and for data visualization.
14:40 Poster lightning Hannah Coughlan AUD Integration and visualisation of high throughput genomics data with R visualisation, bioinformatics Simon Jackson
As sequencing of DNA becomes a more affordable option for studying genomics research questions, more data types are becoming available. One interesting and complex data type, chromosome confirmation capture, aims to interrogate the 3D structure of chromosomes. The same 2 metres of DNA is compacted into the nucleus of every human cell regardless of the cell function, and the genes that determine the cell function are controlled by strict regulation. Chromosome structure can be a mechanism of gene regulation; more specifically, DNA loops form to spatially associate genes with regulators that are not adjacent in the linear genome.However, the techniques to study chromosome structure (Hi-C) are often limited by spatial resolution and can be difficult to interpret. Other genomics techniques that study gene expression (RNA-seq) and regulation (ChIP-seq) cannot discover far away regulators. Here we will show how different types of genomics data can be integrated to investigate long distance gene regulation. Using R and Bioconductor packages (edgeR, Sushi, GenomicRanges and limma) we can integrate data into a common framework that can be visualised to allow for biological interpretation.
14:40 Poster lightning Vipavee Trivittayasil AUD MovingBubbles : Animated d3 bubble chart visualisation Simon Jackson
Webpages: https://github.com/chengvt/MovingBubblesA line graph is usually used to portray time-series data. However when there are many time-series, the graph can become cluttered and thus difficult to read. In order to portray the time-series data with many samples in a more intuitive way, a package for plotting an animated bubble chart was developed. A bubble chart here refers to the chart which represents one quantity but arranged in a way that the bubbles are packed close together to use the space efficiently. The quantity each bubble represents is proportional to the bubble area. There is already a package to plot a static bubble chart in R (Joe Chang et al., n.d.). The MovingBubbles package provides a method to add second and third information dimensions to the chart by means of animation and color. The animation portrays changes in data with time and also helps make the plot more engaging to the viewers. The plotting and transitions between frames are handled by d3 library (Bostock et al., 2011). The package uses htmlwidgets framework (Vaidyanathan et al., 2017) to bridge Javascript and R.
14:40 Poster lightning Nicholas Spyrison AUD spinifex: visulizing local structure of higher dimensions visualisation, community/education Simon Jackson
Visualizing in higher (greater than p=3 numeric dimensions) can be messy and unintuitive. Here we explain the methodology and explore the functionality of tourr. We offer a vignette for use and contrast with other higher dimensional visualization methods. The R package, tourr (2011, Wickham, H., D. Cook), gives us the means to animate the projection as we rotate though p-dimensions. This is achieved by varying the contributions from each dimension, via random walk, predefined path, or optimizing an index.Wickham, H., D. Cook, and H. Hofmann (2015). Visualising statistical models: Removing the blindfold (withdiscussion). Statistical Analysis and Data Mining 8(4), 203–225.Wickham, H., D. Cook, H. Hofmann, and A. Buja (2011). tourr: An r package for exploring multivariate data withprojections. Journal of Statistical Software 40(2), http://www.jstatsoft.org/v40.Asimov D (1985). “The Grand Tour: A Tool for Viewing Multidimensional Data.” SIAMJournal of Scientific and Statistical Computing, 6(1), 128–143.
14:40 Poster lightning Motoyuki Oki AUD Time Series Digger : Automatic time series analysis for data science in R visualisation, data mining, space/time Simon Jackson
Exploratory Data Analysis (EDA) is an essential process for understanding time series and conducting useful feature extraction. We introduce "Time Series Digger", which provides automatic and programmable EDA in R to accelerate time series analysis for data scientists.Time Series Digger is now deployed on data science platform in NTT Communications which is one of the largest Internet service providers in Japan. We show the effectiveness with real use cases.Time Series Digger consists of three parts.First, it provides automatic and comprehensive time series visualization on various time interval to understand the time series.Second, it provides basic and programmable feature extraction from uni- or multi-variate time series.Third, it applies the features to multiple time series anomaly detection methods.A number of various packages to treat with time series including forecasting and anomaly detection methods exist in R packages.To the best of our knowledge, no packages have focused on efficient and comprehensive analysis process, especially for multiple time series.Our package and contributions should effectively work for R users that face similar problems.
14:40 Poster lightning Volha Tryputsen AUD Antibody characterization with next generation sequencing using Group My Abs shiny app visualisation, algorithms, models, data mining, applications, web app, reproducibility, bioinformatics Simon Jackson
Next-generation sequencing (NGS), phage display technology and high throughput capacities enables biologists in drug discovery to characterize antibodies (Abs) based on their HCDR3 sequences and further group them into families before moving to hit-to-lead stage of drug discovery and development. This enables diversification of Ab portfolio and insures back up options if Ab candidate fails. However, there was no method or software available in-house to support Ab discovery with capacities to apply biophysical rules to classify the sequences. Shiny app "Group My Abs" was developed to apply biophysical properties for Ab characterization to the NGS data. Several Multiple Sequence Alignment algorithms implemented in the app enable sequence comparability. A method was developed to evaluate differences between comparable sequences and subsequently classify sequences into families. The app provides custom-made and interactive data visualization, enables refined Ab classification in a mathematical manner, considerably increases efficiency and insures reproducibility. This all decreases bias and enables informative decision making during the hit-to-lead stage in biologics drug discovery.
14:40 Poster lightning Gabriel Domingo AUD Use of R in Antitrust: The case of the Philippine Competition Commission visualisation, models, applications, reproducibility, community/education, Antitrust Simon Jackson
The use of quantitative analysis of economic data in antitrust is well established. As a new competition agency, the challenges of adopting these analyses at the Philippine Competition Commission is daunting. From competitive enforcement to merger control, the R language empowers our teams of economists in their work.We use R's flexibility and power to clean and model price and demand data when investigating anti-competitive agreements or mergers, and abuses of dominant position. R allows our analysts to rule out various theories of harm in the market, while re-focusing our efforts on specific areas of concern. We use several packages for data modeling, but dplyr and antitrust is particularly useful.When defining the geographic market in merger control, we use R's mapping and plotting packages ggmap, ggplot, leaflet and osrm. These tools determine the scope of a market by pinpointing supplier and consumer locations, illustrating routes, and computing distances and travel times.Finally, we will discuss our efforts to expose more of our economists to R via hands-on training sessions in small teams, and our considerations in using of Rmarkdown to standardize our reports.
14:40 Poster lightning Adam Gruer AUD Using R and Process Control Charts to Help Hospital Management See The Woods For The Trees visualisation Simon Jackson
The poster describes a project undertaken with the Head of Surgery to introduce Process Control tools and methods to a broader population of hospital managers and executives. It was observed that existing reporting of hospital KPIs was encouraging management and other staff to waste time, energy and analytic resources on variances that were not outside the range or random variance. This is an inefficient use of limited resources. The project involved developing RMarkdown reports and flexdashboards to visualise the variance in the processes being monitored. CRAN packages such as qichart2 and the tidyverse were selected as useful tools for completing the project as well as consulting literature on Process Control and Lean methodology and contacting other R users in health systems such as the NHS in the UK. Also, important topics such as user interface (UI), user experience (UX) and communication, promotion and education programmes needed to be considered and the poster will highlight how other departments in the hospital with experience in these areas were consulted. This poster discusses the technical and cultural challenges faced and the solutions that were developed.
14:40 Poster lightning Sharon Lee AUD Shiny EMMIXskew for symmetric and asymmetric mixture modelling visualisation, algorithms, models, data mining, applications, web app, multivariate Simon Jackson
EMMIXskew allows users to easily fit univariate and multivariate mixture models and perform inference. Designed with a focus on analyzing data that exhibit non-normal features such as asymmetry and heavy-tails, EMMIXskew offers the options of fitting mixtures of skew normal and skew t-distributions in addition to traditional normal and t-mixture models. These models have received increasing attention in recent years due to their powerfulness and flexibility, as witnessed by many applications in fields ranging from biomedicine, imaging, social sciences, to finance. In this talk, we introduce the EMMIXskew package and its accompanying Shiny app. Its main functionalities will be demonstrated with real-life applications. We will also cover various useful tools included in the package, such as density calculation, mode calculation, random sample generation, error rate calculation, and contour visualization. With the Shiny interface, analysis using these models will become much more accessible for all practitioners and R users.
14:40 Poster lightning Alan Pearse AUD SSNDesign -- An R Package for Optimal Designs on Spatial Stream Networks models, experimental design; optimal design; spatial stream networks Simon Jackson
Optimal experimental designs maximise the information gained from limited samples. Optimal designs are paramount when precise predictions or parameter estimates are required but data collection is resource intensive. R packages exist to find optimal designs for a few settings; e.g. AlgDesign and OPDOE. However, to our knowledge, there are no R packages for optimal design problems for stream and river networks. Stream networks provide a unique design challenge due to their branching structure and flow accumulation as water moves downstream. Given these statistical challenges and the importance of healthy freshwater ecosystems, computational tools for designing effective monitoring programs on streams with minimal cost for maximum impact are sorely needed. Here, we present SSNdesign; an R package for finding optimal designs on stream networks. This package relies on the S4 SpatialStreamNetwork object and models implemented in the package SSN. It has functionality for finding optimal designs for estimating model parameters and making predictions on stream networks. Users can also define utility functions for their own design problems.
14:40 Poster lightning Tatiana Marci AUD Using Factor Mixture Analysis in Developmental Psychology: An Application to Research on Parent-child Attachment applications, Factor Mixture Analysis, heterogeneous populations Simon Jackson
Factor Mixture Analysis (FMA) is a useful tool to explore data from potentially heterogeneous populations using a crossbred of both categorical and continuous latent variables. Briefly, this approach allows to explore the underlying factorial structure of a theoretical construct, while simultaneously detecting unobserved subgroups in the study population. Thus, FMA becomes particularly useful to investigate psychological phenomena assumed to be categorical and continuous at the same time, and when the source of heterogeneity in the considered population may be not directly observed. Despite these advantages, its application within the psychological sciences remains limited. The current study aims to illustrate the utility of FMA within the context of attachment research in developmental psychology. By presenting a real data example concerning the latent structure of attachment in middle childhood, this work provides a practical example of FMA application using the FactMixtAnalyses package (Viroli, 2011). Furthermore, we will describe ad hoc R functions to assist in the interpretation of results. Benefits and drawbacks of applying FMA to this research area will be discussed.
14:40 Poster lightning Gi-Seop Lee AUD Evaluations of the machine learning models in the coastal habitat classification applications, multivariate, performance Simon Jackson
A ‘short-neck clam’ (Ruditapes philippinarum) is one of the most important commercial shellfish. The amount of the shellfish production has been severely reduced due to the unexpected invasion of the ‘Japanese mud shrimp’ (Upogebia major) in some Korean tidal flats. Thus, it is highly required to know the habitat suitability for both organisms. In this study, the diverse simulations of the habitat classification were carried out using the available habitat data of the U. major and R. philippinarum. Supervised learning methods such as decision tree, k-Nearest Neighbor (kNN), Support Vector Machine (SVM) and Artificial Neural Network (ANN) were used with the three optimal clusters defined by R package ‘NbClust’. The decision trees were applied ‘bagging’ and ‘adaboost’ algorithms. Based on the simulation results, the prediction accuracies of each model in case of using the test data are estimated to be about 55-65%. This is considered to be due to outlier effects, and the overfitting problem due to the relatively small number of samples. In many biological data, these are still challenging problems.
14:40 Poster lightning Koji Makiyama AUD Magic Functions to Obtain Results from 'for' Loops in R interfaces Simon Jackson
The function 'for' is one of the most popular functions in R. As you know, it is used to create loops. We think there is an inconvenience of 'for' loops in R. It is that the results you get will be gone away. So we have created a package to store the results automatically. To do it, you only need to cast one line spell 'magic_for.' For instance, to calculate squared values for 1 to 3 using 'for' loop and 'print' function is very easy. However, it becomes too much hassle to change such codes to store displayed results. You must prepare some containers with correct length for storing results and change 'print' function to assignment statements. Moreover, in such or more troublesome situations like where you have to store many variables, codes will grow more complex. The 'magicfor' package makes to resolve the problem with keep readability. You just add one line 'magic_for()' before 'for' loops. Once you call 'magic_for,' you can just execute 'for' as usual, the results will be stored in memory automatically. You can obtain the results using 'magic_result.' We introduce how to use the magic.
14:40 Poster lightning Yuya Matsumura AUD Easy Writing of Bayesian Optimizaion for Macine Learning models, performance, big data Simon Jackson
In many machine learning algorithms, tuning hyperparameters is one of the most important point. Bayesian optimization (Shahriari et al., 2015) is a method for tuning hyperparameters faster and more efficient than grid search that searches all grids in parameter space. In R, combination of rBayesianOptimization package and some machine learning packages such as e1071 or ranger enable Bayesian optimization for hyperparameter tuning. However, it was troublesome to write codes for Bayesian optimization using those packages because we must make a complicated function to maximize, then write as code to execute Bayesian optimization. This is very confusing, hard to try and try and error. MlBayesOpt package (Matsumura, 2017, https://cran.r-project.org/web/packages/MlBayesOpt/index.html) is a very convenient to write this work. This package requires to execute Bayesian optimization only a dataframe, column name of label to classify (or regress), and column names of feature vectors. For example, there are 32 lines of a source using combination of packages, but 5 lines of that using MlBayesOpt package.
14:40 Poster lightning Jeremy Forbes AUD Using Australian census data to describe electorates' socio-economic profiles at the time of a federal election. models, reproducibility, community/education Simon Jackson
In Australia, the House of Representatives is divided into 150 seats, each representing an electoral division, and each divisions' boundaries are revised periodically. Federal elections generally occur every three years, but electorate boundaries can change in between elections.The Australian Bureau of Statistics conducts a Census of Population and Housing every five years, and updates its record of electorate boundaries in July each year, in accordance with the official electoral commission's boundaries.This research looks at matching and estimating the socio-economic profile of each electorate at the time of a federal election.To accurately estimate profiles, each election is initially paired with the Census data taken closet to the election date. Many elections do not occur in the same year as a census, and are matched with data from nearby years. Differences between these dates are adjusted for using spatial analysis and time-series forecasts.This work is an update for the eechidna package, which contains Australian census and election data, and tools for visualisation and analysis.PS. An update can be provided closer to date. Research has only recently commenced.
14:40 Poster lightning Jessica Bagnall AUD Analysing the voting patterns of the Senate of the 45th Australian Parliament via fully-visible Boltzmann machines algorithms, models, applications, networks Simon Jackson
The 45th Australian Senate—following the 2016 federal election—contains the largest crossbench since the expansion of the Senate in 1950. Of the 20 Senators who make up the crossbench, 7 minor parties were elected.We analyse the party-level voting patterns of the parties of the Senate of the 45th Australian parliament by modelling the crossbench via a fully-visible Boltzmann machine, a probabilistic graphical network that arises from the neural networks literature, in order to determine the various influences that each party has on each other, and to evaluate the relative pro- (or anti-) government stances of the aforementioned parties.We describe the required estimations and computations that are performed via our R package BoltzMM—available at github.com/andrewthomasjones/BoltzMM. The package implements the MM algorithm for maximum pseudolikelihood estimation of FVBM models of Nguyen and Wood (Neural Computation, 2016), and uses the asymptotic normality results of Nguyen and Wood (IEEE T Neural Networks and Learning Systems, 2016) for inferential computations.
14:40 Poster lightning Florian Schwendinger AUD Readability Prediction in R models, applications, text analysis/NLP Simon Jackson
Readability prediction is commonly used to assess the comprehensibility of a given text. Early approaches focus on the development of readability scores (e.g. Fog-Index, Dale-Chall, Flesch Reading Ease). Most of these readability scores are based on the number of words, number of sentences, number of syllables and number of words which are not present in a predefined list.Current research in the field of linguistics suggests that these scores are often misleading and models which combine Natural Language Processing (NLP) and statistical learning should be used instead.This research presents how a state-of-the-art readability prediction can be implemented in R by utilizing the tools available from the StanfordCoreNLP package. The StanfordCoreNLP package and its companions can be installed from https://datacube.wu.ac.at/.
14:40 Poster lightning Aswi Aswi AUD Comparison of different Bayesian spatio-temporal models using R packages models, applications, space/time, CARBayesST Simon Jackson
There is a growing number of packages in R for modelling spatio-temporal data. In this presentation, we will review and compare a number of spatio-temporal Bayesian models using R. We will focus on two R packages, namely R-INLA (Integrated Nested Laplace Approximation) and CARBayesST and describe the different spatio-temporal models available. We examine six and five Bayesian spatio temporal models using CARBayes and R-INLA, respectively. We will illustrate the application of these models and packages through a case study on dengue cases, in Makassar, Indonesia. Model performance will be compared using goodness of fit such as Deviance Information Criteria (DIC). The computational speed and ease of using these packages makes them a very attractive option for Bayesian spatio-temporal modelling.
14:40 Poster lightning Janek Thomas AUD Automatic gradient boosting algorithms, models, data mining, Automatic Machine Learning Simon Jackson
Well-qualified data scientist are not a dime a dozen. Instead, employees being not very familiar with data analysis are often called to do the job. Automatic machine learning can help those persons to perform predictive modeling with high performing machine learning tools without having much experience. This is achieved by making those applications parameter-free, i.e. only the data is required as input. Projects like Auto-WEKA or auto-sklearn aim to solve the Combined Algorithm Selection and Hyperparameter optimization (CASH) problem resulting in a huge optimization space. However, for most real world applications, only few different learning algorithms are required to deliver superior performances. autoxgboost simplifies this idea one step further and the CASH problem to taking Gradient Boosting as a single learning algorithm in combination with intelligent model based hyperparameter tuning. It is based on the R-Packages mlr, mlrMBO and XGBoost. It also supports categorical variables due to special inbuilt factor feature encoding. Even though autoxgboost only uses one learner instead of a whole library, it provides comparable or even better performances.
14:40 Poster lightning Awdhesh Yadav AUD Household and Community factor on under-five mortality in India: An application of multilevel cox proportional hazard model multivariate, big data Simon Jackson
The objective of this paper is to determine the important of community, household and individual level effect on under-five mortality in India. Using data from the latest round of Demographic Health Survey (DHS)-2005-06, multilevel cox proportional hazard analysis was performed on a nationally representative sample. The results indicate that pattern of under-five mortality were clustered within mothers and communities. The community level variables like region, place of residence, community poverty level, community education level, ethnic fractionalization index were significantly determine under-five mortality in India. The risk of under-five deaths was significantly higher for children residing in North, East and West regions compared to South region. In addition, the proportion of women in community completing secondary school were significantly more likely to increase the child survival. The household level variables like religion, caste and wealth index were significantly determining under-five mortality. The results suggest to address the contextual level factors to address under-five mortality in India
14:40 Poster lightning Jean-Michel Perraud AUD A suite of R packages for hydrological ensemble forecasting using Rcpp space/time, performance, big data, Hydrology, forecast Simon Jackson
Ensemble prediction techniques have been shown to produce more accurate predictions as well as formally quantify prediction uncertainty in a range of scientific applications. We present a suite of libraries for hydrological ensemble forecasts designed for use both in research and operations. The features of the C++ libraries are available from several high-level interactive languages including R. The suite currently comprises three main R packages for rainfall forecast post processing (RPP), semi-distributed ensemble hydrological modelling (SWIFT2) and multi-dimensional ensemble time series (uchronia). The packages are designed to offer concise commands for handling ensemble time series, input/output, model parameterisation and simulation execution. The native libraries purposely have a C API for maximising interoperability and foster a consistent use experience across high-level languages. Rcpp is used for surfacing the features in the R packages. Bespoke code for marshaling data, object lifetime management and generating glue code for Rcpp is already open source and suitable for reuse in similar technical contexts.
14:40 Poster lightning Susanna Cramb AUD Bayesian disease mapping in R models, space/time Simon Jackson
Bayesian methods prominently feature in disease mapping, and R has multiple packages designed to enable efficient computation of assorted Bayesian spatial models.Here we examine seven Bayesian models suitable for disease mapping and implement them using the R packages of R-INLA, CARBayes, R2WinBUGS and R2jags. Models considered included common approaches such as the BYM which smooth estimates over all adjacent areas, through to more recently introduced models that allowed for discontinuities between adjacent areas, as well as spline models. Simulated incidence data designed to represent a rare cancer (liver) and more common cancer (lung) were examined across 2153 areas in Australia. Model performance was compared on goodness of fit measures (WAIC, Moran’s I on residuals), computational time and convergence (Geweke). The packages themselves are also compared in terms of computational time and model flexibility. It is useful to consider several different models to understand the robustness of results when disease mapping. R has the capacity to enable a wide range of models to be considered, with the additional advantages of high quality visualisation of results.
14:40 Poster lightning Rosemary Putler AUD Analysis of EHR Data and Circulating Inflammatory Mediators: Association with Severe Clostridium difficile Infection models, applications, bioinformatics Simon Jackson
Clostridium difficile infection (CDI) is a major healthcare-associated infection, and severe CDI often leads to subsequent recurrence or death. We hypothesized that circulating inflammatory mediators would associate with severity in a prospective cohort of inpatients diagnosed with CDI. An inflammatory mediator panel was performed on collected sera and merged with electronic health record (EHR) data. With these data we show that circulating biomarkers associate not only with severity of the CDI episode, but also with subsequent mortality. Because of the large number of potentially correlated predictors, our data presented a challenge when we set out to identify features to incorporate into an accurate predictive model. We explore the steps performed in this analysis, discussing methodology and decision points, including the management and analysis of EHR data, utilization of dimensional reduction techniques, and use of existing packages such as vegan, glmnet, and pROC. Through this analysis, we demonstrate how a diverse array of R packages and statistical methodologies, which function in a wide array of use cases, can also be used to answer a complex disease-related question.
14:40 Poster lightning Stuart Davie AUD A data driven approach to generating and scoring B2B leads visualisation, models, applications Simon Jackson
In many industries, companies rely on a sales team to source and qualify leads. Unfortunately, this limits a company's potential leads to those that can be manually processed, while lead qualification is limited by the quality of ad hoc scoring systems. To find leads faster, companies might engage in cold-calling, or blanket email campaigns, both of which are known for their low conversion rates. Here, a data-driven B2B lead generation and qualification solution for the UK market is presented, based on open source data and XGBoost. Our models take into account both general and company specific features, and allow an approximation of the size of market opportunity. Lead reports containing pertinent conversion information are automatically generated using xgboostExplainer and R Markdown. Considerations on feature engineering, and difficulties associated with overfitting, are also discussed. **As there are several components to this presentation, a lightning talk would be preferred over a poster**
18:00 Poster Shinichi Takayanagi Foyer How LINE Corp Use R to Compete in a Data-Driven World applications, reproducibility, Practical use of R language in business -
LINE is one of the most popular messaging applications in Asia developed by LINE corp.We always collect and utilize the log of user behavior to import our services.In our data science team, R plays an important role in all stages of the data-analysis and science.It ranges from exploratory data analysis and modeling to sharing the results using Shiny with other colleagues.In our poster presentation, we explain how we use R to solve our business problems and share some practical insights for others who want to use R in their business.At first, as our data science team grows, different people develop their own R script to solve similar tasks like data base connection.We started working on our internal R package called "liner" shared through Github Enterprise to solve this problems.Now, multiple members often collaborate in order to improve and fix bugs using this package.We also introduce an overview of our R-related data analysis platform along with some useful OSS tools:- yanagishima: It helps people run query easily developed at LINE corp- Drone: Docker-based continuous integration tool. We use this for automated unit test and deploy "liner" package to users
18:00 Poster Mahmoud Ahmed Foyer pcr: an R package for quality assessment, analysis and testing of qPCR data bioinformatics, data-access -
Real-time quantitative PCR (qPCR) is a broadly used technique in the biomedical research. We developed an R package to implement methods for quality assessment, analysis and testing qPCR data for statistical significance. The Double Delta CT and standard curve models were implemented to quantify the relative expression of target genes from CT values in standard qPCR. In addition, calculation of amplification efficiency and curves from serial dilution qPCR experiments are used to assess the data quality. Finally, two-group testing and linear models are used to test for statistical significance among conditions. Using two datasets from qPCR experiments, we applied different quality assessment, analysis and statistical testing in the pcr package and compared the results to the original published articles. The pcr package provides an intuitive and unified interface for its main functions to allow biologist to perform all necessary steps of qPCR analysis and produce graphs in a uniform way.
18:00 Poster Miles McBain Foyer An R pipeline for creating and hosting collaborative Web VR environments. visualisation, space/time -
We are continuing the rich history of R as a generative tool for data driven documents by developing a capability to generate Virtual Reality (VR) scenes. A VR scene object is provided that can create, configure, and serve multi-user Web VR scenes. Development so far includes tools that harness the R spatial ecosystem to create textured 3D meshes from contour and raster data. Several other VR primitives are provided, including 360 photo 'portals'.The immediate application for this work has been the creation of environments for the calibration of geo-spatial statistical models by expert elicitation. However, we see the ability to generate and host multi-user scenes, complete with facility for activity logging, as a compelling new platform for collecting and visualising experimental data.Given this work is evolving rapidly, a poster would be the perfect tool to engage attendees in a variety of conversations about the API, the technology stack, and applications. If given a poster slot at useR!2018, we would like to bring some VR hardware to the session, to give attendees a chance to experience some of the virtual environments our tools have generated.
18:00 Poster Kim Ki-Yeol Foyer Squamous cell carcinoma analysis with R bioinformatics -
Squamous cell carcinoma (SCC) is the most common histological type of head neck cancer and cervical cancer. Carcinogenesis in these two types of cancers demonstrate similar multistep progression. The purpose of the study is to identify the significant consensus gene modules of these two cancers. We used a publicly available expression dataset for the study. The dataset included head neck cancer (42 cancer samples and 14 normal samples) and cervical cancer (20 cancer samples and 8 normal samples). We used only human papilloma virus 16 positive samples for excluding the bias according to the different types of HPV. We identified consensus modules of two types of cancers and explored the biological functions of each modules by annotation tool. We identified 8 consensus gene modules of head neck cancer and cervical cancer. Each module was well preserved between the two types of cancer. The modules included significant biological functions, including ATP binding and extracellular exosome. Consensus gene module identification is expected to contribute to more personalized management of multiple cancer types.
18:00 Poster Erika Siregar Foyer AnalevR: An Interactive R-Based Analysis Environment for Utilizing BPS-Statistics Indonesia Data visualisation, web app, community/education, networks, big data, analysis, collaboration, cloud computing, analysis as a service, bps, statistics Indonesia, analysis environment, remote data analysis -
As Indonesia’s national statistical agency, BPS-Statistics Indonesia produces a massive amount of strategic data every year. However, these data are still underutilized by other parties (governments, researchers, etc.) due to technical limitations and raw data exclusivity and locality. Actually, numerous people outside BPS are capable of conducting analysis but unable to access the data. To increase the data usefulness, we introduce AnalevR, an online R-based analysis environment that allows anyone to perform analyses and create visualization without having to own the original raw data. It uses a notebook-like interface where users can type commands and the output appears below it. BPS provides the data and analysis service (including the R modules) which are held in cloud storage and can be explored via the helper function. Users remotely execute R commands and perform analysis inside the workspaces. A user can create up to 10 workspaces, each representing different sessions. Each saved session preserves the user-defined variables and functions for future use. This breakthrough will raise users’ involvement in employing BPS’ data and increase statistical quality in Indonesia.
18:00 Poster Yan Holtz Foyer From Data to Viz visualisation, community/education -
Selecting the right graphic type is a common task for a data scientist. On a daily basis, an R user deals with a data frame and must decide what visualization is the most appropriate to represent it.The task is not easy. The data scientist must:- Know the broad spectrum of visualization types- Figure out which dataviz is doable given the dataset- Try several (or all) of them- Find the code to create the charts- Avoid the common caveats associated with the selected optionData-to-viz.com is a new website that comes to meet these needs. It displays an interactive decision tree. The user describe their dataset, what leads them to a set of appropriate graphic types. A description is provided for each, explaining its pros and cons. Links to the R and the Python graph galleries are provided, which allows to get the corresponding code in seconds.The complete decision tree is also available in a static version through a poster. The project has not been released yet due to its potential announcement at the useR conference.
18:00 Poster Goknur Giner Foyer Pathway-VisualiseR visualisation, web app, community/education, bioinformatics, networks, Bioconductor -
Statistical modelling of any genomic research produces sets of genes and individual biomarkers which require investigation. Further exploration of those set of biomarkers is a prominent step towards discovering the source of a biological problem. Furthermore, understanding the collective behaviour of genes has been shown to provide valuable insights into the triggers of many human diseases. We have developed an RShiny application that provides an interface for researchers, enabling them to discover the interaction between their genes and biological pathways. This application allow users to inquire into details of their research through Gene Ontology (GO) analysis with interactive network visualisations and links to related web sites.
18:00 Poster Edgar Santos-Fernandez Foyer ActisoftR: a toolbox for processing and visualizing scored actigraphy data. visualisation, applications, space/time -
Actigraphy is a cost-effective and convenient tool for activity-based monitoring. It allows studying sleep/wake patterns and identifying disorders in sleep research. ActisoftR was designed for parsing actigraphy outputs and to summarise scored data across user-defined intervals. It consists of several functions for importing, generating reports and statistics, and for data visualization.
18:00 Poster Hannah Coughlan Foyer Integration and visualisation of high throughput genomics data with R visualisation, bioinformatics -
As sequencing of DNA becomes a more affordable option for studying genomics research questions, more data types are becoming available. One interesting and complex data type, chromosome confirmation capture, aims to interrogate the 3D structure of chromosomes. The same 2 metres of DNA is compacted into the nucleus of every human cell regardless of the cell function, and the genes that determine the cell function are controlled by strict regulation. Chromosome structure can be a mechanism of gene regulation; more specifically, DNA loops form to spatially associate genes with regulators that are not adjacent in the linear genome.However, the techniques to study chromosome structure (Hi-C) are often limited by spatial resolution and can be difficult to interpret. Other genomics techniques that study gene expression (RNA-seq) and regulation (ChIP-seq) cannot discover far away regulators. Here we will show how different types of genomics data can be integrated to investigate long distance gene regulation. Using R and Bioconductor packages (edgeR, Sushi, GenomicRanges and limma) we can integrate data into a common framework that can be visualised to allow for biological interpretation.
18:00 Poster Vipavee Trivittayasil Foyer MovingBubbles : Animated d3 bubble chart visualisation -
Webpages: https://github.com/chengvt/MovingBubblesA line graph is usually used to portray time-series data. However when there are many time-series, the graph can become cluttered and thus difficult to read. In order to portray the time-series data with many samples in a more intuitive way, a package for plotting an animated bubble chart was developed. A bubble chart here refers to the chart which represents one quantity but arranged in a way that the bubbles are packed close together to use the space efficiently. The quantity each bubble represents is proportional to the bubble area. There is already a package to plot a static bubble chart in R (Joe Chang et al., n.d.). The MovingBubbles package provides a method to add second and third information dimensions to the chart by means of animation and color. The animation portrays changes in data with time and also helps make the plot more engaging to the viewers. The plotting and transitions between frames are handled by d3 library (Bostock et al., 2011). The package uses htmlwidgets framework (Vaidyanathan et al., 2017) to bridge Javascript and R.
18:00 Poster Nicholas Spyrison Foyer spinifex: visulizing local structure of higher dimensions visualisation, community/education -
Visualizing in higher (greater than p=3 numeric dimensions) can be messy and unintuitive. Here we explain the methodology and explore the functionality of tourr. We offer a vignette for use and contrast with other higher dimensional visualization methods. The R package, tourr (2011, Wickham, H., D. Cook), gives us the means to animate the projection as we rotate though p-dimensions. This is achieved by varying the contributions from each dimension, via random walk, predefined path, or optimizing an index.Wickham, H., D. Cook, and H. Hofmann (2015). Visualising statistical models: Removing the blindfold (withdiscussion). Statistical Analysis and Data Mining 8(4), 203–225.Wickham, H., D. Cook, H. Hofmann, and A. Buja (2011). tourr: An r package for exploring multivariate data withprojections. Journal of Statistical Software 40(2), http://www.jstatsoft.org/v40.Asimov D (1985). “The Grand Tour: A Tool for Viewing Multidimensional Data.” SIAMJournal of Scientific and Statistical Computing, 6(1), 128–143.
18:00 Poster Motoyuki Oki Foyer Time Series Digger : Automatic time series analysis for data science in R visualisation, data mining, space/time -
Exploratory Data Analysis (EDA) is an essential process for understanding time series and conducting useful feature extraction. We introduce "Time Series Digger", which provides automatic and programmable EDA in R to accelerate time series analysis for data scientists.Time Series Digger is now deployed on data science platform in NTT Communications which is one of the largest Internet service providers in Japan. We show the effectiveness with real use cases.Time Series Digger consists of three parts.First, it provides automatic and comprehensive time series visualization on various time interval to understand the time series.Second, it provides basic and programmable feature extraction from uni- or multi-variate time series.Third, it applies the features to multiple time series anomaly detection methods.A number of various packages to treat with time series including forecasting and anomaly detection methods exist in R packages.To the best of our knowledge, no packages have focused on efficient and comprehensive analysis process, especially for multiple time series.Our package and contributions should effectively work for R users that face similar problems.
18:00 Poster Volha Tryputsen Foyer Antibody characterization with next generation sequencing using Group My Abs shiny app visualisation, algorithms, models, data mining, applications, web app, reproducibility, bioinformatics -
Next-generation sequencing (NGS), phage display technology and high throughput capacities enables biologists in drug discovery to characterize antibodies (Abs) based on their HCDR3 sequences and further group them into families before moving to hit-to-lead stage of drug discovery and development. This enables diversification of Ab portfolio and insures back up options if Ab candidate fails. However, there was no method or software available in-house to support Ab discovery with capacities to apply biophysical rules to classify the sequences. Shiny app "Group My Abs" was developed to apply biophysical properties for Ab characterization to the NGS data. Several Multiple Sequence Alignment algorithms implemented in the app enable sequence comparability. A method was developed to evaluate differences between comparable sequences and subsequently classify sequences into families. The app provides custom-made and interactive data visualization, enables refined Ab classification in a mathematical manner, considerably increases efficiency and insures reproducibility. This all decreases bias and enables informative decision making during the hit-to-lead stage in biologics drug discovery.
18:00 Poster Gabriel Domingo Foyer Use of R in Antitrust: The case of the Philippine Competition Commission visualisation, models, applications, reproducibility, community/education, Antitrust -
The use of quantitative analysis of economic data in antitrust is well established. As a new competition agency, the challenges of adopting these analyses at the Philippine Competition Commission is daunting. From competitive enforcement to merger control, the R language empowers our teams of economists in their work.We use R's flexibility and power to clean and model price and demand data when investigating anti-competitive agreements or mergers, and abuses of dominant position. R allows our analysts to rule out various theories of harm in the market, while re-focusing our efforts on specific areas of concern. We use several packages for data modeling, but dplyr and antitrust is particularly useful.When defining the geographic market in merger control, we use R's mapping and plotting packages ggmap, ggplot, leaflet and osrm. These tools determine the scope of a market by pinpointing supplier and consumer locations, illustrating routes, and computing distances and travel times.Finally, we will discuss our efforts to expose more of our economists to R via hands-on training sessions in small teams, and our considerations in using of Rmarkdown to standardize our reports.
18:00 Poster Adam Gruer Foyer Using R and Process Control Charts to Help Hospital Management See The Woods For The Trees visualisation -
The poster describes a project undertaken with the Head of Surgery to introduce Process Control tools and methods to a broader population of hospital managers and executives. It was observed that existing reporting of hospital KPIs was encouraging management and other staff to waste time, energy and analytic resources on variances that were not outside the range or random variance. This is an inefficient use of limited resources. The project involved developing RMarkdown reports and flexdashboards to visualise the variance in the processes being monitored. CRAN packages such as qichart2 and the tidyverse were selected as useful tools for completing the project as well as consulting literature on Process Control and Lean methodology and contacting other R users in health systems such as the NHS in the UK. Also, important topics such as user interface (UI), user experience (UX) and communication, promotion and education programmes needed to be considered and the poster will highlight how other departments in the hospital with experience in these areas were consulted. This poster discusses the technical and cultural challenges faced and the solutions that were developed.
18:00 Poster Sharon Lee Foyer Shiny EMMIXskew for symmetric and asymmetric mixture modelling visualisation, algorithms, models, data mining, applications, web app, multivariate -
EMMIXskew allows users to easily fit univariate and multivariate mixture models and perform inference. Designed with a focus on analyzing data that exhibit non-normal features such as asymmetry and heavy-tails, EMMIXskew offers the options of fitting mixtures of skew normal and skew t-distributions in addition to traditional normal and t-mixture models. These models have received increasing attention in recent years due to their powerfulness and flexibility, as witnessed by many applications in fields ranging from biomedicine, imaging, social sciences, to finance. In this talk, we introduce the EMMIXskew package and its accompanying Shiny app. Its main functionalities will be demonstrated with real-life applications. We will also cover various useful tools included in the package, such as density calculation, mode calculation, random sample generation, error rate calculation, and contour visualization. With the Shiny interface, analysis using these models will become much more accessible for all practitioners and R users.
Time Session Presenter Venue Title Keywords Chair
12:00 Poster Alan Pearse Foyer SSNDesign -- An R Package for Optimal Designs on Spatial Stream Networks models, experimental design; optimal design; spatial stream networks -
Optimal experimental designs maximise the information gained from limited samples. Optimal designs are paramount when precise predictions or parameter estimates are required but data collection is resource intensive. R packages exist to find optimal designs for a few settings; e.g. AlgDesign and OPDOE. However, to our knowledge, there are no R packages for optimal design problems for stream and river networks. Stream networks provide a unique design challenge due to their branching structure and flow accumulation as water moves downstream. Given these statistical challenges and the importance of healthy freshwater ecosystems, computational tools for designing effective monitoring programs on streams with minimal cost for maximum impact are sorely needed. Here, we present SSNdesign; an R package for finding optimal designs on stream networks. This package relies on the S4 SpatialStreamNetwork object and models implemented in the package SSN. It has functionality for finding optimal designs for estimating model parameters and making predictions on stream networks. Users can also define utility functions for their own design problems.
12:00 Poster Tatiana Marci Foyer Using Factor Mixture Analysis in Developmental Psychology: An Application to Research on Parent-child Attachment applications, Factor Mixture Analysis, heterogeneous populations -
Factor Mixture Analysis (FMA) is a useful tool to explore data from potentially heterogeneous populations using a crossbred of both categorical and continuous latent variables. Briefly, this approach allows to explore the underlying factorial structure of a theoretical construct, while simultaneously detecting unobserved subgroups in the study population. Thus, FMA becomes particularly useful to investigate psychological phenomena assumed to be categorical and continuous at the same time, and when the source of heterogeneity in the considered population may be not directly observed. Despite these advantages, its application within the psychological sciences remains limited. The current study aims to illustrate the utility of FMA within the context of attachment research in developmental psychology. By presenting a real data example concerning the latent structure of attachment in middle childhood, this work provides a practical example of FMA application using the FactMixtAnalyses package (Viroli, 2011). Furthermore, we will describe ad hoc R functions to assist in the interpretation of results. Benefits and drawbacks of applying FMA to this research area will be discussed.
12:00 Poster Gi-Seop Lee Foyer Evaluations of the machine learning models in the coastal habitat classification applications, multivariate, performance -
A ‘short-neck clam’ (Ruditapes philippinarum) is one of the most important commercial shellfish. The amount of the shellfish production has been severely reduced due to the unexpected invasion of the ‘Japanese mud shrimp’ (Upogebia major) in some Korean tidal flats. Thus, it is highly required to know the habitat suitability for both organisms. In this study, the diverse simulations of the habitat classification were carried out using the available habitat data of the U. major and R. philippinarum. Supervised learning methods such as decision tree, k-Nearest Neighbor (kNN), Support Vector Machine (SVM) and Artificial Neural Network (ANN) were used with the three optimal clusters defined by R package ‘NbClust’. The decision trees were applied ‘bagging’ and ‘adaboost’ algorithms. Based on the simulation results, the prediction accuracies of each model in case of using the test data are estimated to be about 55-65%. This is considered to be due to outlier effects, and the overfitting problem due to the relatively small number of samples. In many biological data, these are still challenging problems.
12:00 Poster Koji Makiyama Foyer Magic Functions to Obtain Results from 'for' Loops in R interfaces -
The function 'for' is one of the most popular functions in R. As you know, it is used to create loops. We think there is an inconvenience of 'for' loops in R. It is that the results you get will be gone away. So we have created a package to store the results automatically. To do it, you only need to cast one line spell 'magic_for.' For instance, to calculate squared values for 1 to 3 using 'for' loop and 'print' function is very easy. However, it becomes too much hassle to change such codes to store displayed results. You must prepare some containers with correct length for storing results and change 'print' function to assignment statements. Moreover, in such or more troublesome situations like where you have to store many variables, codes will grow more complex. The 'magicfor' package makes to resolve the problem with keep readability. You just add one line 'magic_for()' before 'for' loops. Once you call 'magic_for,' you can just execute 'for' as usual, the results will be stored in memory automatically. You can obtain the results using 'magic_result.' We introduce how to use the magic.
12:00 Poster Yuya Matsumura Foyer Easy Writing of Bayesian Optimizaion for Macine Learning models, performance, big data -
In many machine learning algorithms, tuning hyperparameters is one of the most important point. Bayesian optimization (Shahriari et al., 2015) is a method for tuning hyperparameters faster and more efficient than grid search that searches all grids in parameter space. In R, combination of rBayesianOptimization package and some machine learning packages such as e1071 or ranger enable Bayesian optimization for hyperparameter tuning. However, it was troublesome to write codes for Bayesian optimization using those packages because we must make a complicated function to maximize, then write as code to execute Bayesian optimization. This is very confusing, hard to try and try and error. MlBayesOpt package (Matsumura, 2017, https://cran.r-project.org/web/packages/MlBayesOpt/index.html) is a very convenient to write this work. This package requires to execute Bayesian optimization only a dataframe, column name of label to classify (or regress), and column names of feature vectors. For example, there are 32 lines of a source using combination of packages, but 5 lines of that using MlBayesOpt package.
12:00 Poster Jeremy Forbes Foyer Using Australian census data to describe electorates' socio-economic profiles at the time of a federal election. models, reproducibility, community/education -
In Australia, the House of Representatives is divided into 150 seats, each representing an electoral division, and each divisions' boundaries are revised periodically. Federal elections generally occur every three years, but electorate boundaries can change in between elections.The Australian Bureau of Statistics conducts a Census of Population and Housing every five years, and updates its record of electorate boundaries in July each year, in accordance with the official electoral commission's boundaries.This research looks at matching and estimating the socio-economic profile of each electorate at the time of a federal election.To accurately estimate profiles, each election is initially paired with the Census data taken closet to the election date. Many elections do not occur in the same year as a census, and are matched with data from nearby years. Differences between these dates are adjusted for using spatial analysis and time-series forecasts.This work is an update for the eechidna package, which contains Australian census and election data, and tools for visualisation and analysis.PS. An update can be provided closer to date. Research has only recently commenced.
12:00 Poster Jessica Bagnall Foyer Analysing the voting patterns of the Senate of the 45th Australian Parliament via fully-visible Boltzmann machines algorithms, models, applications, networks -
The 45th Australian Senate—following the 2016 federal election—contains the largest crossbench since the expansion of the Senate in 1950. Of the 20 Senators who make up the crossbench, 7 minor parties were elected.We analyse the party-level voting patterns of the parties of the Senate of the 45th Australian parliament by modelling the crossbench via a fully-visible Boltzmann machine, a probabilistic graphical network that arises from the neural networks literature, in order to determine the various influences that each party has on each other, and to evaluate the relative pro- (or anti-) government stances of the aforementioned parties.We describe the required estimations and computations that are performed via our R package BoltzMM—available at github.com/andrewthomasjones/BoltzMM. The package implements the MM algorithm for maximum pseudolikelihood estimation of FVBM models of Nguyen and Wood (Neural Computation, 2016), and uses the asymptotic normality results of Nguyen and Wood (IEEE T Neural Networks and Learning Systems, 2016) for inferential computations.
12:00 Poster Florian Schwendinger Foyer Readability Prediction in R models, applications, text analysis/NLP -
Readability prediction is commonly used to assess the comprehensibility of a given text. Early approaches focus on the development of readability scores (e.g. Fog-Index, Dale-Chall, Flesch Reading Ease). Most of these readability scores are based on the number of words, number of sentences, number of syllables and number of words which are not present in a predefined list.Current research in the field of linguistics suggests that these scores are often misleading and models which combine Natural Language Processing (NLP) and statistical learning should be used instead.This research presents how a state-of-the-art readability prediction can be implemented in R by utilizing the tools available from the StanfordCoreNLP package. The StanfordCoreNLP package and its companions can be installed from https://datacube.wu.ac.at/.
12:00 Poster Aswi Aswi Foyer Comparison of different Bayesian spatio-temporal models using R packages models, applications, space/time, CARBayesST -
There is a growing number of packages in R for modelling spatio-temporal data. In this presentation, we will review and compare a number of spatio-temporal Bayesian models using R. We will focus on two R packages, namely R-INLA (Integrated Nested Laplace Approximation) and CARBayesST and describe the different spatio-temporal models available. We examine six and five Bayesian spatio temporal models using CARBayes and R-INLA, respectively. We will illustrate the application of these models and packages through a case study on dengue cases, in Makassar, Indonesia. Model performance will be compared using goodness of fit such as Deviance Information Criteria (DIC). The computational speed and ease of using these packages makes them a very attractive option for Bayesian spatio-temporal modelling.
12:00 Poster Janek Thomas Foyer Automatic gradient boosting algorithms, models, data mining, Automatic Machine Learning -
Well-qualified data scientist are not a dime a dozen. Instead, employees being not very familiar with data analysis are often called to do the job. Automatic machine learning can help those persons to perform predictive modeling with high performing machine learning tools without having much experience. This is achieved by making those applications parameter-free, i.e. only the data is required as input. Projects like Auto-WEKA or auto-sklearn aim to solve the Combined Algorithm Selection and Hyperparameter optimization (CASH) problem resulting in a huge optimization space. However, for most real world applications, only few different learning algorithms are required to deliver superior performances. autoxgboost simplifies this idea one step further and the CASH problem to taking Gradient Boosting as a single learning algorithm in combination with intelligent model based hyperparameter tuning. It is based on the R-Packages mlr, mlrMBO and XGBoost. It also supports categorical variables due to special inbuilt factor feature encoding. Even though autoxgboost only uses one learner instead of a whole library, it provides comparable or even better performances.
12:00 Poster Awdhesh Yadav Foyer Household and Community factor on under-five mortality in India: An application of multilevel cox proportional hazard model multivariate, big data -
The objective of this paper is to determine the important of community, household and individual level effect on under-five mortality in India. Using data from the latest round of Demographic Health Survey (DHS)-2005-06, multilevel cox proportional hazard analysis was performed on a nationally representative sample. The results indicate that pattern of under-five mortality were clustered within mothers and communities. The community level variables like region, place of residence, community poverty level, community education level, ethnic fractionalization index were significantly determine under-five mortality in India. The risk of under-five deaths was significantly higher for children residing in North, East and West regions compared to South region. In addition, the proportion of women in community completing secondary school were significantly more likely to increase the child survival. The household level variables like religion, caste and wealth index were significantly determining under-five mortality. The results suggest to address the contextual level factors to address under-five mortality in India
12:00 Poster Jean-Michel Perraud Foyer A suite of R packages for hydrological ensemble forecasting using Rcpp space/time, performance, big data, Hydrology, forecast -
Ensemble prediction techniques have been shown to produce more accurate predictions as well as formally quantify prediction uncertainty in a range of scientific applications. We present a suite of libraries for hydrological ensemble forecasts designed for use both in research and operations. The features of the C++ libraries are available from several high-level interactive languages including R. The suite currently comprises three main R packages for rainfall forecast post processing (RPP), semi-distributed ensemble hydrological modelling (SWIFT2) and multi-dimensional ensemble time series (uchronia). The packages are designed to offer concise commands for handling ensemble time series, input/output, model parameterisation and simulation execution. The native libraries purposely have a C API for maximising interoperability and foster a consistent use experience across high-level languages. Rcpp is used for surfacing the features in the R packages. Bespoke code for marshaling data, object lifetime management and generating glue code for Rcpp is already open source and suitable for reuse in similar technical contexts.
12:00 Poster Susanna Cramb Foyer Bayesian disease mapping in R models, space/time -
Bayesian methods prominently feature in disease mapping, and R has multiple packages designed to enable efficient computation of assorted Bayesian spatial models.Here we examine seven Bayesian models suitable for disease mapping and implement them using the R packages of R-INLA, CARBayes, R2WinBUGS and R2jags. Models considered included common approaches such as the BYM which smooth estimates over all adjacent areas, through to more recently introduced models that allowed for discontinuities between adjacent areas, as well as spline models. Simulated incidence data designed to represent a rare cancer (liver) and more common cancer (lung) were examined across 2153 areas in Australia. Model performance was compared on goodness of fit measures (WAIC, Moran’s I on residuals), computational time and convergence (Geweke). The packages themselves are also compared in terms of computational time and model flexibility. It is useful to consider several different models to understand the robustness of results when disease mapping. R has the capacity to enable a wide range of models to be considered, with the additional advantages of high quality visualisation of results.
12:00 Poster Rosemary Putler Foyer Analysis of EHR Data and Circulating Inflammatory Mediators: Association with Severe Clostridium difficile Infection models, applications, bioinformatics -
Clostridium difficile infection (CDI) is a major healthcare-associated infection, and severe CDI often leads to subsequent recurrence or death. We hypothesized that circulating inflammatory mediators would associate with severity in a prospective cohort of inpatients diagnosed with CDI. An inflammatory mediator panel was performed on collected sera and merged with electronic health record (EHR) data. With these data we show that circulating biomarkers associate not only with severity of the CDI episode, but also with subsequent mortality. Because of the large number of potentially correlated predictors, our data presented a challenge when we set out to identify features to incorporate into an accurate predictive model. We explore the steps performed in this analysis, discussing methodology and decision points, including the management and analysis of EHR data, utilization of dimensional reduction techniques, and use of existing packages such as vegan, glmnet, and pROC. Through this analysis, we demonstrate how a diverse array of R packages and statistical methodologies, which function in a wide array of use cases, can also be used to answer a complex disease-related question.
12:00 Poster Stuart Davie Foyer A data driven approach to generating and scoring B2B leads visualisation, models, applications -
In many industries, companies rely on a sales team to source and qualify leads. Unfortunately, this limits a company's potential leads to those that can be manually processed, while lead qualification is limited by the quality of ad hoc scoring systems. To find leads faster, companies might engage in cold-calling, or blanket email campaigns, both of which are known for their low conversion rates. Here, a data-driven B2B lead generation and qualification solution for the UK market is presented, based on open source data and XGBoost. Our models take into account both general and company specific features, and allow an approximation of the size of market opportunity. Lead reports containing pertinent conversion information are automatically generated using xgboostExplainer and R Markdown. Considerations on feature engineering, and difficulties associated with overfitting, are also discussed. **As there are several components to this presentation, a lightning talk would be preferred over a poster**
Time Session Presenter Venue Title Keywords Chair