Saturday June 27, 2015

9:00   General assembly of The R Foundation (closed meeting, D2.20)
11:30   Registration (CBS entrance hall)
12:00   Lunch (FUHU Faculty Club)
13:00   Welcome  
Peter Dalgaard, Copenhagen Business School
    Session I: R Internals and Efficiency Chair: Claus Ekstrøm
    Luke Tierney, University of Iowa Some Directions for the R Engine This talk will outline some possible improvements to the R engine I hope to explore in the next year, including compiler enhancements to improve scalar performance, improvements to function call performance, and changes to vector computations and representations.
    Tomas Kalibera, Northeastern University R Performance Optimizations & Hunting Memory Protection Bugs The talk will report on recent performance optimizations in R-devel, including S3/S4 method dispatch, symbol lookup within method dispatch, type table operations, and invocation of external functions in the stats package. The talk will also report on a hunt for PROTECT bugs; a number of them has been found using an automated tool and fixed in R-devel. The tool was implemented specifically for this purpose and can be of interest to package developers, because it can also find PROTECT bugs in the C code of R packages.
    Lukas Stadler, Oracle Labs FastR: Challenges, Progress, and Future Plans FastR is a JVM-based implementation of the R language.
This talk will provide insights into the technical foundations of the FastR project, along with an overview of its current state and the main challenges we are facing. I will also present some specific language and runtime features that proved to be hard to implement in a system backed by an optimizing compiler, and the solutions and heuristics we applied to overcome them. Feedback on whether our approaches seem sound will be more than welcome.
    Maarten-Jan Kallen, BeDataDriven B.V. and Hannes Mühleisen, CWI Latest Developments around Renjin Renjin is a JVM-based interpreter for the R programming language. Like GNU R, it is open source and it relates to GNU R much like Jython relates to (C)Python and JRuby to (Matz's) Ruby. In this talk we give a short update on our progress since the 2013 UseR! conference in Spain and we expand on our future plans which will mostly cover S4 support and further performance improvements. We will demonstrate some of these improvements which include those that are the result of Hannes' research project "R as a Query Language". Using deferred computation we use (a) simple optimizations in the vector pipeliner such as identity optimizations (e.g. mean(rep(x, y)) = mean(x) and x*1 = x), (implicit) parallelization of computations and just-in-time computation of specialized computations and (b) optimizers inspired by relational query optimization such as selection pushdown, function de-virtualization, common expression elimination and caching sub-expressions. We showcase our optimizations on both micro-benchmarks and a real-world use case, namely a large-scale survey analysis on American Community Survey (ACS) data.
    Karl Millar, Google CXXR Performance Improvements and Future Directions The goal of the CXXR project it to produce a fully-compatible, maintainable, high-performance R implementation using modern software engineering techniques. I'll discuss the general approach that we're taking to achieving high performance, before delving into the details of some of the techniques used in the CXXR runtime library part of our approach. In the memory management system, these include using conservative stack scanning to largely eliminate the need for the PROTECT() macro and using deferred reference counting to both reclaim unused memory and identify objects that can be modified in-place or who's allocations can be reused. The memory manager also integrates with address sanitizer to easily find errors. Elsewhere in the runtime library, CXXR uses an optimized calling convention for calling builtin functions, more efficient representations for environments and the global cache, and (soon) a special representation for scalar objects that eliminates the need for object allocation in many cases.
15:15   Coffee break
15:45   Session I (continued)
    Radford Neal, University of Toronto Can Interpreting be as Fast as Byte Compiling? + Other Developments in Pqr I will discuss recent developments and future directions in the pqR project, focusing mostly on ways in which pqR's interpreter has been made faster for scalar computations. Some highlights are a scheme for returning scalar results from arithmetic operations in "static boxes", direct assignment to variables in update operations, fast lookup of variables using cached bindings, and a new scheme for subset assignment, which resolves various semantic anomalies as well as being faster. The "variant result" mechanism in pqR plays a role in many of these improvements, as well as in task merging and parallel computation in helper threads. The present lack of this mechanism in byte-compiled code is one motivation for trying to make the interpreter fast enough that the byte compiler can be abandoned. I will also briefly mention future plans for extending use of task merging and helper threads, for automatic compact storage of large vectors, and for implementing exactly-rounded computation of sums and means.
    Richard J. Cotton, Weill Cornell Medical College in Qatar Assorted Rants: Things that Annoy Me about R, and Suggestions for Fixes I feel like R is 99% life-changingly awesome, and 1% utterly exasperating. This talk is about that 1%, and what I think some solutions are. The topics are loosely organised around eliminating quirks, helping beginners, and gamifying package development.
    Discussion of session 1
    Session II: Technical Infrastructure and Applications Chair: Stefan Bache 
    Matt Dowle, H2O H2O Design and Infrastructure Matt will give a demo of H2O and discuss its design and infrastructure.  H2O is for data scientists and business analysts who need scalable and fast machine learning. H2O is an open source predictive analytics platform. Unlike traditional analytics tools, H2O provides a combination of extraordinary math and high performance parallel processing with unrivaled ease of use. H2O speaks the language of data science with support for R, Python, Scala, Java and a robust REST API.
    Indrajit Roy, HP Labs and Michael Lawrence, Genentech Progress on Distributed Data Structures in R Data sizes continue to increase, while single core performance has stagnated. To scale our computations, we need to distribute datasets across multiple machines. Thus, R needs standardized, idiomatic abstractions for computing on distributed data structures. R has many packages that provide parallelism constructs as well as bridges to distributed systems such as Hadoop. Unfortunately, each interface has its own syntax, parallelism techniques, and supported platform(s). As a consequence, contributors are forced to learn multiple idiosyncratic interfaces, and to restrict each implementation to a particular interface, thus limiting the applicability and adoption of their research and hampering interoperability. Our proposal is to create a unified API for distributed computing. The API supports three shapes of data lists, arrays and data frames and enables the loading and basic manipulation of distributed data, including multiple modes of functional iteration (e.g., apply() operations). In this talk we will discuss the proposed API, and how it can be implemented on top of existing distributed backends.
    Ryan Hafen, Tessera Tessera: Analysis of Large Complex Data in R Tessera is a project that provides a simple interface that enables all of the statistical, machine learning, and visualization methods in R to be used with large complex data. Tessera is built on the Divide and Recombine (D&R) analysis paradigm. Tessera can be used with data in memory, on disk, or using a distributed storage and computing back end like Hadoop or Spark. Tessera is designed to be back end agnostic and extensible so that a single simple D&R-based interface can be used regardless of the back end and so that new technology can be leveraged as it comes along. This talk will cover the basics of the design of Tessera.
17:30   Closed meeting with video conference (D2.20)
19:00   Dinner (FUHU Faculty Club)

Sunday June 28, 2015

9:00   Closed meeting of R Foundation (D2.20)
12:00   Lunch (FUHU Faculty Club)
13:00   Session II (continued)
    Andrie de Vries, Microsoft R and Reproducibility: An Update During 2014, Revolution Analytics releases Revolution R Open (RRO), a downstream distribution of R that includes the Intel MKL for faster matrix computation. RRO also includes two components that aim to make it easier to write reproducible code. Specifically, RRO makes progress in resolving the problem of package reproducibility. The components are:
  • MRAN, a server-side solution that makes snapshots of CRAN every day, i.e. a CRAN time-machine
  • checkpoint, a package on CRAN that allows the user to easily access MRAN snapshots and create a local library to install packages as they existed on a given snapshot date.
In this presentation, I provide an update on changes to RRO and MRAN during 2015. For example, we now distribute RRO as two downloads (separating R and Intel MKL) and we made many improvements to checkpoint, based on community feedback. Finally, I share some thoughts about our experience with the package miniCRAN, a package that allows you to download only a subset of CRAN packages to a local server, a feature that is appreciated especially by enterprise customers.
    Jeroen Ooms, UCLA The curl Package: A Modern, Flexible http/ftp Client for R
    Gabor Csardi, Harvard University The METACRAN Experiment METACRAN is an experiment to provide additional services on top of the CRAN infrastructure:
  • A searchable code mirror, with diffs between package versions, and the ability to create personalized versions of R packages.
  • A database and API of CRAN metadata.
  • A package search engine.
  • A package manager.
  • A continuous integration wrapper to build and check R packages with various R versions.
  • A database and API of CRAN package downloads from the cloud CRAN mirror.
  • web site for convenient access to these services.
    Discussion of session II
    Session III: Community Infrastructure Chair: Kenneth Rose
    Peter Dalgaard, Copenhagen Business School R development dynamics. Current conventions and future directions Since R v.1.0.0, the R Core Team has taken a very conservative approach to extending and modifying R, essentially leaving all such changes to take place in "the CRAN marketplace". This may or may not be a tenable position in the longer perspective. In this talk, I intend to descibe the rationale(s) for the current situation, but also what I perceive as potential risks asssociated with it. I do not necessarily hold strong opinions on the matter, but would like to open a discussion about the potential for different directions and their implications.
    Joseph B. Rickert, Microsoft News from the R Community By almost any measure, the R community is thriving and growing. In this talk, I will present a few ideas about the structure of the R Community and highlight the respective roles of R User Groups, and the anticipated R Consortium in helping the community prosper.
15:15   Coffee break
15:45   Session III (continued)
    Bettina Grün, Johannes Kepler Universität Current Status and Future Directions for The R Journal The R Journal started in 2009 as the successor of the R Newsletter and it is the open access, refereed journal of the R project for statistical computing with two issues published each year. The main scope of the R Journal is to inform the R community about new developments and give insights into exciting applications which can be performed using R. We will give an overview on the current state regarding submissions, in particular to the different topics of interest covered by the R Journal, handling of manuscripts and published articles. We will aim at pointing out current strengths and weaknesses and try to trigger some discussion on future developments.
    Mine Çetinkaya-Rundel, Duke University Expanding R Exposure through Early Introduction in the Undergraduate Curriculum In this talk we discuss approaches for effectively integrating R into an introductory statistics curriculum. R is attractive because, unlike software designed specifically for courses at this level, it is relevant beyond the introductory statistics classroom, and is more powerful and flexible. The main obstacle to the adoption and use of R in an introductory setting is the perceived challenge of teaching programming in addition to teaching statistical concepts. Furthermore, working at a command line tends to be more intimidating to students and teachers than GUI-based tools. Many of these challenges can be overcome with the right tools: a user-friendly IDE like RStudio, which is invaluable for resolving some of the initial hurdles novice students experience with the bare-bones R interface, and labs and activities that use the right balance of standard and custom R functions. We will present examples from labs that have been developed with the goal of helping students synthesize concepts while learning R and do so in a completely reproducible framework using R Markdown. We will discuss benefits of this approach, not only with respect to creating opportunities for discussing the importance of reproducible research, but also for learning syntax, avoiding common novice pitfalls, and organizing and unifying output and write-ups. Additionally we will share approaches for introducing students to R outside of the classroom as well as student experiences and feedback.
    Jennifer Bryan, University of British Columbia Engaging New UseRs with R Markdown, Git, and GitHub Every year I teach graduate courses in exploratory data analysis and statistics for high-dimensional biology. For each run, we attract ~40 motivated grad students from a variety of programs and backgrounds. Recently, we've had a very rewarding experience integrating R Markdown, Git and GitHub into these courses. The benefits include
  • reduced webmaster burdens for the instructor
  • easier dissemination of code and reports
  • enhanced code-focused interactions, both instructor <--> student and student <--> student
I'll discuss how we use these tools to create a lively R-focused community in the classroom.
    Karthik Ram, University of California, Berkeley The Role of R in Growing a Diverse and Open Community of Scientific Software Developers Over the past few years, the rOpenSci has helped create, grow, and support a community and practice of software development among full-time researchers in various domains. Our success in this area has been somewhat unique since most practicing researchers either lack the necessary programming skills or the appropriate incentives to engage in software development. We attribute our success in this area to 1) widespread familiarity with R among various disciplines, 2) our intense and sustained community building efforts, and 3) strong ties to the domains we support.  We believe that our unique approach and experience has the potential to transform the culture of science particularly within the context of research software development. In this talk I describe the origins of the rOpenSci project and our efforts in building such a community.
    Discussion of session III
17:30   Closing

Last updated by: Ida Willumsen 26/06/2015