Apache Spark

  • Apache Spark: "A fast and general engine for large-scale data processing"; "A fast and general-purpose cluster computing system"
  • Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel
  • Distributed DataFrame API based on R and pandas data frames.
  • Open source and supported by many vendors including Microsoft, IBM, Intel, Google, Cloudera, HortonWorks, DataBricks, and many others.
  • IBM announced last year that they'd be dedicating 3,500 people to work on Spark related projects (they are now the #2 committer to Spark after DataBricks)

Spark APIs

  • Scala/Java API (traditional object-oriented API)

  • pyspark: "The Spark Python API (PySpark) exposes the Spark programming model to Python. To learn the basics of Spark, we recommend reading through the Scala programming guide first; it should be easy to follow even if you don’t know Scala. This guide will show how to use the Spark features described there in Python."

  • What should the R interface look like?

R as Interface Language

  • Optimized for interactive exploration—many language features aimed at productive REPL usage.
  • Generic functions—S3 dispatch provides uniform interfaces to all objects for inspection, plotting, etc.
  • Functional language w/ immutable data—promotes trustworthy computing.
  • Non-standard evaluation for meta-programming—ideal for creating DSLs like the R formula interface, dplyr, etc.

The R interface should provide high level facades for the tasks users want to undertake with Spark that are consistent with base R semantics and take advantage of it's strengths as an interface language.



  • Data Frame Interfaces:
    • R data frame API, dplyr, data.table
    • We want to use these interfaces for remote data and local data (i.e. please don't mask our local interface in the service of providing a remote interface!)
  • Distributed Machine Learning:
    • High level functional interfaces to distributed machine learning that play well with R generics like print, predict, summary, residuals, fitted, etc.
  • Distributed Parallel Execution:

Evolving SparkR

  • Break core RPC layer into new package: sparkapi
    • Exposes the core R to Java RPC bridge publicly and makes it possible to write extensions that call arbitrary Spark APIs packages
  • New package that provides a dplyr-interface to Spark DataFrames: sparklyr
    • Also provides Spark MLlib interface that works within dplyr pipelines
    • Use of dplyr is optional so extensions that provide alternate data frame interfaces can still call MLlib functions.
  • Extensions also compatible with SparkR (if sparkapi is supported in SparkR, this is not in our control).


sparkapi Package

Function Description
spark_connection Get the Spark connection associated with an object (S3)
spark_jobj Get the Spark jobj associated with an object (S3)
spark_dataframe Get the Spark DataFrame associated with an object (S3)
spark_context Get the SparkContext for a spark_connection
hive_context Get the HiveContext for a spark_connection
invoke Call a method on an object
invoke_new Create a new object by invoking a constructor
invoke_static Call a static method on an object

Distributed Parallel Execution

  • Nothing (yet) in sparkapi for distributing R computations to cluster nodes

  • Need to ascertain what common infrstructure is required for various projects (ddR, Tessera, hmr, etc.)

  • Need help to define and build these interfaces

Next Steps

  • Community review of sparkapi package: is it possible to write the extensions we'd like to?

  • Apache Spark review of sparkapi: can we agree on a common extension API?

  • CRAN submissions of sparkapi and sparklyr