Fei Chen and Brian D. Ripley
Statistical Computing and Databases: Distributed Computing Near the Data
************************************************************************
This paper addresses the following question: "how do we fit statistical models
efficiently with very large data sets that reside in databases?"
Nowadays it is quite common to we encounter a situation where a very large data
set is stored in a database, yet the statistical analysis is performed with a
separate piece of software such as R. Usually it does not make much sense and
in some cases it may not even be possible to move the data from the database
manager into the statistical software in order to complete a statistical
calculation.
For this reason we discuss and implement the concept of "computing near the data".
To reduce the amount of data that needs to be transferred, we should perform as
many operations as possible at the location where the data resides, and only
communicate the reduced summary information across to the statistical system.
We present details of implementation on embedding an R process inside a database
(MySQL) and remotely controlling this process with another R session via
communication interfaces provided by CORBA. In order to make such distributed
computing transparent to the end user, we discuss modifying the R engine to allow
computation with external pointers such that an expression involving an external
pointer reference is evaluated remotely at the location where the object pointed
to resides. In addition, we implement distributed R data frames to give R the
ability to handle very large data sets with PVM and ScaLAPACK.