Fei Chen and Brian D. Ripley Statistical Computing and Databases: Distributed Computing Near the Data ************************************************************************ This paper addresses the following question: "how do we fit statistical models efficiently with very large data sets that reside in databases?" Nowadays it is quite common to we encounter a situation where a very large data set is stored in a database, yet the statistical analysis is performed with a separate piece of software such as R. Usually it does not make much sense and in some cases it may not even be possible to move the data from the database manager into the statistical software in order to complete a statistical calculation. For this reason we discuss and implement the concept of "computing near the data". To reduce the amount of data that needs to be transferred, we should perform as many operations as possible at the location where the data resides, and only communicate the reduced summary information across to the statistical system. We present details of implementation on embedding an R process inside a database (MySQL) and remotely controlling this process with another R session via communication interfaces provided by CORBA. In order to make such distributed computing transparent to the end user, we discuss modifying the R engine to allow computation with external pointers such that an expression involving an external pointer reference is evaluated remotely at the location where the object pointed to resides. In addition, we implement distributed R data frames to give R the ability to handle very large data sets with PVM and ScaLAPACK.