Backwards-compatible customization of atomic vectors in R

Gabriel Becker - Genentech, Inc.

Created: 2016-07-11 Mon 11:08

1 Vectors

1.1 Now

#define DATAPTR(x)      (((SEXPREC_ALIGN *) (x)) + 1)

2 Customization approaches

2.1 S4 dispatch

  • The "everything should be a generic" method
    • Used to great effect in Bioconductor
  • Pros
    • No touching the header or internals
    • Object-oriented
  • Cons
    • Performance

2.2 Rcpp

  • The "lets just build something entirely new" approach
    • Used to great effect in many CRAN/Bioc packages
  • Pros
    • Performant, clean API, many conveniences, behave like atomic vectors
  • Cons
    • Not actually atomic vectors, not compatible directly with internals

3 Proposal

3.1 The header has a spare bit

//...
    unsigned int debug :  1;
    unsigned int trace :  1;  /* functions and memory tracing */
    unsigned int spare :  1;  /* currently unused */
    unsigned int gcgen :  1;  /* old generation number */
    unsigned int gccls :  3;  /* node class */
}

3.2 Use it to offer customization interface for SEXP data storage

/* if the custom vector bit is set, use its accessor. Otherwise, vector data
   is contiguous with header, same as it always was */
#define DATAPTR(x)      ((void *) (IS_CUSTVEC(x) ?
                         (CUSTAPI_PTR(x)->dataptr(x)) :
                         ((SEXPREC_ALIGN *) (x)) + 1))

3.3 A note about dataptr

  • Custom vectors don't need to have their data contiguous in memory anywhere when created
  • Only need a pointer when dataptr is actually called
    • e.g., store vector as Rle, unpack only when it hits DATAPTR

3.4 We need one more thing

  • Duplication code needs to handle these new vector implementations
#define DUPLICATE_ATOMIC_VECTOR(type, fun, to, from, deep) do { \
      if( CUSTAPI_PTR(from) ) \
          PROTECT(to = CUSTAPI_PTR(from)->dup_vector( to, from, deep)); \
      else DUPLICATE_TRAD_ATOMIC_VECTOR(type, fun, to, from, deep);     \
} while (0)

4 The vector implementation API

4.1 Creating the vector

typedef struct api_impl {
    R_allocator_t *allocator;
    void *(*dataptr)(SEXP);
    void (*set_dataptr)(SEXP, void*);
    SEXP (*dup_vector)(SEXP, SEXP, Rboolean);

4.2 dataptr

  • can retrieve or create a pointer to a C-array
  • Share memory with
    • Other systems - MonetDB, Apache Arrow
    • Other R vectors - windowing operations without copying
  • Store data in different form with an escape valve
    • intervals of integers stored as just start and length
    • Rles, other forms of compressed representation

4.3 Accessing the vector

SEXP (*subvector)(SEXP, R_xlen_t, R_xlen_t); 
SEXP (*set_subvector) (SEXP, SEXP, int);
SEXP (*get)(SEXP, int);
SEXP (*set)(SEXP, SEXP, int);

4.4 Access benefits

Allow C code to use custom vectors without needing to create the C array representation

4.5 That means

  • Only need two values for continuous intervals
  • Let databases retrieve and set values however they want
  • Rles and other compressed vector types

4.6 Querying the vector for properties

Rboolean (*is_sorted)(SEXP);
void (*set_sorted)(SEXP, Rboolean);
Rboolean (*is_contigous)(SEXP); //is it a contiguous sequence of integers?;
void (*set_contiguous)(SEXP, Rboolean);
Rboolean (*contains)(SEXP, SEXP);
Rboolean (*noNAs)(SEXP); //does the vector "know" that it has no NAs?
SEXP (*set_noNAs)(SEXP, Rboolean);

4.7 So

  • sort a no-op after the first time
  • if we know a vector is contiguous interval of integers, we can create a

window of the parent when it is used to subset

  • Matching can be silently faster if sortedness known
    • set operations, merging data, grouping

4.8 specialized lookup and matching

Rboolean (*contains)(SEXP, SEXP);
R_xlen_t (*index_of)(SEXP, SEXP);

4.9 Benefits

  • scan sorted vectors in O(log n)
  • Let databases, Rles, etc do matching without full scan

4.10 miscellaneous other stuff

void *(*scratch)(SEXP); //retrieve scratch space for SEXP, if any;
/* init is called after the SEXP is allocated via the allocator
   It can be a no-op, or can allocate the actual data storage, or
   set the data storage to an existing pointer, etc. 
   3rd argument can be a SEXP, but doesn't need to be (e.g. when
   constructing a custom vector directly from C). */
SEXP (*init)(SEXP, size_t, void*); 
const char *impl_descr;

5 My current plan is to offer 3 custom vector impls

5.1 Basic data decoupling impl

  • Data stored in C-array, but not contiguous with SEXP header
    • Shallow duplication of atomic vectors!
  • Has sortedness, noNA, etc support.

5.2 Vector view impl

  • Copyless window into existing R vector
  • Inherit sortedness, etc info from parent
  • Keep parent as attribute to protect from GC

5.3 R functions impl

  • Allow users use R closures in VFT
  • Shoutout to DTL's RGraphicsDevice
  • Poor performance but great for prototyping/less advanced users

6 Challenges

6.1 LONGJMPs everywhere

  • Data access macros can now cause allocation, which means they can fail.

6.2 Some people are bad programmers

  • If exposed to package authors, easy for them to wreak havoc
    • This is true of packages with C code anyway…