How Open Source is Failing R for Big Data
At this year’s UseR! conference in Los Angeles, John Chambers, the father of the S language and thus grandfather of the R language, noted that R was designed as an interface to the best algorithms. And indeed the R language excels as an interface to algorithms, which is why so many professional and budding data scientists choose R as the interface to drive their analytic workflows.
On the other hand, R was not designed as an interface to massive-scale data: Base R data objects are in-memory only. The irony is that the characteristics of the R community that have made it successful at developing and maintaining a superb interface to algorithms are not well suited to the very different characteristics needed to develop and maintain massive-scale data objects. R finds itself in need of massive-scale data objects, but without the right open-source community design to respond to that urgent need.
There are thousands of R packages and hundreds of thousands of community-contributed R functions, and those numbers keep growing. This outcome is the result of a deliberate community design decision: Encourage everyone in the R community to write their own functions and publish them openly via a CRAN package. This has been a successful strategy, and has attracted the creativity and investment of R’s open source community that grows and tends that algorithmic interface.
Of course many of those projects – while made freely available to the public on R’s comprehensive repository CRAN – are never downloaded or used by anyone besides the author of the package. R is among the least aggressively moderated of the open source programming projects. R’s lower threshold to contribution lets many more flowers bloom than say the Apache Hadoop project. Where moderation on other projects ensures consistency and reliability, the relative lack of moderation in the R world means higher variance in the value of R packages: A small number of packages are superstars; the vast majority of packages are duds.
Designing the R community for high variance has succeeded brilliantly at encouraging algorithm development. A key ingredient for this success has been the fact that the building blocks for algorithms are relatively simple. With a few core data structures and access to some scripting logic, both classic and contemporary algorithms are readily expressible in R. Up until just the past few years, the R community basically hasn’t needed to worry about data and compute engines – in most cases all the data to be analyzed fit into memory, and the base R, in-memory data objects of vector, data frame, matrix, and list were perfectly adequate to the vast majority of analytic use cases faced by R users.
However, with the advent of big data, the R community can no longer rely on the same basic in-memory data structures that brought the R community so far. And the creative strengths of the open-source R developer community, and the relative lack of moderation, are not well suited to developing the new R data structures that are needed to connect the algorithmic creativity of the R community to the massive, diverse data residing across a cluster.
A different sort of community is needed to develop these new R data structures. And indeed, the “communities” that have been rising to the challenge are frequently corporate engineering teams rather than open-source volunteer communities. Rather than letting a thousand flowers bloom, these teams are focused on creating a small number of hardened data structures that can serve as the backend interface between data of all sizes and types and the algorithms that were designed with nary a thought to the complications involved with massive, diverse data.
The R community continues to create a swirl of diverse and powerful technical tools. For organizations and analysts faced with massive data, they not only need R but a complementary platform. This way the organization can wrap their data in new, massive-scale data structures that can connect with and take advantage of the algorithmic creativity of the open source R developer community. Organizations need a platform that enables R integration, and that is guided by a similar design goal: find the strength in the diversity, what each tool does well, and make them play well together in a robust, reliable, usable, industry-hardened platform.
About the author: Nick Switanek is a senior consultant at Teradata Aster. As a data scientist, he’s actively involved with R education and analytic modeling and discovery. Nick also works closely with the Teradata R&D organization to provide product requirements to help drive R implementation and strategy.
Related Items:
A Revolution for Open Source Big Data
Poll: SAS Use Surges for Data Mining
I don’t see how the open source community can muster the major, sustained financial investment ($millions) necessary to feed a smart, devoted, professional team to design, develop, test, and document a rewrite of the internals. How could it? That and the backward compatibility issue…