A little known initiative called XDATA has spawned a collection of technologies that could bring advances to the world of Big Data. The program is funded by the sometimes mysterious, sometimes fanciful DARPA (Defense Advanced Research Projects Agency). In this case, it looks like DARPA’s imaginings could bring some tangible benefits to the real world.
A few days ago, XDATA put up a catalog of open source software that has come out of the effort. Though many of the applications are research-grade works still in development, some are being used in production in some of the web’s Big Data cornerstones like Twitter and Facebook and are serious enough to have formed the core technology of promising startups (Trifacta and Databricks). The projects each address key challenges in Big Data are likely to bring substantial improvements in our ability to make sense of massive amounts of data. I’ll try to highlight some of the major software packages in terms of some over-arching themes.
New platforms for distributed computing.
Big Data involves volumes of data that exceed the capacity of any one machine to store and process. Systems for managing Terabytes and Petabytes of information therefore have to coordinate data processing across multiple machines – sometimes in the thousands of them. Hadoop was one of the first widely available platforms for getting computers to collectively process data at this scale.
Mesos is an XDATA funded platform that could be a successor to Hadoop. Roughly speaking, it extends Hadoop by enabling a more diverse set of systems to participate in the data processing. Where Hadoop is setup to support one kind of processing (called MapReduce), a collection of computers are able to adopt different patterns of processing that allow a departure from a “one size fits all” approach. Mesos was initially developed by graduate students at Berkeley interested in expanding Hadoop’s capabilities.
Twitter’s backend engineering was one of the first groups to grasp the potential of this system and Mesos is now widely used for Big Data inside Twitter. The Mesos team eventually spawned another effort called Spark, which is also being adopted in the Big Data community. Spark allows Big Data software to trade off memory and disk in ways that allow increased flexibility and performance, in some cases allowing workloads to complete 100 times faster than Hadoop MapReduce.
Dealing with real-time data.
BlinkDB is a new kind of database that allows queries that balance accuracy against speed. BlinkDB leverages the idea behind opinion polling. A pollster doesn’t have to ask the entire population of a state how they’re likely to vote to get a sense of the way things are going to go in an election: if they can get a representative sample of a population then they can give a prediction with a margin error based upon the response of people in the sample.
Similarly, BlinkDB builds small samples of a given data base and can respond with query results that have a given margin of error for a query result. In cases where there’s historical data to base the sampling on, BlinkDB can give responses 200 times faster than traditional querying. Where this really helps is in real time applications, where getting a “close enough” sense of what a population is doing can feed real time response. Suppose that you are a have an online store and you’d like to tie promotions to real time measures of customer sentiment — quickly enough impact purchasing trends. BlinkDB could save you the effort of having to wait overnight for a full analysis to run across all your data. If it could assess sentiment quickly with a 10% accuracy, that might be good enough to help with hour to hour marketing decisions.
Beyond storing the data and running queries, the goal of data science is to extract meaning from Big Data. To return to the store example, there are complex algorithms that could be used to predict customer behavior or to understand their feelings about the things they’ve already purchased. Data scientists typically rely on special statistical packages like R and SAS for designing these algorithms. These tools tend to be the platforms of record for data science – most researchers use them to develop state of the art algorithms, they provide a breadth of coverage that is unparalleled, the and the implementations have been validated by the world’s statistical community.
The problem is that these tools tend to be very slow primarily because of design decisions around memory management: a skilled programmer could probably get 10 to 100 times improvement for a given R algorithm by rewriting it in a compiled language like C++. Further, these tools are ill-equipped to deal with the huge data sets that are now critically relevant to business. The R installation documentation recommends data sizes no larger than 20% of available memory.
XDATA awardees have developed new modules for R that allow processing of large data sets using a variety of techniques including integration with Hadoop (the RHIPE project) and memory management modules that support large data sets (bigmemory). Another XDATA recipient is the Julia statistical language project. Julia is a new computing tool which is both fast – for some calculations 100 times faster than R – and has the statistical libraries needed to support serious data science.
Perhaps the most important outcome of the XDATA effort will be the development of new tools to “see” the data. Bokeh is a graphics library being developed by Continuum Analytics allowing data scientists and BI professionals to deliver stunning graphical displays of data without much low level coding. Vega is another tool that provides a high level visualization language for laying out complex visualizations and imMens supports visualization of large networks.
The great thing about all these tools is that they are free, open source efforts that will no doubt support the development of the next generation of data science tools. In posts to come, I’ll step through usage of some of these platforms.
Written By : Dr. Charles Earl
Charles Earl is an Atlanta native with a BS in Math from More house, a Masters in Electrical Engineering from Georgia Tech, and a PhD in Computer Science from the University of Chicago. For the past ten years he has worked on applying machine learning to solve challenges in science, music recommendation, defense, and education, and he is now one of Charter Global’s top Data Scientists working with some of the largest telecom companies in the world.