Gartner Exclaims “uRiKA!”
Hadoop is synonymous with big data, but perhaps, according to Carl Claunch of Gartner, it should not be. Instead, he suggests an in-memory system like YarcData’s uRiKA might better suited to handle big data graph problems.
According to Claunch, graph problems represent the epitome of big data analysis. The issue is that graph problems are incredibly unpredictable by nature. In principle, graph problems can be parallelized just like any other.
For example, if one were modeling global wind patterns, one could set aside a node for each particular cube of the globe. The nodes would then be made to interact with each other based on which cubes were next to which other cubes in the model.
Unfortunately, says Claunch, this approach is problematic for several reasons. One of those reasons is that the nodes that take the most time do not necessarily correlate with the more complicated or interesting regions of the graph. “The region of the graph in which the search spends most time,” Claunch wrote “could concentrate in unknown spots or spread across the full graph. A DBMS designed for known relationships and anticipated requests runs badly if the relationships actually discovered are different, and if requests are continually adapted to what is learned.”
Essentially, in order for a Hadoop-like parallelization effort of a graph problem to be effective, it has to be known which relationships it should be picking out. But the whole point of graph problems is to recognize points of interest that were unknown. “When the relationships among data are mysterious, and the nature of the inquiries unknown, no meaningful scheme for partitioning the data is possible.”
A practical application of this involves analyzing people’s interactions and actions across a wide network. Taken by itself, no particular action or interaction is suspicious. Added together, however, they may indicate a potential terrorist cell.
For obvious reasons, the US government is interested in big data analysis, particularly Hadoop, to solve the above problem. However, uRIKa may be more efficient. According to Claunch, uRiKA possesses three technologies that help it rise above the challenges presented by graph problems.
“YarcData’s Threadstorm chip shows no slowdown under the characteristic zigs and zags of graph-oriented processing. Second, the data is held in-memory in very large system memory configurations, slashing the rate of file accesses. Finally, a global shared memory architecture provides every server in the uRiKA system access to all data.”
As noted before, efficient graph processing requires handling unexpected jumps from certain regions of the graph to others. This calls for intense parallelization, “The Threadstorm processor runs 128 threads simultaneously, so that individual threads may wait a long time for memory access from RAM, but enough threads are active so at least one completes an instruction in each cycle.”
Running 128 threads simultaneously is clearly an advantage. According to Claunch, other chips only have a few out of 128 active during a given cycle, making the Threadstorm chip a true, well, storm of threads.
But there is no one to say that chip cannot be made available to other systems. So why does that level of parallelization work when other systems, whose essential purpose is to partition and parallelize, come up short? It has a great deal to do with the third technology Claunch listed, the global shared memory architecture. Here, the data is not actually partitioned but shared.
“Employing a single systemwide memory space means data does not need to be partitioned,” Claunch wrote “as it must be on MapReduce-based systems like Hadoop. Any thread can dart to any location, following its path through the graph, since all threads can see all data elements. This greatly reduces the imbalances in time that plague graph-oriented processing on Hadoop clusters.”
Conceptually, it is easy to understand why a model that can freely interact with itself, where regions are not limited by their proximity, would be ideal. Frequently, however, solutions whose implications are easy to conceptualize are difficult to actually achieve. However, through parallelized threads that are not subject to the limitations of partitioning, Claunch notes that YarcData may have actually achieved it.
Finally, the in-memory portion of uRiKA hypothetically solves the inefficiencies caused by constantly re-accessing the data after shutdowns or referencing far away caches.
“The performance of almost all modern processors is dependent on locality of reference, to exploit very fast but expensive cache memories. When a series of requests to memory are near to one another, the cache may return all but the first of the requests. The first request is slow, as RAM memory is glacially slow in comparison with cache memory.”
The main argument being made here is that traditional divide and conquer methods in computer science are insufficient in solving vast modeling problems. The notion that data can forego partitioning and go straight into the model is a nice, but incredibly difficult to achieve. Claunch is perhaps implying that this kind of innovation is hard to come by, as people are more content hammering away at a certain process to make it faster instead of coming up with a whole new process.
Whatever the case, it is hardly important. What is important is that, for Claunch, uRiKA is an important first step in solving difficult to model graph problems efficiently.