Follow BigDATAwire:

June 28, 2013

Yahoo! Spinning Continuous Computing with YARN

Isaac Lopez

YARN was the big news this week, with the announcement that the Hadoop resource manager is finally hitting the streets as part of the Hortonworks Data Platform (HDP) “Community Preview.”

According to Bruno Fernandez-Ruiz, who spoke at Hadoop Summit this week, Yahoo! has been able to leverage YARN to transform the processing in their Hadoop cluster from simple, stodgy MapReduce, to a nimble micro-batch engine processing machine – a change which they refer to as “continuous computing.”

The problem with MapReduce, explained Fernandez-Ruiz, VP of Platforms at Yahoo!, is that it once the processing ship has launched, new data is left at the dock. This creates a problem in the age of connected devices and real time ad serving, especially when the network your running is trying to make sense of 21 billion events per day, and you’ve got users and advertisers counting on you to be right.   

Having MapReduce batch jobs that take anywhere between two and six hours, Yahoo! wasn’t getting the fidelity that they needed in their moment-to-moment information for such things as their Right Media services, their personalization algorithms, and their machine learning models. Problems such as this became an impetus for their getting involved with Hortonworks and the Apache Hadoop community to develop YARN, said Fernandez-Ruiz.

“We figured out that we had to change the model, and this is what we’re calling ‘continuous computing,’” he explained. “How do you take MapReduce and change this notion of [going from] big long windows of several hours to running in small continuous incremental iterations – whether that is 5 minutes, or 5 seconds, or a half a second. How do you move that batch job from being long running, to being micro-batch?”

One of the solutions they’ve turned to, says Fernando-Ruiz, is to leverage YARN to move certain processes from big batch jobs to streaming.  To accomplish this they turned to the open source distributed real-time computation system, Storm.

Pivoting off of YARN, Yahoo! was able to use Storm to reduce a process window that was previously 45 minutes (and as long as an hour and a half) long, to sub-5 seconds, correcting a problem that they had with unintentional client over-spend on their Right Media exchange.

Fernando-Ruiz says that they currently have Storm running in this implementation on 320 nodes of their Hadoop cluster, processing 133,000 events per second, corresponding roughly to 500 processes that are running with 12,000 threads. (Fernando-Ruiz says that their port of Storm into YARN has been submitted into the Storm distribution, and is now available for anyone to use).

He explained that they are also using YARN to run UC Berkely AMPLab’s data analytics cluster computing framework, Spark, in conjunction with MapReduce to help with personalization on their network. According to Fernando-Ruiz, Yahoo! is using long-running MapReduce to calculate the probabilities of an individual users interests (he used the example of a user having a fashion emphasis).  While the batch runs, Spark is continuing to score the user.

“If you actually send an email, perform a search query, or you have been clicking on a number of articles, we can infer really quickly if you happen to be a fashion emphasis today,” he said. They then take the data and feed it into a scoring function, which flags the individual according to their interests for the next few minutes, or even 12 hours, delivering personalized content to them on the Yahoo! network.

According to Fernando-Ruiz, this deployment with Spark is currently running as a much smaller deployment of 40 nodes, and that the continuous training of these personalization algorithms has experienced a 3 times speed up. (Again, Spark has been ported to YARN, and is currently available for download).

The final use case he discussed uses HBase, Spark, and Storm running on 19,000 nodes for machine learning to train their personalization models. With data constantly accumulating, Yahoo! looks towards using the flood to calibrate and train their models and classifiers.

“The problem is that every time you load all that data…it’s a long running MapReduce job – by the time you finish it, you’re basically changing maybe 1% of the data. Ninety-nine percent of the data is that same, so why are you running the MapReduce job again on the same amount of data? It would be better to actually do it iteratively and incrementally, so we started to use Spark to train those models at the same time we’re using Storm.”

He says that Hadoop 2.0 is changing the way that they view Hadoop altogether. “It’s no longer just these long-running MapReduce jobs – it’s actually the MapReduce jobs together with an ability to process in very low latency the streaming signals that we get in. To not have that window of processing , but actually have a very small window of processing together with the ability to not have to reload all the data set in memory…to go and iteratively and incrementally do those micro-batch jobs.”

All that, he says, is possible thanks to YARN, and the new development of splitting the resource manager.

Related Items:

Hortonworks Previews Future After Massive Funding Haul

The Art of Scheduling in Big Data Infrastructures Today

Gartner’s Adrian Raps on Big Data’s Present and Future

BigDATAwire