Pinterest Shoots ‘Pinball’ Into Open Source
Pinterest announced yesterday that it’s making the workflow management software it developed to manage big data pipelines, called Pinball, available as open source. Now anybody can use the same technology that Pinterest uses to manage the flow of work on Hadoop and other cluster resources.
Pinterest came onto the social media scene in 2010 with a relatively simple Web and mobile app that lets users share pictures and videos by “pinning” them to their pinboards. Today, Pinterest has become the 32nd most popular website in the world, according to Alexa, and manages tens of billions of pins per day, or 120,000 per second.
Hadoop plays a very important role for Pinterest, as it crunches nearly 3 petabytes worth of data every day for the company. But there are other components too, and that’s what sent the Pinterest engineers in search of a good workflow engine to manage their entire big data pipeline.
“After experimenting with a few open-source workflow managers we found none of them to be flexible enough to accommodate the ever-changing landscape of our data processing solutions,” the company wrote on its engineering blog. “In particular, current available solutions are either scoped to support a specific type of job [e.g. Apache Oozie optimized for Hadoop computations] or abstractly broad and hard to extend [e.g. monolithic Azkaban].”
After finding none of the open source workflow managers to its liking, the company’s engineers decided to create their own. Flexibility was a big requirement, as the workflow engine had to be able to handle everything from basic shell commands “to elaborate ETL-style computations on top of Hadoop, Hive and Spark.”
It created Pinball to oversee the data pipeline within a master-worker architecture, where the master is acts as the single source of truth about the current system state, and workers only communicate with the master. Pinball’s architecture enables Pinterest engineers to easily upgrade or replace a given computational node without fear of upsetting the whole apple cart.
Pinball handles hundreds of workflows with thousands of jobs that process almost 3 petabytes of data every day on its Hadoop clusters, the company says. The workflows automate a large volume and variety of jobs, everything from generating analytics reports and building search indices to training machine learning models, “and a myriad of other tasks.” The largest workflow Pinball manages has more than 500 jobs, the company says.
While the Python-based Pinball (which is available for download at Pinterest’s GitHub site) is highly customizable, the software comes with a default implementation of clients that allows users to define, run, and monitor workflows, the company says. Customers can define their own workflows through a UI workflow builder, or import them from other systems. The software supports a Python-based workflow configuration syntax, as well as a number of job templates for configuring simple shell scripts.
Last year, Pinterest engineer Pawel Garbacki lamented the lack of workflow products. “Hadoop is the technology of choice for large scale distributed data processing, while Redis does for an in-memory key-value store, and Zookeeper handles synchronization of distributed processes,” he said in a January 2014 blog post. “So why isn’t there a standard for workflow management?”
Maybe now we’ll have one.
Related Items:
Beyond the 3 Vs: Where Is Big Data Now?