Intel Goes Graph with Hadoop Distro
Intel will be targeting big retail operations with a new graph database that it unveiled today as part of its Intel Distribution for Apache Hadoop version 3 announcement. The graph engine will enable customers to make product or customer recommendations in real time, a la Netflix or Amazon, based on existing data. The chip giant also fleshed out its Hadoop distro with a 20x speedup in encryption functions, a data tokenization option, and a handful of new machine learning algorithms aimed at solving common problems.
Intel, like most Hadoop distributors, starts with the open source Apache Hadoop distribution and other products in the Apache Hadoop ecosystem, and then adds additional capabilities to differentiate Intel Distribution for Apache Hadoop in the market. The version 3 offering that Intel announced today builds off the Apache Hadoop version 2.x codebase, and that gives it a lot of new capabilities out of the gate.
But supporting YARN only gets Intel so far. Today’s big data customers demand multiple ways to slice their data, and they want it to be fast and simple to use. Delivering on such monumental demands is not easy, but Intel is making strides in creating an extensible set of Hadoop “building blocks” aimed at solving big data challenges for customers in multiple industries.
Intel got its feet wet with graph analytics a year ago when it released into the open source arena Graph Builder, a set of libraries designed to help developers create graphs based on real world models. Since that first alpha release, Intel developers have streamlined the software and made it easier for users to import, clean, and transform large amounts of data sitting in the graph database. These enhancements will ship in early 2014 as Intel Graph Builder for Apache Hadoop software version 2.
Intel Graph Builder is based on the open source Titan distributed graph database, and uses Pig scripts to trigger queries on top of the graph, says Ritu Kama, director of product management in Intel’s Big Data group. The graph engine adds another analytical option for Intel Hadoop customers, in addition to MapReduce, HBase, Hive, and Mahout, which are all bundled with the distribution.
Example of a graph. Source: https://wiki.digitalmethods.net |
“We’re giving customers another way to model their data,” Kama says. “We’re making it all available in an end-to-end package they can deploy. They can pull in data from any existing data, whether it’s a database, CSV files, or text files. We’re giving them a way to load that data into a graph data structure, and then giving them the capability to query that data to understand the linkages between the data sets.”
Netflix, Amazon, and Facebook all use graph engines to perform social analytics for people or to figure out how products or features are connected and inter-related. And Intel plans to help its customers do the same thing.
“The idea here is to put your data in a different model, and then run analytics queries to figure out how one element of a data set are related to other elements of the data set, what are the shortest path between two end points, and what properties are more correlated for various entities that are in your system,” she says. “Those are the kinds of analytics that customers routinely do to better understand the products, better understand customers, and better understand preferences.”
The Graph Builder will be heavily used by customers in the retail industry, says Jason Fedder, general manager of channels, marketing and business operations for Intel’s Datacenter Software Division.
“It’s typically used, at least commercially, for cross-sell and up-sell decision making trees,” he tells Datanami. “So if you’re looking to determine consumer behavior or if you’re looking to determine the likelihood of fraud or risk outcomes, predicated on previously defined behavioral conditions, then that’s the kind of thing that graph analytics helps you do.”
The company is currently working on additional engines, including a stream analytics engine. “We’ll integrate the stream analytics in our next version, which will come out in the early part of next year,” Kama says.
Intel Distribution for Apache Hadoop software 3.0 also brings a new collection of algorithms, dubbed the Advanced Analytics Toolkit. This toolkit introduces several “advanced state” algorithms for doing machine learning on top of customer data, and basically serves as a set of “building blocks” for creating advanced analytic solutions that address common big data use cases, Kama says.
“Instead of starting from scratch, they get prepackaged algorithms,” she says. “A majority of the algorithm work is already done in the toolkit. They can put these building blocks in any sequence they want to deliver a quick analytic solution for any of problem statements–personalization, recommendation, and customer segmentation.”
The third and final new capability in Intel Distribution for Apache Hadoop software 3.0 is in the management and security space. On the security front, Intel says its management software layer, called Intel Manager for Apache Hadoop, can now speed encryption and decryption routines for HBase, MapReduce, Hive and Pig by up to a factor of 20. That is big news for customers in regulated industries that need to encrypt personally identifiable information. It also is giving customers additional security with Intel Expressway Tokenization Broker, a “token vault” that removes data from compliance scope.
Intel’s proprietary Hadoop management layer (which provides deployment, configuration, monitoring, and alerts) also now gets support for high availability features in HDFS that remove the NameNode as the single point of failure. That is a critical point for customers as they move their Hadoop clusters from proof of concept into production. But Intel’s Hadoop customers aren’t limited to using HDFS–they can also use YARN, Lustre, and GlusterFS as Hadoop- compatible file systems, it says.
Intel Distribution for Apache Hadoop software 3.0 is a big part of Intel’s big data strategy going forward into 2014. But you can also expect Intel to work with additional partners and business intelligence vendors, because, as Fedder likes to say, Intel doesn’t really care about end-user customers.
“We are probably the only vendor in the big data space right now that has no interest in trying to solve a customer’s business problem,” Fedder says. “In a sense, we are trying to create and define software-based building blocks, software components if you like…Fundamentally we’ll always be deployed in conjunction with third-party analytical tools or application. And that’s quite different than the approach that other vendors are taking.”
Related Items:
Hadoop Version 2: One Step Closer to the Big Data Goal
Is Intel Raining on the Hadoop Distro Startup Parade?
Intel Hitches Xeon to Hadoop Wagon