Picking the Right SQL-on-Hadoop Tool for the Job
SQL is, arguably, the biggest workload many organizations run on their Hadoop clusters. And there’s good reason why: The combination of a familiar interface (SQL) along with a modern computing architecture (Hadoop) enables people to manipulate and query data in new and powerful ways. But not all SQL-on-Hadoop tools are equal, and that makes picking the right tool a challenge.
There’s no shortage of SQL on Hadoop offerings, and each Hadoop distributor seems to have its preferred flavor. The list begins with the original SQL-on-Hadoop engine, Apache Hive, which was created at Facebook and is now backed by Hortonworks with its Stinger initiative. In 2012, Cloudera rolled out the first release of Apache Impala to great fanfare, while lately MapR has been pushing the schema-less bounds of SQL querying with Apache Drill, which is based on Google‘s Dremel.
But the prospective SQL-on-Hadoop user may also want to check out other offerings, including those from mega-vendors like BigSQL (backed by IBM), Big Data SQL (backed by Oracle), and Vertica SQL on Hadoop (backed by Hewlett-Packard). Others in the race include Presto (created by Facebook, now backed by Teradata), Vortex (backed by Actian), Apache HAWQ (backed by Pivotal) and Apache Phoenix. Last but not least is Spark SQL, the SQL-enabling component of Apache Spark that is quickly gaining momentum.
Buckets of SQL on Hadoop
However, most Hadoop distributors include multiple SQL engines with their core offerings, which should tell you something: namely, that three is a very broad spectrum of capabilities within the SQL-on-Hadoop category. In fact, there are several sub-categories of SQL on Hadoop tools. It’s helpful to think of three main buckets: batch-oriented, interactive, and in-memory.
Apache Hive is the best example of the batch-oriented SQL on Hadoop tool. The software essentially recreates a relational-style database atop HDFS, and then uses MapReduce (or more recently, Apache Tez, thanks to Hortonworks’ Stinger efforts) as an intermediate processing layer. It’s most often used for running big and complex jobs, including ETL and production data “pipelines,” against massive data sets.
The interactive SQL-on-Hadoop category is more crowded. Impala was the first tool to attempt to deliver interactive-like response to SQL queries running over data on HDFS. But Cloudera didn’t have this category to itself for very long, and soon had a host of competitors, like Drill, HAWQ, Presto, Vortex, and Vertica SQL on Hadoop, not to mention the IBM and Oracle products.
These tools excel at executing ad-hoc SQL queries and performing self-service data exploration, and are often used directly by data analysts or execute the machine-generated SQL code from BI tools like Qlik and Tableau. Latency is usually measured in seconds to minutes.
The third category is defined by an in-memory computing architecture, and Apache Spark SQL pretty much owns this category, although Apache Flink could provide Spark SQL with competition in this category. Data scientists who are doing self-service data exploration are the biggest users of Spark SQL, which delivers results very quickly (in milliseconds to seconds).
Consolidation Looms
Ashish Thusoo, who led the development Apache Hive while working at Facebook from 2007 to 2011, agrees that the SQL-on-Hadoop tool market is a pretty topsy-turvy place, with many vendors making performance claims that are tough to be substantiated. But over time, Thusoo, who founded and is CEO at Hadoop-as-a-service provider Qubole, says the market will eventually settle down.
“Some of these products will die out. In each of these architectures, there will be two or three winners,” Thusoo says. “I don’t think there’s an Uber architecture that has all of these things together [in the short term]. That convergence is still a few years out where you have a single SQL engine that does all of them together.”
The reliability and performance of Hive-on-MapReduce and Hive-on-Tez will ensure that it lives on huge batch jobs, he says, while Spark SQL appears headed to the winners circle in the in-memory category. But it’s too soon to call the winner for that second category revolving around interactive SQL access by data analysts.
“I think it will basically boil down to who owns the architectural advantage and which one gets the most mindshare,” Thusoo tells Datanami. “There’s two or three things that you do in the SQL engine to make it fast, and I think pretty much everybody knows what those two or three things are, so I don’t think there’s a huge advantage that one tool has over the other.”
One of the key differences that Thusoo sees among the interactive SQL-on-Hadoop tools is how they were built. Some of the tools, such as Impala and Drill, were developed from the beginning to run on Hadoop clusters, while others are essentially ports of existing SQL engines that previously ran on vendors’ massively parallel processing (MPP) databases.
“To run some of these engines on Hadoop, I think that’s a much tougher bridge,” he says. “It’s much better to write something from the ground up, which is what some of these other tools are doing, because some of the fundamental infrastructure assumptions have changed with Hadoop. There’s a lot more distributed processing. There’s a lot more scale. There’s a lot more check-pointing capabilities. There’s a lot more building to an architecture that has built-in fault tolerance, and the emphasis is not there so much in the MPP architecture.”
Other Considerations
There are other differences, of course, including the types of file formats that are supported, and, adherence to ANSI SQL standards, and the capability to run on non-Hadoop clusters. Impala, for example, accesses HDFS data that’s encoded in the Parquet format, while Hive and others support optimized row column (ORC) files, sequence files, or plain text. The different encoding standards result in different block sizes, and that can impact performance.
The file format question factored into YP.com‘s SQL-on-Hadoop decision. The Georgia-based company, which runs a 1,000 node Hadoop cluster based on Cloudera’s CDH5 distribution, was leaning toward HP’s Vertica SQL on Hadoop offering in September when Datanami spoke with Bill Theisinger, vice president of engineering for YP.com’s Platform Data Services Group. The fact that ORC files compress smaller than Parquet files was a big factor, he said.
“We’re capturing a lot of data in Hadoop and HBase, so a lot of data resides in HDFS,” Theisinger explained. “And we can store that data in ORC far more performant than any other formats we’ve used in the past, like sequence files for example.”
Different SQL-on-Hadoop tools support different source files too. It used to be that Hadoop users would use a tool like MapReduce to pound unstructured data into a more structured or relational format. But lately, tools like Drill have opened up SQL-based access directly to semi-structured data, such as JSON, which is a common format found on NoSQL databases like MongoDB‘s, and increasingly relational databases too. Cloudera recently added support for JSON in Impala, which shows you how the state-of-the-art is evolving.
SQL-on-Hadoop tool vendors are prone to making big and bold claims about their tools’ performance, but it’s best to take those with a grain of salt. The fact that vendors have been known to cherry pick how their products run on older decision support benchmarks (therefore giving skewed comparisons) led the Transaction Processing Performance Council (TPC) to unveil a new benchmark specifically designed to account for distributed computing platforms, such as Hadoop. It’s best to wait until the vendors actually run their products through the new TPC-DS 2.0 benchmark to make accurate comparisons.
Until the day there’s one Uber SQL-on-Hadoop tool that does everything well, Hadoop customers are advised to use multiple tools to complete their data analytics tasks. You should keep in mind that that day may never come.
“One tool is never going to do everything well,” Cloudera product marketing manager Alexandra Gutow told Datanami in November, when Cloudera added Spark SQL to its distribution, which already includes Impala and Hive. “Make sure you have the right tools for the right job for the right user.”
Related Items:
New TPC Benchmark Puts an End to Tall SQL-on-Hadoop Tales
Inside Yellow Pages’ SQL-on-Hadoop Journey
How Advances in SQL on Hadoop Are Democratizing Big Data—Part 1
These SQL-on-Hadoop solutions follow a similar architecture and are not really suited for BI use with large data sets. Jethro (http://jethro.io) is a SQL-on-Hadoop solution that was specifically developed for BI use with a different architectural approach than Hive, Spark, Drill, etc.