Cloudera Releases Impala Into the Wild
On Tuesday, Cloudera announced that Impala, their SQL on Hadoop initiative, is moving out of public beta and into general availability.
Purpose built for low latency querying, Impala is a native MPP query engine that’s designed to existing beside MapReduce as another tool in the data analyst’s toolkit. Impala shares all the same integration features common to a Hadoop implementation including metadata, storage, resource management, etc., but provides real-time SQL (ANSI-92) querying. The result is an evolution away from Hadoop being merely a platform for batch processing workloads, and towards a multi-purpose, interactive BI and analytics machine.
“I often use the analogy that MapReduce is like a dump truck – great at moving huge volumes of things around, and Impala is like your Ferrari – very quick at getting nimbly to the answer that you’re looking for,” said Justin Erickson, Senior Product Manager at Cloudera.
In our conversation, Erickson said that Cloudera sees the core benefits of Impala as providing more and faster value from big data, noting that Impala has access to the raw, full granular data of Hadoop, which is often not in traditional storage engines.
“Whereas traditional data management architectures would typically be in the tens of thousands of dollars per terabyte, with Hadoop you’re looking at hundreds of dollars per terabyte which makes it much more cost efficient to store big data scale data sets within Hadoop,” commented Erickson. “And what that often means is that you get full fidelity analysis. [Users] can store the raw and full data sets, not just the computed aggregations, and then of course, due to the flexibility benefits, I do not need to deal with fixed schemas to load the data in. I have the ease of Hadoop to bring the data in their native format, and then apply schema on read time.”
Erickson also noted that due to the commonality of SQL, Impala (and SQL on Hadoop paradigms like it) open up the Hadoop stores to more users. “Today, it’s not uncommon to see petabyte clusters with hundred or even thousand node clusters with maybe tens of users that are able to interact with that,” commented Erickson noting that these users had to have map reduce development skill sets – something that is not in great supply out in the talent pool.
“But now with Impala, you get more users – and for the existing users, they no longer have the current working paradigm… Now I can actually iterate in real time with the system. It is perfectly reasonable for me to execute a query, wait for a response, get it in seconds, look at the results of that, and figure out my next query.”
Impala, which launched into private beta last May, and then into public beta in October, was at the front end of a movement toward interactive SQL on Hadoop that has started to really blossom recently with several recent SQL/ Hadoop announcements. Following the line of logic that if you’re not catching flak, you’re not over the target, Erickson noted that Cloudera is definitely catching flak from competitors in the space – an observation we noted to be generally true.
EMC has been particularly vocal about the target that they’ve painted on Cloudera’s back, through their Greenplum arm that has been recently spun out as Pivotal. The company has made no bones about going head to head with Cloudera for the heart of Hadoop, and has gone so far as to publish benchmarks comparing their HAWQ offering with Impala, using what they called an industry standard data set.
Erickson dismissed the results as flattery, saying that EMC’s aggressive stance is great validation of their direction and the level of concern that EMC has for how the market is reacting to their respective offerings.
He also noted that there may have been some technical slight-of-hand in the numbers (if you can believe it). “When they do their comparisons, they were comparing on the data after it’s been loaded into their silo database,” explained Erickson. “And they sort of hand-picked a set of queries that worked very well on columnar data structures. If you ask them for the performance of the same data in the native Hadoop, the performance numbers are going to be very, very different.”
NEXT – Cloudera Overviews Differing Architectures –>
If you didn’t guess, Cloudera says that they are in the lead in the SQL on Hadoop race, noting that their design strategy is what places them firmly in the catbird seat over competitive offerings. The company says that the current offerings fall into one of three different implementation types, each with their own issues.
- Batch MapReduce – This SQL on Hadoop design aims to make MapReduce faster. While there are a lot of opportunities to make MapReduce faster, at the end of the day, it’s still going to be a batch processing engine. “Trying to get it to go faster and achieve interactive latency is increasingly going to be a pain point as you’re designing against the grain,” says Erickson.
He notes that although within Google’s infrastructure, they’ve done a lot of work with their Tenzing work to make MapReduce much faster for batch SQL, but when they designed Dremel, their interactive SQL engine, they designed it outside of MapReduce. “We believe that is the right architecture, and we believe that as fast as MapReduce can be, it’s not going to be able to achieve the low latency of a purpose built query engine.”
- Remote Query – The next approach that Cloudera observes in the SQL on Hadoop design contest is remote query. Erickson notes that some vendors actually started with batch MapReduce approaches, and switched tracks when faced with the realization that MapReduce is their chief bottleneck. Having built a distributed query engine already, Erickson says that they don’t want to reinvest in all of that engineering, and instead move towards using HDFS as an external file system.
“The most notable example of this is the Aster approach with SQL-H, where they have a set of distributed query engine nodes and a set of Hadoop nodes, and when they do queries, they will have the data get pulled from the HDFS file system over the network and then begin the processing in their distributed query engine,” says Erickson. He notes that while this will work great for point of contact, things get dicey if a process requires bringing big data for a query across the network before it can begin processing. “The network is increasingly going to become a bottleneck as the size of data grows, and so you’ll get slowness from the network bottleneck, and will now have to spend for a pretty expensive traditional query engine at a different price point than what you traditionally see in Hadoop without getting the primary benefits.”
- Siloed DBMS – Erickson claims that vendors like EMC actually started with remote query and saw the bottlenecks around the network and started moving towards the siloed DBMS following what he says was Hadapt’s lead within the space. This approach, says Erickson, takes the distributed query engine and parks it on each one of the nodes that are running Hadoop, and run that side by side with a Hadoop cluster. Erickson charges that in the EMC approach, they’ll not only have that live side by side, they’ll store their proprietary files within HDFS.
“The downside of this is, even though it’s technically storing its data in HDFS, users hat to still load the data into their engine to get any of the good performance benefits, and once I do, that file is locked out from MapReduce, Pig, Hive, and the rest of the ecosystem as it’s not a native Hadoop file format that’s available for the other engines to operate against.” Erickson says that in that instance, users end up with the same ETL, same rigidity of a traditional data warehouse, and end up with siloed data.
Cloudera’s approach, says Erickson, differs from these design strategies due to the fact that they are taking a MPP query engine that is purpose-built to be a part of the Hadoop ecosystem where the data lives in Hadoop in their native formats. Erickson claims that users don’t have to load it into any special structure, and they get the query engine directly against the raw data so that they can run MapReduce, they can run Hive, they can write machine learning jobs, data processing jobs, and have them all running on the same data at the same time.
Specific customer use case is still going to be the ultimate factor, commented Curt Monash of Monash Research. “Cloudera Impala is not and will not soon be a fast or richly-featured analytic RDBMS,” he told us. “It’s slow and limited when compared with stand-alone analytic RDBMS products. Impala should be considered as a limited DBMS integrated into a Hadoop cluster you’d have anyway for other reasons. In that context, it could be very useful.”
Erickson’s own statements validate this. “We don’t see this as a replacement to a data warehouse,” he commented. “I don’t see customers that are throwing away their Teradata systems to go and move it onto Impala. Those systems have advanced SQL functionality that you want to be able to do for your gold standard of data. I see this as much better for the exploratory queries when I have a bunch of joins, a bunch of aggregations, and I might not be taking advantage of advanced analytic functions like window and time series functions.”
NEXT – Will the Impala Devour the Shark? –>
While Cloudera has other vendor offerings in the space to contend with, there is research coming out of the UC Berkeley AMP Lab, where Shark (Hive on Spark) is starting to generate some buzz for its supposed fast and fault-tolerant parallel execution engine. In particular, researchers from this space have noted that their project is ahead of Impala where in-memory data processing to speed up queries and fault tolerance are concerned.
“We’ve actually looked at what Spark, in particular, has been able to do with how they keep data sets in-memory, and we see the main benefits of pinning data sets in-memory,” said Erickson. “Impala has been built to be able to take advantage of whatever the raw underlying hardware is with minimal overhead, and you’ll see things like pinning data sets in-memory come about.”
Addressing fault tolerance, Erickson admitted that in its current implementation, if a node fails mid-query, that query would get aborted, and the user would need to reissue that query. “If you’re looking at queries that are coming back an order of magnitude or more faster than what you can achieve today on Hive, running a query twice in the event of a failure on a query that runs a couple seconds is still dramatically faster than waiting for the fault tolerance mechanisms of existing systems.” Even still, Erickson says that it’s not something that is designed in a way that they can’t go and add it in.
Erickson noted that Cloudera has taken note of the work that is happening in the AMP Lab and commented that we may see more of it integrated into the Hadoop sphere. “We’ve actually hired some people from Berkeley who are familiar with the technology to take the great research that they’ve done and productionize that within the Hadoop ecosystem.”
Looking towards the future, Ericson noted that there are two dimensions in which he sees the SQL on Hadoop efforts growing towards. The first, he says, is that it will be a commoditizing force on existing workloads that are in the traditional SQL world. “You’ll see more advanced SQL functionality that doesn’t exist today in Impala will come in there,” noting such things as window functions and other features.
Secondly, he says that there is another dimension that will happen around taking better advantage of the flexibility that Hadoop natively offers. “The fact that you have SQL on Hadoop solutions that can query across multiple file formats in their native structures means that I can have data come in raw data formats and convert them on the fly to a more optimal format – all the while having all of your data available across multiple frameworks. That’s something that is unique and different from anything you could have done with existing technologies.”
Erickson says that Cloudera expects there to be more peers of Impala joining the fray. “There will be other processing frameworks that will do different types of processing beyond what you can do in SQL, and beyond what you’d want to write in MapReduce, and I think the story is going to start growing horizontally there as well.”
Related Items:
Cloudera Runs Real-Time with Impala
Pivotal Launches With $105m Investment From GE
Putting Some Real Time Sting into Hive