Hortonworks Dishes on Distro Differences
The Hadoop hoopla heated up this month following the deluge of announcements that flooded out of the Strata HadoopWorld event. From kicking the platform into new real-time territory to partnerships and integrations galore, it was easy to lose sight of the ever-expanding ecosystem map.
To make sense of the activity, we sat down with Hadoop distro vendor, Hortonworks’ Chief Product Officer, Ari Zilka for an hour to outline their strategy in the face of some drastic changes to the core Hadoop platform (which Hortonworks patrols with Apache) as well as to discuss their own role in the industry shakeup.
In addition to providing some nice contrast to our conversation earlier last week with Cloudera CEO, Mike Olson, we were able to take the pulse from the decidedly open source and Apache-rooted side of the fence.
For those who follow industry news, Ari Zilka’s name might be familiar; he was co-founder and former CTO at Terracotta, which focuses on big data management via an in-memory, performance-focused approach. Prior to that, he was the Chief Architect for Walmart.com in addition to his entrepreneurial activities at big data investment powerhouse, Accel Partners. He’s one of the big data world’s go-to guys for some perspective on the branching vendor ecosystem—and has been responsible for everything from funding decisions to helping Walmart.com build an enterprise data warehouse so robust that Walmart almost got into the data warehousing business.
Creating new uses (and breaking points) for Hadoop is nothing new to Zilka, who says he visualizes what’s happening around Hadoop in terms of slices in a steadily growing pie. “Hadoop took one slice out of that that nothing else could handle,” he said. This was the multi-hundred terabyte slice, the petabyte slice and the $100k slices versus that $10 million EDW slice.
From the time Hortonworks started until here in 2012, that metaphorical pie has been carved out to dedicate 99 percent to EDW and just 1 percent to Hadoop. He predicts that over time, however, the pie itself will simply continue to grow in overall size as decisions will be made to offload things that never belonged in the EDW in the first place. This will carve away ten or twenty percent of the overall workload—the same percent that is making the EDW customers scream because of cost and workability.
To highlight this, Zilka says that they have customers who are trying to do ETL between Informatica and Teradata and find that their query workload on Teradata eats up 20 percent of their CPU utilization and the ETL job alone—just to load the data—takes 80 percent. If they chose to do the same ETL on Hadoop, he said it took less than 20 percent of a Hadoop cluster that costs just under $200k. The customer can then let that data be inserted into Teradata in Teradata-native form because Hadoop prepared and refined it. This is the real value of openness, he said—and of what they’ve thrown into the ingredient list for that growing pie.
The above point goes beyond practicality and utilization of resources, however, says Zilka. It’s actually a core part of how their overall partnership strategy works. “With that refinery and preparation, we’ve offloaded Teradata but we haven’t taken away its purpose. The CFO, COO, SAP and other areas are still talking to Teradata. It is that cooperation—not cannibalization approach—that is key to their business and partnership strategy.
While admittedly, we’ve skipped over many of the basics about Hortonworks as a company, if this is somewhat new to you, it’s easiest to think of them of as the gatekeeper company of the native Apache Hadoop code, a partnership-driven distro vendor, and one whose sole goal is to package a reliable, stable and pure open source strategy around Hadoop. Zilka reminded us that while Cloudera may have been first to the commercialization table, his own company is the one and only “true spinout” from the Apache Hadoop project, using the core engineering talent required to continue tweaking and refining the platform.
As the company sees it, they’re also the only ones that are able to package together an offering that works for the three classes of Hadoop users: developers, operations and data scientists. As Zilka told us, “we’ve built on 10 to 15 year old operations infrastructure, all open source—we’re not built from scratch, we’re assembled with tested parts. This is, he says, the foundation of their claim to the greatest degree of reliability. He often reiterated the three-customer mantra, saying that to be useful, Hadoop has to work for the golden triad of users: dev, ops and analysts.
When it comes to those three classes of Hadoop users, Hortonworks thinks they are the only company that has this trinity covered. “If you think customers need Hadoop to be great for operations, great for Java developers and then also great for scientists, quants and the SQL guys, we have all of that in-house,” Zilka explained. He compared that approach to Cloudera’s noting that Hortonworks’ starting point was dealing with Hadoop’s constantly extending scaling needs, finding ways around that and then coming up with solutions to expand capabilities around that. On the other hand, says the CPO, “Those are four years of development Cloudera was blind to. While Cloudera was out knocking on doors asking users what they wanted Hadoop to do, we were anticipating what they needed.”
One key non-technological differentiator that sets Hortonworks apart is its partnership strategy, which they say marries a “top down” and a “bottom down” approach. According to the company, they take their product management and sales teams out to see what tools and services enterprise customers are using to develop a “heat map” of important partner possibilities. They go “bottom up” from that point to find out how they can work with those partners in a way that works for both parties (doesn’t compete with them or take over their importance) and for users.
As Zilka told us, when the company goes bottom up with its partnership approach, they find out, for example, what a user does with Informatica, what its use case is for a given architecture, and then they come up with a plan to help the Informatica approach in a way that doesn’t undercut what Informatica is doing on the ETL side.
He put this in context with an example. “Instead of competing, co-opting or eroding their revenue streams, we look to work together. However, if you look something like Cloudera’s Impala announcement, they’re cannibalizing a traditional Teradata revenue stream. So if Teradata partners with that, they’re basically saying I can sell you old Teradata or I can sell you Impala OEM’d but it’s the same use case.” He continued, claiming that what they’re doing differently is selling MapReduce bolted onto Teradata. “Whatever MapReduce was good at, we’re elevating that with Teradata saying you need to do some things in MR and some in SQL—our partnership strategy is based on integrations, the other side is to then go find what solutions have the most traction—to find what the teams are asking for Hadoop to augment.”
As you’ll note above, Zilka pointed to the recent Impala news from Cloudera as an example of cannibalization. That’s certainly not all he had to say about their main competitors’ approach to pushing real-time Hadoop via Impala. We spent some discussing this technology and what it could signal for Hadoop users, its growing ecosystem that might be able to find more high performance use cases, and of course, their own bottom line in the face of attractive features like Impala.
Next — Hunting a Wild Impala >
While Zilka consistently claimed constant respect for Cloudera and what it’s been doing in the marketplace, he said that Impala in its current incarnation is going to have a hard time finding a lot of use cases out of the gate due to maturity. Even still, he says it poses a credible threat—even if that is years down the road. Impala is not mature and, as Zilka told us, it’s limited in his scope. As he said:
Enterprises are confused – they’re trying to put workloads on Hadoop that don’t go, they’re trying to make sense of HBase, and there’s huge there largely because they’re simply trying to get low latency on their Hadoop clusters. Now, Impala is coming along saying that HBase and other standard tools are unreliable. The problem I have with a pure SQL MPP is that it’s already being done by others at higher price points but with much more maturity in the implementation. Do we need faster Hadoop? Yes. But do we need faster Hadoop only dedicated to SQL? Probably not.”
While Zilka agrees that Impala could change the playing field, “it’s still got a lot to learn at a product management level; it’s going to get thrown a lot of things that are going to break it because MPP has a lot of pathological workloads that break MPP—otherwise Hadoop wouldn’t exist in the first place.” He admits they’ll get it figured out, but it’s going to be a multi-year slog versus something that matures over the course of a few months.
Zilka was fired up on the Impala topic—he went on, saying that he’s already written and implemented the exact same thing as Impala three years ago, inspired by some of the challenges from his Walmart.com days. He noted that at the time, there were simply questions they couldn’t ask of the data because of limitations. However, the workaround—the same one at the heart of Impala—wasn’t that difficult. “You basically take a SQL parser and then have an indexed data set, where you keep statistics about the minimum and maximum values, ranges, deviations and distributions of value sets. Someone launches a query, then you look at a query predicate and determine which blocks contain what data—then you just send a query to those blocks and they come back.”
When it comes to competitor MapR, Zilka says his company is no longer worried. He notes that when they first launched their proprietary replacement layer for HDFS, there was a performance increase, but over time the others have caught up and MapR no longer enjoys the coveted performance position. Just as with Cloudera, he noted that he has great respect for the team, but “generic Hadoop has time and again been benchmarked above it.” Still, MapR is not without it strengths, he said, pointing to the additional functionality of their file system over HDFS for users that want their system to look like NetApp (be able to run it from Windows, for instance, without porting or integration). While this is a stellar feature, it fits only a certain number of use cases “You can map a MapR cluster to your Windows laptop like a Z drive and drag stuff onto it—you can’t do that with Hadoop, and while that’s great, it doesn’t add enough value since the real value of Hadoop is in the analytics. Without talking MapR down too much, he said that the killer app for the company right now is as a NetApp competitor, or in other words, for those who want to build storage clusters for teams to share data with.
The thing is, if you go to MapR or Cloudera, they might assert that they’re using the more modern, updated and robust version of Hadoop (2.0). They will claim that this version gives them unparalleled strength against anyone leveraging “old” 1.0 and that the statements about its reliability problems are overstated. Since Hortonworks’ platform is based off of Hadoop 1.0, this is an important point of contention—but one that Zilka shrugged off as rather unfounded.
According to the CPO, while it’s true they’ve built off the proven reliability of 1.0, this does not mean there are sacrifices in terms of features of capabilities. In fact, he says that the platform maturity combined with the features of Ambari (their answer to Cloudera Manager) make it just as robust as the offerings built on the “less stable” 2.0 Hadoop releases that Cloudera and MapR have capitalized on.
“High availability, metadata management and ops tooling are all part of the 1.0 release. The difference between that and what others are shipping on 2.0 are negligible. They’re ignoring the extra investment we put into 1.0 when they say it’s not as feature rich,” said Zilka.
Where the difference is most keenly felt is at the all-important management layer, which, again he says, has to suit the needs of the big three (dev, ops and analyst). We asked him how the company’s own management suite compares to, say for example, Cloudera Manager.
Even though he admitted it rather begrudgingly, Zilka told us that a deep dive into the capabilities of Cloudera Manager drove his thinking about how to architect Ambari. As he looked closer at the backbone, however, he says there were some missing pieces that are of vast importance to Hadoop’s classes of users. These boil down to root cause analysis, something that he says is indeed built into Cloudera Manager, but only on a surface level.
“When I looked at the tools there in the context of my operational background with large-scale production analytics and transactional systems, I saw that it’s not actually root cause analysis since there is no context there. I like the idea of root cause analysis, but all Cloudera Manager gives you are some operational tools you could get elsewhere for the basic CPU, network and disk utilization while running MapReduce across a cluster via Ganglia, BMC, Tivoli, NewRelic and others.” Root cause is not about monitoring, he says—it’s about finer points. The ability to see how many HDFS blocks have moved, how long particular tasks ran, how many threads a task forked, and visual maps of the chain of dependencies as jobs hop nodes are all critical, he argues.
Hortonworks plowed full speed ahead with Ambari with a large portion of the development dollars going toward making their root cause analysis a top tool for ops. Zilka said that it’s a big step for users to be able to see jobs move visually around a cluster, see them as they relate to the hardware, and watch carefully to see which jobs are burning up the cluster. Relating those issues back to the node in question is the core value they’re pushing with Ambari—and is, in addition to the company’s open source and partnership strategy, what really sets their enterprise offering apart.
The company remains the sole gatekeeper at the switch that flips Hadoop 1.0 over to 2.0. He said that they’re working on it still, but there are 45 or more bugs that they’re still trying to exterminate. It should be noted that he said this with emphasis—pointing to a potentially hefty set of problems the 2.0 builders are going to face—if they aren’t already.
As we look down that road to 2.0, it will be interesting to see what kind of shakeup Impala really causes (which depends on how important SQL will be to these users), what fixes will be required on the other platform distros to repair and keep pace with this addition, and most importantly, how the benchmarks shake out during 2.0 release. Stay tuned.
Related Articles
Six Super-Scale Hadoop Deployments
Cloudera CTO Reflects on Hadoop Underpinnings
Cloudera Runs Real-Time with Impala
MapR Traces New Routes for HBase
November 22, 2024
- DataOps.live Achieves SOC 2 Type II Compliance
- LogicMonitor Gains $800M in Strategic Investment to Scale Global Operations
November 21, 2024
- Snowflake Agrees to Acquire Open Data Integration Platform, Datavolo
- Denodo Platform 9.1 Brings New Advanced AI Capabilities and Enhanced Data Lakehouse Performance
- Teradata AI Unlimited in Microsoft Fabric Public Preview Now Available Through Microsoft Fabric Workload Hub
- Zilliz Cloud Powers GenAI Readiness with Cost-Effective Enterprise-Grade Performance and Scalability
- Snowflake and Anthropic Team Up to Bring Claude Models Directly to the AI Data Cloud
- Duality AI Launches EDU Subscription to Empower Aspiring AI Developers with Digital Twin Simulation and Synthetic Data Skills
- Striim Offers Mirroring Solution for SQL Server to Fabric at Microsoft Ignite
November 20, 2024
- Anaconda Unites Teams Across Data Skill Levels With Anaconda Toolbox for Excel
- StarTree Unveils Innovations to Tackle Real-Time Data Scaling Challenges
- Introducing Crunchy Data Warehouse, a Modern Postgres Analytics Platform
- Zettar Advances Data Movement in Collaboration with MiTAC Computing and NVIDIA
- Matillion Leverages Simbian’s AI to Streamline Security and Boost Efficiency
- CData Launches Free Connect Spreadsheets Product to Simplify Access to Enterprise Data for Excel and Google Sheets Users
- Graphwise Introduces GraphDB 10.8 with Multi-Method RAG for GenAI Applications