Syncsort Aims to Bridge Hadoop ETL
When Cloudera took to the virtual airwaves last week with a press event proclaiming Hadoop as the center of gravity in the data warehouse, not everyone agreed. While some would say that Cloudera was overstating Hadoop’s current position, there seems to be little controversy around the idea of Hadoop’s increasing role as a data warehouse offload tool.
We spoke with Josh Rogers, Senior Vice President at Syncsort about this particular use case. He tells us that while it’s not necessarily in-line with the sexy big data promises of predictive and sentiment analysis, data warehouse offloading is becoming an increasingly important on-ramp for Hadoop usage.
“If you look today, most large enterprises have somewhere between 35 and 65 percent of their warehouse capacity dedicated to ELT,” Rogers told us commenting on his encounters with organizations wrestling with these issues. This is a challenge for organizations, says Rogers, especially as the data and workloads increase, bringing with them increasing costs and failures at the SLA level.
This, says Rogers, is a driving force in why Hadoop as a data warehouse offload is gaining popularity. “If I can free up 30% of my data warehouse and put off additional upgrades to an incredibly expensive but powerful data store, that create real saving in my organization,” he explains, adding that aside from the immediate measurable ROI benefits, this use case provides another type of less measurable but still valuable ROI: organizational experience in implementing Hadoop – a problem which he says is still preventing organizations from getting the most out of their Hadoop installations.
“Organizations need to have extremely talented java developers to be able to be productive and create data flows or business logic that is going to execute in their clusters,” he explains. Rogers says that while a shortage exists for this kind of talent to maximize Hadoop’s potential, organizations can use ELT process optimization and data warehouse offloading as an opportunity to level-up their organizational skill sets. “If you can take a tooling environment that allows people to use their existing set of skills to contribute in this new architecture, that’s very powerful.”
However, say Rogers, there are still gaps that exist in making Hadoop a complete ETL solution, including what he refers to as a connectivity gap. Rogers explains that while there are a lot of different ways and mechanisms that can be used to move data into Hadoop, they’re not particularly consistent or coordinated. “It’s a bunch of one-off connections that I have to manually create and feed, and we think that limits people’s ability to move all the data they want on a repetitive basis, consistently and reliably on the platform.”
Last month, Syncsort released new data integration tools geared for Hadoop that attempt to address this issue. Part of their Spring ’13 release, Syncsort announced two new products to extend their DMX data integration offering. Dubbed DMX-h, Rogers says their Hadoop-centered tools aim to close key gaps that exist in the data warehouse offload and ETL use case for organizations using Hadoop.
The new tools, say Rogers, provide users with an ETL application that runs natively on Hadoop, and provide users with a drag and drop interface that is familiar to an ETL developer. Explaining DMX-h ETL edition, Rogers says that they’ve made it a native Hadoop application that interacts with the MapReduce compute framework through a contribution that the company made to open source Apache Hadoop this past January. Through this approach, Rogers says that organizations can gain full connectivity to all their data sources, including mainframe data.
Rogers says that leveraging their contribution, their tools running natively within MapReduce on every node in the cluster, they’re able to achieve performance benefits over custom coded solutions. “We’re the only data integration vendor in the marketplace that can claim that,” claims Rogers. “Everyone else is taking essentially a code generated approach – which is, you can use my UI on the back-end, I’ll generate some MapReduce code, or Pig, or HiveQL. That generally is going to be slower and less efficient from an execution perspective than if you custom coded it in MapReduce.”
While Hadoop’s use is by no means limited to data warehouse offloading, Rogers says that it’s currently the most common use case that he’s seeing in the market today – one that he expects to grow as people face scaling problems with their traditional systems.
“You will not be able to develop the appropriate level of competency in terms of big data analytics on existing relational architectures – the traditional [systems] are just not going to get you there,” he says adding that he sees ETL applications will prove to be to Hadoop as Excel was to Windows 95.
“We believe that while it’s lower level, ETL applications are going to be the killer application that drives adoption of Hadoop because it’s a very logical place to start.”
Related items:
Hadoop Sharks Smell Blood; Take Aim at Status Quo
DataStax Takes Aim at Oracle as Cassandra Summit Kicks Off
The Transformational Role of the CIO in the New Era of Analytics