Big Data’s Relentless Pace Exposes Old Tensions and New Risks in the Enterprise
Over the past two weeks, we’ve explored some of the difficulties that enterprises have experienced in trying to adopt the Hadoop stack of big data technologies. One area that demands further attention is how the rapid pace of development of open source data science technology in general, and the new business opportunities it unlocks, is simultaneously exposing old fault lines between business and IT while opening them to new risks.
Events like Cloudera and O’Reilly‘s recent Strata + Hadoop World conference and Hortonworks‘ upcoming DataWorks Summit 2017 are showcases for the burgeoning market for big data technology. While Hadoop itself may not be the center of gravity that it once was, there is no doubt that we’re in the midst of a booming marketplace for distributed computing technologies and data science techniques, and it’s not going to let up anytime soon.
The rapid pace of technological evolution has plusses and minuses. On the plus side, users are getting new technologies to play with all the time. Apache Spark has captured people’s imaginations, but already a replacement is on the horizon for those who think Spark is too slow. Enter Ray, a new technology that RISELab director Michael Jordan discussed during a keynote at last week’s Strata (and which we’ll cover here at Datanami).
Data scientists and developers are having a veritable field day with new software. Meanwhile, new hardware innovations from Intel, IBM, Nvidia, and ARM promise to unleash another round of disruptive innovation just in time for the IoT revolution.
This is a great time to be a data scientist or a big data developer. Like kids in a candy store with $100 to spend — and no parents to tell them what to do — it’s a technological dream come true in many respects.
Too Much, Too Fast?
And therein lies the rub: the kid in the candy store with eyes as big as dinner plates will invariably have a stomach ache of similar proportion.
“We’ve never seen technology change so rapidly,” says Bill Schmarzo, the chief technology officer of the big data practice at Dell EMC and the Dean of Big Data. “I don’t think we know what we’re doing with it yet.”
CIOs are struggling to keep up with the pace of change while retaining the order and organizational structure that their bosses demand, Schmarzo says. “They’ve got the hardest job in the world because the world around them has changed so dramatically from what they were used to,” he says. “Only the most agile and the most business-centric companies are the ones who are going to survive.”
How exactly we got to this point in business technology will be fodder for history books. Suffice it to say, the key driver today is the open source development method, which allows visionaries like Doug Cutting, Jay Kreps, Matei Zaharia and others to share their creations en masse, creating a ripple effect of faster and faster innovation cycles.
As you ogle this technological bounty that seemingly came out of nowhere, keep this key point in mind: All this awesome new open source big data technology was designed by developers for other developers to use.
This is perhaps the main reason why regular companies — the ones in non-tech fields like manufacturing and distribution and retail that are accustomed to buying their technology as shrink-wrapped products that are fully backed and supported by a vendor – are having so much difficulty using it effectively.
So, where are the software vendors? While many are working to create end-to-end applications that masks the complexity, many of the players in big data are hawking tools, such as libraries or frameworks that help developers become more productive. We’re not seeing mad rush of fully shrink-wrapped products in large part because software vendors are hesitant to get off the merry-go-round and plant a stake in the ground to make the tech palatable to Average Joe for fear of being left behind by what’s coming next.
The result is we have today’s culture of roll-your-own big data tech. Instead of buying big data applications, companies hire data scientists, analysts, and data engineers to stitch together various frameworks and use the open source tools to build one-off big data analytics products that are highly tailored to the needs of the business itself.
This is by far the most popular approach, although there are a few exceptions. We’re seeing Hortonworks building Hadoop bundles to solve specific tasks, like data warehousing, cybersecurity, and IoT, while Cloudera is going upstream and competing with the data science platform vendors with its new Data Science Workbench. But homegrown big data analytics is the norm today.
Don’t Lock Me In
While this open source approach works with enough time and money (and blood, sweat, and tears), it’s generally at odds with traditional IT organizations that value things like stability and predictability and 24/7 tech hotlines.
All this new big data technology sold under the “Hadoop” banner has run headlong into IT’s sensibility and organizational momentum, says Peter Wang, the CTO and co-founder of Continuum Analytics.
“One of the points of open source tools is to provide innovation to avoid vendor lock in, and then part of that innovation is agility,” he tells Datanami. “When new innovation comes out, you consume it. What enterprise IT has tended to do is once it deploys some of these open source things is it locks them down and makes them less agile.”
Some CIOs gravitated toward Hadoop because they didn’t want to go through a six-month data migration for some classic data warehouse, Wang says. “Now they’re finding that the IT teams make them go through the same [six-month] process for their Hadoop data lake,” he says.
That’s the source of some of the Hadoop pain enterprises are feeling. They were essentially expecting to get something for nothing with Hadoop and friends, which can be downloaded and used without paying any licensing fees. Even if they understood that it would require investing in people who had the skills to develop data applications using the new class of tools, they vastly underestimated the DevOps costs of creating it and operating it.
In the wider data science world, a central tenet holds that data scientists must be free to seek out and discover new data sources that are of value, and find new ways to extract additional value from existing sources. But even getting that level of agility is anathema to traditional IT’s approach, Wang says.
“All of data science is about being fast, both with the algorithms as well as new kinds of data sets and being able to explore ideas quickly and get them into production quickly,” Wang explains. “There’s a fundamental tension there.”
This tension surprised enterprises looking to adopt Hadoop, which in its raw Apache form, is largely unworkable for companies that just want to use the product, and not hire a team of developers to learn how to use it. Over the past few years, the Hadoop distributors have worked out the major kinks and filled in the functionally gaps and have something resembling a working platform. It wasn’t easy (don’t forget the battles fought over Hortonworks’ attempts to standardize the stack with its Open Data Platform Initiative), but today you can buy a functioning stack.
The problem is, just as Hadoop started to harden, the market shifted, and new technology emerged that wasn’t tied to Hadoop (although much of it was shipped in Hadoop distributions). Companies today are hearing about things like deep learning and wondering if they should be using Google‘s TensorFlow, which has no dependencies on Hadoop, although an organization may use it store the huge amount of training data they’re going to need to train the neural networks data scientists will build with TensorFlow.
Necessary Vs. Unnecessary Complexity
The complexity of big data tech will increase, Wang says. And while software vendors may eventually take all of the technology and deliver shrink-wrapped products that take the developer-like complexity out of using this technology, any company that wants to take advantage of the current data science movement will need to stiffen up, accept the daunting complexity level, and just try to make the most of it.
“People are going to have to hire very talented individuals who can draw from this giant pile of parts and build extremely vertically integrated, targeted apps or cloud services or whatever, and have to own, soup-to-nuts, the whole thing,” Wang says. “Before you could rely on Red Hat or Microsoft to provide you an operating system. You could get a database from some vendor or get a Java runtime and Java tooling from somebody else.
“At the end of the day,” Wang says, “you now have six or seven layers of an enterprise software development stack, and then you hire some software developers to sprinkle some magic design pattern stuff and write some things, and you’ve got an app.”
Not all complexity is evil, according to Wang, who differentiates between necessary complexity and unnecessary complexity.
“There’s a central opportunity available in this space right now, and that essential opportunity is ultimately the oxygen that’s driving all these different kinds of innovation,” Wang says. “The insight that’s available with the data we have – that is the oxygen causing everything to catch fire.”
We’re experiencing a Gold Rush mentality at the moment in regards to data and the myriad of different ways organizations can monetize data or otherwise do something productive with it. If you can get over the complexity and get going with the data, you have the potential to shake up an industry and get rich in the process, which is ultimately what’s driving the boom.
“There’s a concept of the unreasonable effectiveness of data, where you just have a [big] ton of data in every category,” Wang says. “You don’t have to be really smart, but if you can get the right data and harness it and do some fairly standard thing with it, you are way ahead of the competition.”
Hedging Tech Dynamism
There is a lot of uncertainty around what technologies will emerge and become popular, and companies don’t want to make bad bets on losing tech. One must have the stomach to accept relentless technological change, which Hadoop creator Doug Cutting likened to Darwinian evolution through random digital mutations.
One hedge against technology irrelevancy is flexibility, and that’s generally what open source provides, Schmarzo says.
“We think we have the right architecture, but we really don’t know what will change,” he says. “So how do I give myself an architecture that gives me as much agility and flexibility as possible, so when things change I haven’t locked myself in?”
Adopting an open source platform allows you, theoretically, the most flexible environment, he says, even if it runs counter to the prevailing desire in organizations to rely on outside vendors for technology needs. Investing in open source also makes you more attractive to prospective data scientists who are eager to use the latest and greatest tools.
“Our approach so far has been, on the data science side, to let them use every tool they want to do their exploration and discovery work,” Schmarzo says. “So if they come out of university with experience or R or Python, we let them use that.”
Organizations may want the best of all worlds, but they will be forced to make tradeoffs at some point. “There is no silver bullet. Everything’s a trade off in life,” Schmarzo says. “You’ve got to build on something. You’ve got to pick something.”
The key is to try and retain that flexibility as much as possible so you’re able to adapt to new opportunities that data provides. The fact that open source is both the source of the flexibility and the source of the complexity is something that technology leaders will simply have to deal with.
“The IT guys want everything locked down. Meanwhile the business opportunity is passing you by,” he adds. “I would hate to be a CIO today. It was easy when you had to buy SAP and Oracle [ERP systems]. You bought them and it took you 10 years to put the stupid things in but it didn’t matter because it’s going to last 20 years. Now we’re worried if it doesn’t go in in a couple of months because in two months, it may be obsolete.”
While there’s a risk in betting on the wrong big data technology, getting flummoxed by Hadoop, or making poor hiring decisions, the potential cost of not even trying is potentially even bigger.
“Enterprises really need to understand the business risks around that,” Wang says. “I think most of them are not cognizant yet of what that means. You’re going to tell your data scientists ‘No you can’t look at those five data sets together, just because.’ Because the CIO or the CDO making that decision or that call does not recognize the upside for them. There’s only risk.”
Related Items:
Hadoop Has Failed Us, Tech Experts Say
Hadoop at Strata: Not Exactly ‘Failure,’ But It Is Complicated
Anatomy of a Hadoop Project Failure
Cutting On Random Digital Mutations and Peak Hadoop