10 Key Big Data Trends That Drove 2017
2017 has come and (almost) gone. It was a memorable year, to be sure, with plenty of drama and unexpected happenings in terms of the technology, the players, and the application of big data and data science. As we gear up for 2018, we think it’s worth taking some time to ponder about what happened in 2017 and put things in some kind of order.
Here are 10 of the biggest takeaways for the big data year that was 2017.
AI Takes Over
2016 closed with a keen interest in AI, and that momentum in AI surged in 2017, thanks to emerging deep learning technology and techniques that provide better and faster results for some machine learning tasks. Teradata, for instance, found that 80% of enterprises are already investing in AI, which backed similar findings from IDC.
Nevertheless, the same old challenges that kept big data off Easy Street also emerged to cool some of the heat emanating from AI. Over the summer, Databricks‘ CEO, Ali Ghodsi, warned about “AI’s 1% problem.” “There are only about five companies who are truly conducting AI today,” Ghodsi said.
The sudden re-emergence of AI also re-kindled an old debate about the meaning of the phrase, as well as the differences between AI and deep learning and machine learning. While some purists may consider AI to refer to a human-like replacement, the software community generally has a much broader view of AI today. On the hype meter, AI has replaced big data as the new “it” term.
By year’s end, we noticed a dip in excitement around neural networks, which really don’t replicate the functioning of the human brain. We also heard how “cracking the brain code” is our best chance at achieving “true” AI, which is something backed by Geoffrey Hinton, the father of deep learning.
Hadoop Falls Off
One of the biggest big data stories of 2017 was the shine coming off Apache Hadoop. We detailed the struggles that some customers have had with trying to get Hadoop software set up and running. Common complaints about Hadoop that we relayed in our March story include: too much complexity, incompatible components, poor support for small files, poor support for non-batch workloads, and just an overall limited usefulness for non-developers.
There was some pushback from technologists. Sure, Hadoop is complicated, they said. It’s not perfect. It’s not “set it and forget it.” But until a better general-purpose distributed computing platform comes along, Hadoop is the best option for organizations that need to store and process very large data sets, they said.
There are different ways of processing this. One view holds that Hadoop’s time in the sun has passed. Another view holds that its difficulties are temporary setbacks that all emerging technologies face before eventually finding their footing as core components in the enterprise stack. Interestingly, there seemed to be very little interest in Hadoop 3.0, which we first wrote about in 2016 and covered again in mid-2017. With Hadoop 3.0 nearing GA this month, interest seems to have perked up a bit — but it’s nothing like the frenzy that accompanied the release of Hadoop 2.0 four years ago.
We also saw Hadoop removed from the names of two prominent industry conferences, including Cloudera and O’Reilly’s Strata + Hadoop World (now called Strata Data Conference) and Hortonworks’ Hadoop Summit (now called DataWorks Summit).
Graph Picks Up
Graph databases continue to gather momentum in the industry, thanks to their status as the preferred technology for a set of specific use cases revolving around connected data. In January 2017, we saw the emergence of JanusGraph to continue open source TitanDB development, while in February we described how Qualia uses Neo4j to allow ad tech firms to quickly detect trends in user behavior.
In April we saw IBM, which debuted a Titan-based graph database service in 2016, touting the advantages of the graph approach with its “State of Graph Databases” report. In September, we told you about TigerGraph, a startup that claims to have developed the industry’s first native parallel graph database.
But Amazon Web Services took the (graph) cake in November with the debut of Neptune, a new hosted graph database that, when it’s GA, will support both major types of graph databases, including property and entity graph models.
Spark Keeps Rolling
Apache Spark roared onto the big data stage in 2014 as a better and faster data processing engine than MapReduce for Hadoop clusters. Considering how quickly big data tech was evolving at that time, it seemed plausible that Spark itself would get displaced with a newer, better technology. At Strata + Hadoop World in March, we saw a compelling demo of Ray, a new framework that’s touted as a Spark replacement by RISELab, the successor to the AMPLab.
But for whatever reason, Spark has maintained its momentum, and today it would appear that Spark has a permanent chair at the big data table. One of the reasons for that is likely that Spark has just continued to evolve. The framework is being infused with deep learning capabilities, including the new deep learning pipelines project unveiled in June. And work done in Spark version 2.2 and subsequent versions are prepping the open source framework to be able to utilize specialized hardware like GPUs and FPGAs in future releases.
Cloud Rules All
As some questioned whether Hadoop was the right platform to store data, the barriers to using the cloud seemed to shrink by the day. Indeed, Amazon’s S3 API has emerged as a defacto storage standard in its own right, challenging HDFS for big data dominance and causing the Apache Hadoop community to adopt S3 in the forthcoming Hadoop 3.0. Public clouds are growing fast; Gartner and Forrester say it’s growing at 18% to 19% per year, respectively, while Synergy Research Group found the cloud is growing in excess of 40% annually.
All the major data management vendors – Hadoop folks like Cloudera, MapR, and Hortonworks and NoSQL folks like MongoDB, Couchbase, and Datastax – made major cloud-related announcements. And at AWS’s annual re:Invent extravaganza, the cloud giant rolled out dozens of new data services – including new machine learning and graph database services — to go on top of the nearly 1,300 new or improved services it rolled out this year.
Data Fabrics Emerge
Hadoop had a rough go of it in 2017. But one area of the big data ecosystem that looked particularly promising was big data fabrics. The concept, which originated with Forrester analyst Noel Yuhanna, refers to a holistic approach for streamlining a variety of data management tasks, including accessing, discovering, transforming, integrating, securing, governing, controlling lineage, and orchestrating data across a variety of silos, including Hadoop, Spark, NoSQL databases, and analytic relational databases.
Some organizations are using big data fabrics to bypass the software integration challenges faced by early Hadoop adopters. That’s the lesson learned by Vizient, a Texas-based group purchasing organization that recently implemented the Hortonworks Data Platform (HDP). “The beauty of the fabric is that not everybody needs to know all that [Hadoop] stuff,” the company’s VP of enterprise architecture and strategic technology told Datanami. Expect the data fabric trend to continue into 2018 as it addresses some of the challenges working with data that exists in multiple silos.
Fighting Swamp Monsters
Just because you have a lot of data doesn’t mean you can do anything with it. That was the lesson learned by Wolf Ruzicka of EastBanc Technologies, which in June told Datanami about a client who had a 50 PB data lake, only to let it turn into a data swamp.
Keeping Hadoop data lakes on the straight and narrow has become a cottage industry, providing work for many third-party technologists and software vendors. But they’re fighting a losing battle, according to Gartner, which predicted that 90% of data lakes will be “overwhelmed with information assets captured for uncertain use cases.”
Big Data IPOs
After years of spearheading startups and raising venture capital, the big data ecosystem was due for a little consolidation and payback. Some of the top firms raised their profiles with initial public offerings (IPOs) of stock, including Cloudera, MongoDB, and Alteryx.
Cloudera raised $228 million with its April IPO on the New York Stock Exchange (NYSE: CLDR). After peaking near $23 per share in June, CLDR has lost about 26% in value. MongoDB raised $192 million when it went public on the NASDAQ in October; since that first closing day, MDB has lost about 5% of its value. Alteryx, meanwhile, went public on the NYSE in March, raising $126 million; since then AYX is up about 74%.
Looking forward, 2018 could be another big year for big data IPOs. Executives at other big data software vendors who have publicly discussed possible IPOs include Couchbase, MapR, and Anaconda. One may also see a VC-funded graph database software vendor go public. There are other data management firm that could look to the public markets to provide an exit for private investors too.
Data Science Platforms
In 2017, we saw how better software is helping us to take the data scientist out of data science (and fight back swamp monsters at the same time). An emerging group of data science platform vendors, as well as existing analytic tool providers, are finding traction helping customers accomplish a variety of tasks, such as picking the right algorithm for the customer’s specific data, managing the creation, maintenance, and testing of machine learning models, deploying models into production, and enabling data scientists to collaborate.
These platforms won’t eliminate the need for data scientists, which some believe are overrated anyway. But they can take some of the grunt work off the backs of existing data scientists, and help organizations get more data science work done with fewer high-skilled unicorn types. This trend shows no sign of abating in 2018, especially in light of the growing popularity of newer deep learning technologies and techniques, and the greater complexity that neural networks can bring.
GDPR Looms
We’ve known for over a year that the General Data Protection Regulation will go into full effect on May 25, 2018. But for some reason, GDPR’s May Day started feeling more real this year. For American CDOs, there’s no government regulation that instills as much fear and loathing as the European Union’s GDPR, which will likely put an end to big data’s Wild West ways.
In addition to mandating good data practices, the GDPR will require companies to adhere to individuals’ “right to be forgotten.” However, data privacy is a tricky topic that will take some time to solve. The inestimable Michael Stonebraker told us that, with the 3Vs of big data basically solved, that privacy and integration would emerge as the big stumbling blocks to getting value out of big data.
Bonus: Case Studies
Did we mention that we love big data case studies? We had just a few in 2017. In fact, we did case studies on:
Pandora’s use of Kafka, CapitalOne’s “Purple Rain” security framework, Blackboard’s implementation of a Snowflake data warehouse, Dollar Shave Club’s Spark implementation, HSBC’s use of MongoDB, the tech inside of JetBlue’s customer 360 initiative, Mashable’s data-infused CMS, how Rabobank uses Kafka, car hailing vendor Grab’s adoption of Iguazio, Nextdoor’s “serverless ETL,” how the sports analytics firm STATS is using deep learning, the Mormon Church’s Cassandra database, how Operation Red Alert uses big data to fight human trafficking in India, how African hospitals are using Couchbase, TDBank’s meta-data lake, golf-swing optimization at GOLFTEC, and finally, the data science inside the Bloomberg terminal.
Got a good case study for 2018? Drop us a line at [email protected].