To Centralize or Not to Centralize Your Data–That Is the Question
Should you strive to centralize your data, or leave it scattered about? It seems like it should be a simple question, but it’s actually a tough one to answer, particularly because it has so many ramifications for how data systems are architected, particularly with the rise of cloud data lakes.
In the old days, data was a relatively scarce commodity, and so it made sense to invest the time and money to centralize it. Companies paid millions of dollars to ensure their data warehouses were filled with the cleanest and freshest data possible, for historical reporting and analytics use cases.
As the big data boom unfolded, companies gained more options. Open systems like Hadoop let companies store petabytes of data on commodity hardware, spurring the creation of massive data lakes that held the less structured “data exhaust” at the heart of so many big data initiatives. Database innovation has also been steady, particularly around NoSQL databases that relax the rigid constraints of relational database to simplify life for developers, as well as NewSQL databases that replicate the time-tested reliability and durability of relational databases in the distributed realm.
The public cloud has been a force for both centralization and decentralization. On the decentralization side of the ledger are the multitude of software as a service (SaaS) applications, like Salesforce, NetSuite, and ServiceNow, which are extremely popular among Internet natives and brick and mortar companies alike. Since SaaS applications come with their own databases, it eliminates the need for customers to own and manage their own data store (which can be either good or bad).
On the other side of the coin are object storage systems like Amazon Web Service’s S3 and Microsoft Azure Data Lake Storage (ADLS), which provide very scalable and very inexpensive places to park vast amounts of data. Moving up the cloud stack, we see databases like RedShift, Google Cloud‘s BigTable, and Snowflake, which essentially recreate high-performance SQL data warehouse systems in the cloud, but with the added advantage of separating compute and storage, which lets users scale both independently and is a product of today’s containerized cloud architectures (thank you, Kubernetes).
Companies also have the option of building cloud-like systems on prem or in hybrid configurations, such as the Cloudera Data Platform (CDP) products, which provide many of the architectural advantages of cloud-based systems, but without the lock-in that comes with exclusive use of the big cloud providers’ data storage and analytic offerings.
In other words, companies have a plethora of options today when it comes to managing their data, and it’s not always clear how to straighten them out. So what should a data-driven company do? While every situation is different, there are some advantages and disadvantages to keep in mind for advanced analytics and AI use cases when choosing whether to invest in data centralization or leaving the data where it lays.
AI on the Edge
Sri Ambati, the CEO of H2O.ai, has watched data ebb and flow in the corporate data center over time. “Going back to the 1990s, companies with the best warehouses were able to mine it and get the most insights,” he says. “I think there’s always going to be value for centralization.”
Having all the data in one place can make it easier for a company to explore their data, to create features out of it, and to add more dimensions, Ambati says. But the track record in building modern centralized data systems is mixed. “The data lakes of the bygone big data era, if you will, have become data swamps, because people have not been able to use that effectively,” he says.
For many of today’s AI use cases, companies just won’t get a good return on their centralization investment, he says. “The reason for centralizing data is to run compliance processes, to run analytics, to understand what to do better, what to do next,” he says. “ But if you can do that at the edge, if you can learn where data is being created – that’s kind of where the convergence of data, analytics, and applications [is headed].”
Organizations in some jurisdiction and industries are forbidden from moving data, which limits their ability to benefit from centralized data. In these situations, federated learning can help fill in the gaps left by incomplete data.
For example, Ambati says over 100 hospitals in the US are sharing machine learning models that individual hospitals are training on local data. This gives them the benefit of sharing the learnings contained in the data, but without violating HIPAA.
Federated Data Analytics
Federated techniques are also becoming more popular with SQL analytics, in addition to ML. According to Justin Borgman, the CEO of Starburst, federated analytic systems like Presto allow analysts to query data without worrying about where it’s actually located.
“For them, there’s no perception of the data silo problem,” he says. “Is the data physically in different systems? Yes….But from the analyst’s perspective, it all looks like one system. They don’t know that that table is in Teradata and that table is in Hadoop. They just know I want to join these two tables.”
That’s not to say there aren’t benefits from building a massive data warehouse or data lake that contains all of the data that they want to analyze. Borgman expects companies to continue building these systems when they’re necessary.
“The biggest benefit of getting all your data into one high performance data system like Teradata and Snowflake is you can maximize performance that way,” he says. “There’s no doubt that that is going to give you the max possible performance. However the tradeoff there is you have to be comfortable with getting all your data into one system, and I think what we found over the years was that’s just sort of impractical.”
Despite various attempts to fight the data silo problem, it just never seems to go away, he says. The growing popularity of Presto is a product of all those failed attempts to bring all the data together, according to Borgman.
“I think we’re just sort of trying to be honest about that, to recognize that, and basically say to customers ‘That’s OK. We can still give you access to all that data through one interface, and deliver fast query results,’ as opposed to what a traditional data warehouse vendor would say, which is load all that data into this one single source of the truth, and then analyze it there.”
Dreaming of Data Virtualization
Companies have already voted with their wallets and are investing in cloud-based data repositories, according to Tomer Shiran, the chief product officer at Dremio. So why not give them the ability to work with data stored in those data lakes, no matter which lake it’s in?
“Most new applications are built in the cloud and store their data in these cloud object stores. And organizations are moving quickly to ETL data from other applications such as operational databases into the central cloud data lake,” he tells Datanami. “While some may take the position that these application-specific data sets are ‘silos’ of some sort, they are still effectively centralized in a common, open storage platform. This means that many different data processing services, including those available today and those that will come in the future, are able to work with the data.”
There are data virtualization technologies that attempt to create an abstraction layer that provides data access across a wide range of data silos. But those are “inherently fragile and prone to performance inconsistencies,” Shiran says.
Dremio has built an independent SQL query engine based on Apache Arrow, the open source data store for columnar, in-memory analytics that its engineered helped develop. This approach allows customers to take advantage of data centralizations in public clouds, without the lock-in.
Centralization in the Cloud
We’ll likely never completely move away from the need to build and manage ETL pipelines. But the less data engineering work that has to be done and the more centralization we have, the better off organizations will be, acco
rding to Matei Zaharia, the Chief Technical Officer at Databricks and the creator of Apache Spark.
“It’s hard to imagine centralizing absolutely everything from the beginning,” Zaharia says. “I do see that especially for downstream analytics, you do have a lot of great tools for moving things into a data lake, things like CDC [change data capture] and so on.”
But on the other hand, the scalability and cost advantages of data lakes, either object stores in the cloud or Hadoop on prem, cannot be denied. Amazon S3, in particular, is quite reliable and affordable. “It basically never goes down, so it makes a lot of sense to build things based on that,” he says.
However, having much of one’s data centralized in S3 doesn’t mean it’s readily available to all the other cloud-based systems that a company might want to use to process data, he says. In fact, ETL on the cloud, through products like Glue or homegrown Spark scripts, consumes a lot of data engineers’ time.
“We’ve been seeing, in a lot of case, people had a data lake but then after that they had pipelines to load the data into many downstream systems, such as a data warehouse or some special system for machine learning or data science or whatever,” Zaharia says. “We’re trying to allow them to do the analysis directly on the data lake, so it removes the complexity of creating new things downstream from that.”
That’s the gist behind Databricks’ Delta Lake offering, which provides a pre-built mechanism for turning raw data into more structured data that’s more useful to the user. Once it’s processed in Delta Lake, Databricks users can bring other Spark-powered capabilities, including machine learning and SQL analytics, to bear on the data.
“I think in terms of the source data, it’s definitely going to come from many places,” Zaharia says. “In terms of analytics, you can actually get good performance and manageability on the data lake, if you use something like Delta and the lake house pattern. So that’s kind of what we’re pushing for, to make that very easy so you don’t have to worry about it once the data gets there.”
In the end analysis, there is not a one-size-fits all solution, and data will continue to be both centralized and not centralized, depending on the use case. The growth of data, 5G networks, and the Internet of Things (IoT) will increase the need to get more work done on the edge, including machine learning work that would normally be done in a central location. Federated techniques for analytics and AI will also continue to bring value in situations where data cannot be centralized.
The growth of cloud data lakes presents an interesting twist by enabling many of the advantages of centralization. However, depending on how customer approach cloud lakes, they could be setting themselves up for the same lock-in that chased them out of on-prem systems in the first place.
Related Items:
Three Privacy Enhancing Techniques That Can Bolster the COVID-19 Response
Databricks Cranks Delta Lake Performance, Nabs Redash for SQL Viz
Machine Learning on the Edge, Hold the Code