Data is Cheap, Information is Expensive
In this age of information, to say that the volume of data is exploding is a stark understatement. This big bang of big data is estimated to grow from 33 zettabytes in 2018 to 175 zettabytes by 2025, according to a recent IDC and Seagate report. Why? It’s simple: The abilities to generate data via cheap compute and save it via cheap storage are directly proportional. That’s why, as it has become popular to say, “data is the new oil” that, when refined, fuels today’s information economy.
Any business not partaking in this data revolution, let alone fully immersing themselves in it, is at a severe disadvantage. But doing so in a way that enhances business is much easier said than done.
The Issue
With the advent of the Internet, cloud, and all things connected, data has become the lifeblood of companies’ external communications, internal operations, and overall business value. These veins of data stream throughout a company. Each intersection can change and/or add to this flow of data. And to keep everything running smoothly, data is stored and analyzed to promote the good health and growth of the business.
However, it is not easy to take advantage of this deluge of business data. The issue is not in the ability to create mountains of data, or to stockpile it. The problem is in transforming raw data into valuable information. Data becomes information when questions can be asked of it and decisions and/or insights can be derived from it. And here lies the dilemma: the more there is, the harder and more expensive it is to refine/structure for quick/intelligent access.
The Reason
But why is this? What makes it so complicated and expensive? The reasoning, again, is simple.
It is by far (think magnitudes) cheaper to generate and store data then it is to transform it into actionable information. The refining/structuring of data involves much more compute than it takes to generate. As data increases in volume, variety, and velocity (3Vs of Big Data) so does the amount of processing it requires. And often, producing quick/intelligent access involves even more storage than the original data source.
However, there is some salvation with respect to how much additional cost is required to derive information from data. Instead of using brute force parsing through each aspect of this raw data, computer science algorithms/structures have been utilized to implement a variety of database solutions. Each solution has different benefits and requirements, but in the end, all pretty much do the same thing: store data in a representation such that intelligent access can be performed more efficiently than manually analyzing the raw source.
Although not as intense and expensive as using brute force, the compute associated with database solutions can still be quite costly. These solutions can be seen as the combination of compute and storage where data is moved into these systems to be algorithmically refined/structured for optimized access. Depending on the type of information needed, specific types of databases are used. For instance, when hunting (i.e., search) for a needle in a haystack, search databases are utilized. If there is a need for correlating (i.e., joins) data relationships, relational databases are employed. And yet, these are just a few of many use case “specific” solutions. Often there is a need for several of these databases within a company, where a variety of solutions are used in concert, compounding the need for additional storage and compute.
One would think today’s technology and associated databases would address the cost of data-to-information translation. And for decades they did. But with the growth in the 3Vs of big data, these solutions are teetering. While there have been introductions of new styles of databases to reduce the cost/complexity, they haven’t solved the problem because the philosophy of refining/structuring data into information has not changed, nor has the underlying science. And if it’s not obvious yet, the amount of compute to generate data will always outpace the capacity to analyze it. In other words, the “cost of a question” will always go up as data grows. To truly wrangle the cost of information, innovation is needed in new computer science and architecture.
The Answer
The answer to this problem is not simple. But the reason for the problem is — and it stems from the fact that there are many issues related to the problem. Solutions that work to address one, even two issues, eventually still hit the proverbial wall. The problems can be distilled into three primary categories: time, cost and complexity. Each aspect increases as data increases. It should be understood that one certainly affects others. Often it is the case that solving for only one aspect increases the others. Therefore the answer must lie in addressing all three in a holistic manner.
To understand this, let’s dig a bit deeper and start with complexity, since this drives most of the increases in time and cost. There are many aspects within the vector of complexity that have to be addressed, but from a database perspective, this can be distilled down into configuring and managing the underlying resources to run a database: storage, compute, and network. These resources are connected and manipulated based on the type of database: Relational, Search, Graph, etc.
Within each, data is refined/structured into Row, Column, and Text representations via traditional computer science structures (B-Tree, LSM-Tree, Inverted-Index, etc.). And here lies the issue. All this is hard (not NP-hard) but as data grows, each aspect in concert needs expert design and support. Traditional solutions were designed, built, and managed in a static manner. Scaling such systems is time consuming, expensive, and as you can see, it gets complex.
So what can be done? The answers lies in the ability to quickly and elastically provision and connect storage, compute, and network sources. The best place to provision “dynamic” resources is in the cloud. In other words, the solution has to be cloud architected first and foremost. Next, the solution needs to leverage data where it is being stockpiled – cloud storage (e.g., AWS, Google, Azure). And when saying “leverage”, it does not mean as temporary or archival, but as the “only” storage medium. No data movement! Storage has to be seen as a true service: cheap, elastic, secure, durable. This removes at least 50% of the time, cost, and complexity of a database solution.
From here, utilizing elastic compute and network over “Storage as a Service” is the next major focus. The ability to dynamically execute the refining/structuring of information as data grows is paramount. Like cloud storage, this should be completely automated and seamless. And not just the refining/structuring (i.e., indexing), but also the intelligent/access (i.e., search/queries) execution; each having the ability to allocate resources on the fly to resolve a question (think cloud fabric).
The final piece to the puzzle is the ability to truly connect these two aspects: cloud storage and elastic compute/network. Databases today, with their traditional thinking, were never designed to leverage cloud storage or elastic compute and network. They were designed to be static and standalone solutions since storage and compute traditionally were static and fixed. There has been some architectural innovation in separating storage from compute. However, cloud storage is not a first-class citizen. And this is the problem. Databases are using the same computer science algorithms/structures which were designed for block storage and not cloud storage. Once this connection is truly solved, will we tame the exponential growth in data and ultimately make access to data–and the information that lies within it — both fast and affordable.
About the author: Thomas Hazel is Founder, CTO, and Chief Scientist of CHAOSSEARCH. He is a serial entrepreneur at the forefront of communication, virtualization, and database technology, and the inventor of CHAOSSEARCH’s patent pending IP. Thomas has also patented several other technologies in the areas of distributed algorithms, virtualization, and database science. He holds a Bachelor of Science in Computer Science from University of New Hampshire, and founded both student and professional chapters of the Association for Computing Machinery (ACM).
Related Items:
What Is An Insight Engine? And Other Questions
Rethinking Enterprise Search for the Big Data Age