Striking a Balance: Big Data Debt and Big Data Value
In technology, the concept of “technical debt” is used to describe the obligation side of a trade-off, usually when something is built “fast” rather than being built “right.”
Just as with personal or corporate debt, a certain amount of technical debt can be very useful. For example, it might be smart to introduce a new feature with minimal effort to first establish customer interest before building an optimal solution. And, like personal and corporate debt, too much technical debt can be very bad, eliminating your ability to add new features because you’re too busy putting out fires. The same idea of debt applies to Big Data.
How We Got Here
These days every company is a data company. We largely frame our data strategies in terms of two distinct areas: applications and analytics. Applications are where data is created, and analytics is where we make sense of data, across all our applications and their history. While both are centered in data, the technology needs are fairly different.
For decades these two areas shared a common technology foundation: the relational model. While a company was, for example, running their business on SAP, and assessing their performance through Teradata, at all points along the way the data was in tables, rows, and columns, queried by SQL, and managed by a relational database. The technologies, skills, tooling, and integrations could be applied to both applications and analytics. The associated costs could be amortized across many projects.
But starting about 10 years ago, companies started to consider alternatives to relational databases for their applications. New demands made their traditional approach impractical. There were many factors – always-connected users; explosive data volumes; richer data structures; faster application release cycles, to name a few.
The individuals closest to these new challenges were developers. They looked beyond centrally approved options from IT to alternatives from open source projects, AWS, and SaaS products. These options give application teams better agility, lower cost of ownership, improved scalability, and greater reliability. In short, they all exist for good reasons, and they enhance a company’s ability to grow and flourish in competitive markets.
Developers claimed newfound independence and autonomy, and in return provided faster release cycles and better user experience. While developers were providing exactly what the business wanted, companies were unknowingly making an important trade-off. Because these new alternatives were fundamentally incompatible with existing analytical processes, companies were unwittingly creating a problem whose solution had a non-trivial cost.
We call this Big Data Debt, and if you fast forward to today, this form of technical debt is absolutely massive in many companies. What’s more concerning is that companies have no real sense for just how large their obligations are.
Why Analytics on Modern Data Is So Hard
The “last mile” in analytics consists of the tools millions of analysts and data scientists use from their devices. These are BI products (Tableau, Microsoft Power BI, MicroStrategy), data science tools (Python, R, and SAS), and most popular of all, Excel. One thing these tools all have in common is they work best when all the data is stored in a high-performance relational database.
But companies don’t have all their data in a single relational database. Instead, their application data is stored in four key areas across hundreds or thousands of different systems:
Technology | Examples | Data stored as | Data access |
Relational databases | Oracle, SQL Server | Tables | SQL |
SaaS applications | Salesforce, Workday | Proprietary | Proprietary API |
NoSQL databases | MongoDB, Elasticsearch, Cassandra | JSON, key/value | Proprietary API |
Distributed file systems | Amazon S3, Hadoop HDFS | Files | Proprietary API |
For application data in relational databases, companies are well equipped to move data from these systems into their analytical environments. They build data pipelines with ETL tools, data warehouses, data marts, cubes, and BI extracts. While these undertakings are substantial, they are well understood, well governed, and follow a pattern that has been in use for several decades.
In contrast, the other application data in SaaS applications, NoSQL databases, and distributed file systems is fundamentally incompatible with the existing approach to data pipelines. ETL and BI teams find themselves ill equipped in terms of tools, best practices, and skills to handle these sources. Furthermore, these tend to be the applications that are more strategic, and their data is in higher demand for analysis. As a result, IT finds itself under enormous pressure.
Many companies have turned to Hadoop or S3. But even those companies who have managed to consolidate their application data into a data lake find that it doesn’t meet the performance needs of their analysts and data scientists. So, they extract summarized data into a relational database and run their “last mile” tools on these environments, or they create BI extracts for each of the different tools. The end-to-end solution is incredibly complex, expensive, fragile, and slow to adapt to new application data. This is what it costs to pay down Big Data Debt.
Understanding Your Big Data Debt
To better understand your debt, it can help to break the costs down into different areas. Here are the four we think are most important. You may find there are others in your organization, but these are the factors we believe are most common:
- Technology costs. To make application data available to BI tools and platforms for data science, it must be transformed into relational structures and stored in a relational database. Because newer application data is incompatible with traditional tools, skills, and processes, new technologies must be adopted that meet these needs. These may be provided by vendors, or they may be custom built by IT. Technology costs include the infrastructure necessary for making application data compatible with the tools used by analysts and data scientists.
- People costs. Each application presents specific requirements for accessing data in terms of APIs, security, performance, and scalability. In addition, cleansing, transforming, summarizing, and governing data as it is moved from these systems into relational structures requires specialized skills (eg, Data Engineers) and building data pipelines that reliably move data. People costs include the engineers, analysts, data stewards, security experts, and system administrators required to make data compatible with the tools used by analysts and data scientists.
- Opportunity costs. Time to insight is essential in today’s fast-moving business climate. The value of application data varies over time by source. For example, sales history data may be of value for many years, whereas some sensor data starts to lose value in seconds. Moving application data into analytical environments can take significant time. Reducing this time can have very high costs. Opportunity costs include unrealized value as a result of prolonged time to insight as data moves through pipelines to reach the tools used by analysts and data scientists.
- Liability costs. Data is an enormous asset, but it is also a liability in terms of security threats, compliance, errors, and data breaches. Most companies keep 6-10 copies of their data across different environments. Many users download data into spreadsheets they store in Dropbox or on laptops outside the four walls of the organization. With so much data spread across so many systems, companies have a greater surface area that is subject to attack by internal and external agents, inconsistent security measures, and varying governance controls. Big data involves tools and protocols that are less mature than traditional approaches. These systems pose a greater liability risk that must be considered in understanding your total debt. Liability costs include potential losses that can result from unsecured or ungoverned data moving through pipelines to make it compatible with the tools used by analysts and data scientists.
As with personal, corporate, and technical debt, when used carefully Big Data Debt can be advantageous. For example, time to market may be a more important consideration for a new application than making this data available for analysis. If the application is successful and its data turns out to be very valuable, you can build a data pipeline that will make this data compatible with the tools used by your analysts and data scientists. The total cost might be higher, but getting the application to market may have been the priority. The real problem for companies is when this choice is made without an understanding of the trade-offs. You have no sense for what these costs are, and no tactics for paying down the Big Data Debt it creates.
The costs associated with Big Data Debt can be surprisingly high. As with any area of technology, where there are high costs there is opportunity for innovation. We believe that the next major advances in the data technology space will come from products that help companies to more effectively align the data generated from their diverse application portfolio with the tools used by their analysts and data scientists.
About the Author: Tomer Shiran is the co-founder and CEO of Dremio. Prior to that, he headed the product management team at MapR and was responsible for product strategy, roadmap and requirements. Prior to MapR, Tomer held numerous product management and engineering roles at IBM and Microsoft, most recently as the product manager for Microsoft Internet Security & Acceleration Server (now Microsoft Forefront). He is the founder of two websites that have served tens of millions of users, and received coverage in prestigious publications such as The New York Times, USA Today and The Times of London. He holds an MS in Computer Engineering from Carnegie Mellon University and a BS in Computer Science from Technion – Israel Institute of Technology and is the author of five US patents.