Demystifying DataOps: What We Need to Know to Leverage It
The term “DataOps” has picked up momentum and is quickly becoming the new buzz word. But we want it to be more than just a buzz word for your company, after reading this article you will have the knowledge to leverage the best of DataOps for your organization.
Let’s start by looking at where DataOps stands in the zoo of current IT methodologies. If you are familiar with ETL (extract, transform, and load) and MDM (master data management systems), think about DataOps as the next level in organizing data and processes around it. You can also think about it as a methodology that brings together DevOps and Agile in the field of Data Science in that DataOps is about changing people’s minds and the way they approach everyday challenges.
Now we need to look at what issues DataOps is trying to tackle. Perhaps one of the biggest problems, and one that creates the most confusion, is data ownership. This is particularly common in legacy enterprise systems where each department has its own pipeline, analysis, and methodologies for procuring its datasets. Such an ownership of data processes that often lack transparency is one of the main sources of data silos. Complicating these data silos even further, each department interprets each particular dataset and results based on them in its own way. These decisions are not centralized and there is no unified way to share them across the organization creating a situation with disjointed departments and little collaboration.
Let’s examine what this process would look like for a retail business. Many of the stores we know today have a membership program. Imagine if you could group purchases from different stores made on different days to one member. This purchase grouping enables a large part of complex analytics that the vast majority of retailers around the world are doing. However, the definition of “member” is not always easy and straight forward, but it has a profound effect on the interpretation of the results and subsequent business decisions. Is it safe to assume that one membership number is one person? If yes, how do you advertise to couples that use one membership? If online purchases do not require the use of memberships, how do you join online and in-store purchases?
Decisions like this have a critical effect on an organization’s strategic marketing decisions and plans for in-store purchases, online purchases and marketing campaigns. In a pre-DataOps organization, these important decisions are made by independent departments without the tools or procedures for collaboration. The result is a lack of data transparency and therefore unnecessary barriers for effective and well-timed strategic business decisions.
In an environment where it is impossible to have strategic enterprise level thinking and sharing of knowledge and discoveries, DataOps suggests treating data as a solid standalone resource, an asset of the entire organization. Each department will have access to this resource, share tools and storage, and perhaps more importantly, share results, discoveries or needs in a unified way on a known platform for the whole organization. Of course, such a reality requires commitments and agreements across the organization.
Depending on the age of your business and data environment, the required changes might be massive and painful. In my experience, one of the most difficult steps, both to recognize and to implement, is the creation of a common data schema. In such a schema, each entity of enterprise data has a defined and agreed upon definition and an identified method to locate it and work with it.
The schema development process requires a lot of collaboration, especially in an enterprise-size organization. Moreover, once developed the schema is never “set in stone” and is fluid as long as the business develops and changes. One widely recognized development practice that allows intense collaboration and quick changes is Agile. New business challenges bring new questions to each department and they can choose and maintain their own pipelines and entity definitions to extend the core schema. However, there should be a process in place to decide which of these entities and pipelines should become a part of the common data schema and when. Agile provides control over this process, short development cycles, and quick idea implementations.
As we previously mentioned, DataOps brings DevOps and Agile together, and Agile plays a vital role in managing the required and intense collaboration and quick changes. So what is the role of DevOps in this process? DevOps successfully satisfies the need for a centralized data archive with variable access points allowing individual departments to plug in custom solutions that support a high volume of requests. Another valuable aspect of DevOps is its ability to build and manage a system of solutions for technical specialists (IT engineers), nontechnical users (i.e. managers, business leaders) and someone in between (for example, data analysts and data scientists who both produce and consume data) to collaborate. Because of its flexibility, DevOps is the choice for building a modern data management system that is ready to deal with complex, high volume, high velocity data.
It is important to mention that the implementation of DataOps methodology, or at least some part of it, is an absolute necessary requirement to successfully and widely employ ML algorithms and AI systems dedicated to help customers and the business. The amount and diversity of data of any kind is growing every day and ability of a business not only to manage this data, but also to successfully and seamlessly integrate it into everyday business operations, is an essential skill to survive and prosper.
The value of the DataOps methodology is clear, so why isn’t it catching on quicker and more widely implemented?
To summarize, DataOps is about changing mindsets which can be challenging, especially within a huge enterprise organization. The legacy of existing data management practices may be overwhelming. At the same time, a small organization may view the DataOps ideology as overkill and be content with appointing an owner to each piece of data or pipeline and hold regular meetings to facilitate collaboration. However, projects grow fast and staff may change responsibilities or leave the company and critical data may be lost or overlooked. It is never too early to put a plan in place to embrace the DataOps ideology. In the end, the size of your organization does not matter when the ultimate goal is to be prepared for growing volumes and complexity of data and avoiding drowning in silos of data and practice debt.
About the author: Polina Reshetova, PhD, is a senior data scientist with EastBanc Technologies. Polina earned her PhD in complex systems data analysis. Over the past five years, she has been developing machine learning algorithms and predictive analytical techniques. Today, Polina focuses on implementing iterative approaches (e.g. Minimal Viable Prediction) that break complex, challenging assumptions into small, digestible chunks that can be tested in weeks, not months. These methods minimize the time needed to garner actionable insights while maximizing user feedback loops which progressively increases algorithm effectiveness.
Related Items:
Data Pipeline Automation: The Next Step Forward in DataOps