Realizing the Promise of Data Lakes
Like their real life counterparts, data lakes are subject to pollution, flooding and a host of other ills. But they are also becoming indispensable to enterprises that wish to take advantage of the vast quantities of raw data from initiatives like the Internet of Things (IoT) that can be cost effectively managed with low cost computing environments.
Because of the term’s relative newness, a data lake is defined in various ways. For example, it can be viewed as a storage repository that holds huge amounts of raw data made available to the enterprise through technologies like Hadoop. Or a data lake can be seen as a technologically agnostic methodology that facilitates the capture, refinement, archiving and exploration of raw data within the enterprise. Simply stated, a Data Lake is a centralized approach to capturing, refining, storing, and exploring any form of raw data at scale, enabled by low cost technologies.
Data lakes allow data scientists to cost effectively explore data sets of unknown, underappreciated, or unrecognized value. Data lakes are alleviating strained IT architectures tasked with filtering and transforming exponentially growing data volumes – particularly as the IoT movement gains momentum. Data lakes have also proven to be an economic way to create an online archive that expands corporate knowledge.
But the implementation, care and feeding of a data lake can be problematic. .
Data Lake Challenges
Extracting the signal from the noise is a major data lake challenge due to the fact that all data is accepted without oversight or governance. As Gartner commented in a press release last year, “Without descriptive metadata and a mechanism to maintain it, the data lake risks turning into a data swamp.”
Although the usability of Hadoop is improving with the maturing of SQL on Hadoop, in many cases data will need minimal transformations to make it usable. This may require users to be proficient in low level coding languages – a skill that many business users understandably lack.
In addition, data can flow into the data lake with no oversight of the contents, representing risk exposure for privacy and regulatory requirements.
Toward a Solution
Teradata directly addresses these and other challenges associated with deploying and maintaining data lakes through Teradata Loom. This solution provides integrated metadata management, data lineage and data wrangling for enterprise Hadoop.
Teradata Loom’s framework of sources, datasets, transforms and jobs provides the data scientist and analysts with an integrated view of the workflow. It captures metadata about data and processes and provides a single platform for data management, maintaining data lineage and data preparation through metadata management.
Teradata Loom’s Activescan feature automatically collects metadata about the job and removes the burden of generating metadata. It maintains the relations between the original data sets and the data sets derived through transformations.
Working within an easy-to-use, browser-based, self-service environment, users can access, maintain and manage data in the data lake throughout its lifecycle. The Teradata Loom ActiveScan feature automates the collection of data lineage and statistics for all data in the cluster. Teradata Loom’s built in “data wrangling” capabilities simplify data preparation with self-service UI capabilities for analysts. IT involvement is minimized.
Teradata Loom reads and parses a variety of file types, including JSON (Java Script Object Notation), the language that powers the IoT, representing this deluge of data in simple tables for easy data manipulation and preparation by enterprise end users. In addition, Teradata Loom is compatible with the leading Hadoop distributions including Hortonworks Hadoop Data Platform, Cloudera CDH and MapR.
Help from Think Big
To implement advanced data lakes, Think Big, a Teradata company, provides data science and engineering services that enable organizations to accelerate their time to value from big data. These business-driven solutions include analytics in areas such as manufacturing device events, omni-channel consumer behavior, and customer service behavior, as well as well governed data lakes that provide business value.
As the first and only pure-play big data services firm, Think Big’s data scientists, data engineers, and project managers have deep expertise in helping the world’s most innovative companies strategize, architect, implement, analyze data on, and manage open source big data solutions.
In most cases, Think Big uses a blended delivery model for client projects. These means that field consulting teams work at customer locations, but also leverage Teradata’s onshore Solution Center based in the United States for much of the coding and QA work necessary for implementing big data applications, including data lakes.
Think Big provides all the expertise needed to help companies deploy best in class data lakes. This includes: hardening the data lake to meet enterprise grade security requirements for large organizations; helping companies select and deploy the appropriate Hadoop distributions; and implementing Big Data and data lake best practices.
Realizing the Benefits
Teradata Loom allows enterprise data analysts and data scientists to easily work with Hadoop data and accelerate the time from data acquisition to business insights. The solution provides a single unified platform that handles everything from data discovery to metadata management and data preparation. The data lake can become an excellent adjunct to your data warehouse, allowing the use and capture of critical information that otherwise might have been overlooked.
The solution democratizes accessibility and data understanding across the entire organization and boosts the productivity of analysts and data scientists.
Teradata Loom makes sure that your data lake doesn’t turn into a data swamp overflowing with unused and unusable information. The Teradata solution is a major contribution to realizing the full potential of Big Data.
Download the free Teradata Loom Community Edition here and learn first hand how to turn a data lake into an invaluable enterprise asset.