How TD Bank Made Its Data Lake More Usable
One of the ways to make big data useful is to collect it all in a Hadoop-based repository and then give users self-service tools to access it. But there are a lot of details that must be hammered out before the Hadoop dream becomes a reality. For TD Bank, the route to data lake efficiency is currently passing through a suite of end-to-end tools that address some of biggest stumbling blocks customers encounter when working with new technologies.
Toronto-Dominion Bank (TD Bank) is one of the largest banks in North America, with 85,000 employees, more than 2,400 locations between Canada and the United States, and assets nearing $1 trillion. Like most large financial services firms, the Toronto, Ontario-based firm has a bit of everything when it comes to its IT operations, ranging from open systems like Linux to proprietary ones like IBM System z mainframes, long the favored transaction processing platform of big banks.
About three years ago, the company decided to standardize how it warehouses data for various business intelligence and regulatory reporting functions. The company purchased Cloudera‘s enterprise Hadoop distribution and set off to build a large cluster that could function as a centralized lake to store data originating from a variety of departments.
However, the bank discovered the process of ingesting data into the Hadoop cluster was not as simple as they may have hoped. The company’s reliance on hand-coded ETL scripts meant that it could take up to six months for the IT department to deliver the data that analysts needed to explore how the business was running, what customers were doing, and how it might evolve its business. That kind of delay was simply unacceptable for a $30-billion bank.
The company turned to the aftermarket of Hadoop automation tools, and invited several vendors to build proofs of concept (POCs) to showcase how they could simplify a variety of data management tasks. In the fall of 2015, TD Bank selected Podium Data, which develops a suite of tools for managing Hadoop clusters and the data held in them, as well as Teradata‘s Think Big consultancy, which handled the implementation that started in 2016.
Meta-Data Lake
The key to Podium Data’s solution is its reliance on metadata to identify individual pieces of data and how they’re moving and evolving throughout the enterprise, says CEO Paul Barth.
“It’s really looking at an end-to-end data management process, what I call from raw to ready, where we build in and drive off of metadata and configuration, data collection, preparation, delivery, quality, governance, and security,” Barth told Datanami. “We have a turnkey solution that does every step of the puzzle.”
The Podium platform runs on a node in the Hadoop cluster and automatically handles many of the tasks required to make data visible and usable to end users. “It’s like setting up a store in Amazon – you have everything you need to create a data marketplace for your user base,” Barth continues. “We are one of the few integrated environments, so when you turn it on, you have the entire marketplace of data.”
Podium also handles security tasks, such as encryption, for TD Bank. As an international bank, TD Bank must follow a variety of regulations governing how it handles data security. “We are identifying the personally identifiable information [PII]. We’re protecting it,” Barth says.
TD Bank is also making use of Podium’s automated data ingestion capability to streamline the landing of data into its Hadoop cluster and to provide more access to data for analysts, says Upal Hossain, who heads up TD Bank’s data ingestion innovation team.
“We’re trying to automate ingestion,” Hossain told Datanami at the recent Strata Data Conference show in New York. “We want to make it super easy to get data into our lake. It should take two days. Right now it takes two months.”
Enter the EDPP
The Podium work targets TD Bank’s enterprise data provisioning platform, or EDPP, which collects up to 100,000 files per night representing millions of fields from all of the bank’s business units, and makes it available to analysts and other users. The company actually runs the analytical workloads elsewhere.
The diversity of data types in TD Bank’s environment contributes to the challenge. In addition to relational database and XML files, the company must deal with flat files and mainframe files. Mainframes, of course, use the EBCDIC data format as opposed to the ASCI format used by the rest of the world, which further adds to the difficulty.
Under the old approach, TD Bank’s analysts would spend up to half their time defining data sources to ensure that transformations are done correctly and accurately. The goal with Podium is to automate those tasks so the analysts can reclaim that time for more value-added activities.
TD Bank still must define the transformations in Podium, which then generates the Hive tables. But once those transformations are done, the company is able to build off that by rolling out new transformations for new but similar data sources more quickly than before.
“Until now, basically any file that comes in, we have to analyze how we’re going to ingest that data,” Hossain said. “One of our main goals is to achieve high re-usability. One of the ways we’re measured is how often can a new project come in and reuse the same data that we already have, so instead of starting a new ETL from scratch, we already have that data and they can leverage that.”
TD Bank has cut its IT costs related to data preparation by 40%, according to Podium data’s case study on the implementation. “It saves us a lot of money,” Hossain said.
TD Bank isn’t using all of Podium’s capabilities. For example, its next step is to get data stewards to describe the data within the tool. But the relationships between the two entities is good.
“We like working with Podium. Whenever there’s a need, Podiums is quick to respond,” Hossain said. “The thing we like bout Podium is it’s geared to big data environments. So the guys we’re working with are very passionate, which we really liked.”
Related Items:
Why Integration and Governance Are Critical for Data Lake Success
Data Catalogs Emerge as Strategic Requirement for Data Lakes