September 25, 2015

How Graph-Based Smart Data Lakes Will Democratize Value Extraction from Big Data

Alok Prasad

Data lake's risk becoming dumping grounds without proper governance

The prevalence of big data and the value it generates has greatly, and perhaps indelibly, altered contemporary business practices in two pivotal ways.

Firstly, the value of just-in-time analytics is contributing to a reality in which it is no longer feasible to wait for scarcely found data scientists to compile and prepare data for end-user usage. Widespread end-user adoption hinges on a simplification and democratization of big data that transcends, expedites, and even automates aspects of data science.

Secondly, big data has resulted in a situation in which enterprises must account for copious amounts of external data that are largely unstructured. The true value in accessing and analyzing these data lie in their integration with traditionally structured internal data for comprehensive views. Historically, integration between external and internal data has been hampered by inordinate time consumption on security and data governance concerns. Gartner has written at length about governance issues of non-semantic data lakes.

What’s needed is a way to expedite access, analytics, and integration of big data in a timely fashion that facilitates the sort of trust and security upon which prized enterprise data is based.

One approach for doing so is to utilize Smart Data Lakes based on in-memory, high speed processing graph technologies with business user understandable semantic models. This method ameliorates the aforementioned concerns by intrinsically provisioning rapid time to insight and action, as well as semantic consistency for sustainable governance and security.

Demystifying Big Data with Semantics

The celerity of data access and analysis is one of the most tangible wins for Smart Data Lakes and semantic graph technology, especially when applied to big data. Many organizations believe that big data initiatives require the lengthy services of data scientists and long periods of data modeling. But one of the benefits of semantic graphs is that organizations can start with whatever data they have and expand as rapidly as they like to evolve as their business needs change. Semantic graph models eliminate the need to design the right model upfront at design time. Enterprises can start with a model based on current needs and add to and evolve them, and their associated analytic capabilities, as needs change. There is a visual representation of data objects in graphs, which offers end-users greater insight into relationships, context, and self-service visualization and analytics capabilities. Thus, the more data in such graphs, the more useful they become.

The expedience with which one can run queries and expedite the data
discovery process to a four-step method of look, hypothesize, analyze, and act is beneficial across vertical industries. In the health care space, graphs enable clinicians to determine relationships between disparate health factors, which can yield links to possible cures and treatment methods. The fact that end users can run these queries so quickly increases the number of questions they can ask, which improves their overall efficacy. In the pharmaceutical industry, research and development departments can drastically reduce their expenses and time to market of products by utilizing this graph-based data discovery process.

Integrating External and Internal Data

The near real-time propensity for ad hoc querying of data in a Smart Data Lake is substantially enhanced by the capability of this technology to integrate external big data with internal data. The power of semantics enables such integration and expeditious data discovery without lengthy data preparation processes. The combination of real-time big data with internal data allows users to gain an additional degree of context relevant to real world developments. In e-commerce fields, graphs can facilitate recommender engines to tailor products specifically to customer needs based on their online profiles and up to the moment online activity. In the finance and credit card industry, graphs can integrate external data with internal data to help determine and prevent instances of fraudulent activity. Alternatively, they can also be leveraged to identify potential investments and hedge fund opportunities for customers based on real world opportunities as they occur.

The governance and security capabilities of Smart Data Lakes are just as valuable. Since graphs can ensure semantic consistency, critical facets of proper governance (such as role-based access to data, data provenance, and metadata management) can all be linked to semantic models and requisite business glossaries. Such uniform governance mechanisms are all but essential in heavily regulated finance and health care industries. The provenance and traceability of Smart Data Lakes allows companies to determine what potentially sensitive customer data is accessed or deployed in a way that is non-compliant with regulations, and can do so in a comprehensive manner that encompasses multiple forms of communications and exchanges of data.

In other scenarios, the viability of a Smart Data Lake hinges on the governance and security it can provide. Consulting companies, for example, work with multiple customers—many of whom might be competitors with one another. Consultants will not put sensitive and personally identifiable information about clients or companies in a data lake if there are not stringent measures to prevent colleagues working on different accounts from accessing it. The governance repercussions of Smart Data Lakes assuage such issues with role-based access, and enable the consultant company to still utilize the aforementioned data discovery benefits.

Taming Big Data

Smart Data Lakes based on semantic graphs provide a visual view of data elements that accelerates time to insight and action for both business users and data scientists. The latter can augment this boon to maximize big data utility by implementing analytics results into workflows for action. Additionally, leveraging semantic graphs for integration purposes reinforces governance and security protocols by facilitating consistency with metadata and business glossary definitions. The result is role-based access and data provenance, allowing enterprises to more readily manage regulatory requirements. Thus, Smart Data Lakes based on in-memory high speed processing graph technologies with business user understandable semantic models make big data more accessible, manageable, and viable to the enterprise.

About the author: Alok Prasad is president of Cambridge Semantics, the leading provider of smart data solutions driven by semantic web technology. He brings 25-plus years of technology and business experience. In 2000, he co-founded Beacon Photonics, a venture capital fund with Boston University and Globalvest focused on IT infrastructure and life sciences. Prior to Beacon, he was a principal at PRTM, an operations management consulting company, subsequently acquired by PWC. He also previously served as vice president of COBA-M.I.D, a strategy and operations consulting company later acquired by Renaissance Worldwide and renamed Adventis. Prasad received his MBA from The Wharton School at the University of Pennsylvania, a master’s degree from Auburn University and a B.Tech from the Indian Institute of Technology, Kharagpur, India.

Applications: Data Mining, Enterprise Analytics, Predictive Analytics

Technologies: Middleware

Sectors: Financial Services, Manufacturing, Retail

Vendors: Cambridge Semantics

Tags: Cambridge Semantics, data lake, Hadoop, semantics