BI on Hadoop–What Are Your Options?
In the era of RDBMS and modern data warehouses, business intelligence was mostly a solved problem. Any reasonably advanced tool would work with any reasonable database, and the only real work was deciding what to collect and how to present it. However. the rise of big data and its associated technologies has forced the market solve all these old problems all over again, and we’re now left with a proliferation of software that can be difficult to differentiate.
During the course of my Strata + Hadoop World presentation, titled “BI on Hadoop: What are your options?” we’ll look at three primary categories of solutions to the ‘BI-on-Big-Data’ problem.
The first one is ‘ETL to RDBMS,’ in which pre-packaged or custom software is employed to create a relational database based on information extracted from a big data source. This approach essentially reduces the contemporary problem to the earlier and better-understood problem of ‘BI on RDBMS.’ In this section popular ETL tools are named, and an example flow of how to create an RDBMS from big data is shown.
The second category is a class of software that could be described as ‘monolithic solutions.’ This software takes an all-in-one approach that solves the problem of querying and visualizing big data all within a single package. We’ll discuss the architecture of three of these tools (Platfora, Datameer, and Zoomdata) and point out how these design choices influence the experience of using the software.
The final category is SQL-on-Big-Data solutions, which is comprised of three important sub-categories: native SQL (Drill, Impala, Presto), batch SQL (SQL on Hive and Spark SQL), and OLAP cubes. Fundamentally, these solutions provide a query engine layer on top of big data that provides an interface for SQL-enabled BI tools. We’ll be taking some time to compare and contrast these tools, and attendees curious about SQL-on-Big-Data will leave with a strong sense of what defines each sub-category.
Following the SQL-on-Big-Data section there will be a brief demo, in which Yelp datasets stored on both MongoDB and Hadoop are accessed from Tableau via Drill. The wrap-up for the talk will consist of a summary of the properties of these solutions and a heuristic for guiding enterprise adopters to the BI solution that might work best for them.
My session takes place Thursday March 31 from 2:40 to 3:30 in room LL20 D. For more info, click here.
About the author: Jacques Nadeau is cofounder and CTO of Dremio, a big data software startup aimed at the development of Apache Arrow, a new data format for columnar in-memory analytics. Prior to Dremio, Jacques led the Apache Drill development efforts at MapR Technologies.