Follow BigDATAwire:

March 4, 2022

Meet 2022 Datanami Person to Watch Ryan Blue

As the co-creator of Apache Iceberg, Ryan Blue played a central role in establishing the table data format as a new standard in the open data ecosystem. As the CEO of Tabular, Blue is also buliding a commerical entity around Iceberg. We recently caught up with Blue, who is one of our Datanami People to Watch for 2022.

Datanami: Apache Iceberg has filled a need for an open table format for a variety of computational frameworks, including Hive, Spark, Flink, PrestoDB, and Trino. What spurred you to develop it?

Ryan Blue: Before joining Netflix, I had a lot of conversations about fixing tables—it was a well-known problem and it seemed like each company I talked with had different approaches to making pipelines reliable. At Netflix, the problems were more urgent because we were working with data in S3 rather than HDFS. Directory listing couldn’t be trusted, latency was higher, and Netflix scale meant hitting “rare” problems all the time.

We started keeping track of just how many problems were caused by the simplicity of the Hive format and found that we could solve many pressing issues: the need to scale the Hive metastore, S3 latency, number of S3 operations, and S3 eventual consistency.

In the end, I think what pushed us to actually build it, rather than maintaining work-arounds, was that it was so painful for our data engineering partners. They’d regularly use a type that worked in only one engine, or drop a column and corrupt a table, or not know that to guarantee correctness Spark would automatically overwrite rather than insert. It was so painful to work with our platform that we had to do something.

The key was recognizing that our infrastructure problems and our customers’ pain had the same cause: a table format that wasn’t up to the task for data warehouse workloads.

Datanami: What do you really like about the open source community? Why is this the right way to develop software for enterprises?

Blue: The Iceberg community is full of amazing engineers and it’s been great to see the project grow far beyond what we would have been able to accomplish at Netflix alone. The list of contributions is really amazing. Things like SQL extensions to make it easy to run maintenance tasks or to configure a table’s sort order would never have happened, not to mention the integrations with all of the processing engines.

Of course, this was the goal of donating the project to the ASF. But it’s one thing to put a project out there and another to see people actually adopt it, and then to invest so heavily in improving it.

I’m glad to see it because this is what the larger big data community needs: a standard for cloud-native analytic tables that works across all the engines we already use. The only way to do that is through a healthy community that wants to welcome new people and use cases, and is neutral so everyone can confidently invest in support for the standard.

Datanami: What do you hope to see from the big data community in the coming year?

Blue: I’m excited to have more people using Tabular’s data platform, of course. But that aside, there are some things I think are set to make significant progress this year. The first is making data engineering more declarative. Even though we use SQL-like systems, people spend too much time worrying about how something is done instead of telling their tools what to do. I think this is one of the design principles that makes dbt so successful. This has been improving as SQL-like engines mature and I hope to see more improvements over the next year.

We’ve been working toward declarative data engineering in the Iceberg community for a long time with things like table-level configuration and hidden partitioning, but some features we added to Spark 3.2 make it more possible, like clustering and sorting as table attributes. It will be good to see people picking up those features and no longer worrying about rebuilding and testing jobs just to tweak the output clustering.

Along the same lines, there are some exciting developments in the view space. I’m hearing a lot more about materialized views lately. And there are some promising projects to be able to share views across database engines, like Substrait, which is a shared representation aimed at making it possible to exchange logical SQL plans. Having one definition work across Spark and Trino, for example, is a big win.

And the last thing is that I’m hoping to see more companies adopt Iceberg as the standard for analytic tables. In the last few months, Starburst, Dremio, Athena, EMR, and Snowflake have all announced support and I’m excited to see that momentum continue!

Datanami: Outside of the professional sphere, what can you share about yourself that your colleagues might be surprised to learn – any unique hobbies or stories?

Blue: A few weeks into the Pandemic, I started running every day to make sure I got out of the house and it turned into something I’ve kept doing every day. I’m at 650 days now, and I’m going to try to make it until the “end”. That’s hopefully soon, since we’re close to vaccines for kids under 5.

You can read the interview with Blue and other 2022 Datanami People to Watch winners at this link.

BigDATAwire