December 3, 2024

AWS Unveils Hosted Apache Iceberg Service on S3, New Metadata Management Layer

Alex Woodie

AWS today unveiled a new S3 bucket type that’s optimized for storing data in Apache Iceberg, which has become the defacto standard for open table formats. AWS will not only automate the “undifferentiated heavy lifting” of table maintenance with the new S3 bucket type, but it will deliver a massive speedup in analytics using the Iceberg table. The company also launched a new metadata service that’s aimed at helping to wrangle technical metadata generated in Iceberg environments.

The events of this June, when Databricks acquired Tabular and Snowflake launched the Polaris metadata catalog for Iceberg, are still reverberating around the big data community. Customers who previously might have been hesitant to invest in building a data lakehouse out of fear of choosing the wrong table format were given the greenlight as the industry settled on Iceberg.

As the largest cloud provider, AWS stood to benefit from the accelerating growth of customer data lakehouses managed by the likes data big wigs like Snowflake and Databricks as well as scrappier upstarts like Starburst and Dremio. Many of the world’s new Iceberg tables–essentially metadata that organizes Parquet files in ways that enable the transactionality and consistency that were missing in earlier data lakes–were likely to live in S3 anyway, so why not just cut out the middleman?

That’s basically what AWS is doing with today’s launch of Amazon S3 Tables. AWS says the new bucket type optimizes storage and querying of tabular data as Iceberg tables, where it can be consumed by multiple query engines, including AWS services like Amazon Athena, EMR, Redshift, and Quicksight, but also open source query engines like Apache Spark and others. Storing data in this way gives customers benefits like row-level transaction support, queryable snapshots via time travel functionality, schema evolution, and other Iceberg capabilities.

Parquet and Iceberg are designed for large-scale big data analytic environments, and AWS says it’s upping the performance with Amazon S3 Tables. The company claims its new Iceberg service delivers up to 3x faster query performance and up to 10x higher transactions per second (TPS) compared to plain vanilla Parquet files stored on standard S3 buckets.

AWS CEO Matt Garman at re:Invent 2024

Perhaps more importantly, the new service also handles manual tasks, such as table maintenance, file compaction, snapshot management, and access control. Those tasks can often require a technical team to manage as Iceberg environments scale up, which becomes a costly burden–or as AWS sees it, an opportunity.

“Iceberg is really challenging to manage at scale,” AWS CEO Matt Garman said during today’s keynote address at the re:Invent 2024 conference in Las Vegas. “It’s hard to manage the scalability. It’s hard to manage the security.”

One of the AWS customers planning to use S3 Tables is Genesys, a provider of AI orchestration tools. The company says using S3 Tables will enable it to offer a materialized view layer for its diverse data analysis needs.

“S3 is completely reinventing object storage for the data lake world,” Garman said. “ I think this is a game changer for data lake performance.”

In addition to a managed Iceberg service, AWS took the next step and launched a metadata service to help manage the morass of data stored in Iceberg environments. The company says the new offering, dubbed S3 Metadata, will “automatically generates queryable object metadata in near real-time to help accelerate data discovery and improve data understanding, eliminating the need for customers to build and maintain their own complex metadata systems.”

Customers can add their own custom metadata to S3 Metadata using object tags, such as SKUs or content ratings, which enables them to better manage data in their own businesses, AWS says. The metadata can be queried using basic SQL, which helps to prepare the data for analytics or for use in generative AI.

(Panchenko Vladimir/Shutterstock)

S3 Metadata takes aim at so-called metadata catalogs, such as the Apache Polaris offering that Snowflake launched earlier this year. Other technical metadata catalogs include Databricks Unity Catalog and Dremio’s Project Nessie, both of which are in the process of becoming compatible with Polaris.

The automation of metadata management will be particularly beneficial in large environments, such as those exceeding 1PB of data, Garman said.

“We think customers are just going to love this capability, and it’s really a step change in how you can use your S3 data,” he said. “We think that this materially changes how you can use your data for analytics, as well as really large AI modeling use cases.”

S3 Tables are generally available now. S3 Metadata is available as a preview. For more information on S3 Tables, read this AWS blog. For more information on S3 Metadata, read this AWS blog.

Databricks Nabs Iceberg-Maker Tabular to Spawn Table Uniformity

Snowflake Embraces Open Data with Polaris Catalog

Applications: Data Management

Technologies: Cloud

Vendors: AWS

Tags: Amazon S3, Apache Iceberg, Matt Garman, metadata catalog, re:Invent 2024, S3 Metadata, S3 Tables

AWS Unveils Hosted Apache Iceberg Service on S3, New Metadata Management Layer

December 10, 2024

December 9, 2024

Sponsored Partner Content

Designing a Copilot for Data Transformation

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

IDC Spotlight: Boosting AI Impact with Data Products

Building a Trusted Data Foundation for AI/ML and Business Intelligence (BI)

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

AWS Unveils Hosted Apache Iceberg Service on S3, New Metadata Management Layer

December 10, 2024

December 9, 2024

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Share

Copy short link