AWS Unveils Hosted Apache Iceberg Service on S3, New Metadata Management Layer
AWS today unveiled a new S3 bucket type that’s optimized for storing data in Apache Iceberg, which has become the defacto standard for open table formats. AWS will not only automate the “undifferentiated heavy lifting” of table maintenance with the new S3 bucket type, but it will deliver a massive speedup in analytics using the Iceberg table. The company also launched a new metadata service that’s aimed at helping to wrangle technical metadata generated in Iceberg environments.
The events of this June, when Databricks acquired Tabular and Snowflake launched the Polaris metadata catalog for Iceberg, are still reverberating around the big data community. Customers who previously might have been hesitant to invest in building a data lakehouse out of fear of choosing the wrong table format were given the greenlight as the industry settled on Iceberg.
As the largest cloud provider, AWS stood to benefit from the accelerating growth of customer data lakehouses managed by the likes data big wigs like Snowflake and Databricks as well as scrappier upstarts like Starburst and Dremio. Many of the world’s new Iceberg tables–essentially metadata that organizes Parquet files in ways that enable the transactionality and consistency that were missing in earlier data lakes–were likely to live in S3 anyway, so why not just cut out the middleman?
That’s basically what AWS is doing with today’s launch of Amazon S3 Tables. AWS says the new bucket type optimizes storage and querying of tabular data as Iceberg tables, where it can be consumed by multiple query engines, including AWS services like Amazon Athena, EMR, Redshift, and Quicksight, but also open source query engines like Apache Spark and others. Storing data in this way gives customers benefits like row-level transaction support, queryable snapshots via time travel functionality, schema evolution, and other Iceberg capabilities.
Parquet and Iceberg are designed for large-scale big data analytic environments, and AWS says it’s upping the performance with Amazon S3 Tables. The company claims its new Iceberg service delivers up to 3x faster query performance and up to 10x higher transactions per second (TPS) compared to plain vanilla Parquet files stored on standard S3 buckets.
Perhaps more importantly, the new service also handles manual tasks, such as table maintenance, file compaction, snapshot management, and access control. Those tasks can often require a technical team to manage as Iceberg environments scale up, which becomes a costly burden–or as AWS sees it, an opportunity.
“Iceberg is really challenging to manage at scale,” AWS CEO Matt Garman said during today’s keynote address at the re:Invent 2024 conference in Las Vegas. “It’s hard to manage the scalability. It’s hard to manage the security.”
One of the AWS customers planning to use S3 Tables is Genesys, a provider of AI orchestration tools. The company says using S3 Tables will enable it to offer a materialized view layer for its diverse data analysis needs.
“S3 is completely reinventing object storage for the data lake world,” Garman said. “ I think this is a game changer for data lake performance.”
In addition to a managed Iceberg service, AWS took the next step and launched a metadata service to help manage the morass of data stored in Iceberg environments. The company says the new offering, dubbed S3 Metadata, will “automatically generates queryable object metadata in near real-time to help accelerate data discovery and improve data understanding, eliminating the need for customers to build and maintain their own complex metadata systems.”
Customers can add their own custom metadata to S3 Metadata using object tags, such as SKUs or content ratings, which enables them to better manage data in their own businesses, AWS says. The metadata can be queried using basic SQL, which helps to prepare the data for analytics or for use in generative AI.
S3 Metadata takes aim at so-called metadata catalogs, such as the Apache Polaris offering that Snowflake launched earlier this year. Other technical metadata catalogs include Databricks Unity Catalog and Dremio’s Project Nessie, both of which are in the process of becoming compatible with Polaris.
The automation of metadata management will be particularly beneficial in large environments, such as those exceeding 1PB of data, Garman said.
“We think customers are just going to love this capability, and it’s really a step change in how you can use your S3 data,” he said. “We think that this materially changes how you can use your data for analytics, as well as really large AI modeling use cases.”
S3 Tables are generally available now. S3 Metadata is available as a preview. For more information on S3 Tables, read this AWS blog. For more information on S3 Metadata, read this AWS blog.
Related Items:
How Apache Iceberg Won the Open Table Wars
Databricks Nabs Iceberg-Maker Tabular to Spawn Table Uniformity
December 10, 2024
- GridGain Announces Call for Speakers for Virtual Apache Ignite Summit 2025
- Confluent Platform Simplifies and Secures Apache Flink for On-Prem Workloads
- DataRobot Named a Leader in IDC MarketScape Worldwide MLOps Platforms 2024 Vendor Assessment
- Cohesity Targets $40B Market Following Veritas Acquisition Completion
- NetApp’s 2024 Data Complexity Report Reveals AI’s Make or Break Year Ahead
- Acceldata Expands Enterprise Data Observability Platform with Advanced AI-Powered Data Reconciliation
- Data Days Workshop Gathers DOE National Labs to Discuss Future of Data Management
- Algolia Launches Data Transformations Tool to Enhance and Enrich the Quality of Indexed Data
- Qlik Identifies Lack of Skills and Trust as Key AI Challenges
- Airbyte Launches Enterprise-Grade Data Connectors for Oracle Databases and Workday
- SDSC Receives Honors in 2024 BigDATAwire Readers’ and Editors’ Choice Awards
- Elastic Announces Elastic Rerank Model to Power Up Semantic Search
- Ataccama Data Trust Report Reveals 54% of Leaders Fear AI Lag Will Hurt Competitiveness
- Cloudera Accelerates Enterprise AI with RAG Studio Preview
- MinIO Research Reveals the Impact of AI on Storage Infrastructure
December 9, 2024
- KNIME Launches K-AI Companion to Enhance Workflow Automation and Transparency
- BigDATAwire Reveals Winners of 2024 Readers’ and Editors’ Choice Awards
- Denodo Named a Leader in the 2024 Gartner Magic Quadrant for Data Integration Tools for 5 Consecutive Years
- Adastra Supports Amazon SageMaker Unified Studio Launch as Key Data Partner
- Graphiant Launches Data Assurance Solution to Address Data in Motion Challenges