Business

SaaS

Technology

•

Jun 24, 2026

Why Open Table Formats Are the Real Battlefield in the Data Platform Wars

Open table formats were supposed to be the happy ending. Apache Iceberg and Delta Lake arrived and promised to end storage lock-in forever: write your data once, query it with any engine, never rewrite a petabyte just because you changed platforms. Everyone exhaled. And then Databricks spent $2 billion on Tabular, a startup with roughly $1 million in annual recurring revenue.

That number only makes sense if the format war was never really about formats. The 2,000x revenue multiple prices control of the path to the catalog, not the format itself. This article traces how lock-in moved from storage to governance, and what that means for any enterprise signing a data platform contract in 2026. For the broader strategic context, the data platform wars have reached an AI inflection point that is reshaping competitive dynamics across every layer of the stack.

What Are Apache Iceberg and Delta Lake and Why Do They Matter for Vendor Lock-In?

Apache Iceberg and Delta Lake are open table formats. They sit on top of Apache Parquet files in your object storage (S3, Azure Blob, GCS) and organise those files into structured, queryable tables with database-like features: ACID transactions, schema evolution, time travel, and partitioning.

The architectural innovation is the separation of metadata from data. A table’s metadata (which files belong to the table, what their schema is, how they are partitioned, and a history of every change) lives in a metadata layer that any compatible engine can read. This means Spark, Trino, and Snowflake, among more than 30 other tools, can all query the same table on the same object storage without format conversion.

Before open formats, migrating between platforms meant rewriting all your data, a process measured in weeks or months for large estates. With Iceberg or Delta Lake, migration means pointing a new engine at the same files and registering the table in a new catalog. It is a metadata operation rather than a data operation.

Iceberg was created at Netflix around 2017 and donated to the Apache Software Foundation, giving it multi-vendor governance. Delta Lake was created at Databricks, open-sourced in 2019, and donated to the Linux Foundation, with Databricks retaining primary development control. Both store data as identical Parquet files; only the metadata layer differs.

Here is the thing to carry forward: open formats solve storage lock-in. Your data is no longer trapped in a proprietary format. But they leave compute lock-in and governance lock-in as separate problems. “Open format” addresses one layer of lock-in, not all of them.

Why Are Open Table Formats a Competitive Weapon Rather Than a Gift from Vendors?

Databricks open-sourced Delta Lake in 2019 for a reason, and that reason was Snowflake. If customers could store data in an open format readable by any engine, Snowflake’s model (where all your data must live inside Snowflake) became a competitive disadvantage. Databricks made this move to weaken a competitor’s lock-in, not out of altruism.

Snowflake responded by adopting Apache Iceberg as its standard, then launched and donated Apache Polaris to the Apache Software Foundation, an Iceberg-native open-source catalog, to make Databricks’ governance layer the only remaining lock-in surface. The logic was straightforward: if the catalog is also open, Databricks loses its last proprietary advantage.

Then came Tabular. In June 2024, Databricks acquired the startup founded by Iceberg’s original creators for a reported $2 billion against roughly $1 million in ARR. The acquisition brought Iceberg expertise in-house to accelerate Delta-Iceberg interoperability. It was also defensive: preventing Snowflake from acquiring Tabular and positioning Iceberg as the lakehouse standard instead of Delta Lake.

A vendor’s “open format” announcement is simultaneously genuine and strategic. The format is open-source. And it serves the vendor’s competitive interest. Buyers should interpret these announcements as competitive chess moves, not acts of charity.

The format war is settling. Roughly 78% of data professionals use Iceberg, every major cloud provider supports it natively, and format convergence tools are shipping. Both format creators have publicly stated the format choice should be irrelevant to enterprises. The strategic question has shifted to the catalog.

What Is the Catalog War Between Polaris and Unity Catalog?

A data catalog is the system of record that maps table names to metadata file locations, mediates access control, and handles concurrency. When a query engine wants to read or write, it asks the catalog two questions: where is this table, and am I allowed to do what I am about to do. The catalog is the chokepoint through which query engines must pass. Whoever controls it controls governance, engine compatibility, and semantic definitions.

Unity Catalog originated with Databricks and was open-sourced to the Linux Foundation in June 2024. It governs Delta Lake and Iceberg tables, plus ML models, functions, and AI tool catalogs, using Databricks-native access controls inherited from the platform’s identity layer. Its strategic value to Databricks is governance lock-in: even if your data is in open Iceberg format, your permissions, lineage, and AI governance live in a catalog closely integrated with the Databricks platform.

Apache Polaris originated with Snowflake and was donated to the Apache Software Foundation. It is an Iceberg-native open-source catalog implementing the Iceberg REST Catalog API specification with vendor-neutral governance. Polaris also supports Delta Lake tables, having reached general availability for format-neutral operation in early 2026, meaning it can serve as the catalog for both Iceberg and Delta Lake tables without requiring a separate catalog for each. Snowflake’s bet: if Polaris becomes the standard open catalog, Databricks’ governance lock-in dissolves.

The Iceberg REST Catalog API is the open standard attempting to prevent catalog lock-in. Any compliant engine can work with any compliant catalog. But security, credential vending, and semantic layers remain proprietary battlegrounds where vendors can differentiate and create dependency.

The winner of the catalog war determines whose platform becomes the default governance plane for all lakehouse deployments. The data catalog market is projected to reach $5-10 billion by 2032-2035. The catalog layer is where the next decade of vendor lock-in is being constructed — and how AI governance layers are reshaping platform competition is already adding a new dimension to the catalog war.

If the catalog is the real battlefield, the format question still needs an answer. Here is why it is simpler than it looks.

Apache Iceberg vs Delta Lake: Which Open Table Format Should Enterprises Standardise On?

Standardise on either. The format decision is increasingly tactical. Both are production-hardened at massive scale, and the gaps are narrowing through Format Convergence.

Iceberg has broader native engine support: 30-plus engines including Spark, Trino, Flink, DuckDB, Snowflake, BigQuery, and Athena. It was designed from the ground up for engine neutrality, with features like hidden partitioning and partition evolution that work identically everywhere. Delta Lake was originally built with Spark as the primary engine, though Delta Kernel and Delta UniForm are closing the multi-engine gap.

On features, the comparison is nuanced. Delta Lake offers deletion vectors, column mapping, liquid clustering, and predictive optimisation through Databricks. Iceberg offers partition evolution, hidden partitioning, and the Iceberg REST Catalog API as an open governance standard. But field IDs from Iceberg have been adopted into Delta for schema evolution, and deletion vectors now use identical binary encodings across both formats. Format convergence is shipped code, not hand-waving.

The governance difference matters more. Iceberg is Apache-governed with multi-vendor input: Netflix, Apple, AWS, Snowflake, Databricks, and dozens more. Delta Lake is Linux Foundation-governed, but Databricks retains primary development control. If you prioritise long-term vendor neutrality, Apache governance is the stronger guarantee.

The practical recommendation is straightforward: standardise on either Iceberg or Delta Lake, but standardise. And ensure your catalog is not the same vendor as your compute engine. Mixing formats per table or team is complexity in disguise. Because format decisions shape every dimension of platform evaluation, the standardisation choice you make here will ripple through your cost architecture, AI strategy, and governance model.

Polaris vs Unity Catalog: Which Open-Source Catalog Should Govern Your Lakehouse?

The catalog decision carries more weight than the format decision. Whoever governs your data controls your exit options.

Apache Polaris is Apache-governed, Iceberg-native, and designed for vendor neutrality. It implements the Iceberg REST Catalog API as the reference implementation and supports both Iceberg and Delta Lake tables. Unity Catalog is Linux Foundation-governed with Databricks retaining primary control. It governs a broader scope (Iceberg, Delta Lake, Hive, ML models, and AI tool catalogs) but integrates with the Databricks platform.

Both catalogs are open-source and can be self-hosted. Neither is cloud-agnostic in practice because credential vending, identity integration, and semantic layers tie each catalog to its parent platform’s IAM and governance primitives.

The AI governance dimension is becoming the key differentiator. Unity Catalog already governs ML models and AI tools as first-class catalog objects. Polaris is catching up. For organisations building AI agents that need governed context (tool catalogs, model lineage, training data provenance), catalog AI governance capability is gaining weight.

The recommendation follows your platform strategy. If you are building a multi-engine Iceberg architecture with Snowflake as a primary engine, lean Polaris. If you are Databricks-native with Iceberg optionality, lean Unity Catalog. In both cases, make sure your contract guarantees that your catalog’s governance layer (permissions, lineage, and metadata definitions) can be exported in a machine-readable format if you change platforms. Vendors may support Iceberg while adding proprietary extensions that only work inside their own compute environment. That is where lock-in starts.

What Questions Should a Buyer Ask About Vendor Lock-In Before Signing a Data Platform Contract?

Here are five questions to bring into vendor negotiations, plus one meta-question that separates openness from marketing claims.

Are my tables stored in an open format readable by engines outside your platform, without proprietary extensions or required translation layers? If the answer requires a vendor-managed bridge service rather than direct engine access, the format is not open in practice.

Can I export the governance metadata from your catalog (table definitions, lineage records, and access policies) and load them into another vendor’s governance layer without manual reconstruction? Most vendors will say their catalog is open-source but cannot guarantee metadata portability because credential vending and identity federation are platform-specific. Demand this in writing.

Are my AI models, training data provenance, and agent tool catalogs governed by the same catalog, and can they be exported if I migrate platforms? As covered in the catalog comparison above, AI governance is the next surface where dependency builds quietly. If your models and agent context are inseparable from the platform’s governance layer, you have traded storage lock-in for a newer kind.

If I terminate my contract, does my data remain readable by open-source engines without your platform running? This is the “stop paying” test. It reveals whether the platform uses open formats or merely open-format-compatible storage that requires vendor services to function.

What is the exit cost (in time, engineering effort, and dollars) of migrating 100TB of governed data to another platform, including catalog metadata, permissions, and lineage? Require an estimate in writing. The answer reveals whether the vendor has designed for portability or for stickiness.

The meta-question: Which of your competitors’ engines can read my tables today, without your platform as an intermediary? If the answer is fewer than three engines, the openness claim is aspirational, not operational.

Open table formats won the storage war. The catalog war is being fought right now, and its outcome will determine whether the next decade of data platform economics runs on interoperability or relabelled lock-in — a question made more urgent by the multi-cloud economics that make format portability matter.

The data catalog market is heading toward $5-10 billion, and that projection is not about metadata storage. It is about who controls the governance plane that every query, every permission check, and every AI agent context lookup must pass through. The five questions above capture a moving frontline. Revisit them each contract cycle — lock-in has direct dollar consequences in consumption-priced platforms, and each renewal is a chance to test whether your vendor’s grip is tightening or loosening.

Frequently Asked Questions

Is Apache Parquet itself an open table format, or do I need Iceberg or Delta Lake on top?

Parquet is an open file format for columnar data storage, not a table format. It stores data efficiently but has no concept of tables, transactions, schema history, or partitioning metadata. Iceberg and Delta Lake add the table abstraction on top of Parquet files: they track which files belong to a table, manage schema changes over time, provide ACID guarantees, and enable time travel queries. Without a table format, a directory of Parquet files is just files.

Where does Apache Hudi fit into the open table format landscape?

Apache Hudi is the third major open table format, created at Uber for low-latency streaming ingestion and upserts. It pioneered features like record-level indexing and incremental queries that Iceberg and Delta Lake later adopted. Hudi has strong adoption in streaming-heavy and CDC workloads, though its multi-engine support is narrower than Iceberg’s. The format convergence trend applies to Hudi too: Apache XTable translates between all three formats, and Hudi’s community participates in the shared metadata alignment efforts.

Can I use Apache Iceberg and Delta Lake together in the same organisation?

Yes, and many large organisations already do. Teams operating Databricks for ETL may write Delta Lake tables while analytics teams query Iceberg tables through Snowflake or Trino. The practical challenge is catalog sprawl: managing two sets of governance, lineage, and permissions across formats. Format convergence tools (Delta UniForm, Apache XTable) let you expose Delta tables as Iceberg-compatible metadata and vice versa, making multi-format estates increasingly manageable without duplication.

Do open table formats affect query performance compared to proprietary warehouse storage?

They can, though the gap is narrowing. Proprietary formats often bundle storage with compute optimisations (Databricks’ Photon engine, Snowflake’s micro-partitions) that open formats accessed through a generic engine cannot match. However, Iceberg and Delta Lake support file statistics, bloom filters, data skipping, and compaction strategies that bring open-format performance close to proprietary systems. The trade-off is portability versus a moderate performance premium on vendor-optimised query paths.

What happens to my data if the vendor behind my chosen format abandons it?

Your data remains safe because Iceberg and Delta Lake store data as standard Parquet files in your own object storage, not in a vendor-controlled format. Any tool that reads Parquet can access the raw data. The risk is to the metadata layer: if a format’s development stagnates, you may lose access to newer engine integrations and query optimisations. Format convergence tools provide an escape path by translating metadata between formats without rewriting data.

Do I still need a data catalog if I am only using a single query engine?

Technically no, a single engine can read Iceberg or Delta Lake tables directly from object storage using the Hadoop catalog or a file-path reference. But you will quickly need one. Without a catalog, you manage table locations manually, cannot enforce access controls, and lose lineage tracking. Even single-engine deployments benefit from a catalog for governance, discovery, and the optionality to add engines later without retrofitting an access layer.

Is the Iceberg REST Catalog API mature enough for production deployments?

Yes. The Iceberg REST Catalog API reached 1.0 in 2024 and has production implementations from Snowflake (Polaris), AWS (Glue), Dremio, Tabular, and several open-source projects. It is the standard interface for engine-to-catalog communication in the Iceberg ecosystem. The maturity concern is less about the API specification itself and more about whether each vendor’s implementation handles your scale of credential vending, concurrency, and multi-region failover reliably.

Can open table formats handle streaming and real-time data, or are they batch-only?

They handle both, though differently. Iceberg supports streaming writes via Flink and Spark Structured Streaming, with commit-based atomicity suited to micro-batch ingestion (sub-minute latency). Delta Lake supports streaming reads and writes through the same engines plus Databricks’ proprietary optimisations. Neither format targets sub-second latency the way Apache Kafka or Apache Hudi’s record-level upserts do, but both are production-proven for near-real-time lakehouse pipelines where latency tolerances are in the seconds-to-minutes range.

What does Databricks owning Tabular actually mean for Apache Iceberg’s independence?

Tabular’s acquisition puts the creators of Iceberg inside Databricks, which is accelerating Format Convergence but raising concerns about Iceberg’s multi-vendor neutrality. The Apache Software Foundation governance still requires consensus across many contributors (Netflix, Apple, AWS, Snowflake, and more), so Databricks cannot unilaterally steer the project. The practical impact is that Delta-Iceberg interoperability will improve faster, while Iceberg’s roadmap may increasingly reflect Databricks’ platform priorities alongside community input.

How do open table formats handle data deletion for compliance with regulations like GDPR?

Iceberg and Delta Lake both support row-level deletes through delete markers (soft deletes) and file compaction (physical removal). Iceberg uses delete files that track removed rows without rewriting data files; Delta Lake uses deletion vectors that serve the same purpose. For GDPR right-to-erasure requests, you can issue point deletes against specific rows, then run compaction to physically remove the data. Both formats support time travel, which means deleted data may persist in historical snapshots until those snapshots expire per retention policy.

Why Open Table Formats Are the Real Battlefield in the Data Platform Wars

What Are Apache Iceberg and Delta Lake and Why Do They Matter for Vendor Lock-In?

Why Are Open Table Formats a Competitive Weapon Rather Than a Gift from Vendors?

What Is the Catalog War Between Polaris and Unity Catalog?

Apache Iceberg vs Delta Lake: Which Open Table Format Should Enterprises Standardise On?

Polaris vs Unity Catalog: Which Open-Source Catalog Should Govern Your Lakehouse?

What Questions Should a Buyer Ask About Vendor Lock-In Before Signing a Data Platform Contract?

Frequently Asked Questions

Is Apache Parquet itself an open table format, or do I need Iceberg or Delta Lake on top?

Where does Apache Hudi fit into the open table format landscape?

Can I use Apache Iceberg and Delta Lake together in the same organisation?

Do open table formats affect query performance compared to proprietary warehouse storage?

What happens to my data if the vendor behind my chosen format abandons it?

Do I still need a data catalog if I am only using a single query engine?

Is the Iceberg REST Catalog API mature enough for production deployments?

Can open table formats handle streaming and real-time data, or are they batch-only?

What does Databricks owning Tabular actually mean for Apache Iceberg’s independence?

How do open table formats handle data deletion for compliance with regulations like GDPR?

Related Articles

What Is Loop Engineering And Why Should You Care

Is AI Killing the Zero Marginal Cost SaaS Model?

Using LLMs to Accelerate Code and Data Migration

Need a reliable team to help achieve your software goals?

BUSINESS HOURS

SYDNEY

YOGYAKARTA

BANDUNG