Skip to content
Table Format Maintenance & Operations Last updated: May 29, 2026

Iceberg Sort-Based Compaction

A compaction strategy in Apache Iceberg that combines small data files and sorts the records globally by specified columns to improve query data skipping.

sort based compactioniceberg sort compactionglobal sort order compaction

Iceberg Sort-Based Compaction

Iceberg Sort-Based Compaction is a table optimization strategy that combines small data files while sorting the records globally by specified columns. Unlike bin-packing, which only merges files without reorganizing their rows, sort-based compaction restructures the data layout, grouping similar column values together within the resulting files. This layout enhances the effectiveness of min/max statistics for data skipping, speeding up range queries.

How Sort Compaction Optimizes Queries

When query engines scan an Iceberg table, they read the upper and lower bounds of each column stored in the manifest entries. If a column contains randomly distributed values, every file’s min/max range will overlap, forcing the engine to read every file.

Sort-based compaction resolves this overlap:

Resource Trade-Offs

Because sorting requires a global shuffle of data across executors, sort-based compaction is resource-intensive:

πŸ“š Go Deeper on Apache Iceberg

Alex Merced has authored three hands-on books covering Apache Iceberg, the Agentic Lakehouse, and modern data architecture. Pick up a copy to master the full ecosystem.

← Back to Iceberg Knowledge Base