Iceberg Orphan Files Penalty
The Iceberg Orphan Files Penalty refers to the storage cost and operational overhead that accumulates when unreferenced physical files are left behind in a tableβs storage directory. Because Apache Iceberg defines table state strictly through metadata pointers, any file written during a failed transaction, aborted compaction, or crashed stream remains in storage but is not tracked by the catalog.
Without maintenance, these orphan files remain in the storage bucket indefinitely, incurring ongoing costs and degrading operational tooling performance.
Causes of Orphan Files
Orphan files are commonly generated by the following events:
- Failed Write Commits: A Spark or Flink job successfully writes Parquet data files to S3, but the final catalog commit fails or timeout limit is exceeded. The written data files are orphaned.
- Aborted Compaction Jobs: If a file rewrite job crashes midway, the partially written consolidated files are abandoned in storage.
- Improper Deletions: Manually deleting table files directly from storage instead of using the Iceberg API can leave unreferenced metadata files.
The Penalties Incurred
- Storage Costs: Since data files can be large, hundreds of failed commits can write terabytes of untracked data, inflating monthly cloud storage bills.
- Tooling Degredation: Backup utilities, replication scripts, or security scanners that traverse the physical directory must process millions of unreferenced files, slowing down their run times.
- Audit Discrepancies: Physical directory size scans will show different storage numbers than those reported by the
.filesmetadata table, complicating audit reports.
Data teams avoid this penalty by scheduling the remove_orphan_files procedure weekly to purge unreferenced blocks from storage.