feat: delete orphaned files #1958

jayceslesar · 2025-04-29T22:42:05Z

Closes #1200

Rationale for this change

Ability to do more table maintenance from pyiceberg (iceberg-python?)

Are these changes tested?

Added a test!

Are there any user-facing changes?

Yes, this is a new method on the Table class.

pyiceberg/table/__init__.py

jayceslesar · 2025-04-29T22:46:43Z

pyiceberg/table/inspect.py

-    def all_manifests(self) -> "pa.Table":
+    def all_manifests(self, snapshots: Optional[list[Snapshot]] = None) -> "pa.Table":
        import pyarrow as pa

-        snapshots = self.tbl.snapshots()
+        snapshots = snapshots or self.tbl.snapshots()
        if not snapshots:


Another case of me treating snapshots and snapshot_id's the same... happy to enforce this being snapshot_id's instead

Let's save that for another PR. I don't think we can just change this API since folks might be using this.. We could allow for an Union[list[snapshot], iterable[int]]?

Modified this, let me know what you think

jayceslesar · 2025-04-29T22:50:32Z

pyiceberg/table/__init__.py

+        if orphaned_files:
+            deletes = executor.map(self.io.delete, orphaned_files)


unsure if this should be a new executor but looks like its a singleton so shouldnt matter

This looks fine, we can just re-use the executor 👍

When one of the deletes would throw an error (maybe some other process had already cleaned up the file), then the whole execution would terminate. Should we add a try block to swallow any related exception? Would be good to also add a test for this 👍

thas has been done and a test was added

Fokko

Thanks for working on this @jayceslesar, sorry for the late review.

I think this is a great start, I left some comments, let me know what you think!

Fokko · 2025-05-02T05:31:26Z

pyiceberg/table/__init__.py

+
+        location = self.location()
+
+        all_known_files = []


Why not make this a set right away?

Suggested change

all_known_files = []

all_known_files = set()

Fokko · 2025-05-02T05:43:44Z

pyiceberg/table/__init__.py

+
+        from pyiceberg.io.pyarrow import _fs_from_file_path
+
+        location = self.location()


Nit, should we move this variable assignment downward, where we start using it?

done (also this was refactored up a little bit)

pyiceberg/table/__init__.py

Fokko · 2025-05-02T05:53:47Z

pyiceberg/table/__init__.py

+        files_by_snapshots: Iterator["pa.Table"] = executor.map(lambda snapshot_id: self.inspect.files(snapshot_id), snapshot_ids)
+        all_known_files.extend(pa.concat_tables(files_by_snapshots)["file_path"].to_pylist())


How about just returning a set of paths? This way we can nicely union all of them into a set:

Suggested change

files_by_snapshots: Iterator["pa.Table"] = executor.map(lambda snapshot_id: self.inspect.files(snapshot_id), snapshot_ids)

all_known_files.extend(pa.concat_tables(files_by_snapshots)["file_path"].to_pylist())

files_by_snapshots: Iterator[Set[str]] = executor.map(lambda snapshot_id: set(self.inspect.files(snapshot_id), snapshot_ids)["file_path"].to_pylist())

datafile_paths = reduce(set.union, files_by_snapshots)

all_known_files.extend(datafile_paths)

There will probably be quite a bit of overlap between the snapshots in terms of data files

Fokko · 2025-05-02T05:54:25Z

pyiceberg/table/__init__.py

+        if orphaned_files:
+            deletes = executor.map(self.io.delete, orphaned_files)


This looks fine, we can just re-use the executor 👍

Fokko · 2025-05-02T05:55:59Z

pyiceberg/table/__init__.py

+        if orphaned_files:
+            deletes = executor.map(self.io.delete, orphaned_files)


When one of the deletes would throw an error (maybe some other process had already cleaned up the file), then the whole execution would terminate. Should we add a try block to swallow any related exception? Would be good to also add a test for this 👍

Fokko · 2025-05-02T08:16:54Z

pyiceberg/table/__init__.py

@@ -1371,6 +1375,45 @@ def to_polars(self) -> pl.LazyFrame:

        return pl.scan_iceberg(self)

+    def delete_orphaned_files(self) -> None:


I think it would be good to add some options that we also have on the Java side, at a minimum:

older_than: Remove orphan files created before this timestamp (Defaults to 3 days). It can be that some process is writing to the table, and has some files staged to be added to the metadata tree. If we don't take this into account, it might be that these files are removed in the period between writing and committing.

dry_run: When true, don't actually remove files (defaults to false). I think it would be nice to return a set of the number of files removed:

Suggested change

def delete_orphaned_files(self) -> None:

def delete_orphaned_files(self) -> Set[str]:

Is there a reason that older_than is not a table property?

Fokko · 2025-05-02T08:20:34Z

pyiceberg/table/inspect.py

-    def all_manifests(self) -> "pa.Table":
+    def all_manifests(self, snapshots: Optional[list[Snapshot]] = None) -> "pa.Table":
        import pyarrow as pa

-        snapshots = self.tbl.snapshots()
+        snapshots = snapshots or self.tbl.snapshots()
        if not snapshots:


Let's save that for another PR. I don't think we can just change this API since folks might be using this.. We could allow for an Union[list[snapshot], iterable[int]]?

smaheshwar-pltr

Thanks for the PR @jayceslesar, using InpsectTable to get orphaned files to submit to the executor pool is a nice idea! Just some concerns / suggestions / debugging help 😄

smaheshwar-pltr · 2025-05-03T16:43:37Z

pyiceberg/table/inspect.py

+        files_by_snapshots: Iterator[Set[str]] = executor.map(
+            lambda snapshot_id: set(self.files(snapshot_id)["file_path"].to_pylist())
+        )
+        datafile_paths: set[str] = reduce(set.union, files_by_snapshots, set())


Won't this always be empty? I don't see any Iterable submitted to the executor pool above

fixed, lost this in a little refactor

smaheshwar-pltr · 2025-05-03T16:44:25Z

pyiceberg/table/inspect.py

+
+        from pyiceberg.io.pyarrow import _fs_from_file_path
+
+        all_known_files = set()


We also want to have manifest list files here (I don't see them now). Otherwise, they'll be removed by the procedure and the table will be "corrupted".

(Related: when looking at Java tests, I noticed apache/iceberg#12957)

The same goes for the current metadata JSON file, and I think to match Java behaviour we want to include all files in the metadata log of the current metadata file too.

I think there are more files we might be missing - I think tests would be nice to make sure we're not missing something! (Perhaps inspiration can be taken from the Java ones)

I see! I just pushed a change that will capture those, as well as the statistic file paths

smaheshwar-pltr · 2025-05-03T17:00:38Z

pyiceberg/table/inspect.py

+        as_of = datetime.now(timezone.utc) - older_than if older_than else None
+        all_files = [f for f in fs.get_file_info(selector) if f.type == FileType.File and (as_of is None or (f.mtime < as_of))]
+
+        orphaned_files = set(all_files).difference(all_known_files)


I think we need to be careful here. all_files is a list of these FileInfo objects I think but all_known_files is a set of strs. So the set difference here won't do anything because a FileInfo object won't be in a str set.

ah good catch this happened in a little refactor, just need to call f.path

smaheshwar-pltr · 2025-05-03T17:06:10Z

pyiceberg/table/inspect.py

+
+        from pyiceberg.io.pyarrow import _fs_from_file_path
+
+        all_known_files = set()


Part of me wonders whether we could expose this as a method: a public, documented inspect utility that returns all files referenced by a table. Curious what others think about whether this would be useful, I'm not fully convinced myself. (We could also then restructure orphaned file detection to use that)

I think it would likely make things simpler, inspect could use a little beefing up IMO, I came across #1626 which is a good start

Yeah, I am going to play around with this. It makes testing a lot easier

Okay, let me know what you think about the change I just pushed -- see all_known_files. @Fokko vis as well -- this should make testing a lot easier (if I have both of your blessings here I will add tests for this function) and allow us to modify smarter going forward

kevinjqliu

Thanks for the PR! I added a few comments. ptal :)

kevinjqliu · 2025-05-04T00:40:39Z

pyiceberg/table/__init__.py

+                # exhaust
+                list(deletes)
+                logger.info(f"Deleted {len(orphaned_files)} orphaned files at {location}!")
+


nit: log an else case

kevinjqliu · 2025-05-04T00:43:20Z

pyiceberg/table/__init__.py

+                deletes = executor.map(_delete, orphaned_files)
+                # exhaust
+                list(deletes)
+                logger.info(f"Deleted {len(orphaned_files)} orphaned files at {location}!")


nit: this might not necessary be always true, esp when _delete errors are suppressed.

what we do count the number of successfully deletes here? maybe _delete can return True/False for whether the delete was successful.

the spark procedure outputs the orphan_file_location which are all the files set to be deleted. this is pretty useful for logging
https://iceberg.apache.org/docs/nightly/spark-procedures/#output_7

kevinjqliu · 2025-05-04T00:45:42Z

pyiceberg/table/inspect.py

+
+    def orphaned_files(self, location: str, older_than: Optional[timedelta] = timedelta(days=3)) -> Set[str]:
+        """Get all the orphaned files in the table.
+


nit: add a sentence explaining what orphaned files mean, maybe copy/paste from https://iceberg.apache.org/docs/nightly/spark-procedures/#remove_orphan_files

kevinjqliu · 2025-05-04T00:49:17Z

pyiceberg/table/__init__.py

@@ -1371,6 +1376,28 @@ def to_polars(self) -> pl.LazyFrame:

        return pl.scan_iceberg(self)

+    def delete_orphaned_files(self, older_than: Optional[timedelta] = timedelta(days=3), dry_run: bool = False) -> None:


nit: we should always provide an older_than arg. this protects the orphan file deletion job from deleting recently created files that is currently waiting to be committed.

kevinjqliu · 2025-05-04T00:51:06Z

pyiceberg/table/inspect.py

+
+        return _all_known_files
+
+    def orphaned_files(self, location: str, older_than: Optional[timedelta] = timedelta(days=3)) -> Set[str]:


nit: should we expose this as a public function given that there's no equivalent from java/spark side? we modeled the inspect tables based on java's metadata tables.
maybe we can change this to _orphaned_files for now

kevinjqliu · 2025-05-04T00:51:33Z

pyiceberg/table/inspect.py

+        _, _, path = _parse_location(location)
+        selector = FileSelector(path, recursive=True)
+        # filter to just files as it may return directories, and filter on time
+        as_of = datetime.now(timezone.utc) - older_than if older_than else None


older_than should always be present, see the above comment

kevinjqliu · 2025-05-04T01:21:41Z

a meta question, wydt of moving the orphan file function to its own file/namespace, similar to how to use .inspect.

i like the idea of having all the table maintenance functions together, similar to delta table's optimize

jayceslesar and others added 3 commits April 29, 2025 16:58

feat: delete orphaned files

9dcb580

simpler and a test

e43505c

remove

eed5ea8

jayceslesar commented Apr 29, 2025

View reviewed changes

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

jayceslesar commented Apr 29, 2025

View reviewed changes

Fokko reviewed May 2, 2025

View reviewed changes

jayceslesar added 3 commits May 2, 2025 17:22

updates from review!

8cca600

include dry run and older than

75b1240

add case for dry run

6379480

smaheshwar-pltr suggested changes May 3, 2025

View reviewed changes

jayceslesar added 7 commits May 3, 2025 14:16

use .path so we get paths pack

0c2822e

actually pass in iterable

aaf8fc2

capture manifest_list files

b09641b

refactor into all_known_files

beec233

fix type in docstring

b888c56

mildly more readable

ff461ed

beef up tests

3b3b10e

kevinjqliu reviewed May 4, 2025

View reviewed changes

		if orphaned_files:
		deletes = executor.map(self.io.delete, orphaned_files)


		from pyiceberg.io.pyarrow import _fs_from_file_path

		location = self.location()

		files_by_snapshots: Iterator["pa.Table"] = executor.map(lambda snapshot_id: self.inspect.files(snapshot_id), snapshot_ids)
		all_known_files.extend(pa.concat_tables(files_by_snapshots)["file_path"].to_pylist())

-        files_by_snapshots: Iterator["pa.Table"] = executor.map(lambda snapshot_id: self.inspect.files(snapshot_id), snapshot_ids)
-        all_known_files.extend(pa.concat_tables(files_by_snapshots)["file_path"].to_pylist())
+        files_by_snapshots: Iterator[Set[str]] = executor.map(lambda snapshot_id: set(self.inspect.files(snapshot_id), snapshot_ids)["file_path"].to_pylist())
+        datafile_paths = reduce(set.union, files_by_snapshots)
+        all_known_files.extend(datafile_paths)

		@@ -1371,6 +1375,45 @@ def to_polars(self) -> pl.LazyFrame:

		return pl.scan_iceberg(self)

		def delete_orphaned_files(self) -> None:

	def delete_orphaned_files(self) -> None:
	def delete_orphaned_files(self) -> Set[str]:


		from pyiceberg.io.pyarrow import _fs_from_file_path

		all_known_files = set()


		def orphaned_files(self, location: str, older_than: Optional[timedelta] = timedelta(days=3)) -> Set[str]:
		"""Get all the orphaned files in the table.

		@@ -1371,6 +1376,28 @@ def to_polars(self) -> pl.LazyFrame:

		return pl.scan_iceberg(self)

		def delete_orphaned_files(self, older_than: Optional[timedelta] = timedelta(days=3), dry_run: bool = False) -> None:


		return _all_known_files

		def orphaned_files(self, location: str, older_than: Optional[timedelta] = timedelta(days=3)) -> Set[str]:

feat: delete orphaned files #1958

Are you sure you want to change the base?

feat: delete orphaned files #1958

Conversation

jayceslesar commented Apr 29, 2025

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fokko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smaheshwar-pltr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayceslesar May 3, 2025 • edited Loading

Choose a reason for hiding this comment

jayceslesar May 3, 2025 • edited Loading

Choose a reason for hiding this comment

kevinjqliu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevinjqliu commented May 4, 2025

jayceslesar May 3, 2025 •

edited

Loading

jayceslesar May 3, 2025 •

edited

Loading