-
Notifications
You must be signed in to change notification settings - Fork 302
Added ExpireSnapshots Feature #1880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…h a new Expired Snapshot class. updated tests.
ValueError: Cannot expire snapshot IDs {3051729675574597004} as they are currently referenced by table refs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ForeverAngry for raising this PR! I'll go into details tomorrow morning (UTC+2 here). Could you resolve the conflicts to get the CI running?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applied intial suggestions so that CI can run on the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rebuilt the poetry lock file.
Moved expiration-related methods from `ExpireSnapshots` to `ManageSnapshots` for improved organization and clarity. Updated corresponding pytest tests to reflect these changes.
Re-ran the `poetry run pre-commit run --all-files` command on the project.
Re-ran the `poetry run pre-commit run --all-files` command on the project.
After looking at the way the action here was implemented, I refined the changes. Let me know if these make sense :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevinjqliu thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rebased from main
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a merge conflict somewhere, there are a lot of code here from other PRs. Perhaps you need to update your fork's main branch.
pyiceberg/table/update/snapshot.py
Outdated
Returns: | ||
This for method chaining. | ||
""" | ||
# Collect IDs of snapshots to be expired |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunally, it is not that simple to just look at the time alone. Instead, there are some rules, for example:
- We want to make sure that we don't remove the head snapshot of each of the branches: https://github.com/apache/iceberg/blob/3f661d5c6657542538a1e944db57405efdefea29/core/src/main/java/org/apache/iceberg/RemoveSnapshots.java#L297-L310
- Don't remove any snapshots that are tagged: https://github.com/apache/iceberg/blob/3f661d5c6657542538a1e944db57405efdefea29/core/src/main/java/org/apache/iceberg/RemoveSnapshots.java#L255-L278
The easiest way of going through the logic is following this method: https://github.com/apache/iceberg/blob/3f661d5c6657542538a1e944db57405efdefea29/core/src/main/java/org/apache/iceberg/RemoveSnapshots.java#L179
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might just pull this out into another issue, separate from this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm hestitant to do that, because when folks would run it, it might break their tables 😱
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant, for now, ill remove the expire_snapshots_older_than
method, for now, and contribute that in another PR.
…ng it in a separate issue. Fixed: unrelated changes caused by afork/branch sync issues.
Co-authored-by: Fokko Driesprong <fokko@apache.org>
Implemented logic to protect the HEAD branches or Tagged branches from being expired by the `expire_snapshot_by_id` method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ForeverAngry Could you see if you can get the linters/tests passing? Thanks!
`make lint` now passes
I was able to get the linting figured out. I dont have docker on this machine, so im going to give the integration tests a try in codespaces to see if i can get away doing the checks there. |
@ForeverAngry Sorry for the late reply, it looks like that there is a test failing now 👀 |
I think this commit should fix the test error, i also added additional tests. All passed - and appear to be in-good-order. 🤞 this time is the charm. |
Updated test_expire_snapshots.py to use model_copy(update={...}) for modifying the refs attribute of the table metadata, ensuring compatibility with Pydantic's frozen models. Fixed all snapshot expiration tests to avoid direct assignment to frozen attributes. All tests now pass, confirming correct behavior for protected and unprotected snapshot expiration.
this would be great - whats the status of this? |
I see that in #1958, for orphaned file removal, we decided to have a |
Well, right now this pr doesn't do anything with the newly orphaned files. It just handles the metadata operation. |
@@ -739,6 +741,7 @@ class ManageSnapshots(UpdateTableMetadata["ManageSnapshots"]): | |||
ms.create_tag(snapshot_id1, "Tag_A").create_tag(snapshot_id2, "Tag_B") | |||
""" | |||
|
|||
_snapshot_ids_to_expire: Set[int] = set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this one was added by mistake?
Use table.expire_snapshots().<operation>().commit() to run a specific operation. | ||
Use table.expire_snapshots().<operation-one>().<operation-two>().commit() to run multiple operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this isn't true, since the method returns ExpireSnapshots
, meaning that you cannot chain the operations (which is fine if you call it on the table).
Use table.expire_snapshots().<operation>().commit() to run a specific operation. | |
Use table.expire_snapshots().<operation-one>().<operation-two>().commit() to run multiple operations. |
Yes, I agree with you there. Before doing the 0.10.0 release, we need to ensure we align on this and make proper docs. I have a slight preference towards |
I'm happy to follow the '.maintenance' api design if there is a strong preference toward it. |
Summary
This PR Closes issue #516 by implementing support for the
ExpireSnapshot
table metadata action.Rationale
The
ExpireSnapshot
action is a core part of Iceberg’s table maintenance APIs. Adding support for this action in PyIceberg helps ensure feature parity with other language implementations (e.g., Java) and supports users who want to programmatically manage snapshot retention using PyIceberg’s public API.Testing
User-facing changes
ExpireSnapshot
.