-
Notifications
You must be signed in to change notification settings - Fork 302
validate added data files for snapshot compatibility #2050
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
validate added data files for snapshot compatibility #2050
Conversation
validateAddedDataFiles
validateAddedDataFiles
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kaushiksrini thank you for working on this PR!
The logic looks sound to me, I have some nit picks regarding doc strings and reducing duplication of code.
Please let me know what you think about the feedback!
pyiceberg/table/update/validate.py
Outdated
@@ -24,7 +24,8 @@ | |||
from pyiceberg.table.snapshots import Operation, Snapshot, ancestors_between | |||
from pyiceberg.typedef import Record | |||
|
|||
VALIDATE_DATA_FILES_EXIST_OPERATIONS = {Operation.OVERWRITE, Operation.REPLACE, Operation.DELETE} | |||
VALIDATE_DATA_FILES_EXIST_OPERATIONS: Set[Operation] = {Operation.OVERWRITE, Operation.REPLACE, Operation.DELETE} | |||
VALIDATE_ADDED_FILES_OPERATIONS: Set[Operation] = {Operation.APPEND, Operation.OVERWRITE} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know Java uses this name, but I'd like to make an argument for us to be more specific about the files we are referring to since we are naming this constant for the first time in PyIceberg:
VALIDATE_ADDED_FILES_OPERATIONS: Set[Operation] = {Operation.APPEND, Operation.OVERWRITE} | |
VALIDATE_ADDED_DATA_FILES_OPERATIONS: Set[Operation] = {Operation.APPEND, Operation.OVERWRITE} |
partition_set: Optional[dict[int, set[Record]]], | ||
parent_snapshot: Optional[Snapshot], | ||
) -> Iterator[ManifestEntry]: | ||
if parent_snapshot is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we introduce a doc string for each function?
pyiceberg/table/update/validate.py
Outdated
if entry.snapshot_id not in snapshot_ids: | ||
continue | ||
|
||
if data_filter and evaluator.eval(entry.data_file) is ROWS_CANNOT_MATCH: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: there's a lot of duplicated code between _deleted_data_files
and _added_data_files
. Is there a way we could refactor the code block here to prevent that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point! made some changes @sungwy
@kaushiksrini could you resolve the conflicts? Thanks! |
if parent_snapshot is None: | ||
return | ||
|
||
manifests, snapshot_ids = validation_history( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
manifests, snapshot_ids = validation_history( | |
manifests, snapshot_ids = _validation_history( |
This was made private here #2054
Closes #1929
Rationale for this change
Are these changes tested?
Yes
Are there any user-facing changes?
No