Skip to content

validate added data files for snapshot compatibility #2050

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

kaushiksrini
Copy link
Contributor

Closes #1929

Rationale for this change

Are these changes tested?

Yes

Are there any user-facing changes?

No

@kaushiksrini kaushiksrini changed the title Feat/validate added data files validate added data files by supporting validateAddedDataFiles May 28, 2025
@kaushiksrini kaushiksrini changed the title validate added data files by supporting validateAddedDataFiles validate added data files for snapshot compatibility May 28, 2025
Copy link
Collaborator

@sungwy sungwy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kaushiksrini thank you for working on this PR!

The logic looks sound to me, I have some nit picks regarding doc strings and reducing duplication of code.

Please let me know what you think about the feedback!

@@ -24,7 +24,8 @@
from pyiceberg.table.snapshots import Operation, Snapshot, ancestors_between
from pyiceberg.typedef import Record

VALIDATE_DATA_FILES_EXIST_OPERATIONS = {Operation.OVERWRITE, Operation.REPLACE, Operation.DELETE}
VALIDATE_DATA_FILES_EXIST_OPERATIONS: Set[Operation] = {Operation.OVERWRITE, Operation.REPLACE, Operation.DELETE}
VALIDATE_ADDED_FILES_OPERATIONS: Set[Operation] = {Operation.APPEND, Operation.OVERWRITE}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know Java uses this name, but I'd like to make an argument for us to be more specific about the files we are referring to since we are naming this constant for the first time in PyIceberg:

Suggested change
VALIDATE_ADDED_FILES_OPERATIONS: Set[Operation] = {Operation.APPEND, Operation.OVERWRITE}
VALIDATE_ADDED_DATA_FILES_OPERATIONS: Set[Operation] = {Operation.APPEND, Operation.OVERWRITE}

partition_set: Optional[dict[int, set[Record]]],
parent_snapshot: Optional[Snapshot],
) -> Iterator[ManifestEntry]:
if parent_snapshot is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we introduce a doc string for each function?

if entry.snapshot_id not in snapshot_ids:
continue

if data_filter and evaluator.eval(entry.data_file) is ROWS_CANNOT_MATCH:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: there's a lot of duplicated code between _deleted_data_files and _added_data_files. Is there a way we could refactor the code block here to prevent that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point! made some changes @sungwy

@Fokko
Copy link
Contributor

Fokko commented Jun 2, 2025

@kaushiksrini could you resolve the conflicts? Thanks!

if parent_snapshot is None:
return

manifests, snapshot_ids = validation_history(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
manifests, snapshot_ids = validation_history(
manifests, snapshot_ids = _validation_history(

This was made private here #2054

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Concurrency Safety Validation: Implement validateAddedDataFiles
4 participants