Skip to content

feat: validate_deleted_data_files #1938

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
May 20, 2025

Conversation

jayceslesar
Copy link
Contributor

@jayceslesar jayceslesar commented Apr 19, 2025

Comment on lines 118 to 128
if entry.snapshot_id not in new_snapshot_ids:
continue

if entry.status != ManifestEntryStatus.DELETED:
continue

if data_filter is not None and not evaluator(entry.data_file):
continue

if partition_set is not None and (entry.data_file.spec_id, entry.data_file.partition) not in partition_set:
continue
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably cleaner to put these each on a line and wrap in an any call

@jayceslesar jayceslesar changed the title [wip] feat: validate_deleted_data_files feat: validate_deleted_data_files May 1, 2025
@jayceslesar jayceslesar marked this pull request as ready for review May 2, 2025 00:40
Comment on lines 84 to 85
parent_snapshot: Optional[Snapshot],
partition_set: Optional[set[Record]],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks out of order:

Suggested change
parent_snapshot: Optional[Snapshot],
partition_set: Optional[set[Record]],
partition_set: Optional[set[Record]],
parent_snapshot: Optional[Snapshot],

It looks like the function call in validate_deleted_data_files is assuming that parent_snapshot is the last argument

    conflicting_entries = deleted_data_files(table, starting_snapshot, data_filter, None, parent_snapshot)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs some more work. Next to @sungwy's comment, we in the code below:

(entry.data_file.spec_id, entry.data_file.partition) not in partition_set

Where it checks if a tuple is in a partition_set, but the partition_set only contains the Record according to the signature.

This triggered me, because if you do:

ALTER TABLE prod.db.taxis REPLACE PARTITION FIELD pickup_timestamp WITH day(pickup_timestamp);

-- and then
ALTER TABLE prod.db.taxis REPLACE PARTITION FIELD pickup_timestamp WITH day(dropoff_timestamp);

Both of the partitioning strategies will produce a Record[int] because it will contain the number of days since epoch. But the meaning is completely different.

Comment on lines 105 to 106
starting_snapshot,
parent_snapshot,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks out of order to, because validation_history has to_snapshot as the second argument, and from_snapshot as the third argument.

Suggested change
starting_snapshot,
parent_snapshot,
starting_snapshot,
parent_snapshot,

Do you think it would be better to update validation_history function to use the following function signature instead? I think it's a lot more expected to have from_snapshot then to_snapshot

def validation_history(
    table: Table,
    from_snapshot: Snapshot,
    to_snapshot: Snapshot,
    matching_operations: set[Operation],
    manifest_content_filter: ManifestContent,
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify a little more? Are you saying we should move to the terms starting_snapshot (which is from_snapshot) and parent_snapshot (to_snapshot)

)

if data_filter is not None:
evaluator = _StrictMetricsEvaluator(table.schema(), data_filter).eval
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too sure if this is correct, because ManifestGroup.entries seems to be using inclusive projection.

Should we be using inclusive_projection here instead?

Summoning @Fokko for a second review

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree that this should be inclusive projection, since we want to know if there are any matches. Inclusive projection returns rows_might_match and rows_cannot_match. If they cannot be matched, then we can skip it :)

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for following up on the previous PR @jayceslesar. This looks like a great start, but there are some missing elements. It would be good to add some more tests as well, since this is pretty critical code since it might affect correctness.

Comment on lines 84 to 85
parent_snapshot: Optional[Snapshot],
partition_set: Optional[set[Record]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs some more work. Next to @sungwy's comment, we in the code below:

(entry.data_file.spec_id, entry.data_file.partition) not in partition_set

Where it checks if a tuple is in a partition_set, but the partition_set only contains the Record according to the signature.

This triggered me, because if you do:

ALTER TABLE prod.db.taxis REPLACE PARTITION FIELD pickup_timestamp WITH day(pickup_timestamp);

-- and then
ALTER TABLE prod.db.taxis REPLACE PARTITION FIELD pickup_timestamp WITH day(dropoff_timestamp);

Both of the partitioning strategies will produce a Record[int] because it will contain the number of days since epoch. But the meaning is completely different.

)

if data_filter is not None:
evaluator = _StrictMetricsEvaluator(table.schema(), data_filter).eval
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree that this should be inclusive projection, since we want to know if there are any matches. Inclusive projection returns rows_might_match and rows_cannot_match. If they cannot be matched, then we can skip it :)

@jayceslesar
Copy link
Contributor Author

@sungwy @Fokko thank you for reviews! Late responses here are my bad -- was focusing on the orphaned files PR

Copy link
Collaborator

@sungwy sungwy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jayceslesar - thanks for adopting all of the review feedback!

I think this is good to merge, but I'd prefer to make validate_deleted_data_files a private function

@sungwy
Copy link
Collaborator

sungwy commented May 15, 2025

Waiting for the CI to pass

@jayceslesar
Copy link
Contributor Author

@sungwy @Fokko thank you for the review! Learned a bunch and hopefully I can get the other 3 sub tasks in #819 along quickly!

parent_snapshot: Ending snapshot on the branch being validated

"""
conflicting_entries = _deleted_data_files(table, starting_snapshot, data_filter, None, parent_snapshot)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, I think style wise it is more elegant to switch over to keyword-arguments:

Suggested change
conflicting_entries = _deleted_data_files(table, starting_snapshot, data_filter, None, parent_snapshot)
conflicting_entries = _deleted_data_files(table, starting_snapshot, data_filter, parent_snapshot=parent_snapshot)

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks @jayceslesar for working on this, and thanks @sungwy for the review 🙌

@Fokko Fokko merged commit e98f685 into apache:main May 20, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Concurrency Safety Validation: Implement validateDeletedDataFiles
3 participants