-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pre-insert schema validation #500
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Overview
This PR introduces schema validation for online ingestion by implementing a base DataFrameValidator along with specialized validators for Pandas, Polars, and PySpark dataframes. It also integrates the new validation mechanism into the feature group engine, ensuring that the schema is validated before saving metadata or inserting data when online mode is enabled.
Reviewed Changes
File | Description |
---|---|
python/hsfs/core/schema_validation.py | Adds a base validator and specific implementations for different DF types. |
python/hsfs/core/feature_group_engine.py | Integrates schema validation before saving feature group metadata and during data insertion. |
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
Comments suppressed due to low confidence (1)
python/hsfs/core/schema_validation.py:98
- Consider adding a check to ensure that extract_numbers returns a non-empty list to avoid a potential IndexError when accessing the first element.
return int(self.extract_numbers(feature.online_type)[0])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Overview
This PR introduces pre-insert schema validation to ensure that incoming data conforms to the expected schema before insertion. Key changes include:
- Implementation of a generic base DataFrameValidator along with Pandas, Polars, and PySpark-specific validators.
- Addition of unit tests that cover various schema validation scenarios.
- Integration of schema validation checks in the feature group engine when saving or inserting data.
Reviewed Changes
File | Description |
---|---|
python/hsfs/core/schema_validation.py | Introduces DataFrameValidator and its implementations for Pandas, Polars, and PySpark. |
python/tests/test_schema_validator.py | Adds comprehensive unit tests for validating schema rules under different scenarios. |
python/hsfs/core/feature_group_engine.py | Integrates schema validation into the feature group save/insert workflows when enabled. |
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
Comments suppressed due to low confidence (2)
python/hsfs/core/schema_validation.py:107
- [nitpick] Consider renaming 'i_feature' to a more descriptive variable name such as 'feature' to improve clarity.
for i_feature in dataframe_features:
python/tests/test_schema_validator.py:177
- [nitpick] It might be more robust to verify the updated feature by iterating over the features and matching by feature name rather than relying on a fixed index.
assert df_features[2].online_type == "varchar(101)"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces pre-insert schema validation to ensure that the input data conforms to the feature group schema before ingestion. Key changes include:
- Implementing schema validation in DataFrameValidator and its subclasses (PandasValidator, PolarsValidator, PySparkValidator).
- Adding unit tests for schema validation in python/tests/test_schema_validator.py.
- Integrating pre-insert schema validations in feature group engine methods and updating related documentation.
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
File | Description |
---|---|
python/hsfs/core/schema_validation.py | New validators for different dataframe types with schema checks. |
python/tests/test_schema_validator.py | Added unit tests covering various schema validation scenarios. |
python/hsfs/core/feature_group_engine.py | Integrated pre-insert schema validation in save/insert workflows. |
python/hsfs/feature_group.py | Updated docstring to indicate pre-insert schema validations. |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add unit test for spark and polars DF, add a check to ensure that validation is not triggered when passing offline FG or embedded FG
|
||
# Check string lengths | ||
for col in df.select(pl.col(pl.Utf8)).columns: | ||
currentmax = df[col].str.len_chars().max() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you confirm with RonDB team whether we should use len_bytes or len_char here? I am not sure what VARCHAR(100) is setup to use in rondb for online feature store table
feature_group_data, pandas_df, feature_group_data.features | ||
) | ||
|
||
def test_offline_fg(self, pandas_df, feature_group_data, caplog): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this test doing? Offline FG should never be validated by the validator, no?
feature_group_data, pandas_df, feature_group_data.features | ||
) | ||
|
||
assert df_features == feature_group_data.features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are the tests for the spark and polars dataframes?
This PR adds/fixes/changes...
JIRA Issue: -
Priority for Review: -
Related PRs: -
How Has This Been Tested?
Checklist For The Assigned Reviewer: