pre-insert schema validation #500

dhananjay-mk · 2025-03-04T14:41:10Z

This PR adds/fixes/changes...

please summarize your changes to the code
and make sure to include all changes to user-facing APIs

JIRA Issue: -

Priority for Review: -

Related PRs: -

How Has This Been Tested?

Unit Tests
Integration Tests
Manual Tests on VM

Checklist For The Assigned Reviewer:

- [ ] Checked if merge conflicts with master exist
- [ ] Checked if stylechecks for Java and Python pass
- [ ] Checked if all docstrings were added and/or updated appropriately
- [ ] Ran spellcheck on docstring
- [ ] Checked if guides & concepts need to be updated
- [ ] Checked if naming conventions for parameters and variables were followed
- [ ] Checked if private methods are properly declared and used
- [ ] Checked if hard-to-understand areas of code are commented
- [ ] Checked if tests are effective
- [ ] Built and deployed changes on dev VM and tested manually
- [x] (Checked if all type annotations were added and/or updated appropriately)

PR Overview

This PR introduces schema validation for online ingestion by implementing a base DataFrameValidator along with specialized validators for Pandas, Polars, and PySpark dataframes. It also integrates the new validation mechanism into the feature group engine, ensuring that the schema is validated before saving metadata or inserting data when online mode is enabled.

Reviewed Changes

File	Description
python/hsfs/core/schema_validation.py	Adds a base validator and specific implementations for different DF types.
python/hsfs/core/feature_group_engine.py	Integrates schema validation before saving feature group metadata and during data insertion.

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (1)

python/hsfs/core/schema_validation.py:98

Consider adding a check to ensure that extract_numbers returns a non-empty list to avoid a potential IndexError when accessing the first element.

return int(self.extract_numbers(feature.online_type)[0])

PR Overview

This PR introduces pre-insert schema validation to ensure that incoming data conforms to the expected schema before insertion. Key changes include:

Implementation of a generic base DataFrameValidator along with Pandas, Polars, and PySpark-specific validators.
Addition of unit tests that cover various schema validation scenarios.
Integration of schema validation checks in the feature group engine when saving or inserting data.

Reviewed Changes

File	Description
python/hsfs/core/schema_validation.py	Introduces DataFrameValidator and its implementations for Pandas, Polars, and PySpark.
python/tests/test_schema_validator.py	Adds comprehensive unit tests for validating schema rules under different scenarios.
python/hsfs/core/feature_group_engine.py	Integrates schema validation into the feature group save/insert workflows when enabled.

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (2)

python/hsfs/core/schema_validation.py:107

[nitpick] Consider renaming 'i_feature' to a more descriptive variable name such as 'feature' to improve clarity.

for i_feature in dataframe_features:

python/tests/test_schema_validator.py:177

[nitpick] It might be more robust to verify the updated feature by iterating over the features and matching by feature name rather than relying on a fixed index.

assert df_features[2].online_type == "varchar(101)"

Copilot

Pull Request Overview

This PR introduces pre-insert schema validation to ensure that the input data conforms to the feature group schema before ingestion. Key changes include:

Implementing schema validation in DataFrameValidator and its subclasses (PandasValidator, PolarsValidator, PySparkValidator).
Adding unit tests for schema validation in python/tests/test_schema_validator.py.
Integrating pre-insert schema validations in feature group engine methods and updating related documentation.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
python/hsfs/core/schema_validation.py	New validators for different dataframe types with schema checks.
python/tests/test_schema_validator.py	Added unit tests covering various schema validation scenarios.
python/hsfs/core/feature_group_engine.py	Integrated pre-insert schema validation in save/insert workflows.
python/hsfs/feature_group.py	Updated docstring to indicate pre-insert schema validations.

python/hsfs/core/schema_validation.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

vatj

Please add unit test for spark and polars DF, add a check to ensure that validation is not triggered when passing offline FG or embedded FG

vatj · 2025-03-19T13:23:46Z

python/hsfs/core/schema_validation.py

+
+        # Check string lengths
+        for col in df.select(pl.col(pl.Utf8)).columns:
+            currentmax = df[col].str.len_chars().max()


Can you confirm with RonDB team whether we should use len_bytes or len_char here? I am not sure what VARCHAR(100) is setup to use in rondb for online feature store table

vatj · 2025-03-19T13:26:38Z

python/tests/test_schema_validator.py

+                feature_group_data, pandas_df, feature_group_data.features
+            )
+
+    def test_offline_fg(self, pandas_df, feature_group_data, caplog):


What is this test doing? Offline FG should never be validated by the validator, no?

vatj · 2025-03-19T13:27:57Z

python/tests/test_schema_validator.py

+            feature_group_data, pandas_df, feature_group_data.features
+        )
+
+        assert df_features == feature_group_data.features


Where are the tests for the spark and polars dataframes?

dhananjay-mk added 12 commits February 10, 2025 12:21

init

c6902b9

init

35b59d0

refactor common methods to utils

eee0bcf

Merge remote-tracking branch 'upstream/main' into schemaval

876fac0

modify raising error conditions

b4d42c0

major refactor-switching to class

3a40b8e

Merge remote-tracking branch 'upstream/main' into schemaval

405fdda

revert engine changes

34a53ee

rminor cleanup

90f8d51

minor cleanup

53621ca

refactor and cleanup

3748ee4

Merge remote-tracking branch 'upstream/main' into schemaval

9370856

dhananjay-mk requested review from vatj and Copilot March 4, 2025 14:41

dhananjay-mk marked this pull request as draft March 4, 2025 14:41

Copilot AI reviewed Mar 4, 2025

View reviewed changes

add tests

26d388e

dhananjay-mk requested a review from Copilot March 5, 2025 18:26

Copilot AI reviewed Mar 5, 2025

View reviewed changes

dhananjay-mk added 4 commits March 11, 2025 11:22

Merge remote-tracking branch 'upstream/main' into schemaval

790c2c6

update docs

1ccee82

remove evt time and minor updates

90b12a0

update to handle explicit features

c078bcc

dhananjay-mk requested a review from Copilot March 13, 2025 15:08

dhananjay-mk marked this pull request as ready for review March 13, 2025 15:08

Copilot AI reviewed Mar 13, 2025

View reviewed changes

python/hsfs/core/schema_validation.py Outdated Show resolved Hide resolved

python/hsfs/core/schema_validation.py Outdated Show resolved Hide resolved

dhananjay-mk and others added 2 commits March 14, 2025 17:59

Update python/hsfs/core/schema_validation.py

ba62e49

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update python/hsfs/core/schema_validation.py

599c47c

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

vatj requested changes Mar 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pre-insert schema validation #500

pre-insert schema validation #500

dhananjay-mk commented Mar 4, 2025

Copilot AI left a comment

vatj left a comment

vatj Mar 19, 2025

vatj Mar 19, 2025

vatj Mar 19, 2025

pre-insert schema validation #500

Are you sure you want to change the base?

pre-insert schema validation #500

Conversation

dhananjay-mk commented Mar 4, 2025

Choose a reason for hiding this comment

PR Overview

Reviewed Changes

Choose a reason for hiding this comment

PR Overview

Reviewed Changes

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

vatj left a comment

Choose a reason for hiding this comment

vatj Mar 19, 2025

Choose a reason for hiding this comment

vatj Mar 19, 2025

Choose a reason for hiding this comment

vatj Mar 19, 2025

Choose a reason for hiding this comment