Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FSTORE-612] Add docs for feature monitoring #347

Merged
merged 19 commits into from
Feb 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
4acb90c
[FSTORE-612] Add docs for feature monitoring
javierdlrm Feb 14, 2024
8445157
Update docs/concepts/fs/feature_group/feature_monitoring.md
javierdlrm Feb 14, 2024
1f5ccfc
Update docs/user_guides/fs/feature_group/data_validation_advanced.md
javierdlrm Feb 14, 2024
3a6d893
Update docs/user_guides/fs/feature_group/data_validation_best_practic…
javierdlrm Feb 14, 2024
ecbba0a
Update docs/user_guides/fs/feature_group/data_validation_best_practic…
javierdlrm Feb 14, 2024
42d66cc
Update docs/user_guides/fs/feature_group/feature_monitoring.md
javierdlrm Feb 14, 2024
0919618
Update docs/user_guides/fs/feature_group/feature_monitoring.md
javierdlrm Feb 14, 2024
a3f15d4
Update docs/user_guides/fs/feature_group/feature_monitoring.md
javierdlrm Feb 14, 2024
b27da27
Update docs/user_guides/fs/feature_monitoring/feature_monitoring_adva…
javierdlrm Feb 14, 2024
eab5f32
Update docs/user_guides/fs/feature_view/feature_monitoring.md
javierdlrm Feb 14, 2024
cac541d
Update docs/user_guides/fs/feature_view/feature_monitoring.md
javierdlrm Feb 14, 2024
085ffca
Update docs/user_guides/fs/feature_monitoring/statistics_comparison.md
javierdlrm Feb 14, 2024
27a5939
Update docs/user_guides/fs/feature_monitoring/feature_monitoring_adva…
javierdlrm Feb 14, 2024
a757039
Update docs/user_guides/fs/feature_monitoring/index.md
javierdlrm Feb 14, 2024
a31a05d
Update docs/user_guides/fs/feature_monitoring/interactive_graph.md
javierdlrm Feb 14, 2024
9d2ca02
Update docs/user_guides/fs/feature_monitoring/statistics_comparison.md
javierdlrm Feb 14, 2024
24ee6e6
Update docs/user_guides/fs/feature_monitoring/statistics_comparison.md
javierdlrm Feb 14, 2024
371bc37
Update docs/user_guides/fs/feature_monitoring/statistics_comparison.md
javierdlrm Feb 14, 2024
77081a1
Address comments
javierdlrm Feb 14, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions docs/admin/alert.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ button on the left side of the **email** row and fill out the form that pops up.
CRAM-MD5, LOGIN or PLAIN.

Optionally cluster wide Email alert receivers can be added in _Default receiver emails_.
These receivers will be available to all users when they create event triggered [alerts](../../user_guides/fs/feature_group/advanced_data_validation/#setup-alerts).
These receivers will be available to all users when they create event triggered [alerts](../../user_guides/fs/feature_group/data_validation_best_practices#setup-alerts).

### Step 3: Configure Slack Alerts
Alerts can also be sent via Slack messages. To be able to send Slack messages you first need to configure
Expand All @@ -47,7 +47,7 @@ a Slack webhook. Click on the _Configure_ button on the left side of the **slack
</figure>

Optionally cluster wide Slack alert receivers can be added in _Slack channel/user_.
These receivers will be available to all users when they create event triggered [alerts](../../user_guides/fs/feature_group/advanced_data_validation/#setup-alerts).
These receivers will be available to all users when they create event triggered [alerts](../../user_guides/fs/feature_group/data_validation_best_practices/#setup-alerts).

### Step 4: Configure Pagerduty Alerts
Pagerduty is another way you can send alerts from Hopsworks. Click on the _Configure_ button on the left side of
Expand Down Expand Up @@ -93,7 +93,7 @@ global:
...
```

To test the alerts by creating triggers from Jobs and Feature group validations see [Alerts](../../user_guides/fs/feature_group/advanced_data_validation/#setup-alerts).
To test the alerts by creating triggers from Jobs and Feature group validations see [Alerts](../../user_guides/fs/feature_group/data_validation_best_practices/#setup-alerts).

The yaml syntax in the UI is slightly different in that it does not allow double quotes (it will ignore the values but give no error).
Below is an example configuration, that can be used in the UI, with both email and slack receivers configured for system alerts.
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
20 changes: 20 additions & 0 deletions docs/concepts/fs/feature_group/feature_monitoring.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
Feature Monitoring complements data validation capabilities by allowing you to monitor your feature data after it has been ingested into the Feature Store.

HSFS supports monitoring features on your Feature Group by:

- transparently **computing statistics** on the whole or a subset of feature data defined by a detection window.
- **comparing statistics** against a reference window of feature data, and **configuring thresholds** to identify anomalous data.
- **configuring alerts** based on the statistics comparison results.

## Scheduled Statistics

After creating a Feature Group in HSFS, you can setup statistics monitoring to compute statistics over one or more features on a scheduled basis. Statistics are computed on the whole or a subset of feature data (i.e., detection window) already inserted into the Feature Group.

## Statistics Comparison

In addition to scheduled statistics, you can enable the comparison of statistics against a reference subset of feature data (i.e., reference window) and define the criteria for this comparison including the statistics metric to compare and a threshold to identify anomalous values.

!!! info "Feature Monitoring Guide"
More information can be found in the [Feature monitoring guide](../../../user_guides/fs/feature_monitoring/index.md).


20 changes: 20 additions & 0 deletions docs/concepts/fs/feature_view/feature_monitoring.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
Feature Monitoring complements data validation capabilities by allowing you to monitor your feature data once they have been ingested into the Feature Store.

HSFS supports monitoring features on your Feature View by:

- transparently **computing statistics** on the whole or a subset of feature data defined by a detection window.
- **comparing statistics** against a reference window of feature data (e.g., training dataset), and **configuring thresholds** to identify anomalous data.
- **configuring alerts** based on the statistics comparison results.

## Scheduled Statistics

After creating a Feature View in HSFS, you can setup statistics monitoring to compute statistics over one or more features on a scheduled basis. Statistics are computed on the whole or a subset of feature data (i.e., detection window) using the Feature View query.

## Statistics Comparison

In addition to scheduled statistics, you can enable the comparison of statistics against a reference subset of feature data (i.e., reference window), typically a training dataset, and define the criteria for this comparison including the statistics metric to compare and a threshold to identify anomalous values.

!!! info "Feature Monitoring Guide"
More information can be found in the [Feature monitoring guide](../../../user_guides/fs/feature_monitoring/index.md).


63 changes: 35 additions & 28 deletions docs/css/custom.css
Original file line number Diff line number Diff line change
@@ -1,58 +1,67 @@
:root {
--md-primary-fg-color: #1EB382;
--md-primary-fg-color: #1eb382;
--md-secondary-fg-color: #188a64;
--md-tertiary-fg-color: #0d493550;
--md-quaternary-fg-color: #fdfdfd;
--border-radius-variable: 5px;
}

.md-footer__inner:not([hidden]) {
display: none
display: none;
}

/* Lex did stuff here */
.svg_topnav{
.svg_topnav {
width: 12px;
filter: invert(100);
}
.svg_topnav:hover{
.svg_topnav:hover {
width: 12px;
filter: invert(10);
}

.md-header[data-md-state=shadow] {
.md-header[data-md-state="shadow"] {
box-shadow: 0 0 0 0;
}

.md-tabs__item:hover {
background-color: var(--md-tertiary-fg-color);
transition: background-color 450ms;

}

.md-sidebar__scrollwrap{
.md-sidebar__scrollwrap {
background-color: var(--md-quaternary-fg-color);
padding: 15px 5px 5px 5px;
border-radius: var(--border-radius-variable);
}


.image_logo_02{
width:450px;
.image_logo_02 {
width: 450px;
}

/* End of Lex did stuff here */

/* no-icon style for admonitions */
.md-typeset .no-icon > .admonition-title::before,
.md-typeset .no-icon > summary::before {
display: none;
}
.md-typeset .no-icon > :is(.admonition-title, summary) {
padding-left: 1rem;
}
/* end of no-icon style */

.md-header__button.md-logo {
margin: .1rem;
padding: .1rem;
margin: 0.1rem;
padding: 0.1rem;
}

.md-header__button.md-logo img, .md-header__button.md-logo svg {
.md-header__button.md-logo img,
.md-header__button.md-logo svg {
display: block;
width: 1.8rem;
height: 1.8rem;
fill: currentColor;
fill: rgba(43, 155, 70, 0.1);
}

.md-tabs {
Expand All @@ -63,7 +72,6 @@
transition: background-color 250ms;
}


.wrapper {
display: grid;
grid-template-columns: repeat(4, 1fr);
Expand All @@ -72,9 +80,9 @@
}

.wrapper * {
border: 2px solid green;
text-align: center;
padding: 70px 0;
border: 2px solid green;
text-align: center;
padding: 70px 0;
}

.one {
Expand Down Expand Up @@ -107,13 +115,12 @@
display: none !important;
}


@media screen and (max-width: 479px){
.md-sidebar--primary, .md-sidebar {
z-index: 50 !important;
}
.md-logo {
visibility: hidden;
}

}
@media screen and (max-width: 479px) {
.md-sidebar--primary,
.md-sidebar {
z-index: 50 !important;
}
.md-logo {
visibility: hidden;
}
}
6 changes: 3 additions & 3 deletions docs/user_guides/fs/feature_group/data_validation.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ In order to define and validate an expectation when writing to a Feature Group,

- A Hopsworks project. If you don't have a project yet you can go to [managed.hopsworks.ai](https://managed.hopsworks.ai), signup with your email and create your first project.
- An API key, you can get one by following the instructions [here](../../../setup_installation/common/api_key.md)
- The [hopsworks python library](../../client_installation/index.md) installed in your client
- The [Hopsworks Python library](https://pypi.org/project/hopsworks) installed in your client. See the [installation guide](../../client_installation/index.md).

#### Connect your notebook to Hopsworks

Expand Down Expand Up @@ -174,7 +174,7 @@ That is all there is to it. Hopsworks will now automatically use your suite to v
job, validation_report = fg.insert(df.head(5))
```

As you can see, Hopsworks runs the validation in the client before attempting to insert the data. By default, Hopsworks will try to insert the data even if validation fails to prevent data loss. However it can be configured for production setup to be more restrictive, checkout the [data validation advanced guide](advanced_data_validation.md).
As you can see, Hopsworks runs the validation in the client before attempting to insert the data. By default, Hopsworks will try to insert the data even if validation fails to prevent data loss. However it can be configured for production setup to be more restrictive, checkout the [data validation advanced guide](data_validation_advanced.md).

!!!info
Note that once the Expectation Suite is attached to the Feature Group, any subsequent attempt to insert to this Feature Group will apply the Data Validation step even from a different client or in a scheduled job.
Expand Down Expand Up @@ -214,4 +214,4 @@ The integration between Hopsworks and Great Expectations makes it simple to add

## Going further

If you wish to find out more about how to use the data validation API or best practices for development or production pipelines in Hopsworks, checkout the [advanced guide](advanced_data_validation.md).
If you wish to find out more about how to use the data validation API or best practices for development or production pipelines in Hopsworks, checkout the [advanced guide](data_validation_advanced.md) and [best practices guide](data_validation_best_practices.md).
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Advanced Data Validation Options and Best Practices

The introduction to data vaildation guide can be found [here](data_validation.md). The notebook example to get started with Data Validation in Hopsworks can be found [here](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/integrations/great_expectations/fraud_batch_data_validation.ipynb).
The introduction to the data validation guide can be found [here](data_validation.md). The notebook example to get started with Data Validation in Hopsworks can be found [here](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/integrations/great_expectations/fraud_batch_data_validation.ipynb).

## Data Validation Configuration Options in Hopsworks

Expand Down Expand Up @@ -55,7 +55,7 @@ The one constant in life is change. If you need to add, remove or edit an expect

Go to the Feature Group edit page, in the expectation section. You can click on the expectation you want to edit and edit the json configuration. Check out Great Expectations documentation if you need more information on a particular expectation.

### In Hopsworks Python Client
#### In Hopsworks Python Client

There are several way to edit an Expectation in the python client. You can use Great Expectations API or directly go through Hopsworks. In the latter case, if you want to edit or remove an expectation, you will need the Hopsworks expectation ID. It can be found in the UI or in the meta field of an expectation. Note that you must have inserted data in the FG and attached the expectation suite to enable the Expectation API.

Expand Down Expand Up @@ -122,7 +122,7 @@ The boilerplate of uploading report on insertion is taken care of by hopsworks,
fg.save_validation_report(ge_report)
```

#### Monitor and Fetch Validation Reports
### Monitor and Fetch Validation Reports

A summary of uploaded reports will then be available via an API call or in the Hopsworks UI enabling easy monitoring. For in-depth analysis, it is possible to download the complete report from the UI.

Expand Down Expand Up @@ -173,116 +173,3 @@ ge_report = ge_df.validate()
```

Note that you should always use an expectation suite that has been saved to Hopsworks if you intend to upload the associated validation report.

## Best Practices

Below is a set of recommendations and code snippets to help our users follow best practices when it comes to integrating a data validation step in your feature engineering pipelines. Rather than being prescriptive, we want to showcase how the API and configuration options can help adapt validation to your use-case.

### Development

Data validation is generally considered to be a production-only feature and as such is often only setup once a project has reached the end of the development phase. At Hopsworks, we think there is a lot of value in setting up validation during early development. That's why we made it quick to get started and ensured that by default data validation is never an obstacle to inserting data.

#### Validate Early

As often with data validation, the best piece of advice is to set it up early in your development process. Use this phase to build a history you can then use when it becomes time to set quality requirements for a project in production. We made a code snippet to help you get started quickly:

```python3
# Load sample data. Replace it with your own!
my_data_df = pd.read_csv("https://repo.hops.works/master/hopsworks-tutorials/data/card_fraud_data/credit_cards.csv")

# Use Great Expectation profiler (ignore deprecation warning)
expectation_suite_profiled, validation_report = ge.from_pandas(my_data_df).profile(profiler=ge.profile.BasicSuiteBuilderProfiler)

# Create a Feature Group on hopsworks with an expectation suite attached. Don't forget to change the primary key!
my_validated_data_fg = fs.get_or_create_feature_group(
name="my_validated_data_fg",
version=1,
description="My data",
primary_key=['cc_num'],
expectation_suite=expectation_suite_profiled)
```

Any data you insert in the Feature Group from now will be validated and a report will be uploaded to Hopsworks.

```python3
# Insert and validate your data
insert_job, validation_report = my_validated_data_fg.insert(my_data_df)
```

Great Expectations profiler can inspect your data to build a standard Expectation Suite. You can attach this Expectation Suite directly when creating your Feature Group to make sure every piece of data finding its way in Hopsworks gets validated. Hopsworks will default to its `"ALWAYS"` ingestion policy, meaning data are ingested whether validation succeeds or not. This way data validation is not a barrier, just a monitoring tool.

#### Identify Unreliable Features

Once you setup data validation, every insertion will upload a validation report to Hopsworks. Identifying Features which often have null values or wild statistical variations can help detecting unreliable Features that need refinements or should be avoided. Here are a few expectations you might find useful:

- `expect_column_values_to_not_be_null`
- `expect_column_(min/max/mean/stdev)_to_be_between`
- `expect_column_values_to_be_unique`

#### Get the stakeholders involved

Hopsworks UI helps involve every project stakeholder by enabling both setting and monitoring of data quality requirements. No coding skills needed! You can monitor data quality requirements by checkint out the validation reports and results on the Feature Group page.

If you need to set or edit the existing requirements, you can go on the Feature Group edit page. The Expectation suite section allows you to edit individual expectations and set success parameters that match ever changing business requirements.

### Production

Models in production require high-quality data to make accurate predictions for your customers. Hopsworks can use your Expectation Suite as a gatekeeper to make it simple to prevent low-quality data to make its way into production. Below are some simple tips and snippets to make the most of your data validation when your project is ready to enter its production phase.

#### Be Strict in Production

Whether you use an existing or create a new (recommended) Feature Group for production, we recommend you set the validation ingestion policy of your Expectation Suite to `"STRICT"`.

```python3
fg_prod.save_expectation_suite(
my_suite,
validation_ingestion_policy="STRICT")
```

In this setup, Hopsworks will abort inserting a DataFrame that does not successfully fullfill all expectations in the attached Expectation Suite. This ensures data quality standards are upheld for every insertion and provide downstream users with strong guarantees.

#### Avoid Data Loss on materialization jobs

Aborting insertions of DataFrames which do not satisfy the data quality standards can lead to data loss in your materialization job. To avoid such loss we recommend creating a duplicate Feature Group with the same Expectation Suite in `"ALWAYS"` mode which will hold the rejected data.

```python3
job, report = fg_prod.insert(df)

if report["success"] is False:
job, report = fg_rejected.insert(df)
```

#### Take Advantage of the Validation History

You can easily retrieve the validation history of a specific expectation to export it to your favourite visualisation tool. You can filter on time and on whether insertion was successful or not

```python3
validation_history = fg.get_validation_history(
expectation_id=my_id,
filters=["REJECTED", "UNKNOWN"],
ge_type=False
)

timeseries = pd.DataFrame(
{
"observed_value": [res.result["observed_value"] for res in validation_histoy]],
"validation_time": [res.validation_time for res in validation_history]
}
)

# export to your preferred Dashboard
```

#### Setup Alerts

While checking your feature engineering pipeline executed properly in the morning can be good enough in the development phase, it won't make the cut for demanding production use-cases. In Hopsworks, you can setup alerts if ingestion fails or succeeds.

First you will need to configure your preferred communication endpoint: slack, email or pagerduty. Check out [this page](../../../admin/alert.md) for more information on how to set it up. A typical use-case would be to add an alert on ingestion success to a Feature Group you created to hold data that failed validation. Here is a quick walkthrough:

1. Go the Feature Group page in the UI
2. Scroll down and click on the `Add an alert` button.
3. Choose the trigger, receiver and severity and click save.

## Conclusion

Hopsworks completes Great Expectation by automatically running the validation, persisting the reports along your data and allowing you to monitor data quality in its UI. How you decide to make use of these tools depends on your application and requirements. Whether in development or in production, real-time or batch, we think there is configuration that will work for your team. Check out our [quick hands-on tutorial](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/integrations/great_expectations/fraud_batch_data_validation.ipynb) to start applying what you learned so far.
Loading
Loading