Skip to content

Commit 300a328

Browse files
committed
[FSTORE-612] Add docs for feature monitoring
1 parent 7522cbe commit 300a328

35 files changed

+950
-156
lines changed

docs/admin/alert.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ button on the left side of the **email** row and fill out the form that pops up.
3434
CRAM-MD5, LOGIN or PLAIN.
3535

3636
Optionally cluster wide Email alert receivers can be added in _Default receiver emails_.
37-
These receivers will be available to all users when they create event triggered [alerts](../../user_guides/fs/feature_group/advanced_data_validation/#setup-alerts).
37+
These receivers will be available to all users when they create event triggered [alerts](../../user_guides/fs/feature_group/data_validation_best_practices#setup-alerts).
3838

3939
### Step 3: Configure Slack Alerts
4040
Alerts can also be sent via Slack messages. To be able to send Slack messages you first need to configure
@@ -47,7 +47,7 @@ a Slack webhook. Click on the _Configure_ button on the left side of the **slack
4747
</figure>
4848

4949
Optionally cluster wide Slack alert receivers can be added in _Slack channel/user_.
50-
These receivers will be available to all users when they create event triggered [alerts](../../user_guides/fs/feature_group/advanced_data_validation/#setup-alerts).
50+
These receivers will be available to all users when they create event triggered [alerts](../../user_guides/fs/feature_group/data_validation_best_practices/#setup-alerts).
5151

5252
### Step 4: Configure Pagerduty Alerts
5353
Pagerduty is another way you can send alerts from Hopsworks. Click on the _Configure_ button on the left side of
@@ -93,7 +93,7 @@ global:
9393
...
9494
```
9595

96-
To test the alerts by creating triggers from Jobs and Feature group validations see [Alerts](../../user_guides/fs/feature_group/advanced_data_validation/#setup-alerts).
96+
To test the alerts by creating triggers from Jobs and Feature group validations see [Alerts](../../user_guides/fs/feature_group/data_validation_best_practices/#setup-alerts).
9797

9898
The yaml syntax in the UI is slightly different in that it does not allow double quotes (it will ignore the values but give no error).
9999
Below is an example configuration, that can be used in the UI, with both email and slack receivers configured for system alerts.
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
Feature Monitoring complements data validation capabilities by allowing you to monitor your feature data after been ingested into the Feature Store.
2+
3+
HSFS supports monitoring features on your Feature Group by:
4+
5+
- transparently **computing statistics** on the whole or a subset of feature data defined by a detection window.
6+
- **comparing statistics** against a reference window of feature data, and **configuring thresholds** to identify anomalous data.
7+
- **configuring alerts** based on the statistics comparison results.
8+
9+
## Scheduled Statistics
10+
11+
After creating a Feature Group in HSFS, you can setup statistics monitoring to compute statistics over one or more features on a scheduled basis. Statistics are computed on the whole or a subset of feature data (i.e., detection window) already inserted into the Feature Group.
12+
13+
## Statistics Comparison
14+
15+
In addition to scheduled statistics, you can enable the comparison of statistics against a reference subset of feature data (i.e., reference window) and define the criteria for this comparison including the statistics metric to compare and a threshold to identify anomalous values.
16+
17+
!!! info "Feature Monitoring Guide"
18+
More information can be found in the [Feature monitoring guide](../../../user_guides/fs/feature_monitoring/index.md).
19+
20+
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
Feature Monitoring complements data validation capabilities by allowing you to monitor your feature data once they have been ingested into the Feature Store.
2+
3+
HSFS supports monitoring features on your Feature View by:
4+
5+
- transparently **computing statistics** on the whole or a subset of feature data defined by a detection window.
6+
- **comparing statistics** against a reference window of feature data (e.g., training dataset), and **configuring thresholds** to identify anomalous data.
7+
- **configuring alerts** based on the statistics comparison results.
8+
9+
## Scheduled Statistics
10+
11+
After creating a Feature View in HSFS, you can setup statistics monitoring to compute statistics over one or more features on a scheduled basis. Statistics are computed on the whole or a subset of feature data (i.e., detection window) using the Feature View query.
12+
13+
## Statistics Comparison
14+
15+
In addition to scheduled statistics, you can enable the comparison of statistics against a reference subset of feature data (i.e., reference window), typically a training dataset, and define the criteria for this comparison including the statistics metric to compare and a threshold to identify anomalous values.
16+
17+
!!! info "Feature Monitoring Guide"
18+
More information can be found in the [Feature monitoring guide](../../../user_guides/fs/feature_monitoring/index.md).
19+
20+

docs/css/custom.css

+35-28
Original file line numberDiff line numberDiff line change
@@ -1,58 +1,67 @@
11
:root {
2-
--md-primary-fg-color: #1EB382;
2+
--md-primary-fg-color: #1eb382;
33
--md-secondary-fg-color: #188a64;
44
--md-tertiary-fg-color: #0d493550;
55
--md-quaternary-fg-color: #fdfdfd;
66
--border-radius-variable: 5px;
77
}
88

99
.md-footer__inner:not([hidden]) {
10-
display: none
10+
display: none;
1111
}
1212

1313
/* Lex did stuff here */
14-
.svg_topnav{
14+
.svg_topnav {
1515
width: 12px;
1616
filter: invert(100);
1717
}
18-
.svg_topnav:hover{
18+
.svg_topnav:hover {
1919
width: 12px;
2020
filter: invert(10);
2121
}
2222

23-
.md-header[data-md-state=shadow] {
23+
.md-header[data-md-state="shadow"] {
2424
box-shadow: 0 0 0 0;
2525
}
2626

2727
.md-tabs__item:hover {
2828
background-color: var(--md-tertiary-fg-color);
2929
transition: background-color 450ms;
30-
3130
}
3231

33-
.md-sidebar__scrollwrap{
32+
.md-sidebar__scrollwrap {
3433
background-color: var(--md-quaternary-fg-color);
3534
padding: 15px 5px 5px 5px;
3635
border-radius: var(--border-radius-variable);
3736
}
3837

39-
40-
.image_logo_02{
41-
width:450px;
38+
.image_logo_02 {
39+
width: 450px;
4240
}
4341

4442
/* End of Lex did stuff here */
4543

44+
/* no-icon style for admonitions */
45+
.md-typeset .no-icon > .admonition-title::before,
46+
.md-typeset .no-icon > summary::before {
47+
display: none;
48+
}
49+
.md-typeset .no-icon > :is(.admonition-title, summary) {
50+
padding-left: 1rem;
51+
}
52+
/* end of no-icon style */
53+
4654
.md-header__button.md-logo {
47-
margin: .1rem;
48-
padding: .1rem;
55+
margin: 0.1rem;
56+
padding: 0.1rem;
4957
}
5058

51-
.md-header__button.md-logo img, .md-header__button.md-logo svg {
59+
.md-header__button.md-logo img,
60+
.md-header__button.md-logo svg {
5261
display: block;
5362
width: 1.8rem;
5463
height: 1.8rem;
55-
fill: currentColor;
64+
fill: rgba(43, 155, 70, 0.1);
5665
}
5766

5867
.md-tabs {
@@ -63,7 +72,6 @@
6372
transition: background-color 250ms;
6473
}
6574

66-
6775
.wrapper {
6876
display: grid;
6977
grid-template-columns: repeat(4, 1fr);
@@ -72,9 +80,9 @@
7280
}
7381

7482
.wrapper * {
75-
border: 2px solid green;
76-
text-align: center;
77-
padding: 70px 0;
83+
border: 2px solid green;
84+
text-align: center;
85+
padding: 70px 0;
7886
}
7987

8088
.one {
@@ -107,13 +115,12 @@
107115
display: none !important;
108116
}
109117

110-
111-
@media screen and (max-width: 479px){
112-
.md-sidebar--primary, .md-sidebar {
113-
z-index: 50 !important;
114-
}
115-
.md-logo {
116-
visibility: hidden;
117-
}
118-
119-
}
118+
@media screen and (max-width: 479px) {
119+
.md-sidebar--primary,
120+
.md-sidebar {
121+
z-index: 50 !important;
122+
}
123+
.md-logo {
124+
visibility: hidden;
125+
}
126+
}

docs/user_guides/fs/feature_group/data_validation.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -174,7 +174,7 @@ That is all there is to it. Hopsworks will now automatically use your suite to v
174174
job, validation_report = fg.insert(df.head(5))
175175
```
176176

177-
As you can see, Hopsworks runs the validation in the client before attempting to insert the data. By default, Hopsworks will try to insert the data even if validation fails to prevent data loss. However it can be configured for production setup to be more restrictive, checkout the [data validation advanced guide](advanced_data_validation.md).
177+
As you can see, Hopsworks runs the validation in the client before attempting to insert the data. By default, Hopsworks will try to insert the data even if validation fails to prevent data loss. However it can be configured for production setup to be more restrictive, checkout the [data validation advanced guide](data_validation_advanced.md).
178178

179179
!!!info
180180
Note that once the Expectation Suite is attached to the Feature Group, any subsequent attempt to insert to this Feature Group will apply the Data Validation step even from a different client or in a scheduled job.
@@ -214,4 +214,4 @@ The integration between Hopsworks and Great Expectations makes it simple to add
214214

215215
## Going further
216216

217-
If you wish to find out more about how to use the data validation API or best practices for development or production pipelines in Hopsworks, checkout the [advanced guide](advanced_data_validation.md).
217+
If you wish to find out more about how to use the data validation API or best practices for development or production pipelines in Hopsworks, checkout the [advanced guide](data_validation_advanced.md) and [best practices guide](data_validation_best_practices.md).

docs/user_guides/fs/feature_group/advanced_data_validation.md docs/user_guides/fs/feature_group/data_validation_advanced.md

+3-116
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Advanced Data Validation Options and Best Practices
22

3-
The introduction to data vaildation guide can be found [here](data_validation.md). The notebook example to get started with Data Validation in Hopsworks can be found [here](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/integrations/great_expectations/fraud_batch_data_validation.ipynb).
3+
The introduction to data validation guide can be found [here](data_validation.md). The notebook example to get started with Data Validation in Hopsworks can be found [here](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/integrations/great_expectations/fraud_batch_data_validation.ipynb).
44

55
## Data Validation Configuration Options in Hopsworks
66

@@ -55,7 +55,7 @@ The one constant in life is change. If you need to add, remove or edit an expect
5555

5656
Go to the Feature Group edit page, in the expectation section. You can click on the expectation you want to edit and edit the json configuration. Check out Great Expectations documentation if you need more information on a particular expectation.
5757

58-
### In Hopsworks Python Client
58+
#### In Hopsworks Python Client
5959

6060
There are several way to edit an Expectation in the python client. You can use Great Expectations API or directly go through Hopsworks. In the latter case, if you want to edit or remove an expectation, you will need the Hopsworks expectation ID. It can be found in the UI or in the meta field of an expectation. Note that you must have inserted data in the FG and attached the expectation suite to enable the Expectation API.
6161

@@ -122,7 +122,7 @@ The boilerplate of uploading report on insertion is taken care of by hopsworks,
122122
fg.save_validation_report(ge_report)
123123
```
124124

125-
#### Monitor and Fetch Validation Reports
125+
### Monitor and Fetch Validation Reports
126126

127127
A summary of uploaded reports will then be available via an API call or in the Hopsworks UI enabling easy monitoring. For in-depth analysis, it is possible to download the complete report from the UI.
128128

@@ -173,116 +173,3 @@ ge_report = ge_df.validate()
173173
```
174174

175175
Note that you should always use an expectation suite that has been saved to Hopsworks if you intend to upload the associated validation report.
176-
177-
## Best Practices
178-
179-
Below is a set of recommendations and code snippets to help our users follow best practices when it comes to integrating a data validation step in your feature engineering pipelines. Rather than being prescriptive, we want to showcase how the API and configuration options can help adapt validation to your use-case.
180-
181-
### Development
182-
183-
Data validation is generally considered to be a production-only feature and as such is often only setup once a project has reached the end of the development phase. At Hopsworks, we think there is a lot of value in setting up validation during early development. That's why we made it quick to get started and ensured that by default data validation is never an obstacle to inserting data.
184-
185-
#### Validate Early
186-
187-
As often with data validation, the best piece of advice is to set it up early in your development process. Use this phase to build a history you can then use when it becomes time to set quality requirements for a project in production. We made a code snippet to help you get started quickly:
188-
189-
```python3
190-
# Load sample data. Replace it with your own!
191-
my_data_df = pd.read_csv("https://repo.hops.works/master/hopsworks-tutorials/data/card_fraud_data/credit_cards.csv")
192-
193-
# Use Great Expectation profiler (ignore deprecation warning)
194-
expectation_suite_profiled, validation_report = ge.from_pandas(my_data_df).profile(profiler=ge.profile.BasicSuiteBuilderProfiler)
195-
196-
# Create a Feature Group on hopsworks with an expectation suite attached. Don't forget to change the primary key!
197-
my_validated_data_fg = fs.get_or_create_feature_group(
198-
name="my_validated_data_fg",
199-
version=1,
200-
description="My data",
201-
primary_key=['cc_num'],
202-
expectation_suite=expectation_suite_profiled)
203-
```
204-
205-
Any data you insert in the Feature Group from now will be validated and a report will be uploaded to Hopsworks.
206-
207-
```python3
208-
# Insert and validate your data
209-
insert_job, validation_report = my_validated_data_fg.insert(my_data_df)
210-
```
211-
212-
Great Expectations profiler can inspect your data to build a standard Expectation Suite. You can attach this Expectation Suite directly when creating your Feature Group to make sure every piece of data finding its way in Hopsworks gets validated. Hopsworks will default to its `"ALWAYS"` ingestion policy, meaning data are ingested whether validation succeeds or not. This way data validation is not a barrier, just a monitoring tool.
213-
214-
#### Identify Unreliable Features
215-
216-
Once you setup data validation, every insertion will upload a validation report to Hopsworks. Identifying Features which often have null values or wild statistical variations can help detecting unreliable Features that need refinements or should be avoided. Here are a few expectations you might find useful:
217-
218-
- `expect_column_values_to_not_be_null`
219-
- `expect_column_(min/max/mean/stdev)_to_be_between`
220-
- `expect_column_values_to_be_unique`
221-
222-
#### Get the stakeholders involved
223-
224-
Hopsworks UI helps involve every project stakeholder by enabling both setting and monitoring of data quality requirements. No coding skills needed! You can monitor data quality requirements by checkint out the validation reports and results on the Feature Group page.
225-
226-
If you need to set or edit the existing requirements, you can go on the Feature Group edit page. The Expectation suite section allows you to edit individual expectations and set success parameters that match ever changing business requirements.
227-
228-
### Production
229-
230-
Models in production require high-quality data to make accurate predictions for your customers. Hopsworks can use your Expectation Suite as a gatekeeper to make it simple to prevent low-quality data to make its way into production. Below are some simple tips and snippets to make the most of your data validation when your project is ready to enter its production phase.
231-
232-
#### Be Strict in Production
233-
234-
Whether you use an existing or create a new (recommended) Feature Group for production, we recommend you set the validation ingestion policy of your Expectation Suite to `"STRICT"`.
235-
236-
```python3
237-
fg_prod.save_expectation_suite(
238-
my_suite,
239-
validation_ingestion_policy="STRICT")
240-
```
241-
242-
In this setup, Hopsworks will abort inserting a DataFrame that does not successfully fullfill all expectations in the attached Expectation Suite. This ensures data quality standards are upheld for every insertion and provide downstream users with strong guarantees.
243-
244-
#### Avoid Data Loss on materialization jobs
245-
246-
Aborting insertions of DataFrames which do not satisfy the data quality standards can lead to data loss in your materialization job. To avoid such loss we recommend creating a duplicate Feature Group with the same Expectation Suite in `"ALWAYS"` mode which will hold the rejected data.
247-
248-
```python3
249-
job, report = fg_prod.insert(df)
250-
251-
if report["success"] is False:
252-
job, report = fg_rejected.insert(df)
253-
```
254-
255-
#### Take Advantage of the Validation History
256-
257-
You can easily retrieve the validation history of a specific expectation to export it to your favourite visualisation tool. You can filter on time and on whether insertion was successful or not
258-
259-
```python3
260-
validation_history = fg.get_validation_history(
261-
expectation_id=my_id,
262-
filters=["REJECTED", "UNKNOWN"],
263-
ge_type=False
264-
)
265-
266-
timeseries = pd.DataFrame(
267-
{
268-
"observed_value": [res.result["observed_value"] for res in validation_histoy]],
269-
"validation_time": [res.validation_time for res in validation_history]
270-
}
271-
)
272-
273-
# export to your preferred Dashboard
274-
```
275-
276-
#### Setup Alerts
277-
278-
While checking your feature engineering pipeline executed properly in the morning can be good enough in the development phase, it won't make the cut for demanding production use-cases. In Hopsworks, you can setup alerts if ingestion fails or succeeds.
279-
280-
First you will need to configure your preferred communication endpoint: slack, email or pagerduty. Check out [this page](../../../admin/alert.md) for more information on how to set it up. A typical use-case would be to add an alert on ingestion success to a Feature Group you created to hold data that failed validation. Here is a quick walkthrough:
281-
282-
1. Go the Feature Group page in the UI
283-
2. Scroll down and click on the `Add an alert` button.
284-
3. Choose the trigger, receiver and severity and click save.
285-
286-
## Conclusion
287-
288-
Hopsworks completes Great Expectation by automatically running the validation, persisting the reports along your data and allowing you to monitor data quality in its UI. How you decide to make use of these tools depends on your application and requirements. Whether in development or in production, real-time or batch, we think there is configuration that will work for your team. Check out our [quick hands-on tutorial](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/integrations/great_expectations/fraud_batch_data_validation.ipynb) to start applying what you learned so far.

0 commit comments

Comments
 (0)