|
1 | 1 | # Advanced Data Validation Options and Best Practices
|
2 | 2 |
|
3 |
| -The introduction to data vaildation guide can be found [here](data_validation.md). The notebook example to get started with Data Validation in Hopsworks can be found [here](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/integrations/great_expectations/fraud_batch_data_validation.ipynb). |
| 3 | +The introduction to data validation guide can be found [here](data_validation.md). The notebook example to get started with Data Validation in Hopsworks can be found [here](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/integrations/great_expectations/fraud_batch_data_validation.ipynb). |
4 | 4 |
|
5 | 5 | ## Data Validation Configuration Options in Hopsworks
|
6 | 6 |
|
@@ -55,7 +55,7 @@ The one constant in life is change. If you need to add, remove or edit an expect
|
55 | 55 |
|
56 | 56 | Go to the Feature Group edit page, in the expectation section. You can click on the expectation you want to edit and edit the json configuration. Check out Great Expectations documentation if you need more information on a particular expectation.
|
57 | 57 |
|
58 |
| -### In Hopsworks Python Client |
| 58 | +#### In Hopsworks Python Client |
59 | 59 |
|
60 | 60 | There are several way to edit an Expectation in the python client. You can use Great Expectations API or directly go through Hopsworks. In the latter case, if you want to edit or remove an expectation, you will need the Hopsworks expectation ID. It can be found in the UI or in the meta field of an expectation. Note that you must have inserted data in the FG and attached the expectation suite to enable the Expectation API.
|
61 | 61 |
|
@@ -122,7 +122,7 @@ The boilerplate of uploading report on insertion is taken care of by hopsworks,
|
122 | 122 | fg.save_validation_report(ge_report)
|
123 | 123 | ```
|
124 | 124 |
|
125 |
| -#### Monitor and Fetch Validation Reports |
| 125 | +### Monitor and Fetch Validation Reports |
126 | 126 |
|
127 | 127 | A summary of uploaded reports will then be available via an API call or in the Hopsworks UI enabling easy monitoring. For in-depth analysis, it is possible to download the complete report from the UI.
|
128 | 128 |
|
@@ -173,116 +173,3 @@ ge_report = ge_df.validate()
|
173 | 173 | ```
|
174 | 174 |
|
175 | 175 | Note that you should always use an expectation suite that has been saved to Hopsworks if you intend to upload the associated validation report.
|
176 |
| - |
177 |
| -## Best Practices |
178 |
| - |
179 |
| -Below is a set of recommendations and code snippets to help our users follow best practices when it comes to integrating a data validation step in your feature engineering pipelines. Rather than being prescriptive, we want to showcase how the API and configuration options can help adapt validation to your use-case. |
180 |
| - |
181 |
| -### Development |
182 |
| - |
183 |
| -Data validation is generally considered to be a production-only feature and as such is often only setup once a project has reached the end of the development phase. At Hopsworks, we think there is a lot of value in setting up validation during early development. That's why we made it quick to get started and ensured that by default data validation is never an obstacle to inserting data. |
184 |
| - |
185 |
| -#### Validate Early |
186 |
| - |
187 |
| -As often with data validation, the best piece of advice is to set it up early in your development process. Use this phase to build a history you can then use when it becomes time to set quality requirements for a project in production. We made a code snippet to help you get started quickly: |
188 |
| - |
189 |
| -```python3 |
190 |
| -# Load sample data. Replace it with your own! |
191 |
| -my_data_df = pd.read_csv("https://repo.hops.works/master/hopsworks-tutorials/data/card_fraud_data/credit_cards.csv") |
192 |
| - |
193 |
| -# Use Great Expectation profiler (ignore deprecation warning) |
194 |
| -expectation_suite_profiled, validation_report = ge.from_pandas(my_data_df).profile(profiler=ge.profile.BasicSuiteBuilderProfiler) |
195 |
| - |
196 |
| -# Create a Feature Group on hopsworks with an expectation suite attached. Don't forget to change the primary key! |
197 |
| -my_validated_data_fg = fs.get_or_create_feature_group( |
198 |
| - name="my_validated_data_fg", |
199 |
| - version=1, |
200 |
| - description="My data", |
201 |
| - primary_key=['cc_num'], |
202 |
| - expectation_suite=expectation_suite_profiled) |
203 |
| -``` |
204 |
| - |
205 |
| -Any data you insert in the Feature Group from now will be validated and a report will be uploaded to Hopsworks. |
206 |
| - |
207 |
| -```python3 |
208 |
| -# Insert and validate your data |
209 |
| -insert_job, validation_report = my_validated_data_fg.insert(my_data_df) |
210 |
| -``` |
211 |
| - |
212 |
| -Great Expectations profiler can inspect your data to build a standard Expectation Suite. You can attach this Expectation Suite directly when creating your Feature Group to make sure every piece of data finding its way in Hopsworks gets validated. Hopsworks will default to its `"ALWAYS"` ingestion policy, meaning data are ingested whether validation succeeds or not. This way data validation is not a barrier, just a monitoring tool. |
213 |
| - |
214 |
| -#### Identify Unreliable Features |
215 |
| - |
216 |
| -Once you setup data validation, every insertion will upload a validation report to Hopsworks. Identifying Features which often have null values or wild statistical variations can help detecting unreliable Features that need refinements or should be avoided. Here are a few expectations you might find useful: |
217 |
| - |
218 |
| -- `expect_column_values_to_not_be_null` |
219 |
| -- `expect_column_(min/max/mean/stdev)_to_be_between` |
220 |
| -- `expect_column_values_to_be_unique` |
221 |
| - |
222 |
| -#### Get the stakeholders involved |
223 |
| - |
224 |
| -Hopsworks UI helps involve every project stakeholder by enabling both setting and monitoring of data quality requirements. No coding skills needed! You can monitor data quality requirements by checkint out the validation reports and results on the Feature Group page. |
225 |
| - |
226 |
| -If you need to set or edit the existing requirements, you can go on the Feature Group edit page. The Expectation suite section allows you to edit individual expectations and set success parameters that match ever changing business requirements. |
227 |
| - |
228 |
| -### Production |
229 |
| - |
230 |
| -Models in production require high-quality data to make accurate predictions for your customers. Hopsworks can use your Expectation Suite as a gatekeeper to make it simple to prevent low-quality data to make its way into production. Below are some simple tips and snippets to make the most of your data validation when your project is ready to enter its production phase. |
231 |
| - |
232 |
| -#### Be Strict in Production |
233 |
| - |
234 |
| -Whether you use an existing or create a new (recommended) Feature Group for production, we recommend you set the validation ingestion policy of your Expectation Suite to `"STRICT"`. |
235 |
| - |
236 |
| -```python3 |
237 |
| -fg_prod.save_expectation_suite( |
238 |
| - my_suite, |
239 |
| - validation_ingestion_policy="STRICT") |
240 |
| -``` |
241 |
| - |
242 |
| -In this setup, Hopsworks will abort inserting a DataFrame that does not successfully fullfill all expectations in the attached Expectation Suite. This ensures data quality standards are upheld for every insertion and provide downstream users with strong guarantees. |
243 |
| - |
244 |
| -#### Avoid Data Loss on materialization jobs |
245 |
| - |
246 |
| -Aborting insertions of DataFrames which do not satisfy the data quality standards can lead to data loss in your materialization job. To avoid such loss we recommend creating a duplicate Feature Group with the same Expectation Suite in `"ALWAYS"` mode which will hold the rejected data. |
247 |
| - |
248 |
| -```python3 |
249 |
| -job, report = fg_prod.insert(df) |
250 |
| - |
251 |
| -if report["success"] is False: |
252 |
| - job, report = fg_rejected.insert(df) |
253 |
| -``` |
254 |
| - |
255 |
| -#### Take Advantage of the Validation History |
256 |
| - |
257 |
| -You can easily retrieve the validation history of a specific expectation to export it to your favourite visualisation tool. You can filter on time and on whether insertion was successful or not |
258 |
| - |
259 |
| -```python3 |
260 |
| -validation_history = fg.get_validation_history( |
261 |
| - expectation_id=my_id, |
262 |
| - filters=["REJECTED", "UNKNOWN"], |
263 |
| - ge_type=False |
264 |
| -) |
265 |
| - |
266 |
| -timeseries = pd.DataFrame( |
267 |
| - { |
268 |
| - "observed_value": [res.result["observed_value"] for res in validation_histoy]], |
269 |
| - "validation_time": [res.validation_time for res in validation_history] |
270 |
| - } |
271 |
| -) |
272 |
| - |
273 |
| -# export to your preferred Dashboard |
274 |
| -``` |
275 |
| - |
276 |
| -#### Setup Alerts |
277 |
| - |
278 |
| -While checking your feature engineering pipeline executed properly in the morning can be good enough in the development phase, it won't make the cut for demanding production use-cases. In Hopsworks, you can setup alerts if ingestion fails or succeeds. |
279 |
| - |
280 |
| -First you will need to configure your preferred communication endpoint: slack, email or pagerduty. Check out [this page](../../../admin/alert.md) for more information on how to set it up. A typical use-case would be to add an alert on ingestion success to a Feature Group you created to hold data that failed validation. Here is a quick walkthrough: |
281 |
| - |
282 |
| -1. Go the Feature Group page in the UI |
283 |
| -2. Scroll down and click on the `Add an alert` button. |
284 |
| -3. Choose the trigger, receiver and severity and click save. |
285 |
| - |
286 |
| -## Conclusion |
287 |
| - |
288 |
| -Hopsworks completes Great Expectation by automatically running the validation, persisting the reports along your data and allowing you to monitor data quality in its UI. How you decide to make use of these tools depends on your application and requirements. Whether in development or in production, real-time or batch, we think there is configuration that will work for your team. Check out our [quick hands-on tutorial](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/integrations/great_expectations/fraud_batch_data_validation.ipynb) to start applying what you learned so far. |
0 commit comments