Skip to content

Commit 7ce6da4

Browse files
o-alexSirOibaf
authored andcommitted
Model provenance - including init feature vector (#403)
1 parent 99ed691 commit 7ce6da4

File tree

5 files changed

+175
-16
lines changed

5 files changed

+175
-16
lines changed
Loading
Loading

docs/user_guides/fs/provenance/provenance.md

+65-16
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,28 @@
1-
# Provenance
1+
# Provenance
22

3-
## Introduction
3+
## Introduction
44

5-
Hopsworks feature store allows users to track provenance (lineage) between storage connectors, feature groups, feature views, training datasets and models. Tracking lineage allows users to determine where/if a feature group is being used. You can track if feature groups are being used to create additional (derived) feature groups or feature views.
5+
Hopsworks allows users to track provenance (lineage) between:
66

7-
You can interact with the provenance graph using the UI and the APIs.
7+
- storage connectors
8+
- feature groups
9+
- feature views
10+
- training datasets
11+
- models
12+
13+
In the provenance pages we will call a provenance artifact or shortly artifact, any of the five entities above.
14+
15+
With the following provenance graph:
16+
17+
```
18+
storage connector -> feature group -> feature group -> feature view -> training dataset -> model
19+
```
20+
21+
we will call the parent, the artifact to the left, and the child, the artifact to the right. So a feature view has a number of feature groups as parents and can have a number of training datasets as children.
22+
23+
Tracking provenance allows users to determine where and if an artifact is being used. You can track, for example, if feature groups are being used to create additional (derived) feature groups or feature views, or if their data is eventually used to train models.
24+
25+
You can interact with the provenance graph using the UI or the APIs.
826

927
## Step 1: Storage connector lineage
1028

@@ -28,7 +46,7 @@ The relationship between storage connectors and feature groups is captured autom
2846

2947
### Using the APIs
3048

31-
Starting from a feature group metadata object, you can traverse upstream the provenance graph to retrieve the metadata objects of the storage connectors that are part of the feature group. To do so, you can use the [get_storage_connector_provenance](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#get_storage_connector_provenance) method.
49+
Starting from a feature group metadata object, you can traverse upstream the provenance graph to retrieve the metadata objects of the storage connectors that are part of the feature group. To do so, you can use the [get_storage_connector_provenance](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#get_storage_connector_provenance) method.
3250

3351
=== "Python"
3452

@@ -53,7 +71,7 @@ Starting from a feature group metadata object, you can traverse upstream the pro
5371
user_profiles_fg.get_storage_connector()
5472
```
5573

56-
To traverse the provenance graph in the opposite direction (i.e. from the storage connector to the feature group), you can use the [get_feature_groups_provenance](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/storage_connector_api/#get_feature_groups_provenance) method. When navigating the provenance graph downstream, the `deleted` feature groups are not tracked by provenance, as such, the `deleted` property will always return an empty list.
74+
To traverse the provenance graph in the opposite direction (i.e. from the storage connector to the feature group), you can use the [get_feature_groups_provenance](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/storage_connector_api/#get_feature_groups_provenance) method. When navigating the provenance graph downstream, the `deleted` feature groups are not tracked by provenance, as such, the `deleted` property will always return an empty list.
5775

5876
=== "Python"
5977

@@ -79,15 +97,15 @@ To traverse the provenance graph in the opposite direction (i.e. from the storag
7997

8098
### Assign parents to a feature group
8199

82-
When creating a feature group, it is possible to specify a list of feature groups used to create the derived features. For example, you could have an external feature group defined over a Snowflake or Redshift table, which you use to compute the features and save them in a feature group. You can mark the external feature group as parent of the feature group you are creating by using the `parents` parameter in the [get_or_create_feature_group](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#get_or_create_feature_group) or [create_feature_group](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#create_feature_group) methods:
100+
When creating a feature group, it is possible to specify a list of feature groups used to create the derived features. For example, you could have an external feature group defined over a Snowflake or Redshift table, which you use to compute the features and save them in a feature group. You can mark the external feature group as parent of the feature group you are creating by using the `parents` parameter in the [get_or_create_feature_group](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#get_or_create_feature_group) or [create_feature_group](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#create_feature_group) methods:
83101

84102
=== "Python"
85103

86104
```python
87105
# Retrieve the feature group
88106
profiles_fg = fs.get_external_feature_group("user_profiles", version=1)
89107

90-
# Do feature engineering
108+
# Do feature engineering
91109
age_df = transaction_df.merge(profiles_fg.read(), on="cc_num", how="left")
92110
transaction_df["age_at_transaction"] = (age_df["datetime"] - age_df["birthdate"]) / np.timedelta64(1, "Y")
93111

@@ -103,7 +121,7 @@ When creating a feature group, it is possible to specify a list of feature group
103121
transaction_fg.insert(transaction_df)
104122
```
105123

106-
Another example use case for derived feature group is if you have a feature group containing features with daily resolution and you are using the content of that feature group to populate a second feature group with monthly resolution:
124+
Another example use case for derived feature group is if you have a feature group containing features with daily resolution and you are using the content of that feature group to populate a second feature group with monthly resolution:
107125

108126
=== "Python"
109127

@@ -112,7 +130,7 @@ Another example use case for derived feature group is if you have a feature grou
112130
daily_transaction_fg = fs.get_feature_group("daily_transaction", version=1)
113131
daily_transaction_df = daily_transaction_fg.read()
114132

115-
# Do feature engineering
133+
# Do feature engineering
116134
cc_group = daily_transaction_df[["cc_num", "amount", "datetime"]] \
117135
.groupby("cc_num") \
118136
.rolling("1M", on="datetime")
@@ -132,7 +150,7 @@ Another example use case for derived feature group is if you have a feature grou
132150

133151
### List feature group parents
134152

135-
You can query the provenance graph of a feature group using the UI and the APIs. From the APIs you can list the parent feature groups by calling the method [get_parent_feature_groups](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#get_parent_feature_groups)
153+
You can query the provenance graph of a feature group using the UI and the APIs. From the APIs you can list the parent feature groups by calling the method [get_parent_feature_groups](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#get_parent_feature_groups)
136154

137155
=== "Python"
138156

@@ -151,7 +169,7 @@ You can query the provenance graph of a feature group using the UI and the APIs.
151169

152170
A parent is marked as `deleted` (and added to the deleted list) if the parent feature group was deleted. `inaccessible` if you no longer have access to the parent feature group (e.g. the parent feature group belongs to a project you no longer have access to).
153171

154-
To traverse the provenance graph in the opposite direction (i.e. from the parent feature group to the child), you can use the [get_generate_feature_groups](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#get_generated_feature_groups) method. When navigating the provenance graph downstream, the `deleted` feature groups are not tracked by provenance, as such, the `deleted` property will always return an empty list.
172+
To traverse the provenance graph in the opposite direction (i.e. from the parent feature group to the child), you can use the [get_generate_feature_groups](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#get_generated_feature_groups) method. When navigating the provenance graph downstream, the `deleted` feature groups are not tracked by provenance, as such, the `deleted` property will always return an empty list.
155173

156174
=== "Python"
157175

@@ -180,7 +198,7 @@ The relationship between feature groups and feature views is captured automatica
180198

181199
### Using the APIs
182200

183-
Starting from a feature view metadata object, you can traverse upstream the provenance graph to retrieve the metadata objects of the feature groups that are part of the feature view. To do so, you can use the [get_parent_feature_groups](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_view_api/#get_parent_feature_groups) method.
201+
Starting from a feature view metadata object, you can traverse upstream the provenance graph to retrieve the metadata objects of the feature groups that are part of the feature view. To do so, you can use the [get_parent_feature_groups](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/feature_view_api/#get_parent_feature_groups) method.
184202

185203
=== "Python"
186204

@@ -204,14 +222,37 @@ You can also traverse the provenance graph in the opposite direction. Starting f
204222
```python
205223
lineage = transaction_fg.get_generated_feature_views()
206224

207-
# List all accessible downstream feature views
225+
# List all accessible downstream feature views
208226
lineage.accessible
209227

210-
# List all the inaccessible downstream feature views
228+
# List all the inaccessible downstream feature views
211229
lineage.inaccessible
212230
```
213231

214-
### Using the UI
232+
Users can call the [get_models_provenance](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/feature_view_api/#get_models_provenance) method which will return a [Link](#provenance-links) object.
233+
234+
You can also retrive directly the accessible models, without the need to extract them from the provenance links object:
235+
=== "Python"
236+
237+
```python
238+
#List all accessible models
239+
models = fraud_fv.get_models()
240+
241+
#List accessible models trained from a specific training dataset version
242+
models = fraud_fv.get_models(training_dataset_version: 1)
243+
```
244+
245+
Also we added a utility method to retrieve from the user's accessible models, the last trained one. Last is determined based on timestamp when it was saved into the model registry.
246+
=== "Python"
247+
248+
```python
249+
#Retrieve newest model from all user's accessible models based on this feature view
250+
model = fraud_fv.get_newest_model()
251+
#Retrieve newest model from all user's accessible models based on this training dataset version
252+
model = fraud_fv.get_newest_model(training_dataset_version: 1)
253+
```
254+
255+
### Using the UI
215256

216257
In the feature view overview UI you can explore the provenance graph of the feature view:
217258

@@ -221,3 +262,11 @@ In the feature view overview UI you can explore the provenance graph of the feat
221262
<figcaption>Feature view provenance graph</figcaption>
222263
</figure>
223264
</p>
265+
266+
## Provenance Links
267+
268+
All the `_provenance` methods return a `Link` dictionary object that contains `accessible`, `inaccesible`, `deleted` lists.
269+
270+
- `accessible` - contains any artifact from the result, that the user has access to.
271+
- `inaccessible` - contains any artifacts that might have been shared at some point in the past, but where this sharing was retracted. Since the relation between artifacts is still maintained in the provenance, the user will only have access to limited metadata and the artifacts will be included in this `inaccessible` list.
272+
- `deleted` - contains artifacts that are deleted with children stil present in the system. There is minimum amount of metadata for the deleted allowing for some limited human readable identification.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
# Provenance
2+
3+
## Introduction
4+
5+
Hopsworks allows users to track provenance (lineage) between:
6+
7+
- storage connectors
8+
- feature groups
9+
- feature views
10+
- training datasets
11+
- models
12+
13+
In the provenance pages we will call a provenance artifact or shortly artifact, any of the five entities above.
14+
15+
With the following provenance graph:
16+
17+
```
18+
storage connector -> feature group -> feature group -> feature view -> training dataset -> model
19+
```
20+
21+
we will call the parent, the artifact to the left, and the child, the artifact to the right. So a feature view has a number of feature groups as parents and can have a number of training datasets as children.
22+
23+
Tracking provenance allows users to determine where and if an artifact is being used. You can track, for example, if feature groups are being used to create additional (derived) feature groups or feature views, or if their data is eventually used to train models.
24+
25+
You can interact with the provenance graph using the UI or the APIs.
26+
27+
## Model provenance
28+
29+
The relationship between feature views and models is captured in the model [constructor](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/model_registry/model_api/#create_model). If you do not provide at least the feature view object to the constructor, the provenance will not capture this relation and you will not be able to navigate from model to the feature view it used or from the feature view to this model.
30+
31+
You can provide the feature view object and have the training dataset version be inferred.
32+
33+
=== "Python"
34+
35+
```python
36+
# this fv object will be provided to the model constructor
37+
fv = hsfs.get_feature_view(...)
38+
39+
# when calling trainig data related methods on the feature view, the training dataset version is cached in the feature view and is implicitly provided to the model constructor
40+
X_train, X_test, y_train, y_test = feature_view.train_test_split(...)
41+
42+
# provide the feature_view object in the model constructor
43+
hsml.model_registry.ModelRegistry.python.create_model(
44+
...
45+
feature_view = fv
46+
...)
47+
```
48+
49+
You can of course explicitly provide the training dataset version.
50+
=== "Python"
51+
52+
```python
53+
# this object will be provided to the model constructor
54+
fv = hsfs.get_feature_view(...)
55+
56+
# this training dataset version will be provided to the model constructor
57+
X_train, X_test, y_train, y_test = feature_view.get_train_test_split(training_dataset_version=1)
58+
59+
# provide the feature_view object in the model constructor
60+
hsml.model_registry.ModelRegistry.python.create_model(
61+
...
62+
feature_view = fv,
63+
training_dataset_version = 1,
64+
...)
65+
```
66+
67+
Once the relation is stored in the provenance graph, you can navigate the graph from model to feature view or training dataset and the other way around.
68+
69+
Users can call the [get_feature_view_provenance(https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/model_registry/model_api/#get_feature_view_provenance) method or the [get_training_dataset_provenance(https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/model_registry/model_api/#get_training_dataset_provenance) method which will each return a [Link](#provenance-links) object.
70+
71+
You can also retrieve directly the parent feature view object, without the need to extract them from the provenance links object, using the [get_feature_view(https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/model_registry/model_api/#get_feature_view ) method
72+
73+
=== "Python"
74+
75+
```python
76+
feature_view = model.get_feature_view()
77+
```
78+
79+
This utility method also has the options to initialize the required components for batch or online retrieval of feature vectors.
80+
81+
=== "Python"
82+
83+
```python
84+
model.get_feature_view(init: bool = True, online: Optional[bool]: None)
85+
```
86+
87+
By default, the base init for feature vector retrieval is enabled. In case you have a workflow that requires more particular options, you can disable this base init by setting the `init` to `false`.
88+
The method detects if it is running within a deployment and will initialize the feature vector retrieval for the serving.
89+
If the `online` argument is provided and `true` it will initialize for online feature vector retrieval.
90+
If the `online` argument is provided and `false` it will initialize the feature vector retrieval for batch scoring.
91+
92+
### Using the UI
93+
94+
In the model overview UI you can explore the provenance graph of the model:
95+
96+
<p align="center">
97+
<figure>
98+
<img src="../../../../assets/images/guides/mlops/provenance/provenance_model.png" alt="Model provenance graph">
99+
<figcaption>Provenance graph of derived feature groups</figcaption>
100+
</figure>
101+
</p>
102+
103+
## Provenance Links
104+
105+
All the `_provenance` methods return a `Link` dictionary object that contains `accessible`, `inaccesible`, `deleted` lists.
106+
107+
- `accessible` - contains any artifact from the result, that the user has access to.
108+
- `inaccessible` - contains any artifacts that might have been shared at some point in the past, but where this sharing was retracted. Since the relation between artifacts is still maintained in the provenance, the user will only have access to limited metadata and the artifacts will be included in this `inaccessible` list.
109+
- `deleted` - contains artifacts that are deleted with children stil present in the system. There is minimum amount of metadata for the deleted allowing for some limited human readable identification.

mkdocs.yml

+1
Original file line numberDiff line numberDiff line change
@@ -195,6 +195,7 @@ nav:
195195
- API Protocol: user_guides/mlops/serving/api-protocol.md
196196
- Troubleshooting: user_guides/mlops/serving/troubleshooting.md
197197
- Vector Database: user_guides/mlops/vector_database/index.md
198+
- Provenance: user_guides/mlops/provenance/provenance.md
198199
- Migration:
199200
- 3.X to 4.0: user_guides/migration/40_migration.md
200201
- Setup and Administration:

0 commit comments

Comments
 (0)