|
| 1 | +# Provenance |
| 2 | + |
| 3 | +## Introduction |
| 4 | + |
| 5 | +Hopsworks feature store allows users to track provenance (lineage) between storage connectors, feature groups, feature views, training datasets and models. Tracking lineage allows users to determine where/if a feature group is being used. You can track if feature groups are being used to create additional (derived) feature groups or feature views, or to train models. |
| 6 | + |
| 7 | +You can interact with the provenance graph using the UI and the APIs. |
| 8 | + |
| 9 | +## Model provenance |
| 10 | + |
| 11 | +The relationship between feature views and models is captured when you create a model. If you do not provide at least the feature view object to the constructor, the provenance will not capture this relation and you will not be able to navigate from model to the feature view it used or from the feature view to the models that were created from it. |
| 12 | + |
| 13 | +You can provide the feature view object and have the training dataset version be inferred. |
| 14 | +=== "Python" |
| 15 | + ```python |
| 16 | + # this object will be provided to the model constructor |
| 17 | + feature_view = hsfs.get_feature_view(...) |
| 18 | + |
| 19 | + # when calling this method, the training dataset version is cached in the feature view and is implicitly provided to the model constructor |
| 20 | + X_train, X_test, y_train, y_test = feature_view.train_test_split(...) |
| 21 | + |
| 22 | + # provide the feature_view object in the model constructor |
| 23 | + hsml.model_registry.ModelRegistry.python.create_model(..., feature_view = feature_view) |
| 24 | + ``` |
| 25 | + |
| 26 | +You can of course explicitly provide the training dataset version. |
| 27 | +=== "Python" |
| 28 | + ```python |
| 29 | + # this object will be provided to the model constructor |
| 30 | + feature_view = hsfs.get_feature_view(...) |
| 31 | + |
| 32 | + # this training dataset version will be provided to the model constructor |
| 33 | + X_train, X_test, y_train, y_test = feature_view.get_train_test_split(training_dataset_version=1) |
| 34 | + |
| 35 | + # provide the feature_view object in the model constructor |
| 36 | + hsml.model_registry.ModelRegistry.python.create_model(..., feature_view = feature_view, training_dataset_version = training_dataset_version) |
| 37 | + ``` |
| 38 | + |
| 39 | +Once the relation is stored in the provenance graph, you can navigate the graph from model to feature view and the other way around. |
| 40 | + |
| 41 | +Users can call the provenance method which will return a Link object containing the parent feature view in either the `accessible`, `deleted` or `inaccessible` list. |
| 42 | +* If the user has access to both the model and the feature view (including shared featurestores), the feature view will be present in the `accessible` list. |
| 43 | +* If the user had access to the feature view at some point, through a shared feature store, it used it to generate the model, but after that the sharing feature store access was restricted, the relation is still maintained in the provenance, but the user only has access to limited metadata for the feature view and the provenanance method with return it in the `inaccessible` list. |
| 44 | +* If the feature view was deleted after the model creation, the provenance will retain the relation, with a minimum amount of metadata for the feature view and provenance method will return the feature view in the `deleted` list. |
| 45 | + |
| 46 | +=== "Python" |
| 47 | + ```python |
| 48 | + lineage = model.get_feature_view_provenance() |
| 49 | + |
| 50 | + # List accessible parent feature view |
| 51 | + lineage.accessible |
| 52 | + |
| 53 | + # List deleted parent feature view |
| 54 | + lineage.deleted |
| 55 | + |
| 56 | + # List inaccessible parent feature view |
| 57 | + lineage.inaccessible |
| 58 | + ``` |
| 59 | + |
| 60 | +You can also retrieve the training dataset provenance object. |
| 61 | +=== "Python" |
| 62 | + |
| 63 | + ```python |
| 64 | + lineage = model.get_training_dataset_provenance() |
| 65 | + |
| 66 | + # List accessible parent training dataset |
| 67 | + lineage.accessible |
| 68 | + |
| 69 | + # List deleted parent training dataset |
| 70 | + lineage.deleted |
| 71 | + |
| 72 | + # List inaccessible parent training dataset |
| 73 | + lineage.inaccessible |
| 74 | + ``` |
| 75 | + |
| 76 | +You can also retrieve directly the parent feature view object, without the need to extract them from the provenance links object |
| 77 | +=== "Python" |
| 78 | + |
| 79 | + ```python |
| 80 | + feature_view = model.get_feature_view() |
| 81 | + ``` |
| 82 | +This utility method also has the options to initialize the required components for batch or online retrieval of feature vectors. |
| 83 | +=== "Python" |
| 84 | + |
| 85 | + ```python |
| 86 | + model.get_feature_view(init: bool = True, online: Optional[bool]: None) |
| 87 | + ``` |
| 88 | + |
| 89 | +By default, the base init for feature vector retrieval is enabled. In case you have a workflow that requires more particular options, you can disable this base init by setting the `init` to `false`. |
| 90 | +The method detects if it is running within a deployment and will initialize the feature vector retrieval for the serving. |
| 91 | +If the `online` argument is provided and `true` it will initialize for online feature vector retrieval. |
| 92 | +If the `online` argument is provided and `false` it will initialize the feature vector retrieval for batch scoring. |
0 commit comments