Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FSTORE-1672] Allow multiple on-demand features to be returned from an on-demand transformation function and allow passing of local variables to a transformation function #439

Merged
merged 1 commit into from
Feb 3, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions docs/user_guides/fs/feature_group/on_demand_transformations.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,13 @@
## On Demand Transformation Function Creation


An on-demand transformation function may be created by associating a [transformation function](../transformation_functions.md) with a feature group. Each on-demand transformation function generates a single on-demand feature, which, by default, is assigned the same name as the associated transformation function. For instance, in the example below, the on-demand transformation function `transaction_age` produces an on-demand feature named transaction_age. Alternatively, the name of the resulting on-demand feature can be explicitly defined using the [`alias`](../transformation_functions.md#specifying-output-features–names-for-transformation-functions) function.

It is important to note that only one-to-one or many-to-one transformation functions are compatible with the creation of on-demand transformation functions.
An on-demand transformation function may be created by associating a [transformation function](../transformation_functions.md) with a feature group. Each on-demand transformation function can generate one or multiple on-demand features. If the on-demand transformation function returns a single feature, it is automatically assigned the same name as the transformation function. However, if it returns multiple features, they are by default named using the format `functionName_outputColumnNumber`. For instance, in the example below, the on-demand transformation function `transaction_age` produces an on-demand feature named `transaction_age` and the on-demand transformation function `stripped_strings` produces the on-demand features names `stripped_strings_0` and `stripped_strings_1`. Alternatively, the name of the resulting on-demand feature can be explicitly defined using the [`alias`](../transformation_functions.md#specifying-output-features–names-for-transformation-functions) function.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it be functionName_outputTupleElementIndex instead of functionName_outputColumnNumber (maybe i am misunderstanding)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically it is mean the same maybe there might be a better way of phrasing things, I gave it outputColumnNumber since a pandasDataFrame can also be return to return multiple features from a transformation function and in that case it depends on the column number.

In general each output feature is named based on the position in the output tuple or Dataframe.


!!! warning "On-demand transformation"
All on-demand transformation functions attached to a feature group must have unique names and, in contrast to model-dependent transformations, they do not have access to training dataset statistics.

Each on-demand transformation function can map specific features to its arguments by explicitly providing their names as arguments to the transformation function. If no feature names are provided, the transformation function will default to using features that match the name of the transformation function's argument.



=== "Python"
!!! example "Creating on-demand transformation functions."
```python
Expand All @@ -24,14 +20,18 @@ Each on-demand transformation function can map specific features to its argument
def transaction_age(transaction_date, current_date):
return (current_date - transaction_date).dt.days

@hopsworks.udf(return_type=[str, str], drop=["current_date"])
def stripped_strings(country, city):
return county.strip(), city.strip()

# Attach transformation function to feature group to create on-demand transformation function.
fg = feature_store.create_feature_group(name="fg_transactions",
version=1,
description="Transaction Features",
online_enabled=True,
primary_key=['id'],
event_time='event_time'
transformation_functions=[transaction_age]
transformation_functions=[transaction_age, stripped_strings]
)
```

Expand Down
15 changes: 14 additions & 1 deletion docs/user_guides/fs/feature_view/batch-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,4 +53,17 @@ If you have specified transformation functions when creating a feature view, you
feature_view.init_batch_scoring(training_dataset_version=1)
```

It is important to note that in addition to the filters defined in feature view, [extra filters](./training-data.md#Extra-filters) will be applied if they are defined in the given training dataset version.
It is important to note that in addition to the filters defined in feature view, [extra filters](./training-data.md#Extra-filters) will be applied if they are defined in the given training dataset version.


## Passing Context Variables to Transformation Functions
After [defining a transformation function using a context variable](../transformation_functions.md#passing-context-variables-to-transformation-function), you can pass the necessary context variables through the `transformation_context` parameter when fetching batch data.


=== "Python"
!!! example "Passing context variables while fetching batch data."
```python
# Passing context variable to IN-MEMORY Training Dataset.
batch_data = feature_view.get_batch_data(transformation_context={"context_parameter":10})

```
13 changes: 13 additions & 0 deletions docs/user_guides/fs/feature_view/feature-vectors.md
Original file line number Diff line number Diff line change
Expand Up @@ -191,6 +191,19 @@ You can also use the parameter to provide values for all the features which are
)
```

## Passing Context Variables to Transformation Functions
After [defining a transformation function using a context variable](../transformation_functions.md#passing-context-variables-to-transformation-function), you can pass the required context variables using the `transformation_context` parameter when fetching the feature vectors.

=== "Python"
!!! example "Passing context variables while fetching batch data."
```python
# Passing context variable to IN-MEMORY Training Dataset.
batch_data = feature_view.get_feature_vectors(
entry = [{ "pk1": 1 }],
transformation_context={"context_parameter":10}
)
```

## Choose the right Client

The Online Store can be accessed via the **Python** or **Java** client allowing you to use your language of choice to connect to the Online Store. Additionally, the Python client provides two different implementations to fetch data: **SQL** or **REST**. The SQL client is the default implementation. It requires a direct SQL connection to your RonDB cluster and uses python asyncio to offer high performance even when your Feature View rows involve querying multiple different tables. The REST client is an alternative implementation connecting to [RonDB Feature Vector Server](./feature-server.md). Perfect if you want to avoid exposing ports of your database cluster directly to clients. This implementation is available as of Hopsworks 3.7.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -93,14 +93,14 @@ To attach built-in transformation functions from the `hopsworks` module they can

!!! example "Creating model-dependent transformation using built-in transformation functions imported from hopsworks"
```python
from hopsworks.builtin_transformations import min_max_scaler, label_encoder, robust_scaler, standard_scaler
from hopsworks.hsfs.builtin_transformations import min_max_scaler, label_encoder, robust_scaler, standard_scaler

feature_view = fs.create_feature_view(
name='transactions_view',
query=query,
labels=["fraud_label"],
transformation_functions = [
label_encoder("category": ),
label_encoder("category"),
robust_scaler("amount"),
min_max_scaler("loc_delta"),
standard_scaler("age_at_transaction")
Expand Down
24 changes: 24 additions & 0 deletions docs/user_guides/fs/feature_view/training-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,30 @@ X_train, X_test, y_train, y_test = feature_view.get_train_test_split(training_da
X_train, X_val, X_test, y_train, y_val, y_test = feature_view.get_train_validation_test_split(training_dataset_version=1)
```

## Passing Context Variables to Transformation Functions
Once you have [defined a transformation function using a context variable](../transformation_functions.md#passing-context-variables-to-transformation-function), you can pass the required context variables using the `transformation_context` parameter when generating IN-MEMORY training data or materializing a training dataset.

!!! note
Passing context variables for materializing a training dataset is only supported in the PySpark Kernel.


=== "Python"
!!! example "Passing context variables while creating training data."
```python
# Passing context variable to IN-MEMORY Training Dataset.
X_train, X_test, y_train, y_test = feature_view.get_train_test_split(training_dataset_version=1,
primary_key=True,
event_time=True,
transformation_context={"context_parameter":10})

# Passing context variable to Materialized Training Dataset.
version, job = feature_view.get_train_test_split(training_dataset_version=1,
primary_key=True,
event_time=True,
transformation_context={"context_parameter":10})

```

## Read training data with primary key(s) and event time
For certain use cases, e.g. time series models, the input data needs to be sorted according to the primary key(s) and event time combination.
Primary key(s) and event time are not usually included in the feature view query as they are not features used for training.
Expand Down
15 changes: 15 additions & 0 deletions docs/user_guides/fs/transformation_functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -228,6 +228,21 @@ The `TransformationStatistics` instance contains separate objects with the sam
return argument + argument2 + argument3 + statistics.argument1.mean + statistics.argument2.mean + statistics.argument3.mean
```

### Passing context variables to transformation function

The `context` keyword argument can be defined in a transformation function to access shared context variables. These variables contain common data used across transformation functions. By including the context argument, you can pass the necessary data as a dictionary into the into the `context` argument of the transformation function during [training dataset creation](feature_view/training-data.md#passing-context-variables-to-transformation-functions) or [feature vector retrieval](feature_view/feature-vectors.md#passing-context-variables-to-transformation-functions) or [batch data retrieval](feature_view/batch-data.md#passing-context-variables-to-transformation-functions).


=== "Python"
!!! example "Creation of a transformation function in Hopsworks that accepts context variables"
```python
from hopsworks import udf

@udf(int)
def add_features(argument1, context):
return argument + context["value_to_add"]
```


## Saving to the Feature Store

Expand Down