You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Python is the most widely used framework for feature engineering due to its extensive library support for aggregations (Pandas), data validation (Great Expectations), and dimensionality reduction (embeddings, PCA), and transformations (in Scikit-Learn, TensorFlow, PyTorch). Python also supports open-source feature engineering frameworks used for automated feature engineering, such as [featuretools](https://www.featuretools.com/) that supports relational and temporal sources.
24
+
Python is the most widely used framework for feature engineering due to its extensive library support for aggregations (Pandas/Polars), data validation (Great Expectations), and dimensionality reduction (embeddings, PCA), and transformations (in Scikit-Learn, TensorFlow, PyTorch). Python also supports open-source feature engineering frameworks used for automated feature engineering, such as [featuretools](https://www.featuretools.com/) that supports relational and temporal sources.
Copy file name to clipboardexpand all lines: docs/concepts/fs/feature_view/offline_api.md
+2-2
Original file line number
Diff line number
Diff line change
@@ -7,7 +7,7 @@ The feature view provides an *Offline API* for
7
7
8
8
Training data is created using a feature view. You can create training data as either:
9
9
10
-
- in-memory Pandas DataFrames, useful when you have a small amount of training data;
10
+
- in-memory Pandas/Polars DataFrames, useful when you have a small amount of training data;
11
11
- materialized training data in files, in a file format of your choice (such as .tfrecord, .csv, or .parquet).
12
12
13
13
You can apply filters when creating training data from a feature view:
@@ -46,7 +46,7 @@ Test data can also be split into evaluation sets to help evaluate a model for po
46
46
47
47
Batch data for scoring models is created using a feature view. Similar to training data, you can create batch data as either:
48
48
49
-
- in-memory Pandas DataFrames, useful when you have a small amount of data to score;
49
+
- in-memory Pandas/Polars DataFrames, useful when you have a small amount of data to score;
50
50
- materialized data in files, in a file format of your choice (such as .tfrecord, .csv, or .parquet)
51
51
52
52
Batch data requires specification of a `start_time` for the start of the batch scoring data. You can also specify the `end_time` (default is the current date).
Copy file name to clipboardexpand all lines: docs/concepts/fs/feature_view/training_inference_pipelines.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
A *training pipeline* is a program that orchestrates the training of a machine learning model. For supervised machine learning, a training pipeline requires both features and labels, and these can typically be retrieved from the feature store as either in-memory Pandas DataFrames or read as training data files, created from the feature store. An *inference pipeline* is a program that takes user input, optionally enriches it with features from the feature store, and builds a feature vector (or batch of feature vectors) with with it uses a model to make a prediction.
1
+
A *training pipeline* is a program that orchestrates the training of a machine learning model. For supervised machine learning, a training pipeline requires both features and labels, and these can typically be retrieved from the feature store as either in-memory Pandas/Polars DataFrames or read as training data files, created from the feature store. An *inference pipeline* is a program that takes user input, optionally enriches it with features from the feature store, and builds a feature vector (or batch of feature vectors) with with it uses a model to make a prediction.
Copy file name to clipboardexpand all lines: docs/user_guides/fs/compute_engines.md
+6-5
Original file line number
Diff line number
Diff line change
@@ -7,7 +7,7 @@ as a Dataframe.
7
7
As such, Hopsworks supports three computational engines:
8
8
9
9
1.[Apache Spark](https://spark.apache.org): Spark Dataframes and Spark Structured Streaming Dataframes are supported, both from Python environments (PySpark) and from Scala environments.
10
-
2.[Pandas](https://pandas.pydata.org/): For pure Python environments without dependencies on Spark, Hopsworks supports [Pandas Dataframes](https://pandas.pydata.org/).
10
+
2.[Python](https://www.python.org/): For pure Python environments without dependencies on Spark, Hopsworks supports [Pandas Dataframes](https://pandas.pydata.org/) and [Polars Dataframes](https://pola.rs/).
11
11
3.[Apache Flink](https://flink.apache.org): Flink Data Streams are currently supported as an experimental feature from Java/Scala environments.
12
12
3.[Apache Beam](https://beam.apache.org/)*experimental*: Beam Data Streams are currently supported as an experimental feature from Java/Scala environments.
13
13
@@ -23,11 +23,12 @@ Hopsworks is aiming to provide funtional parity between the computational engine
23
23
| Feature Group Creation from dataframes |[`FeatureGroup.create_feature_group()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#create_feature_group) |:white_check_mark:|:white_check_mark:| - | - | Currently Flink/Beam doesn't support registering feature group metadata. Thus it needs to be pre-registered before you can write real time features computed by Flink/Beam.|
24
24
| Training Dataset Creation from dataframes |[`TrainingDataset.save()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/training_dataset_api/#save) |:white_check_mark:| - | - | - | Functionality was deprecated in version 3.0 |
25
25
| Data validation using Great Expectations for streaming dataframes |[`FeatureGroup.validate()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#validate) [`FeatureGroup.insert_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#insert_stream) | - | - | - | - |`insert_stream` does not perform any data validation even when a expectation suite is attached. |
26
-
| Stream ingestion |[`FeatureGroup.insert_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#insert_stream) |:white_check_mark:| - |:white_check_mark:|:white_check_mark:| Python/Pandas has currently no notion of streaming. |
27
-
| Reading from Streaming Storage Connectors |[`KafkaConnector.read_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/storage_connector_api/#read_stream) |:white_check_mark:| - | - | - | Python/Pandas has currently no notion of streaming. For Flink/Beam only write operations are supported |
26
+
| Stream ingestion |[`FeatureGroup.insert_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#insert_stream) |:white_check_mark:| - |:white_check_mark:|:white_check_mark:| Python/Pandas/Polars has currently no notion of streaming. |
27
+
| Stream ingestion |[`FeatureGroup.insert_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#insert_stream) |:white_check_mark:| - |:white_check_mark:|:white_check_mark:| Python/Pandas/Polars has currently no notion of streaming. |
28
+
| Reading from Streaming Storage Connectors |[`KafkaConnector.read_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/storage_connector_api/#read_stream) |:white_check_mark:| - | - | - | Python/Pandas/Polars has currently no notion of streaming. For Flink/Beam only write operations are supported |
28
29
| Reading training data from external storage other than S3 |[`FeatureView.get_training_data()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_view_api/#get_training_data) |:white_check_mark:| - | - | - | Reading training data that was written to external storage using a Storage Connector other than S3 can currently not be read using HSFS APIs, instead you will have to use the storage's native client. |
29
-
| Reading External Feature Groups into Dataframe |[`ExternalFeatureGroup.read()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/external_feature_group_api/#read) |:white_check_mark:| - | - | - | Reading an External Feature Group directly into a Pandas Dataframe is not supported, however, you can use the [Query API](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/query_api/) to create Feature Views/Training Data containing External Feature Groups. |
30
-
| Read Queries containing External Feature Groups into Dataframe |[`Query.read()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/query_api/#read) |:white_check_mark:| - | - | - | Reading a Query containing an External Feature Group directly into a Pandas Dataframe is not supported, however, you can use the Query to create Feature Views/Training Data and write the data to a Storage Connector, from where you can read up the data into a Pandas Dataframe. |
30
+
| Reading External Feature Groups into Dataframe |[`ExternalFeatureGroup.read()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/external_feature_group_api/#read) |:white_check_mark:| - | - | - | Reading an External Feature Group directly into a Pandas/Polars Dataframe is not supported, however, you can use the [Query API](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/query_api/) to create Feature Views/Training Data containing External Feature Groups. |
31
+
| Read Queries containing External Feature Groups into Dataframe |[`Query.read()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/query_api/#read) |:white_check_mark:| - | - | - | Reading a Query containing an External Feature Group directly into a Pandas/Polars Dataframe is not supported, however, you can use the Query to create Feature Views/Training Data and write the data to a Storage Connector, from where you can read up the data into a Pandas/Polars Dataframe. |
Copy file name to clipboardexpand all lines: docs/user_guides/fs/feature_group/create.md
+2-2
Original file line number
Diff line number
Diff line change
@@ -14,7 +14,7 @@ Before you begin this guide we suggest you read the [Feature Group](../../../con
14
14
15
15
## Create using the HSFS APIs
16
16
17
-
To create a feature group using the HSFS APIs, you need to provide a Pandas or Spark DataFrame. The DataFrame will contain all the features you want to register within the feature group, as well as the primary key, event time and partition key.
17
+
To create a feature group using the HSFS APIs, you need to provide a Pandas, Polars or Spark DataFrame. The DataFrame will contain all the features you want to register within the feature group, as well as the primary key, event time and partition key.
18
18
19
19
### Create a Feature Group
20
20
@@ -272,7 +272,7 @@ The snippet above only created the metadata object on the Python interpreter run
272
272
fg.insert(df)
273
273
```
274
274
275
-
The save method takes in input a Pandas or Spark DataFrame. HSFS will use the DataFrame columns and types to determine the name and types of features, primary key, partition key and event time.
275
+
The save method takes in input a Pandas, Polars or Spark DataFrame. HSFS will use the DataFrame columns and types to determine the name and types of features, primary key, partition key and event time.
276
276
277
277
The DataFrame *must* contain the columns specified as primary keys, partition key and event time in the `create_feature_group` call.
| MapType | - | MAP<String,TYPE>| Only when time_travel_type!="HUDI"; Only string keys permitted |
29
+
or a [Pandas](https://pandas.pydata.org/) DataFrame, or a [Polars](https://pola.rs/) DataFrame in a Python-only environment (P) the following default mapping to offline feature types applies:
30
+
31
+
| Spark Type (S) | Pandas Type (P) |Polars Type (P) | Offline Feature Type | Remarks |
| MapType | - |- | MAP<String,TYPE>| Only when time_travel_type!="HUDI"; Only string keys permitted |
48
48
49
49
When registering a Pandas DataFrame in a PySpark environment (S) the Pandas DataFrame is first converted to a Spark DataFrame, using Spark's [default conversion](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.SparkSession.createDataFrame.html).
50
50
It results in a less fine-grained mapping between Python and Spark types:
Copy file name to clipboardexpand all lines: docs/user_guides/integrations/python.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -53,7 +53,7 @@ fs = conn.get_feature_store() # Get the project's default feature stor
53
53
54
54
!!! note "Engine"
55
55
56
-
`HSFS` uses either Apache Spark or Pandas on Python as an execution engine to perform queries against the feature store. The `engine` option of the connection let's you overwrite the default behaviour by setting it to `"python"` or `"spark"`. By default, `HSFS` will try to use Spark as engine if PySpark is available. So if you have PySpark installed in your local Python environment, but you have not configured Spark, you will have to set `engine='python'`. Please refer to the [Spark integration guide](spark.md) to configure your local Spark cluster to be able to connect to the Hopsworks Feature Store.
56
+
`HSFS` uses either Apache Spark or Pandas/Polars on Python as an execution engine to perform queries against the feature store. The `engine` option of the connection let's you overwrite the default behaviour by setting it to `"python"` or `"spark"`. By default, `HSFS` will try to use Spark as engine if PySpark is available. So if you have PySpark installed in your local Python environment, but you have not configured Spark, you will have to set `engine='python'`. Please refer to the [Spark integration guide](spark.md) to configure your local Spark cluster to be able to connect to the Hopsworks Feature Store.
0 commit comments