Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding documentation for polars integration #351

Merged
merged 1 commit into from
Mar 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/concepts/fs/feature_group/feature_pipelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Transformations are covered in more detail in [training/inference pipelines](../
<img src="../../../../assets/images/concepts/fs/feature-pipelines-with-transformations.svg">

### Feature Engineering in Python
Python is the most widely used framework for feature engineering due to its extensive library support for aggregations (Pandas), data validation (Great Expectations), and dimensionality reduction (embeddings, PCA), and transformations (in Scikit-Learn, TensorFlow, PyTorch). Python also supports open-source feature engineering frameworks used for automated feature engineering, such as [featuretools](https://www.featuretools.com/) that supports relational and temporal sources.
Python is the most widely used framework for feature engineering due to its extensive library support for aggregations (Pandas/Polars), data validation (Great Expectations), and dimensionality reduction (embeddings, PCA), and transformations (in Scikit-Learn, TensorFlow, PyTorch). Python also supports open-source feature engineering frameworks used for automated feature engineering, such as [featuretools](https://www.featuretools.com/) that supports relational and temporal sources.


### Feature Engineering in Spark/PySpark
Expand Down
4 changes: 2 additions & 2 deletions docs/concepts/fs/feature_view/offline_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ The feature view provides an *Offline API* for

Training data is created using a feature view. You can create training data as either:

- in-memory Pandas DataFrames, useful when you have a small amount of training data;
- in-memory Pandas/Polars DataFrames, useful when you have a small amount of training data;
- materialized training data in files, in a file format of your choice (such as .tfrecord, .csv, or .parquet).

You can apply filters when creating training data from a feature view:
Expand Down Expand Up @@ -46,7 +46,7 @@ Test data can also be split into evaluation sets to help evaluate a model for po

Batch data for scoring models is created using a feature view. Similar to training data, you can create batch data as either:

- in-memory Pandas DataFrames, useful when you have a small amount of data to score;
- in-memory Pandas/Polars DataFrames, useful when you have a small amount of data to score;
- materialized data in files, in a file format of your choice (such as .tfrecord, .csv, or .parquet)

Batch data requires specification of a `start_time` for the start of the batch scoring data. You can also specify the `end_time` (default is the current date).
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
A *training pipeline* is a program that orchestrates the training of a machine learning model. For supervised machine learning, a training pipeline requires both features and labels, and these can typically be retrieved from the feature store as either in-memory Pandas DataFrames or read as training data files, created from the feature store. An *inference pipeline* is a program that takes user input, optionally enriches it with features from the feature store, and builds a feature vector (or batch of feature vectors) with with it uses a model to make a prediction.
A *training pipeline* is a program that orchestrates the training of a machine learning model. For supervised machine learning, a training pipeline requires both features and labels, and these can typically be retrieved from the feature store as either in-memory Pandas/Polars DataFrames or read as training data files, created from the feature store. An *inference pipeline* is a program that takes user input, optionally enriches it with features from the feature store, and builds a feature vector (or batch of feature vectors) with with it uses a model to make a prediction.


## Transformations
Expand Down
11 changes: 6 additions & 5 deletions docs/user_guides/fs/compute_engines.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ as a Dataframe.
As such, Hopsworks supports three computational engines:

1. [Apache Spark](https://spark.apache.org): Spark Dataframes and Spark Structured Streaming Dataframes are supported, both from Python environments (PySpark) and from Scala environments.
2. [Pandas](https://pandas.pydata.org/): For pure Python environments without dependencies on Spark, Hopsworks supports [Pandas Dataframes](https://pandas.pydata.org/).
2. [Python](https://www.python.org/): For pure Python environments without dependencies on Spark, Hopsworks supports [Pandas Dataframes](https://pandas.pydata.org/) and [Polars Dataframes](https://pola.rs/).
3. [Apache Flink](https://flink.apache.org): Flink Data Streams are currently supported as an experimental feature from Java/Scala environments.
3. [Apache Beam](https://beam.apache.org/) *experimental*: Beam Data Streams are currently supported as an experimental feature from Java/Scala environments.

Expand All @@ -23,11 +23,12 @@ Hopsworks is aiming to provide funtional parity between the computational engine
| Feature Group Creation from dataframes | [`FeatureGroup.create_feature_group()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#create_feature_group) | :white_check_mark: | :white_check_mark: | - | - | Currently Flink/Beam doesn't support registering feature group metadata. Thus it needs to be pre-registered before you can write real time features computed by Flink/Beam.|
| Training Dataset Creation from dataframes | [`TrainingDataset.save()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/training_dataset_api/#save) | :white_check_mark: | - | - | - | Functionality was deprecated in version 3.0 |
| Data validation using Great Expectations for streaming dataframes | [`FeatureGroup.validate()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#validate) [`FeatureGroup.insert_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#insert_stream) | - | - | - | - | `insert_stream` does not perform any data validation even when a expectation suite is attached. |
| Stream ingestion | [`FeatureGroup.insert_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#insert_stream) | :white_check_mark: | - | :white_check_mark: | :white_check_mark: | Python/Pandas has currently no notion of streaming. |
| Reading from Streaming Storage Connectors | [`KafkaConnector.read_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/storage_connector_api/#read_stream) | :white_check_mark: | - | - | - | Python/Pandas has currently no notion of streaming. For Flink/Beam only write operations are supported |
| Stream ingestion | [`FeatureGroup.insert_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#insert_stream) | :white_check_mark: | - | :white_check_mark: | :white_check_mark: | Python/Pandas/Polars has currently no notion of streaming. |
| Stream ingestion | [`FeatureGroup.insert_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#insert_stream) | :white_check_mark: | - | :white_check_mark: | :white_check_mark: | Python/Pandas/Polars has currently no notion of streaming. |
| Reading from Streaming Storage Connectors | [`KafkaConnector.read_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/storage_connector_api/#read_stream) | :white_check_mark: | - | - | - | Python/Pandas/Polars has currently no notion of streaming. For Flink/Beam only write operations are supported |
| Reading training data from external storage other than S3 | [`FeatureView.get_training_data()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_view_api/#get_training_data) | :white_check_mark: | - | - | - | Reading training data that was written to external storage using a Storage Connector other than S3 can currently not be read using HSFS APIs, instead you will have to use the storage's native client. |
| Reading External Feature Groups into Dataframe | [`ExternalFeatureGroup.read()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/external_feature_group_api/#read) | :white_check_mark: | - | - | - | Reading an External Feature Group directly into a Pandas Dataframe is not supported, however, you can use the [Query API](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/query_api/) to create Feature Views/Training Data containing External Feature Groups. |
| Read Queries containing External Feature Groups into Dataframe | [`Query.read()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/query_api/#read) | :white_check_mark: | - | - | - | Reading a Query containing an External Feature Group directly into a Pandas Dataframe is not supported, however, you can use the Query to create Feature Views/Training Data and write the data to a Storage Connector, from where you can read up the data into a Pandas Dataframe. |
| Reading External Feature Groups into Dataframe | [`ExternalFeatureGroup.read()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/external_feature_group_api/#read) | :white_check_mark: | - | - | - | Reading an External Feature Group directly into a Pandas/Polars Dataframe is not supported, however, you can use the [Query API](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/query_api/) to create Feature Views/Training Data containing External Feature Groups. |
| Read Queries containing External Feature Groups into Dataframe | [`Query.read()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/query_api/#read) | :white_check_mark: | - | - | - | Reading a Query containing an External Feature Group directly into a Pandas/Polars Dataframe is not supported, however, you can use the Query to create Feature Views/Training Data and write the data to a Storage Connector, from where you can read up the data into a Pandas/Polars Dataframe. |

## Python

Expand Down
4 changes: 2 additions & 2 deletions docs/user_guides/fs/feature_group/create.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Before you begin this guide we suggest you read the [Feature Group](../../../con

## Create using the HSFS APIs

To create a feature group using the HSFS APIs, you need to provide a Pandas or Spark DataFrame. The DataFrame will contain all the features you want to register within the feature group, as well as the primary key, event time and partition key.
To create a feature group using the HSFS APIs, you need to provide a Pandas, Polars or Spark DataFrame. The DataFrame will contain all the features you want to register within the feature group, as well as the primary key, event time and partition key.

### Create a Feature Group

Expand Down Expand Up @@ -272,7 +272,7 @@ The snippet above only created the metadata object on the Python interpreter run
fg.insert(df)
```

The save method takes in input a Pandas or Spark DataFrame. HSFS will use the DataFrame columns and types to determine the name and types of features, primary key, partition key and event time.
The save method takes in input a Pandas, Polars or Spark DataFrame. HSFS will use the DataFrame columns and types to determine the name and types of features, primary key, partition key and event time.

The DataFrame *must* contain the columns specified as primary keys, partition key and event time in the `create_feature_group` call.

Expand Down
38 changes: 19 additions & 19 deletions docs/user_guides/fs/feature_group/data_types.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,25 +26,25 @@ The default mapping, however, can be overwritten by using an [explicit schema de
### Offline data types

When registering a [Spark](https://spark.apache.org/docs/latest/sql-ref-datatypes.html) DataFrame in a PySpark environment (S),
or a [Pandas](https://pandas.pydata.org/) DataFrame in a Python-only environment (P) the following default mapping to offline feature types applies:

| Spark Type (S) | Pandas Type (P) | Offline Feature Type | Remarks |
|----------------|------------------------------------|-------------------------------|----------------------------------------------------------------|
| BooleanType | bool, object(bool) | BOOLEAN | |
| ByteType | int8, Int8 | TINYINT or INT | INT when time_travel_type="HUDI" |
| ShortType | uint8, int16, Int16 | SMALLINT or INT | INT when time_travel_type="HUDI" |
| IntegerType | uint16, int32, Int32 | INT | |
| LongType | int, uint32, int64, Int64 | BIGINT | |
| FloatType | float, float16, float32 | FLOAT | |
| DoubleType | float64 | DOUBLE | |
| DecimalType | decimal.decimal | DECIMAL(PREC, SCALE) | Not supported in PO env. when time_travel_type="HUDI" |
| TimestampType | datetime64[ns], datetime64[ns, tz] | TIMESTAMP | s. [Timestamps and Timezones](#timestamps-and-timezones) |
| DateType | object (datetime.date) | DATE | |
| StringType | object (str), object(np.unicode) | STRING | |
| ArrayType | object (list), object (np.ndarray) | ARRAY&lt;TYPE&gt; | |
| StructType | object (dict) | STRUCT&lt;NAME: TYPE, ...&gt; | |
| BinaryType | object (binary) | BINARY | |
| MapType | - | MAP&lt;String,TYPE&gt; | Only when time_travel_type!="HUDI"; Only string keys permitted |
or a [Pandas](https://pandas.pydata.org/) DataFrame, or a [Polars](https://pola.rs/) DataFrame in a Python-only environment (P) the following default mapping to offline feature types applies:

| Spark Type (S) | Pandas Type (P) |Polars Type (P) | Offline Feature Type | Remarks |
|----------------|------------------------------------|-----------------------------------|-------------------------------|----------------------------------------------------------------|
| BooleanType | bool, object(bool) |Boolean | BOOLEAN | |
| ByteType | int8, Int8 |Int8 | TINYINT or INT | INT when time_travel_type="HUDI" |
| ShortType | uint8, int16, Int16 |UInt8, Int16 | SMALLINT or INT | INT when time_travel_type="HUDI" |
| IntegerType | uint16, int32, Int32 |UInt16, Int32 | INT | |
| LongType | int, uint32, int64, Int64 |UInt32, Int64 | BIGINT | |
| FloatType | float, float16, float32 |Float32 | FLOAT | |
| DoubleType | float64 |Float64 | DOUBLE | |
| DecimalType | decimal.decimal |Decimal | DECIMAL(PREC, SCALE) | Not supported in PO env. when time_travel_type="HUDI" |
| TimestampType | datetime64[ns], datetime64[ns, tz] |Datetime | TIMESTAMP | s. [Timestamps and Timezones](#timestamps-and-timezones) |
| DateType | object (datetime.date) |Date | DATE | |
| StringType | object (str), object(np.unicode) |String, Utf8 | STRING | |
| ArrayType | object (list), object (np.ndarray) |List | ARRAY&lt;TYPE&gt; | |
| StructType | object (dict) |Struct | STRUCT&lt;NAME: TYPE, ...&gt; | |
| BinaryType | object (binary) |Binary | BINARY | |
| MapType | - |- | MAP&lt;String,TYPE&gt; | Only when time_travel_type!="HUDI"; Only string keys permitted |

When registering a Pandas DataFrame in a PySpark environment (S) the Pandas DataFrame is first converted to a Spark DataFrame, using Spark's [default conversion](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.SparkSession.createDataFrame.html).
It results in a less fine-grained mapping between Python and Spark types:
Expand Down
2 changes: 1 addition & 1 deletion docs/user_guides/integrations/python.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ fs = conn.get_feature_store() # Get the project's default feature stor

!!! note "Engine"

`HSFS` uses either Apache Spark or Pandas on Python as an execution engine to perform queries against the feature store. The `engine` option of the connection let's you overwrite the default behaviour by setting it to `"python"` or `"spark"`. By default, `HSFS` will try to use Spark as engine if PySpark is available. So if you have PySpark installed in your local Python environment, but you have not configured Spark, you will have to set `engine='python'`. Please refer to the [Spark integration guide](spark.md) to configure your local Spark cluster to be able to connect to the Hopsworks Feature Store.
`HSFS` uses either Apache Spark or Pandas/Polars on Python as an execution engine to perform queries against the feature store. The `engine` option of the connection let's you overwrite the default behaviour by setting it to `"python"` or `"spark"`. By default, `HSFS` will try to use Spark as engine if PySpark is available. So if you have PySpark installed in your local Python environment, but you have not configured Spark, you will have to set `engine='python'`. Please refer to the [Spark integration guide](spark.md) to configure your local Spark cluster to be able to connect to the Hopsworks Feature Store.

!!! info "Ports"

Expand Down
Loading
Loading