adding documentation for polars integration

manu-sj · manu-sj · commit 67a4b81be009 · 2024-02-19T09:33:23.000+01:00
diff --git a/docs/concepts/fs/feature_group/feature_pipelines.md b/docs/concepts/fs/feature_group/feature_pipelines.md
@@ -21,7 +21,7 @@ Transformations are covered in more detail in [training/inference pipelines](../
 <img src="../../../../assets/images/concepts/fs/feature-pipelines-with-transformations.svg">
 
 ### Feature Engineering in Python
-Python is the most widely used framework for feature engineering due to its extensive library support for aggregations (Pandas), data validation (Great Expectations), and dimensionality reduction (embeddings, PCA), and transformations (in Scikit-Learn, TensorFlow, PyTorch). Python also supports open-source feature engineering frameworks used for automated feature engineering, such as [featuretools](https://www.featuretools.com/) that supports relational and temporal sources.
+Python is the most widely used framework for feature engineering due to its extensive library support for aggregations (Pandas/Polars), data validation (Great Expectations), and dimensionality reduction (embeddings, PCA), and transformations (in Scikit-Learn, TensorFlow, PyTorch). Python also supports open-source feature engineering frameworks used for automated feature engineering, such as [featuretools](https://www.featuretools.com/) that supports relational and temporal sources.
 
 
 ### Feature Engineering in Spark/PySpark
diff --git a/docs/concepts/fs/feature_view/offline_api.md b/docs/concepts/fs/feature_view/offline_api.md
@@ -7,7 +7,7 @@ The feature view provides an *Offline API* for
 
 Training data is created using a feature view. You can create training data as either:
 
- - in-memory Pandas DataFrames, useful when you have a small amount of training data;
+ - in-memory Pandas/Polars DataFrames, useful when you have a small amount of training data;
  - materialized training data in files, in a file format of your choice (such as .tfrecord, .csv, or .parquet).
 
 You can apply filters when creating training data from a feature view:
@@ -46,7 +46,7 @@ Test data can also be split into evaluation sets to help evaluate a model for po
 
 Batch data for scoring models is created using a feature view. Similar to training data, you can create batch data as either:
 
- - in-memory Pandas DataFrames, useful when you have a small amount of data to score;
+ - in-memory Pandas/Polars DataFrames, useful when you have a small amount of data to score;
  - materialized data in files, in a file format of your choice (such as .tfrecord, .csv, or .parquet)
 
 Batch data requires specification of a `start_time` for the start of the batch scoring data. You can also specify the `end_time` (default is the current date).
diff --git a/docs/concepts/fs/feature_view/training_inference_pipelines.md b/docs/concepts/fs/feature_view/training_inference_pipelines.md
@@ -1,4 +1,4 @@
-A *training pipeline* is a program that orchestrates the training of a machine learning model. For supervised machine learning, a training pipeline requires both features and labels, and these can typically be retrieved from the feature store as either in-memory Pandas DataFrames or read as training data files, created from the feature store. An *inference pipeline* is a program that takes user input, optionally enriches it with features from the feature store, and builds a feature vector (or batch of feature vectors) with with it uses a model to make a prediction.
+A *training pipeline* is a program that orchestrates the training of a machine learning model. For supervised machine learning, a training pipeline requires both features and labels, and these can typically be retrieved from the feature store as either in-memory Pandas/Polars DataFrames or read as training data files, created from the feature store. An *inference pipeline* is a program that takes user input, optionally enriches it with features from the feature store, and builds a feature vector (or batch of feature vectors) with with it uses a model to make a prediction.
 
 
 ## Transformations
diff --git a/docs/user_guides/fs/compute_engines.md b/docs/user_guides/fs/compute_engines.md
@@ -7,7 +7,7 @@ as a Dataframe.
 As such, Hopsworks supports three computational engines:
 
 1. [Apache Spark](https://spark.apache.org): Spark Dataframes and Spark Structured Streaming Dataframes are supported, both from Python environments (PySpark) and from Scala environments.
-2. [Pandas](https://pandas.pydata.org/): For pure Python environments without dependencies on Spark, Hopsworks supports [Pandas Dataframes](https://pandas.pydata.org/).
+2. [Python](https://www.python.org/): For pure Python environments without dependencies on Spark, Hopsworks supports [Pandas Dataframes](https://pandas.pydata.org/) and [Polars Dataframes](https://pola.rs/).
 3. [Apache Flink](https://flink.apache.org): Flink Data Streams are currently supported as an experimental feature from Java/Scala environments.
 3. [Apache Beam](https://beam.apache.org/) *experimental*: Beam Data Streams are currently supported as an experimental feature from Java/Scala environments.
 
@@ -23,11 +23,12 @@ Hopsworks is aiming to provide funtional parity between the computational engine
 | Feature Group Creation from dataframes                            | [`FeatureGroup.create_feature_group()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#create_feature_group)  | :white_check_mark: | :white_check_mark: | - | - | Currently Flink/Beam doesn't support registering feature group metadata. Thus it needs to be pre-registered before you can write real time features computed by Flink/Beam.|
 | Training Dataset Creation from dataframes                         | [`TrainingDataset.save()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/training_dataset_api/#save)  | :white_check_mark: | - | - | - | Functionality was deprecated in version 3.0 |
 | Data validation using Great Expectations for streaming dataframes | [`FeatureGroup.validate()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#validate) [`FeatureGroup.insert_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#insert_stream) | - | - | - | - | `insert_stream` does not perform any data validation even when a expectation suite is attached. |
-| Stream ingestion    | [`FeatureGroup.insert_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#insert_stream) | :white_check_mark: | - | :white_check_mark: | :white_check_mark: | Python/Pandas has currently no notion of streaming. |
-| Reading from Streaming Storage Connectors | [`KafkaConnector.read_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/storage_connector_api/#read_stream) | :white_check_mark: | - | - | - | Python/Pandas has currently no notion of streaming. For Flink/Beam only write operations are supported | 
+| Stream ingestion    | [`FeatureGroup.insert_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#insert_stream) | :white_check_mark: | - | :white_check_mark: | :white_check_mark: | Python/Pandas/Polars has currently no notion of streaming. |
+| Stream ingestion    | [`FeatureGroup.insert_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#insert_stream) | :white_check_mark: | - | :white_check_mark: | :white_check_mark: | Python/Pandas/Polars has currently no notion of streaming. |
+| Reading from Streaming Storage Connectors | [`KafkaConnector.read_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/storage_connector_api/#read_stream) | :white_check_mark: | - | - | - | Python/Pandas/Polars has currently no notion of streaming. For Flink/Beam only write operations are supported | 
 | Reading training data from external storage other than S3 | [`FeatureView.get_training_data()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_view_api/#get_training_data) | :white_check_mark: | - | - | - | Reading training data that was written to external storage using a Storage Connector other than S3 can currently not be read using HSFS APIs, instead you will have to use the storage's native client. |
-| Reading External Feature Groups into Dataframe | [`ExternalFeatureGroup.read()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/external_feature_group_api/#read) | :white_check_mark: | - | - | - | Reading an External Feature Group directly into a Pandas Dataframe is not supported, however, you can use the [Query API](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/query_api/) to create Feature Views/Training Data containing External Feature Groups. |
-| Read Queries containing External Feature Groups into Dataframe | [`Query.read()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/query_api/#read) | :white_check_mark: | - | - | - | Reading a Query containing an External Feature Group directly into a Pandas Dataframe is not supported, however, you can use the Query to create Feature Views/Training Data and write the data to a Storage Connector, from where you can read up the data into a Pandas Dataframe. |
+| Reading External Feature Groups into Dataframe | [`ExternalFeatureGroup.read()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/external_feature_group_api/#read) | :white_check_mark: | - | - | - | Reading an External Feature Group directly into a Pandas/Polars Dataframe is not supported, however, you can use the [Query API](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/query_api/) to create Feature Views/Training Data containing External Feature Groups. |
+| Read Queries containing External Feature Groups into Dataframe | [`Query.read()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/query_api/#read) | :white_check_mark: | - | - | - | Reading a Query containing an External Feature Group directly into a Pandas/Polars Dataframe is not supported, however, you can use the Query to create Feature Views/Training Data and write the data to a Storage Connector, from where you can read up the data into a Pandas/Polars Dataframe. |
 
 ## Python
 
diff --git a/docs/user_guides/fs/feature_group/create.md b/docs/user_guides/fs/feature_group/create.md
@@ -14,7 +14,7 @@ Before you begin this guide we suggest you read the [Feature Group](../../../con
 
 ## Create using the HSFS APIs
 
-To create a feature group using the HSFS APIs, you need to provide a Pandas or Spark DataFrame. The DataFrame will contain all the features you want to register within the feature group, as well as the primary key, event time and partition key.
+To create a feature group using the HSFS APIs, you need to provide a Pandas, Polars or Spark DataFrame. The DataFrame will contain all the features you want to register within the feature group, as well as the primary key, event time and partition key.
 
 ### Create a Feature Group 
 
@@ -272,7 +272,7 @@ The snippet above only created the metadata object on the Python interpreter run
 fg.insert(df)
 ```
 
-The save method takes in input a Pandas or Spark DataFrame. HSFS will use the DataFrame columns and types to determine the name and types of features, primary key, partition key and event time. 
+The save method takes in input a Pandas, Polars or Spark DataFrame. HSFS will use the DataFrame columns and types to determine the name and types of features, primary key, partition key and event time. 
 
 The DataFrame *must* contain the columns specified as primary keys, partition key and event time in the `create_feature_group` call.
 
diff --git a/docs/user_guides/fs/feature_group/data_types.md b/docs/user_guides/fs/feature_group/data_types.md
@@ -26,25 +26,25 @@ The default mapping, however, can be overwritten by using an [explicit schema de
 ### Offline data types 
 
 When registering a [Spark](https://spark.apache.org/docs/latest/sql-ref-datatypes.html) DataFrame in a PySpark environment (S),
-or a [Pandas](https://pandas.pydata.org/) DataFrame in a Python-only environment (P) the following default mapping to offline feature types applies:
-
-| Spark Type (S) | Pandas Type (P)                    | Offline Feature Type          | Remarks                                                        |
-|----------------|------------------------------------|-------------------------------|----------------------------------------------------------------|
-| BooleanType    | bool, object(bool)                 | BOOLEAN                       |                                                                |
-| ByteType       | int8, Int8                         | TINYINT or INT                | INT when time_travel_type="HUDI"                               |
-| ShortType      | uint8, int16, Int16                | SMALLINT or INT               | INT when time_travel_type="HUDI"                               |
-| IntegerType    | uint16, int32, Int32               | INT                           |                                                                |
-| LongType       | int, uint32, int64, Int64          | BIGINT                        |                                                                |
-| FloatType      | float, float16, float32            | FLOAT                         |                                                                |
-| DoubleType     | float64                            | DOUBLE                        |                                                                |
-| DecimalType    | decimal.decimal                    | DECIMAL(PREC, SCALE)          | Not supported in PO env. when time_travel_type="HUDI"          |
-| TimestampType  | datetime64[ns], datetime64[ns, tz] | TIMESTAMP                     | s. [Timestamps and Timezones](#timestamps-and-timezones)       |
-| DateType       | object (datetime.date)             | DATE                          |                                                                |
-| StringType     | object (str), object(np.unicode)   | STRING                        |                                                                |
-| ArrayType      | object (list), object (np.ndarray) | ARRAY&lt;TYPE&gt;             |                                                                |
-| StructType     | object (dict)                      | STRUCT&lt;NAME: TYPE, ...&gt; |                                                                |
-| BinaryType     | object (binary)                    | BINARY                        |                                                                |
-| MapType        | -                                  | MAP&lt;String,TYPE&gt;        | Only when time_travel_type!="HUDI"; Only string keys permitted |
+or a [Pandas](https://pandas.pydata.org/) DataFrame, or a [Polars](https://pola.rs/) DataFrame in a Python-only environment (P) the following default mapping to offline feature types applies:
+
+| Spark Type (S) | Pandas Type (P)                    |Polars Type (P)                    | Offline Feature Type          | Remarks                                                        |
+|----------------|------------------------------------|-----------------------------------|-------------------------------|----------------------------------------------------------------|
+| BooleanType    | bool, object(bool)                 |Boolean                            | BOOLEAN                       |                                                                |
+| ByteType       | int8, Int8                         |Int8                               | TINYINT or INT                | INT when time_travel_type="HUDI"                               |
+| ShortType      | uint8, int16, Int16                |UInt8, Int16                       | SMALLINT or INT               | INT when time_travel_type="HUDI"                               |
+| IntegerType    | uint16, int32, Int32               |UInt16, Int32                      | INT                           |                                                                |
+| LongType       | int, uint32, int64, Int64          |UInt32, Int64                      | BIGINT                        |                                                                |
+| FloatType      | float, float16, float32            |Float32                            | FLOAT                         |                                                                |
+| DoubleType     | float64                            |Float64                            | DOUBLE                        |                                                                |
+| DecimalType    | decimal.decimal                    |Decimal                            | DECIMAL(PREC, SCALE)          | Not supported in PO env. when time_travel_type="HUDI"          |
+| TimestampType  | datetime64[ns], datetime64[ns, tz] |Datetime                           | TIMESTAMP                     | s. [Timestamps and Timezones](#timestamps-and-timezones)       |
+| DateType       | object (datetime.date)             |Date                               | DATE                          |                                                                |
+| StringType     | object (str), object(np.unicode)   |String, Utf8                       | STRING                        |                                                                |
+| ArrayType      | object (list), object (np.ndarray) |List                               | ARRAY&lt;TYPE&gt;             |                                                                |
+| StructType     | object (dict)                      |Struct                             | STRUCT&lt;NAME: TYPE, ...&gt; |                                                                |
+| BinaryType     | object (binary)                    |Binary                             | BINARY                        |                                                                |
+| MapType        | -                                  |-                                  | MAP&lt;String,TYPE&gt;        | Only when time_travel_type!="HUDI"; Only string keys permitted |
 
 When registering a Pandas DataFrame in a PySpark environment (S) the Pandas DataFrame is first converted to a Spark DataFrame, using Spark's [default conversion](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.SparkSession.createDataFrame.html).
 It results in a less fine-grained mapping between Python and Spark types:
diff --git a/docs/user_guides/integrations/python.md b/docs/user_guides/integrations/python.md
@@ -53,7 +53,7 @@ fs = conn.get_feature_store()           # Get the project's default feature stor
 
 !!! note "Engine"
 
-    `HSFS` uses either Apache Spark or Pandas on Python as an execution engine to perform queries against the feature store. The `engine` option of the connection let's you overwrite the default behaviour by setting it to `"python"` or `"spark"`. By default, `HSFS` will try to use Spark as engine if PySpark is available. So if you have PySpark installed in your local Python environment, but you have not configured Spark, you will have to set `engine='python'`. Please refer to the [Spark integration guide](spark.md) to configure your local Spark cluster to be able to connect to the Hopsworks Feature Store.
+    `HSFS` uses either Apache Spark or Pandas/Polars on Python as an execution engine to perform queries against the feature store. The `engine` option of the connection let's you overwrite the default behaviour by setting it to `"python"` or `"spark"`. By default, `HSFS` will try to use Spark as engine if PySpark is available. So if you have PySpark installed in your local Python environment, but you have not configured Spark, you will have to set `engine='python'`. Please refer to the [Spark integration guide](spark.md) to configure your local Spark cluster to be able to connect to the Hopsworks Feature Store.
 
 !!! info "Ports"
 
diff --git a/docs/user_guides/integrations/sagemaker.md b/docs/user_guides/integrations/sagemaker.md
diff --git a/docs/user_guides/integrations/spark.md b/docs/user_guides/integrations/spark.md

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		-A training pipeline is a program that orchestrates the training of a machine learning model. For supervised machine learning, a training pipeline requires both features and labels, and these can typically be retrieved from the feature store as either in-memory Pandas DataFrames or read as training data files, created from the feature store. An inference pipeline is a program that takes user input, optionally enriches it with features from the feature store, and builds a feature vector (or batch of feature vectors) with with it uses a model to make a prediction.
	`1`	+A training pipeline is a program that orchestrates the training of a machine learning model. For supervised machine learning, a training pipeline requires both features and labels, and these can typically be retrieved from the feature store as either in-memory Pandas/Polars DataFrames or read as training data files, created from the feature store. An inference pipeline is a program that takes user input, optionally enriches it with features from the feature store, and builds a feature vector (or batch of feature vectors) with with it uses a model to make a prediction.
`2`	`2`
`3`	`3`
`4`	`4`	`## Transformations`