diff --git a/docs/user_guides/fs/compute_engines.md b/docs/user_guides/fs/compute_engines.md index da655cf5b..6adb0c9d4 100644 --- a/docs/user_guides/fs/compute_engines.md +++ b/docs/user_guides/fs/compute_engines.md @@ -4,31 +4,31 @@ In order to execute a feature pipeline to write to the Feature Store, as well as Hopsworks Feature Store APIs are built around dataframes, that means feature data is inserted into the Feature Store from a Dataframe and likewise when reading data from the Feature Store, it is returned as a Dataframe. -As such, Hopsworks supports three computational engines: +As such, Hopsworks supports five computational engines: 1. [Apache Spark](https://spark.apache.org): Spark Dataframes and Spark Structured Streaming Dataframes are supported, both from Python environments (PySpark) and from Scala environments. 2. [Python](https://www.python.org/): For pure Python environments without dependencies on Spark, Hopsworks supports [Pandas Dataframes](https://pandas.pydata.org/) and [Polars Dataframes](https://pola.rs/). 3. [Apache Flink](https://flink.apache.org): Flink Data Streams are currently supported as an experimental feature from Java/Scala environments. -3. [Apache Beam](https://beam.apache.org/) *experimental*: Beam Data Streams are currently supported as an experimental feature from Java/Scala environments. +4. [Apache Beam](https://beam.apache.org/) *experimental*: Beam Data Streams are currently supported as an experimental feature from Java/Scala environments. +5. [Java](https://www.java.com): For pure Java environments without dependencies on Spark, Hopsworks supports writing using List of POJO Objects. Hopsworks supports running [compute on the platform itself](../../concepts/dev/inside.md) in the form of [Jobs](../projects/jobs/pyspark_job.md) or in [Jupyter Notebooks](../projects/jupyter/python_notebook.md). Alternatlively, you can also connect to Hopsworks using Python or Spark from [external environments](../../concepts/dev/outside.md), given that there is network connectivity. ## Functionality Support -Hopsworks is aiming to provide funtional parity between the computational engines, however, there are certain Hopsworks functionalities which are exclusive to the engines. +Hopsworks is aiming to provide functional parity between the computational engines, however, there are certain Hopsworks functionalities which are exclusive to the engines. -| Functionality | Method | Spark | Python | Flink | Beam | Comment | -| ----------------------------------------------------------------- | ------ | ----- | ------ | ------ | ------ | ------- | -| Feature Group Creation from dataframes | [`FeatureGroup.create_feature_group()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#create_feature_group) | :white_check_mark: | :white_check_mark: | - | - | Currently Flink/Beam doesn't support registering feature group metadata. Thus it needs to be pre-registered before you can write real time features computed by Flink/Beam.| -| Training Dataset Creation from dataframes | [`TrainingDataset.save()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/training_dataset_api/#save) | :white_check_mark: | - | - | - | Functionality was deprecated in version 3.0 | -| Data validation using Great Expectations for streaming dataframes | [`FeatureGroup.validate()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#validate) [`FeatureGroup.insert_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#insert_stream) | - | - | - | - | `insert_stream` does not perform any data validation even when a expectation suite is attached. | -| Stream ingestion | [`FeatureGroup.insert_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#insert_stream) | :white_check_mark: | - | :white_check_mark: | :white_check_mark: | Python/Pandas/Polars has currently no notion of streaming. | -| Stream ingestion | [`FeatureGroup.insert_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#insert_stream) | :white_check_mark: | - | :white_check_mark: | :white_check_mark: | Python/Pandas/Polars has currently no notion of streaming. | -| Reading from Streaming Storage Connectors | [`KafkaConnector.read_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/storage_connector_api/#read_stream) | :white_check_mark: | - | - | - | Python/Pandas/Polars has currently no notion of streaming. For Flink/Beam only write operations are supported | -| Reading training data from external storage other than S3 | [`FeatureView.get_training_data()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_view_api/#get_training_data) | :white_check_mark: | - | - | - | Reading training data that was written to external storage using a Storage Connector other than S3 can currently not be read using HSFS APIs, instead you will have to use the storage's native client. | -| Reading External Feature Groups into Dataframe | [`ExternalFeatureGroup.read()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/external_feature_group_api/#read) | :white_check_mark: | - | - | - | Reading an External Feature Group directly into a Pandas/Polars Dataframe is not supported, however, you can use the [Query API](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/query_api/) to create Feature Views/Training Data containing External Feature Groups. | -| Read Queries containing External Feature Groups into Dataframe | [`Query.read()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/query_api/#read) | :white_check_mark: | - | - | - | Reading a Query containing an External Feature Group directly into a Pandas/Polars Dataframe is not supported, however, you can use the Query to create Feature Views/Training Data and write the data to a Storage Connector, from where you can read up the data into a Pandas/Polars Dataframe. | +| Functionality | Method | Spark | Python | Flink | Beam | Java | Comment | +| ----------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------ | ------------------ | ---------------------- | ------------------ | ------------------ |------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| Feature Group Creation from dataframes | [`FeatureGroup.create_feature_group()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#create_feature_group) | :white_check_mark: | :white_check_mark: | - | - | - | Currently Flink/Beam/Java doesn't support registering feature group metadata. Thus it needs to be pre-registered before you can write real time features computed by Flink/Beam. | +| Training Dataset Creation from dataframes | [`TrainingDataset.save()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/training_dataset_api/#save) | :white_check_mark: | - | - | - | - | Functionality was deprecated in version 3.0 | +| Data validation using Great Expectations for streaming dataframes | [`FeatureGroup.validate()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#validate)
[`FeatureGroup.insert_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#insert_stream) | - | - | - | - | - | `insert_stream` does not perform any data validation even when a expectation suite is attached. | +| Stream ingestion | [`FeatureGroup.insert_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#insert_stream) | :white_check_mark: | - | :white_check_mark: | :white_check_mark: | :white_check_mark: | Python/Pandas/Polars has currently no notion of streaming. | +| Reading from Streaming Storage Connectors | [`KafkaConnector.read_stream()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/storage_connector_api/#read_stream) | :white_check_mark: | - | - | - | - | Python/Pandas/Polars has currently no notion of streaming. For Flink/Beam/Java only write operations are supported | +| Reading training data from external storage other than S3 | [`FeatureView.get_training_data()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_view_api/#get_training_data) | :white_check_mark: | - | - | - | - | Reading training data that was written to external storage using a Storage Connector other than S3 can currently not be read using HSFS APIs, instead you will have to use the storage's native client. | +| Reading External Feature Groups into Dataframe | [`ExternalFeatureGroup.read()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/external_feature_group_api/#read) | :white_check_mark: | - | - | - | - | Reading an External Feature Group directly into a Pandas/Polars Dataframe is not supported, however, you can use the [Query API](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/query_api/) to create Feature Views/Training Data containing External Feature Groups. | +| Read Queries containing External Feature Groups into Dataframe | [`Query.read()`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/query_api/#read) | :white_check_mark: | - | - | - | - | Reading a Query containing an External Feature Group directly into a Pandas/Polars Dataframe is not supported, however, you can use the Query to create Feature Views/Training Data and write the data to a Storage Connector, from where you can read up the data into a Pandas/Polars Dataframe. | ## Python @@ -77,3 +77,7 @@ Apache Beam integration with Hopsworks feature store was only tested using Dataf For more details head over to the [Getting Started Guide](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/integrations/java/beam). +## Java +It is also possible to interact to Hopsworks feature store using pure Java environments without dependencies on Spark, Flink or Beam. + +For more details head over to the [Getting Started Guide](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/java). \ No newline at end of file diff --git a/docs/user_guides/integrations/index.md b/docs/user_guides/integrations/index.md index db8d38b1d..bf52b139b 100644 --- a/docs/user_guides/integrations/index.md +++ b/docs/user_guides/integrations/index.md @@ -3,9 +3,10 @@ Hopsworks is an open platform aiming to be accessible from a variety of tools. Learn in this section how to connect to Hopsworks from - [Python](python) +- [Java](java) - [Databricks](databricks/networking) - [AWS SageMaker](sagemaker) - [AWS EMR](emr/networking) - [Azure HDInsight](hdinsight) - [Azure Machine Learning](mlstudio_designer) -- [Apache Spark](spark) +- [Apache Spark](spark) \ No newline at end of file diff --git a/docs/user_guides/integrations/java.md b/docs/user_guides/integrations/java.md new file mode 100644 index 000000000..4e730000e --- /dev/null +++ b/docs/user_guides/integrations/java.md @@ -0,0 +1,100 @@ +--- +description: Documentation on how to connect to Hopsworks from a Java client. +--- + +# Java client + +Starting from version 3.9.0-RC13, HSFS provides a pure Java client. This guide explains how to use the client to connect to Hopsworks and read or write feature data. + +## Generate an API key + +For instructions on how to generate an API key follow this [user guide](../projects/api_key/create_api_key.md). For the Java client to work correctly make sure you add the following scopes to your API key: + + 1. featurestore + 2. project + 3. job + 4. kafka + +## Add the HSFS dependency to your project: + +The HSFS library is available on the Hopsworks' Maven repository. If you are using Maven as build tool, you can add the following in your pom.xml file: + +```xml + + + Hops + Hops Repository + https://archiva.hops.works/repository/Hops/ + + true + + + true + + + +``` + +The artifactId for the HSFS Java build is hsfs, if you are using Maven as build tool, you can add the following dependency: + +```xml + + com.logicalclocks + hsfs + ${hsfs.version} + +``` + +!!!note "Java Version" + + Please note that the Java client has been tested with Java versions up to Java 17 + +## Connecting to the Feature Store + +You are now ready to connect to the Hopsworks Feature Store from a Java client: + +```Java +//Import necessary classes +import com.logicalclocks.hsfs.FeatureStore; +import com.logicalclocks.hsfs.FeatureView; +import com.logicalclocks.hsfs.HopsworksConnection; + +//Establish connection with Hopsworks. +HopsworksConnection hopsworksConnection = HopsworksConnection.builder() + .host("my_instance") // DNS of your Feature Store instance + .port(443) // Port to reach your Hopsworks instance, defaults to 443 + .project("my_project") // Name of your Hopsworks Feature Store project + .apiKeyValue("api_key") // The API key to authenticate with the feature store + .hostnameVerification(false) // Disable for self-signed certificates + .build(); + +//get feature store handle +FeatureStore fs = hopsworksConnection.getFeatureStore(); + +//get feature view handle +FeatureView fv = fs.getFeatureView(fvName, fvVersion); + +// get feature vector +List singleVector = fv.getFeatureVector(new HashMap() {{ + put("id", 100); + }}); +``` + +### Update feature data + +The Java client allows you to update data on existing feature groups using the streaming interface. You can provide a list of POJO objects to the [insertStream](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/javadoc/com/logicalclocks/hsfs/StreamFeatureGroup.html#insertStream-java.util.List-) method. + +The feature group should exists already (can be created using the Python client) and the POJO objects should be serializable with the feature group's AVRO schema. + +Please see the [tutorial](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/integrations/java/java) for a code example on how to write data. + +### Limitations + +Currently using the Java client to retrieve feature vectors have the following limitations: + +* Only the SQL interface is supported. It is not possible to retrieve feature vectors using the REST API Interface +* Feature Views with model dependent transformations attached are not applied. If your feature view has model dependent transformations, please use the Python client. + +## Next Steps + +You can find more information on how to interact from Java client in the [JavaDoc](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/javadoc/) or this [tutorial](https://github.com/logicalclocks/hopsworks-tutorials/tree/master/integrations/java/java) \ No newline at end of file diff --git a/mkdocs.yml b/mkdocs.yml index c93cac646..2210024f6 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -119,6 +119,7 @@ nav: - Apache Spark: user_guides/integrations/spark.md - Apache Flink: user_guides/integrations/flink.md - Apache Beam: user_guides/integrations/beam.md + - Java: user_guides/integrations/java.md - Sharing: user_guides/fs/sharing/sharing.md - Tags: user_guides/fs/tags/tags.md - Provenance: user_guides/fs/provenance/provenance.md