You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Remmove references to pip install hsfs and hsfs.connection()
- Improve the documentation for the installation of the Python library (Including profiles)
- Add documentation for the installation of the Java library
Copy file name to clipboardexpand all lines: docs/user_guides/client_installation/index.md
+79-29
Original file line number
Diff line number
Diff line change
@@ -1,56 +1,106 @@
1
1
---
2
-
description: Documentation on how to install the Hopsworks and HSFS Python libraries, including the specific requirements for Mac OSX and Windows.
2
+
description: Documentation on how to install the Hopsworks Python and Java library.
3
3
---
4
4
# Client Installation Guide
5
5
6
-
## Hopsworks (including Feature Store and MLOps)
7
-
The Hopsworks client library is required to connect to the Hopsworks Feature Store and MLOps services from your local machine or any other Python environment such as Google Colab or AWS Sagemaker. Execute the following command to install the full Hopsworks client library in your Python environment:
6
+
## Hopsworks Python library
7
+
8
+
The Hopsworks Python client library is required to connect to the Hopsworks Feature Store and MLOps services from your local machine or any other Python environment such as Google Colab or AWS Sagemaker. Execute the following command to install the Hopsworks client library in your Python environment:
8
9
9
10
!!! note "Virtual environment"
10
11
It is recommended to use a virtual python environment instead of the system environment used by your operating system, in order to avoid any side effects regarding interfering dependencies.
Hopsworks latest version should work on OSX systems without any additional requirements. However if installing an older version of the Hopsworks SDK you might need to install `librdkafka` manually. Checkout the documentation for the specific version you are installing.
18
+
### Profiles
19
19
20
-
!!! attention "Windows/Conda Installation"
20
+
The Hopsworks library has several profiles that bring additional dependencies and enable additional functionalities:
21
21
22
-
On Windows systems you might need to install twofish manually before installing hopsworks, if you don't have the Microsoft Visual C++ Build Tools installed. In that case, it is recommended to use a conda environment and run the following commands:
23
-
24
-
```bash
25
-
conda install twofish
26
-
pip install hopsworks
27
-
```
22
+
| Profile Name | Description |
23
+
| ------------------ | ------------- |
24
+
| No Profile | This is the base installation. Supports interacting with the feature store metadata, model registry and deployments. It also supports reading and writing from the feature store from PySpark environments. |
25
+
|`python`| This profile enables reading and writing from/to the feature store from a Python environment |
26
+
|`great-expectations`| This profile installs the [Great Expectations](https://greatexpectations.io/) Python library and enables data validation on feature pipelines |
27
+
|`polars`| This profile installs the [Polars](https://pola.rs/) library and enables reading and writing Polars DataFrames |
28
28
29
-
## Feature Store only
30
-
To only install the Hopsworks Feature Store client library, execute the following command:
29
+
You can install all the above profiles with the following command:
If you want to interact with the Hopsworks Feature Store from environments such as Spark, Flink or Beam, you can use the Hopsworks Feature Store (HSFS) Java library.
38
+
39
+
!!!note "Feature Store Only"
40
+
41
+
The Java library only allows interaction with the Feature Store component of the Hopsworks platform. Additionally each environment might restrict the supported API operation. You can see which API operation is supported by which environment [here](../fs/compute_engines)
42
+
43
+
The HSFS library is available on the Hopsworks' Maven repository. If you are using Maven as build tool, you can add the following in your `pom.xml` file:
Hopsworks latest version should work on OSX systems without any additional requirements. However if installing an older version of the Hopsworks SDK you might need to install `librdkafka` manually. Checkout the documentation for the specific version you are installing.
61
+
The library has different builds targeting different environments:
41
62
42
-
!!! attention "Windows/Conda Installation"
63
+
### Spark
43
64
44
-
On Windows systems you might need to install twofish manually before installing hsfs, if you don't have the Microsoft Visual C++ Build Tools installed. In that case, it is recommended to use a conda environment and run the following commands:
45
-
46
-
```bash
47
-
conda install twofish
48
-
pip install hsfs[python]
49
-
```
65
+
The `artifactId` for the Spark build is `hsfs-spark-spark{spark.version}`, if you are using Maven as build tool, you can add the following dependency:
66
+
67
+
```
68
+
<dependency>
69
+
<groupId>com.logicalclocks</groupId>
70
+
<artifactId>hsfs-spark-spark3.1</artifactId>
71
+
<version>${hsfs.version}</version>
72
+
</dependency>
73
+
```
74
+
75
+
Hopsworks provides builds for Spark 3.1, 3.3 and 3.5. The builds are also provided as JAR files which can be downloaded from [Hopsworks repository](https://repo.hops.works/master/hsfs)
76
+
77
+
### Flink
78
+
79
+
The `artifactId` for the Flink build is `hsfs-flink`, if you are using Maven as build tool, you can add the following dependency:
80
+
81
+
```
82
+
<dependency>
83
+
<groupId>com.logicalclocks</groupId>
84
+
<artifactId>hsfs-flink</artifactId>
85
+
<version>${hsfs.version}</version>
86
+
</dependency>
87
+
```
88
+
89
+
### Beam
90
+
91
+
The `artifactId` for the Beam build is `hsfs-beam`, if you are using Maven as build tool, you can add the following dependency:
92
+
93
+
```
94
+
<dependency>
95
+
<groupId>com.logicalclocks</groupId>
96
+
<artifactId>hsfs-beam</artifactId>
97
+
<version>${hsfs.version}</version>
98
+
</dependency>
99
+
```
50
100
51
101
## Next Steps
52
102
53
-
If you are using a local python environment and want to connect to the Hopsworks Feature Store, you can follow the [Python Guide](../integrations/python.md#generate-an-api-key) section to create an API Key and to get started.
103
+
If you are using a local python environment and want to connect to Hopsworks, you can follow the [Python Guide](../integrations/python.md#generate-an-api-key) section to create an API Key and to get started.
In order for the Databricks cluster to be able to communicate with the Hopsworks Feature Store, the clients running on Databricks need to be able to access a Hopsworks API key.
3
+
In order for the Databricks cluster to be able to communicate with Hopsworks, clients running on Databricks need to be able to access a Hopsworks API key.
4
4
5
5
## Generate an API key
6
6
@@ -15,127 +15,20 @@ For instructions on how to generate an API key follow this [user guide](../../pr
15
15
16
16
!!! hint "API key as Argument"
17
17
To get started quickly, without saving the Hopsworks API in a secret storage, you can simply supply it as an argument when instantiating a connection:
18
-
```python hl_lines="6"
19
-
import hsfs
20
-
conn = hsfs.connection(
21
-
host='my_instance', # DNS of your Feature Store instance
22
-
port=443, # Port to reach your Hopsworks instance, defaults to 443
23
-
project='my_project', # Name of your Hopsworks Feature Store project
24
-
api_key_value='apikey', # The API key to authenticate with Hopsworks
25
-
hostname_verification=True # Disable for self-signed certificates
26
-
)
27
-
fs = conn.get_feature_store() # Get the project's default feature store
28
-
```
29
18
30
-
## Store the API key
31
19
32
-
### AWS
33
-
34
-
#### Step 1: Create an instance profile to attach to your Databricks clusters
35
-
36
-
Go to the *AWS IAM* choose *Roles* and click on *Create Role*. Select *AWS Service* as the type of trusted entity and *EC2* as the use case as shown below:
37
-
38
-
<palign="center">
39
-
<figure>
40
-
<img src="../../../../assets/images/guides/integrations/create-instance-profile.png" alt="Create an instance profile">
41
-
<figcaption>Create an instance profile</figcaption>
42
-
</figure>
43
-
</p>
44
-
45
-
Click on *Next: Permissions*, *Next:Tags*, and then *Next: Review*. Name the instance profile role and then click *Create role*.
46
-
47
-
#### Step 2: Storing the API Key
48
-
49
-
**Option 1: Using the AWS Systems Manager Parameter Store**
50
-
51
-
In the AWS Management Console, ensure that your active region is the region you use for Databricks.
52
-
Go to the *AWS Systems Manager* choose *Parameter Store* and select *Create Parameter*.
53
-
As name enter `/hopsworks/role/[MY_DATABRICKS_ROLE]/type/api-key` replacing `[MY_DATABRICKS_ROLE]` with the name of the AWS role you have created in [Step 1](#step-1-create-an-instance-profile-to-attach-to-your-databricks-clusters). Select *Secure String* as type and create the parameter.
54
-
55
-
<palign="center">
56
-
<figure>
57
-
<img src="../../../../assets/images/guides/integrations/databricks/aws/databricks_parameter_store.png" alt="Storing the Feature Store API key in the Parameter Store">
58
-
<figcaption>Storing the Feature Store API key in the Parameter Store</figcaption>
59
-
</figure>
60
-
</p>
61
-
62
-
63
-
Once the API Key is stored, you need to grant access to it from the AWS role that you have created in [Step 1](#step-1-create-an-instance-profile-to-attach-to-your-databricks-clusters).
64
-
In the AWS Management Console, go to *IAM*, select *Roles* and then search for the role that you have created in [Step 1](#step-1-create-an-instance-profile-to-attach-to-your-databricks-clusters).
65
-
Select *Add inline policy*. Choose *Systems Manager* as service, expand the *Read* access level and check *GetParameter*.
66
-
Expand Resources and select *Add ARN*.
67
-
Enter the region of the *Systems Manager* as well as the name of the parameter **WITHOUT the leading slash** e.g. *hopsworks/role/[MY_DATABRICKS_ROLE]/type/api-key* and click *Add*.
68
-
Click on *Review*, give the policy a name and click on *Create policy*.
69
-
70
-
<palign="center">
71
-
<figure>
72
-
<img src="../../../../assets/images/guides/integrations/databricks/aws/databricks_parameter_store_policy.png" alt="Configuring the access policy for the Parameter Store">
73
-
<figcaption>Configuring the access policy for the Parameter Store</figcaption>
74
-
</figure>
75
-
</p>
76
-
77
-
78
-
**Option 2: Using the AWS Secrets Manager**
79
-
80
-
In the AWS management console ensure that your active region is the region you use for Databricks.
81
-
Go to the *AWS Secrets Manager* and select *Store new secret*. Select *Other type of secrets* and add *api-key*
82
-
as the key and paste the API key created in the previous step as the value. Click next.
83
-
84
-
<palign="center">
85
-
<figure>
86
-
<img src="../../../../assets/images/guides/integrations/databricks/aws/databricks_secrets_manager_step_1.png" alt="Storing a Feature Store API key in the Secrets Manager Step 1">
87
-
<figcaption>Storing a Feature Store API key in the Secrets Manager Step 1</figcaption>
88
-
</figure>
89
-
</p>
90
-
91
-
As secret name, enter *hopsworks/role/[MY_DATABRICKS_ROLE]* replacing [MY_DATABRICKS_ROLE] with the AWS role you have created in [Step 1](#step-1-create-an-instance-profile-to-attach-to-your-databricks-clusters). Select next twice and finally store the secret.
92
-
Then click on the secret in the secrets list and take note of the *Secret ARN*.
93
-
94
-
<palign="center">
95
-
<figure>
96
-
<img src="../../../../assets/images/guides/integrations/databricks/aws/databricks_secrets_manager_step_2.png" alt="Storing a Feature Store API key in the Secrets Manager Step 2">
97
-
<figcaption>Storing a Feature Store API key in the Secrets Manager Step 2</figcaption>
98
-
</figure>
99
-
</p>
100
-
101
-
Once the API Key is stored, you need to grant access to it from the AWS role that you have created in [Step 1](#step-1-create-an-instance-profile-to-attach-to-your-databricks-clusters).
102
-
In the AWS Management Console, go to *IAM*, select *Roles* and then the role that that you have created in [Step 1](#step-1-create-an-instance-profile-to-attach-to-your-databricks-clusters).
103
-
Select *Add inline policy*. Choose *Secrets Manager* as service, expand the *Read* access level and check *GetSecretValue*.
104
-
Expand Resources and select *Add ARN*. Paste the ARN of the secret created in the previous step.
105
-
Click on *Review*, give the policy a name and click on *Create policy*.
106
-
107
-
<palign="center">
108
-
<figure>
109
-
<img src="../../../../assets/images/guides/integrations/databricks/aws/databricks_secrets_manager_policy.png" alt="Configuring the access policy for the Secrets Manager">
110
-
<figcaption>Configuring the access policy for the Secrets Manager</figcaption>
111
-
</figure>
112
-
</p>
113
-
114
-
#### Step 3: Allow Databricks to use the AWS role created in Step 1
115
-
116
-
First you need to get the AWS role used by Databricks for deployments as described in [this step](https://docs.databricks.com/administration-guide/cloud-configurations/aws/instance-profiles.html#step-3-note-the-iam-role-used-to-create-the-databricks-deployment). Once you get the role name, go to *AWS IAM*, search for the role, and click on it. Then, select the *Permissions* tab, click on *Add inline policy*, select the *JSON* tab, and paste the following snippet. Replace *[ACCOUNT_ID]* with your AWS account id, and *[MY_DATABRICKS_ROLE]* with the AWS role name created in [Step 1](#step-1-create-an-instance-profile-to-attach-to-your-databricks-clusters).
host='my_instance', # DNS of your Feature Store instance
24
+
port=443, # Port to reach your Hopsworks instance, defaults to 443
25
+
project='my_project', # Name of your Hopsworks Feature Store project
26
+
api_key_value='apikey', # The API key to authenticate with Hopsworks
27
+
hostname_verification=True# Disable for self-signed certificates
28
+
)
29
+
fs = project.get_feature_store() # Get the project's default feature store
130
30
```
131
31
132
-
Click *Review Policy*, name the policy, and click *Create Policy*. Then, go to your Databricks workspace and follow [this step](https://docs.databricks.com/administration-guide/cloud-configurations/aws/instance-profiles.html#step-5-add-the-instance-profile-to-databricks) to add the instance profile to your workspace. Finally, when launching Databricks clusters, select *Advanced* settings and choose the instance profile you have just added.
133
-
134
-
135
-
### Azure
136
-
137
-
On Azure we currently do not support storing the API key in a secret storage. Instead just store the API key in a file in your Databricks workspace so you can access it when connecting to the Feature Store.
138
-
139
32
## Next Steps
140
33
141
34
Continue with the [configuration guide](configuration.md) to finalize the configuration of the Databricks Cluster to communicate with the Hopsworks Feature Store.
Don't forget to replace X.X.0 with the major and minor version of your Hopsworks deployment.
172
+
!!! attention "Matching Hopsworks version"
173
173
174
-
<palign="center">
175
-
<figure>
176
-
<img src="../../../../assets/images/hopsworks-version.png" alt="HSFS version needs to match the major version of Hopsworks">
177
-
<figcaption>To find your Hopsworks version, enter any of your projects and go to the settings tab inside your project.</figcaption>
178
-
</figure>
179
-
</p>
174
+
We recommend that the major and minor version of the Python library match the major and minor version of the Hopsworks deployment.
175
+
176
+
<p align="center">
177
+
<figure>
178
+
<img src="../../../../assets/images/hopsworks-version.png" alt="The library version needs to match the major version of Hopsworks">
179
+
<figcaption>You find the Hopsworks version inside any of your Project's settings tab on Hopsworks</figcaption>
180
+
</figure>
181
+
</p>
180
182
181
183
Add the bootstrap actions when configuring your EMR cluster. Provide 3 arguments to the bootstrap action: The name of the API key secret e.g., `hopsworks/featurestore`,
182
184
the public DNS name of your Hopsworks cluster, such as `ad005770-33b5-11eb-b5a7-bfabd757769f.cloud.hopsworks.ai`, and the name of your Hopsworks project, e.g. `demo_fs_meb10179`.
0 commit comments