Skip to content

Commit b4976ec

Browse files
Alexandru OrmenisanAlexandru Ormenisan
Alexandru Ormenisan
authored and
Alexandru Ormenisan
committed
Merge remote-tracking branch 'origin' into modelProvenance
2 parents b8931e0 + 0581f14 commit b4976ec

File tree

169 files changed

+1330
-6444
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

169 files changed

+1330
-6444
lines changed

docs/admin/services.md

-49
This file was deleted.
Loading
Loading
45.5 KB
Loading
Loading
Loading
Loading
Loading
Binary file not shown.
Loading
Loading
Loading
Loading
-19.5 MB
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
32.3 KB
Loading
Loading
51.1 KB
Loading
1.18 MB
Loading
Loading
Loading
Loading

docs/concepts/dev/inside.md

+13-7
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
Hopsworks provides a complete self-service development environment for feature engineering and model training. You can develop programs as Jupyter notebooks or jobs, you can manage the Python libraries in a project using its conda environment, you can manage your source code with Git, and you can orchestrate jobs with Airflow.
1+
Hopsworks provides a complete self-service development environment for feature engineering and model training. You can develop programs as Jupyter notebooks or jobs, customize the bundled FTI (feature, training and inference pipeline) python environments, you can manage your source code with Git, and you can orchestrate jobs with Airflow.
22

33
<img src="../../../assets/images/concepts/dev/dev-inside.svg">
44

@@ -10,18 +10,24 @@ Hopsworks provides a Jupyter notebook development environment for programs writt
1010

1111
Hopsworks provides source code control support using Git (GitHub, GitLab or BitBucket). You can securely checkout code into your project and commit and push updates to your code to your source code repository.
1212

13-
### Conda Environment per Project
13+
### FTI Pipeline Environments
1414

15-
Hopsworks supports the self-service installation of Python libraries using PyPi, Conda, Wheel files, or GitHub URLs. The Python libraries are installed in a Conda environment linked with your project. Each project has a base Docker image and its custom conda environment. Jobs are run as Docker images, but they are compiled transparently for you when you update your Conda environment. That is, there is no need to write a Dockerfile, users install Python libraries in their project. You can setup custom development and production environments by creating new projects, each with their own conda environment.
15+
Hopsworks postulates that building ML systems following the FTI pipeline architecture is best practice. This architecture consists of three independently developed and operated ML pipelines:
16+
17+
* Feature pipeline: takes as input raw data that it transforms into features (and labels)
18+
* Training pipeline: takes as input features (and labels) and outputs a trained model
19+
* Inference pipeline: takes new feature data and a trained model and makes predictions
20+
21+
In order to facilitate the development of these pipelines Hopsworks bundles several python environments containing necessary dependencies. Each of these environments may then also be customized further by cloning it and installing additional dependencies from PyPi, Conda channels, Wheel files, GitHub repos or a custom Dockerfile. Internal compute such as Jobs and Jupyter is run in one of these environments and changes are applied transparently when you install new libraries using our APIs. That is, there is no need to write a Dockerfile, users install libraries directly in one or more of the environments. You can setup custom development and production environments by creating separate projects or creating multiple clones of an environment within the same project.
1622

1723
### Jobs
1824

1925
In Hopsworks, a Job is a schedulable program that is allocated compute and memory resources. You can run a Job in Hopsworks:
2026

21-
* from the UI;
22-
* programmatically with the Hopsworks SDK (Python, Java) or REST API;
23-
* from Airflow programs (either inside our outside Hopsworks);
24-
* from your IDE using a plugin ([PyCharm/IntelliJ plugin](https://plugins.jetbrains.com/plugin/15537-hopsworks));
27+
* From the UI
28+
* Programmatically with the Hopsworks SDK (Python, Java) or REST API
29+
* From Airflow programs (either inside our outside Hopsworks)
30+
* From your IDE using a plugin ([PyCharm/IntelliJ plugin](https://plugins.jetbrains.com/plugin/15537-hopsworks))
2531

2632
### Orchestration
2733

docs/concepts/fs/feature_group/external_fg.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
External feature groups are offline feature groups where their data is stored in an external table. An external table requires a storage connector, defined with the Connector API (or more typically in the user interface), to enable HSFS to retrieve data from the external table. An external table includes a user-defined SQL string for retrieving data, but you also perform SQL operations, including projections, aggregations, and so on. The SQL query is executed on-demand when HSFS retrieves data from the external Feature Group, for example, when creating training data using features in the external table.
1+
External feature groups are offline feature groups where their data is stored in an external table. An external table requires a storage connector, defined with the Connector API (or more typically in the user interface), to enable HSFS to retrieve data from the external table. An external feature group doesn't allow for offline data ingestion or modification; instead, it includes a user-defined SQL string for retrieving data. You can also perform SQL operations, including projections, aggregations, and so on. The SQL query is executed on-demand when HSFS retrieves data from the external Feature Group, for example, when creating training data using features in the external table.
22

33
In the image below, we can see that HSFS currently supports a large number of data sources, including any JDBC-enabled source, Snowflake, Data Lake, Redshift, BigQuery, S3, ADLS, GCS, and Kafka
44

docs/concepts/fs/feature_group/fg_overview.md

+11-1
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,16 @@ A feature group is a table of features, where each feature group has a primary k
77

88
### Online and offline Storage
99

10-
Feature groups can be stored in a low-latency "online" database and/or in low cost, high throughput "offline" storage, typically a data lake or data warehouse. The online store stores only the latest values of features for a feature group. It is used to serve pre-computed features to models at runtime. The offline store stores the historical values of features for a feature group, so it may store many times more data than the online store. Offline feature groups are used, typically, to create training data for models, but also to retrieve data for batch scoring of models:
10+
Feature groups can be stored in a low-latency "online" database and/or in low cost, high throughput "offline" storage, typically a data lake or data warehouse.
1111

1212
<img src="../../../../assets/images/concepts/fs/feature-storage.svg">
13+
14+
#### Online Storage
15+
16+
The online store stores only the latest values of features for a feature group. It is used to serve pre-computed features to models at runtime.
17+
18+
#### Offline Storage
19+
20+
The offline store stores the historical values of features for a feature group so that it may store much more data than the online store. Offline feature groups are used, typically, to create training data for models, but also to retrieve data for batch scoring of models.
21+
22+
In most cases, offline data is stored in Hopsworks, but through the implementation of storage connectors, it can reside in an external file system. The externally stored data can be managed by Hopsworks by defining ordinary feature groups or it can be used for reading only by defining [External Feature Group](external_fg.md).

docs/concepts/fs/feature_group/fg_statistics.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ HSFS supports monitoring, validation, and alerting for features:
66

77
### Statistics
88

9-
When you create a Feature Group in HSFS, you can configure it to compute statistics over the features inserted into the fFeature Group by setting the `statistics_config` dict parameter, see [Feature Group Statistics](../../../../user_guides/fs/feature_group/statistics/) for details. Every time you write to the Feature Group, new statistics will be computed over all of the data in the Feature Group.
9+
When you create a Feature Group in HSFS, you can configure it to compute statistics over the features inserted into the Feature Group by setting the `statistics_config` dict parameter, see [Feature Group Statistics](../../../../user_guides/fs/feature_group/statistics/) for details. Every time you write to the Feature Group, new statistics will be computed over all of the data in the Feature Group.
1010

1111

1212
### Data Validation

docs/concepts/fs/index.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ Hopsworks and its Feature Store are an open source data-intensive AI platform us
99
##HSFS API
1010

1111

12-
The HSFS (HopsworkS Feature Store) API is how you, as a developer, will use the feature store.
12+
The HSFS (Hopsworks Feature Store) API is how you, as a developer, will use the feature store.
1313
The HSFS API helps simplify some of the problems that feature stores address including:
1414

1515
- consistent features for training and serving

docs/concepts/hopsworks.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -20,5 +20,5 @@ Hopsworks provides a vector database (or embedding store) based on [OpenSearch k
2020
Hopsworks provides a data-mesh architecture for managing ML assets and teams, with multi-tenant projects. Not unlike a GitHub repository, a project is a sandbox containing team members, data, and ML assets. In Hopsworks, all ML assets (features, models, training data) are versioned, taggable, lineage-tracked, and support free-text search. Data can be also be securely shared between projects.
2121

2222
## Data Science Platform
23-
You can develop feature engineering pipelines and training pipelines in Hopsworks. There is support for version control (GitHub, GitLab, BitBucket), Jupyter notebooks, a shared distributed file system, per project conda environments for managing python dependencies without needing to write Dockerfiles, jobs (Python, Spark, Flink), and workflow orchestration with Airflow.
23+
You can develop feature engineering, model training and inference pipelines in Hopsworks. There is support for version control (GitHub, GitLab, BitBucket), Jupyter notebooks, a shared distributed file system, many bundled modular project python environments for managing python dependencies without needing to write Dockerfiles, jobs (Python, Spark, Flink), and workflow orchestration with Airflow.
2424

docs/index.md

+5-5
Original file line numberDiff line numberDiff line change
@@ -185,7 +185,7 @@ pointer-events: initial;
185185
</a>
186186
</div>
187187
<div id="w-node-_4a479fbb-90c7-9f47-d439-20aa6a224339-46672785" class="infra">
188-
<a href="./setup_installation/on_prem/hopsworks_installer/">
188+
<a href="./setup_installation/on_prem/contact_hopsworks/">
189189
<img src="images/icons8-database.svg" loading="lazy" alt="" class="infra-icon">
190190
<div class="name_item small">On-premise</div>
191191
</a>
@@ -247,7 +247,7 @@ pointer-events: initial;
247247

248248
<img src="images/hopsworks-logo-2022.svg" loading="lazy" alt="" class="image_logo_02">
249249

250-
Hopsworks is a data platform for ML with a Python-centric Feature Store and MLOps capabilities. Hopsworks is a modular platform. You can use it as a standalone Feature Store, you can use it to manage, govern, and serve your models, and you can even use it to develop and operate feature pipelines and training pipelines. Hopsworks brings collaboration for ML teams, providing a secure, governed platform for developing, managing, and sharing ML assets - features, models, training data, batch scoring data, logs, and more.
250+
Hopsworks is a data platform for ML with a Python-centric Feature Store and MLOps capabilities. Hopsworks is a modular platform. You can use it as a standalone Feature Store, you can use it to manage, govern, and serve your models, and you can even use it to develop and operate feature, training and inference pipelines. Hopsworks brings collaboration for ML teams, providing a secure, governed platform for developing, managing, and sharing ML assets - features, models, training data, batch scoring data, logs, and more.
251251

252252
## Python-Centric Feature Store
253253
Hopsworks is widely used as a standalone Feature Store. Hopsworks breaks the monolithic model development pipeline into separate feature and training pipelines, enabling both feature reuse and better tested ML assets. You can develop features by building feature pipelines in any Python (or Spark or Flink) environment, either inside or outside Hopsworks. You can use the Python frameworks you are familiar with to build production feature pipelines. You can compute aggregations in Pandas, validate feature data with Great Expectations, reduce your data dimensionality with embeddings and PCA, test your feature logic and features end-to-end with PyTest, and transform your categorical and numerical features with Scikit-Learn, TensorFlow, and PyTorch. You can orchestrate your feature pipelines with your Python framework of choice, including Hopsworks' own Airflow support.
@@ -262,10 +262,10 @@ Hopsworks provides model serving capabilities through KServe, with additional su
262262
Hopsworks provides projects as a secure sandbox in which teams can collaborate and share ML assets. Hopsworks' unique multi-tenant project model even enables sensitive data to be stored in a shared cluster, while still providing fine-grained sharing capabilities for ML assets across project boundaries. Projects can be used to structure teams so that they have end-to-end responsibility from raw data to managed features and models. Projects can also be used to create development, staging, and production environments for data teams. All ML assets support versioning, lineage, and provenance provide all Hopsworks users with a complete view of the MLOps life cycle, from feature engineering through model serving.
263263

264264
## Development and Operations
265-
Hopsworks provides development tools for Data Science, including conda environments for Python, Jupyter notebooks, jobs, or even notebooks as jobs. You can build production pipelines with the bundled Airflow, and even run ML training pipelines with GPUs in notebooks on Airflow. You can train models on as many GPUs as are installed in a Hopsworks cluster and easily share them among users. You can also run Spark, Spark Streaming, or Flink programs on Hopsworks, with support for elastic workers in the cloud (add/remove workers dynamically).
265+
Hopsworks provides a FTI (feature/training/inference) pipeline architecture for ML systems. Each part of the pipeline is defined in a Hopsworks job which corresponds to a Jupyter notebook, a python script or a jar. The production pipelines are then orchestrated with Airflow which is bundled in Hopsworks. Hopsworks provides several python environments that can be used and customized for each part of the FTI pipeline, for example switching between using PyTorch or TensorFlow in the training pipeline. You can train models on as many GPUs as are installed in a Hopsworks cluster and easily share them among users. You can also run Spark, Spark Streaming, or Flink programs on Hopsworks. JupyterLab is also bundled which can be used to run Python and Spark interactively.
266266

267267
## Available on any Platform
268-
Hopsworks is available as a both managed platform in the cloud on AWS, Azure, and GCP, and can be installed on any Linux-based virtual machines (Ubuntu/Redhat compatible), even in air-gapped data centers. Hopsworks is also available as a serverless platform that manages and serves both your features and models.
268+
Hopsworks is available to be installed on a kubernetes cluster in the cloud on AWS, Azure, and GCP, and On-Prem (Ubuntu/Redhat compatible), even in air-gapped data centers. Hopsworks is also available as a serverless platform that manages and serves both your features and models.
269269

270270
## Join the community
271271
- Ask questions and give us feedback in the [Hopsworks Community](https://community.hopsworks.ai/)
@@ -274,7 +274,7 @@ Hopsworks is available as a both managed platform in the cloud on AWS, Azure, an
274274
- Join our public [slack-channel](https://join.slack.com/t/public-hopsworks/shared_invite/zt-24fc3hhyq-VBEiN8UZlKsDrrLvtU4NaA )
275275

276276
## Contribute
277-
We are building the most complete and modular ML platform available in the market, and we count on your support to continuously improve Hopsworks. Feel free to [give us suggestions](https://github.com/logicalclocks/hopsworks), [report bugs](https://github.com/logicalclocks/hopsworks/issues) and [add features to our library](https://github.com/logicalclocks/feature-store-api) anytime.
277+
We are building the most complete and modular ML platform available in the market, and we count on your support to continuously improve Hopsworks. Feel free to [give us suggestions](https://github.com/logicalclocks/hopsworks), [report bugs](https://github.com/logicalclocks/hopsworks/issues) and [add features to our library](https://github.com/logicalclocks/hopsworks-api) anytime.
278278

279279
## Open-Source
280280
Hopsworks is available under the AGPL-V3 license. In plain English this means that you are free to use Hopsworks and even build paid services on it, but if you modify the source code, you should also release back your changes and any systems built around it as AGPL-V3.

docs/js/dropdown.js

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
1-
document.getElementsByClassName("md-tabs__link")[7].style.display = "none";
2-
document.getElementsByClassName("md-tabs__link")[9].style.display = "none";
1+
document.getElementsByClassName("md-tabs__link")[6].style.display = "none";
2+
document.getElementsByClassName("md-tabs__link")[8].style.display = "none";
33

0 commit comments

Comments
 (0)