|
| 1 | +# ArrowFlight Server with DuckDB |
| 2 | +By default, Hopsworks uses big data technologies (Spark or Hive) to create training data and read data for Python clients. |
| 3 | +This is great for large datasets, but for small or moderately sized datasets (think of the size of data that would fit in a Pandas |
| 4 | +DataFrame in your local Python environment), the overhead of starting a Spark or Hive job and doing distributed data processing can be significant. |
| 5 | + |
| 6 | +ArrowFlight Server with DuckDB significantly reduces the time that Python clients need to read feature groups |
| 7 | +and batch inference data from the Feature Store, as well as creating moderately-sized in-memory training datasets. |
| 8 | + |
| 9 | +When the service is enabled, clients will automatically use it for the following operations: |
| 10 | + |
| 11 | +- [reading Feature Groups](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#read) |
| 12 | +- [reading Queries](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/query_api/#read) |
| 13 | +- [reading Training Datasets](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_view_api/#get_training_data) |
| 14 | +- [creating In-Memory Training Datasets](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_view_api/#training_data) |
| 15 | +- [reading Batch Inference Data](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_view_api/#get_batch_data) |
| 16 | + |
| 17 | +For larger datasets, clients can still make use of the Spark/Hive backend by explicitly setting |
| 18 | +`read_options={"use_hive": True}`. |
| 19 | + |
| 20 | +## Service configuration |
| 21 | + |
| 22 | +!!! note |
| 23 | + Supported only on AWS at the moment. |
| 24 | + |
| 25 | +!!! note |
| 26 | + Make sure that your cross account role has the load balancer permissions as described in [here](../../aws/restrictive_permissions/#load-balancers-permissions-for-external-access), otherwise you have to create and manage the load balancer yourself. |
| 27 | + |
| 28 | +The ArrowFlight Server is co-located with RonDB in the Hopsworks cluster. |
| 29 | +If the ArrowFlight Server is activated, RonDB and ArrowFlight Server can each use up to 50% |
| 30 | +of the available resources on the node, so they can co-exist without impacting each other. |
| 31 | +Just like RonDB, the ArrowFlight Server can be replicated across multiple nodes to serve more clients at lower latency. |
| 32 | +To guarantee high performance, each individual ArrowFlight Server instance processes client requests sequentially. |
| 33 | +Requests will be queued for up to 10 minutes before they are rejected. |
| 34 | + |
| 35 | +<p align="center"> |
| 36 | + <figure> |
| 37 | + <img style="border: 1px solid #000" src="../../../assets/images/setup_installation/managed/common/arrowflight_rondb.png" alt="Configure RonDB"> |
| 38 | + <figcaption>Activate ArrowFlight Server with DuckDB on a RonDB cluster</figcaption> |
| 39 | + </figure> |
| 40 | +</p> |
| 41 | + |
| 42 | +To deploy ArrowFlight Server on a cluster: |
| 43 | + |
| 44 | +1. Select "RonDB cluster" |
| 45 | +2. Select an instance type with at least 16GB of memory and 4 cores. (*) |
| 46 | +3. Tick the checkbox `Enable ArrowFlight Server`. |
| 47 | + |
| 48 | +(*) The service should have at least the 2x the amount of memory available that a typical Python client would have. |
| 49 | + Because RonDB and ArrowFlight Server share the same node we recommend selecting an instance type with at least 4x the |
| 50 | + client memory. For example, if the service serves Python clients with typically 4GB of memory, |
| 51 | + an instance with at least 16GB of memory should be selected. |
| 52 | + An instance with 16GB of memory will be able to read feature groups and training datasets of up to 10-100M rows, |
| 53 | + depending on the number of columns and size of the features (~2GB in parquet). The same instance will be able to create |
| 54 | + point-in-time correct training datasets with 1-10M rows, also depending on the number and the size of the features. |
| 55 | + Larger instances are able to handle larger datasets. The numbers scale roughly linearly with the instance size. |
| 56 | + |
0 commit comments