Skip to content

Latest commit

 

History

History
112 lines (77 loc) · 4.19 KB

inference-batcher.md

File metadata and controls

112 lines (77 loc) · 4.19 KB

How To Configure Inference Batcher

Introduction

Inference batching can be enabled to increase inference request throughput at the cost of higher latencies. The configuration of the inference batcher depends on the serving tool and the model server used in the deployment. See the compatibility matrix.

GUI

Step 1: Create new deployment

If you have at least one model already trained and saved in the Model Registry, navigate to the deployments page by clicking on the Deployments tab on the navigation menu on the left.

Deployments navigation tab

Deployments navigation tab

Once in the deployments page, you can create a new deployment by either clicking on New deployment (if there are no existing deployments) or on Create new deployment it the top-right corner. Both options will open the deployment creation form.

Step 2: Go to advanced options

A simplified creation form will appear including the most common deployment fields from all available configurations. Inference batching is part of the advanced options of a deployment. To navigate to the advanced creation form, click on Advanced options.

Advance options

Advanced options. Go to advanced deployment creation form

Step 3: Configure inference batching

To enable inference batching, click on the Request batching checkbox.

Inference batcher in advanced deployment form

Inference batching configuration (default values)

If your deployment uses KServe, you can optionally set three additional parameters for the inference batcher: maximum batch size, maximum latency (ms) and timeout (s).

Once you are done with the changes, click on Create new deployment at the bottom of the page to create the deployment for your model.

Code

Step 1: Connect to Hopsworks

=== "Python"

import hopsworks

project = hopsworks.login()

# get Hopsworks Model Registry handle
mr = project.get_model_registry()

# get Hopsworks Model Serving handle
ms = project.get_model_serving()

Step 2: Define an inference logger

=== "Python"

from hsml.inference_batcher import InferenceBatcher

my_batcher = InferenceBatcher(enabled=True,
                              # optional
                              max_batch_size=32,
                              max_latency=5000, # milliseconds
                              timeout=5 # seconds
                              )

Step 3: Create a deployment with the inference batcher

=== "Python"

my_model = mr.get_model("my_model", version=1)

my_predictor = ms.create_predictor(my_model,
                                  inference_batcher=my_batcher
                                  )
my_predictor.deploy()

# or

my_deployment = ms.create_deployment(my_predictor)
my_deployment.save()

API Reference

[Inference Batcher](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/model-serving/inference_batcher_api/)

Compatibility matrix

??? info "Show supported inference batcher configuration"

| Serving tool | Model server       | Inference batching | Fine-grained configuration |
| ------------ | ------------------ | ------------------ | ------- |
| Docker       | Flask              | ❌                 |  -       |
|              | TensorFlow Serving | ✅                 | ❌        |
| Kubernetes   | Flask              | ❌                 |  -       |
|              | TensorFlow Serving | ✅                 | ❌        |
| KServe       | Flask              | ✅                 | ✅        |
|              | TensorFlow Serving | ✅                 | ✅        |