Skip to content

Enable vllm for DocSum #1716

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Mar 28, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 35 additions & 2 deletions DocSum/docker_compose/intel/cpu/xeon/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

This document outlines the deployment process for a Document Summarization application utilizing the [GenAIComps](https://github.com/opea-project/GenAIComps.git) microservice pipeline on an Intel Xeon server. The steps include Docker image creation, container deployment via Docker Compose, and service execution to integrate microservices such as `llm`. We will publish the Docker images to Docker Hub soon, which will simplify the deployment process for this service.

The default pipeline deploys with vLLM as the LLM serving component. It also provides options of using TGI backend for LLM microservice, please refer to [start-microservice-docker-containers](#start-microservice-docker-containers) section in this page.

## 🚀 Apply Intel Xeon Server on AWS

To apply a Intel Xeon server on AWS, start by creating an AWS account if you don't have one already. Then, head to the [EC2 Console](https://console.aws.amazon.com/ec2/v2/home) to begin the process. Within the EC2 service, select the Amazon EC2 M7i or M7i-flex instance type to leverage 4th Generation Intel Xeon Scalable processors. These instances are optimized for high-performance computing and demanding workloads.
Expand Down Expand Up @@ -116,9 +118,20 @@ To set up environment variables for deploying Document Summarization services, f

```bash
cd GenAIExamples/DocSum/docker_compose/intel/cpu/xeon
```

If use vLLM as the LLM serving backend.

```bash
docker compose -f compose.yaml up -d
```

If use TGI as the LLM serving backend.

```bash
docker compose -f compose_tgi.yaml up -d
```

You will have the following Docker Images:

1. `opea/docsum-ui:latest`
Expand All @@ -128,10 +141,30 @@ You will have the following Docker Images:

### Validate Microservices

1. TGI Service
1. LLM backend Service

In the first startup, this service will take more time to download, load and warm up the model. After it's finished, the service will be ready.
Try the command below to check whether the LLM serving is ready.

```bash
# vLLM service
docker logs docsum-xeon-vllm-service 2>&1 | grep complete
# If the service is ready, you will get the response like below.
INFO: Application startup complete.
```

```bash
# TGI service
docker logs docsum-xeon-tgi-service | grep Connected
# If the service is ready, you will get the response like below.
2024-09-03T02:47:53.402023Z INFO text_generation_router::server: router/src/server.rs:2311: Connected
```

Then try the `cURL` command below to validate services.

```bash
curl http://${host_ip}:8008/generate \
# either vLLM or TGI service
curl http://${host_ip}:8008/v1/chat/completions \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}' \
-H 'Content-Type: application/json'
Expand Down
45 changes: 22 additions & 23 deletions DocSum/docker_compose/intel/cpu/xeon/compose.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,54 +2,53 @@
# SPDX-License-Identifier: Apache-2.0

services:
tgi-server:
image: ghcr.io/huggingface/text-generation-inference:2.4.0-intel-cpu
container_name: tgi-server
vllm-service:
image: ${REGISTRY:-opea}/vllm:${TAG:-latest}
container_name: docsum-xeon-vllm-service
ports:
- ${LLM_ENDPOINT_PORT:-8008}:80
- "8008:80"
volumes:
- "${MODEL_CACHE:-./data}:/root/.cache/huggingface/hub"
shm_size: 1g
environment:
no_proxy: ${no_proxy}
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
TGI_LLM_ENDPOINT: ${TGI_LLM_ENDPOINT}
HUGGINGFACEHUB_API_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
host_ip: ${host_ip}
LLM_ENDPOINT_PORT: ${LLM_ENDPOINT_PORT}
HF_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
LLM_MODEL_ID: ${LLM_MODEL_ID}
VLLM_TORCH_PROFILER_DIR: "/mnt"
healthcheck:
test: ["CMD-SHELL", "curl -f http://${host_ip}:${LLM_ENDPOINT_PORT}/health || exit 1"]
test: ["CMD-SHELL", "curl -f http://localhost:80/health || exit 1"]
interval: 10s
timeout: 10s
retries: 100
volumes:
- "${MODEL_CACHE:-./data}:/data"
shm_size: 1g
command: --model-id ${LLM_MODEL_ID} --cuda-graphs 0 --max-input-length ${MAX_INPUT_TOKENS} --max-total-tokens ${MAX_TOTAL_TOKENS}
command: --model $LLM_MODEL_ID --host 0.0.0.0 --port 80

llm-docsum-tgi:
llm-docsum-vllm:
image: ${REGISTRY:-opea}/llm-docsum:${TAG:-latest}
container_name: llm-docsum-server
container_name: docsum-xeon-llm-server
depends_on:
tgi-server:
vllm-service:
condition: service_healthy
ports:
- ${DOCSUM_PORT:-9000}:9000
- ${LLM_PORT:-9000}:9000
ipc: host
environment:
no_proxy: ${no_proxy}
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
LLM_ENDPOINT: ${LLM_ENDPOINT}
LLM_MODEL_ID: ${LLM_MODEL_ID}
HUGGINGFACEHUB_API_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
MAX_INPUT_TOKENS: ${MAX_INPUT_TOKENS}
MAX_TOTAL_TOKENS: ${MAX_TOTAL_TOKENS}
LLM_MODEL_ID: ${LLM_MODEL_ID}
DocSum_COMPONENT_NAME: ${DocSum_COMPONENT_NAME}
LOGFLAG: ${LOGFLAG:-False}
restart: unless-stopped

whisper:
image: ${REGISTRY:-opea}/whisper:${TAG:-latest}
container_name: whisper-server
container_name: docsum-xeon-whisper-server
ports:
- "7066:7066"
ipc: host
Expand All @@ -63,10 +62,10 @@ services:
image: ${REGISTRY:-opea}/docsum:${TAG:-latest}
container_name: docsum-xeon-backend-server
depends_on:
- tgi-server
- llm-docsum-tgi
- vllm-service
- llm-docsum-vllm
ports:
- "8888:8888"
- "${BACKEND_SERVICE_PORT:-8888}:8888"
environment:
- no_proxy=${no_proxy}
- https_proxy=${https_proxy}
Expand All @@ -83,7 +82,7 @@ services:
depends_on:
- docsum-xeon-backend-server
ports:
- "5173:5173"
- "${FRONTEND_SERVICE_PORT:-5173}:5173"
environment:
- no_proxy=${no_proxy}
- https_proxy=${https_proxy}
Expand Down
97 changes: 97 additions & 0 deletions DocSum/docker_compose/intel/cpu/xeon/compose_tgi.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

services:
tgi-server:
image: ghcr.io/huggingface/text-generation-inference:2.4.0-intel-cpu
container_name: docsum-xeon-tgi-server
ports:
- ${LLM_ENDPOINT_PORT:-8008}:80
volumes:
- "${MODEL_CACHE:-./data}:/data"
environment:
no_proxy: ${no_proxy}
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
TGI_LLM_ENDPOINT: ${TGI_LLM_ENDPOINT}
HUGGINGFACEHUB_API_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
host_ip: ${host_ip}
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:80/health || exit 1"]
interval: 10s
timeout: 10s
retries: 100
shm_size: 1g
command: --model-id ${LLM_MODEL_ID} --cuda-graphs 0 --max-input-length ${MAX_INPUT_TOKENS} --max-total-tokens ${MAX_TOTAL_TOKENS}

llm-docsum-tgi:
image: ${REGISTRY:-opea}/llm-docsum:${TAG:-latest}
container_name: docsum-xeon-llm-server
depends_on:
tgi-server:
condition: service_healthy
ports:
- ${LLM_PORT:-9000}:9000
ipc: host
environment:
no_proxy: ${no_proxy}
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
LLM_ENDPOINT: ${LLM_ENDPOINT}
LLM_MODEL_ID: ${LLM_MODEL_ID}
HUGGINGFACEHUB_API_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
MAX_INPUT_TOKENS: ${MAX_INPUT_TOKENS}
MAX_TOTAL_TOKENS: ${MAX_TOTAL_TOKENS}
DocSum_COMPONENT_NAME: ${DocSum_COMPONENT_NAME}
LOGFLAG: ${LOGFLAG:-False}
restart: unless-stopped

whisper:
image: ${REGISTRY:-opea}/whisper:${TAG:-latest}
container_name: docsum-xeon-whisper-server
ports:
- "7066:7066"
ipc: host
environment:
no_proxy: ${no_proxy}
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
restart: unless-stopped

docsum-xeon-backend-server:
image: ${REGISTRY:-opea}/docsum:${TAG:-latest}
container_name: docsum-xeon-backend-server
depends_on:
- tgi-server
- llm-docsum-tgi
ports:
- "${BACKEND_SERVICE_PORT:-8888}:8888"
environment:
- no_proxy=${no_proxy}
- https_proxy=${https_proxy}
- http_proxy=${http_proxy}
- MEGA_SERVICE_HOST_IP=${MEGA_SERVICE_HOST_IP}
- LLM_SERVICE_HOST_IP=${LLM_SERVICE_HOST_IP}
- ASR_SERVICE_HOST_IP=${ASR_SERVICE_HOST_IP}
ipc: host
restart: always

docsum-gradio-ui:
image: ${REGISTRY:-opea}/docsum-gradio-ui:${TAG:-latest}
container_name: docsum-xeon-ui-server
depends_on:
- docsum-xeon-backend-server
ports:
- "${FRONTEND_SERVICE_PORT:-5173}:5173"
environment:
- no_proxy=${no_proxy}
- https_proxy=${https_proxy}
- http_proxy=${http_proxy}
- BACKEND_SERVICE_ENDPOINT=${BACKEND_SERVICE_ENDPOINT}
- DOC_BASE_URL=${BACKEND_SERVICE_ENDPOINT}
ipc: host
restart: always

networks:
default:
driver: bridge
37 changes: 35 additions & 2 deletions DocSum/docker_compose/intel/hpu/gaudi/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

This document outlines the deployment process for a Document Summarization application utilizing the [GenAIComps](https://github.com/opea-project/GenAIComps.git) microservice pipeline on Intel Gaudi server. The steps include Docker image creation, container deployment via Docker Compose, and service execution to integrate microservices such as `llm`. We will publish the Docker images to Docker Hub soon, which will simplify the deployment process for this service.

The default pipeline deploys with vLLM as the LLM serving component. It also provides options of using TGI backend for LLM microservice, please refer to [start-microservice-docker-containers](#start-microservice-docker-containers) section in this page.

## 🚀 Build Docker Images

### 1. Build MicroService Docker Image
Expand Down Expand Up @@ -108,9 +110,20 @@ To set up environment variables for deploying Document Summarization services, f

```bash
cd GenAIExamples/DocSum/docker_compose/intel/hpu/gaudi
```

If use vLLM as the LLM serving backend.

```bash
docker compose -f compose.yaml up -d
```

If use TGI as the LLM serving backend.

```bash
docker compose -f compose_tgi.yaml up -d
```

You will have the following Docker Images:

1. `opea/docsum-ui:latest`
Expand All @@ -120,10 +133,30 @@ You will have the following Docker Images:

### Validate Microservices

1. TGI Service
1. LLM backend Service

In the first startup, this service will take more time to download, load and warm up the model. After it's finished, the service will be ready.
Try the command below to check whether the LLM serving is ready.

```bash
# vLLM service
docker logs docsum-xeon-vllm-service 2>&1 | grep complete
# If the service is ready, you will get the response like below.
INFO: Application startup complete.
```

```bash
# TGI service
docker logs docsum-xeon-tgi-service | grep Connected
# If the service is ready, you will get the response like below.
2024-09-03T02:47:53.402023Z INFO text_generation_router::server: router/src/server.rs:2311: Connected
```

Then try the `cURL` command below to validate services.

```bash
curl http://${host_ip}:8008/generate \
# either vLLM or TGI service
curl http://${host_ip}:8008/v1/chat/completions \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}' \
-H 'Content-Type: application/json'
Expand Down
Loading