Skip to content

Commit 50dd959

Browse files
XinyaoWapre-commit-ci[bot]lkk12014402
authored
Support Long context for DocSum (#1255)
Signed-off-by: Xinyao Wang <xinyao.wang@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: lkk <33276950+lkk12014402@users.noreply.github.com>
1 parent 05365b6 commit 50dd959

File tree

15 files changed

+809
-215
lines changed

15 files changed

+809
-215
lines changed

DocSum/docker_compose/amd/gpu/rocm/compose.yaml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ services:
2727
security_opt:
2828
- seccomp:unconfined
2929
ipc: host
30-
command: --model-id ${DOCSUM_LLM_MODEL_ID}
30+
command: --model-id ${DOCSUM_LLM_MODEL_ID} --max-input-length ${MAX_INPUT_TOKENS} --max-total-tokens ${MAX_TOTAL_TOKENS}
3131

3232
docsum-llm-server:
3333
image: ${REGISTRY:-opea}/llm-docsum-tgi:${TAG:-latest}
@@ -53,6 +53,9 @@ services:
5353
https_proxy: ${https_proxy}
5454
TGI_LLM_ENDPOINT: "http://${HOST_IP}:${DOCSUM_TGI_SERVICE_PORT}"
5555
HUGGINGFACEHUB_API_TOKEN: ${DOCSUM_HUGGINGFACEHUB_API_TOKEN}
56+
MAX_INPUT_TOKENS: ${MAX_INPUT_TOKENS}
57+
MAX_TOTAL_TOKENS: ${MAX_TOTAL_TOKENS}
58+
LLM_MODEL_ID: ${DOCSUM_LLM_MODEL_ID}
5659
restart: unless-stopped
5760

5861
whisper:

DocSum/docker_compose/amd/gpu/rocm/set_env.sh

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,8 @@
33
# Copyright (C) 2024 Advanced Micro Devices, Inc.
44
# SPDX-License-Identifier: Apache-2.0
55

6+
export MAX_INPUT_TOKENS=2048
7+
export MAX_TOTAL_TOKENS=4096
68
export DOCSUM_TGI_IMAGE="ghcr.io/huggingface/text-generation-inference:2.3.1-rocm"
79
export DOCSUM_LLM_MODEL_ID="Intel/neural-chat-7b-v3-3"
810
export HOST_IP=${host_ip}

DocSum/docker_compose/intel/cpu/xeon/README.md

Lines changed: 89 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -223,11 +223,12 @@ You will have the following Docker Images:
223223
Text:
224224

225225
```bash
226+
## json input
226227
curl -X POST http://${host_ip}:8888/v1/docsum \
227228
-H "Content-Type: application/json" \
228229
-d '{"type": "text", "messages": "Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5."}'
229230

230-
# Use English mode (default).
231+
# form input, use English mode (default).
231232
curl http://${host_ip}:8888/v1/docsum \
232233
-H "Content-Type: multipart/form-data" \
233234
-F "type=text" \
@@ -290,6 +291,93 @@ You will have the following Docker Images:
290291
-F "stream=true"
291292
```
292293

294+
7. MegaService with long context
295+
296+
If you want to deal with long context, can set following parameters and select suitable summary type.
297+
298+
- "summary_type": can be "auto", "stuff", "truncate", "map_reduce", "refine", default is "auto"
299+
- "chunk_size": max token length for each chunk. Set to be different default value according to "summary_type".
300+
- "chunk_overlap": overlap token length between each chunk, default is 0.1\*chunk_size
301+
302+
**summary_type=auto**
303+
304+
"summary_type" is set to be "auto" by default, in this mode we will check input token length, if it exceed `MAX_INPUT_TOKENS`, `summary_type` will automatically be set to `refine` mode, otherwise will be set to `stuff` mode.
305+
306+
```bash
307+
curl http://${host_ip}:8888/v1/docsum \
308+
-H "Content-Type: multipart/form-data" \
309+
-F "type=text" \
310+
-F "messages=" \
311+
-F "max_tokens=32" \
312+
-F "files=@/path to your file (.txt, .docx, .pdf)" \
313+
-F "language=en" \
314+
-F "summary_type=auto"
315+
```
316+
317+
**summary_type=stuff**
318+
319+
In this mode LLM generate summary based on complete input text. In this case please carefully set `MAX_INPUT_TOKENS` and `MAX_TOTAL_TOKENS` according to your model and device memory, otherwise it may exceed LLM context limit and raise error when meet long context.
320+
321+
```bash
322+
curl http://${host_ip}:8888/v1/docsum \
323+
-H "Content-Type: multipart/form-data" \
324+
-F "type=text" \
325+
-F "messages=" \
326+
-F "max_tokens=32" \
327+
-F "files=@/path to your file (.txt, .docx, .pdf)" \
328+
-F "language=en" \
329+
-F "summary_type=stuff"
330+
```
331+
332+
**summary_type=truncate**
333+
334+
Truncate mode will truncate the input text and keep only the first chunk, whose length is equal to `min(MAX_TOTAL_TOKENS - input.max_tokens - 50, MAX_INPUT_TOKENS)`
335+
336+
```bash
337+
curl http://${host_ip}:8888/v1/docsum \
338+
-H "Content-Type: multipart/form-data" \
339+
-F "type=text" \
340+
-F "messages=" \
341+
-F "max_tokens=32" \
342+
-F "files=@/path to your file (.txt, .docx, .pdf)" \
343+
-F "language=en" \
344+
-F "summary_type=truncate"
345+
```
346+
347+
**summary_type=map_reduce**
348+
349+
Map_reduce mode will split the inputs into multiple chunks, map each document to an individual summary, then consolidate those summaries into a single global summary. `streaming=True` is not allowed here.
350+
351+
In this mode, default `chunk_size` is set to be `min(MAX_TOTAL_TOKENS - input.max_tokens - 50, MAX_INPUT_TOKENS)`
352+
353+
```bash
354+
curl http://${host_ip}:8888/v1/docsum \
355+
-H "Content-Type: multipart/form-data" \
356+
-F "type=text" \
357+
-F "messages=" \
358+
-F "max_tokens=32" \
359+
-F "files=@/path to your file (.txt, .docx, .pdf)" \
360+
-F "language=en" \
361+
-F "summary_type=map_reduce"
362+
```
363+
364+
**summary_type=refine**
365+
366+
Refin mode will split the inputs into multiple chunks, generate summary for the first one, then combine with the second, loops over every remaining chunks to get the final summary.
367+
368+
In this mode, default `chunk_size` is set to be `min(MAX_TOTAL_TOKENS - 2 * input.max_tokens - 128, MAX_INPUT_TOKENS)`.
369+
370+
```bash
371+
curl http://${host_ip}:8888/v1/docsum \
372+
-H "Content-Type: multipart/form-data" \
373+
-F "type=text" \
374+
-F "messages=" \
375+
-F "max_tokens=32" \
376+
-F "files=@/path to your file (.txt, .docx, .pdf)" \
377+
-F "language=en" \
378+
-F "summary_type=refine"
379+
```
380+
293381
## 🚀 Launch the UI
294382

295383
Several UI options are provided. If you need to work with multimedia documents, .doc, or .pdf files, suggested to use Gradio UI.

DocSum/docker_compose/intel/cpu/xeon/compose.yaml

Lines changed: 12 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@
22
# SPDX-License-Identifier: Apache-2.0
33

44
services:
5-
tgi-service:
5+
tgi-server:
66
image: ghcr.io/huggingface/text-generation-inference:2.4.0-intel-cpu
7-
container_name: tgi-service
7+
container_name: tgi-server
88
ports:
99
- "8008:80"
1010
environment:
@@ -16,13 +16,13 @@ services:
1616
volumes:
1717
- "./data:/data"
1818
shm_size: 1g
19-
command: --model-id ${LLM_MODEL_ID} --cuda-graphs 0
19+
command: --model-id ${LLM_MODEL_ID} --cuda-graphs 0 --max-input-length ${MAX_INPUT_TOKENS} --max-total-tokens ${MAX_TOTAL_TOKENS}
2020

2121
llm-docsum-tgi:
2222
image: ${REGISTRY:-opea}/llm-docsum-tgi:${TAG:-latest}
2323
container_name: llm-docsum-server
2424
depends_on:
25-
- tgi-service
25+
- tgi-server
2626
ports:
2727
- "9000:9000"
2828
ipc: host
@@ -32,11 +32,15 @@ services:
3232
https_proxy: ${https_proxy}
3333
TGI_LLM_ENDPOINT: ${TGI_LLM_ENDPOINT}
3434
HUGGINGFACEHUB_API_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
35+
MAX_INPUT_TOKENS: ${MAX_INPUT_TOKENS}
36+
MAX_TOTAL_TOKENS: ${MAX_TOTAL_TOKENS}
37+
LLM_MODEL_ID: ${LLM_MODEL_ID}
38+
LOGFLAG: True
3539
restart: unless-stopped
3640

3741
whisper:
3842
image: ${REGISTRY:-opea}/whisper:${TAG:-latest}
39-
container_name: whisper-service
43+
container_name: whisper-server
4044
ports:
4145
- "7066:7066"
4246
ipc: host
@@ -48,7 +52,7 @@ services:
4852

4953
dataprep-audio2text:
5054
image: ${REGISTRY:-opea}/dataprep-audio2text:${TAG:-latest}
51-
container_name: dataprep-audio2text-service
55+
container_name: dataprep-audio2text-server
5256
ports:
5357
- "9099:9099"
5458
ipc: host
@@ -57,7 +61,7 @@ services:
5761

5862
dataprep-video2audio:
5963
image: ${REGISTRY:-opea}/dataprep-video2audio:${TAG:-latest}
60-
container_name: dataprep-video2audio-service
64+
container_name: dataprep-video2audio-server
6165
ports:
6266
- "7078:7078"
6367
ipc: host
@@ -78,7 +82,7 @@ services:
7882
image: ${REGISTRY:-opea}/docsum:${TAG:-latest}
7983
container_name: docsum-xeon-backend-server
8084
depends_on:
81-
- tgi-service
85+
- tgi-server
8286
- llm-docsum-tgi
8387
- dataprep-multimedia2text
8488
- dataprep-video2audio

DocSum/docker_compose/intel/hpu/gaudi/README.md

Lines changed: 93 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -207,18 +207,19 @@ You will have the following Docker Images:
207207
Text:
208208

209209
```bash
210+
## json input
210211
curl -X POST http://${host_ip}:8888/v1/docsum \
211212
-H "Content-Type: application/json" \
212213
-d '{"type": "text", "messages": "Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5."}'
213214

214-
# Use English mode (default).
215+
# form input. Use English mode (default).
215216
curl http://${host_ip}:8888/v1/docsum \
216217
-H "Content-Type: multipart/form-data" \
217218
-F "type=text" \
218219
-F "messages=Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5." \
219220
-F "max_tokens=32" \
220221
-F "language=en" \
221-
-F "stream=true"
222+
-F "stream=True"
222223

223224
# Use Chinese mode.
224225
curl http://${host_ip}:8888/v1/docsum \
@@ -227,7 +228,7 @@ You will have the following Docker Images:
227228
-F "messages=2024年9月26日,北京——今日,英特尔正式发布英特尔® 至强® 6性能核处理器(代号Granite Rapids),为AI、数据分析、科学计算等计算密集型业务提供卓越性能。" \
228229
-F "max_tokens=32" \
229230
-F "language=zh" \
230-
-F "stream=true"
231+
-F "stream=True"
231232

232233
# Upload file
233234
curl http://${host_ip}:8888/v1/docsum \
@@ -237,7 +238,6 @@ You will have the following Docker Images:
237238
-F "files=@/path to your file (.txt, .docx, .pdf)" \
238239
-F "max_tokens=32" \
239240
-F "language=en" \
240-
-F "stream=true"
241241
```
242242

243243
> Audio and Video file uploads are not supported in docsum with curl request, please use the Gradio-UI.
@@ -255,7 +255,7 @@ You will have the following Docker Images:
255255
-F "messages=UklGRigAAABXQVZFZm10IBIAAAABAAEARKwAAIhYAQACABAAAABkYXRhAgAAAAEA" \
256256
-F "max_tokens=32" \
257257
-F "language=en" \
258-
-F "stream=true"
258+
-F "stream=True"
259259
```
260260

261261
Video:
@@ -271,7 +271,94 @@ You will have the following Docker Images:
271271
-F "messages=convert your video to base64 data type" \
272272
-F "max_tokens=32" \
273273
-F "language=en" \
274-
-F "stream=true"
274+
-F "stream=True"
275+
```
276+
277+
7. MegaService with long context
278+
279+
If you want to deal with long context, can set following parameters and select suitable summary type.
280+
281+
- "summary_type": can be "auto", "stuff", "truncate", "map_reduce", "refine", default is "auto"
282+
- "chunk_size": max token length for each chunk. Set to be different default value according to "summary_type".
283+
- "chunk_overlap": overlap token length between each chunk, default is 0.1\*chunk_size
284+
285+
**summary_type=auto**
286+
287+
"summary_type" is set to be "auto" by default, in this mode we will check input token length, if it exceed `MAX_INPUT_TOKENS`, `summary_type` will automatically be set to `refine` mode, otherwise will be set to `stuff` mode.
288+
289+
```bash
290+
curl http://${host_ip}:8888/v1/docsum \
291+
-H "Content-Type: multipart/form-data" \
292+
-F "type=text" \
293+
-F "messages=" \
294+
-F "max_tokens=32" \
295+
-F "files=@/path to your file (.txt, .docx, .pdf)" \
296+
-F "language=en" \
297+
-F "summary_type=auto"
298+
```
299+
300+
**summary_type=stuff**
301+
302+
In this mode LLM generate summary based on complete input text. In this case please carefully set `MAX_INPUT_TOKENS` and `MAX_TOTAL_TOKENS` according to your model and device memory, otherwise it may exceed LLM context limit and raise error when meet long context.
303+
304+
```bash
305+
curl http://${host_ip}:8888/v1/docsum \
306+
-H "Content-Type: multipart/form-data" \
307+
-F "type=text" \
308+
-F "messages=" \
309+
-F "max_tokens=32" \
310+
-F "files=@/path to your file (.txt, .docx, .pdf)" \
311+
-F "language=en" \
312+
-F "summary_type=stuff"
313+
```
314+
315+
**summary_type=truncate**
316+
317+
Truncate mode will truncate the input text and keep only the first chunk, whose length is equal to `min(MAX_TOTAL_TOKENS - input.max_tokens - 50, MAX_INPUT_TOKENS)`
318+
319+
```bash
320+
curl http://${host_ip}:8888/v1/docsum \
321+
-H "Content-Type: multipart/form-data" \
322+
-F "type=text" \
323+
-F "messages=" \
324+
-F "max_tokens=32" \
325+
-F "files=@/path to your file (.txt, .docx, .pdf)" \
326+
-F "language=en" \
327+
-F "summary_type=truncate"
328+
```
329+
330+
**summary_type=map_reduce**
331+
332+
Map_reduce mode will split the inputs into multiple chunks, map each document to an individual summary, then consolidate those summaries into a single global summary. `streaming=True` is not allowed here.
333+
334+
In this mode, default `chunk_size` is set to be `min(MAX_TOTAL_TOKENS - input.max_tokens - 50, MAX_INPUT_TOKENS)`
335+
336+
```bash
337+
curl http://${host_ip}:8888/v1/docsum \
338+
-H "Content-Type: multipart/form-data" \
339+
-F "type=text" \
340+
-F "messages=" \
341+
-F "max_tokens=32" \
342+
-F "files=@/path to your file (.txt, .docx, .pdf)" \
343+
-F "language=en" \
344+
-F "summary_type=map_reduce"
345+
```
346+
347+
**summary_type=refine**
348+
349+
Refin mode will split the inputs into multiple chunks, generate summary for the first one, then combine with the second, loops over every remaining chunks to get the final summary.
350+
351+
In this mode, default `chunk_size` is set to be `min(MAX_TOTAL_TOKENS - 2 * input.max_tokens - 128, MAX_INPUT_TOKENS)`.
352+
353+
```bash
354+
curl http://${host_ip}:8888/v1/docsum \
355+
-H "Content-Type: multipart/form-data" \
356+
-F "type=text" \
357+
-F "messages=" \
358+
-F "max_tokens=32" \
359+
-F "files=@/path to your file (.txt, .docx, .pdf)" \
360+
-F "language=en" \
361+
-F "summary_type=refine"
275362
```
276363

277364
> More detailed tests can be found here `cd GenAIExamples/DocSum/test`

0 commit comments

Comments
 (0)