Skip to content

Commit 6d42f1a

Browse files
authored
Update README.md
1 parent 138ba3f commit 6d42f1a

File tree

1 file changed

+39
-1
lines changed

1 file changed

+39
-1
lines changed

Diff for: README.md

+39-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,42 @@
1-
<div align="center">
1+
22

33
<h1>OpenAI Compatible Tensort-LLM Worker </h1>
4+
A high-performance inference server that combines the power of TensorRT-LLM for optimized model inference with RunPod's serverless infrastructure. This implementation provides an OpenAI-compatible API interface for easy integration with existing applications.
5+
6+
## Features
7+
8+
- TensorRT-LLM optimization for faster inference
9+
- OpenAI-compatible API endpoints
10+
- Flexible configuration through environment variables
11+
- Support for model parallelism (tensor and pipeline)
12+
- Hugging Face model integration
13+
- Streaming response support
14+
- RunPod serverless deployment ready
15+
16+
### Runtime Constraints
17+
- Batch size and sequence length must be determined during engine building time
18+
- Dynamic shape support is limited and may impact performance
19+
- KV-cache size is fixed at build time and affects memory usage
20+
- Changing model parameters requires rebuilding the TensorRT engine
21+
22+
### Build Time Impact
23+
- Engine building can take significant time (hours for large models)
24+
- Each combination of parameters requires a separate engine
25+
- Changes to maximum sequence length or batch size require rebuilding
26+
27+
## Environment Variables
28+
29+
The server can be configured using the following environment variables:
430

31+
```plaintext
32+
TRTLLM_MODEL # Required: Path or name of the model to load
33+
TRTLLM_TOKENIZER # Optional: Path or name of the tokenizer (defaults to model path)
34+
TRTLLM_MAX_BEAM_WIDTH # Optional: Maximum beam width for beam search
35+
TRTLLM_MAX_BATCH_SIZE # Optional: Maximum batch size for inference
36+
TRTLLM_MAX_NUM_TOKENS # Optional: Maximum number of tokens to generate
37+
TRTLLM_MAX_SEQ_LEN # Optional: Maximum sequence length
38+
TRTLLM_TP_SIZE # Optional: Tensor parallelism size (default: 1)
39+
TRTLLM_PP_SIZE # Optional: Pipeline parallelism size (default: 1)
40+
TRTLLM_KV_CACHE_FREE_GPU_MEMORY_FRACTION # Optional: GPU memory fraction for KV cache (default: 0.9)
41+
TRTLLM_TRUST_REMOTE_CODE # Optional: Whether to trust remote code (default: false)
42+
HF_TOKEN # Optional: Hugging Face API token for protected models

0 commit comments

Comments
 (0)