|
1 |
| -<div align="center"> |
| 1 | + |
2 | 2 |
|
3 | 3 | <h1>OpenAI Compatible Tensort-LLM Worker </h1>
|
| 4 | +A high-performance inference server that combines the power of TensorRT-LLM for optimized model inference with RunPod's serverless infrastructure. This implementation provides an OpenAI-compatible API interface for easy integration with existing applications. |
| 5 | + |
| 6 | +## Features |
| 7 | + |
| 8 | +- TensorRT-LLM optimization for faster inference |
| 9 | +- OpenAI-compatible API endpoints |
| 10 | +- Flexible configuration through environment variables |
| 11 | +- Support for model parallelism (tensor and pipeline) |
| 12 | +- Hugging Face model integration |
| 13 | +- Streaming response support |
| 14 | +- RunPod serverless deployment ready |
| 15 | + |
| 16 | +### Runtime Constraints |
| 17 | +- Batch size and sequence length must be determined during engine building time |
| 18 | +- Dynamic shape support is limited and may impact performance |
| 19 | +- KV-cache size is fixed at build time and affects memory usage |
| 20 | +- Changing model parameters requires rebuilding the TensorRT engine |
| 21 | + |
| 22 | +### Build Time Impact |
| 23 | +- Engine building can take significant time (hours for large models) |
| 24 | +- Each combination of parameters requires a separate engine |
| 25 | +- Changes to maximum sequence length or batch size require rebuilding |
| 26 | + |
| 27 | +## Environment Variables |
| 28 | + |
| 29 | +The server can be configured using the following environment variables: |
4 | 30 |
|
| 31 | +```plaintext |
| 32 | +TRTLLM_MODEL # Required: Path or name of the model to load |
| 33 | +TRTLLM_TOKENIZER # Optional: Path or name of the tokenizer (defaults to model path) |
| 34 | +TRTLLM_MAX_BEAM_WIDTH # Optional: Maximum beam width for beam search |
| 35 | +TRTLLM_MAX_BATCH_SIZE # Optional: Maximum batch size for inference |
| 36 | +TRTLLM_MAX_NUM_TOKENS # Optional: Maximum number of tokens to generate |
| 37 | +TRTLLM_MAX_SEQ_LEN # Optional: Maximum sequence length |
| 38 | +TRTLLM_TP_SIZE # Optional: Tensor parallelism size (default: 1) |
| 39 | +TRTLLM_PP_SIZE # Optional: Pipeline parallelism size (default: 1) |
| 40 | +TRTLLM_KV_CACHE_FREE_GPU_MEMORY_FRACTION # Optional: GPU memory fraction for KV cache (default: 0.9) |
| 41 | +TRTLLM_TRUST_REMOTE_CODE # Optional: Whether to trust remote code (default: false) |
| 42 | +HF_TOKEN # Optional: Hugging Face API token for protected models |
0 commit comments