Cortex.cpp: Local Engines and Dependencies #1117

namchuai · 2024-08-29T03:20:01Z

namchuai
Aug 29, 2024

Motivation

Do we package the cuda toolkit to the engine?
Yes? Then will have to do the same for llamacpp, tensorrt-llm and onnx?
No? Will download separatedly
Folder structures (e.g if user have llamacpp, tensorrt at the same time)?

Resources
Llamacpp release
Currently we are downloading toolkit dependency via https://catalog.jan.ai/dist/cuda-dependencies/<version>/<platform>/cuda.tar.gz

cc @vansangpfiev @nguyenhoangthuan99 @dan-homebrew

Update sub-tasks:

vansangpfiev · 2024-08-29T03:27:20Z

vansangpfiev
Aug 29, 2024

On my perspective, we should download CUDA toolkit separately. We support multiple engines: cortex.llamacpp and cortex.tensorrt-llm, both need CUDA toolkit to run. CUDA is backward compatible so we only need the latest CUDA toolkit version that supported by nvidia-driver version.
For example:

Nvidia driver: "527.41" compatible with CUDA 12.4
cortex.llamacpp: depends on CUDA 12.2
cortex.tensorrt-llm: depends on CUDA 12.4
We only need to download CUDA 12.4 to support both engines .

Edit: I just checked the cuda matrix compatibility and it is incorrect that CUDA is always backward compatible

Related ticket: #1047

Edit 2: The above image is forward compatibility between cuda and nvidia-version

From CUDA 11 onwards, applications compiled with a CUDA Toolkit release from within a CUDA major release family can run

So yes, CUDA is backward compatible within a CUDA major release
reference: https://docs.nvidia.com/deploy/cuda-compatibility/#minor-version-compatibility

0 replies

nguyenhoangthuan99 · 2024-08-29T03:59:24Z

nguyenhoangthuan99
Aug 29, 2024
Maintainer

I also think we need to Download CUDA toolkit separately, both tensorrt llm and llamacpp require cuda, plus:
inside tensorrt llm package (~ 1GB) doesn't include cuda toolkit lib (cublas, cuparse, ... which is very heavy ~400Mb), if we decided to pack everything in 1 package for both tensorrt-llm and llamacpp, the size will increase.

0 replies

namchuai · 2024-08-29T04:28:08Z

namchuai
Aug 29, 2024
Author

I'm referring this table to check for the compatibility between driver and toolkit
https://docs.nvidia.com/deeplearning/cudnn/latest/reference/support-matrix.html#gpu-cuda-toolkit-and-cuda-driver-requirements

0 replies

dan-menlo · 2024-08-29T10:13:14Z

dan-menlo
Aug 29, 2024
Maintainer

Can I verify my understanding of the issue:

Decision
For Nvidia GPU users, the different engines have CUDA dependencies that are large 200-400mb downloads.

Per-engine CUDA dependencies (i.e. install separately)
Download 1 CUDA Toolkit for all engines

My initial thoughts

I think per-engine CUDA dependencies are more sustainable architecture
It is less efficient, but easier to manage long-term
i.e. llama.cpp packages its cudart files that have been verified
i.e. TensorRT-LLM packages its own CUDA dependencies (may change in the future)

This will be disk-space inefficient. However, the alternative seems to be dependency hell, which I think is even worse.

Folder Structure

My ideal outcome for Cortex is where each engine is its own submodule, and manages its own folder structure.
We have invested in making cortex.llama.cpp a separate module; it should ideally be independent and package with its dependencies

/cortex
    /engines
        /llama.cpp-extension
            /deps                               # CUDA dlls
        /tensorrt-llm-extension
            /deps                               # CUDA dlls

That said, am open to all ideas, especially @vansangpfiev's

0 replies

vansangpfiev · 2024-08-29T11:25:09Z

vansangpfiev
Aug 29, 2024

If disk-space inefficient is acceptable, I think we can go with option 1.
Please note that we will have some blockers for this option:

dynamic library search path: we will have two paths for llamacpp and tensorrt-llm, a potential issue can happen when we mix them together. For ubuntu and MacOS, I think we can solve that issue by compiling with rpath flag. For windows, I have created related issue
CI changes for cortex.llamacpp and cortex.tensorrt-llm to pack CUDA dependencies

0 replies

namchuai · 2024-08-30T00:42:48Z

namchuai
Aug 30, 2024
Author

Thanks @vansangpfiev and @dan-homebrew

I'm confirming that we agree with:
Question 1: Packaging CUDA toolkit dependencies into corresponding engine.
Caveats:

CI changes for cortex.llamacpp and cortex.tensorrt-llm to pack CUDA dependencies

Question 2: Storing CUDA dependencies under corresponding engines.

/cortex
    /engines
        /cortex.llamacpp
            /deps                               # CUDA dlls
        /cortex.tensorrt-llm
            /deps                               # CUDA dlls

Caveats:

dynamic library search path on windows

Additional thought
@vansangpfiev , I think when we change the CI for engine, could we associated a file which contains the versions of the engine and info of its dependencies. This will help for engine list command in the future.
wdyt? cc @nguyenhoangthuan99

0 replies

freelerobot · 2024-09-04T05:49:03Z

freelerobot
Sep 4, 2024
Maintainer

What if llamacpp vs tensorrtllm dependencies start to conflict?
Do we care about engine portability. And does doing a dynamic library search path on windows affect portability.
How will we do maintenance and updates? i.e.

cortex update requires dependency bumps
cortex update doesn't require dependency bump (easier)

Is this a dumb idea: store CUDA dependencies in a central location, such as a separate deps directory at the project root, and then use symbolic links or environment variables to point to the engine-specific dependencies.

/.cortex
    /deps
        /cuda
            cuda-11.5 or whatever versioning
    /engines
        /cortex.llamacpp
            /bin
        /cortex.tensorrt-llm
            /bin

Are there dependency mgmt tools we can use to manage this better?

0 replies

namchuai · 2024-09-04T06:38:19Z

namchuai
Sep 4, 2024
Author

@0xSage , here's my thought. Please correct me if I'm wrong @nguyenhoangthuan99 @vansangpfiev

What if llamacpp vs tensorrtllm dependencies start to conflict?

Yeah, that's why we separating dependencies for cortex.llamacpp and cortex.tensorrt-llm

    /engines
        /cortex.llamacpp
            /deps
        /cortex.tensorrt-llm
            /deps

Do we care about engine portability. And does doing a dynamic library search path on windows affect portability.

Hmm, I might not really get what do you meant by portability. Regarding the dynamic library search path, it's because, in windows, the program will search for the DLLs at the current path (IIRC, same path as the executable). But we are about to put the dependencies under cortex.llamacpp/deps and cortex.tenssorrt-llm/deps, we need to tell the OS where to look for the DLLs.

How will we do maintenance and updates?
I think this is good question. We haven't decided on this yet. WDYT? @vansangpfiev @dan-homebrew @nguyenhoangthuan99 @hiento09
Separating CUDA dependencies?
I think this is good idea. But separates DLLs might cost us some efforts to handle it properly? I'm not sure. WDYT? @vansangpfiev @dan-homebrew @nguyenhoangthuan99
Are there dependency mgmt tools we can use to manage this better?
I think no. Currently, I think we only have cortexcpp and the default behavior of install is replace.

0 replies

vansangpfiev · 2024-09-04T06:57:17Z

vansangpfiev
Sep 4, 2024

For 3, I think we can do the maintenance and updates by versioning: generate a file (for example version.txt) for each release, which has metadata for engine version and cuda version. We will update cuda dependencies if needed.
For 4, I think it is easier for us to locate all cuda dependencies in the same folder as engine because we don't need to check which cuda version is using for which engine version

0 replies

dan-menlo · 2024-09-04T09:25:58Z

dan-menlo
Sep 4, 2024
Maintainer

@vansangpfiev @namchuai @0xSage Quick responses:

Per-Engine Dependencies

Is this a dumb idea: store CUDA dependencies in a central location, such as a separate deps directory at the project root, and then use symbolic links or environment variables to point to the engine-specific dependencies.

From an architecture perspective, I would like for us to approach this from "each engine manages its own dependencies"
I would like to optimize for architectural simplicity at this stage of our library
I would like to re-use llama.cpp's bundled CUDA dependencies
Shared dependencies can be a subsequent optimization, that we tackle down the road as llama.cpp and TensorRT-LLM stabilize. From my POV, they are still changing very rapidly

I also agree with @vansangpfiev: let's co-locate all CUDA dependencies with the engine folder.

Simple > Complex, especially since model files are >4gb.

Updating Engines

For 3, I think we can do the maintenance and updates by versioning: generate a file (for example version.txt) for each release, which has metadata for engine version and cuda version. We will update cuda dependencies if needed.

I also think we need to think through the CLI and API commands:

cortex engines update tensorrt-llm
PUT <API URL>?

Naming

I wonder whether it is better for us to have clearer naming for Cortex engines:

llamacpp-engine
onnx-engine
tensorrt-llm-engine

This articulates the concept of Cortex engines more clearly. Hopefully, with a clear API, the community can also step in to help build backends.

We would need to reason through cortex.python separately.

I think engine extensions can be in C++ or in Python
cortex.python might be more of a "Python base template", where Engine Extension dev can define Python version to bundle

0 replies

Cortex.cpp: Local Engines and Dependencies #1117

Uh oh!

Uh oh!

namchuai Aug 29, 2024

Replies: 10 comments

Uh oh!

Uh oh!

vansangpfiev Aug 29, 2024

Uh oh!

nguyenhoangthuan99 Aug 29, 2024 Maintainer

Uh oh!

namchuai Aug 29, 2024 Author

Uh oh!

Uh oh!

dan-menlo Aug 29, 2024 Maintainer

Uh oh!

vansangpfiev Aug 29, 2024

Uh oh!

Uh oh!

namchuai Aug 30, 2024 Author

Uh oh!

Uh oh!

freelerobot Sep 4, 2024 Maintainer

Uh oh!

namchuai Sep 4, 2024 Author

Uh oh!

vansangpfiev Sep 4, 2024

Uh oh!

Uh oh!

dan-menlo Sep 4, 2024 Maintainer

Per-Engine Dependencies

Updating Engines

Naming

namchuai
Aug 29, 2024

vansangpfiev
Aug 29, 2024

nguyenhoangthuan99
Aug 29, 2024
Maintainer

namchuai
Aug 29, 2024
Author

dan-menlo
Aug 29, 2024
Maintainer

vansangpfiev
Aug 29, 2024

namchuai
Aug 30, 2024
Author

freelerobot
Sep 4, 2024
Maintainer

namchuai
Sep 4, 2024
Author

vansangpfiev
Sep 4, 2024

dan-menlo
Sep 4, 2024
Maintainer