Serve multiple models with llamacpp server #10431
Replies: 10 comments 1 reply
-
or just start multiple instance lol |
Beta Was this translation helpful? Give feedback.
-
I start each model in a separate server instance using a new port for each model. Then I make an API call to any running server providing the array of messages intended for that model. I can also use the very nice webUI that @ngxson made for any running server! |
Beta Was this translation helpful? Give feedback.
-
I saw this project on r/LocalLLaMA as a workaround: https://github.com/mostlygeek/llama-swap |
Beta Was this translation helpful? Give feedback.
-
I would say either add a proxy server in front of it, or just use a project like llama-swap. Then you can just have all the config options etc per server. The proxy would just need some simple routing, some simple queuing per server and options like trying to start new model as second server (if you have the memory available) and then you can just reuse all server options. |
Beta Was this translation helpful? Give feedback.
-
I haven't tried using it, but there is also this: https://github.com/perk11/large-model-proxy It looks like you can run multiple backend even and then have them unload automatically using least recently used, etc. |
Beta Was this translation helpful? Give feedback.
-
Still seems like a nice feature, if the will to implement it is there. |
Beta Was this translation helpful? Give feedback.
-
I'm learning to use llama.cpp... Have a similar need. Basically I would like to set up some rules for different prompt types and decide to use different models on one server. Still not sure weather currently it's possible the original llama-server application supports to load / swap different models via an api call to specific a model path or not. for example, I cannot change a model via the below call even will get a success response, maybe my usage is not correct. |
Beta Was this translation helpful? Give feedback.
This comment was marked as spam.
This comment was marked as spam.
-
Could you please share a link that contains how to load multiple models? I went though some articles, however did not figure out how to pass multiple models. 😢 Also tried the below commands, all did not work.
|
Beta Was this translation helpful? Give feedback.
-
To serve multiple models using llama.cpp with llamacpp .info server, you need to run the server in a way that supports multiple models. Here’s how you can do it: |
Beta Was this translation helpful? Give feedback.
-
Is there a way to serve more than one model at the same time? (I don't think so).
Are there any plans to add this feature?
Beta Was this translation helpful? Give feedback.
All reactions