Concurent requests thows memory exception problem #1033

koureasstavros · 2025-01-02T15:17:57Z

koureasstavros
Jan 2, 2025

I have a Blazor Hosted WebAssembly app in .NET8 and testing LLamaSharp with Phi3.5mini. First results seems fine, a question gets it's answer. But then some problems occurs in more complex schenarios:

A second question in the same context exposes strange anwser containing content from previous anwser of previous question.
When it comes with multiple requests from different windows at the same time in the same context an error occurs:
Attempted to read or write protected memory. This is often an indication that other memory is corrupt

I use the same static context because seems each new context loads again the model and multiplies the consuption of RAM and GPU Memory. Below is a sample code:

    public static class Chatengine
    {
         //OBJECTS
         private static Microsoft.ML.MLContext? context_mlnet;

         public async static IAsyncEnumerable<string> ChatLocalAsync(DatabaseContext_Management _dbContext, 
         List<Communication_Pair_Str> communication, bool async)
         {
              var parameters = new ModelParams(param_model_path)
              {
                   ContextSize = param_model_language_global_maxtokens, // The longest length of chat as memory.
                   BatchSize = 2048, // How many layers to offload to GPU. Please adjust it according to your GPU memory.
                   UBatchSize = 2048, // The longest length of text as input for model. //MORE THAN 2048 FAILS
                   GpuLayerCount = 50, // How many layers to offload to GPU. Please adjust it according to your GPU memory.
                   Embeddings = false, //Embeding Size cannot change as it is fixed due to the llm embedding layer.
                   PoolingType = LLama.Native.LLamaPoolingType.Mean
              };

              if (context_llama == null)
              {
                   LLamaWeights weights = LLamaWeights.LoadFromFile(parameters);
                   context_llama = weights.CreateContext(parameters);
              }

              var executor = new InteractiveExecutor(context_llama);

              // Add chat histories as prompt to tell AI how to act.
              var chatHistory = new ChatHistory();

              InferenceParams inferenceParams = new InferenceParams()
              {
                   MaxTokens = Convert.ToInt32(param_model_language_global_maxtokens), // No more tokens should appear in answer. Remove it if antiprompt is enough for control.
                   AntiPrompts = new List<string> { "User:" } // Stop generation once antiprompts appear.
              };

              ChatSession session = new(executor, chatHistory);
              Console.Write("The chat session has started.");

              var chatHistory_ = new ChatHistory();
              chatHistory_.AddMessage(AuthorRole.System, context.NStringToString());
              chatHistory_.AddMessage(AuthorRole.User, "Hello, I need your support.");
              chatHistory_.AddMessage(AuthorRole.Assistant, "Hello, How may I help you today?");
              chatHistory_.AddMessage(AuthorRole.User, prompt.NStringToString());

              await foreach (string? result in session.ChatAsync(chatHistory_, inferenceParams, cancellationToken))
              { yield return result; }
         }
    }

martindevans · 2025-01-02T15:39:34Z

martindevans
Jan 2, 2025
Maintainer

Creating a new context doesn't load the entire model again:

LLamaWeights weights = LLamaWeights.LoadFromFile(parameters); this is what loads the actual model into memory, it will consume approximately the same amount of memory as the size of the file. (Note that due to memory mapping you might not observe this memory being used until it is used, i.e. when a context is created).
context_llama = weights.CreateContext(parameters); this creates a "context" from the model, which allocates space to store the "KV cache" (basically the history of everything you've said). This cache is preallocated so as soon as you create a context it will consume enough memory to store param_model_language_global_maxtokens plus a few other buffers used for computation.

Re-using the same context for multiple conversations is slightly better (there is some overhead per-context).

The problem with your implementation is you're never clearing the context, so everything that has previously been said is still in the KV cache. Adding in context_llama.NativeHandle.KvCacheClear() will clear that out. However, because you're re-using the context you must be certain that ChatLocalAsync will not be used concurrently!

0 replies

koureasstavros · 2025-01-03T08:58:16Z

koureasstavros
Jan 3, 2025
Author

Thanks for the clarification, so I have to options:

either I use same context and might clear cache each time, but not exexute in parallel
either I use multiple contexts to be able to use parallel execution

For the second case. I have done the necessary changes and I keep as a static the weights and I create new context each time, but I observed the following:

Server starts with 0.0GB of GPU and 17.0GB or RAM
Then with the first request we have 8GB of GPU and 23GB of RAM
Then with the second request we have 13GB of GPU and 28GB of RAM (parallel)
Then with the third request we have 17GB of GPU and 32GB or RAM (parallel), in this case the third parallel request make all the requests to fail as I propably reaching the maxim GPU and RAM of the server while two parallel requests are processing fine.

Of course if a request is finish the reserved capacity of GPU and RAM for this request is released when using the context_llama.NativeHandle.KvCacheClear() in the end as you said.

According to my results the overhead is to big which means that we have very limited parallel requests according to GPU an RAM, unless I am doing something wrong.

5 replies

martindevans Jan 3, 2025
Maintainer

There is a third option which is much better if you want lots of concurrent requests. One context can actually handle multiple independent sequences. This means you can prompt all of the sequences, run inference for all of them at the same time and then sample all of them (repeating this in a loop, token by token).

That's how the BatchedExecutor is designed to be used. It's a lower level system than the others (you have to work with raw tokens), but if you're looking for lots of sequences that's definitely the way to go.

koureasstavros Jan 7, 2025
Author

I haven't noticed that there were multiple executors, although I cannot find good information on how to use the BatchedExecutor, do you have any sample by using BatchedExecutor?

One example I managed to find is this discussion: #804 but I also have some problems due to this line: sampler.Sample(Context.NativeHandle, Conversation.Sample(), _tokenHistory);
in my case there are only two arguments for Sample() function, so i tried to use it like this sampler.Sample(executor.Context.NativeHandle, i); but every time i get 'Attempted to read or write protected memory. This is often an indication that other memory is corrupt.'

martindevans Jan 7, 2025
Maintainer

I've created a lot of examples for the BatchedExecutor here as I've developed it. Many of them show more advanced scenarios that are possible with it.

The simplest sample, which just generates one sequence is the SaveAndLoad sample. You can see a basic inference loop here.

koureasstavros Jan 10, 2025
Author

I have some questions regarding the BatchedExecutor which I am trying to learn more.

Does this batchedexecutor have something similar to session.ChatAsync so the output tokens come like the model is typing? I see that in the code example you are using decoder.read() which returns the full response at once.
Another thing is that even I run your sample code the second request (not parallel) fails with the following error:
'Attempted to read or write protected memory. This is often an indication that other memory is corrupt'
on the following code line:
conversation.Prompt(executor.Context.Tokenize(prompt));

martindevans Jan 10, 2025
Maintainer

Does this batchedexecutor have something similar

The BatchedExecutor just provides one token at a time, so it's fundamentally a streaming system. The way the StreamingTextDecoder works is you can add tokens one by one, it internally buffers them, and when you call Read() it decodes as much text as possible from those tokens. You could call Read() after adding every token if you wanted and it would provide a stream of text.

Another thing is that even I run your sample code the second request (not parallel) fails with the following error

Could you open up an issue, showing exactly what you did and providing as much detail about the error as possible. I'm planning to look into another potential issue with BatchedExecutor this weekend, so I'll hopefully be able to look into this one too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Concurent requests thows memory exception problem #1033

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Concurent requests thows memory exception problem #1033

Uh oh!

koureasstavros Jan 2, 2025

Replies: 2 comments · 5 replies

Uh oh!

martindevans Jan 2, 2025 Maintainer

Uh oh!

koureasstavros Jan 3, 2025 Author

Uh oh!

martindevans Jan 3, 2025 Maintainer

Uh oh!

Uh oh!

koureasstavros Jan 7, 2025 Author

Uh oh!

martindevans Jan 7, 2025 Maintainer

Uh oh!

koureasstavros Jan 10, 2025 Author

Uh oh!

martindevans Jan 10, 2025 Maintainer

koureasstavros
Jan 2, 2025

Replies: 2 comments 5 replies

martindevans
Jan 2, 2025
Maintainer

koureasstavros
Jan 3, 2025
Author

martindevans Jan 3, 2025
Maintainer

koureasstavros Jan 7, 2025
Author

martindevans Jan 7, 2025
Maintainer

koureasstavros Jan 10, 2025
Author

martindevans Jan 10, 2025
Maintainer