Parallel sampling in processing the batch of tokens? #11882

whitezhang · 2025-02-15T08:39:31Z

whitezhang
Feb 15, 2025

When I use the following command to start the server:

./build/bin/llama-server -m Qwen2_7B.gguf --host 0.0.0.0 --port 8099 --log-verbose --main_gpu 0 --n_gpu_layers 40 -np 20 -c 40960 -n 2048 --slots --metrics

I found that the time cost for each query is quite high. I checked the code and found that it processes each slot serially. Is it possible to make this parallel? I can do this if it is feasible. Or is there anything else I haven’t considered?

//  process the created batch of tokens
 for (int32_t i = 0; i < batch.n_tokens; i += n_batch) {
    // do something
    for (auto & slot : slots) {
        // do something
    }
}

Here’s a simplified conceptual code snippet that I want to change to be

// Assuming slots is a vector of some slot objects
std::vector<SlotType> slots;

// Mutex for synchronizing access to any shared resources
std::mutex mtx;

// Function to process a single slot
void process_slot(SlotType& slot) {
    // Lock the mutex if shared resources are used
    std::lock_guard<std::mutex> lock(mtx);
    // Do something with the slot
}

// Process the created batch of tokens
for (int32_t i = 0; i < batch.n_tokens; i += n_batch) {
    // Create a vector to hold threads
    std::vector<std::thread> threads;

    // Create and start a thread for each slot
    for (auto& slot : slots) {
        threads.emplace_back(process_slot, std::ref(slot));
    }

    // Join the threads to wait for them to finish
    for (auto& t : threads) {
        t.join();
    }
}

ggerganov · 2025-02-15T09:38:00Z

ggerganov
Feb 15, 2025
Maintainer

The requests are already processed in parallel - there is nothing extra necessary to enable this.

2 replies

whitezhang Feb 16, 2025
Author

@ggerganov When my client's QPS equals 1, the entire update_slots function takes 18329us. When QPS equals 2, update_slots takes 38257us.
This statistic appears that the service is not processing all requests in parallel at the same time.
I don't quite understand why you said that “The requests are already processed in parallel”

whitezhang Feb 16, 2025
Author

In the code, each query corresponds to a slot. The slots are processed in serial, why requests are processed in parallel? Am I correct?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel sampling in processing the batch of tokens? #11882

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Parallel sampling in processing the batch of tokens? #11882

whitezhang Feb 15, 2025

Replies: 1 comment · 2 replies

ggerganov Feb 15, 2025 Maintainer

whitezhang Feb 16, 2025 Author

whitezhang Feb 16, 2025 Author

whitezhang
Feb 15, 2025

Replies: 1 comment 2 replies

ggerganov
Feb 15, 2025
Maintainer

whitezhang Feb 16, 2025
Author

whitezhang Feb 16, 2025
Author