How to Enable Fully Greedy Decoding Sample Parameters? #3005

PenutChen · 2023-09-04T09:02:52Z

PenutChen
Sep 4, 2023

Hi,

I'm planning to employ in-context learning in my project and have chosen to use greedy decoding. Unlike in HuggingFace, it seems there is no do_sample=False parameter. Based on my research, I've established the following parameters for greedy decoding:

param = {
    "n_predict": 256,
    "stop": ["\n\n"],
    "prompt": prompt,
    "temperature": 0.0,
    "top_k": 0,
    "top_p": 0.0,
    "repeat_last_n": 0,
    "repeat_penalty": 1.0,
    "penalize_nl": False,
    "tfs_z": 1.0,
    "presence_penalty": 0.0,
    "frequency_penalty": 0.0,
    "mirostat": 0
}

I've chosen these parameters based on the server documentation since I'm using the server to serve as an LLM backend. Can anyone provide feedback on this? Specifically, I'm wondering if I've missed something or if there are better values for certain parameters given my intended use-case.

Thank you in advance!

Answered by ggerganov

Nov 5, 2024

Setting temp = 0 will no longer be equivalent to greedy decoding (see #9897). To enable it, configure a single top_k sampler and set k = 1. For example, with llama-cli this can be done with the following CLI args:

--sampling-seq k --top-k 1

View full answer

KerfuffleV2 · 2023-09-04T13:13:02Z

KerfuffleV2
Sep 4, 2023
Collaborator

At least in the main example, setting temperature to a negative value enables greedy sampling. edit: I checked, seems the same in examples/server.

3 replies

PenutChen Sep 5, 2023
Author

My apologies for overlooking the details about the temperature parameter. I had subjectively assumed that setting the temperature to 0 would simply disable temperature sampling, while other methods like top_k and top_p would continue to function as normal. To my understanding, these sampling methods should operate independently, correct? Thank you once again for your response!

KerfuffleV2 Sep 5, 2023
Collaborator

No problem. I'm not sure I 100% understand your question. For greedy sampling, stuff like top_k won't apply. The code looks like this: https://github.com/ggerganov/llama.cpp/blob/2ba85c8609309a59d49c45ab43c31800b7ba141c/common/common.cpp#L927-L946

So if temperature < 0 then you only get greedy sampling. If mirostat 1 or 2 is enabled, you only get temperature + mirostat, otherwise the main set of samplers like top-k, tail free, etc will apply. (Not included in the code fragment I linked is the penalty stuff like repeat penalty which applies in all cases when enabled.)

The way the temperature sampler works in llama.cpp doesn't actually select a token or anything, it just divides the logits by the temperature: https://github.com/ggerganov/llama.cpp/blob/2ba85c8609309a59d49c45ab43c31800b7ba141c/llama.cpp#L4065-L4070

PenutChen Sep 5, 2023
Author

You've completely addressed my blind spots concerning the sampling methods and resolved my questions. I now understand that the temperature parameter is actually a kind of logits augmentation. Thank you for your thorough explanation!

jackfsuia · 2024-07-11T10:46:46Z

jackfsuia
Jul 11, 2024

hi. can you ask questions continued on what you've discussed? I use
llama-server.exe -m "D:\my repo\qwen2\ggml-model-bf16.gguf --temp 0.0
to be my server. and Use

import openai
client = openai.OpenAI(
    base_url="http://localhost:8080/v1", 
    api_key = "sk-no-key-required"
)
completion = client.chat.completions.create(
    model="qwen",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "write a letter about love to a cat"}
    ],
)
print(completion.choices[0].message.content)

to be my request. but every time the response turn out different, which might suggest that the server was not doing greedy decoding. What did i do wrong? Thank you~

6 replies

jackfsuia Jul 11, 2024

Thank you for your help! I tried this, but once i passed temperature=0.0 in there, the server seemed to be stuck somewhere forever, and nothing happen, and i don't know why.

PenutChen Jul 11, 2024
Author

Maybe you can try using streaming output to see what happens:

completion = client.chat.completions.create(
    model="qwen",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "write a letter about love to a cat"}
    ],
    temperature=0.0,
    stream=True,
)

for chunk in completion:
    try:
        chunk_message = chunk.choices[0].delta.content
        print(end=chunk_message)
    except Exception as e:
        print(e)

jackfsuia Jul 11, 2024

Thank you for your brilliant solution, it turns out the server was producing the same sentences non-stop. The problem is clear now. Can I ask one more question? When the llama.cpp runs the non-quantized .gguf model file, is it supposed to produce the same results as the original huggingface model for the same input? But I haven't succeed at making it do so. Is it possible to do so? Thank you again!

PenutChen Jul 11, 2024
Author

This is pretty difficult to align the responses of these backends. In my experience, not only does the temperature need to be set to 0.0, but also frequency_penalty, presence_penalty, or repeat-penalty (if they exist) need to be set properly. By the way, the most greedy decode of llama.cpp is example/simple.

Remember to check the generation config of HF Transformers. They also need to be set to greedy decode.

jackfsuia Jul 11, 2024

You are right, it is difficult. I try llama-simple on the model QWEN2-0.5B, both f16.gguf and bf16.gguf failed to be aligned with hf. I give up. Thank you again.

Zuo-Peng · 2024-11-02T03:35:40Z

Zuo-Peng
Nov 2, 2024

This means that if I want to use greedy decoding, I just need to set temp=0.

0 replies

ggerganov · 2024-11-05T07:54:07Z

ggerganov
Nov 5, 2024
Maintainer

Setting temp = 0 will no longer be equivalent to greedy decoding (see #9897). To enable it, configure a single top_k sampler and set k = 1. For example, with llama-cli this can be done with the following CLI args:

--sampling-seq k --top-k 1

2 replies

jtyska Jan 7, 2025

How do I do this using the API?

Zuo-Peng Feb 27, 2025

use 'llama-server' cmd

Goutamxd · 2025-02-27T17:50:21Z

Goutamxd
Feb 27, 2025

Hey 😊@PenutChen! Your setup looks solid for fully greedy decoding! Setting temperature=0.0, top_k=0, and top_p=0.0 ensures full determinism, so you’re on the right track. repeat_last_n=0 makes sure the model strictly follows the prompt, which is great. One small tweak—you might try repeat_penalty=1.1 instead of 1.0 to help avoid repetitive loops. If things feel off, playing around with presence_penalty or frequency_penalty a little could help. But overall, this should work well for in-context learning. 😊

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Enable Fully Greedy Decoding Sample Parameters? #3005

{{title}}

Replies: 5 comments 11 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

How to Enable Fully Greedy Decoding Sample Parameters? #3005

Replies: 5 comments · 11 replies

KerfuffleV2 Sep 4, 2023 Collaborator

PenutChen Sep 5, 2023 Author

KerfuffleV2 Sep 5, 2023 Collaborator

PenutChen Sep 5, 2023 Author

PenutChen Jul 11, 2024 Author

PenutChen Jul 11, 2024 Author

ggerganov Nov 5, 2024 Maintainer

Replies: 5 comments 11 replies

KerfuffleV2
Sep 4, 2023
Collaborator

PenutChen Sep 5, 2023
Author

KerfuffleV2 Sep 5, 2023
Collaborator

PenutChen Sep 5, 2023
Author

PenutChen Jul 11, 2024
Author

PenutChen Jul 11, 2024
Author

ggerganov
Nov 5, 2024
Maintainer