Skip to content

Commit a329db0

Browse files
authored
Merge pull request oobabooga#5452 from oobabooga/dev
Merge dev branch
2 parents 4f3fdf1 + acfbe6b commit a329db0

27 files changed

+387
-214
lines changed

docs/03 - Parameters Tab.md

+4-2
Original file line numberDiff line numberDiff line change
@@ -55,9 +55,10 @@ For more information about the parameters, the [transformers documentation](http
5555
* **mirostat_tau**: No idea, see the paper for details. According to the Preset Arena, 8 is a good value.
5656
* **mirostat_eta**: No idea, see the paper for details. According to the Preset Arena, 0.1 is a good value.
5757
* **dynamic_temperature**: Activates Dynamic Temperature. This modifies temperature to range between "dynatemp_low" (minimum) and "dynatemp_high" (maximum), with an entropy-based scaling. The steepness of the curve is controlled by "dynatemp_exponent".
58-
* **temperature_last**: Makes temperature the last sampler instead of the first. With this, you can remove low probability tokens with a sampler like min_p and then use a high temperature to make the model creative without losing coherency.
58+
* **smoothing_factor**: Activates Quadratic Sampling. When `0 < smoothing_factor < 1`, the logits distribution becomes flatter. When `smoothing_factor > 1`, it becomes more peaked.
59+
* **temperature_last**: Makes temperature the last sampler instead of the first. With this, you can remove low probability tokens with a sampler like min_p and then use a high temperature to make the model creative without losing coherency. Note: this parameter takes precedence over "Sampler priority". That means that `temperature`/`dynamic_temperature`/`quadratic_sampling` will be removed from wherever they are and moved to the end of the stack.
5960
* **do_sample**: When unchecked, sampling is entirely disabled, and greedy decoding is used instead (the most likely token is always picked).
60-
* **Seed**: Set the Pytorch seed to this number. Note that some loaders do not use Pytorch (notably llama.cpp), and others are not deterministic (notably ExLlama v1 and v2). For these loaders, the seed has no effect.
61+
* **Seed**: Set the Pytorch seed to this number. Note that some loaders do not use Pytorch (notably llama.cpp), and others are not deterministic (ExLlamaV2). For these loaders, the seed has no effect.
6162
* **encoder_repetition_penalty**: Also known as the "Hallucinations filter". Used to penalize tokens that are *not* in the prior text. Higher value = more likely to stay in context, lower value = more likely to diverge.
6263
* **no_repeat_ngram_size**: If not set to 0, specifies the length of token sets that are completely blocked from repeating at all. Higher values = blocks larger phrases, lower values = blocks words or letters from repeating. Only 0 or high values are a good idea in most cases.
6364
* **min_length**: Minimum generation length in tokens. This is a built-in parameter in the transformers library that has never been very useful. Typically you want to check "Ban the eos_token" instead.
@@ -76,6 +77,7 @@ To the right (or below if you are on mobile), the following parameters are prese
7677
* **Add the bos_token to the beginning of prompts**: By default, the tokenizer will add a BOS (Beginning of Sequence) token to your prompt. During training, BOS tokens are used to separate different documents. If unchecked, no BOS token will be added, and the model will interpret your prompt as being in the middle of a document instead of at the start of one. This significantly changes the output and can make it more creative.
7778
* **Skip special tokens**: When decoding the generated tokens, skip special tokens from being converted to their text representation. Otherwise, BOS appears as `<s>`, EOS as `</s>`, etc.
7879
* **Activate text streaming**: When unchecked, the full response is outputted at once, without streaming the words one at a time. I recommend unchecking this parameter on high latency networks like running the webui on Google Colab or using `--share`.
80+
* **Sampler priority**: Allows you to customize the order in which the different samplers are applied. The first sampler on the list gets applied first. With this, custom orders like `top_p -> temperature -> top_k` can be defined.
7981
* **Load grammar from file**: Loads a GBNF grammar from a file under `text-generation-webui/grammars`. The output is written to the "Grammar" box below. You can also save and delete custom grammars using this menu.
8082
* **Grammar**: Allows you to constrain the model output to a particular format. For instance, you can make the model generate lists, JSON, specific words, etc. Grammar is extremely powerful and I highly recommend it. The syntax looks a bit daunting at first sight, but it gets very easy once you understand it. See the [GBNF Guide](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md) for details.
8183

docs/04 - Model Tab.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ Examples:
4242
* https://huggingface.co/TheBloke/Llama-2-13B-chat-GPTQ
4343

4444
* **gpu-split**: If you have multiple GPUs, the amount of memory to allocate per GPU should be set in this field. Make sure to set a lower value for the first GPU, as that's where the cache is allocated.
45-
* **max_seq_len**: The maximum sequence length for the model. In ExLlama, the cache is preallocated, so the higher this value, the higher the VRAM. It is automatically set to the maximum sequence length for the model based on its metadata, but you may need to lower this value be able to fit the model into your GPU. After loading the model, the "Truncate the prompt up to this length" parameter under "Parameters" > "Generation" is automatically set to your chosen "max_seq_len" so that you don't have to set the same thing twice.
45+
* **max_seq_len**: The maximum sequence length for the model. In ExLlamaV2, the cache is preallocated, so the higher this value, the higher the VRAM. It is automatically set to the maximum sequence length for the model based on its metadata, but you may need to lower this value be able to fit the model into your GPU. After loading the model, the "Truncate the prompt up to this length" parameter under "Parameters" > "Generation" is automatically set to your chosen "max_seq_len" so that you don't have to set the same thing twice.
4646
* **cfg-cache**: Creates a second cache to hold the CFG negative prompts. You need to set this if and only if you intend to use CFG in the "Parameters" > "Generation" tab. Checking this parameter doubles the cache VRAM usage.
4747
* **no_flash_attn**: Disables flash attention. Otherwise, it is automatically used as long as the library is installed.
4848
* **cache_8bit**: Create a 8-bit precision cache instead of a 16-bit one. This saves VRAM but increases perplexity (I don't know by how much).
@@ -57,7 +57,7 @@ Loads: GPTQ models.
5757

5858
* **wbits**: For ancient models without proper metadata, sets the model precision in bits manually. Can usually be ignored.
5959
* **groupsize**: For ancient models without proper metadata, sets the model group size manually. Can usually be ignored.
60-
* **triton**: Only available on Linux. Necessary to use models with both act-order and groupsize simultaneously. Note that ExLlama can load these same models on Windows without triton.
60+
* **triton**: Only available on Linux. Necessary to use models with both act-order and groupsize simultaneously. Note that ExLlamaV2 can load these same models on Windows without triton.
6161
* **no_inject_fused_attention**: Improves performance while increasing the VRAM usage.
6262
* **no_inject_fused_mlp**: Similar to the previous parameter but for Triton only.
6363
* **no_use_cuda_fp16**: On some systems, the performance can be very bad with this unset. Can usually be ignored.
@@ -67,7 +67,7 @@ Loads: GPTQ models.
6767

6868
Loads: GPTQ models.
6969

70-
Ancient loader, the first one to implement 4-bit quantization. It works on older GPUs for which ExLlama and AutoGPTQ do not work, and it doesn't work with "act-order", so you should use it with simple 4-bit-128g models.
70+
Ancient loader, the first one to implement 4-bit quantization. It works on older GPUs for which ExLlamaV2 and AutoGPTQ do not work, and it doesn't work with "act-order", so you should use it with simple 4-bit-128g models.
7171

7272
* **pre_layer**: Used for CPU offloading. The higher the number, the more layers will be sent to the GPU. GPTQ-for-LLaMa CPU offloading was faster than the one implemented in AutoGPTQ the last time I checked.
7373

docs/What Works.md

+8-6
Original file line numberDiff line numberDiff line change
@@ -2,15 +2,17 @@
22

33
| Loader | Loading 1 LoRA | Loading 2 or more LoRAs | Training LoRAs | Multimodal extension | Perplexity evaluation |
44
|----------------|----------------|-------------------------|----------------|----------------------|-----------------------|
5-
| Transformers ||*** |* |||
5+
| Transformers ||\*\*\* |\* |||
6+
| llama.cpp ||||| use llamacpp_HF |
7+
| llamacpp_HF ||||||
68
| ExLlamav2_HF ||||||
7-
| ExLlamav2 ||||| use ExLlamav2_HF |
9+
| ExLlamav2 ||||| use ExLlamav2_HF |
810
| AutoGPTQ ||||||
9-
| GPTQ-for-LLaMa |** |*** ||||
10-
| llama.cpp ||||| use llamacpp_HF |
11-
| llamacpp_HF ||||||
11+
| AutoAWQ | ? || ? | ? ||
12+
| GPTQ-for-LLaMa |\*\* |\*\*\* ||||
1213
| ctransformers ||||||
13-
| AutoAWQ | ? || ? | ? ||
14+
| QuIP# | ? | ? | ? | ? ||
15+
| HQQ | ? | ? | ? | ? ||
1416

1517
❌ = not implemented
1618

extensions/openai/typing.py

+2
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ class GenerationOptions(BaseModel):
1212
dynatemp_low: float = 1
1313
dynatemp_high: float = 1
1414
dynatemp_exponent: float = 1
15+
smoothing_factor: float = 0
1516
top_k: int = 0
1617
repetition_penalty: float = 1
1718
repetition_penalty_range: int = 1024
@@ -39,6 +40,7 @@ class GenerationOptions(BaseModel):
3940
max_tokens_second: int = 0
4041
prompt_lookup_num_tokens: int = 0
4142
custom_token_bans: str = ""
43+
sampler_priority: List[str] | str | None = Field(default=None, description="List of samplers where the first items will appear first in the stack. Example: [\"top_k\", \"temperature\", \"top_p\"].")
4244
auto_max_new_tokens: bool = False
4345
ban_eos_token: bool = False
4446
add_bos_token: bool = True

instruction-templates/ChatML.yaml

+2-5
Original file line numberDiff line numberDiff line change
@@ -5,15 +5,12 @@ instruction_template: |-
55
{%- set ns.found = true -%}
66
{%- endif -%}
77
{%- endfor -%}
8-
{%- if not ns.found -%}
9-
{{- '<|im_start|>system\n' + '' + '<|im_end|>\n' -}}
10-
{%- endif %}
118
{%- for message in messages %}
129
{%- if message['role'] == 'system' -%}
13-
{{- '<|im_start|>system\n' + message['content'] + '<|im_end|>\n' -}}
10+
{{- '<|im_start|>system\n' + message['content'].rstrip() + '<|im_end|>\n' -}}
1411
{%- else -%}
1512
{%- if message['role'] == 'user' -%}
16-
{{-'<|im_start|>user\n' + message['content'] + '<|im_end|>\n'-}}
13+
{{-'<|im_start|>user\n' + message['content'].rstrip() + '<|im_end|>\n'-}}
1714
{%- else -%}
1815
{{-'<|im_start|>assistant\n' + message['content'] + '<|im_end|>\n' -}}
1916
{%- endif -%}

modules/LoRA.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
def add_lora_to_model(lora_names):
1313
if 'GPTQForCausalLM' in shared.model.__class__.__name__ or shared.args.loader == 'AutoGPTQ':
1414
add_lora_autogptq(lora_names)
15-
elif shared.model.__class__.__name__ in ['Exllamav2Model', 'Exllamav2HF'] or shared.args.loader == ['ExLlamav2', 'ExLlamav2_HF']:
15+
elif shared.model.__class__.__name__ in ['Exllamav2Model', 'Exllamav2HF'] or shared.args.loader in ['ExLlamav2', 'ExLlamav2_HF']:
1616
add_lora_exllamav2(lora_names)
1717
else:
1818
add_lora_transformers(lora_names)

modules/chat.py

+41-6
Original file line numberDiff line numberDiff line change
@@ -166,18 +166,53 @@ def make_prompt(messages):
166166
prompt = remove_extra_bos(prompt)
167167
return prompt
168168

169-
prompt = make_prompt(messages)
170-
171169
# Handle truncation
172170
max_length = get_max_prompt_length(state)
173-
while len(messages) > 0 and get_encoded_length(prompt) > max_length:
174-
# Try to save the system message
175-
if len(messages) > 1 and messages[0]['role'] == 'system':
171+
prompt = make_prompt(messages)
172+
encoded_length = get_encoded_length(prompt)
173+
174+
while len(messages) > 0 and encoded_length > max_length:
175+
176+
# Remove old message, save system message
177+
if len(messages) > 2 and messages[0]['role'] == 'system':
176178
messages.pop(1)
177-
else:
179+
180+
# Remove old message when no system message is present
181+
elif len(messages) > 1 and messages[0]['role'] != 'system':
178182
messages.pop(0)
179183

184+
# Resort to truncating the user input
185+
else:
186+
187+
user_message = messages[-1]['content']
188+
189+
# Bisect the truncation point
190+
left, right = 0, len(user_message) - 1
191+
192+
while right - left > 1:
193+
mid = (left + right) // 2
194+
195+
messages[-1]['content'] = user_message[mid:]
196+
prompt = make_prompt(messages)
197+
encoded_length = get_encoded_length(prompt)
198+
199+
if encoded_length <= max_length:
200+
right = mid
201+
else:
202+
left = mid
203+
204+
messages[-1]['content'] = user_message[right:]
205+
prompt = make_prompt(messages)
206+
encoded_length = get_encoded_length(prompt)
207+
if encoded_length > max_length:
208+
logger.error(f"Failed to build the chat prompt. The input is too long for the available context length.\n\nTruncation length: {state['truncation_length']}\nmax_new_tokens: {state['max_new_tokens']} (is it too high?)\nAvailable context length: {max_length}\n")
209+
raise ValueError
210+
else:
211+
logger.warning(f"The input has been truncated. Context length: {state['truncation_length']}, max_new_tokens: {state['max_new_tokens']}, available context length: {max_length}.")
212+
break
213+
180214
prompt = make_prompt(messages)
215+
encoded_length = get_encoded_length(prompt)
181216

182217
if also_return_rows:
183218
return prompt, [message['content'] for message in messages]

modules/llamacpp_hf.py

+2-1
Original file line numberDiff line numberDiff line change
@@ -216,7 +216,8 @@ def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.P
216216
'tensor_split': tensor_split_list,
217217
'rope_freq_scale': 1.0 / shared.args.compress_pos_emb,
218218
'logits_all': shared.args.logits_all,
219-
'offload_kqv': not shared.args.no_offload_kqv
219+
'offload_kqv': not shared.args.no_offload_kqv,
220+
'split_mode': 1 if not shared.args.row_split else 2
220221
}
221222

222223
Llama = llama_cpp_lib().Llama

modules/llamacpp_model.py

+2-1
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,8 @@ def from_pretrained(self, path):
9595
'rope_freq_base': RoPE.get_rope_freq_base(shared.args.alpha_value, shared.args.rope_freq_base),
9696
'tensor_split': tensor_split_list,
9797
'rope_freq_scale': 1.0 / shared.args.compress_pos_emb,
98-
'offload_kqv': not shared.args.no_offload_kqv
98+
'offload_kqv': not shared.args.no_offload_kqv,
99+
'split_mode': 1 if not shared.args.row_split else 2
99100
}
100101

101102
result.model = Llama(**params)

modules/loaders.py

+9-1
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@
2626
'compress_pos_emb',
2727
'disable_exllama',
2828
'disable_exllamav2',
29-
'transformers_info'
29+
'transformers_info',
3030
],
3131
'llama.cpp': [
3232
'n_ctx',
@@ -44,6 +44,7 @@
4444
'cpu',
4545
'numa',
4646
'no_offload_kqv',
47+
'row_split',
4748
'tensorcores',
4849
],
4950
'llamacpp_HF': [
@@ -66,6 +67,7 @@
6667
'no_use_fast',
6768
'logits_all',
6869
'no_offload_kqv',
70+
'row_split',
6971
'tensorcores',
7072
'llamacpp_HF_info',
7173
],
@@ -159,6 +161,7 @@ def transformers_samplers():
159161
'dynatemp_low',
160162
'dynatemp_high',
161163
'dynatemp_exponent',
164+
'smoothing_factor',
162165
'top_p',
163166
'min_p',
164167
'top_k',
@@ -189,6 +192,7 @@ def transformers_samplers():
189192
'negative_prompt',
190193
'ban_eos_token',
191194
'custom_token_bans',
195+
'sampler_priority',
192196
'add_bos_token',
193197
'skip_special_tokens',
194198
'auto_max_new_tokens',
@@ -233,6 +237,7 @@ def transformers_samplers():
233237
'dynatemp_low',
234238
'dynatemp_high',
235239
'dynatemp_exponent',
240+
'smoothing_factor',
236241
'top_p',
237242
'min_p',
238243
'top_k',
@@ -259,6 +264,7 @@ def transformers_samplers():
259264
'negative_prompt',
260265
'ban_eos_token',
261266
'custom_token_bans',
267+
'sampler_priority',
262268
'add_bos_token',
263269
'skip_special_tokens',
264270
'auto_max_new_tokens',
@@ -289,6 +295,7 @@ def transformers_samplers():
289295
'dynatemp_low',
290296
'dynatemp_high',
291297
'dynatemp_exponent',
298+
'smoothing_factor',
292299
'top_p',
293300
'min_p',
294301
'top_k',
@@ -315,6 +322,7 @@ def transformers_samplers():
315322
'negative_prompt',
316323
'ban_eos_token',
317324
'custom_token_bans',
325+
'sampler_priority',
318326
'add_bos_token',
319327
'skip_special_tokens',
320328
'auto_max_new_tokens',

modules/models.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -100,9 +100,9 @@ def load_model(model_name, loader=None):
100100
elif loader in ['llama.cpp', 'llamacpp_HF', 'ctransformers']:
101101
shared.settings['truncation_length'] = shared.args.n_ctx
102102

103-
logger.info(f"LOADER: {loader}")
103+
logger.info(f"LOADER: \"{loader}\"")
104104
logger.info(f"TRUNCATION LENGTH: {shared.settings['truncation_length']}")
105-
logger.info(f"INSTRUCTION TEMPLATE: {metadata['instruction_template']}")
105+
logger.info(f"INSTRUCTION TEMPLATE: \"{metadata['instruction_template']}\"")
106106
logger.info(f"Loaded the model in {(time.time()-t0):.2f} seconds.")
107107
return model, tokenizer
108108

modules/presets.py

+2
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ def default_preset():
1717
'dynatemp_low': 1,
1818
'dynatemp_high': 1,
1919
'dynatemp_exponent': 1,
20+
'smoothing_factor': 0,
2021
'top_p': 1,
2122
'min_p': 0,
2223
'top_k': 0,
@@ -41,6 +42,7 @@ def default_preset():
4142
'num_beams': 1,
4243
'length_penalty': 1,
4344
'early_stopping': False,
45+
'sampler_priority': 'temperature\ndynamic_temperature\nquadratic_sampling\ntop_k\ntop_p\ntypical_p\nepsilon_cutoff\neta_cutoff\ntfs\ntop_a\nmin_p\nmirostat'
4446
}
4547

4648

0 commit comments

Comments
 (0)