Skip to content

Commit 073694b

Browse files
authored
Merge pull request oobabooga#6336 from oobabooga/dev
Merge dev branch
2 parents d011040 + 9d99156 commit 073694b

14 files changed

+131
-118
lines changed

README.md

+24-24
Original file line numberDiff line numberDiff line change
@@ -10,27 +10,29 @@ Its goal is to become the [AUTOMATIC1111/stable-diffusion-webui](https://github.
1010

1111
## Features
1212

13-
* 3 interface modes: default (two columns), notebook, and chat.
14-
* Multiple model backends: [Transformers](https://github.com/huggingface/transformers), [llama.cpp](https://github.com/ggerganov/llama.cpp) (through [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)), [ExLlamaV2](https://github.com/turboderp/exllamav2), [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), [AutoAWQ](https://github.com/casper-hansen/AutoAWQ), [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM).
15-
* Dropdown menu for quickly switching between different models.
16-
* Large number of extensions (built-in and user-contributed), including Coqui TTS for realistic voice outputs, Whisper STT for voice inputs, translation, [multimodal pipelines](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/multimodal), vector databases, Stable Diffusion integration, and a lot more. See [the wiki](https://github.com/oobabooga/text-generation-webui/wiki/07-%E2%80%90-Extensions) and [the extensions directory](https://github.com/oobabooga/text-generation-webui-extensions) for details.
17-
* [Chat with custom characters](https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab#character).
18-
* Precise chat templates for instruction-following models, including Llama-2-chat, Alpaca, Vicuna, Mistral.
19-
* LoRA: train new LoRAs with your own data, load/unload LoRAs on the fly for generation.
20-
* Transformers library integration: load models in 4-bit or 8-bit precision through bitsandbytes, use llama.cpp with transformers samplers (`llamacpp_HF` loader), CPU inference in 32-bit precision using PyTorch.
21-
* OpenAI-compatible API server with Chat and Completions endpoints -- see the [examples](https://github.com/oobabooga/text-generation-webui/wiki/12-%E2%80%90-OpenAI-API#examples).
13+
* Multiple backends for text generation in a single UI and API, including [Transformers](https://github.com/huggingface/transformers), [llama.cpp](https://github.com/ggerganov/llama.cpp) (through [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)), [ExLlamaV2](https://github.com/turboderp/exllamav2), [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), and [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM). [AutoAWQ](https://github.com/casper-hansen/AutoAWQ), [HQQ](https://github.com/mobiusml/hqq), and [AQLM](https://github.com/Vahe1994/AQLM) are also supported through the Transformers loader.
14+
* OpenAI-compatible API server with Chat and Completions endpoints – see the [examples](https://github.com/oobabooga/text-generation-webui/wiki/12-%E2%80%90-OpenAI-API#examples).
15+
* Automatic prompt formatting for each model using the Jinja2 template in its metadata, ensuring high-quality outputs without manual setup.
16+
* Three chat modes: `instruct`, `chat-instruct`, and `chat`, allowing for both task-based interactions and casual conversations with characters. `chat-instruct` mode automatically applies the model's template to the chat's prompt, leading to higher quality outputs.
17+
* Easy switching between conversations and starting new ones through the "Past chats" menu in the main interface tab.
18+
* Flexible text generation through autocompletion in the Default/Notebook tabs without being limited to chat turns. Send formatted chat conversations from the Chat tab to these tabs.
19+
* Multiple sampling parameters and options for sophisticated text generation control.
20+
* Quick downloading and loading of new models through the interface without restarting, using the "Model" tab.
21+
* Simple LoRA fine-tuning tool to customize models with your data.
22+
* Self-contained dependencies in the `installer_files` folder, avoiding interference with the system's Python environment. Precompiled Python wheels for the backends are in the `requirements.txt` and are transparently compiled using GitHub Actions.
23+
* Extensions support, including numerous built-in and user-contributed extensions. See [the wiki](https://github.com/oobabooga/text-generation-webui/wiki/07-%E2%80%90-Extensions) and [the extensions directory](https://github.com/oobabooga/text-generation-webui-extensions) for details.
2224

2325
## How to install
2426

2527
1) Clone or [download](https://github.com/oobabooga/text-generation-webui/archive/refs/heads/main.zip) the repository.
2628
2) Run the `start_linux.sh`, `start_windows.bat`, `start_macos.sh`, or `start_wsl.bat` script depending on your OS.
2729
3) Select your GPU vendor when asked.
28-
4) Once the installation ends, browse to `http://localhost:7860/?__theme=dark`.
30+
4) Once the installation ends, browse to `http://localhost:7860`.
2931
5) Have fun!
3032

31-
To restart the web UI in the future, just run the `start_` script again. This script creates an `installer_files` folder where it sets up the project's requirements. In case you need to reinstall the requirements, you can simply delete that folder and start the web UI again.
33+
To restart the web UI in the future, just run the `start_` script again. This script creates an `installer_files` folder where it sets up the project's requirements. If you need to reinstall the requirements, you can simply delete that folder and start the web UI again.
3234

33-
The script accepts command-line flags. Alternatively, you can edit the `CMD_FLAGS.txt` file with a text editor and add your flags there.
35+
The script accepts command-line flags. Alternatively, you can edit the `CMD_FLAGS.txt` file with a text editor and add your flags there, such as `--api` in case you need to use the API.
3436

3537
To get updates in the future, run `update_wizard_linux.sh`, `update_wizard_windows.bat`, `update_wizard_macos.sh`, or `update_wizard_wsl.bat`.
3638

@@ -207,13 +209,13 @@ usage: server.py [-h] [--multi-user] [--character CHARACTER] [--model MODEL] [--
207209
[--force-safetensors] [--no_use_fast] [--use_flash_attention_2] [--use_eager_attention] [--load-in-4bit] [--use_double_quant] [--compute_dtype COMPUTE_DTYPE] [--quant_type QUANT_TYPE]
208210
[--flash-attn] [--tensorcores] [--n_ctx N_CTX] [--threads THREADS] [--threads-batch THREADS_BATCH] [--no_mul_mat_q] [--n_batch N_BATCH] [--no-mmap] [--mlock]
209211
[--n-gpu-layers N_GPU_LAYERS] [--tensor_split TENSOR_SPLIT] [--numa] [--logits_all] [--no_offload_kqv] [--cache-capacity CACHE_CAPACITY] [--row_split] [--streaming-llm]
210-
[--attention-sink-size ATTENTION_SINK_SIZE] [--gpu-split GPU_SPLIT] [--autosplit] [--max_seq_len MAX_SEQ_LEN] [--cfg-cache] [--no_flash_attn] [--no_xformers] [--no_sdpa]
211-
[--cache_8bit] [--cache_4bit] [--num_experts_per_token NUM_EXPERTS_PER_TOKEN] [--triton] [--no_inject_fused_mlp] [--no_use_cuda_fp16] [--desc_act] [--disable_exllama]
212-
[--disable_exllamav2] [--wbits WBITS] [--groupsize GROUPSIZE] [--no_inject_fused_attention] [--hqq-backend HQQ_BACKEND] [--cpp-runner] [--deepspeed]
213-
[--nvme-offload-dir NVME_OFFLOAD_DIR] [--local_rank LOCAL_RANK] [--alpha_value ALPHA_VALUE] [--rope_freq_base ROPE_FREQ_BASE] [--compress_pos_emb COMPRESS_POS_EMB] [--listen]
214-
[--listen-port LISTEN_PORT] [--listen-host LISTEN_HOST] [--share] [--auto-launch] [--gradio-auth GRADIO_AUTH] [--gradio-auth-path GRADIO_AUTH_PATH] [--ssl-keyfile SSL_KEYFILE]
215-
[--ssl-certfile SSL_CERTFILE] [--subpath SUBPATH] [--api] [--public-api] [--public-api-id PUBLIC_API_ID] [--api-port API_PORT] [--api-key API_KEY] [--admin-key ADMIN_KEY] [--nowebui]
216-
[--multimodal-pipeline MULTIMODAL_PIPELINE] [--model_type MODEL_TYPE] [--pre_layer PRE_LAYER [PRE_LAYER ...]] [--checkpoint CHECKPOINT] [--monkey-patch]
212+
[--attention-sink-size ATTENTION_SINK_SIZE] [--tokenizer-dir TOKENIZER_DIR] [--gpu-split GPU_SPLIT] [--autosplit] [--max_seq_len MAX_SEQ_LEN] [--cfg-cache] [--no_flash_attn]
213+
[--no_xformers] [--no_sdpa] [--cache_8bit] [--cache_4bit] [--num_experts_per_token NUM_EXPERTS_PER_TOKEN] [--triton] [--no_inject_fused_mlp] [--no_use_cuda_fp16] [--desc_act]
214+
[--disable_exllama] [--disable_exllamav2] [--wbits WBITS] [--groupsize GROUPSIZE] [--hqq-backend HQQ_BACKEND] [--cpp-runner] [--deepspeed] [--nvme-offload-dir NVME_OFFLOAD_DIR]
215+
[--local_rank LOCAL_RANK] [--alpha_value ALPHA_VALUE] [--rope_freq_base ROPE_FREQ_BASE] [--compress_pos_emb COMPRESS_POS_EMB] [--listen] [--listen-port LISTEN_PORT]
216+
[--listen-host LISTEN_HOST] [--share] [--auto-launch] [--gradio-auth GRADIO_AUTH] [--gradio-auth-path GRADIO_AUTH_PATH] [--ssl-keyfile SSL_KEYFILE] [--ssl-certfile SSL_CERTFILE]
217+
[--subpath SUBPATH] [--api] [--public-api] [--public-api-id PUBLIC_API_ID] [--api-port API_PORT] [--api-key API_KEY] [--admin-key ADMIN_KEY] [--nowebui]
218+
[--multimodal-pipeline MULTIMODAL_PIPELINE] [--model_type MODEL_TYPE] [--pre_layer PRE_LAYER [PRE_LAYER ...]] [--checkpoint CHECKPOINT] [--monkey-patch] [--no_inject_fused_attention]
217219
218220
Text generation web UI
219221
@@ -237,7 +239,7 @@ Basic settings:
237239
238240
Model loader:
239241
--loader LOADER Choose the model loader manually, otherwise, it will get autodetected. Valid options: Transformers, llama.cpp, llamacpp_HF, ExLlamav2_HF, ExLlamav2,
240-
AutoGPTQ, AutoAWQ.
242+
AutoGPTQ.
241243
242244
Transformers/Accelerate:
243245
--cpu Use the CPU to generate text. Warning: Training on CPU is extremely slow.
@@ -281,6 +283,7 @@ llama.cpp:
281283
--row_split Split the model by rows across GPUs. This may improve multi-gpu performance.
282284
--streaming-llm Activate StreamingLLM to avoid re-evaluating the entire prompt when old messages are removed.
283285
--attention-sink-size ATTENTION_SINK_SIZE StreamingLLM: number of sink tokens. Only used if the trimmed prompt does not share a prefix with the old prompt.
286+
--tokenizer-dir TOKENIZER_DIR Load the tokenizer from this folder. Meant to be used with llamacpp_HF through the command-line.
284287
285288
ExLlamaV2:
286289
--gpu-split GPU_SPLIT Comma-separated list of VRAM (in GB) to use per GPU device for model layers. Example: 20,7,7.
@@ -304,9 +307,6 @@ AutoGPTQ:
304307
--wbits WBITS Load a pre-quantized model with specified precision in bits. 2, 3, 4 and 8 are supported.
305308
--groupsize GROUPSIZE Group size.
306309
307-
AutoAWQ:
308-
--no_inject_fused_attention Disable the use of fused attention, which will use less VRAM at the cost of slower inference.
309-
310310
HQQ:
311311
--hqq-backend HQQ_BACKEND Backend for the HQQ loader. Valid options: PYTORCH, PYTORCH_COMPILE, ATEN.
312312
@@ -401,7 +401,7 @@ https://colab.research.google.com/github/oobabooga/text-generation-webui/blob/ma
401401

402402
## Community
403403

404-
* Subreddit: https://www.reddit.com/r/oobabooga/
404+
* Subreddit: https://www.reddit.com/r/Oobabooga/
405405
* Discord: https://discord.gg/jwZCF2dPQN
406406

407407
## Acknowledgment

download-model.py

+3-2
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@
2929
class ModelDownloader:
3030
def __init__(self, max_retries=5):
3131
self.max_retries = max_retries
32+
self.session = self.get_session()
3233

3334
def get_session(self):
3435
session = requests.Session()
@@ -72,7 +73,7 @@ def sanitize_model_and_branch_names(self, model, branch):
7273
return model, branch
7374

7475
def get_download_links_from_huggingface(self, model, branch, text_only=False, specific_file=None):
75-
session = self.get_session()
76+
session = self.session
7677
page = f"/api/models/{model}/tree/{branch}"
7778
cursor = b""
7879

@@ -192,7 +193,7 @@ def get_single_file(self, url, output_folder, start_from_scratch=False):
192193
attempt = 0
193194
while attempt < max_retries:
194195
attempt += 1
195-
session = self.get_session()
196+
session = self.session
196197
headers = {}
197198
mode = 'wb'
198199

modules/models.py

+22-11
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,7 @@ def load_model(model_name, loader=None):
9898
if model is None:
9999
return None, None
100100
else:
101-
tokenizer = load_tokenizer(model_name, model)
101+
tokenizer = load_tokenizer(model_name)
102102

103103
shared.settings.update({k: v for k, v in metadata.items() if k in shared.settings})
104104
if loader.lower().startswith('exllama') or loader.lower().startswith('tensorrt'):
@@ -113,9 +113,13 @@ def load_model(model_name, loader=None):
113113
return model, tokenizer
114114

115115

116-
def load_tokenizer(model_name, model):
116+
def load_tokenizer(model_name, tokenizer_dir=None):
117+
if tokenizer_dir:
118+
path_to_model = Path(tokenizer_dir)
119+
else:
120+
path_to_model = Path(f"{shared.args.model_dir}/{model_name}/")
121+
117122
tokenizer = None
118-
path_to_model = Path(f"{shared.args.model_dir}/{model_name}/")
119123
if path_to_model.exists():
120124
if shared.args.no_use_fast:
121125
logger.info('Loading the tokenizer with use_fast=False.')
@@ -278,17 +282,24 @@ def llamacpp_loader(model_name):
278282
def llamacpp_HF_loader(model_name):
279283
from modules.llamacpp_hf import LlamacppHF
280284

281-
path = Path(f'{shared.args.model_dir}/{model_name}')
282-
283-
# Check if a HF tokenizer is available for the model
284-
if all((path / file).exists() for file in ['tokenizer_config.json']):
285-
logger.info(f'Using tokenizer from: \"{path}\"')
285+
if shared.args.tokenizer_dir:
286+
logger.info(f'Using tokenizer from: \"{shared.args.tokenizer_dir}\"')
286287
else:
287-
logger.error("Could not load the model because a tokenizer in Transformers format was not found.")
288-
return None, None
288+
path = Path(f'{shared.args.model_dir}/{model_name}')
289+
# Check if a HF tokenizer is available for the model
290+
if all((path / file).exists() for file in ['tokenizer_config.json']):
291+
logger.info(f'Using tokenizer from: \"{path}\"')
292+
else:
293+
logger.error("Could not load the model because a tokenizer in Transformers format was not found.")
294+
return None, None
289295

290296
model = LlamacppHF.from_pretrained(model_name)
291-
return model
297+
298+
if shared.args.tokenizer_dir:
299+
tokenizer = load_tokenizer(model_name, tokenizer_dir=shared.args.tokenizer_dir)
300+
return model, tokenizer
301+
else:
302+
return model
292303

293304

294305
def AutoGPTQ_loader(model_name):

modules/shared.py

+1
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,7 @@
132132
group.add_argument('--row_split', action='store_true', help='Split the model by rows across GPUs. This may improve multi-gpu performance.')
133133
group.add_argument('--streaming-llm', action='store_true', help='Activate StreamingLLM to avoid re-evaluating the entire prompt when old messages are removed.')
134134
group.add_argument('--attention-sink-size', type=int, default=5, help='StreamingLLM: number of sink tokens. Only used if the trimmed prompt does not share a prefix with the old prompt.')
135+
group.add_argument('--tokenizer-dir', type=str, help='Load the tokenizer from this folder. Meant to be used with llamacpp_HF through the command-line.')
135136

136137
# ExLlamaV2
137138
group = parser.add_argument_group('ExLlamaV2')

modules/ui_parameters.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -40,9 +40,9 @@ def create_ui(default_preset):
4040
shared.gradio['do_sample'] = gr.Checkbox(value=generate_params['do_sample'], label='do_sample')
4141

4242
with gr.Blocks():
43-
shared.gradio['dry_multiplier'] = gr.Slider(0, 5, value=generate_params['dry_multiplier'], step=0.01, label='dry_multiplier', info='Set to value > 0 to enable DRY. Controls the magnitude of the penalty for the shortest penalized sequences.')
44-
shared.gradio['dry_base'] = gr.Slider(1, 4, value=generate_params['dry_base'], step=0.01, label='dry_base', info='Controls how fast the penalty grows with increasing sequence length.')
43+
shared.gradio['dry_multiplier'] = gr.Slider(0, 5, value=generate_params['dry_multiplier'], step=0.01, label='dry_multiplier', info='Set to greater than 0 to enable DRY. Recommended value: 0.8.')
4544
shared.gradio['dry_allowed_length'] = gr.Slider(1, 20, value=generate_params['dry_allowed_length'], step=1, label='dry_allowed_length', info='Longest sequence that can be repeated without being penalized.')
45+
shared.gradio['dry_base'] = gr.Slider(1, 4, value=generate_params['dry_base'], step=0.01, label='dry_base', info='Controls how fast the penalty grows with increasing sequence length.')
4646
shared.gradio['dry_sequence_breakers'] = gr.Textbox(value=generate_params['dry_sequence_breakers'], label='dry_sequence_breakers', info='Tokens across which sequence matching is not continued. Specified as a comma-separated list of quoted strings.')
4747

4848
gr.Markdown("[Learn more](https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab)")

0 commit comments

Comments
 (0)