Skip to content

Commit e6f230b

Browse files
committed
Update README.md
1 parent f6e8495 commit e6f230b

File tree

3 files changed

+72
-36
lines changed

3 files changed

+72
-36
lines changed

README.md

+55-19
Original file line numberDiff line numberDiff line change
@@ -3,17 +3,57 @@
33
ExLlamaV2 is an inference library for running local LLMs on modern consumer GPUs.
44

55

6-
## Overview of differences compared to V1
6+
## New in v0.1.0:
7+
8+
- ExLlamaV2 now supports paged attention via [Flash Attention](https://github.com/Dao-AILab/flash-attention) 2.5.7+
9+
- New generator with dynamic batching, smart prompt caching, K/V cache deduplication and simplified API
10+
11+
![alt_text](doc/dynamic_gen.gif)
12+
13+
## Dynamic generator examples
14+
15+
The dynamic generator supports all inference, sampling and speculative decoding features of the previous two
16+
generators, consolidated into one API (with the exception of FP8 cache, though the Q4 cache mode is supported and
17+
performs better anyway, see [here](doc/qcache_eval.md).)
18+
19+
- Single generation:
20+
```python
21+
output = generator.generate(prompt = "Hello, my name is", max_new_tokens = 200)
22+
```
23+
- Batched generation:
24+
```python
25+
outputs = generator.generate(
26+
prompt = [
27+
"Hello, my name is",
28+
"Once upon a time,",
29+
"Large language models are",
30+
],
31+
max_new_tokens = 200
32+
)
33+
```
34+
- Streamed generation with `asyncio`:
35+
```python
36+
job = ExLlamaV2DynamicJobAsync(
37+
generator,
38+
input_ids = tokenizer.encode("You can lead a horse to water"),
39+
banned_strings = ["make it drink"],
40+
gen_settings = ExLlamaV2Sampler.Settings.greedy(),
41+
max_new_tokens = 200
42+
)
43+
async for result in job:
44+
text = result.get("text", "")
45+
print(text, end = "")
46+
```
47+
See the full, updated examples [here](https://github.com/turboderp/exllamav2/tree/master/examples).
48+
49+
750

8-
- Faster, better kernels
9-
- Cleaner and more versatile codebase
10-
- Support for a new quant format (see below)
1151

1252

1353
## Performance
1454

15-
Some quick tests to compare performance with V1. There may be more performance optimizations in the future, and
16-
speeds will vary across GPUs, with slow CPUs still being a potential bottleneck:
55+
Some quick tests to compare performance with ExLlama V1. There may be more performance optimizations in the future,
56+
and speeds will vary across GPUs, with slow CPUs still being a potential bottleneck:
1757

1858
| Model | Mode | Size | grpsz | act | 3090Ti | 4090 |
1959
|------------|--------------|-------|-------|-----|---------|-------------|
@@ -33,13 +73,11 @@ speeds will vary across GPUs, with slow CPUs still being a potential bottleneck:
3373
## How to
3474

3575
To install from the repo you'll need the CUDA Toolkit and either gcc on Linux or (Build Tools for) Visual Studio
36-
on Windows). Also make sure you have an appropriate version of [PyTorch](https://pytorch.org/get-started/locally/),
37-
then run:
76+
on Windows). Also make sure you have an appropriate version of [PyTorch](https://pytorch.org/get-started/locally/), then run:
3877

3978
```
4079
git clone https://github.com/turboderp/exllamav2
4180
cd exllamav2
42-
# Optionally, create and activate a new conda environment
4381
pip install -r requirements.txt
4482
pip install .
4583

@@ -50,13 +88,11 @@ python test_inference.py -m <path_to_model> -p "Once upon a time,"
5088
A simple console chatbot is included. Run it with:
5189
5290
```
53-
python examples/chat.py -m <path_to_model> -mode llama
54-
# Append the '--gpu_split auto' flag for multi-GPU inference
91+
python examples/chat.py -m <path_to_model> -mode llama -gs auto
5592
```
5693
5794
58-
The `-mode` argument chooses the prompt format to use. `llama` is for the Llama(2)-chat finetunes, while `codellama`
59-
probably works better for CodeLlama-instruct. `raw` will produce a simple chatlog-style chat that works with base
95+
The `-mode` argument chooses the prompt format to use. `raw` will produce a simple chatlog-style chat that works with base
6096
models and various other finetunes. Run with `-modes` for a list of all available prompt formats. You can also provide
6197
a custom system prompt with `-sp`.
6298
@@ -100,8 +136,11 @@ C++ extension in the process. Instead, the extension will be built the first tim
100136
101137
### Method 2: Install from release (with prebuilt extension)
102138
103-
Releases are available [here](https://github.com/turboderp/exllamav2/releases), with prebuilt wheels that contain the
104-
extension binaries. Make sure to grab the right version, matching your platform, Python version (`cp`) and CUDA version.
139+
Releases are available [here](https://github.com/turboderp/exllamav2/releases), with prebuilt wheels that contain the extension binaries. Make sure to grab
140+
the right version, matching your platform, Python version (`cp`) and CUDA version. Crucially, you must also match
141+
the prebuilt wheel with your PyTorch version, since the Torch C++ extension ABI breaks with every new version of
142+
PyTorch.
143+
105144
Either download an appropriate wheel or install directly from the appropriate URL:
106145
107146
```
@@ -113,15 +152,12 @@ can also be installed this way, and it will build the extension while installing
113152
114153
### Method 3: Install from PyPI
115154
116-
A PyPI package is available as well. It can be installed with:
155+
A PyPI package is available as well. This is the same as the JIT version (see above). It can be installed with:
117156
118157
```
119158
pip install exllamav2
120159
```
121160
122-
The version available through PyPI is the JIT version (see above). Still working on a solution for distributing
123-
prebuilt wheels via PyPI.
124-
125161
126162
## EXL2 quantization
127163

doc/dynamic_gen.gif

16.1 MB
Loading

examples/dynamic_gen.py

+17-17
Original file line numberDiff line numberDiff line change
@@ -77,10 +77,10 @@
7777
"Please guess the next 20 numbers in this sequence: " + ", ".join(str(n) for n in range(700)),
7878
"Write a short essay about cell membranes.",
7979
"What's up?",
80-
"How do I open a can of beans?",
81-
"How do I open a can of soup?",
82-
"How do I open a can of strawberry jam?",
83-
"How do I open a can of raspberry jam?",
80+
# "How do I open a can of beans?",
81+
# "How do I open a can of soup?",
82+
# "How do I open a can of strawberry jam?",
83+
# "How do I open a can of raspberry jam?",
8484
"What's the tallest building in Paris?",
8585
"What's the most populous nation on Earth?",
8686
"What's the most populous nation on Mars?",
@@ -90,25 +90,25 @@
9090
"Who is Waldo?",
9191
"Why is Waldo?",
9292
"Is it legal to base jump off the Eiffel Tower?",
93-
"Is it legal to base jump into a volcano?",
94-
"Why are cats better than dogs?",
93+
# "Is it legal to base jump into a volcano?",
94+
# "Why are cats better than dogs?",
9595
"Why is the Hulk so angry all the time?",
9696
"How do I build a time machine?",
9797
"What seems out of place in this sequence: " + ", ".join(str(n if n != 123 else 69) for n in range(200)),
9898
"Is it legal to grow your own catnip?",
9999
"What seems out of place in this sequence: " + ", ".join(str(n if n != 160 else 420) for n in range(400)),
100100
"What seems out of place in this sequence: " + ", ".join(str(n if n != 161 else 421) for n in range(400)),
101-
"What's inside a black hole?",
102-
"What do the numbers 2, 4, 8, 16, 32 and 64 have in common?",
103-
"What do the numbers 2, 3, 5, 7, 11 and 13 have in common?",
104-
"Is there life on Mars?",
105-
"Hello!",
106-
"Hi!",
107-
"Boop!",
108-
"Why are cats better than dogs?",
109-
"Why are cats better than dogs?",
110-
"Why are cats better than dogs?",
111-
"Why are cats better than dogs?",
101+
# "What's inside a black hole?",
102+
# "What do the numbers 2, 4, 8, 16, 32 and 64 have in common?",
103+
# "What do the numbers 2, 3, 5, 7, 11 and 13 have in common?",
104+
# "Is there life on Mars?",
105+
# "Hello!",
106+
# "Hi!",
107+
# "Boop!",
108+
# "Why are cats better than dogs?",
109+
# "Why are cats better than dogs?",
110+
# "Why are cats better than dogs?",
111+
# "Why are cats better than dogs?",
112112
]
113113

114114
term = Terminal()

0 commit comments

Comments
 (0)