Update README.md

turboderp · turboderp · commit e6f230bf06ca · 2024-05-25T22:50:36.000+02:00
diff --git a/README.md b/README.md
@@ -3,17 +3,57 @@
 ExLlamaV2 is an inference library for running local LLMs on modern consumer GPUs.
 
 
-## Overview of differences compared to V1
+## New in v0.1.0:
+
+- ExLlamaV2 now supports paged attention via [Flash Attention](https://github.com/Dao-AILab/flash-attention) 2.5.7+
+- New generator with dynamic batching, smart prompt caching, K/V cache deduplication and simplified API
+
+![alt_text](doc/dynamic_gen.gif)
+
+## Dynamic generator examples
+
+The dynamic generator supports all inference, sampling and speculative decoding features of the previous two 
+generators, consolidated into one API (with the exception of FP8 cache, though the Q4 cache mode is supported and
+performs better anyway, see [here](doc/qcache_eval.md).)
+
+- Single generation:
+  ```python
+  output = generator.generate(prompt = "Hello, my name is", max_new_tokens = 200)
+  ```
+- Batched generation:
+    ```python
+    outputs = generator.generate(
+        prompt = [
+            "Hello, my name is",
+            "Once upon a time,",
+            "Large language models are",
+        ], 
+        max_new_tokens = 200
+    )
+    ```
+- Streamed generation with `asyncio`:
+    ```python
+    job = ExLlamaV2DynamicJobAsync(
+        generator,
+        input_ids = tokenizer.encode("You can lead a horse to water"),
+        banned_strings = ["make it drink"],
+        gen_settings = ExLlamaV2Sampler.Settings.greedy(),
+        max_new_tokens = 200
+    )  
+    async for result in job:
+        text = result.get("text", "")
+        print(text, end = "")       
+    ``` 
+See the full, updated examples [here](https://github.com/turboderp/exllamav2/tree/master/examples).
+
+
 
-- Faster, better kernels
-- Cleaner and more versatile codebase
-- Support for a new quant format (see below)
 
 
 ## Performance
 
-Some quick tests to compare performance with V1. There may be more performance optimizations in the future, and
-speeds will vary across GPUs, with slow CPUs still being a potential bottleneck:
+Some quick tests to compare performance with ExLlama V1. There may be more performance optimizations in the future,
+and speeds will vary across GPUs, with slow CPUs still being a potential bottleneck:
 
 | Model      | Mode         | Size  | grpsz | act | 3090Ti  | 4090        |
 |------------|--------------|-------|-------|-----|---------|-------------|
@@ -33,13 +73,11 @@ speeds will vary across GPUs, with slow CPUs still being a potential bottleneck:
 ## How to
 
 To install from the repo you'll need the CUDA Toolkit and either gcc on Linux or (Build Tools for) Visual Studio
-on Windows). Also make sure you have an appropriate version of [PyTorch](https://pytorch.org/get-started/locally/),
-then run:
+on Windows). Also make sure you have an appropriate version of [PyTorch](https://pytorch.org/get-started/locally/), then run:
 
 ```
 git clone https://github.com/turboderp/exllamav2
 cd exllamav2
-# Optionally, create and activate a new conda environment
 pip install -r requirements.txt
 pip install .
 
@@ -50,13 +88,11 @@ python test_inference.py -m <path_to_model> -p "Once upon a time,"
 A simple console chatbot is included. Run it with:
 
 ```
-python examples/chat.py -m <path_to_model> -mode llama
-# Append the '--gpu_split auto' flag for multi-GPU inference
+python examples/chat.py -m <path_to_model> -mode llama -gs auto
 ```
 
 
-The `-mode` argument chooses the prompt format to use. `llama` is for the Llama(2)-chat finetunes, while `codellama`
-probably works better for CodeLlama-instruct. `raw` will produce a simple chatlog-style chat that works with base 
+The `-mode` argument chooses the prompt format to use. `raw` will produce a simple chatlog-style chat that works with base 
 models and various other finetunes. Run with `-modes` for a list of all available prompt formats. You can also provide
 a custom system prompt with `-sp`. 
 
@@ -100,8 +136,11 @@ C++ extension in the process. Instead, the extension will be built the first tim
 
 ### Method 2: Install from release (with prebuilt extension)
 
-Releases are available [here](https://github.com/turboderp/exllamav2/releases), with prebuilt wheels that contain the
-extension binaries. Make sure to grab the right version, matching your platform, Python version (`cp`) and CUDA version.
+Releases are available [here](https://github.com/turboderp/exllamav2/releases), with prebuilt wheels that contain the extension binaries. Make sure to grab
+the right version, matching your platform, Python version (`cp`) and CUDA version. Crucially, you must also match
+the prebuilt wheel with your PyTorch version, since the Torch C++ extension ABI breaks with every new version of 
+PyTorch.
+
 Either download an appropriate wheel or install directly from the appropriate URL:
 
 ```
@@ -113,15 +152,12 @@ can also be installed this way, and it will build the extension while installing
 
 ### Method 3: Install from PyPI
 
-A PyPI package is available as well. It can be installed with:
+A PyPI package is available as well. This is the same as the JIT version (see above). It can be installed with:
 
 ```
 pip install exllamav2
 ```
 
-The version available through PyPI is the JIT version (see above). Still working on a solution for distributing
-prebuilt wheels via PyPI.
-
 
 ## EXL2 quantization
 
diff --git a/doc/dynamic_gen.gif b/doc/dynamic_gen.gif
diff --git a/examples/dynamic_gen.py b/examples/dynamic_gen.py
@@ -77,10 +77,10 @@
     "Please guess the next 20 numbers in this sequence: " + ", ".join(str(n) for n in range(700)),
     "Write a short essay about cell membranes.",
     "What's up?",
-    "How do I open a can of beans?",
-    "How do I open a can of soup?",
-    "How do I open a can of strawberry jam?",
-    "How do I open a can of raspberry jam?",
+    # "How do I open a can of beans?",
+    # "How do I open a can of soup?",
+    # "How do I open a can of strawberry jam?",
+    # "How do I open a can of raspberry jam?",
     "What's the tallest building in Paris?",
     "What's the most populous nation on Earth?",
     "What's the most populous nation on Mars?",
@@ -90,25 +90,25 @@
     "Who is Waldo?",
     "Why is Waldo?",
     "Is it legal to base jump off the Eiffel Tower?",
-    "Is it legal to base jump into a volcano?",
-    "Why are cats better than dogs?",
+    # "Is it legal to base jump into a volcano?",
+    # "Why are cats better than dogs?",
     "Why is the Hulk so angry all the time?",
     "How do I build a time machine?",
     "What seems out of place in this sequence: " + ", ".join(str(n if n != 123 else 69) for n in range(200)),
     "Is it legal to grow your own catnip?",
     "What seems out of place in this sequence: " + ", ".join(str(n if n != 160 else 420) for n in range(400)),
     "What seems out of place in this sequence: " + ", ".join(str(n if n != 161 else 421) for n in range(400)),
-    "What's inside a black hole?",
-    "What do the numbers 2, 4, 8, 16, 32 and 64 have in common?",
-    "What do the numbers 2, 3, 5, 7, 11 and 13 have in common?",
-    "Is there life on Mars?",
-    "Hello!",
-    "Hi!",
-    "Boop!",
-    "Why are cats better than dogs?",
-    "Why are cats better than dogs?",
-    "Why are cats better than dogs?",
-    "Why are cats better than dogs?",
+    # "What's inside a black hole?",
+    # "What do the numbers 2, 4, 8, 16, 32 and 64 have in common?",
+    # "What do the numbers 2, 3, 5, 7, 11 and 13 have in common?",
+    # "Is there life on Mars?",
+    # "Hello!",
+    # "Hi!",
+    # "Boop!",
+    # "Why are cats better than dogs?",
+    # "Why are cats better than dogs?",
+    # "Why are cats better than dogs?",
+    # "Why are cats better than dogs?",
 ]
 
 term = Terminal()