oobabooga
diff --git a/‎README.md
+55-19 b/‎README.md
+55-19
diff --git a/‎conversion/adaptivegptq.py
-1 b/‎conversion/adaptivegptq.py
-1
diff --git a/‎conversion/bot_status.py
+17 b/‎conversion/bot_status.py
+17
diff --git a/‎conversion/compile.py
+10 b/‎conversion/compile.py
+10
diff --git a/‎conversion/measure.py
+30-12 b/‎conversion/measure.py
+30-12
@@ -3,17 +3,57 @@
 ExLlamaV2 is an inference library for running local LLMs on modern consumer GPUs.
 
 
-## Overview of differences compared to V1
+## New in v0.1.0:
+
+- ExLlamaV2 now supports paged attention via [Flash Attention](https://github.com/Dao-AILab/flash-attention) 2.5.7+
+- New generator with dynamic batching, smart prompt caching, K/V cache deduplication and simplified API
+
+![alt_text](doc/dynamic_gen.gif)
+
+## Dynamic generator examples
+
+The dynamic generator supports all inference, sampling and speculative decoding features of the previous two 
+generators, consolidated into one API (with the exception of FP8 cache, though the Q4 cache mode is supported and
+performs better anyway, see [here](doc/qcache_eval.md).)
+
+- Single generation:
+  ```python
+  output = generator.generate(prompt = "Hello, my name is", max_new_tokens = 200)
+  ```
+- Batched generation:
+    ```python
+    outputs = generator.generate(
+        prompt = [
+            "Hello, my name is",
+            "Once upon a time,",
+            "Large language models are",
+        ], 
+        max_new_tokens = 200
+    )
+    ```
+- Streamed generation with `asyncio`:
+    ```python
+    job = ExLlamaV2DynamicJobAsync(
+        generator,
+        input_ids = tokenizer.encode("You can lead a horse to water"),
+        banned_strings = ["make it drink"],
+        gen_settings = ExLlamaV2Sampler.Settings.greedy(),
+        max_new_tokens = 200
+    )  
+    async for result in job:
+        text = result.get("text", "")
+        print(text, end = "")       
+    ``` 
+See the full, updated examples [here](https://github.com/turboderp/exllamav2/tree/master/examples).
+
+
 
-- Faster, better kernels
-- Cleaner and more versatile codebase
-- Support for a new quant format (see below)
 
 
 ## Performance
 
-Some quick tests to compare performance with V1. There may be more performance optimizations in the future, and
-speeds will vary across GPUs, with slow CPUs still being a potential bottleneck:
+Some quick tests to compare performance with ExLlama V1. There may be more performance optimizations in the future,
+and speeds will vary across GPUs, with slow CPUs still being a potential bottleneck:
 
 | Model      | Mode         | Size  | grpsz | act | 3090Ti  | 4090        |
 |------------|--------------|-------|-------|-----|---------|-------------|
@@ -33,13 +73,11 @@ speeds will vary across GPUs, with slow CPUs still being a potential bottleneck:
 ## How to
 
 To install from the repo you'll need the CUDA Toolkit and either gcc on Linux or (Build Tools for) Visual Studio
-on Windows). Also make sure you have an appropriate version of [PyTorch](https://pytorch.org/get-started/locally/),
-then run:
+on Windows). Also make sure you have an appropriate version of [PyTorch](https://pytorch.org/get-started/locally/), then run:
 
 ```
 git clone https://github.com/turboderp/exllamav2
 cd exllamav2
-# Optionally, create and activate a new conda environment
 pip install -r requirements.txt
 pip install .
 
@@ -50,13 +88,11 @@ python test_inference.py -m <path_to_model> -p "Once upon a time,"
 A simple console chatbot is included. Run it with:
 
 ```
-python examples/chat.py -m <path_to_model> -mode llama
-# Append the '--gpu_split auto' flag for multi-GPU inference
+python examples/chat.py -m <path_to_model> -mode llama -gs auto
 ```
 
 
-The `-mode` argument chooses the prompt format to use. `llama` is for the Llama(2)-chat finetunes, while `codellama`
-probably works better for CodeLlama-instruct. `raw` will produce a simple chatlog-style chat that works with base 
+The `-mode` argument chooses the prompt format to use. `raw` will produce a simple chatlog-style chat that works with base 
 models and various other finetunes. Run with `-modes` for a list of all available prompt formats. You can also provide
 a custom system prompt with `-sp`. 
 
@@ -100,8 +136,11 @@ C++ extension in the process. Instead, the extension will be built the first tim
 
 ### Method 2: Install from release (with prebuilt extension)
 
-Releases are available [here](https://github.com/turboderp/exllamav2/releases), with prebuilt wheels that contain the
-extension binaries. Make sure to grab the right version, matching your platform, Python version (`cp`) and CUDA version.
+Releases are available [here](https://github.com/turboderp/exllamav2/releases), with prebuilt wheels that contain the extension binaries. Make sure to grab
+the right version, matching your platform, Python version (`cp`) and CUDA version. Crucially, you must also match
+the prebuilt wheel with your PyTorch version, since the Torch C++ extension ABI breaks with every new version of 
+PyTorch.
+
 Either download an appropriate wheel or install directly from the appropriate URL:
 
 ```
@@ -113,15 +152,12 @@ can also be installed this way, and it will build the extension while installing
 
 ### Method 3: Install from PyPI
 
-A PyPI package is available as well. It can be installed with:
+A PyPI package is available as well. This is the same as the JIT version (see above). It can be installed with:
 
 ```
 pip install exllamav2
 ```
 
-The version available through PyPI is the JIT version (see above). Still working on a solution for distributing
-prebuilt wheels via PyPI.
-
 
 ## EXL2 quantization
 
 
@@ -631,7 +631,6 @@ def pack(self, key, qparams):
 
         qst_packed = torch.zeros((qst.shape[0], qst.shape[1] * qparams.scale_bits // 32), dtype = torch.int32, device = self.device)
         if qparams.scale_bits == 4: ext_c.pack_rows_4(qst, qst_packed)
-        # if qparams.scale_bits == 6: ext_c.pack_rows_6(qst, qst_packed) # TODO:
         output[key + ".q_scale"] = qst_packed
 
         qwt_packed = []
 
@@ -0,0 +1,17 @@
+
+import json
+
+def print_stage(
+    job: dict,
+    stage: str,
+    progress: int,
+    max_progress: int,
+):
+    if not job["status_output"]: return
+
+    status = {
+        "stage": stage,
+        "completion": round(progress / max_progress, 4)
+    }
+
+    print("[STATUS]" + json.dumps(status) + "[/STATUS]")
@@ -1,6 +1,7 @@
 from exllamav2.model import \
 (
     ExLlamaV2Embedding,
+    ExLlamaV2PosEmbedding,
     ExLlamaV2Attention,
     ExLlamaV2MLP,
     ExLlamaV2MoEMLP,
@@ -16,6 +17,7 @@
 import os, glob, shutil, json
 from safetensors import safe_open
 from safetensors.torch import save_file
+from conversion.bot_status import print_stage
 
 def _tsize(t):
 
@@ -69,6 +71,10 @@ def compile_model(job, save_fn, model):
 
             d = get_f_module(job, module); out_dict.update(d); current_size += _dsize(d)
 
+        if isinstance(module, ExLlamaV2PosEmbedding):
+
+            d = get_f_module(job, module); out_dict.update(d); current_size += _dsize(d)
+
         if isinstance(module, ExLlamaV2Attention):
 
             d = get_f_module(job, module.input_layernorm); out_dict.update(d); current_size += _dsize(d)
@@ -126,6 +132,8 @@ def compile_model(job, save_fn, model):
 
         if current_size > shard_bytes or index == len(model.modules):
 
+            print_stage(job, "Compiling", index, len(model.modules))
+
             save_dict = {}
             dont_save_dict = {}
             this_shard_size = 0
@@ -237,3 +245,5 @@ def compile_model(job, save_fn, model):
 
         with open(config_json, "w") as f:
             f.write(json.dumps(config_dict, indent = 4))
+
+    print_stage(job, "Compiling", len(model.modules), len(model.modules))
@@ -1,6 +1,7 @@
 from exllamav2.model import \
 (
     ExLlamaV2Embedding,
+    ExLlamaV2PosEmbedding,
     ExLlamaV2Attention,
     ExLlamaV2MLP,
     ExLlamaV2MoEMLP,
@@ -19,6 +20,7 @@
 import os, time, math, json
 import torch.nn.functional as F
 import gc
+from conversion.bot_status import print_stage
 
 # graceful exiting
 import signal
@@ -68,6 +70,8 @@ def list_live_tensors():
 
 def embeddings(job, save_fn, model, measure = False):
 
+    print_stage(job, "Embeddings", 0, 1)
+
     module = model.modules[0]
     assert isinstance(module, ExLlamaV2Embedding)
 
@@ -82,6 +86,8 @@ def embeddings(job, save_fn, model, measure = False):
     embeddings_dict = { f"row.{i:05}": hidden_state[i:i+1, :, :].contiguous() for i in range(hidden_state.shape[0]) }
     save_file(embeddings_dict, os.path.join(job["out_dir"], "hidden_states.safetensors"))
 
+    print_stage(job, "Embeddings", 1, 1)
+
 
 # Test quantization options
 
@@ -119,18 +125,18 @@ def test_quant(source: ExLlamaV2Linear,
 
 def test_error(module, hidden_states, target_states, cache, attn_params):
 
-    rfn_sum = 0
+    rfn_sum = torch.tensor(0.0).cuda()
     rfn_count = 0
     for x, xref in zip(hidden_states, target_states):
         x = x.cuda()
         xref = xref.cuda()
         xtest = module.forward(x, cache, attn_params)
         xtest = xtest[0].float()
         xref = xref[0].float()
-        rfn_sum += (torch.linalg.norm(xtest - xref, 'fro') / torch.linalg.norm(xref, 'fro')).item()
+        rfn_sum += torch.linalg.norm(xtest - xref, 'fro') / torch.linalg.norm(xref, 'fro')
         rfn_count += 1
 
-    return max(1e-6, 1 - (rfn_sum / rfn_count))
+    return max(1e-6, 1 - (rfn_sum.item() / rfn_count))
 
 
 def measure_attn(module, hidden_states, target_states, quantizers, cache, attn_params, keep_q = False):
@@ -376,7 +382,7 @@ def print_status_box(*content_lines):
     print('-' * box_width)
 
 @torch.inference_mode()
-def measure_quant(job, save_fn, model):
+def measure_quant(job, save_fn, model, hidden_state_offload_layers):
 
     # vars for status box
     time_spent_list = []  
@@ -412,12 +418,15 @@ def measure_quant(job, save_fn, model):
 
     hidden_states = []
     with safe_open(states_filename, framework = "pt", device = "cpu") as f:
-        for k in sorted(f.keys()):
-            hidden_states.append(f.get_tensor(k))
+        for i, k in enumerate(sorted(f.keys())):
+            t = f.get_tensor(k)
+            hidden_states.append(t.to("cuda:0") if i < hidden_state_offload_layers else t)
 
     index = job["last_module_idx"]
     while True:
 
+        print_stage(job, "Measuring", index, len(model.modules))
+
         # sig handler should catch it faster in most cases
         if interrupted:
             print("Measurement process was interrupted. Please decide:")
@@ -487,6 +496,9 @@ def measure_quant(job, save_fn, model):
         elif isinstance(module, ExLlamaV2RMSNorm) or isinstance(module, ExLlamaV2LayerNorm):
             mode = "norm"
 
+        elif isinstance(module, ExLlamaV2PosEmbedding):
+            mode = "pos_emb"
+
         # Reference forward pass
 
         cache = None
@@ -504,18 +516,19 @@ def measure_quant(job, save_fn, model):
 
             x = hidden_states[i].to("cuda:0")
             outputs = module.forward(x, cache, attn_params, intermediates = True)
+            target_device = "cuda:0" if i < hidden_state_offload_layers else "cpu"
 
             # Hessians
 
             if mode == "self_attn":
                 quantizers["q_proj"].add_batch(outputs["post_norm"])  # Reuse H for K and V
                 quantizers["o_proj"].add_batch(outputs["attn_output"])
-                target_states.append(outputs["hidden_states"].to("cpu"))
+                target_states.append(outputs["hidden_states"].to(target_device))
 
             if mode == "mlp":
                 quantizers["up_proj"].add_batch(outputs["post_norm"])  # Reuse H for gate_proj
                 quantizers["down_proj"].add_batch(outputs["pre_down"])
-                target_states.append(outputs["hidden_states"].to("cpu"))
+                target_states.append(outputs["hidden_states"].to(target_device))
 
             if mode == "block_sparse_moe":
                 for j in range(model.config.num_experts):
@@ -526,16 +539,19 @@ def measure_quant(job, save_fn, model):
                             uncalibrated_experts[j] += 1
                     else:
                         uncalibrated_experts[j] += 1
-                target_states.append(outputs["hidden_states"].to("cpu"))
+                target_states.append(outputs["hidden_states"].to(target_device))
 
             if mode == "parallel_decoder":
                 quantizers["q_proj"].add_batch(outputs["post_norm"])  # Reuse H for K, V, up_proj and gate_proj
                 quantizers["o_proj"].add_batch(outputs["attn_output"])
                 quantizers["down_proj"].add_batch(outputs["pre_down"])
                 hidden_states[i] = outputs["post_norm"]
-                target_states_attn.append(outputs["hidden_states_attn"].to("cpu"))
-                target_states_mlp.append(outputs["hidden_states_mlp"].to("cpu"))
-                target_states.append(outputs["hidden_states"].to("cpu"))
+                target_states_attn.append(outputs["hidden_states_attn"].to(target_device))
+                target_states_mlp.append(outputs["hidden_states_mlp"].to(target_device))
+                target_states.append(outputs["hidden_states"].to(target_device))
+
+            if mode == "pos_emb":
+                target_states.append(outputs["hidden_states"].to(target_device))
 
         # For MoE layers, warn if any layer received less than 10% of a calibration batch
 
@@ -647,6 +663,8 @@ def measure_quant(job, save_fn, model):
 
             last_snapshot_time = time.time()
 
+    print_stage(job, "Measuring", len(model.modules), len(model.modules))
+
     # Export measurement
 
     exp_measurement = { "measurement": job["measurement"],