You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Summary:
Addressing feedback for `quantize` API from #391 (comment)
this is an API that changes model inplace, so we want to change the name to reflect that. inplace model quantization is important especially for LLM since it will be hard to load the model to memory. we typically load the model to meta device and then load the quantized weight.
Test Plan:
python test/quantization/test_quant_api.py
python test/integration/test_integration.py
Reviewers:
Subscribers:
Tasks:
Tags:
Copy file name to clipboardExpand all lines: README.md
+3-3
Original file line number
Diff line number
Diff line change
@@ -19,8 +19,8 @@ All with no intrusive code changes and minimal accuracy degradation.
19
19
Quantizing your models is a 1 liner that should work on any model with an `nn.Linear` including your favorite HuggingFace model. You can find a more comprehensive usage instructions [here](torchao/quantization/) and a HuggingFace inference example [here](scripts/hf_eval.py)
20
20
21
21
```python
22
-
from torchao.quantization.quant_api importquantize, int4_weight_only
23
-
m = quantize(m, int4_weight_only())
22
+
from torchao.quantization.quant_api importquantize_, int4_weight_only
23
+
quantize_(m, int4_weight_only())
24
24
```
25
25
26
26
Benchmarks are run on a machine with a single A100 GPU using the script in `_models/llama` which generates text in a latency-optimized way (batchsize=1)
*[MX](torchao/prototype/mx_formats) implementing training and inference support with tensors using the [OCP MX spec](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) data types, which can be described as groupwise scaled float8/float6/float4/int8, with the scales being constrained to powers of two. This work is prototype as the hardware support is not available yet.
85
85
*[nf4](torchao/dtypes/nf4tensor.py) which was used to [implement QLoRA](https://github.com/pytorch/torchtune/blob/main/docs/source/tutorials/qlora_finetune.rst) one of the most popular finetuning algorithms without writing custom Triton or CUDA code. Accessible talk [here](https://x.com/HamelHusain/status/1800315287574847701)
86
-
*[fp6](torchao/prototype/quant_llm/) for 2x faster inference over fp16 with an easy to use API `quantize(model, fp6_llm_weight_only())`
86
+
*[fp6](torchao/prototype/quant_llm/) for 2x faster inference over fp16 with an easy to use API `quantize_(model, fp6_llm_weight_only())`
0 commit comments