fix INT8 prepare function (huggingface#389)

pacman100 · web-flow · commit 1a1cfe34791e · 2023-05-03T12:47:53.000+05:30
* fix INT8 prepare function

* remove unused function args

* fix related tests, examples and docs
diff --git a/docs/source/task_guides/int8-asr.mdx b/docs/source/task_guides/int8-asr.mdx
@@ -178,15 +178,14 @@ model.config.suppress_tokens = []
 
 To get the model ready for `int8` quantization, use the utility function [`prepare_model_for_int8_training`](https://github.com/huggingface/peft/blob/34027fe813756897767b9a6f19ae7f1c4c7b418c/src/peft/utils/other.py#L35) to handle the following:
 
-- casts the `LayerNorm` to full precision (`fp32`) for stability
+- casts all the non `int8` modules to full precision (`fp32`) for stability
 - adds a forward hook to the input embedding layer to calculate the gradients of the input hidden states
 - enables gradient checkpointing for more memory-efficient training
-- casts the output logits to `fp32` for smoother sampling
 
 ```py
 from peft import prepare_model_for_int8_training
 
-model = prepare_model_for_int8_training(model, output_embedding_layer_name="proj_out")
+model = prepare_model_for_int8_training(model)
 ```
 
 Let's also apply LoRA to the training to make it even more efficient. Load a [`~peft.LoraConfig`] and configure the following parameters:
diff --git a/examples/int8_training/Finetune_flan_t5_large_bnb_peft.ipynb b/examples/int8_training/Finetune_flan_t5_large_bnb_peft.ipynb
@@ -328,10 +328,9 @@
    },
    "source": [
     "Some pre-processing needs to be done before training such an int8 model using `peft`, therefore let's import an utiliy function `prepare_model_for_int8_training` that will: \n",
-    "- Cast the layer norm in `float32` for stability purposes\n",
+    "- Casts all the non `int8` modules to full precision (`fp32`) for stability\n",
     "- Add a `forward_hook` to the input embedding layer to enable gradient computation of the input hidden states\n",
-    "- Enable gradient checkpointing for more memory-efficient training\n",
-    "- Cast the output logits in `float32` for smoother sampling during the sampling procedure"
+    "- Enable gradient checkpointing for more memory-efficient training"
    ]
   },
   {
diff --git a/examples/int8_training/Finetune_opt_bnb_peft.ipynb b/examples/int8_training/Finetune_opt_bnb_peft.ipynb
@@ -377,10 +377,9 @@
     "### Prepare model for training\n",
     "\n",
     "Some pre-processing needs to be done before training such an int8 model using `peft`, therefore let's import an utiliy function `prepare_model_for_int8_training` that will: \n",
-    "- Cast the layer norm in `float32` for stability purposes\n",
+    "- Casts all the non `int8` modules to full precision (`fp32`) for stability\n",
     "- Add a `forward_hook` to the input embedding layer to enable gradient computation of the input hidden states\n",
-    "- Enable gradient checkpointing for more memory-efficient training\n",
-    "- Cast the output logits in `float32` for smoother sampling during the sampling procedure"
+    "- Enable gradient checkpointing for more memory-efficient training"
    ]
   },
   {
diff --git a/examples/int8_training/peft_adalora_whisper_large_training.py b/examples/int8_training/peft_adalora_whisper_large_training.py
@@ -561,7 +561,7 @@ def main():
     if args.use_peft:
         from peft import prepare_model_for_int8_training
 
-        model = prepare_model_for_int8_training(model, output_embedding_layer_name="proj_out")
+        model = prepare_model_for_int8_training(model)
 
         # as Whisper model uses Conv layer in encoder, checkpointing disables grad computation
         # to avoid this, make the inputs trainable
diff --git a/examples/int8_training/peft_bnb_whisper_large_v2_training.ipynb b/examples/int8_training/peft_bnb_whisper_large_v2_training.ipynb
@@ -1133,6 +1133,7 @@
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
    "id": "bR-_yaEOPsfQ",
    "metadata": {
@@ -1141,7 +1142,7 @@
    "source": [
     "### Post-processing on the model\n",
     "\n",
-    "Finally, we need to apply some post-processing on the 8-bit model to enable training, let's freeze all our layers, and cast the layer-norm in `float32` for stability. We also cast the output of the last layer in `float32` for the same reasons."
+    "Finally, we need to apply some post-processing on the 8-bit model to enable training, let's freeze all our layers, and cast all non `int8` layers in `float32` for stability."
    ]
   },
   {
@@ -1155,7 +1156,7 @@
    "source": [
     "from peft import prepare_model_for_int8_training\n",
     "\n",
-    "model = prepare_model_for_int8_training(model, output_embedding_layer_name=\"proj_out\")"
+    "model = prepare_model_for_int8_training(model)"
    ]
   },
   {
diff --git a/src/peft/utils/other.py b/src/peft/utils/other.py
@@ -32,9 +32,7 @@ def bloom_model_postprocess_past_key_value(past_key_values):
     return tuple(zip(keys, values))
 
 
-def prepare_model_for_int8_training(
-    model, output_embedding_layer_name="lm_head", use_gradient_checkpointing=True, layer_norm_names=["layer_norm"]
-):
+def prepare_model_for_int8_training(model, use_gradient_checkpointing=True):
     r"""
     This method wraps the entire protocol for preparing a model before running a training. This includes:
         1- Cast the layernorm in fp32 2- making output embedding layer require grads 3- Add the upcasting of the lm
@@ -50,10 +48,10 @@ def prepare_model_for_int8_training(
         # freeze base model's layers
         param.requires_grad = False
 
-        if loaded_in_8bit:
-            # cast layer norm in fp32 for stability for 8bit models
-            if param.ndim == 1 and any(layer_norm_name in name for layer_norm_name in layer_norm_names):
-                param.data = param.data.to(torch.float32)
+    # cast all non INT8 parameters to fp32
+    for param in model.parameters():
+        if (param.dtype == torch.float16) or (param.dtype == torch.bfloat16):
+            param.data = param.data.to(torch.float32)
 
     if loaded_in_8bit and use_gradient_checkpointing:
         # For backward compatibility
@@ -69,22 +67,6 @@ def make_inputs_require_grad(module, input, output):
         # enable gradient checkpointing for memory efficiency
         model.gradient_checkpointing_enable()
 
-    if hasattr(model, output_embedding_layer_name):
-        output_embedding_layer = getattr(model, output_embedding_layer_name)
-        input_dtype = output_embedding_layer.weight.dtype
-
-        class CastOutputToFloat(torch.nn.Sequential):
-            r"""
-            Manually cast to the expected dtype of the lm_head as sometimes there is a final layer norm that is casted
-            in fp32
-
-            """
-
-            def forward(self, x):
-                return super().forward(x.to(input_dtype)).to(torch.float32)
-
-        setattr(model, output_embedding_layer_name, CastOutputToFloat(output_embedding_layer))
-
     return model
 
 
diff --git a/tests/test_gpu_examples.py b/tests/test_gpu_examples.py
@@ -402,7 +402,7 @@ def prepare_dataset(batch):
             model.config.forced_decoder_ids = None
             model.config.suppress_tokens = []
 
-            model = prepare_model_for_int8_training(model, output_embedding_layer_name="proj_out")
+            model = prepare_model_for_int8_training(model)
 
             config = LoraConfig(
                 r=32, lora_alpha=64, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, bias="none"

Original file line number	Diff line number	Diff line change
`@@ -328,10 +328,9 @@`
`328`	`328`	`},`
`329`	`329`	`"source": [`
`330`	`330`	"Some pre-processing needs to be done before training such an int8 model using `peft`, therefore let's import an utiliy function `prepare_model_for_int8_training` that will: \n",
`331`		- "- Cast the layer norm in `float32` for stability purposes\n",
	`331`	+ "- Casts all the non `int8` modules to full precision (`fp32`) for stability\n",
`332`	`332`	"- Add a `forward_hook` to the input embedding layer to enable gradient computation of the input hidden states\n",
`333`		`- "- Enable gradient checkpointing for more memory-efficient training\n",`
`334`		- "- Cast the output logits in `float32` for smoother sampling during the sampling procedure"
	`333`	`+ "- Enable gradient checkpointing for more memory-efficient training"`
`335`	`334`	`]`
`336`	`335`	`},`
`337`	`336`	`{`
Original file line number	Diff line number	Diff line change
`@@ -377,10 +377,9 @@`
`377`	`377`	`"### Prepare model for training\n",`
`378`	`378`	`"\n",`
`379`	`379`	"Some pre-processing needs to be done before training such an int8 model using `peft`, therefore let's import an utiliy function `prepare_model_for_int8_training` that will: \n",
`380`		- "- Cast the layer norm in `float32` for stability purposes\n",
	`380`	+ "- Casts all the non `int8` modules to full precision (`fp32`) for stability\n",
`381`	`381`	"- Add a `forward_hook` to the input embedding layer to enable gradient computation of the input hidden states\n",
`382`		`- "- Enable gradient checkpointing for more memory-efficient training\n",`
`383`		- "- Cast the output logits in `float32` for smoother sampling during the sampling procedure"
	`382`	`+ "- Enable gradient checkpointing for more memory-efficient training"`
`384`	`383`	`]`
`385`	`384`	`},`
`386`	`385`	`{`
Original file line number	Diff line number	Diff line change
`@@ -1133,6 +1133,7 @@`
`1133`	`1133`	`]`
`1134`	`1134`	`},`
`1135`	`1135`	`{`
	`1136`	`+ "attachments": {},`
`1136`	`1137`	`"cell_type": "markdown",`
`1137`	`1138`	`"id": "bR-_yaEOPsfQ",`
`1138`	`1139`	`"metadata": {`
`@@ -1141,7 +1142,7 @@`
`1141`	`1142`	`"source": [`
`1142`	`1143`	`"### Post-processing on the model\n",`
`1143`	`1144`	`"\n",`
`1144`		- "Finally, we need to apply some post-processing on the 8-bit model to enable training, let's freeze all our layers, and cast the layer-norm in `float32` for stability. We also cast the output of the last layer in `float32` for the same reasons."
	`1145`	+ "Finally, we need to apply some post-processing on the 8-bit model to enable training, let's freeze all our layers, and cast all non `int8` layers in `float32` for stability."
`1145`	`1146`	`]`
`1146`	`1147`	`},`
`1147`	`1148`	`{`
`@@ -1155,7 +1156,7 @@`
`1155`	`1156`	`"source": [`
`1156`	`1157`	`"from peft import prepare_model_for_int8_training\n",`
`1157`	`1158`	`"\n",`
`1158`		`- "model = prepare_model_for_int8_training(model, output_embedding_layer_name=\"proj_out\")"`
	`1159`	`+ "model = prepare_model_for_int8_training(model)"`
`1159`	`1160`	`]`
`1160`	`1161`	`},`
`1161`	`1162`	`{`