loftQ can not use multi gpu to train #17

WanBenLe · 2024-02-04T11:35:58Z

When I set:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3'
will raise error :
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [42,0,0], thread: [64,0,0] Assertion srcIndex < srcSelectDimSize failed.

return (element == self).any().item() # type: ignore[union-attr]
RuntimeError: CUDA error: device-side assert triggered

how can I do this?

The text was updated successfully, but these errors were encountered:

yxli2123 · 2024-02-04T12:02:38Z

Which script are you running?

WanBenLe · 2024-02-05T01:38:59Z

CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch --multi_gpu --num_processes=4 --debug './~.py'

train_gsm8k.py will raise the same error.

yxli2123 · 2024-02-05T01:54:25Z

Could you provide the full training command? Multi gpu training for quantized models, unfortunately, is not supported yet. This is because we use bitsandbytes quantization, which doesn't support it. So, one can only train a full precision model by multiple GPUs. To do so, it is important to enable --full_precision. (I have changed the explanation about this argument. It was wrong.)

We provide example training scripts here.
For your case,

# train 4-bit 64-rank llama-2-7b with LoftQ on GSM8K using 8 A100s
accelerate launch train_gsm8k.py \
  --full_precision \
  --model_name_or_path LoftQ/Llama-2-7b-hf-4bit-64rank \
  --learning_rate 3e-4 \
  --seed 11 \
  --expt_name gsm8k_llama2_7b_4bit_64rank_loftq_fake \
  --output_dir exp_results/ \
  --num_train_epochs 6 \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 1 \
  --evaluation_strategy "no" \
  --save_strategy "epoch" \
  --weight_decay 0.1 \
  --warmup_ratio 0.03 \
  --lr_scheduler_type "cosine" \
  --logging_steps 10 \
  --do_train \
  --report_to tensorboard

WanBenLe · 2024-02-05T02:06:30Z

Well, thaks for your help.
With my best wishes.

skyshine102 · 2024-05-29T11:06:55Z

Now QLoRA can be used with FSDP/Deepspeed ZeRO, I was wondering if loftq can be used as combo.

I set BnB config as recommended by https://huggingface.co/docs/peft/main/en/accelerate/deepspeed#use-peft-qlora-and-deepspeed-with-zero3-for-finetuning-large-models-on-multiple-gpus --> results in program hanging up.

    quantization_config = BitsAndBytesConfig(
            load_in_4bit=True,
            **bnb_4bit_use_double_quant=True,**
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16,
            # Notice that torch_dtype for AutoModelForCausalLM is same as the bnb_4bit_quant_storage data type. 
            # For FSDP/ Deepspeed ZeRO
            **bnb_4bit_quant_storage=torch.bfloat16,** 
        )
    model = LlamaForCausalLM.from_pretrained(
        **_meta-llama/Llama-2-7b-chat-hf_**, 
        quantization_config=quantization_config,
        torch_dtype=torch.bfloat16,
        config=config,
        attn_implementation=attn_implementation,
    )
    config = LoraConfig(
        r=cfg.training.lora_config.lora_r,
        lora_alpha=cfg.training.lora_config.lora_alpha,
        target_modules=cfg.training.lora_config.lora_target_modules,
        lora_dropout=cfg.training.lora_config.lora_dropout,
        bias="none",
        task_type="CAUSAL_LM",
        init_lora_weights = "loftq",
        loftq_config = LoftQConfig(
            loftq_bits=4, 
            loftq_iter=1
        ),
    )
    model = get_peft_model(model, config) # hang here

Log:

Weight: (4194304, 1)  | Rank: 64 | Number Iter: 1 |  Num Bits: 4
....
(Then stuck at initializing peft model...)

I'm using peft==0.11.1, bnb==0.43.1.
I'm not sure if the weight shape is expected.
I was wondering if this is due to the bnb_4bit_quant_storage=torch.bfloat16 and bnb_4bit_use_double_quant=True arg, but even if I turned off these two args. I still cannot make it work.

If you have any feedback please let me know :(

yxli2123 · 2024-05-29T18:04:32Z

Could you provide what the value of cfg.base_model is?

If it is a model from LoftQ HuggingFace repo, the problem could be the way how they implement QLoRA with FSDP. Chances are they shard the weight and then quantize the sharded weight. However, the checkpoints on LoftQ HuggingFace repo are already quantized, so they may fail to shard the quantized weight.

If it is the model you obtained by quantized_save.py in this repo, it should have the same logic as QLoRA and wouldn't be any problem.

Please let me know which case you are in.

skyshine102 · 2024-05-29T18:25:23Z

Thank you for your prompt reply.
Sorry I did neither these two cases. I was trying to init lora weight by loftq for the original Llama 2 base model. I would like to do it on the fly if possible. I have updated my previous post to provide full code snippet about where I stuck.
(I know that this is not the recommended flow but I don't understand why, other than the latency problem.)

yxli2123 · 2024-05-29T18:54:16Z

LoftQ obtains the quantized weight $Q$ and LoRA adapters $A, B$ by minimizing $||W - Q - AB^{\top}||$, where $W$ is the full precision weight. When you call model = get_peft_model(model, config), we require the model to be the full precision, but the model in your code is actually already quantized. The algorithm treats the quantized weight as the full precision weight $W$ and therefore fails.

It is also worth noting that even if you change the model to full precision, unfortunately, you still can't do it on the fly because get_peft_model(model, config) returns a quantization-equivalent full precision model (aka fake quantized model). That's why we recommend to apply LoftQ first and then load the fake quantized model by bnb to turn it into real quantized model.

skyshine102 · 2024-05-30T11:24:55Z

Thanks! I will change my current flow and give it a try.
(Sorry for hijacking the multi-GPU thread... anyways)

WanBenLe closed this as completed Feb 5, 2024

yxli2123 reopened this May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loftQ can not use multi gpu to train #17

loftQ can not use multi gpu to train #17

WanBenLe commented Feb 4, 2024

yxli2123 commented Feb 4, 2024

WanBenLe commented Feb 5, 2024

yxli2123 commented Feb 5, 2024

WanBenLe commented Feb 5, 2024

skyshine102 commented May 29, 2024 •

edited

Loading

yxli2123 commented May 29, 2024 •

edited

Loading

skyshine102 commented May 29, 2024

yxli2123 commented May 29, 2024

skyshine102 commented May 30, 2024 •

edited

Loading

loftQ can not use multi gpu to train #17

loftQ can not use multi gpu to train #17

Comments

WanBenLe commented Feb 4, 2024

yxli2123 commented Feb 4, 2024

WanBenLe commented Feb 5, 2024

yxli2123 commented Feb 5, 2024

WanBenLe commented Feb 5, 2024

skyshine102 commented May 29, 2024 • edited Loading

yxli2123 commented May 29, 2024 • edited Loading

skyshine102 commented May 29, 2024

yxli2123 commented May 29, 2024

skyshine102 commented May 30, 2024 • edited Loading

skyshine102 commented May 29, 2024 •

edited

Loading

yxli2123 commented May 29, 2024 •

edited

Loading

skyshine102 commented May 30, 2024 •

edited

Loading