Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loftQ can not use multi gpu to train #17

Open
WanBenLe opened this issue Feb 4, 2024 · 9 comments
Open

loftQ can not use multi gpu to train #17

WanBenLe opened this issue Feb 4, 2024 · 9 comments

Comments

@WanBenLe
Copy link

WanBenLe commented Feb 4, 2024

When I set:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3'
will raise error :
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [42,0,0], thread: [64,0,0] Assertion srcIndex < srcSelectDimSize failed.

return (element == self).any().item() # type: ignore[union-attr]
RuntimeError: CUDA error: device-side assert triggered

how can I do this?

@yxli2123
Copy link
Owner

yxli2123 commented Feb 4, 2024

Which script are you running?

@WanBenLe
Copy link
Author

WanBenLe commented Feb 5, 2024

CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch --multi_gpu --num_processes=4 --debug './~.py'

train_gsm8k.py will raise the same error.
image

image

@yxli2123
Copy link
Owner

yxli2123 commented Feb 5, 2024

Could you provide the full training command? Multi gpu training for quantized models, unfortunately, is not supported yet. This is because we use bitsandbytes quantization, which doesn't support it. So, one can only train a full precision model by multiple GPUs. To do so, it is important to enable --full_precision. (I have changed the explanation about this argument. It was wrong.)

We provide example training scripts here.
For your case,

# train 4-bit 64-rank llama-2-7b with LoftQ on GSM8K using 8 A100s
accelerate launch train_gsm8k.py \
  --full_precision \
  --model_name_or_path LoftQ/Llama-2-7b-hf-4bit-64rank \
  --learning_rate 3e-4 \
  --seed 11 \
  --expt_name gsm8k_llama2_7b_4bit_64rank_loftq_fake \
  --output_dir exp_results/ \
  --num_train_epochs 6 \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 1 \
  --evaluation_strategy "no" \
  --save_strategy "epoch" \
  --weight_decay 0.1 \
  --warmup_ratio 0.03 \
  --lr_scheduler_type "cosine" \
  --logging_steps 10 \
  --do_train \
  --report_to tensorboard

@WanBenLe
Copy link
Author

WanBenLe commented Feb 5, 2024

Well, thaks for your help.
With my best wishes.

@WanBenLe WanBenLe closed this as completed Feb 5, 2024
@skyshine102
Copy link

skyshine102 commented May 29, 2024

Now QLoRA can be used with FSDP/Deepspeed ZeRO, I was wondering if loftq can be used as combo.

I set BnB config as recommended by https://huggingface.co/docs/peft/main/en/accelerate/deepspeed#use-peft-qlora-and-deepspeed-with-zero3-for-finetuning-large-models-on-multiple-gpus --> results in program hanging up.

    quantization_config = BitsAndBytesConfig(
            load_in_4bit=True,
            **bnb_4bit_use_double_quant=True,**
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16,
            # Notice that torch_dtype for AutoModelForCausalLM is same as the bnb_4bit_quant_storage data type. 
            # For FSDP/ Deepspeed ZeRO
            **bnb_4bit_quant_storage=torch.bfloat16,** 
        )
    model = LlamaForCausalLM.from_pretrained(
        **_meta-llama/Llama-2-7b-chat-hf_**, 
        quantization_config=quantization_config,
        torch_dtype=torch.bfloat16,
        config=config,
        attn_implementation=attn_implementation,
    )
    config = LoraConfig(
        r=cfg.training.lora_config.lora_r,
        lora_alpha=cfg.training.lora_config.lora_alpha,
        target_modules=cfg.training.lora_config.lora_target_modules,
        lora_dropout=cfg.training.lora_config.lora_dropout,
        bias="none",
        task_type="CAUSAL_LM",
        init_lora_weights = "loftq",
        loftq_config = LoftQConfig(
            loftq_bits=4, 
            loftq_iter=1
        ),
    )
    model = get_peft_model(model, config) # hang here

Log:

Weight: (4194304, 1)  | Rank: 64 | Number Iter: 1 |  Num Bits: 4
....
(Then stuck at initializing peft model...)
  • I'm using peft==0.11.1, bnb==0.43.1.
  • I'm not sure if the weight shape is expected.
  • I was wondering if this is due to the bnb_4bit_quant_storage=torch.bfloat16 and bnb_4bit_use_double_quant=True arg, but even if I turned off these two args. I still cannot make it work.

If you have any feedback please let me know :(

@yxli2123
Copy link
Owner

yxli2123 commented May 29, 2024

Could you provide what the value of cfg.base_model is?

If it is a model from LoftQ HuggingFace repo, the problem could be the way how they implement QLoRA with FSDP. Chances are they shard the weight and then quantize the sharded weight. However, the checkpoints on LoftQ HuggingFace repo are already quantized, so they may fail to shard the quantized weight.

If it is the model you obtained by quantized_save.py in this repo, it should have the same logic as QLoRA and wouldn't be any problem.

Please let me know which case you are in.

@yxli2123 yxli2123 reopened this May 29, 2024
@skyshine102
Copy link

Thank you for your prompt reply.
Sorry I did neither these two cases. I was trying to init lora weight by loftq for the original Llama 2 base model. I would like to do it on the fly if possible. I have updated my previous post to provide full code snippet about where I stuck.
(I know that this is not the recommended flow but I don't understand why, other than the latency problem.)

@yxli2123
Copy link
Owner

LoftQ obtains the quantized weight $Q$ and LoRA adapters $A, B$ by minimizing $||W - Q - AB^{\top}||$, where $W$ is the full precision weight. When you call model = get_peft_model(model, config), we require the model to be the full precision, but the model in your code is actually already quantized. The algorithm treats the quantized weight as the full precision weight $W$ and therefore fails.

It is also worth noting that even if you change the model to full precision, unfortunately, you still can't do it on the fly because get_peft_model(model, config) returns a quantization-equivalent full precision model (aka fake quantized model). That's why we recommend to apply LoftQ first and then load the fake quantized model by bnb to turn it into real quantized model.

@skyshine102
Copy link

skyshine102 commented May 30, 2024

Thanks! I will change my current flow and give it a try.
(Sorry for hijacking the multi-GPU thread... anyways)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants