-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
loftQ can not use multi gpu to train #17
Comments
Which script are you running? |
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch --multi_gpu --num_processes=4 --debug './~.py' train_gsm8k.py will raise the same error. |
Could you provide the full training command? Multi gpu training for quantized models, unfortunately, is not supported yet. This is because we use We provide example training scripts here. # train 4-bit 64-rank llama-2-7b with LoftQ on GSM8K using 8 A100s
accelerate launch train_gsm8k.py \
--full_precision \
--model_name_or_path LoftQ/Llama-2-7b-hf-4bit-64rank \
--learning_rate 3e-4 \
--seed 11 \
--expt_name gsm8k_llama2_7b_4bit_64rank_loftq_fake \
--output_dir exp_results/ \
--num_train_epochs 6 \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "epoch" \
--weight_decay 0.1 \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 10 \
--do_train \
--report_to tensorboard |
Well, thaks for your help. |
Now QLoRA can be used with FSDP/Deepspeed ZeRO, I was wondering if loftq can be used as combo. I set BnB config as recommended by https://huggingface.co/docs/peft/main/en/accelerate/deepspeed#use-peft-qlora-and-deepspeed-with-zero3-for-finetuning-large-models-on-multiple-gpus --> results in program hanging up. quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
**bnb_4bit_use_double_quant=True,**
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
# Notice that torch_dtype for AutoModelForCausalLM is same as the bnb_4bit_quant_storage data type.
# For FSDP/ Deepspeed ZeRO
**bnb_4bit_quant_storage=torch.bfloat16,**
)
model = LlamaForCausalLM.from_pretrained(
**_meta-llama/Llama-2-7b-chat-hf_**,
quantization_config=quantization_config,
torch_dtype=torch.bfloat16,
config=config,
attn_implementation=attn_implementation,
)
config = LoraConfig(
r=cfg.training.lora_config.lora_r,
lora_alpha=cfg.training.lora_config.lora_alpha,
target_modules=cfg.training.lora_config.lora_target_modules,
lora_dropout=cfg.training.lora_config.lora_dropout,
bias="none",
task_type="CAUSAL_LM",
init_lora_weights = "loftq",
loftq_config = LoftQConfig(
loftq_bits=4,
loftq_iter=1
),
)
model = get_peft_model(model, config) # hang here Log:
If you have any feedback please let me know :( |
Could you provide what the value of If it is a model from LoftQ HuggingFace repo, the problem could be the way how they implement QLoRA with FSDP. Chances are they shard the weight and then quantize the sharded weight. However, the checkpoints on LoftQ HuggingFace repo are already quantized, so they may fail to shard the quantized weight. If it is the model you obtained by Please let me know which case you are in. |
Thank you for your prompt reply. |
LoftQ obtains the quantized weight It is also worth noting that even if you change the |
Thanks! I will change my current flow and give it a try. |
When I set:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3'
will raise error :
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [42,0,0], thread: [64,0,0] Assertion
srcIndex < srcSelectDimSize
failed.return (element == self).any().item() # type: ignore[union-attr]
RuntimeError: CUDA error: device-side assert triggered
how can I do this?
The text was updated successfully, but these errors were encountered: