Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REFACTOR TO THE MAX #7

Merged
merged 1 commit into from
Jan 24, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# make sure to test the local checkout in scripts and not the pre-installed one (don't use quotes!)
export PYTHONPATH = src

check_dirs := src scripts
check_dirs := src

style:
black --line-length 119 --target-version py310 $(check_dirs) setup.py
Expand Down
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,10 @@ If it isn't installed, run:
sudo apt-get install git-lfs
```

## Training models



## Evaluating models

For small models use `--data_parallel=$NUM_GPUS`, for large models shard with `--tensor_parallel=$NUM_GPUS`
Expand Down
20 changes: 2 additions & 18 deletions recipes/launch.slurm → launch.slurm
Original file line number Diff line number Diff line change
Expand Up @@ -24,29 +24,13 @@ echo "PYTHON ENV: $(which python)"
MODEL=Qwen2.5-1.5B-Instruct
TASK=sft
PRECISION=v00.00
ACCELERATOR=deepspeed_zero3
ACCELERATOR=zero3

# Training setup
NUM_NODES=$SLURM_NNODES
GPUS_PER_NODE=8
WORLD_SIZE=$(($NUM_NODES*$GPUS_PER_NODE))
# Due to conflicts between Accelerate's DeepSpeed configs and Transformers' TrainingArguments, we need to parse the gradient accumulation steps from the config file to ensure they match
CONFIG_FILE=recipes/$MODEL/$TASK/config_$PRECISION.yaml

echo "CONFIG_FILE: $CONFIG_FILE"
GRAD_ACC_STEPS=$(grep 'gradient_accumulation_steps' $CONFIG_FILE | awk '{print $2}')


# Loop through the arguments and find the one with "--gradient_accumulation_steps"
for arg in "${ARGS[@]}"; do
if [[ "$arg" == "--gradient_accumulation_steps="* ]]; then
# Extract the value after the equals sign
GRAD_ACC_STEPS="${arg#*=}"
break # Exit the loop once we find the desired argument
fi
done

echo "Gradient accumulation steps: $GRAD_ACC_STEPS"
# so processes know who to talk to
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
MASTER_PORT=6000
Expand All @@ -56,7 +40,7 @@ export CMD=" \
"

export LAUNCHER="HF_HUB_ENABLE_HF_TRANSFER=1 ACCELERATE_LOG_LEVEL=info TRANSFORMERS_VERBOSITY=info accelerate launch \
--config_file recipes/accelerate_configs/$ACCELERATOR.yaml \
--config_file accelerate_configs/$ACCELERATOR.yaml \
--gradient_accumulation_steps $GRAD_ACC_STEPS \
--num_machines $NUM_NODES \
--num_processes $WORLD_SIZE \
Expand Down
46 changes: 0 additions & 46 deletions recipes/Qwen2.5-1.5B-Instruct/sft/config_v00.00.yaml

This file was deleted.

26 changes: 0 additions & 26 deletions recipes/accelerate_configs/fsdp.yaml

This file was deleted.

25 changes: 0 additions & 25 deletions recipes/accelerate_configs/fsdp_qlora.yaml

This file was deleted.

16 changes: 0 additions & 16 deletions recipes/accelerate_configs/multi_gpu.yaml

This file was deleted.

1 change: 0 additions & 1 deletion scripts/training/README.md

This file was deleted.

Loading
Loading