How to change auto-requeue hpc.ckpt
path
#20357
Closed
arijit-hub
started this conversation in
General
Replies: 1 comment
-
I figured it out. One needs to specifically set the |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I am using a slurm environment and am requeuing the job using lightning's automatic slurm handler. It works flawlessly. However, I just have one small issue. The temporary checkpoints
hpc_ckpt_*.ckpt
are saved in the current working directory instead of the directory I specified for model checkpoint saving. This causes a flaw in my experiments when I try to run a new job when an earlier job is in auto-queue. What I mean is this:(1) I had an old job which has hit the wall-time, saved a temporary ckpt, and is requeued. This will use the
hpc_ckpt_*.ckpt
to resume training.(2) My new experiment with the same
.sh
file will not start from scratch as it thinks that thehpc_ckpt_*.ckpt
that is there is intended for it to use.Is there any fix for this?
Beta Was this translation helpful? Give feedback.
All reactions