Lightning Fabric fails in AWS sagemaker jupyter notebook #20178
Unanswered
lrnilingy
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Dear All - I am a new user of fabric, and I tried to make it work in AWS Sagemaker Jupyter notebook.
The instance type is
ml.g5.12xlarge
which has 4 GPUs (NVIDIA A10G). It has a preinstalled torch conda environment with CUDA drivers & torch 2.2, etc.However, after I made the changes following the documentation. I ran all the cells of the jupyter notebook and got errors from
fabric.launch(train)
:I am confused since cuda/gpu/instance should not be a problem. In fact, I was using Huggingface Accelerate and that package worked on this instance.
Do I miss something? Thank you very much.
Some environment info:
Yesterday I was using
lightning v2.3.3
and it failed as well...Beta Was this translation helpful? Give feedback.
All reactions