-
Notifications
You must be signed in to change notification settings - Fork 118
Issues: aws-samples/awsome-distributed-training
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
efa-versions.py file has incorrect path for OFI-NCCL on SMHP Slurm
bug
Something isn't working
#662
opened Apr 30, 2025 by
amanshanbhag
SMHP Slurm clusters with OZFS file system not able to ssh into instance
bug
Something isn't working
#658
opened Apr 30, 2025 by
amanshanbhag
Rename CPU-DDP Kubernetes manifest from fsdp.yaml to ddp.yaml for clarity
#649
opened Apr 22, 2025 by
kjrstory
Change docker to rootless docker
enhancement
New feature or request
#646
opened Apr 18, 2025 by
mhuguesaws
Change slurm exporter to prometheus slurm exporter
enhancement
New feature or request
#644
opened Apr 16, 2025 by
mhuguesaws
add command examples for picotron SmolLM test case
bug
Something isn't working
#625
opened Mar 31, 2025 by
KeitaW
Conda environment creation script uses proprietary Anaconda channels
#582
opened Mar 11, 2025 by
jrandall
Change Amazon FSx for Lustre from Auto IOPS to user provisionned.
#572
opened Feb 28, 2025 by
mhuguesaws
Add container version in 10.FSDP test case
enhancement
New feature or request
#564
opened Feb 21, 2025 by
KeitaW
Previous Next
ProTip!
Updated in the last three days: updated:>2025-04-28.