Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

podman usage #28

Open
mjkpolo opened this issue Aug 23, 2024 · 5 comments
Open

podman usage #28

mjkpolo opened this issue Aug 23, 2024 · 5 comments

Comments

@mjkpolo
Copy link

mjkpolo commented Aug 23, 2024

Hello,

I'm struggling to use podman, keep seeing

Error: creating build container: writing blob: adding layer with blob "sha256:560c024910bebac6b404791af28ebd48a8289303b8377d17b67ffdfe52754f2a": processing tar file(potentially insufficient UIDs or GIDs available in user namespace (requested 0:42 for /etc/gshadow): Check /etc/subuid and /etc/subgid if configured locally and run "podman system migrate": lchown /etc/gshadow: invalid argument): exit status 1

or

WARN[0000] Failed to get rootless runtime dir for DefaultAPIAddress: lstat /run/user/2069: no such file or directory
WARN[0000] RunRoot is pointing to a path (/run/user/2069/containers) which is not writable. Most likely podman will fail.
Error: creating events dirs: mkdir /run/user/2069: permission denied

Thanks!

@koomie
Copy link
Collaborator

koomie commented Aug 23, 2024

I would suggest using apptainer for containerization instead of podman.

@mjkpolo
Copy link
Author

mjkpolo commented Aug 23, 2024

@koomie I'm trying to use vllm, but they don't have a docker image I can pull, I need to build it, so I need to convert their docker files to apptainer files? link

(building from source without docker/podman wasn't working for me)

@coleramos425
Copy link

Probably a lower priority ask, but it would be nice to have a section in HPCFund docs for "Containers on HPCFund", similar to Containers on Frontier. Some topics that may be useful to cover and/or share links to:

  • Converting Docker containers to Singularity for use locally
  • Mention cluster-wide mount point defaults to $WORK
  • Any other gotchas that have come up from Singularity users on the system

@coleramos425
Copy link

EDIT: Oops just saw #10 covers this 😋

@mjkpolo
Copy link
Author

mjkpolo commented Jan 28, 2025

I have been using apptainer now but running into issues building AMD's megatron-lm container. I based the container off of this docker file but am getting Resource temporarily unavailable when building transformer engine:

OpenBLAS blas_thread_init: RLIMIT_NPROC 1536 current, 2060805 max
OpenBLAS blas_thread_init: pthread_create failed for thread 56 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: ensure that your address space and process count limits are big enough (ulimit -a)
OpenBLAS blas_thread_init: or set a smaller OPENBLAS_NUM_THREADS to fit into what you have available

Decreasing OPENBLAS_NUM_THREADS causes the build to last more hours that I have allocated for the node I am using (since it will run out of memory when being built on the login node)

Have others encountered this issue and is there a workaround? I have been using apptainer to build NeMo and transformer engine on an NVIDIA cluster and it doesn't seem to have this issue

Thanks!

P.S. the megatron docker image from AMD works if I pull that, but it is built for gfx942 and there is only one mi300x node available and I have to wait too long to get access each day so it would be ideal to build the same docker image for gfx90a so I can speed up development time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants