Skip to content

ROCm/hipBLASLt@7f76af3 failing with OSError: Failed to locate rocm-smi #359

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
marbre opened this issue Apr 7, 2025 · 5 comments
Open
Labels
bug Something isn't working upstream-cleanup Cleanup needed in upstream components

Comments

@marbre
Copy link
Member

marbre commented Apr 7, 2025

Issue

When trying to bump the hipBLASLt submodule to ROCm/hipBLASLt@7f76af3, building for gfx94X-dcgpu fails with

[hipBLASLt] [5/88] Generating Tensile Libraries
[hipBLASLt] FAILED: Tensile/library /__w/TheRock-gfx94X/build/math-libs/BLAS/hipBLASLt/build/Tensile/library 
[hipBLASLt] cd /__w/TheRock-gfx94X/build/math-libs/BLAS/hipBLASLt/build/library && /usr/local/cmake-3.30.3-linux-x86_64/bin/cmake -E env 'PATH=/__w/TheRock-gfx94X/build/core/clr/dist/bin:/__w/TheRock-gfx94X/build/core/clr/dist/lib/llvm/bin:/__w/TheRock-gfx94X/venv/bin:/__w/TheRock/venv/bin:/__w/.local/bin:/__w/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin' -- /__w/TheRock-gfx94X/build/math-libs/BLAS/hipBLASLt/build/virtualenv/bin/python3.12 /__w/TheRock-gfx94X/build/math-libs/BLAS/hipBLASLt/build/virtualenv/lib/python3.12/site-packages/Tensile/bin/TensileCreateLibrary --code-object-version=4 --cxx-compiler=amdclang++ --library-format=msgpack --architecture=gfx942 --build-id=sha1 /__w/TheRock-gfx94X/math-libs/BLAS/hipBLASLt/library/src/amd_detail/rocblaslt/src/Tensile/Logic/ /__w/TheRock-gfx94X/build/math-libs/BLAS/hipBLASLt/build/Tensile HIP
[hipBLASLt] 
[hipBLASLt] ################################################################################
[hipBLASLt] # Tensile Create Library
[hipBLASLt] Traceback (most recent call last):
[hipBLASLt]   File "/__w/TheRock-gfx94X/build/math-libs/BLAS/hipBLASLt/build/virtualenv/lib/python3.12/site-packages/Tensile/bin/TensileCreateLibrary", line 44, in <module>
[hipBLASLt]     TensileCreateLibrary.run()
[hipBLASLt]   File "/__w/TheRock-gfx94X/build/math-libs/BLAS/hipBLASLt/build/virtualenv/lib/python3.12/site-packages/Tensile/TensileCreateLibrary/Run.py", line 564, in run
[hipBLASLt]     assignGlobalParameters(arguments, isaInfoMap)
[hipBLASLt]   File "/__w/TheRock-gfx94X/build/math-libs/BLAS/hipBLASLt/build/virtualenv/lib/python3.12/site-packages/Tensile/Common/GlobalParameters.py", line 551, in assignGlobalParameters
[hipBLASLt]     globalParameters["ROCmSMIPath"] = locateExe(globalParameters["ROCmBinPath"], "rocm-smi")
[hipBLASLt]                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[hipBLASLt]   File "/__w/TheRock-gfx94X/build/math-libs/BLAS/hipBLASLt/build/virtualenv/lib/python3.12/site-packages/Tensile/Common/Utilities.py", line 103, in locateExe
[hipBLASLt]     raise OSError(f"Failed to locate {exeName}")
[hipBLASLt] OSError: Failed to locate rocm-smi

The line raising the OSError https://github.com/ROCm/hipBLASLt/blob/e9fa8851fbbb1441b67ef0f9c42bdcae8318a7f7/tensilelite/Tensile/Common/Utilities.py#L104 was introduced as part of commit ROCm/hipBLASLt@422087b.

Changes to reproduce are on branch bump-20250704-blas.

Steps to Reproduce

# Checkout sources 
git clone git@github.com:ROCm/TheRock.git
cd TheRock

# Install Python dependencies, see https://github.com/ROCm/TheRock?tab=readme-ov-file#common
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# Install system dependencies, see https://github.com/ROCm/TheRock?tab=readme-ov-file#on-ubuntu
sudo apt install gfortran git-lfs ninja-build cmake g++ pkg-config xxd libgmock-dev libgtest-dev patchelf automake


# Checkout branch and fetch sources
git checkout users/marbre/bump-20250704-blas
./build_tools/fetch_sources.py

# Configure the build
cmake -B build -GNinja . -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DTHEROCK_AMDGPU_FAMILIES=gfx94X-dcgpu -DTHEROCK_PACKAGE_VERSION=ADHOCBUILD -DTHEROCK_VERBOSE=ON -DBUILD_TESTING=OFF -DTHEROCK_ENABLE_ALL=OFF -DTHEROCK_ENABLE_BLAS=ON

# Try to build the target
cmake --build build --target hipBLASLt

Hints

  • Alternatively, the sub-project can be configured with
    cmake --build build --target hipBLASLt+configure
    This allows to cd into the build directory, build/math-libs/BLAS/hipBLASLt/build and trigger a build there.
  • If rebuilding, make sure to delete the Tensile build
    rm -r math-libs/BLAS/hipBLASLt/tensilelite/build
    rm -r math-libs/BLAS/hipBLASLt/tensilelite/Tensile.egg-info/
@marbre marbre added bug Something isn't working upstream-cleanup Cleanup needed in upstream components labels Apr 7, 2025
@bstefanuk
Copy link
Contributor

bstefanuk commented Apr 9, 2025

Running a build with the above instructions I couldn't reproduce the issue.

$ cmake --build build --target hipBLASLt
...
[hipBLASLt] [37/37] Creating library symlink library/libhipblaslt.so.0 library/libhipblaslt.so
[hipBLASLt completed in 6583 seconds]
[114/114] Merging sub-project dist directory for hipBLASLt

But I notice that rocm-smi is in my path

$ whereis rocm-smi
rocm-smi: /usr/bin/rocm-smi /opt/rocm-6.4.0/bin/rocm-smi

Not that I'm an advocate for the code, but if we look at how this key is set in hipBLASLt:

globalParameters["ROCmPath"] = "/opt/rocm"
if "ROCM_PATH" in os.environ:
    globalParameters["ROCmPath"] = os.environ.get("ROCM_PATH")
...
globalParameters["ROCmBinPath"] = os.path.join(globalParameters["ROCmPath"], "bin")
globalParameters["ROCmSMIPath"] = locateExe(globalParameters["ROCmBinPath"], "rocm-smi")

Then in locateExe:

for path in os.environ["PATH"].split(os.pathsep):
    exePath = os.path.join(path, exeName)

It looks like we're hitting an untested edge case where rocm-smi isn't found in any of ROCM_PATH, PATH, or /opt/rocm.

I can clean this code up to search for rocm-smi slightly more idiomatically, but in the meantime, I'm curious what your environment looks like?


Details:

ROCm stack: 6.4.43481-46320a638

$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.5 LTS"

Commit

$ git show
commit 08dfb3c2c1b291e5cebaa3b624375e054720a794 (HEAD -> users/marbre/bump-20250704-blas, origin/users/marbre/bump-20250704-blas)
Author: Marius Brehler <marius.brehler@amd.com>
Date:   Mon Apr 7 22:05:05 2025 +0000

    Bump BLAS submodules 20250407

@stellaraccident
Copy link
Collaborator

Thanks - the reporter claims that their WSL system is missing it. I haven't seen this myself.

@marbre
Copy link
Member Author

marbre commented Apr 9, 2025

Thanks - the reporter claims that their WSL system is missing it. I haven't seen this myself.

I think this is only partly related. It is an issue in a WSL system missing it but we're hitting the same whenever we try to build in an environment without a pre-installed ROCm, which shouldn't be a hard requirement. Hence /opt/rocm/ is not existing, but if it rocm-smi cannot be found in that specific location, the build fails. Among other this will happen in the CI. If the wants to use rocm-smi, the one build with TheRock should but used but not one in /opt/rocm. Anyway, a build must be possible in an environment without a GPU, like in our CI.

@ellosel
Copy link
Contributor

ellosel commented Apr 9, 2025

@marbre it seems like we have two issues here:

  1. We should be able to build without needing rocm-smi (we are cross-compiling after all).
  2. We should be able to support rocm installs outside of the conventional location.

If we found a way to build without using rocm-smi, would that solve the issue? I don't think rocm-smi is required to build hipbaslt but because Tensile and TensileCreateLibrary are lumped together we unconditionally check for the existence of rocm-smi.

Also, the way that Tensile currently accommodates ROCm installations outside of "conventional" locations is through the ROCM_PATH or HIP_PATH environment variables. Can TheRock use these variables? If not, can you provide details how we might discover your ROCm installation?

I would like to move in a direction where we forward along toolchain information detected by CMake and remove all related logic in Tensile but it will take some time to get there.

@marbre
Copy link
Member Author

marbre commented Apr 11, 2025

@marbre it seems like we have two issues here:

  1. We should be able to build without needing rocm-smi (we are cross-compiling after all).
  2. We should be able to support rocm installs outside of the conventional location.

If we found a way to build without using rocm-smi, would that solve the issue? I don't think rocm-smi is required to build hipbaslt but because Tensile and TensileCreateLibrary are lumped together we unconditionally check for the existence of rocm-smi.

With the patch tracked in #380, this actually resolved. It would be nice to get this into hipBLASLt instead of having this patch as part of TheRock.

Also, the way that Tensile currently accommodates ROCm installations outside of "conventional" locations is through the ROCM_PATH or HIP_PATH environment variables. Can TheRock use these variables? If not, can you provide details how we might discover your ROCm installation?

For Tensile, we have at minimum https://github.com/ROCm/TheRock/blob/main/patches/amd-mainline/hipBLASLt/0002-Do-not-hard-code-hipBLASLt-to-find-tools-in-opt-rocm.patch (tracked in #262). Issues tracking all hipBLASLt related patches are here: https://github.com/ROCm/TheRock/issues?q=is%3Aissue%20state%3Aopen%20hipBLASLt%20label%3Apatch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working upstream-cleanup Cleanup needed in upstream components
Projects
None yet
Development

No branches or pull requests

4 participants