-
Notifications
You must be signed in to change notification settings - Fork 24
XLA unsupported on GFX1100 #63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
There's not really much we can do here :( We're at the mercy of XLA upstream. This may also be a ROCm thing. When I used ROCm years ago there were a number of GPUs it did not support IIRC |
You could try adding Line 83 in 63f8d3a
And then building from source by setting |
I made that change and also added gfx1100 to https://github.com/openxla/xla/blob/a01af1af923cf66271c0f03b2962dff58068af5e/xla/stream_executor/device_description.h#L174 08:20:12.567 [info] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
08:20:12.567 [info] XLA service 0x7f437c6272e0 initialized for platform ROCM (this does not guarantee that XLA will be used). Devices:
08:20:12.567 [info] StreamExecutor device (0): Radeon RX 7900 XTX, AMDGPU ISA version: gfx1100
08:20:12.567 [info] Using BFC allocator.
08:20:12.567 [info] XLA backend allocating 23177723904 bytes on device 0 for BFCAllocator.
08:20:12.567 [error] INTERNAL: RET_CHECK failure (xla/pjrt/gpu/se_gpu_pjrt_client.cc:960) options.num_nodes == 1 || kv_get != nullptr
*** Begin stack trace ***
tsl::CurrentStackTrace[abi:cxx11]()
xla::status_macros::MakeErrorStream::Impl::GetStatus()
xla::GetStreamExecutorGpuClient(xla::GpuClientOptions const&)
xla::GetStreamExecutorGpuClient(bool, xla::GpuAllocatorConfig const&, int, int, std::optional<std::set<int, std::less<int>, std::allocator<int> > > const&, std::optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, bool, std::function<absl::lts_20230802::StatusOr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > (std::basic_string_view<char, std::char_traits<char> >, absl::lts_20230802::Duration)>, std::function<absl::lts_20230802::Status (std::basic_string_view<char, std::char_traits<char> >, std::basic_string_view<char, std::char_traits<char> >)>, bool)
exla::GetGpuClient(double, bool, xla::GpuAllocatorConfig::Kind)
get_gpu_client(enif_environment_t*, int, unsigned long const*)
beam_jit_call_nif(process*, void const*, unsigned long*, unsigned long (*)(enif_environment_t*, int, unsigned long*), erl_module_nif*)
*** End stack trace ***
08:20:12.570 [error] GenServer EXLA.Client terminating
** (RuntimeError) RET_CHECK failure (xla/pjrt/gpu/se_gpu_pjrt_client.cc:960) options.num_nodes == 1 || kv_get != nullptr
(exla 0.7.0-dev) lib/exla/client.ex:196: EXLA.Client.unwrap!/1
(exla 0.7.0-dev) lib/exla/client.ex:173: EXLA.Client.build_client/2
(exla 0.7.0-dev) lib/exla/client.ex:136: EXLA.Client.handle_call/3
(stdlib 5.1.1) gen_server.erl:1113: :gen_server.try_handle_call/4
(stdlib 5.1.1) gen_server.erl:1142: :gen_server.handle_msg/6
(stdlib 5.1.1) proc_lib.erl:241: :proc_lib.init_p_do_apply/3
Last message (from #PID<0.313.0>): {:client, :rocm, [platform: :rocm]}
State: :unused_state
Client #PID<0.313.0> is alive
(stdlib 5.1.1) gen.erl:240: :gen.do_call/4
(elixir 1.15.7) lib/gen_server.ex:1071: GenServer.call/3
(exla 0.7.0-dev) lib/exla/defn.ex:268: EXLA.Defn.__compile__/4
(nx 0.7.0-dev) lib/nx/defn.ex:305: Nx.Defn.compile/3
(bumblebee 0.4.2) lib/bumblebee/text/fill_mask.ex:65: anonymous fn/7 in Bumblebee.Text.FillMask.fill_mask/3
(nx 0.7.0-dev) lib/nx/serving.ex:1810: anonymous fn/3 in Nx.Serving.Default.init/3
(elixir 1.15.7) lib/enum.ex:1693: Enum."-map/2-lists^map/1-1-"/2
(nx 0.7.0-dev) lib/nx/serving.ex:1808: anonymous fn/3 in Nx.Serving.Default.init/3
08:20:12.576 [error] Kino.listen with #Function<42.105768164/1 in :erl_eval.expr/6> failed with reason:
** (exit) exited in: GenServer.call(EXLA.Client, {:client, :rocm, [platform: :rocm]}, :infinity)
** (EXIT) an exception was raised:
** (RuntimeError) RET_CHECK failure (xla/pjrt/gpu/se_gpu_pjrt_client.cc:960) options.num_nodes == 1 || kv_get != nullptr
(exla 0.7.0-dev) lib/exla/client.ex:196: EXLA.Client.unwrap!/1
(exla 0.7.0-dev) lib/exla/client.ex:173: EXLA.Client.build_client/2
(exla 0.7.0-dev) lib/exla/client.ex:136: EXLA.Client.handle_call/3
(stdlib 5.1.1) gen_server.erl:1113: :gen_server.try_handle_call/4
(stdlib 5.1.1) gen_server.erl:1142: :gen_server.handle_msg/6
(stdlib 5.1.1) proc_lib.erl:241: :proc_lib.init_p_do_apply/3
(elixir 1.15.7) lib/gen_server.ex:1074: GenServer.call/3
(exla 0.7.0-dev) lib/exla/defn.ex:268: EXLA.Defn.__compile__/4
(nx 0.7.0-dev) lib/nx/defn.ex:305: Nx.Defn.compile/3
(bumblebee 0.4.2) lib/bumblebee/text/fill_mask.ex:65: anonymous fn/7 in Bumblebee.Text.FillMask.fill_mask/3
(nx 0.7.0-dev) lib/nx/serving.ex:1810: anonymous fn/3 in Nx.Serving.Default.init/3
(elixir 1.15.7) lib/enum.ex:1693: Enum."-map/2-lists^map/1-1-"/2
(nx 0.7.0-dev) lib/nx/serving.ex:1808: anonymous fn/3 in Nx.Serving.Default.init/3 EDIT: Hmm, this is failing on checking how many devices should be used (which should just be one). Even when I set XLA_TARGET=rocm or cpu, it still fails. Just to double check, I reverted to this commit to make sure there weren't any breaking changes in XLA recently and I still get the same error. |
EXLA preallocates ~90% of GPU memory as a bit of an optimization: https://jax.readthedocs.io/en/latest/gpu_memory_allocation.html You can disable it in your client configuration by setting It's possible in the case where your memory never frees up that their is a |
I'm not able to reproduce the GPU lockup issue. I'm not entirely sure memory not getting cleared is an XLA or Elixir issue as I've occasionally run into the same issue with other applications. I think this is a ROCM driver thing. The memory was released after a few minutes and now every time I try Image generation, it initially fills up memory, crashes, and then memory/GPU util goes back to 0%. For preallocate, where do I set that? |
Screen.Recording.2023-11-19.at.12.19.27.PM.movHere's a screen recording of what happens It seems there's a bug in EXLA when you change the default config options? Mix.install(
[
{:nx, github: "elixir-nx/nx", sparse: "nx", override: true},
{:exla, github: "elixir-nx/nx", sparse: "exla", override: true},
{:kino, "~> 0.11.2"},
{:bumblebee, "~> 0.4.2"},
{:kino_bumblebee, "~> 0.4.0"}
],
config: [
nx: [default_backend: {EXLA.Backend, client: :rocm}],
exla: [
clients: [
host: [platform: :host],
rocm: [platform: :rocm, preallocate: false],
]
]
],
system_env: %{
"XLA_ARCHIVE_URL" =>
"file:///home/clay/rocm_builds/xla_extension-x86_64-linux-gnu-rocm2.tar.gz",
"XLA_TARGET" => "rocm"
},
force: true
) Trying to turn off preallocate, I get 12:46:48.388 [error] Kino.listen with #Function<42.105768164/1 in :erl_eval.expr/6> failed with reason:
** (RuntimeError) unknown client :cuda given as :preferred_clients. If you plan to use :cuda or :rocm, make sure the XLA_TARGET environment variable is appropriately set. Currently it is set to "rocm"
(exla 0.7.0-dev) lib/exla/client.ex:34: anonymous fn/3 in EXLA.Client.default_name/0
(elixir 1.15.7) lib/enum.ex:4279: Enum.find_list/3
(exla 0.7.0-dev) lib/exla/client.ex:31: EXLA.Client.default_name/0
(elixir 1.15.7) lib/keyword.ex:1383: Keyword.pop_lazy/3
(exla 0.7.0-dev) lib/exla/defn.ex:266: EXLA.Defn.__compile__/4
(nx 0.7.0-dev) lib/nx/defn.ex:305: Nx.Defn.compile/3
(bumblebee 0.4.2) lib/bumblebee/text/fill_mask.ex:65: anonymous fn/7 in Bumblebee.Text.FillMask.fill_mask/3 Mix.install(
[
{:nx, github: "elixir-nx/nx", sparse: "nx", override: true},
{:exla, github: "elixir-nx/nx", sparse: "exla", override: true},
{:kino, "~> 0.11.2"},
{:bumblebee, "~> 0.4.2"},
{:kino_bumblebee, "~> 0.4.0"}
],
config: [
nx: [default_backend: {EXLA.Backend, client: :rocm}],
exla: [
clients: [
rocm: [platform: :rocm, preallocate: false],
],
preferred_clients: [:rocm]
]
],
system_env: %{
"XLA_ARCHIVE_URL" =>
"file:///home/clay/rocm_builds/xla_extension-x86_64-linux-gnu-rocm2.tar.gz",
"XLA_TARGET" => "rocm"
},
force: true
) This seems to work to disable preallocation. However, GPU util is pinned at 100% and it still crashes |
So in theory GFX1100 should be supported now as of openxla/xla#7197, however the latest upstream code fails with the error above |
How much VRAM do you have? Do other models work? For image generation try |
24GB. It seems something has changed between XLA versions. On the latest version I have to set https://github.com/elixir-nx/nx/blob/d15acedb63ec083736c40d1ac67f805a3c101b7c/exla/c_src/exla/exla_client.cc#L498 to xla::GetStreamExecutorGpuClient(false, allocator_config, 0, 1)); and then it initializes successfully. I'm able to run pretty much everything dealing with text. However it crashes immediately when I try Whisper or Stable Diffusion. Setting num_images_per_prompt makes no difference. In the logs I see this message on startup
|
Ah, if it SEGFAULTs on Whisper and Stable Diffusion then it matches #58 (comment). That's an upstream issue though. You can try putting it in elixir script and getting a core dump (some tips: #58 (comment)).
That's usually the case, they change things around quite a lot :) |
Ah, seems to be the case! So looks like Whisper and Stable Diffusion segfaulting is unresolved? That's unfortunate as these models run fine with pyTorch. I tried using the Torch backend with NX with not much luck. Anyways, thanks for your help! |
@jonatanklosko following the instructions from the other issue, here's the backtrace. I can provide the core dump as well, although it's quite large (1.9GB) |
Awesome, perhaps we can minimise the example, here's from the other thread: left = Nx.reshape(Nx.iota({9}), {1, 1, 3, 3})
right = Nx.reshape(Nx.iota({4}), {4, 1, 1, 1})
Nx.Defn.jit_apply(&Nx.conv/3, [left, right, []]) @seanmor5 is that stacktrace enough information for an upstream issue? |
I think so yeah, it may be worth trying to increase the dirty NIF stack size first though unless you tried that already. We have to do this for all conv on NVidia GPUs |
( |
Perfect!
Yeah, there are many optimisations we are yet to do for Stable Diffusion, tracked by elixir-nx/bumblebee#147. |
may i ask which xla version/rev works now with gfx1100 and elixir xla/exla? on 0eace6346026b51f8e069a0d670c49b3d4d23a79 |
I had to patch EXLA to make it work. Specifically, I had to change https://github.com/elixir-nx/nx/blob/d94aa08c8cc5c926a8c08ef2832f8526fbf80cb2/exla/c_src/exla/exla_client.cc#L498 to EXLA_ASSIGN_OR_RETURN(std::unique_ptr<xla::PjRtClient> client,
xla::GetStreamExecutorGpuClient(false, allocator_config, 0, 1)); Looks like I'm on revision b0ec7bafd525948f34804d5ad1d9c5939d1b0562 I think that's all I did to make it work. |
Thanks @clayscode, I sent a PR for this here: elixir-nx/nx#1407 |
@clayscode sorry for bothering you, could you provide the binary you built? I am unable to reproduce the build |
Believe I built using Docker and ROCM 5.6. Unsure if 6.0 will work. I'll try to dig up the files some point this weekend. |
Using the above setup, I'm unable to get nearly any of the tasks under the "Neural Net Smartcell" tasks in Livebook working on my 7900XTX.
With Stable Diffusion, I get the following errors:
With Whisper I get
I'm unsure if this is an upstream problem with XLA, but I can't find any issues referencing GFX1100 in the XLA repo, however there is this pull request openxla/xla#2937
The text was updated successfully, but these errors were encountered: