Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make OpenCL on Intel compile without opening device, working for A770 and Max 1100 #9462

Open
wants to merge 51 commits into
base: master
Choose a base branch
from

Conversation

amarmemic
Copy link

@amarmemic amarmemic commented Mar 15, 2025

WIP:

Decided to integrate the intel opencl offline compiler

   https://github.com/intel/compute-runtime/tree/master/shared/offline_compiler

which uses as backend (maybe use it directly?)

   https://github.com/intel/intel-graphics-compiler/tree/master

but there is a multi-vendor alternative with clang

   https://github.com/KhronosGroup/OpenCL-Guide/blob/main/chapters/os_tooling.md

Contribution:

  • created python bindings for libocloc.so

  • implemented IntelOfflineCompiler class which uses ocloc in order to compile without opening device via opencl api

  • install libocloc.so + header api on target devices which run the unit tests to succeed (fixed package version)

    • fix :Error! Loading of FCL library has failed! Filename: libigdfcl.so.2
      Error! FCL initialization failure. Error code = -6
      Build failed with error code: -6

      • intel-opencl-icd (Intel OpenCL GPU api implementation + several libs) was not installed, we cannot test on amd machines as we need to run compiled binary to check result.
  • add unit tests

  • cleanup code + naming + correct file location

Open questions:

  1. IntelOfflineCompiler is currently implemented in ops_gpu.py; is it the correct location ?
    • move it into seperate file and let others decide what to do with it.
  2. How to handle linter errors with auto-generated (with clang2py) file like intel_ocloc.py ?
    • noqa: it and see if the reviewer screams
  3. It is hard to find out the device id of a specific Intel GPU via OpenCL API (like in https://www.intel.com/content/www/us/en/products/sku/96158/intel-core-i512600he-processor-18m-cache-up-to-4-50-ghz/specifications.html -->Device ID 0x46A6). ocloc expect a device id (there is in opencl something with same naming but it is different) in order to compile for that device.
    • According to chatgpt Device id = 0x46A6 = PCI Device ID --> CL_DEVICE_UUID_KHR (there lays the device ip for intel) uuid field are implementation-defined
  • Run unit test on A770 and Max 1100 (cannot do it, no hw-access)

@geohot
Copy link
Collaborator

geohot commented Mar 17, 2025

I'll leave this open, but it's clear to me this PR wasn't carefully read 3 times

@amarmemic
Copy link
Author

amarmemic commented Mar 17, 2025

I'll leave this open, but it's clear to me this PR wasn't carefully read 3 times

What do you exactly mean ? I am trying to contribute for the first time, so it is a bit tricky to understand all requirements and the context of the task.

All I know is that the compilation with ocloc works.

My problem is to answer the open question. I would appreciate it I you could give some requirements and hints on that bounty.

As you did not close the PR, it should not be totally the wrong way.

@amarmemic amarmemic marked this pull request as draft March 18, 2025 11:01
@geohot geohot added the bounty locked Bounty is locked to someone label Mar 19, 2025
@geohot
Copy link
Collaborator

geohot commented Mar 19, 2025

Bounty locked, but this code needs to sparkle before we can merge it. It's the right approach, yet things aren't in the right places, there's whitespace changes, there's a random , etc....

Copy link
Contributor

Changes

Name                                            Lines    Diff    Tokens/Line    Diff
--------------------------------------------  -------  ------  -------------  ------
tinygrad/runtime/ops_gpu.py                       108      +4           21.1    +0.5
tinygrad/runtime/support/compiler_clintel.py       27     +27           14.8   +14.8


total lines changes: +31

@amarmemic amarmemic marked this pull request as ready for review March 22, 2025 08:47
@deftdawg
Copy link

deftdawg commented Mar 27, 2025

@amarmemic, some silly questions (from a fellow A770 owner):

  1. What is the approximate performance difference from using ocloc vs the current opencl/gpu method?
  2. Does one need to do anything special to activate ocloc? Do we still need to provide INTEL=1 as required by this line (see below), or will ocloc work without it?

renderer = IntelRenderer() if "cl_intel_subgroup_matrix_multiply_accumulate" in self.device_exts and getenv("INTEL") else OpenCLRenderer()

@amarmemic
Copy link
Author

amarmemic commented Mar 27, 2025

@amarmemic, some silly questions (from a fellow A770 owner):

  1. What is the approximate performance difference from using ocloc vs the current opencl/gpu method?
  2. Does one need to do anything special to activate ocloc? Do we still need to provide INTEL=1 as required by this line (see below), or will ocloc work without it?

@deftdawg

I did not any benchmarks but I think there should not be a difference in performance because both approaches use the same igc background (see second link above)

No, you don't need to set anything, just using IntelOfflineCompiler class. But anyway, this PR will not be merged according to George from last meeting.

@deftdawg
Copy link

deftdawg commented Apr 1, 2025

I did not any benchmarks but I think there should not be a difference in performance because both approaches use the same igc background (see second link above)

Ok, thanks... I thought I had it configured, however I hadn't noticed any performance difference, so I was wondering about that. I fell down a bit of a rabbit hole recently trying to figure out why inference perf seems slow using tinygrad via exo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bounty locked Bounty is locked to someone
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants