Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SASS assembler [wip] #9521

Closed
wants to merge 4 commits into from
Closed

Conversation

mesozoic-egg
Copy link
Contributor

I integrated CuAssembler into tinygrad and added support for 4090 (SM_89). Right now all it does is just reverse nvdisasm and re-produce the cubin file.

Run extra/sass/demo/add.py with NV=1 SASS=1 SM=89 python add.py (or rand.py in the demo folder) to see that the disassembled code is being re-assembled back into a valid cubin, and produce the desired output.
Screenshot 2025-03-21 at 04 30 19

The solver parses the output of nvdisasm, and formulate a linear system that can be used to encode future instructions. For example, if MOV R1, R2 produces 0x0001, and MOV R1, R3 produces 0x0002, then if we come across MOV R1, R4, we can deduce the binary code is 0x0003. In the ideal case, given enough disassembled code, we can figure out the exact encoding for MOV, R{number}. More explanation can be found in CuAssembler's repo linked above.

I have included the sm_80.txt and sm_89.txt solution file inside extra/sass/assembler/CuInsRepos/. For example, the sm_89 solution for the above instruction MOV is:

CuInsAssembler("", {"InsKey" : 'MOV_R_R', 
  "InsRepos" : [([7, 0, 255], ['0_MOV'], 70835497244139894895106),
([7, 0, 6], ['0_MOV'], 70835497243070448038402),
([7, 6, 10], ['0_MOV'], 70835497243087628300802),
([7, 6, 4], ['0_MOV', '2_reuse'], 10633823966279397818727699544101253634),
([8, 2, 255], ['0_MOV'], 70835497244139895030274)], 
  "InsModiSet" : {'0_MOV': 0, '2_reuse': 1}, 
  "ValMatrix" : Matrix([
[1, 0, 7, 0, 255],
[1, 0, 7, 0,   6],
[1, 0, 7, 6,  10],
[1, 1, 7, 6,   4],
[1, 0, 8, 2, 255]]), 
  "PSol" : Matrix([
[             0xf000000000000000202], # 0_MOV
[ 0x8000000000000000000000000000000], # 2_reuse
[                            0x1000], # Pred
[                           0x10000], # V1
[                       0x100000000], # V2
]), 
  "PSolFac" : 1, 
  "ValNullMat" : None, 
  "InsRecords" : [(0x0000d0, 0x0000000000000f00000000ff00007202, "MOV R0, RZ ;"),
(0x000240, 0x0000000000000f000000000600007202, "MOV R0, R6 ;"),
(0x000540, 0x0000000000000f000000000a00067202, "MOV R6, R10 ;"),
(0x000100, 0x0800000000000f000000000400067202, "MOV R6, R4.reuse ;"),
(0x000100, 0x0000000000000f00000000ff00028202, "@!P0 MOV R2, RZ ;"),
], 
  "ErrRecords" : {},   "Rhs" : Matrix([
[            0xf00000000ff00007202],
[            0xf000000000600007202],
[            0xf000000000a00067202],
[0x800000000000f000000000400067202],
[            0xf00000000ff00028202],
]), 
  "Arch" : CuSMVersion(80) })

To reproduce the solution, check out the README.md in extra/sass/solver/

The diff is not reviewable yet, but the only interesting part is below, where it takes the ptxas's output, run disassemble (CubinFile() uses nvdisasm), and compile the disassembly back into cubin

class SASSCompiler(Compiler):
...
      subprocess.run(["nvcc", "-arch", self.arch, "--ptx", "-x", "cu", "-o", ptx_file.name, cuda_file.name], check=True)
      subprocess.run(["ptxas", "-arch", self.arch, "-m64", "-o", ptxas_cubin_file.name, ptx_file.name], check=True)
      cf = CubinFile(ptxas_cubin_file.name)
      cf.saveAsCuAsm(cuasm_file.name)
      parser = CuAsmParser()
      parser.parse(cuasm_file.name)
      parser.saveAsCubin(cuasm_cubin_file.name)
      cubin = cuasm_cubin_file.read()
      return cubin

I think the first step is to get the assembler work on all the tests before adding renderer and tuning for speed. Would like to hear some thoughts!

use nv hcq for cubin file

integrate sass compiler

example for zero

r prefix for regex

wip

temp: remove prefix, always re-compile

solver

dedup dumped cuda file

handle a lot of cu files

solver wip

remove dedup

save cu

wip

wip

wip

wip

sm_80 solutions'

fix imports

add sm_89 solution

example kernel that have problems

wip

wip

rand

wip
Copy link
Contributor

This branch currently is behind tinygrad/master. The line count difference bot is disabled.

@geohot
Copy link
Collaborator

geohot commented Mar 21, 2025

But this doesn't go from PTX -> SASS, does it?

@mesozoic-egg
Copy link
Contributor Author

No, it's currently cuda --> ptx -- > cubin --> disaasembled sass --> re-assembled cubin. First two steps handled by nvcc, ptxas; third step by nvdisasm, last step CuAssembler.

The goal is to have sass rendered directly and assembled into cubin for ops_nv to load and run.

Might make more sense when I implement the renderer...

@geohot
Copy link
Collaborator

geohot commented Mar 28, 2025

The goal of a SASS assembler would be to go directly from PTX, I don't understand what value disassembling and reassembling adds.

@geohot geohot closed this Mar 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants