Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I integrated CuAssembler into tinygrad and added support for 4090 (SM_89). Right now all it does is just reverse
nvdisasm
and re-produce the cubin file.Run

extra/sass/demo/add.py
withNV=1 SASS=1 SM=89 python add.py
(orrand.py
in the demo folder) to see that the disassembled code is being re-assembled back into a valid cubin, and produce the desired output.The solver parses the output of
nvdisasm
, and formulate a linear system that can be used to encode future instructions. For example, ifMOV R1, R2
produces 0x0001, andMOV R1, R3
produces 0x0002, then if we come acrossMOV R1, R4
, we can deduce the binary code is 0x0003. In the ideal case, given enough disassembled code, we can figure out the exact encoding forMOV
,R{number}
. More explanation can be found in CuAssembler's repo linked above.I have included the sm_80.txt and sm_89.txt solution file inside
extra/sass/assembler/CuInsRepos/
. For example, the sm_89 solution for the above instruction MOV is:To reproduce the solution, check out the
README.md
inextra/sass/solver/
The diff is not reviewable yet, but the only interesting part is below, where it takes the ptxas's output, run disassemble (
CubinFile()
usesnvdisasm
), and compile the disassembly back into cubinI think the first step is to get the assembler work on all the tests before adding renderer and tuning for speed. Would like to hear some thoughts!