What is the use of some constant in the cuda code of gptq dequantization? #207

frankxyy · 2023-12-04T13:27:33Z

frankxyy
Dec 4, 2023

I find some calculations involving constants in the implementation of gptq dequantization and really confused about the meaning of this operation. Why not just simply use w * scale + zero_point? What is y1y16 z1z16?

Anybody can help? Thanks a lot

turboderp · 2023-12-04T16:37:12Z

turboderp
Dec 4, 2023
Maintainer

This is the magic number trick which I think traces back to FasterTransformer? The idea is just that it's faster to convert two 4-bit integers at once to a half2 value. With enough threads to hide the latency you might get the same performance without this trick, but I find it does make a difference still even on the 4090, and it especially matters on the 3090 where I'm currently getting 187 tokens/second for 7B 128g.

The trick is that the FP16 value 0x6400 | x breaks down as:

sign bit: 0 (positive)
exponent: 25 - 15 (bias) = 10
mantissa: x (if 0 <= x < 1024)
floating point value: (+1) * 2^10 * (1 + x / 1024) = 1024 + x

So in other words static_cast<half>(0x6400 | x) = __int2half_rn(x + 1024) for 0 <= x < 1024. The offset of 1024 cancels out when you subtract the zero offset (z1z16) which itself has an offset of 1024 (first two weights) or 64 (third and fourth since they're pre-shifted by 4.)

y1y16 is just an array containing [(1, 1), (1/16, 1/16)], used to compensate for shifting some of the weights by 4 bits. It's sort of vestigial, but I trust the compiler to reduce it to a single constant and not waste any registers on it.

3 replies

frankxyy Dec 4, 2023
Author

@turboderp Hi, thank you for your clear explanation, I understand the 0x6400 | bit operation now. One thing not understood is that why the weights is shifted? May be I need to read the code to find out...

turboderp Dec 4, 2023
Maintainer

It's just to avoid two shift operations. The weights are packed like so:

hhhhffff ddddbbbb ggggeeee ccccaaaa  <-- q_weights

00000000 00001111 00000000 00001111  <-- mask_1
00000000 0000bbbb 00000000 0000aaaa  <-- q_weights & mask_1
01100100 0000bbbb 01100100 0000aaaa  <-- (q_weights & mask_1) | 0x64006400 = half2(a+1024, b+1024)

00000000 11110000 00000000 11110000  <-- mask_2
00000000 dddd0000 00000000 cccc0000  <-- q_weights & mask_2
01100100 dddd0000 01100100 cccc0000  <-- (q_weights & mask_2) | 0x64006400 = half2(c*16+1024, d*16+1024)

The hfma2s afterwards add (-1024 - zero) to the first value, and the second is first multiplied by 1/16 before (-64 - zero) is added. In either case the result is w - zero.

Since scale is constant for the group, it is applied to the dot product of the group instead of the individual weights.

frankxyy Dec 5, 2023
Author

@turboderp Got it . It seems to be for the using of direct operations to the weight vector of packed values.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the use of some constant in the cuda code of gptq dequantization? #207

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

What is the use of some constant in the cuda code of gptq dequantization? #207

frankxyy Dec 4, 2023

Replies: 1 comment · 3 replies

turboderp Dec 4, 2023 Maintainer

frankxyy Dec 4, 2023 Author

turboderp Dec 4, 2023 Maintainer

frankxyy Dec 5, 2023 Author

frankxyy
Dec 4, 2023

Replies: 1 comment 3 replies

turboderp
Dec 4, 2023
Maintainer

frankxyy Dec 4, 2023
Author

turboderp Dec 4, 2023
Maintainer

frankxyy Dec 5, 2023
Author