Replies: 1 comment 3 replies
-
This is the magic number trick which I think traces back to FasterTransformer? The idea is just that it's faster to convert two 4-bit integers at once to a half2 value. With enough threads to hide the latency you might get the same performance without this trick, but I find it does make a difference still even on the 4090, and it especially matters on the 3090 where I'm currently getting 187 tokens/second for 7B 128g. The trick is that the FP16 value
So in other words
|
Beta Was this translation helpful? Give feedback.
-
I find some calculations involving constants in the implementation of gptq dequantization and really confused about the meaning of this operation. Why not just simply use w * scale + zero_point? What is y1y16 z1z16?
Anybody can help? Thanks a lot
Beta Was this translation helpful? Give feedback.
All reactions