-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QOI_DELTA Proof of Concept #23
Comments
New iteration that's more competitive wrt compression. The code is a spiders web of ifdef's for prototyping but it's attached anyway incase anyone wants to bench this snapshot, WIP. A highlight is a GDIFF_16 definition with a 2 bit tag, the extra bits are spent on red and blue which seems to be strong and it makes the encoding neater. DIFF_24 looks like the next opcode for the chopping block.
op count: 265215890, 0x00=6.46%, 0x10=1.82%, 0x20=20.50%, 0x3B=3.47%, 0x3C=0.13%, 0x3D=0.32%, 0x3E=0.60%, 0x3F=0.03%, 0x40=36.44%, 0x80=30.22%,
|
what is QOI_8_LINEAR_RGB_16? I don't understand the description. |
When the diff of R G and B are the same this tag can be used followed by the diff byte that's applied to all three channels, maybe lockstep would be a better name. Helps with long-range matches, may particularly help with black and white images stored as RGB (TBD), less effective when used alongside GDIFF_16 because GDIFF_16 is pretty good at longer-range matches itself. The first number of the above definitions is the length of the tag, the second is the length of the encoding. |
Interesting experiment! A couple of (not necessarily good) ideas:
|
I had a similar idea, but with those values (plus center zero) used for permutations of (±2, 0, 0). If it's reusing, better to reuse values from larger diff which coincide with smaller. |
That is interesting, basically a conditional >8 bit tag when needed (the condition would be expensive particularly when switching but it's worth exploring). One massive positive would be to RGB input when one op deals with alpha and the other doesn't (ie for the cost of one byte you free up bits otherwise wasted on handling alpha that will never come, no need to handle RGB/RGBA differently it'll sort itself out).
Changing QOI_GDIFF_16 to QOI_GDIFF2_16 (r3g8b3) with no other changes does this: op count: 265215890, total size: 417281211, 0x00=6.46%, 0x10=6.81%, 0x20=20.50%, 0x3B=3.64%, 0x3C=0.13%, 0x3E=0.60%, 0x3F=0.03%, 0x40=31.60%, 0x80=30.22%,
It eliminates the need for QOI_8_LINEAR_RGB_16 and makes the encoding neater which should make an optimised version faster than QOI_GDIFF_16 all other things equal. However because compression takes a hit so do encode/decode times.
If I'm understanding correctly this means that for example 5 of the 9 GB states could have an R of +2. Is there a way to detect those 5 cases without tanking encode time? |
I mean, in addition to 26 {−1..1} states (excluding full zero), there is also 6 {±2,0,0}, {0, ±2,0}, {0,0,±2}. So if any two components is the same, third can differ by 2. It can be detected with |
I implemented your version as QOI_3_DELTA2_8 (shown in build qoi-d2run5). The delta2 build uses an 8 value RUN_8 because space was needed for the RUN_16 etc flags that normally reside in delta1. Two delta1 builds are used for comparison, one also with an 8 value RUN_8 which delta2 should strictly beat (qoi-d2run5, 9 unused ops), and one with a 16 value RUN_8 that takes advantage of the compactness of delta1 (qoi-d1run4, 1 unused op because the code has been reworked since delta4 and it's no longer convenient to have 8 bit tags increase RUN_8's range). tl;dr d2run5 marginally beats d1run4, but in general it depends on the strength and packing of the rest of the opcodes.
|
So my intuition about value distribution is right. Maybe even better would be to use original 64 values and distribute them more symmetrically. |
Even better to take a leaf from GDIFF's green-based delta's by creating a variant that fits in 8 bits (see delta7). delta6 stores vr vg vb in a distribution skewed towards green (vr=-2..1, vg=-3..3, vb=-2..1), delta7 is similar except it stores r and b's difference from green giving green much more weight again:
delta6/7 have been mildly optimised by re-ordering the if else chains. delta7 takes notably longer than delta6 I think purely because GDELTA needs to generate vg_r andvg_b, not just LUMA (aka the compiler must have automatically moved vg_r and vg_b generation into LUMA for delta6). Averaging less than 1 byte per pixel is a neat trick. Thank you RLE ;) |
For delta7's QOI_OP_GDELTA encoding:
Changing that encoding line to:
would make it easier to visually inspect the bytecode (in hex or binary): the 8-bit op becomes
|
You're right, because vg_r and vg_b are powers of two it could be done with shifts, I discounted it because RUN8 takes up the last 16 slots and I thought not using ANS would scatter them (making RUN8 handling 16 special cases instead of one MASK_4). I was wrong, by putting green first RUN8 is not scattered. One small refinement is to do the following to put RUN8 first, this way RUN8 doesn't bisect GDELTA.
The encoding should be a nice speedup, I'm going to try it today. Divide by 4 is already a right shift (which is probably why delta7 still has decent speed), but doing grb allows the divide by 7 to be eliminated which may be big. |
The results are in. Tried two variants of delta8 out of curiosity with slightly different encode logic (same format):
Surely the compiler is smart enough to range check and see they're equivalent, especially when using -O3 as always?
Apparently not, and not having to sum the discrete equations unnecessarily helps the decoder big time. Thanks for pointing out the optimisation, it's unlikely I'd ever have noticed. For reference all my testing is done on a skylake laptop, Linux 64 bit, gcc 9.3.0. |
This is a fresh attempt at a variant that focuses on compression but not to the complete detriment of performance. The goal is to apply LUMA's modelling to a range of output sizes, LUMA is a good encoding so lets see how far it can go. LUMA has been renamed LUMA464 to reflect the bit distribution between vg_r, vg, and vg_b, other LUMA ops do the same. Every delta9 build is an iteration of the previous build, the baseline has enough free opcodes to make iteration easy. Builds
Results
Notes
delta9h config
|
Here's demo9h, a little-endian (i.e. faster) version of delta9h. It's not as fast as demo10, but it's in a plausible spot on the simplicity / speed / compression trade-off. master is commit 2ee2169 (2021-12-11).
|
Thank you for working your black magic, taming that decompression performance in particular is awesome. |
The results in the last table look actually very promising. Maybe QOI 2.0 😉? |
Things have progressed a little, I've found a single combination that averages 401 on images and that can probably be optimised to below 400 by exhaustively trying different LUMA ops. Effort level 0 of the QOIP thread shows the best single combination I can find to date. |
QOI_DELTA might be a viable replacement for QOI_DIFF_8. It can store -1..1 for each of rgb compared to QOI_DIFF_8 which stores -2..1. Narrowing the range allows it to be packed into 5 bits with a 3 bit tag. There's 5 unused values which can be used as 8 bit tags. Assuming special-casing RGB and RGBA as a last-resort encoding instead of the wasteful QOI_COLOR op is the way forwards, we will probably need some space for 8 bit tags so this is convenient. In the example below they've been filled with QOI_COLOR_RGB, QOI_COLOR_RGBA, QOI_COLOR_A, QOI_RUN_16, QOI_RUN_24 (QOI_COLOR_A doesn't seem very useful). The example hasn't been tuned for efficiency or compression and some of the other opcode choices are questionable, this is a WIP that is being iterated on and only meant as an example of QOI_DELTA.
op count: 265215890
QOI_DIFF_16=30.78%, QOI_INDEX=30.22%, QOI_DELTA=20.50%, QOI_COLOR_RGB=6.53%, QOI_RUN_8=6.46%, QOI_DIFF_24=4.74%, QOI_RUN_16=0.60%, QOI_COLOR_RGBA=0.12%, QOI_COLOR_24=0.03%, QOI_COLOR_A=0.01%
The text was updated successfully, but these errors were encountered: