-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QOIP format #31
Comments
A good chunk of things have been implemented today, including qoipbench being able to benchmark crunching, fastpath encode/decode of the default combination (currently propA+RUN2+OP_A), and some optimisation of the generic encode/decode path.
qoip(-1)slow is the -1 effort level with fastpath disabled for comparison, eventually level 0 may similarly get quicker. qoip(0) currently tries 3 combinations: default, QOI, delta9h (QOI and delta9h fastpath TODO). Which ops are tried at which level will likely be tweaked a lot, level 1 in particular needs work. Level 0 lives and dies by the strength of handpicked combinations. Finding and fastpath-implementing a strong RGB combination and a few RGBA targets TODO. Maybe one that targets images that respond well to indexing. Level >0 lives and dies by the strength of individual ops. Hopefully a 2 byte index and ops from demo 20+ help a good chunk. This is an area that needs a lot of work. When many more ops are added crunch paths also needs better customisation options. |
How do I interpret something like How do I know the bits that each op gets? Does a 2-bit tag just claim the next 64 free opcodes? Do 8-bit tags like OP_RGB just get assigned out of the end of the RUN opcode block (if any), wherever that is?? |
That's the correct interpretation of an opstring. The opcodes are assigned deterministically by starting at zero and assigning the codes to the larger blocks first (Mask1..Mask8, qoip_op_comp_mask comparator). If there are still 8 bit tags to assign after using all opcodes they're assigned to the end of the RUN1 block, eating away at RUN1 as usual. See qoip_expand_opcodes, it's not particularly robust yet and can fail on specially-crafted strings. qoipcrunch makes sure a string is good before qoip_encode sees it but it needs work. To make the generic path more performant and to make implementation easier there's a few basic rules a valid opstring must follow:
edit: Now that you mention it there may be a better way of generating opcodes. We could do as above except RUN1 is no longer an explicit block with a mask, instead it consumes all remaining values after the last explicit tag. This would mean that all valid opstrings use all of the opcode space automatically, which would be nice for many reasons. Testing for it is still cheap with b1>some_value, alternatively now that we're guaranteed to be using every op it could be the last else in a chain with no condition. |
As described in the edit above there's been a breaking change to opcode generation:
This makes pretty much everything code and usage-wise slightly cleaner and less error prone. Fastpath is disabled until the new format has been implemented, comparing to the previous slow path generic decode has improved thanks to inlining mandatory ops. All levels have been tweaked a little. Encode is a little slower thanks to index8.
Here's the current state of opcodes, look like there's less than before but there's actually one more.
|
Fixed silly bug in level 0 where it tried 3 combinations but only ever returned the delta9 result. Added OP_DELTA op and replaced QOI combo in level 0 with a combo that uses OP_DELTA and INDEX7:
|
Any chance you could also slot in speed&size numbers for vanilla phoboslab/master QOI as a common reference point? Absolute decode/encode speeds obviously depend on hardware. Picking phoboslab/master commit c04a975 would line up with #28 (comment) |
There's a quick c04, I'll have it as reference from now on. For levels 0..3 the best encoding is encoded twice when it's not the last one tried which particularly inflates level 0. Levels 1..3 may also redo combinations from the common list inflating their runtime a little. There may be a benefit to simulating encode instead of actually encoding, worth trying. |
Effort has been changed to allow multiple comma-delimited user-defined opstrings instead of just -1..3. Effort levels 0..3 still exist, level -1 does not as that can be specified with an empty string.
0004080a0d, 0001020a0e, 0004090a0b0c are the current combinations tried at level 0 so effort "0" and "0004080a0d,0001020a0e,0004090a0b0c" are equivalent. The benefit of adapting the combination used to the input is apparent when comparing level 0 to the three combinations tested solo. Only 0004080a0d currently has a fastpath. edit:
edit:
|
Added many luma variants and created a combination for level 0 that heavily-focuses on alpha. The new combination has no fastpath yet:
|
Things that need work:
|
As images and images-lance were used to inform the combinations selected in qoipcrunch-list.h, it's likely that the general case is slightly worse than these figures suggest. However, assuming images and images-lance are reasonably representative of the general case it shouldn't be much worse. |
So, https://github.com/chocolate42/qoipond/blob/1dd061c7bf88a28a128be01b1bdfac106e4751c8/qoipcrunch.h#L74-L75 configures int rgb_levels[] = {1, 2, 4, 6, 8, qoipcrunch_rgb_cnt};
int rgba_levels[] = {1, 2, 4, 6, 8, qoipcrunch_rgba_cnt}; Combine that with https://github.com/chocolate42/qoipond/blob/1dd061c7bf88a28a128be01b1bdfac106e4751c8/qoipcrunch-list.h which says: char *qoipcrunch_rgb[] = {
"00030a0c14",
"0001060a0b",
"000106090c",
"0002090b14",
"0004090d14",
"000107090c",
"00010609",
"0001060a0c",
"00030a0b14",
"000208090c",
"000207090d",
"0002090c14",
"000107090b",
"00020914",
"000106090b",
"0003080a0c",
};
int qoipcrunch_rgb_cnt = 16;
char *qoipcrunch_rgba[] = {
"0002090b0f1114",
"000307090b0e0f1113",
"00040a0c0f1114",
"000106090c0f",
"000407090b0e101113",
"000307090d0f1113",
"0005090e0f1114",
"00040a0b0f111314",
"00030a0c0f14",
"000207090d0f",
"0003090b0f111314",
"000208090b0f",
"000107090b0f",
"000307090e1011",
"0004090d0f14",
"000307090e0f1113",
};
int qoipcrunch_rgba_cnt = 16; So,
Going by the MASK_X values, that takes 1 + 32 + 64 + 16 + 128 = 241 out of the 256 possible opcodes. There's also 3 implicit opcode assignments (run2, rgb, rgba), which leaves 12 run1 opcodes? Let me know if I've got that right. |
I think you're completely right. The current state I have simplifies things by unifying qoipcrunch_rgb and qoipcrunch_rgba into a single list, so 3/4 channel input is no longer handled differently (only a slight compression hit and behaviour is no longer at the mercy of whether a user presents rgb input as 3/4 channel), and each level doubles the number of combinations tried from the previous. I went with lists like this for simplicity and to make multi-threading with OpenMP simple which might get pushed to the repo later today. The list is generated by testing against images and images_lance and greedily picking the next best combination to add until the list is at some arbitrary size. Not every combination is tested, the old iteration code has been repurposed but isn't exhaustive (for example OP_INDEX8 and OP_A is always present). Partially so training doesn't take too long and partially because an exhaustive list actually makes the result less efficient. For example a combination without OP_A and with OP_A will pretty much always put RGB input into the former and RGBA the latter, bloating the list size (making each effort level worse) for just a tiny compression gain. |
Here's results using an 8 core Zen2 4700u:
|
For QOIP, it might be worth throwing an index-free combo into the mix. A fast-path could avoid updating the index altogether, whether hashed or FIFO. |
The good thing about INDEX8 is that it only takes up one opcode making it hard to beat on efficiency. In theory 95% of the time INDEX8 is more beneficial than the RUN1 value it probably replaces, so a non-index8 variant should be out-competed by the index8-variant almost every time. It's retraining now to test the theory, instead of the 511 combinations from before there's 1022 half of which don't have index8. The top 64 are chosen for qoipcrunch-list.h. It's possible that some index8-free combinations make the cut, if they do I'm guessing it won't be at the top but near the bottom of the 64. If anything INDEX8 could be a good candidate for making mandatory, that way the generic path could be made more efficient which is probably more of a performance win than sometimes having a non-index8 fastpath. BTW I was wrong about always using OP_A, apparently that changed when the many other alpha ops were added. This 1022 set is still not exhaustive (it always leaves space for at least 4 RUN1 values and adds as-big-a 1-byte index op as it can after being exhaustive for everything else), but it's close. Probably easier to just show the current list generator: Grouping OP_A with OP_LUMA2_3433 may be a mistake, they are both 2 byte rgba ops but target different things. |
Here's the result:
Non-index8 combinations made it to slots 17, 36, 38, 50, 55, 57:
So the set of 17 including 0106090f saves 177088 bytes compared to the set of 16 without it, where the input is images and images_lance combined at ~1.55GB encoded so the saving is small. Even at non-index8's best showing we're in extreme territory (something like 0.01% saving for 6.25% extra effort going from set size 16 to 17, excuse the bad percentage maths). Training from scratch would double the training time to ~7 hours (SMT not implemented yet so it can probably get down to ~2.0-2.5), IMO not worth it for a micro-optimisation to the extreme settings but YMMV. |
Early LZ4 entropy-coding results. LZ4 is not in the repo yet as support is only implemented for qoipbench but the results are right here so why not post them. Entropy setting of 0 means raw, 1=lz4, zstd is coming and lzma might be but at first glance seems more complex to implement:
The crunch encoder looks for the best QOIP representation then entropy-codes the resulting bitstream, it doesn't entropy-code every combination (it could potentially find slightly smaller total sizes if it did, might roll that in to an "extreme" setting down the line). The entropy-coding implementation is actually a fair way off being ready so these are very early results. |
Sorry for not being clearer. I wasn't suggesting opstrings just without Your |
Tangential thought (you don't have to implement this, but I'll write it down before I forget it). The opstring codes are currently dense (e.g.
Having the opstring codes occur in sorted order then might make it easier to assign bit patterns (opcodes) to ops. It's kind of like how a canonical Huffman code's compact representation assigns bit patterns according to ascending code length. It would also mean substituting one |
Generation currently uses as-big-a 1-byte index op as it can fit between explicit ops chosen from a group of sets, and RUN1 which has at least 4 values (maybe this should be 8?). Not doing so would just make RUN1 excessively large in most combinations which is a waste, it could be set to also not add an index op when it wouldn't make RUN1 too large (32 say?). Effectively it would duplicate all existing combinations that contain INDEX4 and INDEX3 removing those ops, 108 if also excluding INDEX8 216 otherwise. Might try it later. The main reason combinations without index ops aren't represented by the list is just that there aren't enough explicit ops to fill the opcode space enough to not trigger at least INDEX3.
That particular combination wouldn't make sense as OP_DELTA and OP_LUMA1_232 are both 1 byte RGB encodings with a large overlap, all OP_DELTA brings to the party is encoding a few alpha values. OP_INDEX7 is expensive, but it's also the reason that combination performed as well as it did.
It wouldn't really, a one line qsort on a small string taking practically no runtime is all that would be saved. The main win would be in cleaning up the opcode handling particularly the qoip_ops array. Currently the density is used, opstring codes are used directly as an index into qoip_ops which is a bit messy. Cleaning that up has been on the TODO list for a while, sparse handling as described could do that and potentially eliminate the explicit mask so I'll keep it in mind for the eventual cleanup. Here's more preliminary entropy-coding data including ZSTD -19:
|
Ah, OK. Still, as I said, qoi-lz4-d28 gets pretty good compression and it has no index ops at all. Its ops are:
Something like that might still be worth putting into the mix. Perhaps subbing what you call OP_LUMA3_686 and OP_LUMA3_5654 for what I call QOI_OP_RGB7. You might also want to consider adding QOI_OP_RGBA2 and QOI_OP_RGBA4 to your possible ops. |
I looked into adding d28 ops but that was back when things were being fiddled with manually and not much benefit was noticed. Things are different now so they should be in the mix, RGBA7 already has close representation either side but RGBA7 should probably be there too. RGBA4 and RGB7 should probably be LUMA variants with RGBA2 non-luma, DELTA/DIFF are the only non-LUMA variants because non-LUMA performs better when bitcount is low, LUMA starts seeing benefit at 232. Saying that though, the split was calculated based on having fixed opcodes. It may be worthwhile having both LUMA and non-LUMA versions available in qoip, at least for 222 and 232 which are at the boundary of where LUMA becomes viable. Here's the sets the generator picks from, one is picked from each set where OP_END means "no op picked":
The sets represent in order 1/2/3 byte RGB encodings and 1/2/3/4 byte RGBA encodings. So RGB7 would fit in the third set as OP_LUMA3_777, RGBA2 would fit in the fifth set as OP_DIFF1_2222, and RGBA4 in the sixth set as OP_LUMA3_4444. It's probably sensible to have 8 bit tag opcodes present in every set (and ops taking at least 8, 16, 32 codes for that matter). The training computer is going to love it ;) edit: If we're getting to the point of fleshing out all the sets to represent everything, it may be time to overhaul qoipcrunch_encode entirely. Instead of trying a selection of combinations, a smart qoipcrunch_encode would do a stat-gathering pass on the image, analyse the stats to find the best opstring, then encode it. |
Implemented the first iteration of a smart crunch encode path:
|
The current state of the smart function is much improved:
I'm trying to keep spam in this thread to a minimum while still using it as a log. More details and open questions are in issues in the repo. |
|
Out of curiosity, what ops are in the "strong combination"?
Forgive my naivety. What are "stat integers" and "log tables"? And if they're both "for speed", which is speedier? Why use one or the other? Something something memory use?? |
tl;dr the stat integer method is a way to test a small set of combinations quicker than brute force but it doesn't scale up to thousands of combinations well. The log method can test thousands/hundreds-of-thousands of combinations by taking advantage of most ops being luma variants, but it has overhead and is a WIP so doesn't currently scale down well. The combination in level 0 and the combinations in levels 1..3 are the same as they were before (the 64 combinations in
The stat integer method is in The log tables method is a recent WIP that may change quite a bit and the code needs a good kicking (good luck attempting to read it in its current state), it's implementation is in In reality the lumalog implementation is a little more complicated because there are a bunch of non-luma ops, currently there is a lumalog table and a baseline encoding size for every combination of 1-byte-index/1-byte-diff/2-byte-index op. I'm also taking advantage of some properties of the luma ops we're interested in to reduce the size of the lumalog tables:
Hopefully some of that makes sense. |
|
|
…size. Removed smart function as smarter function is now good enough. nigeltao/qoi2-bikeshed#31 (comment)
Tried using LUT for encode fastpaths but couldn't create something faster than an else chain. Even effort level -1 (which can encode everything except run ops in a single lut) couldn't beat an else chain. LUT was also slower for decode fastpath's. |
|
FIFO caches might not be the way to go. In the presence of entropy coding they actually perform worse because they decorelate the binary of same-index matches (even LZ4, which as a weaker codec might be more concerned with input size):
Combination choice happens before entropy coding so effort level 6 chooses poorly. In qoipcrunch it might be sensible to always use hash cache when entropy-coding and FIFO cache when not. Don't cross the streams. |
After doing some deeper investigation it looks like for large images with ZSTD coding it makes sense to avoid index ops entirely. Even hash indexing appears to be detrimental to ZSTD when the input is big enough to let the entropy coder stretch its legs. When input is smaller there are some gains to using hash caching, FIFO caching doesn't appear good at all with entropy coding (LZ4 or ZSTD). LZ4 doesn't react the same way to large input as ZSTD, it appears to be a weak-enough codec that simple hash caching is a benefit. Determining how the smart qoipcrunch function should proceed remains an open question and is more complicated than first thought. My current leaning is this priority list:
|
As described above there's now a size threshold on the input when ZSTD encoding, if the combination from effort level -1 is >256KiB raw then use that combination, otherwise do what the user selected. It stops images-lance regressing as effort increases without causing images to regress:
|
I've made some progress on a QOI-like format that stores the opcodes used in the header (it's early days, if it breaks let me know). This allows a bitstream and opcodes to be tailored to the input which has a range of mostly positive consequences. The main negative consequence is that the generic encode/decode path is necessarily slower than a fixed-opcode format (mainly because of function pointers and non-optimal opcode ordering), this can be mitigated for non-exhaustive searches if common combinations get "fast-path" implementations (TODO).
https://github.com/chocolate42/qoipond
Currently you don't see the potential compression benefit of the format using qoipbench because crunch-like functionality hasn't been implemented there (TODO within a few days). qoipcrunch is where you can see the benefit, first create a qoip file with qoipconv then call
qoipcrunch -in file.qoip -out file2.qoip -level 3
to do a full search of opcode combinations. The level parameter can accept 5 values:Example
Currently implemented ops, the id can be used to decode the above opstrings.
The text was updated successfully, but these errors were encountered: