Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QOIP format #31

Open
chocolate42 opened this issue Dec 28, 2021 · 35 comments
Open

QOIP format #31

chocolate42 opened this issue Dec 28, 2021 · 35 comments

Comments

@chocolate42
Copy link

I've made some progress on a QOI-like format that stores the opcodes used in the header (it's early days, if it breaks let me know). This allows a bitstream and opcodes to be tailored to the input which has a range of mostly positive consequences. The main negative consequence is that the generic encode/decode path is necessarily slower than a fixed-opcode format (mainly because of function pointers and non-optimal opcode ordering), this can be mitigated for non-exhaustive searches if common combinations get "fast-path" implementations (TODO).

https://github.com/chocolate42/qoipond

Currently you don't see the potential compression benefit of the format using qoipbench because crunch-like functionality hasn't been implemented there (TODO within a few days). qoipcrunch is where you can see the benefit, first create a qoip file with qoipconv then call qoipcrunch -in file.qoip -out file2.qoip -level 3 to do a full search of opcode combinations. The level parameter can accept 5 values:

  • -1: Just try the default combination which is currently an approximation of propA. This is what qoipbench currently does
  • 0: Try a small selection of efficient combinations. This makes level 0 a good middleground between performance and compression, and these combinations are a good target for fast-path implementations to make it even better. Currently the combinations are approximations of propA, propA+OP_A, qoi, delta9h
  • 1: Try the above plus some other combinations that use most of the opcode space
  • 2: Try the above plus more combinations that use most of the opcode space
  • 3: Try all implemented combinations that use most of the opcode space. Right now that means a few hundred combinations, but as more ops get implemented the number of combinations at this level may grow very large. Analogues of demo* ops are next on the implementation list

Example

./qoipcrunch -in test.qoip -level 3 -out test2.qoip
Input has length   233488
New best length    188776 with opstring 000102060e121314
New best length    188544 with opstring 0001070e1213140203
New best length    188184 with opstring 0001070e1513160203
New best length    175544 with opstring 0001070e1513170203
Tried 144 combinations, reduced to 75.183307% of input

Currently implemented ops, the id can be used to decode the above opstrings.

./qoipcrunch -list
id=00, OP_RGB:  4 byte, store RGB  verbatim
id=01, OP_RGBA: 5 byte, store RGBA verbatim
id=02, OP_A:    2 byte, store    A verbatim
id=03, OP_RUN2:   2 byte, 256 value RLE
id=04, OP_RUN1_7: 1 byte, 128 value RLE
id=05, OP_RUN1_6: 1 byte,  64 value RLE
id=06, OP_RUN1_5: 1 byte,  32 value RLE
id=07, OP_RUN1_4: 1 byte,  16 value RLE
id=08, OP_RUN1_3: 1 byte,   8 value RLE
id=09, OP_RUN1_2: 1 byte,   4 value RLE
id=0a, OP_RUN1_1: 1 byte,   2 value RLE
id=0b, OP_RUN1_0: 1 byte,   1 value RLE
id=0c, OP_INDEX7: 1 byte, 128 value index cache
id=0d, OP_INDEX6: 1 byte,  64 value index cache
id=0e, OP_INDEX5: 1 byte,  32 value index cache
id=0f, OP_INDEX4: 1 byte,  16 value index cache
id=10, OP_INDEX3: 1 byte,   8 value index cache
id=11, OP_INDEX2: 1 byte,   4 value index cache
id=12, OP_DIFF:       1 byte delta,   vr  -2..1,  vg  -2..1,    vb  -2..1
id=13, OP_LUMA2_464:  2 byte delta, vg_r  -8..7,  vg -32..31, vg_b  -8..7
id=14, OP_RGB3:       3 byte delta,   vr -64..63,  g,           vb -64..63
id=15, OP_LUMA1_232:  1 byte delta, vg_r  -2..1,  vg  -4..3,  vg_b  -2..1
id=16, OP_LUMA3_676:  3 byte delta, vg_r -32..31, vg -64..63, vg_b -32..31
id=17, OP_LUMA3_4645: 3 byte delta, vg_r  -8..7,  vg -32..31, vg_b  -8..7  va -16..15
@chocolate42
Copy link
Author

A good chunk of things have been implemented today, including qoipbench being able to benchmark crunching, fastpath encode/decode of the default combination (currently propA+RUN2+OP_A), and some optimisation of the generic encode/decode path.

# Grand total for images
             decode_ms     encode_ms   decode_mpps   encode_mpps   size_kb    rate
qoip(-1)slow:    3.870         5.081        119.93         91.36       443   27.0%
qoip(-1):        2.824         3.796        164.37        122.28       443   27.0%
qoip( 0):        2.800        14.470        165.78         32.08       425   25.9%
qoip( 1):        3.027        34.955        153.34         13.28       425   25.9%
qoip( 2):        2.813       246.130        165.00          1.89       418   25.5%

qoip(-1)slow is the -1 effort level with fastpath disabled for comparison, eventually level 0 may similarly get quicker. qoip(0) currently tries 3 combinations: default, QOI, delta9h (QOI and delta9h fastpath TODO). Which ops are tried at which level will likely be tweaked a lot, level 1 in particular needs work.

Level 0 lives and dies by the strength of handpicked combinations. Finding and fastpath-implementing a strong RGB combination and a few RGBA targets TODO. Maybe one that targets images that respond well to indexing.

Level >0 lives and dies by the strength of individual ops. Hopefully a 2 byte index and ops from demo 20+ help a good chunk. This is an area that needs a lot of work. When many more ops are added crunch paths also needs better customisation options.

@nigeltao
Copy link
Owner

How do I interpret something like opstring 000102060e121314? Does it mean that there's an 00 op (OP_RGB), an 01 op (OP_RGBA), etc, up to a 14 op (OP_RGB3)?

How do I know the bits that each op gets? Does a 2-bit tag just claim the next 64 free opcodes? Do 8-bit tags like OP_RGB just get assigned out of the end of the RUN opcode block (if any), wherever that is??

@chocolate42
Copy link
Author

chocolate42 commented Dec 30, 2021

That's the correct interpretation of an opstring. The opcodes are assigned deterministically by starting at zero and assigning the codes to the larger blocks first (Mask1..Mask8, qoip_op_comp_mask comparator). If there are still 8 bit tags to assign after using all opcodes they're assigned to the end of the RUN1 block, eating away at RUN1 as usual. See qoip_expand_opcodes, it's not particularly robust yet and can fail on specially-crafted strings. qoipcrunch makes sure a string is good before qoip_encode sees it but it needs work.

To make the generic path more performant and to make implementation easier there's a few basic rules a valid opstring must follow:

  • A RUN1 op must exist. This allows runs to be handled directly instead of with slower function pointers
  • The only mask overlap possible is 8 bit tags at the end of RUN1
  • No actual opcode overlap is possible
  • While not tested for yet, RGB input must have OP_RGB, RGBA input must have OP_RGBA, to be able to encode the worst-case

edit: Now that you mention it there may be a better way of generating opcodes. We could do as above except RUN1 is no longer an explicit block with a mask, instead it consumes all remaining values after the last explicit tag. This would mean that all valid opstrings use all of the opcode space automatically, which would be nice for many reasons. Testing for it is still cheap with b1>some_value, alternatively now that we're guaranteed to be using every op it could be the last else in a chain with no condition.

@chocolate42
Copy link
Author

As described in the edit above there's been a breaking change to opcode generation:

  • OP_RGB, OP_RGBA and OP_RUN2 are implicit and mandatory, taking up the next available opcodes after the explicit ops have been handled
  • The RUN1 op is implicit, taking up any remaining opcodes. It's not technically mandatory if RUN2 takes up the last opcode

This makes pretty much everything code and usage-wise slightly cleaner and less error prone. Fastpath is disabled until the new format has been implemented, comparing to the previous slow path generic decode has improved thanks to inlining mandatory ops. All levels have been tweaked a little. Encode is a little slower thanks to index8.

# Grand total for images
             decode_ms     encode_ms   decode_mpps   encode_mpps   size_kb    rate
qoip(-1)slow:    3.352         5.572        138.49         83.31       443   27.0%
qoip( 0):        3.440        16.791        134.93         27.64       428   26.1%
qoip( 1):        3.478        55.371        133.45          8.38       419   25.5%
qoip( 2):        3.366       293.129        137.90          1.58       417   25.4%

Here's the current state of opcodes, look like there's less than before but there's actually one more.

./qoipcrunch -list
id=00, OP_A:    2 byte, store    A verbatim
id=01, OP_INDEX8: 2 byte, 256 value index cache
id=02, OP_INDEX7: 1 byte, 128 value index cache
id=03, OP_INDEX6: 1 byte,  64 value index cache
id=04, OP_INDEX5: 1 byte,  32 value index cache
id=05, OP_INDEX4: 1 byte,  16 value index cache
id=06, OP_INDEX3: 1 byte,   8 value index cache
id=07, OP_INDEX2: 1 byte,   4 value index cache
id=08, OP_DIFF:       1 byte delta,   vr  -2..1,  vg  -2..1,    vb  -2..1
id=09, OP_LUMA1_232:  1 byte delta, vg_r  -2..1,  vg  -4..3,  vg_b  -2..1
id=0a, OP_LUMA2_464:  2 byte delta, vg_r  -8..7,  vg -32..31, vg_b  -8..7
id=0b, OP_LUMA3_676:  3 byte delta, vg_r -32..31, vg -64..63, vg_b -32..31
id=0c, OP_LUMA3_4645: 3 byte delta, vg_r  -8..7,  vg -32..31, vg_b  -8..7  va -16..15
id=0d, OP_RGB3:       3 byte delta,   vr -64..63,  g,           vb -64..63

@chocolate42
Copy link
Author

Fixed silly bug in level 0 where it tried 3 combinations but only ever returned the delta9 result. Added OP_DELTA op and replaced QOI combo in level 0 with a combo that uses OP_DELTA and INDEX7:

# Grand total for images
         decode_ms     encode_ms   decode_mpps   encode_mpps   size_kb    rate
qoip( 0):    3.393        18.342        136.82         25.31       416   25.3%
qoip( 1):    3.406        54.194        136.28          8.57       415   25.3%


@nigeltao
Copy link
Owner

Any chance you could also slot in speed&size numbers for vanilla phoboslab/master QOI as a common reference point? Absolute decode/encode speeds obviously depend on hardware.

Picking phoboslab/master commit c04a975 would line up with #28 (comment)

@chocolate42
Copy link
Author

        decode ms   encode ms   decode mpps   encode mpps   size kb    rate
qoi:          2.6         3.3        177.46        140.59       463   28.2%

There's a quick c04, I'll have it as reference from now on. For levels 0..3 the best encoding is encoded twice when it's not the last one tried which particularly inflates level 0. Levels 1..3 may also redo combinations from the common list inflating their runtime a little.

There may be a benefit to simulating encode instead of actually encoding, worth trying.

@chocolate42
Copy link
Author

chocolate42 commented Jan 1, 2022

Effort has been changed to allow multiple comma-delimited user-defined opstrings instead of just -1..3. Effort levels 0..3 still exist, level -1 does not as that can be specified with an empty string.

decode_ms  encode_ms  decode_mpps  encode_mpps  size kb  rate
    2.683      3.378       172.98       137.43      463  28.2%: qoi-c04a975
    3.030      4.184       153.19       110.93      443  27.0%: qoip(0004080a0d)
    3.080      6.178       150.72        75.14      456  27.8%: qoip(0001020a0e)
    3.000      5.849       154.74        79.36      428  26.1%: qoip(0004090a0b0c)
    3.071     17.876       151.13        25.97      416  25.3%: qoip(0004080a0d,0001020a0e,0004090a0b0c)
    3.016     17.462       153.93        26.58      416  25.3%: qoip(0)
    3.017     53.884       153.87         8.61      415  25.3%: qoip(1)

0004080a0d, 0001020a0e, 0004090a0b0c are the current combinations tried at level 0 so effort "0" and "0004080a0d,0001020a0e,0004090a0b0c" are equivalent. The benefit of adapting the combination used to the input is apparent when comparing level 0 to the three combinations tested solo. Only 0004080a0d currently has a fastpath.

edit:
New normal, deltax is default, deltax and propc have fastpath's, another silly bug fixed where decode was not properly benchmarked:

decode_ms  encode_ms  decode_mpps  encode_mpps  size kb  rate
    2.762      3.489       168.03       133.06      463  28.2%: qoi-c04a975
    3.492      4.725       132.93        98.23      422  25.7%: qoip()
    3.290     15.093       141.08        30.76      415  25.3%: qoip(0)
    3.310     49.029       140.22         9.47      415  25.3%: qoip(1)

edit:
New new normal. Fastpath implemented for the last common combo now called idelta. Removed propc from common list so only deltax and idelta remain, propc is better solo than idelta but is rarely used in the presence of deltax. idelta has a 7 bit index which pairs nicely with deltax which is more of a general purpose combination. It might be better to remove LUMA3_4645 from deltax to make it even more RGB-focused, especially if a third common combination is found that is RGBA-focused.


decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
    2.778      3.505       167.12       132.43      463  28.2%: qoi-c04a975
    3.329      4.555       139.44       101.91      422  25.7%: qoip()
    3.229      9.051       143.75        51.29      416  25.3%: qoip(0)
# Grand total for images-lance
decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
   20.627     46.155       196.88        87.99     1744  11.2%: qoip(0)

@chocolate42
Copy link
Author

Added many luma variants and created a combination for level 0 that heavily-focuses on alpha. The new combination has no fastpath yet:

# Grand total for images
decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
    2.529      3.196       183.53       145.24      463  28.2%: qoi-c04a975
    3.333      5.692       139.27        81.55      422  25.7%: qoip()
    3.324     17.958       139.64        25.85      415  25.3%: qoip(0)

# Grand total for images-lance
decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
   19.803     17.157       205.07       236.70     2109  13.6%: qoi-c04a975
   20.515     30.604       197.96       132.70     1764  11.3%: qoip()
   22.613     94.401       179.59        43.02     1580  10.2%: qoip(0)

@chocolate42
Copy link
Author

  • Reworked qoipcrunch to remove the iterator and instead use a curated list (one for rgb one for rgba) of opstrings, the idea being that opstrings higher up the list compress better. Currently the ordering is done by greedily picking the next opstring by the amount of space it saves when added to the previously chosen opstrings, trained against images and images_lance. It still needs some work as there is an rgb/rgba split that isn't properly accounted for, but the current state is an improvement from the iterator approach
  • There's now levels 0 to 5 inclusive that try increasing amounts of opstrings from the lists
  • All previously supported opstrings by qoip_encode still are, they just aren't represented in qoipcrunch.h
	int rgb_levels[]  = {1, 2, 6, 10, 15, 23};
	int rgba_levels[] = {1, 2, 6, 10, 15, 32};
  • rgb only has 23 opstrings at level 5 because rgb only has 23 opstring structures, rgba has ~430 structures but has been arbitrarily limited to 32 (~290 were never used for images or images-lance, and at 32 already each additional opstring only saves low-single-digit kilobytes total from the entire rgba corpus of ~984MB encoded, <0.001%). Judging by the numbers below we could limit rgb and rgba to 16 opstrings and still lose very little on average.

Things that need work:

  • A large percentage of rgba input should be treated as rgb but the hard training split between rgb/rgba means it isn't trained that way. It meant that 0003080a0c0f was something like 10th down the rgba list when manually putting it first was an improvement on average (presumably because the proper-rgba training overpowered the rgba-but-should-treat-as-rgb training). Maybe the rgba-as-rgb input should be filtered in training by testing against the top rgb opstrings, or rgba should be tested against the top rgb at runtime, or something else.
  • Fastpath's have been disabled again (they work but the opstrings are no longer represented in the qoipcrunch lists)
# Grand total for images
decode_ms  encode_ms  decode_mpps  encode_mpps  size kb  rate
    2.587      3.243       179.45       143.14      463  28.2%: qoi-c04a975
    3.530      5.799       131.50        80.04      421  25.7%: qoip(0)
    3.484     11.715       133.24        39.62      414  25.2%: qoip(1)
    3.420     34.864       135.72        13.31      411  25.1%: qoip(2)
    3.374     57.140       137.57         8.12      410  25.0%: qoip(3)
    3.369     85.583       137.76         5.42      410  25.0%: qoip(4)
    3.353    155.575       138.43         2.98      409  24.9%: qoip(5)


# Grand total for images-lance
decode_ms  encode_ms  decode_mpps  encode_mpps  size kb  rate
   19.581     16.980       207.40       239.16     2109  13.6%: qoi-c04a975
   20.779     28.943       195.44       140.31     1807  11.6%: qoip(0)
   21.279     58.700       190.85        69.18     1591  10.2%: qoip(1)
   21.833    177.022       186.01        22.94     1556  10.0%: qoip(2)
   21.602    292.896       187.99        13.87     1552  10.0%: qoip(3)
   21.642    440.122       187.65         9.23     1550  10.0%: qoip(4)
   21.550    908.971       188.45         4.47     1550  10.0%: qoip(5)

@chocolate42
Copy link
Author

  • Implemented luma232 bias from here in addition to the basic version that already existed: Experiments with luma232, luma686 and gray tags #33
  • Implemented delta from average (of previous and above pixel) from here: Diff from average instead of previous pixel #34
  • Average implementation limits average application to the first 8192 pixels in every row, pixels beyond that or in the first row fall back to using just the previous pixel. This is mainly done to limit working memory usage of a streaming/re-entrant implementation. 8192 seems like a good compromise between expected input, efficiency and memory but is otherwise arbitrary (uses 24 KiB of memory)
  • Retrained crunch list with all the new additions. There's still an rgb/rgba split which may not be optimal (4 channel input is currently only tested aganst the rgba list, but as the rgba list is trained against a lot of rgba input that could be treated as 99% rgb it still works well). I'm leaning towards future training that filters rgba-as-rgb input and having rgba input tested against both lists at runtime, that might yield worthwhile gains on average
  • Change effort level boundaries to reflect common CPU core counts
	int rgb_levels[]  = {1, 2, 4, 6, 8,  qoipcrunch_rgb_cnt};
	int rgba_levels[] = {1, 2, 4, 6, 8, qoipcrunch_rgba_cnt};
  • Fastpath's are completely disabled for now as the average implementation has broken them. They'll get fixed once other breaking changes have made their way in and the best opstring's have been nailed down. A FIFO index cache probably replacing the hash cache is the known biggie. Hopefully there are other biggie's yet to be found
# Grand total for images
decode_ms  encode_ms  decode_mpps  encode_mpps  size kb  rate
    6.717     78.417        69.11         5.92      398  24.2%: libpng
    6.638     55.068        69.93         8.43      561  34.2%: stbi
    1.975      2.361       235.00       196.57      463  28.2%: qoi-c04a975
    4.016      6.454       115.59        71.92      400  24.4%: qoip(0)
    3.985     12.729       116.47        36.46      395  24.1%: qoip(1)
    3.939     25.183       117.84        18.43      392  23.9%: qoip(2)
    3.927     37.758       118.21        12.29      391  23.8%: qoip(3)
    3.956     50.741       117.33         9.15      391  23.8%: qoip(4)
    3.937    101.207       117.89         4.59      390  23.8%: qoip(5)

# Grand total for images-lance
decode_ms  encode_ms  decode_mpps  encode_mpps  size kb  rate
   52.313    450.099        77.63         9.02     1395   9.0%: libpng
   41.832    371.103        97.08        10.94     2119  13.6%: stbi
   15.410     13.746       263.53       295.43     2109  13.6%: qoi-c04a975
   24.064     32.243       168.76       125.95     1657  10.7%: qoip(0)
   24.046     64.588       168.89        62.88     1545   9.9%: qoip(1)
   23.913    127.686       169.83        31.81     1541   9.9%: qoip(2)
   23.983    191.586       169.33        21.20     1535   9.9%: qoip(3)
   24.187    256.918       167.90        15.81     1530   9.8%: qoip(4)
   24.285    513.158       167.23         7.91     1528   9.8%: qoip(5)

As images and images-lance were used to inform the combinations selected in qoipcrunch-list.h, it's likely that the general case is slightly worse than these figures suggest. However, assuming images and images-lance are reasonably representative of the general case it shouldn't be much worse.

@nigeltao
Copy link
Owner

nigeltao commented Jan 9, 2022

So, https://github.com/chocolate42/qoipond/blob/1dd061c7bf88a28a128be01b1bdfac106e4751c8/qoipcrunch.h#L74-L75 configures qoip(0) through qoip(5):

int rgb_levels[]  = {1, 2, 4, 6, 8,  qoipcrunch_rgb_cnt};
int rgba_levels[] = {1, 2, 4, 6, 8, qoipcrunch_rgba_cnt};

Combine that with https://github.com/chocolate42/qoipond/blob/1dd061c7bf88a28a128be01b1bdfac106e4751c8/qoipcrunch-list.h which says:

char *qoipcrunch_rgb[] = {
	"00030a0c14",
	"0001060a0b",
	"000106090c",
	"0002090b14",
	"0004090d14",
	"000107090c",
	"00010609",
	"0001060a0c",
	"00030a0b14",
	"000208090c",
	"000207090d",
	"0002090c14",
	"000107090b",
	"00020914",
	"000106090b",
	"0003080a0c",
};
int qoipcrunch_rgb_cnt = 16;

char *qoipcrunch_rgba[] = {
	"0002090b0f1114",
	"000307090b0e0f1113",
	"00040a0c0f1114",
	"000106090c0f",
	"000407090b0e101113",
	"000307090d0f1113",
	"0005090e0f1114",
	"00040a0b0f111314",
	"00030a0c0f14",
	"000207090d0f",
	"0003090b0f111314",
	"000208090b0f",
	"000107090b0f",
	"000307090e1011",
	"0004090d0f14",
	"000307090e0f1113",
};
int qoipcrunch_rgba_cnt = 16;

So, qoip(0) for RGB (no A) images uses only the "00030a0c14" op string. Cross-referencing that with https://github.com/chocolate42/qoipond/blob/1dd061c7bf88a28a128be01b1bdfac106e4751c8/qoip.h#L209-L241 gives:

00: OP_INDEX8, MASK_8, etc
03: OP_INDEX5, MASK_3, etc
0a: OP_LUMA2_464, MASK_2, etc
0c: OP_LUMA3_686, MASK_4, etc
14: OP_LUMA1_232_BIAS, MASK_1, etc

Going by the MASK_X values, that takes 1 + 32 + 64 + 16 + 128 = 241 out of the 256 possible opcodes. There's also 3 implicit opcode assignments (run2, rgb, rgba), which leaves 12 run1 opcodes?

Let me know if I've got that right.

@chocolate42
Copy link
Author

I think you're completely right. The current state I have simplifies things by unifying qoipcrunch_rgb and qoipcrunch_rgba into a single list, so 3/4 channel input is no longer handled differently (only a slight compression hit and behaviour is no longer at the mercy of whether a user presents rgb input as 3/4 channel), and each level doubles the number of combinations tried from the previous. I went with lists like this for simplicity and to make multi-threading with OpenMP simple which might get pushed to the repo later today.

The list is generated by testing against images and images_lance and greedily picking the next best combination to add until the list is at some arbitrary size. Not every combination is tested, the old iteration code has been repurposed but isn't exhaustive (for example OP_INDEX8 and OP_A is always present). Partially so training doesn't take too long and partially because an exhaustive list actually makes the result less efficient. For example a combination without OP_A and with OP_A will pretty much always put RGB input into the former and RGBA the latter, bloating the list size (making each effort level worse) for just a tiny compression gain.

@chocolate42
Copy link
Author

  • Unified the crunch lists into one as described above. As well as making behaviour more predictable it'll make fastpath implementations more worthwhile to implement as they'll be used more often
  • Each effort level doubles the number of combinations tried from the previous level (1, 2, 4, 8, 16, 32, 64)
  • Implemented multi-threading with OpenMP for qoipcrunch_encode() (effort levels only, custom combination list TODO maybe, might not bother as custom strings are more for testing convenience). Each combination is tested single-threaded, but when trying multiple combinations they may be tried concurrently (-threads num argument, default 1).

Here's results using an 8 core Zen2 4700u:

# Grand total for images
decode_ms  encode_ms  decode_mpps  encode_mpps  size kb  rate
    1.948      2.341       238.25       198.32      463  28.2%: qoi-c04a975
    3.874      6.242       119.80        74.36      401  24.4%: qoip(0).threads(8)
    3.901      7.397       118.99        62.75      398  24.2%: qoip(1).threads(8)
    4.313     10.098       107.63        45.97      393  24.0%: qoip(2).threads(8)
    4.617     22.999       100.54        20.18      392  23.9%: qoip(3).threads(8)
    4.643     38.586        99.96        12.03      390  23.8%: qoip(4).threads(8)
    3.888     56.479       119.38         8.22      392  23.9%: qoip(3).threads(1)

# Grand total for images-lance
decode_ms  encode_ms  decode_mpps  encode_mpps  size kb  rate
   15.345     13.833       264.65       293.58     2109  13.6%: qoi-c04a975
   24.337     32.528       166.87       124.85     1664  10.7%: qoip(0).threads(8)
   24.842     37.112       163.47       109.43     1566  10.1%: qoip(1).threads(8)
   25.101     46.341       161.79        87.64     1541   9.9%: qoip(2).threads(8)
   25.888     78.690       156.87        51.61     1531   9.8%: qoip(3).threads(8)
   25.646    150.737       158.35        26.94     1528   9.8%: qoip(4).threads(8)
   24.273    290.353       167.31        13.99     1531   9.8%: qoip(3).threads(1)
  • Each thread is given its own working memory
  • Single-threaded at effort level 3 for comparison
  • Decode taking longer as multi-threaded effort levels increase is a flaw in the benchmark. Not enough time is taken to let the CPU cool down from the multi-threaded encode tests so the clocks are lower than they could be for the decode test
  • Above results were roughly running at 2.8GHz 8 threaded, 4.0GHz single threaded which checks out (laptop part with weedy cooler, 45W TDP but it may thermal throttle before that at sustained SMT load)
  • Effort levels 0/1/2 were not actually using 8 threads as there are not enough combinations to try (1/2/4 combinations/threads used respectively). Comparing levels 0-3 you can see in encode and decode timings how much clocks were reduced for thermal/power limits as thread count increased

@nigeltao
Copy link
Owner

nigeltao commented Jan 9, 2022

OP_INDEX8... is always present

qoi-lz4-d28 from #28 (comment) (there's a list of its opcodes at the bottom of the comment) gets pretty good compression (admittedly when combined with LZ4) and it has no index ops at all, let alone a 2-byte index one.

For QOIP, it might be worth throwing an index-free combo into the mix. A fast-path could avoid updating the index altogether, whether hashed or FIFO.

@chocolate42
Copy link
Author

The good thing about INDEX8 is that it only takes up one opcode making it hard to beat on efficiency. In theory 95% of the time INDEX8 is more beneficial than the RUN1 value it probably replaces, so a non-index8 variant should be out-competed by the index8-variant almost every time.

It's retraining now to test the theory, instead of the 511 combinations from before there's 1022 half of which don't have index8. The top 64 are chosen for qoipcrunch-list.h. It's possible that some index8-free combinations make the cut, if they do I'm guessing it won't be at the top but near the bottom of the 64. If anything INDEX8 could be a good candidate for making mandatory, that way the generic path could be made more efficient which is probably more of a performance win than sometimes having a non-index8 fastpath.

BTW I was wrong about always using OP_A, apparently that changed when the many other alpha ops were added. This 1022 set is still not exhaustive (it always leaves space for at least 4 RUN1 values and adds as-big-a 1-byte index op as it can after being exhaustive for everything else), but it's close. Probably easier to just show the current list generator:
qoipcrunchlistgen.c.txt

Grouping OP_A with OP_LUMA2_3433 may be a mistake, they are both 2 byte rgba ops but target different things.

@chocolate42
Copy link
Author

Here's the result:

Adding 00030a0b0f1114 to make set size of 1 brings total size to      1618345376
Adding 000206090b0e0f1113 to make set size of 2 brings total size to  1582208112
Adding 000106090c0f to make set size of 3 brings total size to        1568051720
Adding 0005090e0f1114 to make set size of 4 brings total size to      1563279664
Adding 000407090b0e101113 to make set size of 5 brings total size to  1560966752
Adding 00030a0c0f14 to make set size of 6 brings total size to        1559050672
Adding 000307090d0f1113 to make set size of 7 brings total size to    1557204656
Adding 0002090b0f14 to make set size of 8 brings total size to        1555964088
Adding 0001060a0c0f to make set size of 9 brings total size to        1554931440
Adding 00040a0b0f111314 to make set size of 10 brings total size to   1553976352
Adding 000107090b0f to make set size of 11 brings total size to       1553608072
Adding 0003090b0f111314 to make set size of 12 brings total size to   1553257688
Adding 0004090d0f14 to make set size of 13 brings total size to       1552914848
Adding 000307090e0f1113 to make set size of 14 brings total size to   1552600128
Adding 000208090b0f to make set size of 15 brings total size to       1552352192
Adding 000207090d0f to make set size of 16 brings total size to       1552110104
Adding 0106090f to make set size of 17 brings total size to           1551933016
Adding 000307090e1011 to make set size of 18 brings total size to     1551778224
Adding 0003090c0f1114 to make set size of 19 brings total size to     1551644664
Adding 0004070a0d0f1213 to make set size of 20 brings total size to   1551534184
Adding 00030a0b0f14 to make set size of 21 brings total size to       1551433144
Adding 000207090c0f11 to make set size of 22 brings total size to     1551359848
Adding 0004090b10111314 to make set size of 23 brings total size to   1551290768
Adding 00050a0c0f111314 to make set size of 24 brings total size to   1551230816
Adding 0003090c0f111314 to make set size of 25 brings total size to   1551175344
Adding 000106090b0f1113 to make set size of 26 brings total size to   1551121384
Adding 0002070a0c0f to make set size of 27 brings total size to       1551070536
Adding 0002090c0f14 to make set size of 28 brings total size to       1551022272
Adding 000306090b0e101113 to make set size of 29 brings total size to 1550986320
Adding 000206090e0f11 to make set size of 30 brings total size to     1550955672
Adding 0002090b0f1114 to make set size of 31 brings total size to     1550925560
Adding 00090b0e0f1114 to make set size of 32 brings total size to     1550897832
Adding 000107090c0f to make set size of 33 brings total size to       1550870856
Adding 00040a0c0f1114 to make set size of 34 brings total size to     1550844624
Adding 000307090b0e0f1113 to make set size of 35 brings total size to 1550820672
Adding 0106090c0f13 to make set size of 36 brings total size to       1550798640
Adding 000106090b0f to make set size of 37 brings total size to       1550778504
Adding 04060a0d101213 to make set size of 38 brings total size to     1550759568
Adding 0003070a0d0f to make set size of 39 brings total size to       1550744360
Adding 0003070a0c101113 to make set size of 40 brings total size to   1550731336
Adding 00030910111314 to make set size of 41 brings total size to     1550718808
Adding 0003080a0c0f to make set size of 42 brings total size to       1550706640
Adding 0002060a0d0f to make set size of 43 brings total size to       1550695184
Adding 00090d0f1314 to make set size of 44 brings total size to       1550684680
Adding 000106090f to make set size of 45 brings total size to         1550674456
Adding 0002060a0c0f1113 to make set size of 46 brings total size to   1550667608
Adding 000206090d0f to make set size of 47 brings total size to       1550661120
Adding 00040a0c0f1314 to make set size of 48 brings total size to     1550654680
Adding 00010609101113 to make set size of 49 brings total size to     1550648504
Adding 01060a0b0f to make set size of 50 brings total size to         1550642336
Adding 0001060a0b0f to make set size of 51 brings total size to       1550636304
Adding 0003060a0d0f1113 to make set size of 52 brings total size to   1550630544
Adding 000207090b0f1113 to make set size of 53 brings total size to   1550624904
Adding 000106090c0f13 to make set size of 54 brings total size to     1550620488
Adding 05090e0f1114 to make set size of 55 brings total size to       1550617024
Adding 0003090b0f1314 to make set size of 56 brings total size to     1550613600
Adding 0106090b0f13 to make set size of 57 brings total size to       1550610368
Adding 000207090b101113 to make set size of 58 brings total size to   1550607224
Adding 0003080a0b0f to make set size of 59 brings total size to       1550604160
Adding 0004070a0b0e0f1113 to make set size of 60 brings total size to 1550601360
Adding 000307090c0e0f11 to make set size of 61 brings total size to   1550598688
Adding 0002090f14 to make set size of 62 brings total size to         1550596056
Adding 000408090d0f to make set size of 63 brings total size to       1550594016
Adding 000106090b0f13

Non-index8 combinations made it to slots 17, 36, 38, 50, 55, 57:

0106090f
0106090c0f13
04060a0d101213
01060a0b0f
05090e0f1114
0106090b0f13

So the set of 17 including 0106090f saves 177088 bytes compared to the set of 16 without it, where the input is images and images_lance combined at ~1.55GB encoded so the saving is small. Even at non-index8's best showing we're in extreme territory (something like 0.01% saving for 6.25% extra effort going from set size 16 to 17, excuse the bad percentage maths). Training from scratch would double the training time to ~7 hours (SMT not implemented yet so it can probably get down to ~2.0-2.5), IMO not worth it for a micro-optimisation to the extreme settings but YMMV.

@chocolate42
Copy link
Author

Early LZ4 entropy-coding results. LZ4 is not in the repo yet as support is only implemented for qoipbench but the results are right here so why not post them. Entropy setting of 0 means raw, 1=lz4, zstd is coming and lzma might be but at first glance seems more complex to implement:

# Grand total for images
decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
    4.042      6.376       114.84        72.80      383  23.3%: qoip(0).threads(8).entropy(1)
    4.088      7.658       113.54        60.61      381  23.2%: qoip(1).threads(8).entropy(1)
    4.478     10.487       103.66        44.26      377  23.0%: qoip(2).threads(8).entropy(1)
    4.804     24.729        96.62        18.77      376  22.9%: qoip(3).threads(8).entropy(1)
    4.878     39.109        95.16        11.87      375  22.8%: qoip(4).threads(8).entropy(1)


# Grand total for images-lance
decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
   26.003     34.750       156.18       116.87     1569  10.1%: qoip(0).threads(8).entropy(1)
   26.204     39.235       154.98       103.51     1492   9.6%: qoip(1).threads(8).entropy(1)
   26.843     49.052       151.29        82.79     1464   9.4%: qoip(2).threads(8).entropy(1)
   27.461     83.002       147.88        48.93     1450   9.3%: qoip(3).threads(8).entropy(1)
   27.202    157.797       149.29        25.74     1449   9.3%: qoip(4).threads(8).entropy(1)

The crunch encoder looks for the best QOIP representation then entropy-codes the resulting bitstream, it doesn't entropy-code every combination (it could potentially find slightly smaller total sizes if it did, might roll that in to an "extreme" setting down the line). The entropy-coding implementation is actually a fair way off being ready so these are very early results.

@nigeltao
Copy link
Owner

nigeltao commented Jan 11, 2022

it might be worth throwing an index-free combo into the mix.

Sorry for not being clearer. I wasn't suggesting opstrings just without OP_INDEX8. I was suggesting opstrings without any OP_INDEXN for all of the Ns.

Your 0106090f doesn't have 00=OP_INDEX8 but it still has 01=OP_INDEX7, taking a relatively expensive 128 opcodes. I'm suggesting trying something like 0608090f instead, subbing in 08=OP_LUMA1_232 for 01=OP_INDEX7.

@nigeltao
Copy link
Owner

Tangential thought (you don't have to implement this, but I'll write it down before I forget it). The opstring codes are currently dense (e.g. 00=OP_INDEX8, 01=OP_INDEX7, etc). You could make them sparse, where the high 3 bits encode the MASK_X value:

  • High 3 bits are 0 means MASK_1
  • High 3 bits are 1 means MASK_2
  • High 3 bits are 2 means MASK_3
  • etc
  • High 3 bits are 7 means MASK_8

Having the opstring codes occur in sorted order then might make it easier to assign bit patterns (opcodes) to ops. It's kind of like how a canonical Huffman code's compact representation assigns bit patterns according to ascending code length.

It would also mean substituting one MASK_1 op for another (e.g. OP_LUMA1_232 for OP_INDEX7) is just a one-byte change to the opstring.

@chocolate42
Copy link
Author

Sorry for not being clearer. I wasn't suggesting opstrings just without OP_INDEX8. I was suggesting opstrings without any OP_INDEXN for all of the Ns.

Generation currently uses as-big-a 1-byte index op as it can fit between explicit ops chosen from a group of sets, and RUN1 which has at least 4 values (maybe this should be 8?). Not doing so would just make RUN1 excessively large in most combinations which is a waste, it could be set to also not add an index op when it wouldn't make RUN1 too large (32 say?). Effectively it would duplicate all existing combinations that contain INDEX4 and INDEX3 removing those ops, 108 if also excluding INDEX8 216 otherwise. Might try it later. The main reason combinations without index ops aren't represented by the list is just that there aren't enough explicit ops to fill the opcode space enough to not trigger at least INDEX3.

Your 0106090f doesn't have 00=OP_INDEX8 but it still has 01=OP_INDEX7, taking a relatively expensive 128 opcodes. I'm suggesting trying something like 0608090f instead, subbing in 08=OP_LUMA1_232 for 01=OP_INDEX7.

That particular combination wouldn't make sense as OP_DELTA and OP_LUMA1_232 are both 1 byte RGB encodings with a large overlap, all OP_DELTA brings to the party is encoding a few alpha values. OP_INDEX7 is expensive, but it's also the reason that combination performed as well as it did.

Having the opstring codes occur in sorted order then might make it easier to assign bit patterns (opcodes) to ops.

It wouldn't really, a one line qsort on a small string taking practically no runtime is all that would be saved. The main win would be in cleaning up the opcode handling particularly the qoip_ops array. Currently the density is used, opstring codes are used directly as an index into qoip_ops which is a bit messy. Cleaning that up has been on the TODO list for a while, sparse handling as described could do that and potentially eliminate the explicit mask so I'll keep it in mind for the eventual cleanup.

Here's more preliminary entropy-coding data including ZSTD -19:

# Grand total for images
decode_ms  encode_ms  decode_mpps  encode_mpps  size kb  rate
    6.693     78.273        69.35         5.93      398  24.2%: libpng
    6.628     55.387        70.03         8.38      561  34.2%: stbi
    1.972      2.352       235.36       197.39      463  28.2%: qoi-c04a975
    4.123      6.440       112.58        72.08      401  24.4%: qoip(0).threads(1).entropy(RAW)
    4.117     18.149       112.73        25.58      398  24.2%: qoip(1).threads(1).entropy(RAW)
    4.065     32.324       114.20        14.36      393  24.0%: qoip(2).threads(1).entropy(RAW)
    4.054     56.828       114.51         8.17      392  23.9%: qoip(3).threads(1).entropy(RAW)
    4.051    107.580       114.60         4.31      390  23.8%: qoip(4).threads(1).entropy(RAW)
    4.185      6.656       110.92        69.74      383  23.3%: qoip(0).threads(1).entropy(LZ4)
    4.164     18.360       111.49        25.28      381  23.2%: qoip(1).threads(1).entropy(LZ4)
    4.124     32.754       112.54        14.17      377  23.0%: qoip(2).threads(1).entropy(LZ4)
    4.100     57.169       113.22         8.12      376  22.9%: qoip(3).threads(1).entropy(LZ4)
    4.099    108.446       113.23         4.28      375  22.8%: qoip(4).threads(1).entropy(LZ4)
    4.734     80.941        98.05         5.73      323  19.7%: qoip(0).threads(1).entropy(ZSTD19)
    4.713     91.863        98.48         5.05      323  19.7%: qoip(1).threads(1).entropy(ZSTD19)
    4.657    103.918        99.67         4.47      322  19.6%: qoip(2).threads(1).entropy(ZSTD19)
    4.650    127.798        99.83         3.63      322  19.6%: qoip(3).threads(1).entropy(ZSTD19)
    4.644    178.223        99.94         2.60      321  19.6%: qoip(4).threads(1).entropy(ZSTD19)
  ~libpng  very_long                                351 ~21.4%: zopflipng -m
                long                                249 ~15.2%: JpegXL (cjxl -e 9 -q 100)

# Grand total for images-lance
decode_ms  encode_ms  decode_mpps  encode_mpps  size kb  rate
   52.063    447.712        78.00         9.07     1395   9.0%: libpng
   41.683    365.202        97.43        11.12     2119  13.6%: stbi
   15.406     13.575       263.60       299.16     2109  13.6%: qoi-c04a975
   25.265     32.567       160.74       124.70     1664  10.7%: qoip(0).threads(1).entropy(RAW)
   25.509     72.390       159.20        56.10     1566  10.1%: qoip(1).threads(1).entropy(RAW)
   25.283    145.892       160.62        27.84     1541   9.9%: qoip(2).threads(1).entropy(RAW)
   25.235    289.706       160.93        14.02     1531   9.8%: qoip(3).threads(1).entropy(RAW)
   25.307    549.688       160.47         7.39     1528   9.8%: qoip(4).threads(1).entropy(RAW)
   25.814     34.653       157.32       117.19     1569  10.1%: qoip(0).threads(1).entropy(LZ4)
   25.745     74.010       157.74        54.87     1492   9.6%: qoip(1).threads(1).entropy(LZ4)
   25.695    148.291       158.05        27.39     1464   9.4%: qoip(2).threads(1).entropy(LZ4)
   25.875    293.236       156.95        13.85     1450   9.3%: qoip(3).threads(1).entropy(LZ4)
   25.804    551.072       157.38         7.37     1449   9.3%: qoip(4).threads(1).entropy(LZ4)
   28.908    649.023       140.48         6.26     1092   7.0%: qoip(0).threads(1).entropy(ZSTD19)
   29.056    613.478       139.77         6.62     1073   6.9%: qoip(1).threads(1).entropy(ZSTD19)
   28.423    696.585       142.88         5.83     1037   6.7%: qoip(2).threads(1).entropy(ZSTD19)
   28.569    853.861       142.15         4.76     1027   6.6%: qoip(3).threads(1).entropy(ZSTD19)

# Grand total for images
decode_ms  encode_ms  decode_mpps  encode_mpps  size kb  rate
    4.090      6.383       113.49        72.72      401  24.4%: qoip(0).threads(8).entropy(RAW)
    4.239      7.821       109.51        59.35      398  24.2%: qoip(1).threads(8).entropy(RAW)
    4.519     10.468       102.73        44.34      393  24.0%: qoip(2).threads(8).entropy(RAW)
    4.817     25.099        96.37        18.49      392  23.9%: qoip(3).threads(8).entropy(RAW)
    4.874     40.680        95.23        11.41      390  23.8%: qoip(4).threads(8).entropy(RAW)
    4.167      6.682       111.38        69.47      383  23.3%: qoip(0).threads(8).entropy(LZ4)
    4.319      8.125       107.46        57.13      381  23.2%: qoip(1).threads(8).entropy(LZ4)
    4.565     10.556       101.69        43.97      377  23.0%: qoip(2).threads(8).entropy(LZ4)
    4.833     25.196        96.03        18.42      376  22.9%: qoip(3).threads(8).entropy(LZ4)
    4.848     40.968        95.75        11.33      375  22.8%: qoip(4).threads(8).entropy(LZ4)
    4.748     81.029        97.77         5.73      323  19.7%: qoip(0).threads(8).entropy(ZSTD19)
    4.731     86.386        98.11         5.37      323  19.7%: qoip(1).threads(8).entropy(ZSTD19)
    4.726     87.980        98.21         5.28      322  19.6%: qoip(2).threads(8).entropy(ZSTD19)
    4.764    101.994        97.42         4.55      322  19.6%: qoip(3).threads(8).entropy(ZSTD19)
    4.776    116.973        97.20         3.97      321  19.6%: qoip(4).threads(8).entropy(ZSTD19)

# Grand total for images-lance
decode_ms  encode_ms  decode_mpps  encode_mpps  size kb  rate
   25.527     32.848       159.09       123.63     1664  10.7%: qoip(0).threads(8).entropy(RAW)
   26.044     37.188       155.93       109.20     1566  10.1%: qoip(1).threads(8).entropy(RAW)
   26.536     46.847       153.04        86.69     1541   9.9%: qoip(2).threads(8).entropy(RAW)
   27.169     79.733       149.48        50.93     1531   9.8%: qoip(3).threads(8).entropy(RAW)
   26.812    154.451       151.46        26.29     1528   9.8%: qoip(4).threads(8).entropy(RAW)
   25.430     34.127       159.69       119.00     1569  10.1%: qoip(0).threads(8).entropy(LZ4)
   25.831     38.749       157.22       104.81     1492   9.6%: qoip(1).threads(8).entropy(LZ4)
   26.424     49.128       153.69        82.66     1464   9.4%: qoip(2).threads(8).entropy(LZ4)
   26.953     81.469       150.67        49.85     1450   9.3%: qoip(3).threads(8).entropy(LZ4)
   26.681    155.137       152.21        26.18     1449   9.3%: qoip(4).threads(8).entropy(LZ4)
   29.222    642.450       138.97         6.32     1092   7.0%: qoip(0).threads(8).entropy(ZSTD19)
   28.804    581.748       140.99         6.98     1073   6.9%: qoip(1).threads(8).entropy(ZSTD19)
   28.879    606.481       140.62         6.70     1037   6.7%: qoip(2).threads(8).entropy(ZSTD19)
   28.956    651.906       140.25         6.23     1027   6.6%: qoip(3).threads(8).entropy(ZSTD19)
   28.899    723.684       140.52         5.61     1030   6.6%: qoip(4).threads(8).entropy(ZSTD19)
  • Entropy coding is single-threaded
  • Raw and LZ4 decode times scaling with thread count is a benchmark anomaly, they are running slower thanks to not waiting for the CPU to recover from the SMT encode. ZSTD19 is unaffected as the CPU recovers during ZSTD encode
  • ZSTD encode takes up the majority of the encode time
  • images-lance averages larger filesizes which ZSTD should prefer for better compression
  • qoip(1).entropy(ZSTD19) taking less time to encode than qoip(0).entropy(ZSTD19) shows there is benefit to using QOIP as a preprocessor, the 100KB less fed to ZSTD is apparent
  • If images is anything to go by lossless JpegXL hands-down wins when compression is the only consideration, as expected we can't come close. Strong lzma compression may put qoip(4).entropy(LZMA) for images at ~300 if we're lucky, but using LZMA may take an excessive amount of encode time
  • images-lance compression regressing between qoip(3) and qoip(4) is interesting, it shows there would be a mild benefit to an extreme setting that entropy-codes every combination
  • Beating crunched PNG by every metric simultaneously when using zstd -19 is nice

@nigeltao
Copy link
Owner

That particular combination wouldn't make sense as OP_DELTA and OP_LUMA1_232 are both 1 byte RGB encodings with a large overlap... OP_INDEX7 is expensive, but it's also the reason that combination performed as well as it did.

Ah, OK. Still, as I said, qoi-lz4-d28 gets pretty good compression and it has no index ops at all. Its ops are:

/* n=run_length, r=red, g=green, b=blue, a=alpha, s=r-g, c=b-g */
#define QOI_OP_LUMA3  0x00 /* .......0  +0 bytes  g3s2c2   */
#define QOI_OP_LUMA6  0x01 /* ......01  +1 bytes  g6s4c4   */
#define QOI_OP_RGB7   0x03 /* .....011  +2 bytes  r7g7b7   */
#define QOI_OP_RUNS   0x07 /* .....111  +0 bytes  n5       */
#define QOI_OP_RUNL   0xD7 /* 11010111  +1 bytes  n8       */
#define QOI_OP_RGBA2  0xDF /* 11011111  +1 bytes  r2g2b2a2 */
#define QOI_OP_RGBA4  0xE7 /* 11100111  +2 bytes  r4g4b4a4 */
#define QOI_OP_RGBA8  0xEF /* 11101111  +4 bytes  r8g8b8a8 */
#define QOI_OP_RGB8   0xF7 /* 11110111  +3 bytes  r8g8b8   */
#define QOI_OP_A8     0xFF /* 11111111  +1 bytes  a8       */

Something like that might still be worth putting into the mix. Perhaps subbing what you call OP_LUMA3_686 and OP_LUMA3_5654 for what I call QOI_OP_RGB7. You might also want to consider adding QOI_OP_RGBA2 and QOI_OP_RGBA4 to your possible ops.

@chocolate42
Copy link
Author

chocolate42 commented Jan 11, 2022

I looked into adding d28 ops but that was back when things were being fiddled with manually and not much benefit was noticed. Things are different now so they should be in the mix, RGBA7 already has close representation either side but RGBA7 should probably be there too. RGBA4 and RGB7 should probably be LUMA variants with RGBA2 non-luma, DELTA/DIFF are the only non-LUMA variants because non-LUMA performs better when bitcount is low, LUMA starts seeing benefit at 232. Saying that though, the split was calculated based on having fixed opcodes. It may be worthwhile having both LUMA and non-LUMA versions available in qoip, at least for 222 and 232 which are at the boundary of where LUMA becomes viable.

Here's the sets the generator picks from, one is picked from each set where OP_END means "no op picked":

qoip_set_t qoipcrunch_set[] = {
	{ {OP_DIFF1_222, OP_LUMA1_232, OP_DELTA, OP_LUMA1_232_BIAS}, {64,128,32,128} },
	{ {OP_LUMA2_464, OP_LUMA2_454}, {64,32} },
	{ {OP_END, OP_LUMA3_686, OP_LUMA3_676, OP_LUMA3_787}, {0,16,8,64} },
	{ {OP_END, OP_DELTAA}, {0,64} },
	{ {OP_A, OP_LUMA2_3433}, {1,32} },
	{ {OP_END, OP_LUMA3_4645, OP_LUMA3_5654}, {0,8,16} },
	{ {OP_END, OP_LUMA4_7777}, {0,16} },
};

The sets represent in order 1/2/3 byte RGB encodings and 1/2/3/4 byte RGBA encodings. So RGB7 would fit in the third set as OP_LUMA3_777, RGBA2 would fit in the fifth set as OP_DIFF1_2222, and RGBA4 in the sixth set as OP_LUMA3_4444. It's probably sensible to have 8 bit tag opcodes present in every set (and ops taking at least 8, 16, 32 codes for that matter). The training computer is going to love it ;)

edit: If we're getting to the point of fleshing out all the sets to represent everything, it may be time to overhaul qoipcrunch_encode entirely. Instead of trying a selection of combinations, a smart qoipcrunch_encode would do a stat-gathering pass on the image, analyse the stats to find the best opstring, then encode it.

@chocolate42
Copy link
Author

Implemented the first iteration of a smart crunch encode path:

  • Use with -smart flag
  • Currently incompatible with -custom argument
  • Three phases: Stat pass, x combination passes for x combinations (using the same combinations at the same effort level as the dumb path for comparison), encode pass which does the encode
  • No ops have been added yet so we can directly compare with the dumb path. A future commit will flesh out the LUMA ops probably to every viable data size with RGB and RGBA options, potentially giving us many orders of magnitude more viable combinations to search through
  • Smart path is slower for effort levels 0 and 1 in the below data for comparison, they can have an escape hatch to use the non-smart path if they remain slower
# Grand total for images (dumb)
decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
    4.053      6.221       114.52        74.61      401  24.4%: qoip(0).threads(1).entropy(0).smart(0)
    4.164     18.137       111.46        25.59      398  24.2%: qoip(1).threads(1).entropy(0).smart(0)
    4.106     32.747       113.05        14.17      393  24.0%: qoip(2).threads(1).entropy(0).smart(0)
    4.105     57.542       113.07         8.07      392  23.9%: qoip(3).threads(1).entropy(0).smart(0)
    4.088    107.643       113.55         4.31      390  23.8%: qoip(4).threads(1).entropy(0).smart(0)
    4.093    212.393       113.40         2.19      390  23.8%: qoip(5).threads(1).entropy(0).smart(0)
    4.079    409.721       113.80         1.13      390  23.8%: qoip(6).threads(1).entropy(0).smart(0)

# Grand total for images (smart)
decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
    4.006     24.224       115.87        19.16      401  24.4%: qoip(0).threads(1).entropy(0).smart(1)
    4.084     25.670       113.67        18.08      398  24.2%: qoip(1).threads(1).entropy(0).smart(1)
    4.053     26.903       114.54        17.25      393  24.0%: qoip(2).threads(1).entropy(0).smart(1)
    4.042     29.296       114.85        15.84      392  23.9%: qoip(3).threads(1).entropy(0).smart(1)
    4.048     33.813       114.65        13.73      390  23.8%: qoip(4).threads(1).entropy(0).smart(1)
    4.025     42.536       115.33        10.91      390  23.8%: qoip(5).threads(1).entropy(0).smart(1)
    4.017     60.857       115.56         7.63      390  23.8%: qoip(6).threads(1).entropy(0).smart(1)

# Grand total for images-lance (dumb)
decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
   25.365     31.705       160.10       128.09     1664  10.7%: qoip(0).threads(1).entropy(0).smart(0)
   25.698     71.780       158.03        56.58     1566  10.1%: qoip(1).threads(1).entropy(0).smart(0)
   25.604    143.126       158.61        28.37     1541   9.9%: qoip(2).threads(1).entropy(0).smart(0)
   25.436    284.000       159.66        14.30     1531   9.8%: qoip(3).threads(1).entropy(0).smart(0)
   25.564    538.957       158.86         7.54     1528   9.8%: qoip(4).threads(1).entropy(0).smart(0)
   25.362   1042.374       160.13         3.90     1527   9.8%: qoip(5).threads(1).entropy(0).smart(0)
   25.252   2051.279       160.82         1.98     1527   9.8%: qoip(6).threads(1).entropy(0).smart(0)

# Grand total for images-lance (smart)
decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
   25.372    120.906       160.06        33.59     1664  10.7%: qoip(0).threads(1).entropy(0).smart(1)
   25.833    123.804       157.21        32.80     1566  10.1%: qoip(1).threads(1).entropy(0).smart(1)
   25.575    125.917       158.79        32.25     1541   9.9%: qoip(2).threads(1).entropy(0).smart(1)
   25.382    132.474       160.00        30.66     1531   9.8%: qoip(3).threads(1).entropy(0).smart(1)
   25.262    144.818       160.76        28.04     1528   9.8%: qoip(4).threads(1).entropy(0).smart(1)
   25.171    168.852       161.34        24.05     1527   9.8%: qoip(5).threads(1).entropy(0).smart(1)
   25.105    218.135       161.76        18.62     1527   9.8%: qoip(6).threads(1).entropy(0).smart(1)
  • Much better scaling will allow us to implement more ops for more combinations
  • Single threaded for now, OpenMP concurrency for the combination passes is trivial but until more ops are added (exploding combination count) it's low priority
  • The stat pass takes much longer than a combination pass and is not necessarily concurrency-friendly (some of it could be split into a concurrent block). Thankfully by design it's only done once per image and optimisation has barely been explored
  • This smart-combination-function that does x combination passes is somewhere between the existing dumb trial-function and an ideal smartest-function that spits out the optimal encoding out of all possibilities
  • Even sticking to this smart-combination-function there's many unexplored areas: An intermediate pass over the stat array to merge identical data (like we do with short runs). Smartly handling LUMA ops instead of using function pointers. SIMD. Concurrency.

@chocolate42
Copy link
Author

The current state of the smart function is much improved:

# Grand total for images
decode_ms  encode_ms  decode_mpps  encode_mpps  size kb  rate
    1.945      2.335       238.71       198.81      463  28.2%: qoi-c04a975
    4.017      6.304       115.54        73.63      401  24.4%: qoip(0).threads(1).entropy(0).smart(1)
    4.002     12.927       115.99        35.91      398  24.2%: qoip(1).threads(1).entropy(0).smart(1)
    4.027     13.268       115.27        34.98      393  24.0%: qoip(2).threads(1).entropy(0).smart(1)
    4.112     14.187       112.88        32.72      392  23.9%: qoip(3).threads(1).entropy(0).smart(1)
    4.065     15.308       114.20        30.32      391  23.8%: qoip(4).threads(1).entropy(0).smart(1)
    4.077     17.892       113.87        25.94      390  23.8%: qoip(5).threads(1).entropy(0).smart(1)
    4.100     23.038       113.20        20.15      390  23.8%: qoip(6).threads(1).entropy(0).smart(1)

I'm trying to keep spam in this thread to a minimum while still using it as a log. More details and open questions are in issues in the repo.

@chocolate42
Copy link
Author

  • Implemented smarter-crunch-function which can reasonably quickly test thousands of combinations
  • Add qoipcrunch controller, which picks which crunch function to use depending on the effort level
    • Level 0 directly encodes using a strong combination
    • Levels 1..3 pick the best from a set of strong combinations (4/16/64 respectively using stat integers for speed)
    • Levels 4..6 try increasingly exhaustive sets of combinations (using log tables for speed)
  • -smart option removed and ui generally tidied by using controller function
  • Custom string handling spun off into its own function
  • Dumb path removed entirely as stat integer path is faster in all cases
# Grand total for images
decode_ms  encode_ms  decode_mpps  encode_mpps  size kb  rate
    6.782     78.896        68.44         5.88      398  24.2%: libpng
    6.687     55.248        69.41         8.40      561  34.2%: stbi
    1.991      2.370       233.12       195.82      463  28.2%: qoi-c04a975
    4.160      6.454       111.58        71.92      401  24.4%: qoip(0).threads(1).entropy(0)
    4.079     13.582       113.79        34.18      393  24.0%: qoip(1).threads(1).entropy(0)
    4.060     15.408       114.32        30.12      391  23.8%: qoip(2).threads(1).entropy(0)
    4.083     23.129       113.70        20.07      391  23.8%: qoip(3).threads(1).entropy(0)
    4.061     25.304       114.30        18.34      388  23.6%: qoip(4).threads(1).entropy(0)
    4.023     40.715       115.38        11.40      387  23.6%: qoip(5).threads(1).entropy(0)
    4.021    124.926       115.43         3.72      387  23.6%: qoip(6).threads(1).entropy(0)

# Grand total for images-lance
decode_ms  encode_ms  decode_mpps  encode_mpps  size kb  rate
   52.720    450.777        77.03         9.01     1395   9.0%: libpng
   42.738    365.697        95.02        11.11     2119  13.6%: stbi
   15.883     13.750       255.69       295.36     2109  13.6%: qoi-c04a975
   25.281     31.916       160.64       127.24     1664  10.7%: qoip(0).threads(1).entropy(0)
   25.417     69.647       159.78        58.31     1545   9.9%: qoip(1).threads(1).entropy(0)
   25.400     79.300       159.89        51.21     1532   9.9%: qoip(2).threads(1).entropy(0)
   25.278    114.082       160.66        35.60     1534   9.9%: qoip(3).threads(1).entropy(0)
   25.405    126.329       159.85        32.15     1516   9.8%: qoip(4).threads(1).entropy(0)
   25.434    185.798       159.67        21.86     1505   9.7%: qoip(5).threads(1).entropy(0)
   25.354    486.395       160.17         8.35     1500   9.7%: qoip(6).threads(1).entropy(0)

@nigeltao
Copy link
Owner

nigeltao commented Feb 4, 2022

Level 0 directly encodes using a strong combination

Out of curiosity, what ops are in the "strong combination"?

Levels 1..3 pick the best from a set of strong combinations (4/16/64 respectively using stat integers for speed)
Levels 4..6 try increasingly exhaustive sets of combinations (using log tables for speed)

Forgive my naivety. What are "stat integers" and "log tables"? And if they're both "for speed", which is speedier? Why use one or the other? Something something memory use??

@chocolate42
Copy link
Author

tl;dr the stat integer method is a way to test a small set of combinations quicker than brute force but it doesn't scale up to thousands of combinations well. The log method can test thousands/hundreds-of-thousands of combinations by taking advantage of most ops being luma variants, but it has overhead and is a WIP so doesn't currently scale down well.

The combination in level 0 and the combinations in levels 1..3 are the same as they were before (the 64 combinations in qoipcrunch-list.h). They're mostly the same combinations you may remember (one op has changed from LUMA4_7777 to LUMA4_7876 but that's it) but the binary values of the opcodes have changed to implement your sparse suggestion some posts up. Here's the strongest combination, nothing exciting:

OP_LUMA1_232B: 1 byte delta, OP_LUMA1_232 but R/B are biased depending on direction of G
OP_LUMA2_464:  2 byte delta, ( avg_gr - 8.. 7, avg_g  -32..31, avg_gb - 8.. 7, a        0 )
OP_INDEX5:     1 byte,  32 value index cache                                               
OP_LUMA3_676:  3 byte delta, ( avg_gr -32..31, avg_g  -64..63, avg_gb -32..31, a        0 )
OP_LUMA3_4645: 3 byte delta, ( avg_gr - 8.. 7, avg_g  -32..31, avg_gb - 8.. 7, va -16..15 )
OP_INDEX8:     2 byte, 256 value index cache                                               
OP_A:          2 byte delta, ( r            0, g            0, b            0, a          )
OP_RGB
OP_RGBA
OP_RUN2
OP_RUN1
RUN1 range: 1..11
RUN2 range: 12..267

The stat integer method is in qoipcrunch_encode_smart explained in detail above, it was a first attempt at a function that can try many combinations quicker than brute force. Every non-run pixel is tried against a set of non-run ops, each non-run pixel has a stat integer with flags defining whether an op can store that pixel. Testing a combination involves iterating over the stat integers with a combination mask, and adding it to the RLE length which is also calculated from stats collected in the stat pass. The current level 3 is the old level 6 from a few posts up, so what would take 409.721ms to brute force 64 combinations only takes 23.129 ms now (the space efficiency regression is due to changing the 7777 op to 7876 without retraining). The function doesn't use any of the new ops.

The log tables method is a recent WIP that may change quite a bit and the code needs a good kicking (good luck attempting to read it in its current state), it's implementation is in qoipcrunch_encode_smarter. Roughly speaking it condenses pixels that can be handled by the same luma op into a count in a lumalog table, so that instead of iterating over all non-run pixels for every combination pass we only need to iterate over a small table. For example say we have a pixel difference of (gr=2, g=9, gb=2), the minimum luma op required to store it is 353, so we increment the corresponding index in the lumalog table.

In reality the lumalog implementation is a little more complicated because there are a bunch of non-luma ops, currently there is a lumalog table and a baseline encoding size for every combination of 1-byte-index/1-byte-diff/2-byte-index op. I'm also taking advantage of some properties of the luma ops we're interested in to reduce the size of the lumalog tables:

  • Every op has same-sized r and b fields, so we only need to store the maximum of log_r and log_b of a pixel (named log_rb)
  • We are only interested in log_g in the range 3..8 as all luma ops >1 byte store at minimum log_g of 3. So all log_g 0..3 are clamped to 3, allowing 0..2 to not exist (removing 1/3 of the table)
  • Similarly we are only interested in log_rb in the range 2..8
  • The next thing on the TODO list is to figure out if we can take advantage of log_rb being within 2 of log_g, and something similar for log_a

Hopefully some of that makes sense.

@chocolate42
Copy link
Author

chocolate42 commented Feb 7, 2022

  • Added INDEX9 and INDEX10 as 512/1024 value caches
  • Manually tested some combinations to find a better one for effort level 0. It's not used for effort levels 1..3 as it uses new ops that are unsupported by the old smart function.
  • There is a bug somewhere in the old smart function causing effort level 3 to be worse than level 2 for "images-lance", despite level 3 being a superset of level 2 which should make this impossible. The resulting encoding is good but there's an error somewhere in choosing the best combination. Instead of fixing the bug it's likely I'll optimise the smarter function to beat the smart function in all cases then just remove the smart function entirely.
# Grand total for images
decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
    4.134      6.322       112.29        73.43      398  24.3%: qoip(0).threads(1).entropy(0)

# Grand total for images-lance
decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
   25.313     30.991       160.43       131.04     1635  10.5%: qoip(0).threads(1).entropy(0)
New best combination:
OP_LUMA1_232B: 1 byte delta, OP_LUMA1_232 but R/B are biased depending on direction of G
OP_LUMA2_464:  2 byte delta, ( avg_gr - 8.. 7, avg_g  -32..31, avg_gb - 8.. 7, a        0 )
OP_INDEX5:     1 byte,   32 value index cache                                               
OP_LUMA3_676:  3 byte delta, ( avg_gr -32..31, avg_g  -64..63, avg_gb -32..31, a        0 )
OP_INDEX10:    2 byte, 1024 value index cache                                               
OP_LUMA4_6866: 4 byte delta, ( avg_gr -32..31, g             , avg_gb -32..31, va -32..31 )
OP_LUMA2_2322: 2 byte delta, ( avg_gr - 2.. 1, avg_g  - 4.. 3, avg_gb - 2.. 1, va - 2.. 1 )
OP_LUMA3_4544: 3 byte delta, ( avg_gr - 8.. 7, avg_g  -16..15, avg_gb - 8.. 7, va - 8.. 7 )
OP_A:          2 byte delta, ( r            0, g            0, b            0, a          )
OP_RGB
OP_RGBA
OP_RUN2
OP_RUN1
RUN1 range: 1..8
RUN2 range: 9..264

@chocolate42
Copy link
Author

  • Optimised smarter function to the point that it can replace the smart function at lower combination counts
  • Removed smart function and the list it used
  • Updated and moved test string to qoipcrunch.h
  • Fixed iteration bug present since last commit
  • Check that the size of an encode is what we expect in smarter function
# Grand total for images
decode_ms  encode_ms  decode_mpps  encode_mpps  size kb  rate
    6.703     78.367        69.25         5.92      398  24.2%: libpng
    6.631     54.778        70.00         8.47      561  34.2%: stbi
    1.970      2.359       235.61       196.76      463  28.2%: qoi-c04a975
    4.268      6.530       108.76        71.08      398  24.3%: qoip(0).threads(1).entropy(0)
    4.241     16.472       109.45        28.18      392  23.9%: qoip(1).threads(1).entropy(0)
    4.178     23.103       111.10        20.09      386  23.5%: qoip(2).threads(1).entropy(0)
    4.167     37.755       111.39        12.29      385  23.5%: qoip(3).threads(1).entropy(0)
    4.181     47.128       111.01         9.85      385  23.5%: qoip(4).threads(1).entropy(0)
    4.184    103.750       110.94         4.47      385  23.4%: qoip(5).threads(1).entropy(0)
    4.146    146.032       111.96         3.18      384  23.4%: qoip(6).threads(1).entropy(0)

# Grand total for images-lance
decode_ms  encode_ms  decode_mpps  encode_mpps  size kb  rate
   52.210    447.593        77.78         9.07     1395   9.0%: libpng
   42.043    367.887        96.59        11.04     2119  13.6%: stbi
   15.565     13.665       260.91       297.20     2109  13.6%: qoi-c04a975
   25.263     31.419       160.75       129.25     1635  10.5%: qoip(0).threads(1).entropy(0)
   24.942     84.407       162.82        48.11     1543   9.9%: qoip(1).threads(1).entropy(0)
   25.111    113.510       161.72        35.78     1514   9.7%: qoip(2).threads(1).entropy(0)
   25.854    183.341       157.08        22.15     1499   9.6%: qoip(3).threads(1).entropy(0)
   25.832    215.016       157.21        18.89     1498   9.6%: qoip(4).threads(1).entropy(0)
   25.929    438.435       156.62         9.26     1494   9.6%: qoip(5).threads(1).entropy(0)
   25.892    590.570       156.85         6.88     1494   9.6%: qoip(6).threads(1).entropy(0)

chocolate42 pushed a commit to chocolate42/qoipond that referenced this issue Feb 8, 2022
…size. Removed smart function as smarter function is now good enough. nigeltao/qoi2-bikeshed#31 (comment)
@chocolate42
Copy link
Author

  • Rework qoip_encode_run()
  • Rework luma1_232b encode
  • Reintroduce -effort -1 as a single fast combination, using only luma ops and OP_RGB/OP_RGBA/OP_RUN1/OP_RUN2
  • Implement encode/decode fastpaths for effort levels -1/0
  • const/restrict keyword pass over qoip.h which might help compiler optimisation, but probably not
# Grand total for images
decode_ms  encode_ms  decode_mpps  encode_mpps  size kb  rate
    3.780      4.501       122.81       103.14      452  27.5%: qoi-nolz4-d28
    1.951      2.343       237.90       198.14      463  28.2%: qoi-c04a975
    2.740      4.087       169.43       113.57      442  26.9%: qoip(-1).threads(1).entropy(0)
    3.357      4.873       138.26        95.25      398  24.3%: qoip(0).threads(1).entropy(0)

# Grand total for images-lance
decode_ms  encode_ms  decode_mpps  encode_mpps  size kb  rate
   26.012     25.549       156.12       158.95     1777  11.4%: qoi-nolz4-d28
   15.209     13.565       267.01       299.39     2109  13.6%: qoi-c04a975
   18.131     23.195       223.99       175.08     1765  11.4%: qoip(-1).threads(1).entropy(0)
   22.760     27.320       178.43       148.65     1635  10.5%: qoip(0).threads(1).entropy(0)

Tried using LUT for encode fastpaths but couldn't create something faster than an else chain. Even effort level -1 (which can encode everything except run ops in a single lut) couldn't beat an else chain. LUT was also slower for decode fastpath's.

@chocolate42
Copy link
Author

  • Finally implemented FIFO caches, only for the 1 byte indexes
  • Hacked in FIFO caches for effort level 6. Being able to choose between hash cache and FIFO cache is a minor space saving for double the time cost so it'll probably stay just for effort level 6
  • Slight generic path decode regression thanks to refactoring to get FIFO working, regress the regression TODO
# Grand total for images
decode_ms  encode_ms  decode_mpps  encode_mpps  size kb  rate
    6.707     78.375        69.20         5.92      398  24.2%: libpng
    6.649     56.018        69.81         8.29      561  34.2%: stbi
    1.978      2.357       234.66       196.89      463  28.2%: qoi-c04a975
    2.770      4.184       167.56       110.93      442  26.9%: qoip(-1).threads(1).entropy(0)
    3.409      4.915       136.16        94.44      398  24.3%: qoip(0).threads(1).entropy(0)
    4.269     16.752       108.73        27.71      392  23.9%: qoip(1).threads(1).entropy(0)
    4.201     24.055       110.50        19.30      386  23.5%: qoip(2).threads(1).entropy(0)
    4.202     39.186       110.46        11.85      385  23.5%: qoip(3).threads(1).entropy(0)
    4.197     48.207       110.59         9.63      385  23.5%: qoip(4).threads(1).entropy(0)
    4.205    104.881       110.40         4.43      385  23.4%: qoip(5).threads(1).entropy(0)
    4.056    283.498       114.43         1.64      383  23.3%: qoip(6).threads(1).entropy(0)

# Grand total for images-lance
decode_ms  encode_ms  decode_mpps  encode_mpps  size kb  rate
   52.151    447.870        77.87         9.07     1395   9.0%: libpng
   41.809    363.202        97.13        11.18     2119  13.6%: stbi
   15.432     13.658       263.15       297.34     2109  13.6%: qoi-c04a975
   17.309     21.823       234.63       186.10     1765  11.4%: qoip(-1).threads(1).entropy(0)
   22.057     26.365       184.12       154.03     1635  10.5%: qoip(0).threads(1).entropy(0)
   26.065     88.521       155.81        45.88     1543   9.9%: qoip(1).threads(1).entropy(0)
   26.197    121.529       155.02        33.42     1514   9.7%: qoip(2).threads(1).entropy(0)
   26.339    188.106       154.19        21.59     1499   9.6%: qoip(3).threads(1).entropy(0)
   26.452    223.389       153.53        18.18     1498   9.6%: qoip(4).threads(1).entropy(0)
   26.204    438.621       154.98         9.26     1494   9.6%: qoip(5).threads(1).entropy(0)
   25.026   1121.820       162.28         3.62     1492   9.6%: qoip(6).threads(1).entropy(0)
  • How we're going to make a dent in libpng's RGBA lead I do not know
  • Code tidying TODO, it's starting to resemble spaghetti with all the additions being bolted on
  • Will probably switch hash cache and FIFO cache 1 byte ops in a future commit, so FIFO is standard and hash cache is used only in effort level 6. FIFO has quicker decode and smaller files on average for only minor encode increases
  • Might not implement FIFO cache for 2 byte ops. 1 byte index ops were easier to implement as they're the first op in line. 2 byte index ops come after 1 byte rgb/rgba delta ops which would make a generic implementation (that doesn't wastefully do both hash and fifo simultaneously never to use one of the results) painful

@chocolate42
Copy link
Author

FIFO caches might not be the way to go. In the presence of entropy coding they actually perform worse because they decorelate the binary of same-index matches (even LZ4, which as a weaker codec might be more concerned with input size):

# Grand total for images-lance
decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
   25.297    212.432       160.53        19.12     1426   9.2%: qoip(4).threads(1).entropy(1)
   25.333    419.222       160.31         9.69     1423   9.2%: qoip(5).threads(1).entropy(1)
   24.797   1073.486       163.77         3.78     1432   9.2%: qoip(6).threads(1).entropy(1)
   28.679    779.268       141.60         5.21     1027   6.6%: qoip(4).threads(1).entropy(2)
   28.547    980.602       142.26         4.14     1027   6.6%: qoip(5).threads(1).entropy(2)
   27.940   1622.503       145.35         2.50     1040   6.7%: qoip(6).threads(1).entropy(2)

# Grand total for images
decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
    4.102     46.688       113.15         9.94      370  22.5%: qoip(4).threads(1).entropy(1)
    4.194    100.328       110.69         4.63      370  22.5%: qoip(5).threads(1).entropy(1)
    4.029    273.973       115.20         1.69      371  22.6%: qoip(6).threads(1).entropy(1)
    4.687    117.043        99.04         3.97      320  19.5%: qoip(4).threads(1).entropy(2)
    4.673    170.746        99.34         2.72      320  19.5%: qoip(5).threads(1).entropy(2)
    4.557    339.445       101.85         1.37      325  19.8%: qoip(6).threads(1).entropy(2)

Combination choice happens before entropy coding so effort level 6 chooses poorly. In qoipcrunch it might be sensible to always use hash cache when entropy-coding and FIFO cache when not. Don't cross the streams.

@chocolate42
Copy link
Author

After doing some deeper investigation it looks like for large images with ZSTD coding it makes sense to avoid index ops entirely. Even hash indexing appears to be detrimental to ZSTD when the input is big enough to let the entropy coder stretch its legs. When input is smaller there are some gains to using hash caching, FIFO caching doesn't appear good at all with entropy coding (LZ4 or ZSTD). LZ4 doesn't react the same way to large input as ZSTD, it appears to be a weak-enough codec that simple hash caching is a benefit.

Determining how the smart qoipcrunch function should proceed remains an open question and is more complicated than first thought. My current leaning is this priority list:

  • Combination from effort level -1 if requested
  • If input is above some size threshold and ZSTD coding is requested, use the non-index combination in effort -1
  • Combination from effort level 0 if requested
  • For effort levels 1..6, if entropy coding is not used test combinations with 1 byte FIFO caching, otherwise test with 1 byte HASH caching

@chocolate42
Copy link
Author

As described above there's now a size threshold on the input when ZSTD encoding, if the combination from effort level -1 is >256KiB raw then use that combination, otherwise do what the user selected. It stops images-lance regressing as effort increases without causing images to regress:

# Grand total for images
decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
    3.964    102.146       117.11         4.54      330  20.1%: qoip(-1).threads(1).entropy(2)
    4.009     99.680       115.78         4.66      321  19.6%: qoip(0).threads(1).entropy(2)
    4.033    100.456       115.10         4.62      321  19.5%: qoip(1).threads(1).entropy(2)
    3.977     99.741       116.71         4.65      320  19.5%: qoip(2).threads(1).entropy(2)
    4.021    105.009       115.44         4.42      320  19.5%: qoip(3).threads(1).entropy(2)
    4.021    111.280       115.45         4.17      320  19.5%: qoip(4).threads(1).entropy(2)
    3.998    146.351       116.11         3.17      320  19.5%: qoip(5).threads(1).entropy(2)
    4.008    178.058       115.81         2.61      320  19.5%: qoip(6).threads(1).entropy(2)

# Grand total for images-lance
decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
   23.964    871.160       169.47         4.66      989   6.4%: qoip(-1).threads(1).entropy(2)
   24.093    869.158       168.56         4.67      991   6.4%: qoip(0).threads(1).entropy(2)
   24.051    867.812       168.85         4.68      990   6.4%: qoip(1).threads(1).entropy(2)
   24.168    866.102       168.04         4.69      990   6.4%: qoip(2).threads(1).entropy(2)
   24.053    873.980       168.84         4.65      990   6.4%: qoip(3).threads(1).entropy(2)
   24.011    894.303       169.14         4.54      990   6.4%: qoip(4).threads(1).entropy(2)
   24.032    968.066       168.98         4.20      989   6.4%: qoip(5).threads(1).entropy(2)
   23.952   1033.282       169.55         3.93      989   6.4%: qoip(6).threads(1).entropy(2)

# Grand total for images
decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
    2.725      5.903       170.31        78.63      442  26.9%: qoip(-1).threads(1).entropy(0)
    3.352      4.831       138.49        96.08      398  24.3%: qoip(0).threads(1).entropy(0)
    4.066     16.858       114.17        27.53      391  23.8%: qoip(1).threads(1).entropy(0)
    3.961     23.353       117.18        19.88      384  23.4%: qoip(2).threads(1).entropy(0)
    3.987     39.045       116.42        11.89      384  23.4%: qoip(3).threads(1).entropy(0)
    4.007     48.123       115.83         9.65      383  23.4%: qoip(4).threads(1).entropy(0)
    3.993    105.667       116.26         4.39      383  23.4%: qoip(5).threads(1).entropy(0)
    4.010    145.145       115.75         3.20      383  23.3%: qoip(6).threads(1).entropy(0)

# Grand total for images-lance
decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
   17.859     22.434       227.39       181.02     1765  11.4%: qoip(-1).threads(1).entropy(0)
   22.582     26.834       179.84       151.34     1635  10.5%: qoip(0).threads(1).entropy(0)
   24.559     89.560       165.36        45.34     1541   9.9%: qoip(1).threads(1).entropy(0)
   24.166    119.442       168.05        34.00     1512   9.7%: qoip(2).threads(1).entropy(0)
   24.382    187.421       166.56        21.67     1497   9.6%: qoip(3).threads(1).entropy(0)
   24.453    220.878       166.07        18.39     1496   9.6%: qoip(4).threads(1).entropy(0)
   24.516    436.698       165.65         9.30     1492   9.6%: qoip(5).threads(1).entropy(0)
   24.667    584.959       164.64         6.94     1492   9.6%: qoip(6).threads(1).entropy(0)

# Grand total for images
decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
    3.043      4.415       152.54       105.13      409  24.9%: qoip(-1).threads(1).entropy(1)
    3.686      5.071       125.93        91.53      381  23.2%: qoip(0).threads(1).entropy(1)
    4.208     16.849       110.32        27.55      376  22.9%: qoip(1).threads(1).entropy(1)
    4.139     24.264       112.15        19.13      371  22.6%: qoip(2).threads(1).entropy(1)
    4.138     38.988       112.18        11.91      370  22.5%: qoip(3).threads(1).entropy(1)
    4.145     48.437       111.99         9.58      370  22.5%: qoip(4).threads(1).entropy(1)
    4.127    103.612       112.48         4.48      370  22.5%: qoip(5).threads(1).entropy(1)
    4.129    144.209       112.42         3.22      369  22.5%: qoip(6).threads(1).entropy(1)

# Grand total for images-lance
decode_ms  encode_ms  decode_mpps  encode_mpps  size_kb  rate
   20.901     24.947       194.30       162.79     1522   9.8%: qoip(-1).threads(1).entropy(1)
   25.662     28.625       158.26       141.87     1551  10.0%: qoip(0).threads(1).entropy(1)
   25.667     89.689       158.22        45.28     1472   9.5%: qoip(1).threads(1).entropy(1)
   25.546    124.362       158.97        32.66     1444   9.3%: qoip(2).threads(1).entropy(1)
   25.800    186.556       157.40        21.77     1427   9.2%: qoip(3).threads(1).entropy(1)
   25.857    219.596       157.06        18.49     1426   9.2%: qoip(4).threads(1).entropy(1)
   25.819    434.337       157.29         9.35     1423   9.2%: qoip(5).threads(1).entropy(1)
   25.787    584.645       157.48         6.95     1423   9.2%: qoip(6).threads(1).entropy(1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants