Provide an efficient way to decompress a sequence of chunks compressed with ZstdCompressionChunker #259

jbosboom · 2025-04-05T07:16:02Z

My program wants to compress some large cached strings and decompress them later. I have no particular requirements on the form of the compressed data, so I used ZstdCompressionChunker to do the compression to avoid repeated reallocation of the output buffer. I would like to process the decompressed data in chunks to reduce peak memory usage. However there is no obvious efficient way to decompress chunks to chunks:

The ZstdCompressionChunker round-trip tests all concatenate the chunks with bytes.join for one-shot decompression. (Fine, they're tests.)
I tried chain.from_iterable(dctx.read_to_iter(c) for c in chunks). This doesn't work because each read_to_iter iterator expects to process a full stream. (I expected it to hold state in the ZstdDecompressor it was obtained from.)
ZstdCompressionObj's documentation says it isn't efficient:

Because calls to decompress() may need to perform multiple memory (re)allocations, this streaming decompression API isn’t as efficient as other APIs.
read_to_iter's documentation says

read_to_iter() accepts an object with a read(size) method that will return compressed bytes or an object conforming to the buffer protocol.

so I wrote a class with a read method that returns memoryviews over the chunks (to avoid copying slices). The documentation is grammatically ambiguous; it turns out that read_to_iter segfaults (!) when given an object with a read method that returns an object conforming to the buffer protocol that is not exactly bytes (reduced test case below).

My feature request is to provide an efficient way to decompress a sequence of chunks compressed with ZstdCompressionChunker (or to document an existing method as the efficient way, if there is one).

import zstandard as zstd
b = b'AB' * 1000
d = zstd.compress(b)
assert zstd.decompress(memoryview(d)) == b # passes
class Whatever:
    def __init__(self, data):
        self.data = data
    def read(self, size):
        assert len(data) <= size
        return memoryview(self.data)
dctx = zstd.ZstdDecompressor()
assert b''.join(dctx.read_to_iter(Whatever(d))) == b # segfault

Segfaults using Arch Linux's python 3.13.2-1 and python-zstandard 0.23.0-2.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide an efficient way to decompress a sequence of chunks compressed with ZstdCompressionChunker #259

Provide an efficient way to decompress a sequence of chunks compressed with ZstdCompressionChunker #259

jbosboom commented Apr 5, 2025

Provide an efficient way to decompress a sequence of chunks compressed with ZstdCompressionChunker #259

Provide an efficient way to decompress a sequence of chunks compressed with ZstdCompressionChunker #259

Comments

jbosboom commented Apr 5, 2025