Skip to content

Provide an efficient way to decompress a sequence of chunks compressed with ZstdCompressionChunker #259

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jbosboom opened this issue Apr 5, 2025 · 0 comments

Comments

@jbosboom
Copy link

jbosboom commented Apr 5, 2025

My program wants to compress some large cached strings and decompress them later. I have no particular requirements on the form of the compressed data, so I used ZstdCompressionChunker to do the compression to avoid repeated reallocation of the output buffer. I would like to process the decompressed data in chunks to reduce peak memory usage. However there is no obvious efficient way to decompress chunks to chunks:

  • The ZstdCompressionChunker round-trip tests all concatenate the chunks with bytes.join for one-shot decompression. (Fine, they're tests.)

  • I tried chain.from_iterable(dctx.read_to_iter(c) for c in chunks). This doesn't work because each read_to_iter iterator expects to process a full stream. (I expected it to hold state in the ZstdDecompressor it was obtained from.)

  • ZstdCompressionObj's documentation says it isn't efficient:

    Because calls to decompress() may need to perform multiple memory (re)allocations, this streaming decompression API isn’t as efficient as other APIs.

  • read_to_iter's documentation says

    read_to_iter() accepts an object with a read(size) method that will return compressed bytes or an object conforming to the buffer protocol.

    so I wrote a class with a read method that returns memoryviews over the chunks (to avoid copying slices). The documentation is grammatically ambiguous; it turns out that read_to_iter segfaults (!) when given an object with a read method that returns an object conforming to the buffer protocol that is not exactly bytes (reduced test case below).

My feature request is to provide an efficient way to decompress a sequence of chunks compressed with ZstdCompressionChunker (or to document an existing method as the efficient way, if there is one).


import zstandard as zstd
b = b'AB' * 1000
d = zstd.compress(b)
assert zstd.decompress(memoryview(d)) == b # passes
class Whatever:
    def __init__(self, data):
        self.data = data
    def read(self, size):
        assert len(data) <= size
        return memoryview(self.data)
dctx = zstd.ZstdDecompressor()
assert b''.join(dctx.read_to_iter(Whatever(d))) == b # segfault

Segfaults using Arch Linux's python 3.13.2-1 and python-zstandard 0.23.0-2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant