Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature - avoid utf-8 decoding for text frames #1376

Closed
toppk opened this issue Jun 26, 2023 · 8 comments
Closed

feature - avoid utf-8 decoding for text frames #1376

toppk opened this issue Jun 26, 2023 · 8 comments

Comments

@toppk
Copy link

toppk commented Jun 26, 2023

just because it is supposed to be in utf-8, doesn't mean I prefer it in that form. specifically, my usecase, is giving the data to orjson, and passing it around as an orjson.Fragment().

Here are the documents for that use case.

https://github.com/ijl/orjson#deserialize
https://github.com/ijl/orjson#fragment

looking at websockets code, if such a capability were to be implemented, it seems like we'd want to add an flag to WebSocketCommonProtocol() and then use it to force binary around the time it decides on whether to decode it or not, located here:

https://github.com/python-websockets/websockets/blob/main/src/websockets/legacy/protocol.py#L1053

I'd be happy to whip up a patch in case you would consider this feature request.

@aaugustin
Copy link
Member

I understand your use case and, indeed, you cannot do this with the current API.

For receiving frames, it would mean an API like websocket.recv(decode_text_frames=False). (Naming TBC.) Can you confirm that it's what you want? (Then, you get bytes in all cases so you cannot tell if it was a Text or Binary frame in the first plac; but you don't really care anyway.)

This raises the question of providing a symmetrical API for sending bytes (assumed to be valid UTF-8) as a Text frame. You didn't ask for this but I'd like to keep consistency between both sides.

@toppk
Copy link
Author

toppk commented Jun 26, 2023

That would work quite well. I guess I misunderstood the code, because it looked to me as if the recv() method is decoupled from where the actual processing of inbound data (read_message()). The solution you propose would certainly be more flexible.

@toppk
Copy link
Author

toppk commented Jun 28, 2023

just thinking about the send side, I think it really is less important. there aren't too many servers that are strict in what they accept, especially when they are expecting text. I think if we implement it for send, while the effect is the same (skip encode, skip decode), but the names of the options will be different, e.g: decode_text_frames=False for recv(), and send_as_text=True for send()

@aaugustin
Copy link
Member

Yes, we need to pick the names for both sides carefully and, ideally, consistently.

raw_utf8 is a name that could work for both sides. I'm not sure it's the best name we can find, though.

If we have two names, I'd like some symmetry e.g. using the words decode and encode.

@carlos-sarmiento
Copy link

I'm finding myself in the same position, trying to send data encoded with orjson as a text frame even when it is provided to websockets in binary form.

Any chance this gets added?

@aaugustin
Copy link
Member

This will be added as part of the new asyncio implementation (#1332).

aaugustin added a commit that referenced this issue Aug 7, 2024
Also support decoding binary frames.

Fix #1376.
aaugustin added a commit that referenced this issue Aug 7, 2024
Also support decoding binary frames.

Fix #1376.
aaugustin added a commit that referenced this issue Aug 7, 2024
Also support decoding binary frames.

Fix #1376.
@aaugustin
Copy link
Member

The new asyncio implementations supports recv(decode=False), which is the original request here.

(Also recv(decode=True) for the opposite behavior.)

I'm not planning to work on the other features discussed above, notably send(), until someone has a use case.

@aaugustin
Copy link
Member

send() was handled in #1515.

aaugustin added a commit that referenced this issue Oct 25, 2024
Previously, a latch was used to synchronize the user thread reading messages and
the background thread reading from the network. This required two thread switches
per message.

Now, the background thread writes messages to queue, from which the user thread
reads. This allows passing several frames at each thread switch, reducing the
overhead.

With this server code::

    async def test(websocket):
        for i in range(int(await websocket.recv())):
            await websocket.send(f"{{\"iteration\": {i}}}")

and this client code::

    with connect("ws://localhost:8765", compression=None) as websocket:
        websocket.send("1_000_000")
        for message in websocket:
            pass

an unscientific benchmark (running it on my laptop) shows a 2.5x speedup, going
from 11 seconds to 4.4 seconds. Setting a very large recv_bufsize and max_size doesn't yield significant further improvement.

The new implementation mirrors the asyncio implementation and gains the
option to prevent or force decoding of frames. Refs #1376.
aaugustin added a commit that referenced this issue Oct 25, 2024
Previously, a latch was used to synchronize the user thread reading messages and
the background thread reading from the network. This required two thread switches
per message.

Now, the background thread writes messages to queue, from which the user thread
reads. This allows passing several frames at each thread switch, reducing the
overhead.

With this server code:

    async def test(websocket):
        for i in range(int(await websocket.recv())):
            await websocket.send(f"{{\"iteration\": {i}}}")

    async with serve(test, "localhost", 8765) as server:
        await server.serve_forever()

and this client code:

    with connect("ws://localhost:8765", compression=None) as websocket:
        websocket.send("1_000_000")
        for message in websocket:
            pass

an unscientific benchmark (running it on my laptop) shows a 2.5x speedup,
going from 11 seconds to 4.4 seconds. Setting a very large recv_bufsize
and max_size doesn't yield significant further improvement.

Flow control was tested by inserting debug logs in maybe_pause/resume()
and by measuring the wait for the recv_flow_control lock. It showed the
expected behavior of pausing and unpausing coupled with some wait time.

The new implementation mirrors the asyncio implementation and gains the
option to prevent or force decoding of frames.

Fix #1376 for the threading implementation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants