Skip to content

help request: WebSocket Load Balancing Imbalance Issue After Upstream Node Scaling #12217

Open
@coder2z

Description

@coder2z

Description

Issue Description
When using APISIX to proxy WebSocket requests, we've observed that when upstream nodes are scaled out, the load distribution of WebSocket connections becomes unbalanced.

Steps to Reproduce
Configure APISIX to proxy WebSocket requests to backend services
Start with 2 upstream nodes providing service
Establish a large number of WebSocket long connections
Scale out the upstream (e.g., from 2 nodes to 3 or more)
Observe the connection distribution across nodes
Current Behavior
After scaling, new connections are evenly distributed across all nodes, but previously established WebSocket long connections remain concentrated on the original two nodes, resulting in an unbalanced load distribution.

Expected Behavior
After scaling, the system should take into account existing long connections, resulting in a more balanced load distribution across all nodes (including newly added ones).

Root Cause Analysis
Based on observation, the issue appears to be caused by:

APISIX maintains a counter mechanism for load balancing. For example, when there are 2 nodes, each node's counter is initialized to 10000. When downstream nodes are scaled out, APISIX resets all counters, but the previously established WebSocket long connections are not recorded in the new count, causing inaccurate load calculations.

Specifically:

A large number of WebSocket long connections have already been established on the original two nodes
After scaling, counters are reset, and these existing connections are "forgotten" in load balancing decisions
New connections will be evenly distributed, but when combined with existing connections, the overall load distribution is unbalanced
Environment Information
APISIX version: latest
Operating system: liunx
Deployment method: Kubernetes
Additional Information
This issue is particularly noticeable in high-concurrency WebSocket application scenarios, especially when long connections persist for extended periods. We hope the load balancing algorithm can be improved to consider existing long connections when nodes are scaled out.

Environment

  • APISIX version (run apisix version): latest
  • Operating system (run uname -a): liunx
  • OpenResty / Nginx version (run openresty -V or nginx -V): latest
  • etcd version, if relevant (run curl http://127.0.0.1:9090/v1/server_info): NA
  • APISIX Dashboard version, if relevant: NA
  • Plugin runner version, for issues related to plugin runners: NA
  • LuaRocks version, for installation issues (run luarocks --version): NA

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Status

    📋 Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions