Skip to content

In 6.2, Cluster Metric API calls to other nodes time out after 1s ignoring config #22595

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tellistone opened this issue May 15, 2025 · 0 comments

Comments

@tellistone
Copy link

A community post here raises the following:

Graylog 6.2.0 cluster with nodes distributed across multiple geographical regions. When I log in to a node in one region, unable to view performance metrics (e.g., Memory/Heap, Buffers, Journal) for nodes in other regions. However, if I log in to a node in the same geographic location, all metrics display correctly.

In the logs seeing inter-node API timeouts like this:

2025-04-29T18:30:10.070Z WARN  [ProxiedResource] Failed to call API on node <1ce8335f-e3a9-4d66-b1eb-4bdcdedf827b>, cause: timeout (duration: 1002 ms)
2025-04-29T18:30:10.070Z WARN  [ProxiedResource] Failed to call API on node <09ac9ab1-ec2c-4e54-88a0-1c74f0291dfe>, cause: timeout (duration: 1001 ms)
2025-04-29T18:30:10.070Z WARN  [ProxiedResource] Failed to call API on node <a1e3a52c-7d66-4514-afe0-3e4d7afb5e68>, cause: timeout (duration: 1001 ms)
2025-04-29T18:30:10.070Z WARN  [ProxiedResource] Failed to call API on node <1db2c9da-8c1c-4f24-9b6e-b85f49fadd94>, cause: timeout (duration: 1002 ms)
2025-04-29T18:31:10.071Z WARN  [ProxiedResource] Failed to call API on node <c81809a0-b020-489f-892c-15211dd73696>, cause: timeout (duration: 1001 ms)
2025-04-29T18:31:10.071Z WARN  [ProxiedResource] Failed to call API on node <1ce8335f-e3a9-4d66-b1eb-4bdcdedf827b>, cause: timeout (duration: 1001 ms)
2025-04-29T18:31:10.071Z WARN  [ProxiedResource] Failed to call API on node <1db2c9da-8c1c-4f24-9b6e-b85f49fadd94>, cause: timeout (duration: 1001 ms)
2025-04-29T18:31:10.071Z WARN  [ProxiedResource] Failed to call API on node <09ac9ab1-ec2c-4e54-88a0-1c74f0291dfe>, cause: timeout (duration: 1001 ms)
2025-04-29T18:31:10.071Z WARN  [ProxiedResource] Failed to call API on node <a1e3a52c-7d66-4514-afe0-3e4d7afb5e68>, cause: timeout (duration: 1001 ms)
2025-04-29T18:31:10.071Z WARN  [ProxiedResource] Failed to call API on node <7569ace1-6ea9-4f67-8531-da9c0919ee3d>, cause: timeout (duration: 1002 ms)
2025-04-29T18:31:58.277Z WARN  [ProxiedResource] Failed to call API on node <a1e3a52c-7d66-4514-afe0-3e4d7afb5e68>, cause: timeout (duration: 1001 ms)
2025-04-29T18:31:58.277Z WARN  [ProxiedResource] Failed to call API on node <c81809a0-b020-489f-892c-15211dd73696>, cause: timeout (duration: 1001 ms)
2025-04-29T18:31:58.277Z WARN  [ProxiedResource] Failed to call API on node <1ce8335f-e3a9-4d66-b1eb-4bdcdedf827b>, cause: timeout (duration: 1001 ms)
2025-04-29T18:31:58.277Z WARN  [ProxiedResource] Failed to call API on node <1db2c9da-8c1c-4f24-9b6e-b85f49fadd94>, cause: timeout (duration: 1001 ms)

tried uncommenting and setting proxied_requests_default_call_timeout = 5s in server.conf, but it doesn’t seem to have any effect—the timeouts still occur at around 1 second. also reviewed the config file but couldn’t find any other relevant settings to adjust this timeout.

Everything else in the cluster appears to be functioning properly. This issue is specifically with viewing node performance stats across regions.

Environment Details:

OS: Ubuntu 24.04
Graylog Version: 6.2.0
Number of Graylog nodes: 9 (will be scaling to 20+)
OpenSearch Version: 2.15.0
Number of OpenSearch data nodes: 32

I’ve noticed that if you visit a node’s metrics page e.g., https://10.X.X.X:9000/system/metrics/node/, it fails to load for any node located in a different geographic region than the one you’re currently accessing. This doesn’t appear to be a firewall or DNS issue—I’ve verified both are working correctly.

It’s also worth noting that this behavior was not present in Graylog 6.1.11. In that version, the cluster page displayed journal and JVM heap metrics for all nodes, which relied on successful metric queries across the cluster. It seems that something changed in Graylog 6.2 that affects this functionality.

Is there a new or alternative setting in Graylog 6.2.0 to increase the inter-node API call timeout, or is this a regression?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant