Skip to content

feat: add support for multiple nacos instances and replace events library with shdict #12263

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 46 commits into
base: master
Choose a base branch
from

Conversation

Revolyssup
Copy link
Contributor

@Revolyssup Revolyssup commented May 28, 2025

Description

  • Currently nacos discovery uses event library and during testing it was found out that when using lua-resty-events, sometimes not all workers get the events. And inconsistencies emerge. Moreover the idiomatic way to share data between workers is through a shared dict. Therefore this PR migrates nacos from older events mechanism to newer shared dict way. Now only priviliged agent will fetch data from nacos and write to shdict while all workers will read from it.
  • Currently nacos supports only a single nacos registry but users have use case of supporting multiple nacos registries. This PR adds support to specify discovery.nacos as an array of nacos instances each with unique ID.

NOTE: The older way of configuring nacos is still supported and older data will be valid. The older configuration now gets converted internally to new configuration before being passed onto the logic therefore making this change backwards compatible. Meaning users can use the single-instance configuration as well as multi instance configuration.

Fixes #11252

Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to this change
  • I have updated the documentation to reflect this change
  • I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

@dosubot dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label May 28, 2025
@Revolyssup Revolyssup marked this pull request as draft May 28, 2025 08:23
@dosubot dosubot bot added the enhancement New feature or request label May 28, 2025
@Revolyssup Revolyssup marked this pull request as ready for review May 28, 2025 19:28
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels May 28, 2025
@@ -29,8 +29,47 @@ The performance of this module needs to be improved:

### Configuration for Nacos

There are two ways to configure Nacos. Single instance and Multi instance method.

Add following configuration in `conf/config.yaml` :
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can remove this sentence and just add "title", e.g.

```yaml title="conf/config.yaml"

to the single and multi instances examples

Add following configuration in `conf/config.yaml` :

#### Multi Instance method
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put the single instance before multi, so the reading flows better because

image

@@ -60,7 +99,7 @@ discovery:

#### L7

Here is an example of routing a request with an URI of "/nacos/*" to a service which named "http://192.168.33.1:8848/nacos/v1/ns/instance/list?serviceName=APISIX-NACOS" and use nacos discovery client in the registry:
Here is an example of routing a request with an URI of "/nacos/*" to a service which named "http://192.168.33.1:8848/nacos/v1/ns/instance/list?serviceName=APISIX-NACOS" configured with id `nacos-cluster-1` and use nacos discovery client in the registry:

:::note
You can fetch the `admin_key` from `config.yaml` and save to an environment variable with the following command:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the right place to have this note under L7. Move it one level up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the :::note section? It was here already. Do you want me to move it?

Copy link
Contributor

@bzp2010 bzp2010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are quite a few issues. I have already checked the hot paths in the code (excluding the utils section). You can take a look and refute the errors I have pointed out.

local value = nacos_dict:get_stale(key)
local nodes = {}
if not value then
-- maximum waiting time: 5 seconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent

ns_id,
group_name,
service_name)
local value = nacos_dict:get_stale(key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it necessary to retrieve potentially stale data using get_stale instead of get?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to retrieve the old data when nacos is down.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the value of an expired key is not guaranteed to be available so one should never rely on the availability of expired items.

try to use get_stale to retrieve expired item is unstable way.

Comment on lines 52 to 63
local waiting_time = 5
local step = 0.1
local logged = false
while not value and waiting_time > 0 do
if not logged then
logged = true
end

if body and 'table' == type(body) then
local err
body, err = core.json.encode(body)
if not body then
return nil, 'invalid body : ' .. err
ngx.sleep(step)
waiting_time = waiting_time - step
value = nacos_dict:get_stale(key)
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this code mean?

Currently, nodes have been reduced to a mechanism that only reads from shdict. Before the code I selected, you did not attempt to initiate a request, so nodes would not suddenly be filled.
I understand that filling shdict is done by a decoupled privileged process, but I don't understand the value of using a spin lock-like mechanism for waiting here.

In my opinion, it should fast fail, which would cause APISIX to return a 503 error and print the log.

No matter how I look at it, it's strange that the request was blocked for 5 seconds because the service discovery provider did not return nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dict:get_stale is called before starting the wait. And if no value is found then it waits for "step" and tries again. It does that for 5 seconds. This waiting mechanism was already there before my PR. The only thing I have done here is replace get with get_stale for reasons mentioned in above comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It really depends on how you think the purpose of this PR is.

As I mentioned above, this logic is weird, so my suggestion is to remove it. Let this logic failfast instead of waiting for something.

If the upstream on a route is not loaded or does not exist at all, APISIX will obviously return 503 immediately.

You can request opinions from other reviewers here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nic-6443 What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aggre to remove it , this is weird.

Comment on lines 117 to 134
for _, val in ipairs(discovery_conf) do
local id = val.id
local version = ngx.md5(core.json.encode(val, true))
keep[id] = true

local dup = de_duplication(services, namespace_id, group_name,
up.service_name, up.scheme)
if dup then
-- The nacos config has not been changed.
if nacos_clients[id] and nacos_clients[id].version == version then
goto CONTINUE
end

if up.discovery_type == 'nacos' then
core.table.insert(services, {
service_name = up.service_name,
namespace_id = namespace_id,
group_name = group_name,
scheme = up.scheme,
})
if nacos_clients[id] then
nacos_clients[id]:stop()
end
local new_client = nacos_factory.new(val)
new_client:start()
nacos_clients[id] = new_client

::CONTINUE::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why check for configuration changes and rebuild the client?

  1. init_worker is not called frequently; it is only called once when the privileged process starts. Therefore, when do you check for configuration changes?

  2. You store nacos_clients in a global variable in the init.lua file, but regardless, the scope of this variable is limited to the LuaVM. Restarting the privileged process is equivalent to restarting the worker, which causes the LuaVM to be rebuilt, thereby clearing the table.
    Comparing the configuration version with an old client stored there seems pointless, as it will always fall into a branch where the old client no longer exists.

function _M.dump_data()
return {config = local_conf.discovery.nacos, services = applications or {}}
return {
config = local_conf.discovery.nacos,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that printing configurations are not secure.

If users are responsible for maintaining APISIX deployments, they can view these in the configuration files. The Control API itself should also be used by maintenance personnel, but we know that security controls on APIs are always difficult to configure correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so there is sensitive data in this configuration.

Comment on lines +263 to +266
start = start,
stop = stop,
fetch_instances = fetch_instances,
fetch_full_registry = fetch_full_registry,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to APISIX convention, you should use the meta table and _index. I don't remember us having such a precedent.

Comment on lines 102 to 103
["username"] = username,
["password"] = password,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
["username"] = username,
["password"] = password,
username = username,
password = password,

Comment on lines 141 to 147
local function is_grpc(scheme)
if scheme == 'grpc' or scheme == 'grpcs' then
return true
end

return false
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
local function is_grpc(scheme)
if scheme == 'grpc' or scheme == 'grpcs' then
return true
end
return false
end
local function is_grpc(scheme)
return scheme == "grpc" or scheme == "grpcs"
end

method = method,
headers = headers,
body = body,
ssl_verify = false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems unsafe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its okay to allow for non tls traffic for service discovery. Eureka service discovery code also has the same logic.

Comment on lines 31 to 33
if ngx.config.subsystem == "stream" then
shdict_name = shdict_name .. "-stream"
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to share services directly between http and stream, i.e., using meta shdict?

The specific reasons are as follows:
We do not configure discovery separately for http and stream in the configuration file; they always use the same provider or the same provider ID.
The only difference is that they may use different service names. However, we do not seem to need to concern ourselves with this.
In fact, you have already set https://github.com/apache/apisix/pull/12263/files#diff-ac8ce1b2b70b12a55abac8b1d508079cab4136f5bf830052ba325e543b04bf9eR130-R148, which includes both routes and stream routes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay makes sense

end
end
-- delete unused service names
local keys = nacos_dict:get_keys(0)
Copy link
Member

@nic-6443 nic-6443 Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CAUTION Avoid calling this method on dictionaries with a very large number of keys as it may lock the dictionary for significant amount of time and block Nginx worker processes trying to access the dictionary.

According to the OpenResty documentation, I realized that get_keys should be avoided. We can record services_in_use in a module variable so that by comparing the values of services_in_use of last time and this time executing the fetch_full_registry function, we can determine the changes without using get_keys.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request size:XXL This PR changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: nacos node failure should not corrupt in-memory data
4 participants