-
Notifications
You must be signed in to change notification settings - Fork 2.6k
feat: add support for multiple nacos instances and replace events library with shdict #12263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@@ -29,8 +29,47 @@ The performance of this module needs to be improved: | |||
|
|||
### Configuration for Nacos | |||
|
|||
There are two ways to configure Nacos. Single instance and Multi instance method. | |||
|
|||
Add following configuration in `conf/config.yaml` : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can remove this sentence and just add "title", e.g.
```yaml title="conf/config.yaml"
to the single and multi instances examples
Add following configuration in `conf/config.yaml` : | ||
|
||
#### Multi Instance method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -60,7 +99,7 @@ discovery: | |||
|
|||
#### L7 | |||
|
|||
Here is an example of routing a request with an URI of "/nacos/*" to a service which named "http://192.168.33.1:8848/nacos/v1/ns/instance/list?serviceName=APISIX-NACOS" and use nacos discovery client in the registry: | |||
Here is an example of routing a request with an URI of "/nacos/*" to a service which named "http://192.168.33.1:8848/nacos/v1/ns/instance/list?serviceName=APISIX-NACOS" configured with id `nacos-cluster-1` and use nacos discovery client in the registry: | |||
|
|||
:::note | |||
You can fetch the `admin_key` from `config.yaml` and save to an environment variable with the following command: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not the right place to have this note under L7. Move it one level up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the :::note section? It was here already. Do you want me to move it?
Co-authored-by: Traky Deng <trakydeng@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are quite a few issues. I have already checked the hot paths in the code (excluding the utils section). You can take a look and refute the errors I have pointed out.
apisix/discovery/nacos/init.lua
Outdated
local value = nacos_dict:get_stale(key) | ||
local nodes = {} | ||
if not value then | ||
-- maximum waiting time: 5 seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indent
apisix/discovery/nacos/init.lua
Outdated
ns_id, | ||
group_name, | ||
service_name) | ||
local value = nacos_dict:get_stale(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it necessary to retrieve potentially stale data using get_stale
instead of get
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to retrieve the old data when nacos is down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the value of an expired key is not guaranteed to be available so one should never rely on the availability of expired items.
try to use get_stale
to retrieve expired item is unstable way.
apisix/discovery/nacos/init.lua
Outdated
local waiting_time = 5 | ||
local step = 0.1 | ||
local logged = false | ||
while not value and waiting_time > 0 do | ||
if not logged then | ||
logged = true | ||
end | ||
|
||
if body and 'table' == type(body) then | ||
local err | ||
body, err = core.json.encode(body) | ||
if not body then | ||
return nil, 'invalid body : ' .. err | ||
ngx.sleep(step) | ||
waiting_time = waiting_time - step | ||
value = nacos_dict:get_stale(key) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this code mean?
Currently, nodes have been reduced to a mechanism that only reads from shdict. Before the code I selected, you did not attempt to initiate a request, so nodes would not suddenly be filled.
I understand that filling shdict is done by a decoupled privileged process, but I don't understand the value of using a spin lock-like mechanism for waiting here.
In my opinion, it should fast fail, which would cause APISIX to return a 503 error and print the log.
No matter how I look at it, it's strange that the request was blocked for 5 seconds because the service discovery provider did not return nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dict:get_stale is called before starting the wait. And if no value is found then it waits for "step" and tries again. It does that for 5 seconds. This waiting mechanism was already there before my PR. The only thing I have done here is replace get with get_stale for reasons mentioned in above comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It really depends on how you think the purpose of this PR is.
As I mentioned above, this logic is weird, so my suggestion is to remove it. Let this logic failfast instead of waiting for something.
If the upstream on a route is not loaded or does not exist at all, APISIX will obviously return 503 immediately.
You can request opinions from other reviewers here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nic-6443 What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aggre to remove it , this is weird.
apisix/discovery/nacos/init.lua
Outdated
for _, val in ipairs(discovery_conf) do | ||
local id = val.id | ||
local version = ngx.md5(core.json.encode(val, true)) | ||
keep[id] = true | ||
|
||
local dup = de_duplication(services, namespace_id, group_name, | ||
up.service_name, up.scheme) | ||
if dup then | ||
-- The nacos config has not been changed. | ||
if nacos_clients[id] and nacos_clients[id].version == version then | ||
goto CONTINUE | ||
end | ||
|
||
if up.discovery_type == 'nacos' then | ||
core.table.insert(services, { | ||
service_name = up.service_name, | ||
namespace_id = namespace_id, | ||
group_name = group_name, | ||
scheme = up.scheme, | ||
}) | ||
if nacos_clients[id] then | ||
nacos_clients[id]:stop() | ||
end | ||
local new_client = nacos_factory.new(val) | ||
new_client:start() | ||
nacos_clients[id] = new_client | ||
|
||
::CONTINUE:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why check for configuration changes and rebuild the client?
-
init_worker is not called frequently; it is only called once when the privileged process starts. Therefore, when do you check for configuration changes?
-
You store nacos_clients in a global variable in the init.lua file, but regardless, the scope of this variable is limited to the LuaVM. Restarting the privileged process is equivalent to restarting the worker, which causes the LuaVM to be rebuilt, thereby clearing the table.
Comparing the configuration version with an old client stored there seems pointless, as it will always fall into a branch where the old client no longer exists.
function _M.dump_data() | ||
return {config = local_conf.discovery.nacos, services = applications or {}} | ||
return { | ||
config = local_conf.discovery.nacos, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that printing configurations are not secure.
If users are responsible for maintaining APISIX deployments, they can view these in the configuration files. The Control API itself should also be used by maintenance personnel, but we know that security controls on APIs are always difficult to configure correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so there is sensitive data in this configuration.
start = start, | ||
stop = stop, | ||
fetch_instances = fetch_instances, | ||
fetch_full_registry = fetch_full_registry, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to APISIX convention, you should use the meta table and _index
. I don't remember us having such a precedent.
apisix/discovery/nacos/factory.lua
Outdated
["username"] = username, | ||
["password"] = password, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
["username"] = username, | |
["password"] = password, | |
username = username, | |
password = password, |
apisix/discovery/nacos/factory.lua
Outdated
local function is_grpc(scheme) | ||
if scheme == 'grpc' or scheme == 'grpcs' then | ||
return true | ||
end | ||
|
||
return false | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
local function is_grpc(scheme) | |
if scheme == 'grpc' or scheme == 'grpcs' then | |
return true | |
end | |
return false | |
end | |
local function is_grpc(scheme) | |
return scheme == "grpc" or scheme == "grpcs" | |
end |
method = method, | ||
headers = headers, | ||
body = body, | ||
ssl_verify = false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems unsafe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its okay to allow for non tls traffic for service discovery. Eureka service discovery code also has the same logic.
apisix/discovery/nacos/factory.lua
Outdated
if ngx.config.subsystem == "stream" then | ||
shdict_name = shdict_name .. "-stream" | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to share services directly between http and stream, i.e., using meta shdict?
The specific reasons are as follows:
We do not configure discovery separately for http and stream in the configuration file; they always use the same provider or the same provider ID.
The only difference is that they may use different service names. However, we do not seem to need to concern ourselves with this.
In fact, you have already set https://github.com/apache/apisix/pull/12263/files#diff-ac8ce1b2b70b12a55abac8b1d508079cab4136f5bf830052ba325e543b04bf9eR130-R148, which includes both routes and stream routes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay makes sense
end | ||
end | ||
-- delete unused service names | ||
local keys = nacos_dict:get_keys(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CAUTION Avoid calling this method on dictionaries with a very large number of keys as it may lock the dictionary for significant amount of time and block Nginx worker processes trying to access the dictionary.
According to the OpenResty documentation, I realized that get_keys
should be avoided. We can record services_in_use
in a module variable so that by comparing the values of services_in_use
of last time and this time executing the fetch_full_registry
function, we can determine the changes without using get_keys
.
Description
discovery.nacos
as an array of nacos instances each with unique ID.NOTE: The older way of configuring nacos is still supported and older data will be valid. The older configuration now gets converted internally to new configuration before being passed onto the logic therefore making this change backwards compatible. Meaning users can use the single-instance configuration as well as multi instance configuration.
Fixes #11252
Checklist