Etcd Certificates are not generated when adding nodes to an existing cluster with scale.yml #12120

kartsank · 2025-04-09T13:10:16Z

What type of PR is this?
/kind bug

What this PR does / why we need it:
if will add new host or missing node etcd cert host to gen_node_certs_True group as part of check_certs and use to create certs using gen_certs task.

Which issue(s) this PR fixes:
Fixes #12117

Special notes for your reviewer:

Does this PR introduce a user-facing change?:
None

NONE

linux-foundation-easycla · 2025-04-09T13:10:21Z

The committers listed above are authorized under a signed CLA.

✅ login: kartsank / name: Karthik S (5e650c3, a3ccbe2, c33c394)

k8s-ci-robot · 2025-04-09T13:10:26Z

Welcome @kartsank!

It looks like this is your first PR to kubernetes-sigs/kubespray 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/kubespray has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2025-04-09T13:10:27Z

Hi @kartsank. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

chadswen · 2025-04-09T15:48:19Z

/ok-to-test

chadswen · 2025-04-09T16:54:24Z

/retest

VannTen · 2025-04-09T19:36:43Z

You're duplicating the group definition in the task above (with group_by). Even if add_host is the right approach, we should only keep one group generation.

kartsank · 2025-04-09T21:17:24Z

You're duplicating the group definition in the task above (with group_by). Even if add_host is the right approach, we should only keep one group generation.

An existing group was created using inventory_hostname (etcd, control-plane). If the etcd play is executed without the kube_node or new_host inventory group, the node certificate will not be generated on the first etcd node. To address this, I am re-evaluating all hosts in the k8s_cluster instead of just the inventory_hostname hosts and adding any missing hosts to the existing group.

VannTen · 2025-04-09T21:32:30Z

I see where the problem is. This was fixed for cluster.yml and upgrade-cluster.yml in #10769 , but scale.yml does not use the common playbook install-etcd.yml which has the correct behavior (dynamically add only the needed hosts, aka nodes using a direct etcd client). I don't think there is anything preventing us from using the same thing in scale.yml, which should fix the problem.

kartsank · 2025-04-09T22:05:06Z

I see where the problem is. This was fixed for cluster.yml and upgrade-cluster.yml in #10769 , but scale.yml does not use the common playbook install-etcd.yml which has the correct behavior (dynamically add only the needed hosts, aka nodes using a direct etcd client). I don't think there is anything preventing us from using the same thing in scale.yml, which should fix the problem.

Initially, I considered including install-etcd.yml, but it adds all kube_node to the groups rather than just the missing nodes. My goal is to create the node certificate only for the missing or new hosts, rather than re-creating the node certificate each time.

https://github.com/kubernetes-sigs/kubespray/blob/master/playbooks/install_etcd.yml#L1C1-L15C17

If force_etcd_cert_refresh is set to true, we can add all hosts. Otherwise, there is no need to re-create the node certificate for every execution of scale.yml. Additionally, the installation of etcd is not required for scale.yml unless etcd_cluster_setup is enabled for new nodes.

https://github.com/kubernetes-sigs/kubespray/blob/master/playbooks/install_etcd.yml#L17-L29

VannTen · 2025-04-10T07:18:14Z

Initially, I considered including install-etcd.yml, but it adds all kube_node to the groups rather than just the missing nodes. My goal is to create the node certificate only for the missing or new hosts, rather than re-creating the node certificate each time.

group_by still consider --limit, or does it not ? So even with force_etcd_refresh + scale.yml with --limit (which is the intended usage), I don't see the advantage.

kartsank · 2025-04-10T16:12:11Z

Initially, I considered including install-etcd.yml, but it adds all kube_node to the groups rather than just the missing nodes. My goal is to create the node certificate only for the missing or new hosts, rather than re-creating the node certificate each time.

group_by still consider --limit, or does it not ? So even with force_etcd_refresh + scale.yml with --limit (which is the intended usage), I don't see the advantage.

I believe the limit is not considered by group_by. Therefore, we have two options: we can either include install-etcd.yml within scale.yml, or we can add the kube_node group to the existing ETCD play in scale.yml.

name: Generate the etcd certificates beforehand
hosts: etcd:kube_control_plane---------> add :kube_node
gather_facts: false
any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
environment: "{{ proxy_disable_env }}"
roles:
- { role: kubespray-defaults }
- role: etcd
  tags: etcd
  vars:
  etcd_cluster_setup: false
  etcd_events_cluster_setup: false
  when:
  - etcd_deployment_type != "kubeadm"
  - kube_network_plugin in ["calico", "flannel", "canal", "cilium"] or cilium_deploy_additionally | default(false) | bool
  - kube_network_plugin != "calico" or calico_datastore == "etcd"

https://github.com/kubernetes-sigs/kubespray/blob/master/playbooks/scale.yml#L9

VannTen · 2025-04-10T16:29:54Z

I don't think so limit is not consider by group_by.

It does, I just tested.

kartsank · 2025-04-10T19:38:01Z

I tested the install_etcd.yml in scale.yml, and all tests were successful. Therefore, I will include install_etcd.yml in scale.yml to resolve the add_node certificate issue.

Thank you for the feedback. I will update this pull request to include install_etcd.yml.

@VannTen Please review and provider your comments

kartsank · 2025-04-18T14:31:17Z

@VannTen Please review and provider your comments. updated scale.yml with install_etcd playbook.

… scale.yml

VannTen · 2025-04-24T08:27:57Z

playbooks/scale.yml

-      vars:
-        etcd_cluster_setup: false
-        etcd_events_cluster_setup: false


The only slight concern I have is this, which differs in install-etcd.yml.

However this only affect 'etcd', so it would only be a problem for using scale.yml with etcd nodes, which is not supported. I wonder if we have an easy way to fail early if something like this is attempted 🤔

Maybe that's out of scope for this though.

@tico88612 @ant31 thoughts ?

can we set the variables in the import? or with set fact before ?

The default value for etcd_cluster_setup is set to true. However, this can be changed to false by configuring the variables in scale.yaml.

https://github.com/kubernetes-sigs/kubespray/blob/master/roles/etcd/defaults/main.yml#L6

@VannTen and @ant31 please review my changes.

Oh right, I don't why I thought we could not set vars on import_playbook.

In that case we'd define in both cases at import_playbook level and be done with it.

… scale.yml

VannTen

You'll need to update cluster.yml and upgrade-cluster.yml as well.

VannTen · 2025-05-01T08:24:59Z

playbooks/scale.yml

-      vars:
-        etcd_cluster_setup: false
-        etcd_events_cluster_setup: false


Oh right, I don't why I thought we could not set vars on import_playbook.

In that case we'd define in both cases at import_playbook level and be done with it.

VannTen · 2025-05-01T08:25:52Z

playbooks/install_etcd.yml

      vars:
-        etcd_cluster_setup: true
        etcd_events_cluster_setup: "{{ etcd_events_cluster_enabled }}"


So the whole section should be scraped and handled at import_playbook level on both use sites.

@VannTen updated on cluster.yml and upgrade_cluster.yml. please review. Thanks

… scale.yml

kartsank · 2025-05-01T18:37:21Z

You'll need to update cluster.yml and upgrade-cluster.yml as well.

updated

VannTen

Thanks for your patience !
That should do it
/approve
/lgtm

k8s-ci-robot · 2025-05-02T07:02:27Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kartsank, VannTen

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [VannTen]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kartsank · 2025-05-02T14:34:29Z

Thanks for your patience ! That should do it /approve /lgtm

Thanks for your review support and approval.

k8s-ci-robot added the do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. label Apr 9, 2025

k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Apr 9, 2025

k8s-ci-robot requested review from tico88612 and VannTen April 9, 2025 13:10

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 9, 2025

k8s-ci-robot added the do-not-merge/contains-merge-commits label Apr 9, 2025

kartsank force-pushed the scale-node-fix branch from 32a6cda to 7c75271 Compare April 9, 2025 15:11

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 9, 2025

chadswen changed the title ~~[Issue-12117]-Certificates for the new hosts are not generated during…~~ Etcd Certificates are not generated when adding nodes to an existing cluster with scale.yml Apr 9, 2025

k8s-ci-robot added the do-not-merge/contains-merge-commits label Apr 14, 2025

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. do-not-merge/contains-merge-commits and removed do-not-merge/contains-merge-commits size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 15, 2025

kartsank force-pushed the scale-node-fix branch from f077093 to 872787e Compare April 17, 2025 22:31

k8s-ci-robot removed the do-not-merge/contains-merge-commits label Apr 17, 2025

kartsank force-pushed the scale-node-fix branch 2 times, most recently from 9ea55a8 to f1e8bda Compare April 23, 2025 22:29

[Issue-12117]-Certificates for the new hosts are not generated during…

c33c394

… scale.yml

kartsank force-pushed the scale-node-fix branch from f1e8bda to c33c394 Compare April 23, 2025 22:31

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 23, 2025

VannTen reviewed Apr 24, 2025

View reviewed changes

[Issue-12117]-Certificates for the new hosts are not generated during…

5e650c3

… scale.yml

kartsank requested review from ant31 and VannTen May 1, 2025 03:03

VannTen requested changes May 1, 2025

View reviewed changes

[Issue-12117]-Certificates for the new hosts are not generated during…

a3ccbe2

… scale.yml

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 1, 2025

kartsank requested a review from VannTen May 1, 2025 18:37

VannTen reviewed May 2, 2025

View reviewed changes

k8s-ci-robot assigned VannTen May 2, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 2, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 2, 2025

k8s-ci-robot merged commit a3e6e66 into kubernetes-sigs:master May 2, 2025
46 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Etcd Certificates are not generated when adding nodes to an existing cluster with scale.yml #12120

Etcd Certificates are not generated when adding nodes to an existing cluster with scale.yml #12120

kartsank commented Apr 9, 2025 •

edited

Loading

linux-foundation-easycla bot commented Apr 9, 2025 •

edited

Loading

k8s-ci-robot commented Apr 9, 2025

k8s-ci-robot commented Apr 9, 2025

chadswen commented Apr 9, 2025

chadswen commented Apr 9, 2025

VannTen commented Apr 9, 2025 via email

kartsank commented Apr 9, 2025

VannTen commented Apr 9, 2025 via email

kartsank commented Apr 9, 2025

VannTen commented Apr 10, 2025

kartsank commented Apr 10, 2025 •

edited

Loading

VannTen commented Apr 10, 2025 via email

kartsank commented Apr 10, 2025 •

edited

Loading

kartsank commented Apr 18, 2025

VannTen Apr 24, 2025

ant31 Apr 30, 2025 •

edited

Loading

kartsank May 1, 2025

VannTen May 1, 2025

kartsank May 1, 2025

VannTen left a comment

VannTen May 1, 2025

VannTen May 1, 2025

kartsank May 1, 2025

kartsank commented May 1, 2025

VannTen left a comment

k8s-ci-robot commented May 2, 2025

kartsank commented May 2, 2025

Etcd Certificates are not generated when adding nodes to an existing cluster with scale.yml #12120

Etcd Certificates are not generated when adding nodes to an existing cluster with scale.yml #12120

Conversation

kartsank commented Apr 9, 2025 • edited Loading

linux-foundation-easycla bot commented Apr 9, 2025 • edited Loading

k8s-ci-robot commented Apr 9, 2025

k8s-ci-robot commented Apr 9, 2025

chadswen commented Apr 9, 2025

chadswen commented Apr 9, 2025

VannTen commented Apr 9, 2025 via email

kartsank commented Apr 9, 2025

VannTen commented Apr 9, 2025 via email

kartsank commented Apr 9, 2025

VannTen commented Apr 10, 2025

kartsank commented Apr 10, 2025 • edited Loading

VannTen commented Apr 10, 2025 via email

kartsank commented Apr 10, 2025 • edited Loading

kartsank commented Apr 18, 2025

VannTen Apr 24, 2025

Choose a reason for hiding this comment

ant31 Apr 30, 2025 • edited Loading

Choose a reason for hiding this comment

kartsank May 1, 2025

Choose a reason for hiding this comment

VannTen May 1, 2025

Choose a reason for hiding this comment

kartsank May 1, 2025

Choose a reason for hiding this comment

VannTen left a comment

Choose a reason for hiding this comment

VannTen May 1, 2025

Choose a reason for hiding this comment

VannTen May 1, 2025

Choose a reason for hiding this comment

kartsank May 1, 2025

Choose a reason for hiding this comment

kartsank commented May 1, 2025

VannTen left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented May 2, 2025

kartsank commented May 2, 2025

kartsank commented Apr 9, 2025 •

edited

Loading

linux-foundation-easycla bot commented Apr 9, 2025 •

edited

Loading

kartsank commented Apr 10, 2025 •

edited

Loading

kartsank commented Apr 10, 2025 •

edited

Loading

ant31 Apr 30, 2025 •

edited

Loading