Skip to content

Etcd Certificates are not generated when adding nodes to an existing cluster with scale.yml #12120

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 2, 2025

Conversation

kartsank
Copy link
Contributor

@kartsank kartsank commented Apr 9, 2025

What type of PR is this?
/kind bug

What this PR does / why we need it:
if will add new host or missing node etcd cert host to gen_node_certs_True group as part of check_certs and use to create certs using gen_certs task.

Which issue(s) this PR fixes:
Fixes #12117

Special notes for your reviewer:

Does this PR introduce a user-facing change?:
None

NONE

@k8s-ci-robot k8s-ci-robot added the do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. label Apr 9, 2025
Copy link

linux-foundation-easycla bot commented Apr 9, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Apr 9, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @kartsank!

It looks like this is your first PR to kubernetes-sigs/kubespray 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/kubespray has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 9, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @kartsank. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/contains-merge-commits cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 9, 2025
@chadswen
Copy link
Member

chadswen commented Apr 9, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 9, 2025
@chadswen
Copy link
Member

chadswen commented Apr 9, 2025

/retest

@chadswen chadswen changed the title [Issue-12117]-Certificates for the new hosts are not generated during… Etcd Certificates are not generated when adding nodes to an existing cluster with scale.yml Apr 9, 2025
@VannTen
Copy link
Contributor

VannTen commented Apr 9, 2025 via email

@kartsank
Copy link
Contributor Author

kartsank commented Apr 9, 2025

You're duplicating the group definition in the task above (with group_by). Even if add_host is the right approach, we should only keep one group generation.

An existing group was created using inventory_hostname (etcd, control-plane). If the etcd play is executed without the kube_node or new_host inventory group, the node certificate will not be generated on the first etcd node. To address this, I am re-evaluating all hosts in the k8s_cluster instead of just the inventory_hostname hosts and adding any missing hosts to the existing group.

@VannTen
Copy link
Contributor

VannTen commented Apr 9, 2025 via email

@kartsank
Copy link
Contributor Author

kartsank commented Apr 9, 2025

I see where the problem is. This was fixed for cluster.yml and upgrade-cluster.yml in #10769 , but scale.yml does not use the common playbook install-etcd.yml which has the correct behavior (dynamically add only the needed hosts, aka nodes using a direct etcd client). I don't think there is anything preventing us from using the same thing in scale.yml, which should fix the problem.

Initially, I considered including install-etcd.yml, but it adds all kube_node to the groups rather than just the missing nodes. My goal is to create the node certificate only for the missing or new hosts, rather than re-creating the node certificate each time.

https://github.com/kubernetes-sigs/kubespray/blob/master/playbooks/install_etcd.yml#L1C1-L15C17

If force_etcd_cert_refresh is set to true, we can add all hosts. Otherwise, there is no need to re-create the node certificate for every execution of scale.yml. Additionally, the installation of etcd is not required for scale.yml unless etcd_cluster_setup is enabled for new nodes.

https://github.com/kubernetes-sigs/kubespray/blob/master/playbooks/install_etcd.yml#L17-L29

@VannTen
Copy link
Contributor

VannTen commented Apr 10, 2025

Initially, I considered including install-etcd.yml, but it adds all kube_node to the groups rather than just the missing nodes. My goal is to create the node certificate only for the missing or new hosts, rather than re-creating the node certificate each time.

group_by still consider --limit, or does it not ? So even with force_etcd_refresh + scale.yml with --limit (which is the intended usage), I don't see the advantage.

@kartsank
Copy link
Contributor Author

kartsank commented Apr 10, 2025

Initially, I considered including install-etcd.yml, but it adds all kube_node to the groups rather than just the missing nodes. My goal is to create the node certificate only for the missing or new hosts, rather than re-creating the node certificate each time.

group_by still consider --limit, or does it not ? So even with force_etcd_refresh + scale.yml with --limit (which is the intended usage), I don't see the advantage.

I believe the limit is not considered by group_by. Therefore, we have two options: we can either include install-etcd.yml within scale.yml, or we can add the kube_node group to the existing ETCD play in scale.yml.

  • name: Generate the etcd certificates beforehand
    hosts: etcd:kube_control_plane---------> add :kube_node
    gather_facts: false
    any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
    environment: "{{ proxy_disable_env }}"
    roles:
    • { role: kubespray-defaults }
    • role: etcd
      tags: etcd
      vars:
      etcd_cluster_setup: false
      etcd_events_cluster_setup: false
      when:
      • etcd_deployment_type != "kubeadm"
      • kube_network_plugin in ["calico", "flannel", "canal", "cilium"] or cilium_deploy_additionally | default(false) | bool
      • kube_network_plugin != "calico" or calico_datastore == "etcd"

https://github.com/kubernetes-sigs/kubespray/blob/master/playbooks/scale.yml#L9

@VannTen
Copy link
Contributor

VannTen commented Apr 10, 2025 via email

@kartsank
Copy link
Contributor Author

kartsank commented Apr 10, 2025

I tested the install_etcd.yml in scale.yml, and all tests were successful. Therefore, I will include install_etcd.yml in scale.yml to resolve the add_node certificate issue.

Thank you for the feedback. I will update this pull request to include install_etcd.yml.

@VannTen Please review and provider your comments

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. do-not-merge/contains-merge-commits and removed do-not-merge/contains-merge-commits size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 15, 2025
@kartsank
Copy link
Contributor Author

@VannTen Please review and provider your comments. updated scale.yml with install_etcd playbook.

@kartsank kartsank force-pushed the scale-node-fix branch 2 times, most recently from 9ea55a8 to f1e8bda Compare April 23, 2025 22:29
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 23, 2025
Comment on lines -17 to -19
vars:
etcd_cluster_setup: false
etcd_events_cluster_setup: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only slight concern I have is this, which differs in install-etcd.yml.

However this only affect 'etcd', so it would only be a problem for using scale.yml with etcd nodes, which is not supported. I wonder if we have an easy way to fail early if something like this is attempted 🤔

Maybe that's out of scope for this though.

@tico88612 @ant31 thoughts ?

Copy link
Contributor

@ant31 ant31 Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we set the variables in the import? or with set fact before ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value for etcd_cluster_setup is set to true. However, this can be changed to false by configuring the variables in scale.yaml.

https://github.com/kubernetes-sigs/kubespray/blob/master/roles/etcd/defaults/main.yml#L6

@VannTen and @ant31 please review my changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, I don't why I thought we could not set vars on import_playbook.

In that case we'd define in both cases at import_playbook level and be done with it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

@kartsank kartsank requested review from ant31 and VannTen May 1, 2025 03:03
Copy link
Contributor

@VannTen VannTen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to update cluster.yml and upgrade-cluster.yml as well.

Comment on lines -17 to -19
vars:
etcd_cluster_setup: false
etcd_events_cluster_setup: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, I don't why I thought we could not set vars on import_playbook.

In that case we'd define in both cases at import_playbook level and be done with it.

Comment on lines 26 to 27
vars:
etcd_cluster_setup: true
etcd_events_cluster_setup: "{{ etcd_events_cluster_enabled }}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the whole section should be scraped and handled at import_playbook level on both use sites.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VannTen updated on cluster.yml and upgrade_cluster.yml. please review. Thanks

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 1, 2025
@kartsank
Copy link
Contributor Author

kartsank commented May 1, 2025

You'll need to update cluster.yml and upgrade-cluster.yml as well.

updated

@kartsank kartsank requested a review from VannTen May 1, 2025 18:37
Copy link
Contributor

@VannTen VannTen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience !
That should do it
/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 2, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kartsank, VannTen

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 2, 2025
@k8s-ci-robot k8s-ci-robot merged commit a3e6e66 into kubernetes-sigs:master May 2, 2025
46 checks passed
@kartsank
Copy link
Contributor Author

kartsank commented May 2, 2025

Thanks for your patience ! That should do it /approve /lgtm

Thanks for your review support and approval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
5 participants