Skip to content

fix: Add the nodepool cgroup mode to the NAP config #2356

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

davidholsgrove
Copy link
Contributor

Closes #2321

Add the nodepool cgroup mode to the NAP config

@davidholsgrove davidholsgrove requested review from apeabody, ericyz and a team as code owners May 22, 2025 23:32
Copy link

google-cla bot commented May 22, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@apeabody
Copy link
Collaborator

Hi @davidholsgrove - Can you please check the CLA. Thanks!

@davidholsgrove davidholsgrove changed the title Add the nodepool cgroup mode to the NAP config fix: Add the nodepool cgroup mode to the NAP config May 30, 2025
@davidholsgrove
Copy link
Contributor Author

CLA updated. Thanks @apeabody

@apeabody
Copy link
Collaborator

/gcbrun

@apeabody
Copy link
Collaborator

Hi @davidholsgrove - From the CI Tests:

        	Error:      	Received unexpected error:
        	            	FatalError{Underlying: error while running command: exit status 1; 
        	            	Error: Error updating LinuxNodeConfig: googleapi: Error 400: Setting cgroup_mode to v1 is not allowed for clusters created with version 1.32.3-gke.1927009.
        	            	Details:
        	            	[
        	            	  {
        	            	    "@type": "type.googleapis.com/google.rpc.RequestInfo",
        	            	    "requestId": "0x261042fcc5e9d554"
        	            	  }
        	            	]
        	            	, badRequest
        	            	
        	            	  with module.example.module.gke.google_container_cluster.primary,
        	            	  on ../../../modules/beta-public-cluster/cluster.tf line 22, in resource "google_container_cluster" "primary":
        	            	  22: resource "google_container_cluster" "primary" {
        	            	}
        	Test:       	TestNodePool

@davidholsgrove
Copy link
Contributor Author

Hi @apeabody

I think the testcase was updated in #2224 to downgrade from CGROUP_MODE_V2 to CGROUP_MODE_V1.

I'm not sure from https://cloud.google.com/kubernetes-engine/docs/how-to/migrate-cgroupv2#transition-plan what the default CGROUP is for clusters created with 1.32, but it looks like from the error message in the testcases that its now V2, and the API is rejecting an attempt to set to V1?

I dont have access to the cloudbuild logs to see further than the quote in your reply apologies.

If the testcase needs to revert this diff, I can include it in my PR?

@apeabody
Copy link
Collaborator

apeabody commented Jun 2, 2025

Hi @apeabody

I think the testcase was updated in #2224 to downgrade from CGROUP_MODE_V2 to CGROUP_MODE_V1.

I'm not sure from https://cloud.google.com/kubernetes-engine/docs/how-to/migrate-cgroupv2#transition-plan what the default CGROUP is for clusters created with 1.32, but it looks like from the error message in the testcases that its now V2, and the API is rejecting an attempt to set to V1?

I dont have access to the cloudbuild logs to see further than the quote in your reply apologies.

If the testcase needs to revert this diff, I can include it in my PR?

Hi @davidholsgrove!

Yes, the API is rejecting when specifically adding to a new the NAP, full removal of CGROUP_MODE_V1 isn't expected till v1.35.

I suggest swapping that values of pool-01 and all (and their test assets). That should fix this test, while maintaining test coverage. https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/blob/main/examples/node_pool/main.tf#L165

@davidholsgrove
Copy link
Contributor Author

/gcbrun

@davidholsgrove
Copy link
Contributor Author

Thanks @apeabody - testcases updated to flip the default to V2, with the exception of pool-01 to continue coverage.
I dont have perms to trigger a retest, could you run that for me and let me know if you need anything else?

@apeabody
Copy link
Collaborator

apeabody commented Jun 3, 2025

/gcbrun

@apeabody
Copy link
Collaborator

apeabody commented Jun 3, 2025

I'll re-run the test again:

        	Error:      	Received unexpected error:
        	            	FatalError{Underlying: error while running command: exit status 1; 
        	            	Error: Error updating LinuxNodeConfig: googleapi: Error 400: Cluster is running incompatible operation operation-1748969188467-eb201d3b-0103-4996-89d9-a6cf6fa5ed15.
        	            	Details:
        	            	[
        	            	  {
        	            	    "@type": "type.googleapis.com/google.rpc.RequestInfo",
        	            	    "requestId": "0x7d2092479dc8c224"
        	            	  },
        	            	  {
        	            	    "@type": "type.googleapis.com/google.rpc.ErrorInfo",
        	            	    "domain": "container.googleapis.com",
        	            	    "reason": "CLUSTER_ALREADY_HAS_OPERATION"
        	            	  }
        	            	]
        	            	, failedPrecondition
        	            	
        	            	  with module.example.module.gke.google_container_cluster.primary,
        	            	  on ../../../modules/beta-public-cluster/cluster.tf line 22, in resource "google_container_cluster" "primary":
        	            	  22: resource "google_container_cluster" "primary" {
        	            	}
        	Test:       	TestNodePool

@davidholsgrove
Copy link
Contributor Author

Thanks @apeabody - I'm not sure how to resolve that failure without access to run the tests / dig deeper.
Looks unrelated to the change in question though?

@apeabody
Copy link
Collaborator

apeabody commented Jun 4, 2025

Thanks @apeabody - I'm not sure how to resolve that failure without access to run the tests / dig deeper. Looks unrelated to the change in question though?

Looks like we are still seeing in the re-test, let me check on a few PR that don't include this change.

@apeabody
Copy link
Collaborator

apeabody commented Jun 4, 2025

Thanks @apeabody - I'm not sure how to resolve that failure without access to run the tests / dig deeper. Looks unrelated to the change in question though?

Looks like we are still seeing in the re-test, let me check on a few PR that don't include this change.

Hi @davidholsgrove - I've checked and this error seems to be specific to this change. I'd suggest (temporarily) removing pool-01 = "CGROUP_MODE_V1" as I'm wondering if the linux_node_config is first set as "CGROUP_MODE_V1" by all, the pool-01 can't be later set to "CGROUP_MODE_V1"?

@davidholsgrove
Copy link
Contributor Author

Thanks @apeabody - rebased and removed the explicit setting of pool-01 = "CGROUP_MODE_V1" from the testcase.
Could you trigger again?

@apeabody
Copy link
Collaborator

apeabody commented Jun 9, 2025

/gcbrun

@apeabody
Copy link
Collaborator

apeabody commented Jun 9, 2025

Thanks @apeabody - rebased and removed the explicit setting of pool-01 = "CGROUP_MODE_V1" from the testcase. Could you trigger again?

Hi @davidholsgrove - Yeah, it's still having the issue:

Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: Error: Error updating LinuxNodeConfig: googleapi: Error 400: Cluster is running incompatible operation operation-1749488222583-0c77b4da-5a67-4e0b-94ab-ad5f6a673820.
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: Details:
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: [
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:   {
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:     "@type": "type.googleapis.com/google.rpc.RequestInfo",
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:     "requestId": "0x663f464965d3f2fb"
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:   },
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:   {
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:     "@type": "type.googleapis.com/google.rpc.ErrorInfo",
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:     "domain": "container.googleapis.com",
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:     "reason": "CLUSTER_ALREADY_HAS_OPERATION"
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:   }
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: ]
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: , failedPrecondition
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: 
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:   with module.example.module.gke.google_container_cluster.primary,
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:   on ../../../modules/beta-public-cluster/cluster.tf line 22, in resource "google_container_cluster" "primary":
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:   22: resource "google_container_cluster" "primary" {
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: 
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z retry.go:99: Returning due to fatal error: FatalError{Underlying: error while running command: exit status 1; 

Looking at your change, another possibility is the addition of local.node_pools_cgroup_mode != null to the node_pool_auto_config gating function is also resulting in the network_tags blocks being created. I would suggest making network_tags a dynamic block as well, gated on var.cluster_autoscaling.enabled && (length(var.network_tags) > 0 || var.add_cluster_firewall_rules to eliminate that potential.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NAP not setting linux_node_config for cgroup_mode
2 participants