fix: Add the nodepool cgroup mode to the NAP config #2356

davidholsgrove · 2025-05-22T23:32:45Z

Add the nodepool cgroup mode to the NAP config

google-cla · 2025-05-22T23:32:49Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

apeabody · 2025-05-29T18:29:50Z

Hi @davidholsgrove - Can you please check the CLA. Thanks!

davidholsgrove · 2025-05-30T05:40:26Z

CLA updated. Thanks @apeabody

apeabody · 2025-05-30T22:09:42Z

/gcbrun

apeabody · 2025-05-30T22:50:32Z

Hi @davidholsgrove - From the CI Tests:

        	Error:      	Received unexpected error:
        	            	FatalError{Underlying: error while running command: exit status 1; 
        	            	Error: Error updating LinuxNodeConfig: googleapi: Error 400: Setting cgroup_mode to v1 is not allowed for clusters created with version 1.32.3-gke.1927009.
        	            	Details:
        	            	[
        	            	  {
        	            	    "@type": "type.googleapis.com/google.rpc.RequestInfo",
        	            	    "requestId": "0x261042fcc5e9d554"
        	            	  }
        	            	]
        	            	, badRequest
        	            	
        	            	  with module.example.module.gke.google_container_cluster.primary,
        	            	  on ../../../modules/beta-public-cluster/cluster.tf line 22, in resource "google_container_cluster" "primary":
        	            	  22: resource "google_container_cluster" "primary" {
        	            	}
        	Test:       	TestNodePool

davidholsgrove · 2025-06-02T01:30:01Z

Hi @apeabody

I think the testcase was updated in #2224 to downgrade from CGROUP_MODE_V2 to CGROUP_MODE_V1.

I'm not sure from https://cloud.google.com/kubernetes-engine/docs/how-to/migrate-cgroupv2#transition-plan what the default CGROUP is for clusters created with 1.32, but it looks like from the error message in the testcases that its now V2, and the API is rejecting an attempt to set to V1?

I dont have access to the cloudbuild logs to see further than the quote in your reply apologies.

If the testcase needs to revert this diff, I can include it in my PR?

apeabody · 2025-06-02T17:01:12Z

Hi @apeabody

I think the testcase was updated in #2224 to downgrade from CGROUP_MODE_V2 to CGROUP_MODE_V1.

I'm not sure from https://cloud.google.com/kubernetes-engine/docs/how-to/migrate-cgroupv2#transition-plan what the default CGROUP is for clusters created with 1.32, but it looks like from the error message in the testcases that its now V2, and the API is rejecting an attempt to set to V1?

I dont have access to the cloudbuild logs to see further than the quote in your reply apologies.

If the testcase needs to revert this diff, I can include it in my PR?

Hi @davidholsgrove!

Yes, the API is rejecting when specifically adding to a new the NAP, full removal of CGROUP_MODE_V1 isn't expected till v1.35.

I suggest swapping that values of pool-01 and all (and their test assets). That should fix this test, while maintaining test coverage. https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/blob/main/examples/node_pool/main.tf#L165

davidholsgrove · 2025-06-03T03:16:56Z

/gcbrun

davidholsgrove · 2025-06-03T03:18:46Z

Thanks @apeabody - testcases updated to flip the default to V2, with the exception of pool-01 to continue coverage.
I dont have perms to trigger a retest, could you run that for me and let me know if you need anything else?

apeabody · 2025-06-03T16:21:15Z

/gcbrun

apeabody · 2025-06-03T17:01:55Z

I'll re-run the test again:

        	Error:      	Received unexpected error:
        	            	FatalError{Underlying: error while running command: exit status 1; 
        	            	Error: Error updating LinuxNodeConfig: googleapi: Error 400: Cluster is running incompatible operation operation-1748969188467-eb201d3b-0103-4996-89d9-a6cf6fa5ed15.
        	            	Details:
        	            	[
        	            	  {
        	            	    "@type": "type.googleapis.com/google.rpc.RequestInfo",
        	            	    "requestId": "0x7d2092479dc8c224"
        	            	  },
        	            	  {
        	            	    "@type": "type.googleapis.com/google.rpc.ErrorInfo",
        	            	    "domain": "container.googleapis.com",
        	            	    "reason": "CLUSTER_ALREADY_HAS_OPERATION"
        	            	  }
        	            	]
        	            	, failedPrecondition
        	            	
        	            	  with module.example.module.gke.google_container_cluster.primary,
        	            	  on ../../../modules/beta-public-cluster/cluster.tf line 22, in resource "google_container_cluster" "primary":
        	            	  22: resource "google_container_cluster" "primary" {
        	            	}
        	Test:       	TestNodePool

davidholsgrove · 2025-06-04T03:56:03Z

Thanks @apeabody - I'm not sure how to resolve that failure without access to run the tests / dig deeper.
Looks unrelated to the change in question though?

apeabody · 2025-06-04T16:11:19Z

Thanks @apeabody - I'm not sure how to resolve that failure without access to run the tests / dig deeper. Looks unrelated to the change in question though?

Looks like we are still seeing in the re-test, let me check on a few PR that don't include this change.

apeabody · 2025-06-04T18:28:00Z

Thanks @apeabody - I'm not sure how to resolve that failure without access to run the tests / dig deeper. Looks unrelated to the change in question though?

Looks like we are still seeing in the re-test, let me check on a few PR that don't include this change.

Hi @davidholsgrove - I've checked and this error seems to be specific to this change. I'd suggest (temporarily) removing pool-01 = "CGROUP_MODE_V1" as I'm wondering if the linux_node_config is first set as "CGROUP_MODE_V1" by all, the pool-01 can't be later set to "CGROUP_MODE_V1"?

Closes terraform-google-modules#2321

davidholsgrove · 2025-06-09T00:49:43Z

Thanks @apeabody - rebased and removed the explicit setting of pool-01 = "CGROUP_MODE_V1" from the testcase.
Could you trigger again?

apeabody · 2025-06-09T16:29:21Z

/gcbrun

apeabody · 2025-06-09T17:29:46Z

Thanks @apeabody - rebased and removed the explicit setting of pool-01 = "CGROUP_MODE_V1" from the testcase. Could you trigger again?

Hi @davidholsgrove - Yeah, it's still having the issue:

Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: Error: Error updating LinuxNodeConfig: googleapi: Error 400: Cluster is running incompatible operation operation-1749488222583-0c77b4da-5a67-4e0b-94ab-ad5f6a673820.
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: Details:
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: [
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:   {
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:     "@type": "type.googleapis.com/google.rpc.RequestInfo",
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:     "requestId": "0x663f464965d3f2fb"
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:   },
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:   {
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:     "@type": "type.googleapis.com/google.rpc.ErrorInfo",
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:     "domain": "container.googleapis.com",
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:     "reason": "CLUSTER_ALREADY_HAS_OPERATION"
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:   }
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: ]
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: , failedPrecondition
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: 
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:   with module.example.module.gke.google_container_cluster.primary,
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:   on ../../../modules/beta-public-cluster/cluster.tf line 22, in resource "google_container_cluster" "primary":
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206:   22: resource "google_container_cluster" "primary" {
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z command.go:206: 
Step #54 - "apply node-pool-local": TestNodePool 2025-06-09T16:57:04Z retry.go:99: Returning due to fatal error: FatalError{Underlying: error while running command: exit status 1;

Looking at your change, another possibility is the addition of local.node_pools_cgroup_mode != null to the node_pool_auto_config gating function is also resulting in the network_tags blocks being created. I would suggest making network_tags a dynamic block as well, gated on var.cluster_autoscaling.enabled && (length(var.network_tags) > 0 || var.add_cluster_firewall_rules to eliminate that potential.

davidholsgrove requested review from apeabody, ericyz and a team as code owners May 22, 2025 23:32

davidholsgrove force-pushed the nap-cgroup-mode branch from 38438a4 to 6489769 Compare May 23, 2025 03:03

davidholsgrove force-pushed the nap-cgroup-mode branch from 0a8987a to b2ab65c Compare May 30, 2025 05:38

davidholsgrove changed the title ~~Add the nodepool cgroup mode to the NAP config~~ fix: Add the nodepool cgroup mode to the NAP config May 30, 2025

davidholsgrove force-pushed the nap-cgroup-mode branch from a91cc48 to c224f89 Compare June 3, 2025 03:16

davidholsgrove added 3 commits June 9, 2025 10:48

fix: Add the nodepool cgroup mode to the NAP config

bed13b1

Closes terraform-google-modules#2321

chore: Update testcases to default NAP EFFECTIVE_CGROUP_MODE_V2

928ec6e

chore: Update testcase to remove node pool-01 CGROUP_MODE_V1 setting

aa0ee48

davidholsgrove force-pushed the nap-cgroup-mode branch from c224f89 to aa0ee48 Compare June 9, 2025 00:49

Merge branch 'main' into nap-cgroup-mode

eabb375

fix: Add the nodepool cgroup mode to the NAP config #2356

Are you sure you want to change the base?

fix: Add the nodepool cgroup mode to the NAP config #2356

Uh oh!

Conversation

davidholsgrove commented May 22, 2025

Uh oh!

google-cla bot commented May 22, 2025

Uh oh!

apeabody commented May 29, 2025

Uh oh!

davidholsgrove commented May 30, 2025

Uh oh!

apeabody commented May 30, 2025

Uh oh!

apeabody commented May 30, 2025

Uh oh!

davidholsgrove commented Jun 2, 2025

Uh oh!

apeabody commented Jun 2, 2025

Uh oh!

davidholsgrove commented Jun 3, 2025

Uh oh!

davidholsgrove commented Jun 3, 2025

Uh oh!

apeabody commented Jun 3, 2025

Uh oh!

apeabody commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidholsgrove commented Jun 4, 2025

Uh oh!

apeabody commented Jun 4, 2025

Uh oh!

apeabody commented Jun 4, 2025

Uh oh!

davidholsgrove commented Jun 9, 2025

Uh oh!

apeabody commented Jun 9, 2025

Uh oh!

apeabody commented Jun 9, 2025

Uh oh!

Uh oh!

apeabody commented Jun 3, 2025 •

edited

Loading