Skip to content

Commit f11ddee

Browse files
bluecrayon52KeitaW
andauthored
Terraform Modules for HyperPod EKS (#586)
* updated instance cound environment variables * updated message for IAM execution role creation * added check_jq function * removed old todos * updated order of hyperpod cluster config message * updated hyperpod cluster stack to conditionally disable deep health checks * put S3 endpoint into separate cfn stack * updated helm chart injector to use kube-system namespace * syntax fix in lambda function * enabled pathrough of existing resource ids from tmp_env_vars to env_vars * fixed execution role stack boolean variable and security group stack display * bump k8s version to 1.31 * breaking ground on terraform support * adding boilerplate module files * added child modules and default values in root * made output and variable corrections * bug fixes on helm chart and eks auth mode * Remove .terraform.lock.hcl file * Added .terraform.lock.hcl to gitignore * remane parent directory hyperpod-eks-tf * added readme and env vars script * code tidy after testing * Update 1.architectures/7.sagemaker-hyperpod-eks/terraform-modules/README.md Co-authored-by: Keita Watanabe <mlkeita@amazon.com> --------- Co-authored-by: Keita Watanabe <mlkeita@amazon.com>
1 parent ac776b0 commit f11ddee

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

50 files changed

+2067
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# Ignore local Terraform directories
2+
**/.terraform/*
3+
4+
# Ignore state files and backups
5+
*.tfstate
6+
*.tfstate.*
7+
8+
# Ignore variable files with sensitive data
9+
# *.tfvars
10+
# *.tfvars.json
11+
12+
# Ignore crash logs
13+
crash.log
14+
crash.*.log
15+
16+
# Ignore override files
17+
override.tf
18+
override.tf.json
19+
*_override.tf
20+
*_override.tf.json
21+
22+
# Ignore plan output files
23+
*.tfplan
24+
25+
# Ignore CLI configuration files
26+
.terraformrc
27+
terraform.rc
28+
29+
# Ignore hashes of provider binaries
30+
.terraform.lock.hcl
31+
32+
# Ignore environment variables
33+
env_vars.sh
34+
terraform_outputs.json
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
# Deploy HyperPod Infrastructure using Terraform
2+
3+
## Modules
4+
5+
The diagram below depicts the Terraform modules that have been bundled into a single project to enable you to deploy a full HyperPod cluster environment all at once.
6+
7+
<img src="./smhp_tf_modules.png" width="50%"/>
8+
9+
## Configuration
10+
Start by reviewing the default configurations in the `terraform.tfvars` file and make modifications as needed to suit your needs.
11+
12+
```bash
13+
vim hyperpod-eks-tf/terraform.tfvars
14+
```
15+
For example, you may want to add or modify the HyperPod instance groups to be created:
16+
```
17+
instance_groups = {
18+
group1 = {
19+
instance_type = "ml.g5.8xlarge"
20+
instance_count = 8
21+
ebs_volume_size = 100
22+
threads_per_core = 2
23+
enable_stress_check = true
24+
enable_connectivity_check = true
25+
lifecycle_script = "on_create.sh"
26+
}
27+
}
28+
```
29+
If you wish to reuse any cloud resources rather than creating new ones, set the associated `create_*` variable to `false` and provide the id for the corresponding resource as the value of the `existing_*` variable.
30+
31+
For example, if you want to reuse an existing VPC, set `create_vpc ` to `false`, then set `existing_vpc_id` to your VPC ID, like `vpc-1234567890abcdef0`.
32+
33+
## Deployment
34+
Run `terraform init` to initialize the Terraform working directory, install necessary provider plugins, download modules, set up state storage, and configure the backend for managing infrastructure state:
35+
36+
```bash
37+
terraform -chdir=hyperpod-eks-tf init
38+
```
39+
Run `terraform plan` to generate and display an execution plan that outlines the changes Terraform will make to your infrastructure, allowing you to review and validate the proposed updates before applying them.
40+
41+
```bash
42+
terraform -chdir=hyperpod-eks-tf plan
43+
```
44+
Run `terraform apply` to execute the proposed changes outlined in the Terraform plan, creating, updating, or deleting infrastructure resources according to your configuration, and updating the state to reflect the new infrastructure setup.
45+
46+
```bash
47+
terraform -chdir=hyperpod-eks-tf apply
48+
```
49+
When prompted to confirm, type `yes` and press enter.
50+
51+
You can also run `terraform apply` with the `-auto-approve` flag to avoid being prompted for confirmation, but use with caution to avoid unintended changes to your infrastructure.
52+
53+
## Environment Variables
54+
Run the `terraform_outputs.sh` script, which populates the `env_vars.sh` script with your environment variables for future reference:
55+
```bash
56+
chmod +x terraform_outputs.sh
57+
./terraform_outputs.sh
58+
cat env_vars.sh
59+
```
60+
Source the `env_vars.sh` script to set your environment variables:
61+
```bash
62+
source env_vars.sh
63+
```
64+
Verify that your environment variables are set:
65+
```bash
66+
echo $EKS_CLUSTER_NAME
67+
echo $PRIVATE_SUBNET_ID
68+
echo $SECURITY_GROUP_ID
69+
```
70+
71+
## Clean Up
72+
73+
Before cleaning up, validate the changes by running a speculative destroy plan:
74+
75+
```bash
76+
terraform -chdir=hyperpod-eks-tf plan -destroy
77+
```
78+
79+
Once you've validated the changes, you can proceed to destroy the resources:
80+
```bash
81+
terraform -chdir=hyperpod-eks-tf destroy
82+
```
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
locals {
2+
vpc_id = var.create_vpc ? module.vpc[0].vpc_id : var.existing_vpc_id
3+
private_subnet_id = var.create_private_subnet ? module.private_subnet[0].private_subnet_id : var.existing_private_subnet_id
4+
security_group_id = var.create_security_group ? module.security_group[0].security_group_id : var.existing_security_group_id
5+
s3_bucket_name = var.create_s3_bucket ? module.s3_bucket[0].s3_bucket_name : var.existing_s3_bucket_name
6+
eks_cluster_name = var.create_eks ? module.eks_cluster[0].eks_cluster_name : var.existing_eks_cluster_name
7+
sagemaker_iam_role_name = var.create_sagemaker_iam_role ? module.sagemaker_iam_role[0].sagemaker_iam_role_name : var.existing_sagemaker_iam_role_name
8+
}
9+
10+
module "vpc" {
11+
count = var.create_vpc ? 1 : 0
12+
source = "./modules/vpc"
13+
14+
resource_name_prefix = var.resource_name_prefix
15+
vpc_cidr = var.vpc_cidr
16+
public_subnet_1_cidr = var.public_subnet_1_cidr
17+
public_subnet_2_cidr = var.public_subnet_2_cidr
18+
}
19+
20+
module "private_subnet" {
21+
count = var.create_private_subnet ? 1 : 0
22+
source = "./modules/private_subnet"
23+
24+
resource_name_prefix = var.resource_name_prefix
25+
vpc_id = local.vpc_id
26+
availability_zone_id = var.availability_zone_id
27+
private_subnet_cidr = var.private_subnet_cidr
28+
nat_gateway_id = var.create_vpc ? module.vpc[0].nat_gateway_1_id : var.existing_nat_gateway_id
29+
}
30+
31+
module "security_group" {
32+
count = var.create_security_group ? 1 : 0
33+
source = "./modules/security_group"
34+
35+
resource_name_prefix = var.resource_name_prefix
36+
vpc_id = local.vpc_id
37+
create_new_sg = var.create_eks
38+
existing_security_group_id = var.existing_security_group_id
39+
}
40+
41+
module "eks_cluster" {
42+
count = var.create_eks ? 1 : 0
43+
source = "./modules/eks_cluster"
44+
45+
resource_name_prefix = var.resource_name_prefix
46+
vpc_id = local.vpc_id
47+
eks_cluster_name = var.eks_cluster_name
48+
kubernetes_version = var.kubernetes_version
49+
security_group_id = local.security_group_id
50+
private_subnet_cidrs = [var.eks_private_subnet_1_cidr, var.eks_private_subnet_2_cidr]
51+
private_node_subnet_cidr = var.eks_private_node_subnet_cidr
52+
nat_gateway_id = var.create_vpc ? module.vpc[0].nat_gateway_1_id : var.existing_nat_gateway_id
53+
54+
}
55+
56+
module "s3_bucket" {
57+
count = var.create_s3_bucket ? 1 : 0
58+
source = "./modules/s3_bucket"
59+
60+
resource_name_prefix = var.resource_name_prefix
61+
}
62+
63+
module "s3_endpoint" {
64+
count = var.create_s3_endpoint ? 1 : 0
65+
source = "./modules/s3_endpoint"
66+
67+
vpc_id = local.vpc_id
68+
private_route_table_id = var.create_private_subnet ? module.private_subnet[0].private_route_table_id : var.existing_private_route_table_id
69+
}
70+
71+
module "lifecycle_script" {
72+
count = var.create_lifecycle_script ? 1 : 0
73+
source = "./modules/lifecycle_script"
74+
75+
resource_name_prefix = var.resource_name_prefix
76+
s3_bucket_name = local.s3_bucket_name
77+
}
78+
79+
module "sagemaker_iam_role" {
80+
count = var.create_sagemaker_iam_role ? 1 : 0
81+
source = "./modules/sagemaker_iam_role"
82+
83+
resource_name_prefix = var.resource_name_prefix
84+
s3_bucket_name = local.s3_bucket_name
85+
}
86+
87+
module "helm_chart" {
88+
count = var.create_helm_chart ? 1 : 0
89+
source = "./modules/helm_chart"
90+
91+
depends_on = [module.eks_cluster]
92+
93+
resource_name_prefix = var.resource_name_prefix
94+
helm_repo_url = var.helm_repo_url
95+
helm_repo_path = var.helm_repo_path
96+
namespace = var.namespace
97+
helm_release_name = var.helm_release_name
98+
eks_cluster_name = local.eks_cluster_name
99+
}
100+
101+
module "hyperpod_cluster" {
102+
count = var.create_hyperpod ? 1 : 0
103+
source = "./modules/hyperpod_cluster"
104+
105+
depends_on = [
106+
module.helm_chart,
107+
module.eks_cluster,
108+
module.private_subnet,
109+
module.security_group,
110+
module.s3_bucket,
111+
module.s3_endpoint,
112+
module.sagemaker_iam_role
113+
]
114+
115+
resource_name_prefix = var.resource_name_prefix
116+
hyperpod_cluster_name = var.hyperpod_cluster_name
117+
node_recovery = var.node_recovery
118+
instance_groups = var.instance_groups
119+
private_subnet_id = local.private_subnet_id
120+
security_group_id = local.security_group_id
121+
eks_cluster_name = local.eks_cluster_name
122+
s3_bucket_name = local.s3_bucket_name
123+
sagemaker_iam_role_name = local.sagemaker_iam_role_name
124+
125+
}
126+
127+
# Data source for current AWS region
128+
data "aws_region" "current" {}
129+
130+
data "aws_eks_cluster" "existing_eks_cluster" {
131+
count = var.create_eks ? 0 : 1
132+
name = var.eks_cluster_name
133+
}
134+
135+
data "aws_s3_bucket" "existing_s3_bucket" {
136+
count = var.create_s3_bucket ? 0 : 1
137+
bucket = var.existing_s3_bucket_name
138+
}

0 commit comments

Comments
 (0)