Skip to content

Commit 7255024

Browse files
authored
Do not auto-prune instance types if there are too many (#235)
I was previously only allowing 1 memory size/core count combination to keep the number of compute resources down and also was combining multiple instance types in one compute resource if possible. This was to try to maximize the number of instance types that were configured. This led to people not being able to configure the exact instance types they wanted. The preference is to notify the user and let them choose which instances types to exclude or to reduce the number of included types. So, I've reverted to my original strategy of 1 instance type per compute resource and 1 CR per queue. The compute resources can be combined into any queues that the user wants using custom slurm settings. I had to exclude instance types in the default configuration in order to keep from exceeding the PC limits. Resolves #220 Update ParallelCluster version in config files and docs. Clean up security scan.
1 parent 70fd1ef commit 7255024

13 files changed

+260
-261
lines changed

.gitignore

+3
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,10 @@
11

2+
23
.mkdocs_venv/
34
_site
45
site/
56
.vscode/
67
source/resources/parallel-cluster/config/build-files/*/*/parallelcluster-*.yml
8+
security_scan/bandit-env
9+
security_scan/bandit.log
710
security_scan/cfn_nag.log

Makefile

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11

2-
.PHONY: help local-docs test clean
2+
.PHONY: help local-docs security_scan test clean
33

44
help:
5-
@echo "Usage: make [ help | local-docs | github-docs | clean ]"
5+
@echo "Usage: make [ help | local-docs | github-docs | security_scan | test | clean ]"
66

77
.mkdocs_venv/bin/activate:
88
rm -rf .mkdocs_venv

docs/deploy-parallel-cluster.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ A ParallelCluster configuration will be generated and used to create a ParallelC
44
The first supported ParallelCluster version is 3.6.0.
55
Version 3.7.0 is the recommended minimum version because it supports compute node weighting that is proportional to instance type
66
cost so that the least expensive instance types that meet job requirements are used.
7-
The current latest version is 3.8.0.
7+
The current latest version is 3.9.1.
88

99
## Prerequisites
1010

docs/res_integration.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ The following example shows the configuration parameters for a RES with the Envi
3030
# Command line values override values in the config file.
3131
#====================================================================
3232
33-
StackName: res-eda-pc-3-8-0-rhel8-x86-config
33+
StackName: res-eda-pc-3-9-1-rhel8-x86-config
3434
3535
Region: <region>
3636
SshKeyPair: <key-name>
@@ -42,10 +42,10 @@ ErrorSnsTopicArn: <topic-arn>
4242
TimeZone: 'US/Central'
4343
4444
slurm:
45-
ClusterName: res-eda-pc-3-8-0-rhel8-x86
45+
ClusterName: res-eda-pc-3-9-1-rhel8-x86
4646
4747
ParallelClusterConfig:
48-
Version: '3.8.0'
48+
Version: '3.9.1'
4949
Image:
5050
Os: 'rhel8'
5151
Architecture: 'x86_64'

security_scan/security_scan.sh

+2-2
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,9 @@
55
scriptdir=$(dirname $(readlink -f $0))
66

77
cd $scriptdir/..
8-
./install.sh --config-file ~/slurm/res-eda/res-eda-pc-3-7-2-centos7-x86-config.yml --cdk-cmd synth
8+
./install.sh --config-file ~/slurm/res-eda/res-eda-pc-3-9-1-rhel8-x86-config.yml --cdk-cmd synth
99

10-
cfn_nag_scan --input-path $scriptdir/../source/cdk.out/res-eda-pc-3-7-2-centos7-x86-config.template.json --deny-list-path $scriptdir/cfn_nag-deny-list.yml --fail-on-warnings &> $scriptdir/cfn_nag.log
10+
cfn_nag_scan --input-path $scriptdir/../source/cdk.out/res-eda-pc-3-9-1-rhel8-x86-config.template.json --deny-list-path $scriptdir/cfn_nag-deny-list.yml --fail-on-warnings &> $scriptdir/cfn_nag.log
1111

1212
cd $scriptdir
1313
if [ ! -e $scriptdir/bandit-env ]; then

source/cdk/cdk_slurm_stack.py

+155-246
Large diffs are not rendered by default.

source/cdk/config_schema.py

+88-1
Original file line numberDiff line numberDiff line change
@@ -181,6 +181,15 @@ def get_slurm_rest_api_version(config):
181181

182182
# Feature support
183183

184+
def MAX_NUMBER_OF_QUEUES(parallel_cluster_version):
185+
return 50
186+
187+
def MAX_NUMBER_OF_COMPUTE_RESOURCES(parallel_cluster_version):
188+
return 50
189+
190+
def MAX_NUMBER_OF_COMPUTE_RESOURCES_PER_QUEUE(parallel_cluster_version):
191+
return 50
192+
184193
# Version 3.7.0:
185194
PARALLEL_CLUSTER_SUPPORTS_LOGIN_NODES_VERSION = parse_version('3.7.0')
186195
def PARALLEL_CLUSTER_SUPPORTS_LOGIN_NODES(parallel_cluster_version):
@@ -194,6 +203,10 @@ def PARALLEL_CLUSTER_SUPPORTS_MULTIPLE_COMPUTE_RESOURCES_PER_QUEUE(parallel_clus
194203
def PARALLEL_CLUSTER_SUPPORTS_MULTIPLE_INSTANCE_TYPES_PER_COMPUTE_RESOURCE(parallel_cluster_version):
195204
return parallel_cluster_version >= PARALLEL_CLUSTER_SUPPORTS_MULTIPLE_INSTANCE_TYPES_PER_COMPUTE_RESOURCE_VERSION
196205

206+
PARALLEL_CLUSTER_SUPPORTS_NODE_WEIGHTS_VERSION = parse_version('3.7.0')
207+
def PARALLEL_CLUSTER_SUPPORTS_NODE_WEIGHTS(parallel_cluster_version):
208+
return parallel_cluster_version >= PARALLEL_CLUSTER_SUPPORTS_NODE_WEIGHTS_VERSION
209+
197210
# Version 3.8.0
198211

199212
PARALLEL_CLUSTER_SUPPORTS_CUSTOM_ROCKY_8_VERSION = parse_version('3.8.0')
@@ -297,6 +310,7 @@ def DEFAULT_OS(config):
297310

298311
'x2iezn', # Intel Xeon Platinum 8252 4.5 GHz 1.5 TB
299312

313+
'u',
300314
#'u-6tb1', # Intel Xeon Scalable (Skylake) 6 TB
301315
#'u-9tb1', # Intel Xeon Scalable (Skylake) 9 TB
302316
#'u-12tb1', # Intel Xeon Scalable (Skylake) 12 TB
@@ -371,7 +385,80 @@ def DEFAULT_OS(config):
371385

372386
default_excluded_instance_types = [
373387
'.+\.(micro|nano)', # Not enough memory
374-
'.*\.metal.*'
388+
'.*\.metal.*',
389+
390+
# Reduce the number of selected instance types to 25.
391+
# Exclude larger core counts for each memory size
392+
# 2 GB:
393+
'c7a.medium',
394+
'c7g.medium',
395+
# 4 GB: m7a.medium, m7g.medium
396+
'c7a.large',
397+
'c7g.large',
398+
# 8 GB: r7a.medium, r7g.medium
399+
'm5zn.large',
400+
'm7a.large',
401+
'm7g.large',
402+
'c7a.xlarge',
403+
'c7g.xlarge',
404+
# 16 GB: r7a.large, x2gd.medium, r7g.large
405+
'r7iz.large',
406+
'm5zn.xlarge',
407+
'm7a.xlarge',
408+
'm7g.xlarge',
409+
'c7a.2xlarge',
410+
'c7g.2xlarge',
411+
# 32 GB: r7a.xlarge, x2gd.large, r7g.xlarge
412+
'r7iz.xlarge',
413+
'm5zn.2xlarge',
414+
'm7a.2xlarge',
415+
'm7g.2xlarge',
416+
'c7a.4xlarge',
417+
'c7g.4xlarge',
418+
# 64 GB: r7a.2xlarge, x2gd.xlarge, r7g.2xlarge
419+
'r7iz.2xlarge',
420+
'm7a.4xlarge',
421+
'm7g.4xlarge',
422+
'c7a.8xlarge',
423+
'c7g.8xlarge',
424+
# 96 GB:
425+
'm5zn.6xlarge',
426+
'c7a.12xlarge',
427+
'c7g.12xlarge',
428+
# 128 GB: x2iedn.xlarge, r7iz.4xlarge, x2gd.2xlarge, r7g.4xlarge
429+
'r7a.4xlarge',
430+
'm7a.8xlarge',
431+
'm7g.8xlarge',
432+
'c7a.16xlarge',
433+
'c7g.8xlarge',
434+
# 192 GB: m5zn.12xlarge, m7a.12xlarge, m7g.12xlarge
435+
'c7a.24xlarge',
436+
# 256 GB: x2iedn.2xlarge, x2iezn.2xlarge, x2gd.4xlarge, r7g.8xlarge
437+
'r7iz.8xlarge',
438+
'r7a.8xlarge',
439+
'm7a.16xlarge',
440+
'm7g.16xlarge',
441+
'c7a.32xlarge',
442+
# 384 GB: 'r7iz.12xlarge', r7g.12xlarge
443+
'r7a.12xlarge',
444+
'm7a.24xlarge',
445+
'c7a.48xlarge',
446+
# 512 GB: x2iedn.4xlarge, x2iezn.4xlarge, x2gd.8xlarge, r7g.16xlarge
447+
'r7iz.16xlarge',
448+
'r7a.16xlarge',
449+
'm7a.32xlarge',
450+
# 768 GB: r7a.24xlarge, x2gd.12xlarge
451+
'x2iezn.6xlarge',
452+
'm7a.48xlarge',
453+
# 1024 GB: x2iedn.8xlarge, x2iezn.8xlarge, x2gd.16xlarge
454+
'r7iz.32xlarge',
455+
'r7a.32xlarge',
456+
# 1536 GB: x2iezn.12xlarge, x2idn.24xlarge
457+
'r7a.48xlarge',
458+
# 2048 GB: x2iedn.16xlarge
459+
'x2idn.32xlarge',
460+
# 3072 GB: 'x2iedn.24xlarge',
461+
# 4096 GB: x2iedn.32xlarge
375462
]
376463

377464
architectures = [

source/resources/config/default_config.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ StackName: slurmminimal-config
4343

4444
slurm:
4545
ParallelClusterConfig:
46-
Version: 3.8.0
46+
Version: 3.9.1
4747
# @TODO: Choose the CPU architecture: x86_64, arm64. Default: x86_64
4848
# Architecture: x86_64
4949
# @TODO: Update DatabaseStackName with stack name you deployed ParallelCluster database into. See: https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_07_slurm-accounting-v3.html#slurm-accounting-db-stack-v3

source/resources/config/slurm_all_arm_instance_types.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ StackName: slurm-all-arm-config
3737

3838
slurm:
3939
ParallelClusterConfig:
40-
Version: 3.8.0
40+
Version: 3.9.1
4141
Architecture: arm64
4242
# @TODO: Update DatabaseStackName with stack name you deployed ParallelCluster database into. See: https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_07_slurm-accounting-v3.html#slurm-accounting-db-stack-v3
4343
# Database:

source/resources/config/slurm_all_x86_instance_types.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ StackName: slurm-all-x86-config
3737

3838
slurm:
3939
ParallelClusterConfig:
40-
Version: 3.8.0
40+
Version: 3.9.1
4141
Architecture: x86_64
4242
# @TODO: Update DatabaseStackName with stack name you deployed ParallelCluster database into. See: https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_07_slurm-accounting-v3.html#slurm-accounting-db-stack-v3
4343
# Database:

source/resources/config/slurm_recommended_arm_instance_types.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ StackName: slurm-arm-config
3737

3838
slurm:
3939
ParallelClusterConfig:
40-
Version: 3.8.0
40+
Version: 3.9.1
4141
Architecture: arm64
4242
# @TODO: Update DatabaseStackName with stack name you deployed ParallelCluster database into. See: https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_07_slurm-accounting-v3.html#slurm-accounting-db-stack-v3
4343
# Database:

source/resources/config/slurm_recommended_x86_instance_types.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ StackName: slurm-x86-config
3737

3838
slurm:
3939
ParallelClusterConfig:
40-
Version: 3.8.0
40+
Version: 3.9.1
4141
Architecture: x86_64
4242
# @TODO: Update DatabaseStackName with stack name you deployed ParallelCluster database into. See: https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_07_slurm-accounting-v3.html#slurm-accounting-db-stack-v3
4343
# Database:

source/resources/lambdas/DeconfigureRESUsersGroupsJson/DeconfigureRESUsersGroupsJson.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -137,7 +137,7 @@ def lambda_handler(event, context):
137137
sudo rmdir $mount_dest
138138
fi
139139
140-
pass
140+
true
141141
"""
142142
logger.info(f"Submitting SSM command")
143143
send_command_response = ssm_client.send_command(

0 commit comments

Comments
 (0)