Skip to content

Commit 009bdeb

Browse files
authored
Merge pull request #1093 from dholt/release-22.01
Release 22.01
2 parents 21c039d + aaedef8 commit 009bdeb

File tree

41 files changed

+457
-57
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

41 files changed

+457
-57
lines changed

.github/workflows/molecule.yml

+33
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
---
2+
name: test ansible roles with molecule
3+
on:
4+
- push
5+
- pull_request
6+
jobs:
7+
build:
8+
runs-on: ubuntu-20.04
9+
strategy:
10+
max-parallel: 4
11+
matrix:
12+
deepops-role:
13+
- singularity_wrapper
14+
steps:
15+
- name: check out repo
16+
uses: actions/checkout@v2
17+
with:
18+
path: "${{ github.repository }}"
19+
- name: set up python
20+
uses: actions/setup-python@v2
21+
with:
22+
python-version: "3.9"
23+
- name: install dependencies
24+
run: |
25+
python3 -m pip install --upgrade pip
26+
python3 -m pip install molecule[docker] docker ansible
27+
- name: run molecule test
28+
run: |
29+
cd "${{ github.repository }}/roles"
30+
ansible-galaxy role install --force -r ./requirements.yml
31+
ansible-galaxy collection install --force -r ./requirements.yml
32+
cd "${{ matrix.deepops-role }}"
33+
molecule test

README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ Check out the [video tutorial](https://drive.google.com/file/d/1RNLQYlgJqE8JMv0n
1616

1717
## Releases
1818

19-
Latest release: [DeepOps 21.09 Release](https://github.com/NVIDIA/deepops/releases/tag/21.09)
19+
Latest release: [DeepOps 22.01 Release](https://github.com/NVIDIA/deepops/releases/tag/22.01)
2020

2121
It is recommended to use the latest release branch for stable code (linked above). All development takes place on the master branch, which is generally [functional](docs/deepops/testing.md) but may change significantly between releases.
2222

config.example/env.sh

+6
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
# This file acts as a location to override the default configurations of deepops/scripts/*
2+
# Many of the scripts in this directory define global variables and set reasonable defaults
3+
# Global variables (in all caps) that are defined here will be automatically sourced and used in all scripts
4+
# See deepops/scripts/common.sh for implementation details
5+
6+
DEEPOPS_EXAMPLE_VAR=""

config.example/group_vars/all.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -122,7 +122,7 @@ sftp_chroot: false
122122
################################################################################
123123
# NVIDIA GPU configuration
124124
# Playbook: nvidia-cuda
125-
cuda_version: cuda-toolkit-11-4
125+
cuda_version: cuda-toolkit-11-5
126126

127127
# DGX-specific vars may be used to target specific models,
128128
# because available versions for DGX may differ from the generic repo

config.example/group_vars/k8s-cluster.yml

-3
Original file line numberDiff line numberDiff line change
@@ -36,9 +36,6 @@ dashboard_image_repo: "kubernetesui/dashboard"
3636
dashboard_metrics_scrape_tagr: "v1.0.4"
3737
dashboard_metrics_scraper_repo: "kubernetesui/metrics-scraper"
3838

39-
# Override the Helm version installed by Kubespray
40-
helm_version: "v3.5.4"
41-
4239
# Ensure hosts file generation only runs across k8s cluster
4340
hosts_add_ansible_managed_hosts_groups: ["k8s-cluster"]
4441

config.example/group_vars/slurm-cluster.yml

+4-4
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
################################################################################
44
# Slurm job scheduler configuration
55
# Playbook: slurm, slurm-cluster, slurm-perf, slurm-perf-cluster, slurm-validation
6-
slurm_version: 21.08.1
6+
slurm_version: 21.08.5
77
slurm_install_prefix: /usr/local
88
pmix_install_prefix: /opt/deepops/pmix
99
hwloc_install_prefix: /opt/deepops/hwloc
@@ -117,9 +117,9 @@ sm_install_host: "slurm-master[0]"
117117
slurm_install_hpcsdk: true
118118

119119
# Select the version of HPC SDK to download
120-
hpcsdk_major_version: "21"
121-
hpcsdk_minor_version: "9"
122-
hpcsdk_file_cuda: "11.4"
120+
hpcsdk_major_version: "22"
121+
hpcsdk_minor_version: "1"
122+
hpcsdk_file_cuda: "11.5"
123123
hpcsdk_arch: "x86_64"
124124

125125
# In a Slurm cluster, default to setting up HPC SDK as modules rather than in

docs/deepops/configuration.md

+1
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ In particular, this directory includes:
1212
- `config/group_vars/all.yml`: An Ansible [variables file](https://docs.ansible.com/ansible/latest/user_guide/playbooks_variables.html) that contains variables we expect to work for all hosts
1313
- `config/group_vars/k8s-cluster.yml`: Variables specific to deploying Kubernetes clusters
1414
- `config/group_vars/slurm-cluster.yml`: Variables specific to deploying Slurm clusters
15+
- `config/env.sh`: Global variables that override default variable values for all `sh` files in `scripts/*`.
1516
- `config/requirements.yml`: An Ansible Galaxy [requirements file](https://docs.ansible.com/ansible/latest/galaxy/user_guide.html#installing-roles-and-collections-from-the-same-requirements-yml-file) that contains a list of custom Collections and Roles to install. Collections and Roles required by DeepOps are stored in a separate `roles/requirements.yml` file, which should not be modified.
1617

1718
It's expected that most DeepOps deployments will make changes to these files!

docs/deepops/testing.md

+74-2
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,13 @@
11
# DeepOps Testing, CI/CD, and Validation
22

3-
## DeepOps Continuous Integration Testing
3+
4+
## DeepOps end-to-end testing
45

56
The DeepOps project leverages a private Jenkins server to run continuous integration tests. Testing is done using the [virtual](../../virtual) deployment mechanism. Several Vagrant VMs are created, the cluster is deployed, tests are executed, and then the VMs are destroyed.
67

78
The goal of the DeepOps CI is to prevent bugs from being introduced into the code base and to identify when changes in 3rd party platforms have occurred or impacted the DeepOps deployment mechanisms. In general, K8s and Slurm deployment issues are detected and resolved with urgency. Many components of DeepOps are 3rd party open source tools that may silently fail or suddenly change without notice. The team will make a best-effort to resolve these issues and include regression tests, however there may be times where a fix is unavailable. Historically, this has been an issue with Rook-Ceph and Kubeflow, and those GitHub communities are best equipped to help with resolutions.
89

9-
### Testing Methodi
10+
### Testing Method
1011

1112
DeepOps CI contains two types of automated tests:
1213

@@ -63,6 +64,77 @@ A short description of the nightly testing is outlined below. The full suit of t
6364
| MIG configuration | | | | No testing support
6465

6566

67+
## DeepOps Ansible role testing
68+
69+
A subset of the Ansible roles in DeepOps have tests defined using [Ansible Molecule](https://molecule.readthedocs.io/en/latest/).
70+
This testing mechanism allows the roles to be tested individually, providing additional test signal to identify issues which do not appear in the end-to-end tests.
71+
These tests are run automatically for each pull request using [Github Actions](https://github.com/NVIDIA/deepops/actions).
72+
73+
Molecule testing runs the Ansible role in quesiton inside a Docker container.
74+
As such, not all roles will be easy to test witth this mechanism.
75+
Roles which mostly involve installing software, configuring services, or executing scripts should generally be possible to test.
76+
Roles which rely on the presence of specific hardware (such as GPUs), which reboot the nodes they act on, or which make changes to kernel configuration are going to be harder to test with Molecule.
77+
78+
### Defining Molecule tests for a new role
79+
80+
To add Molecule tests to a new role, the following procedure can be used.
81+
82+
1. Ensure you have Docker installed in your development environment
83+
84+
2. Install Ansible Molecule in your development environment
85+
86+
```
87+
$ python3 -m pip install "molecule[docker,lint]"
88+
```
89+
90+
3. Initialize Molecule in your new role
91+
92+
```
93+
$ cd deepops/roles/<your-role>
94+
$ molecule init scenario -r <your-role> --driver docker
95+
```
96+
97+
4. In the file `molecule/default/molecule.yml`, define the list of platforms to be tested.
98+
DeepOps currently supports operating systems based on Ubuntu 18.04, Ubuntu 20.04, EL7, and EL8.
99+
To test these stacks, the following `platforms` stanza can be used.
100+
101+
```
102+
platforms:
103+
- name: ubuntu-1804
104+
image: geerlingguy/docker-ubuntu1804-ansible
105+
pre_build_image: true
106+
- name: ubuntu-2004
107+
image: geerlingguy/docker-ubuntu2004-ansible
108+
pre_build_image: true
109+
- name: centos-7
110+
image: geerlingguy/docker-centos7-ansible
111+
pre_build_image: true
112+
- name: centos-8
113+
image: geerlingguy/docker-centos8-ansible
114+
pre_build_image: true
115+
```
116+
117+
5. If you haven't already, define your role's metadata in the file `meta/main.yml`.
118+
A sample `meta.yml` is shown here:
119+
120+
```
121+
galaxy_info:
122+
role_name: <your-role>
123+
namespace: deepops
124+
author: DeepOps Team
125+
company: NVIDIA
126+
description: <your-description>
127+
license: 3-Clause BSD
128+
min_ansible_version: 2.9
129+
```
130+
131+
6. Once this is done, verify that your role executes successfully in the Molecule environment by running `molecule test`. If you run into any issues, consult the [Molecule documentation](https://molecule.readthedocs.io/en/latest/index.html) for help resolving them.
132+
133+
7. (optional) In addition to testing successful execution, you can add additional tests which will be run after your role completes in a file `molecule/default/verify.yml`. This is an Ansible playbook that will run in the same environment as your playbook ran. For a simple example of such a verify playbook, see the [Enroot role](https://github.com/NVIDIA/ansible-role-enroot/blob/master/molecule/default/verify.yml).
134+
135+
8. Once you're confident that your new tests are all passing, add your role to the `deepops-role` section in the `.github/workflows/molecule.yml` file.
136+
137+
66138
## DeepOps Deployment Validation
67139

68140
The Slurm and Kubernetes deployment guides both document cluster verification steps. These should be run during the installation process to validate a GPU workload can be executed on the cluster.

playbooks/container/singularity.yml

+1-6
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,5 @@
11
---
22
- hosts: all
33
become: yes
4-
pre_tasks:
5-
- name: create a folder for go
6-
file:
7-
path: "{{ golang_install_dir }}"
8-
recurse: yes
94
roles:
10-
- lecorguille.singularity
5+
- singularity_wrapper
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
-Dlog4j2.formatMsgNoLookups=true

playbooks/slurm-cluster/logging.yml

+64-8
Original file line numberDiff line numberDiff line change
@@ -3,21 +3,77 @@
33
become: true
44
vars:
55
elasticsearch_network_host: 0.0.0.0
6-
logstash_listen_port_beats: 5000
6+
pre_tasks:
7+
- name: debian - ensure apt cache updated
8+
apt:
9+
update_cache: true
10+
when: ansible_os_family == "Debian"
711
roles:
8-
- geerlingguy.java
9-
- geerlingguy.elasticsearch
10-
- geerlingguy.logstash
11-
- geerlingguy.kibana
12+
- robertdebock.java
13+
- robertdebock.elastic_repo
14+
- robertdebock.elasticsearch
15+
- robertdebock.logstash
16+
- robertdebock.kibana
17+
18+
- hosts: slurm-master[0]
19+
become: true
20+
vars:
21+
filebeat_port: "5000"
22+
tasks:
23+
- name: configure logstash to accept logs from filebeat
24+
template:
25+
src: "filebeat.conf"
26+
dest: "/etc/logstash/conf.d/filebeat.conf"
27+
owner: "root"
28+
group: "root"
29+
mode: "0644"
30+
31+
# Mitigation for CVE-2021-44228 impacting Log4j2
32+
# https://discuss.elastic.co/t/apache-log4j2-remote-code-execution-rce-vulnerability-cve-2021-44228-esa-2021-31/291476
33+
- hosts: slurm-master[0]
34+
become: yes
1235
tasks:
13-
- name: fix bug in logstash role
14-
command: /usr/share/logstash/bin/logstash-plugin install logstash-filter-multiline
36+
- name: configure elasticsearch to mitigate CVE-2021-44228
37+
copy:
38+
src: "cve_2021_44228.options"
39+
dest: "/etc/elasticsearch/jvm.options.d/cve_2021_44228.options"
40+
owner: "root"
41+
group: "root"
42+
mode: "0644"
43+
notify:
44+
- restart-elasticsearch
45+
- name: check for relevant class in logstash
46+
shell: unzip -l /usr/share/logstash/logstash-core/lib/jars/log4j-core-2.* | grep JndiLookup.class
47+
register: logstash_jndi
48+
changed_when: logstash_jndi.rc == 0
49+
failed_when: logstash_jndi.rc == 2
50+
- name: configure logstash to mitigate CVE-2021-44228
51+
shell: zip -q -d /usr/share/logstash/logstash-core/lib/jars/log4j-core-2.* org/apache/logging/log4j/core/lookup/JndiLookup.class
52+
notify:
53+
- restart-logstash
54+
when: logstash_jndi.changed
55+
- name: manually stop logstash as restart is not consistently working later
56+
service:
57+
name: logstash
58+
state: stopped
59+
notify:
60+
- restart-logstash
61+
when: logstash_jndi.changed
62+
handlers:
63+
- name: restart-elasticsearch
64+
service:
65+
name: elasticsearch
66+
state: restarted
67+
- name: restart-logstash
68+
service:
69+
name: logstash
70+
state: restarted
1571

1672
- hosts: slurm-cluster
1773
become: true
1874
vars:
1975
filebeat_create_config: true
20-
filebeat_prospectors:
76+
filebeat_inputs:
2177
- input_type: log
2278
paths:
2379
- "/var/log/*.log"
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
input {
2+
beats {
3+
port => {{ filebeat_port }}
4+
}
5+
}
6+
7+
output {
8+
elasticsearch {
9+
hosts => ["http://localhost:9200"]
10+
index => "%{[@metadata][beat]}-%{[@metadata][version]}"
11+
}
12+
}

roles/dns-config/tasks/main.yml

+2-2
Original file line numberDiff line numberDiff line change
@@ -16,12 +16,12 @@
1616
- systemd-resolved
1717
when: ansible_distribution == 'Ubuntu' and ansible_distribution_major_version == '16'
1818

19-
- name: disable services (bionic)
19+
- name: disable services (bionic, focal)
2020
service:
2121
name: systemd-resolved
2222
state: stopped
2323
enabled: no
24-
when: ansible_distribution == 'Ubuntu' and ansible_distribution_major_version == '18'
24+
when: ansible_distribution == 'Ubuntu' and (ansible_distribution_major_version in ['18', '20'])
2525

2626
- name: install /etc/resolv.conf
2727
template:

roles/nvidia-cuda/defaults/main.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
# 'cuda' is the generic package and will pull the latest version
3-
cuda_version: "cuda-toolkit-11-3"
3+
cuda_version: "cuda-toolkit-11-5"
44

55
# DGX-specific vars may be used to target specific models,
66
# because available versions for DGX may differ from the generic repo

roles/nvidia-hpc-sdk/defaults/main.yml

+4-4
Original file line numberDiff line numberDiff line change
@@ -15,15 +15,15 @@
1515
# See https://developer.nvidia.com/nvidia-hpc-sdk-downloads for more detail on available downloads.
1616

1717
# Version strings used to construct download URL
18-
hpcsdk_major_version: "21"
19-
hpcsdk_minor_version: "9"
20-
hpcsdk_file_cuda: "11.4"
18+
hpcsdk_major_version: "22"
19+
hpcsdk_minor_version: "1"
20+
hpcsdk_file_cuda: "11.5"
2121
hpcsdk_arch: "x86_64"
2222

2323
# We need to specify the default CUDA toolkit to use during installation.
2424
# This should usually be the latest CUDA included in the HPC SDK you are
2525
# installing.
26-
hpcsdk_default_cuda: "11.4"
26+
hpcsdk_default_cuda: "11.5"
2727

2828
# Add HPC SDK modules to the MODULEPATH?
2929
hpcsdk_install_as_modules: false

roles/requirements.yml

+15-12
Original file line numberDiff line numberDiff line change
@@ -36,19 +36,22 @@ roles:
3636
version: "v0.5.0"
3737

3838
- src: geerlingguy.filebeat
39-
version: "2.0.1"
39+
version: "3.3.0"
4040

41-
- src: geerlingguy.logstash
42-
version: "4.0.0"
41+
- src: robertdebock.java
42+
version: "4.1.1"
4343

44-
- src: geerlingguy.elasticsearch
45-
version: "3.0.1"
44+
- src: robertdebock.elastic_repo
45+
version: "1.0.3"
4646

47-
- src: geerlingguy.java
48-
version: "1.9.5"
47+
- src: robertdebock.logstash
48+
version: "1.1.1"
4949

50-
- src: geerlingguy.kibana
51-
version: "3.2.1"
50+
- src: robertdebock.elasticsearch
51+
version: "1.1.3"
52+
53+
- src: robertdebock.kibana
54+
version: "1.2.4"
5255

5356
- src: https://github.com/DeepOps/ansible-maas.git
5457
name: ansible-maas
@@ -61,8 +64,8 @@ roles:
6164
- src: https://github.com/OSC/ood-ansible.git
6265
version: 'v2.0.3'
6366

67+
- src: abims_sbr.singularity
68+
version: 3.7.1-1
69+
6470
- src: gantsign.golang
6571
version: 2.4.0
66-
67-
- src: lecorguille.singularity
68-
version: 1.2.0

0 commit comments

Comments
 (0)