diff --git a/docs/hpc/imgs/vista/MPS-graphs.png b/docs/hpc/imgs/vista/MPS-graphs.png new file mode 100644 index 0000000..6d705d0 Binary files /dev/null and b/docs/hpc/imgs/vista/MPS-graphs.png differ diff --git a/docs/hpc/imgs/vista-topology.png b/docs/hpc/imgs/vista/vista-topology.png similarity index 100% rename from docs/hpc/imgs/vista-topology.png rename to docs/hpc/imgs/vista/vista-topology.png diff --git a/docs/hpc/vista.md b/docs/hpc/vista.md index 542f43c..86d322e 100644 --- a/docs/hpc/vista.md +++ b/docs/hpc/vista.md @@ -1,8 +1,9 @@ # Vista User Guide -*Last update: February 7, 2025* +*Last update: February 18, 2025* ## Notices { #notices } +* **New**: See TACC Staff's [notes on incorporating NVIDIA's Multi-Process Service](#mps). (MPS) * **Important**: Please note [TACC's new SU charge policy](#sunotice). (09/20/2024) * **[Subscribe][TACCSUBSCRIBE] to Vista User News**. Stay up-to-date on Vista's status, scheduled maintenances and other notifications. (09/01/2024) @@ -36,7 +37,7 @@ Vista is funded by the National Science Foundation (NSF) via a supplement to the Vista's compute system is divided into Grace-Grace and Grace-Hopper subsystems networked in two-level fat-tree topology as illustrated in Figure 1. below. -
Figure 1. Vista Topology
+
Figure 1. Vista Topology
The Grace-Grace (GG) subsystem, a purely CPU-based system, is housed in four racks, each containing 64 Grace-Grace (GG) nodes. Each GG node contains 144 processing cores. A GG node provides over 7 TFlops of double precision performance and up to 1 TiB/s of memory bandwidth. GG nodes connect via an InfiniBand 200 Gb/s fabric to a top rack shelf NVIDIA Quantum-2 MQM9790 NDR switch. In total, the subsystem contains sixty-four 200 Gb/s uplinks to the NDR rack shelf switch. @@ -354,6 +355,93 @@ For more information on this and other matters related to Slurm job submission, +## NVIDIA MPS { #mps } + +NVIDIA's [Multi-Process Service](https://docs.nvidia.com/deploy/mps/) (MPS) allows multiple processes to share a GPU efficiently by reducing scheduling overhead. MPS can improve GPU resource sharing between processes when a single process cannot fully saturate the GPU's compute capacity. + +Follow these steps to configure MPS on Vista for optimized multi-process workflows: + +1. **Configure Environment Variables** + + Set environment variables to define where MPS stores its runtime pipes and logs. In the example below, these are placed in each node's `/tmp` directory. The `/tmp` directory is ephemeral and cleared after a job ends or a node reboots. Add these lines to your job script or shell session: + + ```job-script + # Set MPS environment variables + export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps + export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log + ``` + + To retain these logs for later analysis, specify directories in `$SCRATCH`, `$WORK`, or `$HOME` file systems instead of `/tmp`. + +2. **Launch MPS Control Daemon** + + Use `ibrun` to start the MPS daemon across all allocated nodes. This ensures one MPS control process per node, targeting GPU 0: + + ```job-script + # Launch MPS daemon on all nodes + export TACC_TASKS_PER_NODE=1 # Force one task per node + ibrun -np $SLURM_NNODES nvidia-cuda-mps-control -d + unset TACC_TASKS_PER_NODE # Reset to default task distribution + ``` + +3. **Submit Your GPU Job** + + After enabling MPS, run your CUDA application as usual. For example: + + ```job-script + ibrun ./your_cuda_executable + ``` + +### Sample Job Script { #scripts } + +Incorporating the above elements into a job script may look like this: + +```job-script +#!/bin/bash +#SBATCH -J mps_gpu_job # Job name +#SBATCH -o mps_job.%j.out # Output file (%j = job ID) +#SBATCH -t 01:00:00 # Wall time (1 hour) +#SBATCH -N 2 # Number of nodes +#SBATCH -n 8 # Total tasks (4 per node) +#SBATCH -p gh # GPU partition (modify as needed) +#SBATCH -A your_project # Project allocation + +# 1. Configure environment +export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps +export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log + +# 2. Launch MPS daemon on all nodes +echo "Starting MPS daemon..." +export TACC_TASKS_PER_NODE=1 # Force 1 task/node +ibrun -np $SLURM_NNODES nvidia-cuda-mps-control -d +unset TACC_TASKS_PER_NODE +sleep 5 # Wait for daemons to initialize + +# 3. Run your CUDA application +echo "Launching application..." +ibrun ./your_cuda_executable # Replace with your executable +``` + +### Notes on Performance + +MPS is particularly effective for workloads characterized by: + +* Fine-grained GPU operations (many small kernel launches) +* Concurrent processes sharing the same GPU +* Underutilized GPU resources in single-process workflows + +You may verify performance gains for your use case using the following command to monitor the node that your job is running on (e.g., `c608-052`): + +```cmd-line +login1$ nvidia-smi dmon --gpm-metrics=3,12 -s u +``` + +The side-by-side plots in Figure 1 illustrate the performance enhancement obtained by running two GPU processes simultaneous on a single Hopper node with MPS. The GPU performance improvement is ~12%, compared to no improvement without MPS. Also, the setup cost on the CPU (about 12 seconds) is completely overlapped, resulting in in a 1.2x total improvement for 2 simultaneous Amber executions. Even better performance is expected for applications which don't load the GPU as much as Amber. + + +
Figure 1. Usage (SM, Memory and FP32) and SM occupancy percentages for single and dual Amber GPU executions (single-precision) on Hopper H200.
+ + ## Machine Learning { #ml } Vista is well equipped to provide researchers with the latest in Machine Learning frameworks, for example, PyTorch. The installation process will be a little different depending on whether you are using single or multiple nodes. Below we detail how to use PyTorch on our systems for both scenarios. diff --git a/docs/hpc/vista/makefile b/docs/hpc/vista/makefile index 7297b22..6b2d08a 100644 --- a/docs/hpc/vista/makefile +++ b/docs/hpc/vista/makefile @@ -4,6 +4,7 @@ VISTA_OBJS = \ system.md \ running.md \ launching.md \ + mps.md \ ml.md \ building.md \ nvidia.md \ diff --git a/docs/hpc/vista/mps.md b/docs/hpc/vista/mps.md new file mode 100644 index 0000000..3eb336f --- /dev/null +++ b/docs/hpc/vista/mps.md @@ -0,0 +1,87 @@ +## NVIDIA MPS { #mps } + +NVIDIA's [Multi-Process Service](https://docs.nvidia.com/deploy/mps/) (MPS) allows multiple processes to share a GPU efficiently by reducing scheduling overhead. MPS can improve GPU resource sharing between processes when a single process cannot fully saturate the GPU's compute capacity. + +Follow these steps to configure MPS on Vista for optimized multi-process workflows: + +1. **Configure Environment Variables** + + Set environment variables to define where MPS stores its runtime pipes and logs. In the example below, these are placed in each node's `/tmp` directory. The `/tmp` directory is ephemeral and cleared after a job ends or a node reboots. Add these lines to your job script or shell session: + + ```job-script + # Set MPS environment variables + export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps + export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log + ``` + + To retain these logs for later analysis, specify directories in `$SCRATCH`, `$WORK`, or `$HOME` file systems instead of `/tmp`. + +2. **Launch MPS Control Daemon** + + Use `ibrun` to start the MPS daemon across all allocated nodes. This ensures one MPS control process per node, targeting GPU 0: + + ```job-script + # Launch MPS daemon on all nodes + export TACC_TASKS_PER_NODE=1 # Force one task per node + ibrun -np $SLURM_NNODES nvidia-cuda-mps-control -d + unset TACC_TASKS_PER_NODE # Reset to default task distribution + ``` + +3. **Submit Your GPU Job** + + After enabling MPS, run your CUDA application as usual. For example: + + ```job-script + ibrun ./your_cuda_executable + ``` + +### Sample Job Script { #scripts } + +Incorporating the above elements into a job script may look like this: + +```job-script +#!/bin/bash +#SBATCH -J mps_gpu_job # Job name +#SBATCH -o mps_job.%j.out # Output file (%j = job ID) +#SBATCH -t 01:00:00 # Wall time (1 hour) +#SBATCH -N 2 # Number of nodes +#SBATCH -n 8 # Total tasks (4 per node) +#SBATCH -p gh # GPU partition (modify as needed) +#SBATCH -A your_project # Project allocation + +# 1. Configure environment +export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps +export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log + +# 2. Launch MPS daemon on all nodes +echo "Starting MPS daemon..." +export TACC_TASKS_PER_NODE=1 # Force 1 task/node +ibrun -np $SLURM_NNODES nvidia-cuda-mps-control -d +unset TACC_TASKS_PER_NODE +sleep 5 # Wait for daemons to initialize + +# 3. Run your CUDA application +echo "Launching application..." +ibrun ./your_cuda_executable # Replace with your executable +``` + +### Notes on Performance + +MPS is particularly effective for workloads characterized by: + +* Fine-grained GPU operations (many small kernel launches) +* Concurrent processes sharing the same GPU +* Underutilized GPU resources in single-process workflows + +You may verify performance gains for your use case using the following command to monitor the node that your job is running on (e.g., `c608-052`): + +```cmd-line +login1$ nvidia-smi dmon --gpm-metrics=3,12 -s u +``` + +The side-by-side plots in Figure 1 illustrate the performance enhancement obtained by running two GPU processes simultaneous on a single Hopper node with MPS. The GPU performance improvement is ~12%, compared to no improvement without MPS. Also, the setup cost on the CPU (about 12 seconds) is completely overlapped, resulting in in a 1.2x total improvement for 2 simultaneous Amber executions. Even better performance is expected for applications which don't load the GPU as much as Amber. + + +
Figure 1. Usage (SM, Memory and FP32) and SM occupancy percentages for single and dual Amber GPU executions (single-precision) on Hopper H200.
+ + diff --git a/docs/hpc/vista/notices.md b/docs/hpc/vista/notices.md index 3fc1dfb..0725c14 100644 --- a/docs/hpc/vista/notices.md +++ b/docs/hpc/vista/notices.md @@ -1,8 +1,9 @@ # Vista User Guide -*Last update: February 7, 2025* +*Last update: February 18, 2025* ## Notices { #notices } +* **New**: See TACC Staff's [notes on incorporating NVIDIA's Multi-Process Service](#mps). (MPS) * **Important**: Please note [TACC's new SU charge policy](#sunotice). (09/20/2024) * **[Subscribe][TACCSUBSCRIBE] to Vista User News**. Stay up-to-date on Vista's status, scheduled maintenances and other notifications. (09/01/2024) diff --git a/docs/hpc/vista/system.md b/docs/hpc/vista/system.md index f51c281..0060633 100644 --- a/docs/hpc/vista/system.md +++ b/docs/hpc/vista/system.md @@ -4,7 +4,7 @@ Vista's compute system is divided into Grace-Grace and Grace-Hopper subsystems networked in two-level fat-tree topology as illustrated in Figure 1. below. -
Figure 1. Vista Topology
+
Figure 1. Vista Topology
The Grace-Grace (GG) subsystem, a purely CPU-based system, is housed in four racks, each containing 64 Grace-Grace (GG) nodes. Each GG node contains 144 processing cores. A GG node provides over 7 TFlops of double precision performance and up to 1 TiB/s of memory bandwidth. GG nodes connect via an InfiniBand 200 Gb/s fabric to a top rack shelf NVIDIA Quantum-2 MQM9790 NDR switch. In total, the subsystem contains sixty-four 200 Gb/s uplinks to the NDR rack shelf switch.