Vista - new section NVIDIA MPS per Ian Wang

TACC · Feb 18, 2025 · 7e7b1b8 · 7e7b1b8
1 parent 92b8fe1
commit 7e7b1b8
Show file tree

Hide file tree

Showing 7 changed files with 181 additions and 4 deletions.
diff --git a/docs/hpc/imgs/vista/MPS-graphs.png b/docs/hpc/imgs/vista/MPS-graphs.png
diff --git a/docs/hpc/imgs/vista-topology.png → docs/hpc/imgs/vista/vista-topology.png b/docs/hpc/imgs/vista-topology.png → docs/hpc/imgs/vista/vista-topology.png
diff --git a/docs/hpc/vista.md b/docs/hpc/vista.md
@@ -1,8 +1,9 @@
 # Vista User Guide 
-*Last update: February 7, 2025*
+*Last update: February 18, 2025*
 
 ## Notices { #notices }
 
+* **New**: See TACC Staff's [notes on incorporating NVIDIA's Multi-Process Service](#mps). (MPS) 
 * **Important**: Please note [TACC's new SU charge policy](#sunotice). (09/20/2024)
 * **[Subscribe][TACCSUBSCRIBE] to Vista User News**. Stay up-to-date on Vista's status, scheduled maintenances and other notifications.  (09/01/2024)
 
@@ -36,7 +37,7 @@ Vista is funded by the National Science Foundation (NSF) via a supplement to the
 
 Vista's compute system is divided into Grace-Grace and Grace-Hopper subsystems networked in two-level fat-tree topology as illustrated in Figure 1. below.
 
-<figure><img src="../imgs/vista-topology.png"> <figcaption>Figure 1. Vista Topology</figcaption></figure>
+<figure><img src="../imgs/vista/vista-topology.png"> <figcaption>Figure 1. Vista Topology</figcaption></figure>
 
 The Grace-Grace (GG) subsystem, a purely CPU-based system, is housed in four racks, each containing 64 Grace-Grace (GG) nodes. Each GG node contains 144 processing cores. A GG node provides over 7 TFlops of double precision performance and up to 1 TiB/s of memory bandwidth. GG nodes connect via an InfiniBand 200 Gb/s fabric to a top rack shelf NVIDIA Quantum-2 MQM9790 NDR switch. In total, the subsystem contains sixty-four 200 Gb/s uplinks to the NDR rack shelf switch.
 
@@ -354,6 +355,93 @@ For more information on this and other matters related to Slurm job submission,
 
 
 
+## NVIDIA  MPS { #mps }
+
+NVIDIA's [Multi-Process Service](https://docs.nvidia.com/deploy/mps/) (MPS) allows multiple processes to share a GPU efficiently by reducing scheduling overhead. MPS can improve GPU resource sharing between processes when a single process cannot fully saturate the GPU's compute capacity. 
+
+Follow these steps to configure MPS on Vista for optimized multi-process workflows:
+
+1. **Configure Environment Variables**
+
+	Set environment variables to define where MPS stores its runtime pipes and logs. In the example below, these are placed in each node's `/tmp` directory.  The `/tmp` directory is ephemeral and cleared after a job ends or a node reboots.  Add these lines to your job script or shell session:
+
+	```job-script
+	# Set MPS environment variables
+	export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
+	export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
+	```
+
+	To retain these logs for later analysis, specify directories in `$SCRATCH`, `$WORK`, or `$HOME` file systems instead of `/tmp`. 
+
+2. **Launch MPS Control Daemon**
+
+	Use `ibrun` to start the MPS daemon across all allocated nodes. This ensures one MPS control process per node, targeting GPU 0:
+
+	```job-script
+	# Launch MPS daemon on all nodes
+	export TACC_TASKS_PER_NODE=1  # Force one task per node
+	ibrun -np $SLURM_NNODES nvidia-cuda-mps-control -d
+	unset TACC_TASKS_PER_NODE     # Reset to default task distribution
+	```
+
+3. **Submit Your GPU Job**
+
+	After enabling MPS, run your CUDA application as usual. For example:
+
+	```job-script
+	ibrun ./your_cuda_executable 
+	```
+
+### Sample Job Script { #scripts }
+
+Incorporating the above elements into a job script may look like this:
+
+```job-script
+#!/bin/bash
+#SBATCH -J mps_gpu_job           # Job name
+#SBATCH -o mps_job.%j.out        # Output file (%j = job ID)
+#SBATCH -t 01:00:00              # Wall time (1 hour)
+#SBATCH -N 2                     # Number of nodes
+#SBATCH -n 8                     # Total tasks (4 per node)
+#SBATCH -p gh                    # GPU partition (modify as needed)
+#SBATCH -A your_project          # Project allocation
+
+# 1. Configure environment
+export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
+export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
+
+# 2. Launch MPS daemon on all nodes
+echo "Starting MPS daemon..."
+export TACC_TASKS_PER_NODE=1     # Force 1 task/node
+ibrun -np $SLURM_NNODES nvidia-cuda-mps-control -d
+unset TACC_TASKS_PER_NODE
+sleep 5                          # Wait for daemons to initialize
+
+# 3. Run your CUDA application
+echo "Launching application..."
+ibrun ./your_cuda_executable     # Replace with your executable
+```
+
+### Notes on Performance
+
+MPS is particularly effective for workloads characterized by:
+
+* Fine-grained GPU operations (many small kernel launches)
+* Concurrent processes sharing the same GPU
+* Underutilized GPU resources in single-process workflows
+
+You may verify performance gains for your use case using the following command to monitor the node that your job is running on (e.g., `c608-052`):
+
+```cmd-line
+login1$ nvidia-smi dmon --gpm-metrics=3,12 -s u
+```
+
+The side-by-side plots in Figure 1 illustrate the performance enhancement obtained by running two GPU processes simultaneous on a single Hopper node with MPS. The GPU performance improvement is ~12%, compared to no improvement without MPS. Also, the setup cost on the CPU (about 12 seconds) is completely overlapped, resulting in in a 1.2x total improvement for 2 simultaneous Amber executions. Even better performance is expected for applications which don't load the GPU as much as Amber.
+
+
+<figure><img src="../imgs/vista/MPS-graphs.png" width="800"><figcaption>Figure 1.  Usage (SM, Memory and FP32) and SM occupancy percentages for single and dual Amber GPU executions (single-precision) on Hopper H200.</figcaption></figure>
+
+
 ## Machine Learning { #ml }
 
 Vista is well equipped to provide researchers with the latest in Machine Learning frameworks, for example, PyTorch. The installation process will be a little different depending on whether you are using single or multiple nodes. Below we detail how to use PyTorch on our systems for both scenarios.

diff --git a/docs/hpc/vista/makefile b/docs/hpc/vista/makefile
@@ -4,6 +4,7 @@ VISTA_OBJS	=	\
 	system.md \
 	running.md \
 	launching.md \
+	mps.md \
 	ml.md \
 	building.md \
 	nvidia.md \

diff --git a/docs/hpc/vista/mps.md b/docs/hpc/vista/mps.md
@@ -0,0 +1,87 @@
+## NVIDIA  MPS { #mps }
+
+NVIDIA's [Multi-Process Service](https://docs.nvidia.com/deploy/mps/) (MPS) allows multiple processes to share a GPU efficiently by reducing scheduling overhead. MPS can improve GPU resource sharing between processes when a single process cannot fully saturate the GPU's compute capacity. 
+
+Follow these steps to configure MPS on Vista for optimized multi-process workflows:
+
+1. **Configure Environment Variables**
+
+	Set environment variables to define where MPS stores its runtime pipes and logs. In the example below, these are placed in each node's `/tmp` directory.  The `/tmp` directory is ephemeral and cleared after a job ends or a node reboots.  Add these lines to your job script or shell session:
+
+	```job-script
+	# Set MPS environment variables
+	export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
+	export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
+	```
+
+	To retain these logs for later analysis, specify directories in `$SCRATCH`, `$WORK`, or `$HOME` file systems instead of `/tmp`. 
+
+2. **Launch MPS Control Daemon**
+
+	Use `ibrun` to start the MPS daemon across all allocated nodes. This ensures one MPS control process per node, targeting GPU 0:
+
+	```job-script
+	# Launch MPS daemon on all nodes
+	export TACC_TASKS_PER_NODE=1  # Force one task per node
+	ibrun -np $SLURM_NNODES nvidia-cuda-mps-control -d
+	unset TACC_TASKS_PER_NODE     # Reset to default task distribution
+	```
+
+3. **Submit Your GPU Job**
+
+	After enabling MPS, run your CUDA application as usual. For example:
+
+	```job-script
+	ibrun ./your_cuda_executable 
+	```
+
+### Sample Job Script { #scripts }
+
+Incorporating the above elements into a job script may look like this:
+
+```job-script
+#!/bin/bash
+#SBATCH -J mps_gpu_job           # Job name
+#SBATCH -o mps_job.%j.out        # Output file (%j = job ID)
+#SBATCH -t 01:00:00              # Wall time (1 hour)
+#SBATCH -N 2                     # Number of nodes
+#SBATCH -n 8                     # Total tasks (4 per node)
+#SBATCH -p gh                    # GPU partition (modify as needed)
+#SBATCH -A your_project          # Project allocation
+
+# 1. Configure environment
+export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
+export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
+
+# 2. Launch MPS daemon on all nodes
+echo "Starting MPS daemon..."
+export TACC_TASKS_PER_NODE=1     # Force 1 task/node
+ibrun -np $SLURM_NNODES nvidia-cuda-mps-control -d
+unset TACC_TASKS_PER_NODE
+sleep 5                          # Wait for daemons to initialize
+
+# 3. Run your CUDA application
+echo "Launching application..."
+ibrun ./your_cuda_executable     # Replace with your executable
+```
+
+### Notes on Performance
+
+MPS is particularly effective for workloads characterized by:
+
+* Fine-grained GPU operations (many small kernel launches)
+* Concurrent processes sharing the same GPU
+* Underutilized GPU resources in single-process workflows
+
+You may verify performance gains for your use case using the following command to monitor the node that your job is running on (e.g., `c608-052`):
+
+```cmd-line
+login1$ nvidia-smi dmon --gpm-metrics=3,12 -s u
+```
+
+The side-by-side plots in Figure 1 illustrate the performance enhancement obtained by running two GPU processes simultaneous on a single Hopper node with MPS. The GPU performance improvement is ~12%, compared to no improvement without MPS. Also, the setup cost on the CPU (about 12 seconds) is completely overlapped, resulting in in a 1.2x total improvement for 2 simultaneous Amber executions. Even better performance is expected for applications which don't load the GPU as much as Amber.
+
+
+<figure><img src="../imgs/vista/MPS-graphs.png" width="800"><figcaption>Figure 1.  Usage (SM, Memory and FP32) and SM occupancy percentages for single and dual Amber GPU executions (single-precision) on Hopper H200.</figcaption></figure>
+
+
diff --git a/docs/hpc/vista/notices.md b/docs/hpc/vista/notices.md
@@ -1,8 +1,9 @@
 # Vista User Guide 
-*Last update: February 7, 2025*
+*Last update: February 18, 2025*
 
 ## Notices { #notices }
 
+* **New**: See TACC Staff's [notes on incorporating NVIDIA's Multi-Process Service](#mps). (MPS) 
 * **Important**: Please note [TACC's new SU charge policy](#sunotice). (09/20/2024)
 * **[Subscribe][TACCSUBSCRIBE] to Vista User News**. Stay up-to-date on Vista's status, scheduled maintenances and other notifications.  (09/01/2024)
 

diff --git a/docs/hpc/vista/system.md b/docs/hpc/vista/system.md
@@ -4,7 +4,7 @@
 
 Vista's compute system is divided into Grace-Grace and Grace-Hopper subsystems networked in two-level fat-tree topology as illustrated in Figure 1. below.
 
-<figure><img src="../imgs/vista-topology.png"> <figcaption>Figure 1. Vista Topology</figcaption></figure>
+<figure><img src="../imgs/vista/vista-topology.png"> <figcaption>Figure 1. Vista Topology</figcaption></figure>
 
 The Grace-Grace (GG) subsystem, a purely CPU-based system, is housed in four racks, each containing 64 Grace-Grace (GG) nodes. Each GG node contains 144 processing cores. A GG node provides over 7 TFlops of double precision performance and up to 1 TiB/s of memory bandwidth. GG nodes connect via an InfiniBand 200 Gb/s fabric to a top rack shelf NVIDIA Quantum-2 MQM9790 NDR switch. In total, the subsystem contains sixty-four 200 Gb/s uplinks to the NDR rack shelf switch.