Skip to content

Commit

Permalink
Vista - new section NVIDIA MPS per Ian Wang
Browse files Browse the repository at this point in the history
  • Loading branch information
susanunit committed Feb 18, 2025
1 parent 92b8fe1 commit 7e7b1b8
Show file tree
Hide file tree
Showing 7 changed files with 181 additions and 4 deletions.
Binary file added docs/hpc/imgs/vista/MPS-graphs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes
92 changes: 90 additions & 2 deletions docs/hpc/vista.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
# Vista User Guide
*Last update: February 7, 2025*
*Last update: February 18, 2025*

## Notices { #notices }

* **New**: See TACC Staff's [notes on incorporating NVIDIA's Multi-Process Service](#mps). (MPS)
* **Important**: Please note [TACC's new SU charge policy](#sunotice). (09/20/2024)
* **[Subscribe][TACCSUBSCRIBE] to Vista User News**. Stay up-to-date on Vista's status, scheduled maintenances and other notifications. (09/01/2024)

Expand Down Expand Up @@ -36,7 +37,7 @@ Vista is funded by the National Science Foundation (NSF) via a supplement to the

Vista's compute system is divided into Grace-Grace and Grace-Hopper subsystems networked in two-level fat-tree topology as illustrated in Figure 1. below.

<figure><img src="../imgs/vista-topology.png"> <figcaption>Figure 1. Vista Topology</figcaption></figure>
<figure><img src="../imgs/vista/vista-topology.png"> <figcaption>Figure 1. Vista Topology</figcaption></figure>

The Grace-Grace (GG) subsystem, a purely CPU-based system, is housed in four racks, each containing 64 Grace-Grace (GG) nodes. Each GG node contains 144 processing cores. A GG node provides over 7 TFlops of double precision performance and up to 1 TiB/s of memory bandwidth. GG nodes connect via an InfiniBand 200 Gb/s fabric to a top rack shelf NVIDIA Quantum-2 MQM9790 NDR switch. In total, the subsystem contains sixty-four 200 Gb/s uplinks to the NDR rack shelf switch.

Expand Down Expand Up @@ -354,6 +355,93 @@ For more information on this and other matters related to Slurm job submission,



## NVIDIA MPS { #mps }

NVIDIA's [Multi-Process Service](https://docs.nvidia.com/deploy/mps/) (MPS) allows multiple processes to share a GPU efficiently by reducing scheduling overhead. MPS can improve GPU resource sharing between processes when a single process cannot fully saturate the GPU's compute capacity.

Follow these steps to configure MPS on Vista for optimized multi-process workflows:

1. **Configure Environment Variables**

Set environment variables to define where MPS stores its runtime pipes and logs. In the example below, these are placed in each node's `/tmp` directory. The `/tmp` directory is ephemeral and cleared after a job ends or a node reboots. Add these lines to your job script or shell session:

```job-script
# Set MPS environment variables
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
```

To retain these logs for later analysis, specify directories in `$SCRATCH`, `$WORK`, or `$HOME` file systems instead of `/tmp`.

2. **Launch MPS Control Daemon**

Use `ibrun` to start the MPS daemon across all allocated nodes. This ensures one MPS control process per node, targeting GPU 0:

```job-script
# Launch MPS daemon on all nodes
export TACC_TASKS_PER_NODE=1 # Force one task per node
ibrun -np $SLURM_NNODES nvidia-cuda-mps-control -d
unset TACC_TASKS_PER_NODE # Reset to default task distribution
```

3. **Submit Your GPU Job**

After enabling MPS, run your CUDA application as usual. For example:

```job-script
ibrun ./your_cuda_executable
```

### Sample Job Script { #scripts }

Incorporating the above elements into a job script may look like this:

```job-script
#!/bin/bash
#SBATCH -J mps_gpu_job # Job name
#SBATCH -o mps_job.%j.out # Output file (%j = job ID)
#SBATCH -t 01:00:00 # Wall time (1 hour)
#SBATCH -N 2 # Number of nodes
#SBATCH -n 8 # Total tasks (4 per node)
#SBATCH -p gh # GPU partition (modify as needed)
#SBATCH -A your_project # Project allocation
# 1. Configure environment
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
# 2. Launch MPS daemon on all nodes
echo "Starting MPS daemon..."
export TACC_TASKS_PER_NODE=1 # Force 1 task/node
ibrun -np $SLURM_NNODES nvidia-cuda-mps-control -d
unset TACC_TASKS_PER_NODE
sleep 5 # Wait for daemons to initialize
# 3. Run your CUDA application
echo "Launching application..."
ibrun ./your_cuda_executable # Replace with your executable
```

### Notes on Performance

MPS is particularly effective for workloads characterized by:

* Fine-grained GPU operations (many small kernel launches)
* Concurrent processes sharing the same GPU
* Underutilized GPU resources in single-process workflows

You may verify performance gains for your use case using the following command to monitor the node that your job is running on (e.g., `c608-052`):

```cmd-line
login1$ nvidia-smi dmon --gpm-metrics=3,12 -s u
```

The side-by-side plots in Figure 1 illustrate the performance enhancement obtained by running two GPU processes simultaneous on a single Hopper node with MPS. The GPU performance improvement is ~12%, compared to no improvement without MPS. Also, the setup cost on the CPU (about 12 seconds) is completely overlapped, resulting in in a 1.2x total improvement for 2 simultaneous Amber executions. Even better performance is expected for applications which don't load the GPU as much as Amber.


<figure><img src="../imgs/vista/MPS-graphs.png" width="800"><figcaption>Figure 1. Usage (SM, Memory and FP32) and SM occupancy percentages for single and dual Amber GPU executions (single-precision) on Hopper H200.</figcaption></figure>


## Machine Learning { #ml }

Vista is well equipped to provide researchers with the latest in Machine Learning frameworks, for example, PyTorch. The installation process will be a little different depending on whether you are using single or multiple nodes. Below we detail how to use PyTorch on our systems for both scenarios.
Expand Down
1 change: 1 addition & 0 deletions docs/hpc/vista/makefile
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ VISTA_OBJS = \
system.md \
running.md \
launching.md \
mps.md \
ml.md \
building.md \
nvidia.md \
Expand Down
87 changes: 87 additions & 0 deletions docs/hpc/vista/mps.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
## NVIDIA MPS { #mps }

NVIDIA's [Multi-Process Service](https://docs.nvidia.com/deploy/mps/) (MPS) allows multiple processes to share a GPU efficiently by reducing scheduling overhead. MPS can improve GPU resource sharing between processes when a single process cannot fully saturate the GPU's compute capacity.

Follow these steps to configure MPS on Vista for optimized multi-process workflows:

1. **Configure Environment Variables**

Set environment variables to define where MPS stores its runtime pipes and logs. In the example below, these are placed in each node's `/tmp` directory. The `/tmp` directory is ephemeral and cleared after a job ends or a node reboots. Add these lines to your job script or shell session:

```job-script
# Set MPS environment variables
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
```

To retain these logs for later analysis, specify directories in `$SCRATCH`, `$WORK`, or `$HOME` file systems instead of `/tmp`.

2. **Launch MPS Control Daemon**

Use `ibrun` to start the MPS daemon across all allocated nodes. This ensures one MPS control process per node, targeting GPU 0:

```job-script
# Launch MPS daemon on all nodes
export TACC_TASKS_PER_NODE=1 # Force one task per node
ibrun -np $SLURM_NNODES nvidia-cuda-mps-control -d
unset TACC_TASKS_PER_NODE # Reset to default task distribution
```

3. **Submit Your GPU Job**

After enabling MPS, run your CUDA application as usual. For example:

```job-script
ibrun ./your_cuda_executable
```

### Sample Job Script { #scripts }

Incorporating the above elements into a job script may look like this:

```job-script
#!/bin/bash
#SBATCH -J mps_gpu_job # Job name
#SBATCH -o mps_job.%j.out # Output file (%j = job ID)
#SBATCH -t 01:00:00 # Wall time (1 hour)
#SBATCH -N 2 # Number of nodes
#SBATCH -n 8 # Total tasks (4 per node)
#SBATCH -p gh # GPU partition (modify as needed)
#SBATCH -A your_project # Project allocation
# 1. Configure environment
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
# 2. Launch MPS daemon on all nodes
echo "Starting MPS daemon..."
export TACC_TASKS_PER_NODE=1 # Force 1 task/node
ibrun -np $SLURM_NNODES nvidia-cuda-mps-control -d
unset TACC_TASKS_PER_NODE
sleep 5 # Wait for daemons to initialize
# 3. Run your CUDA application
echo "Launching application..."
ibrun ./your_cuda_executable # Replace with your executable
```

### Notes on Performance

MPS is particularly effective for workloads characterized by:

* Fine-grained GPU operations (many small kernel launches)
* Concurrent processes sharing the same GPU
* Underutilized GPU resources in single-process workflows

You may verify performance gains for your use case using the following command to monitor the node that your job is running on (e.g., `c608-052`):

```cmd-line
login1$ nvidia-smi dmon --gpm-metrics=3,12 -s u
```

The side-by-side plots in Figure 1 illustrate the performance enhancement obtained by running two GPU processes simultaneous on a single Hopper node with MPS. The GPU performance improvement is ~12%, compared to no improvement without MPS. Also, the setup cost on the CPU (about 12 seconds) is completely overlapped, resulting in in a 1.2x total improvement for 2 simultaneous Amber executions. Even better performance is expected for applications which don't load the GPU as much as Amber.


<figure><img src="../imgs/vista/MPS-graphs.png" width="800"><figcaption>Figure 1. Usage (SM, Memory and FP32) and SM occupancy percentages for single and dual Amber GPU executions (single-precision) on Hopper H200.</figcaption></figure>


3 changes: 2 additions & 1 deletion docs/hpc/vista/notices.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
# Vista User Guide
*Last update: February 7, 2025*
*Last update: February 18, 2025*

## Notices { #notices }

* **New**: See TACC Staff's [notes on incorporating NVIDIA's Multi-Process Service](#mps). (MPS)
* **Important**: Please note [TACC's new SU charge policy](#sunotice). (09/20/2024)
* **[Subscribe][TACCSUBSCRIBE] to Vista User News**. Stay up-to-date on Vista's status, scheduled maintenances and other notifications. (09/01/2024)

Expand Down
2 changes: 1 addition & 1 deletion docs/hpc/vista/system.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

Vista's compute system is divided into Grace-Grace and Grace-Hopper subsystems networked in two-level fat-tree topology as illustrated in Figure 1. below.

<figure><img src="../imgs/vista-topology.png"> <figcaption>Figure 1. Vista Topology</figcaption></figure>
<figure><img src="../imgs/vista/vista-topology.png"> <figcaption>Figure 1. Vista Topology</figcaption></figure>

The Grace-Grace (GG) subsystem, a purely CPU-based system, is housed in four racks, each containing 64 Grace-Grace (GG) nodes. Each GG node contains 144 processing cores. A GG node provides over 7 TFlops of double precision performance and up to 1 TiB/s of memory bandwidth. GG nodes connect via an InfiniBand 200 Gb/s fabric to a top rack shelf NVIDIA Quantum-2 MQM9790 NDR switch. In total, the subsystem contains sixty-four 200 Gb/s uplinks to the NDR rack shelf switch.

Expand Down

0 comments on commit 7e7b1b8

Please sign in to comment.