diff --git a/docs/hpc/imgs/vista/MPS-graphs.png b/docs/hpc/imgs/vista/MPS-graphs.png
new file mode 100644
index 0000000..6d705d0
Binary files /dev/null and b/docs/hpc/imgs/vista/MPS-graphs.png differ
diff --git a/docs/hpc/imgs/vista-topology.png b/docs/hpc/imgs/vista/vista-topology.png
similarity index 100%
rename from docs/hpc/imgs/vista-topology.png
rename to docs/hpc/imgs/vista/vista-topology.png
diff --git a/docs/hpc/vista.md b/docs/hpc/vista.md
index 542f43c..86d322e 100644
--- a/docs/hpc/vista.md
+++ b/docs/hpc/vista.md
@@ -1,8 +1,9 @@
# Vista User Guide
-*Last update: February 7, 2025*
+*Last update: February 18, 2025*
## Notices { #notices }
+* **New**: See TACC Staff's [notes on incorporating NVIDIA's Multi-Process Service](#mps). (MPS)
* **Important**: Please note [TACC's new SU charge policy](#sunotice). (09/20/2024)
* **[Subscribe][TACCSUBSCRIBE] to Vista User News**. Stay up-to-date on Vista's status, scheduled maintenances and other notifications. (09/01/2024)
@@ -36,7 +37,7 @@ Vista is funded by the National Science Foundation (NSF) via a supplement to the
Vista's compute system is divided into Grace-Grace and Grace-Hopper subsystems networked in two-level fat-tree topology as illustrated in Figure 1. below.
-Figure 1. Vista Topology
+Figure 1. Vista Topology
The Grace-Grace (GG) subsystem, a purely CPU-based system, is housed in four racks, each containing 64 Grace-Grace (GG) nodes. Each GG node contains 144 processing cores. A GG node provides over 7 TFlops of double precision performance and up to 1 TiB/s of memory bandwidth. GG nodes connect via an InfiniBand 200 Gb/s fabric to a top rack shelf NVIDIA Quantum-2 MQM9790 NDR switch. In total, the subsystem contains sixty-four 200 Gb/s uplinks to the NDR rack shelf switch.
@@ -354,6 +355,93 @@ For more information on this and other matters related to Slurm job submission,
+## NVIDIA MPS { #mps }
+
+NVIDIA's [Multi-Process Service](https://docs.nvidia.com/deploy/mps/) (MPS) allows multiple processes to share a GPU efficiently by reducing scheduling overhead. MPS can improve GPU resource sharing between processes when a single process cannot fully saturate the GPU's compute capacity.
+
+Follow these steps to configure MPS on Vista for optimized multi-process workflows:
+
+1. **Configure Environment Variables**
+
+ Set environment variables to define where MPS stores its runtime pipes and logs. In the example below, these are placed in each node's `/tmp` directory. The `/tmp` directory is ephemeral and cleared after a job ends or a node reboots. Add these lines to your job script or shell session:
+
+ ```job-script
+ # Set MPS environment variables
+ export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
+ export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
+ ```
+
+ To retain these logs for later analysis, specify directories in `$SCRATCH`, `$WORK`, or `$HOME` file systems instead of `/tmp`.
+
+2. **Launch MPS Control Daemon**
+
+ Use `ibrun` to start the MPS daemon across all allocated nodes. This ensures one MPS control process per node, targeting GPU 0:
+
+ ```job-script
+ # Launch MPS daemon on all nodes
+ export TACC_TASKS_PER_NODE=1 # Force one task per node
+ ibrun -np $SLURM_NNODES nvidia-cuda-mps-control -d
+ unset TACC_TASKS_PER_NODE # Reset to default task distribution
+ ```
+
+3. **Submit Your GPU Job**
+
+ After enabling MPS, run your CUDA application as usual. For example:
+
+ ```job-script
+ ibrun ./your_cuda_executable
+ ```
+
+### Sample Job Script { #scripts }
+
+Incorporating the above elements into a job script may look like this:
+
+```job-script
+#!/bin/bash
+#SBATCH -J mps_gpu_job # Job name
+#SBATCH -o mps_job.%j.out # Output file (%j = job ID)
+#SBATCH -t 01:00:00 # Wall time (1 hour)
+#SBATCH -N 2 # Number of nodes
+#SBATCH -n 8 # Total tasks (4 per node)
+#SBATCH -p gh # GPU partition (modify as needed)
+#SBATCH -A your_project # Project allocation
+
+# 1. Configure environment
+export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
+export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
+
+# 2. Launch MPS daemon on all nodes
+echo "Starting MPS daemon..."
+export TACC_TASKS_PER_NODE=1 # Force 1 task/node
+ibrun -np $SLURM_NNODES nvidia-cuda-mps-control -d
+unset TACC_TASKS_PER_NODE
+sleep 5 # Wait for daemons to initialize
+
+# 3. Run your CUDA application
+echo "Launching application..."
+ibrun ./your_cuda_executable # Replace with your executable
+```
+
+### Notes on Performance
+
+MPS is particularly effective for workloads characterized by:
+
+* Fine-grained GPU operations (many small kernel launches)
+* Concurrent processes sharing the same GPU
+* Underutilized GPU resources in single-process workflows
+
+You may verify performance gains for your use case using the following command to monitor the node that your job is running on (e.g., `c608-052`):
+
+```cmd-line
+login1$ nvidia-smi dmon --gpm-metrics=3,12 -s u
+```
+
+The side-by-side plots in Figure 1 illustrate the performance enhancement obtained by running two GPU processes simultaneous on a single Hopper node with MPS. The GPU performance improvement is ~12%, compared to no improvement without MPS. Also, the setup cost on the CPU (about 12 seconds) is completely overlapped, resulting in in a 1.2x total improvement for 2 simultaneous Amber executions. Even better performance is expected for applications which don't load the GPU as much as Amber.
+
+
+Figure 1. Usage (SM, Memory and FP32) and SM occupancy percentages for single and dual Amber GPU executions (single-precision) on Hopper H200.
+
+
## Machine Learning { #ml }
Vista is well equipped to provide researchers with the latest in Machine Learning frameworks, for example, PyTorch. The installation process will be a little different depending on whether you are using single or multiple nodes. Below we detail how to use PyTorch on our systems for both scenarios.
diff --git a/docs/hpc/vista/makefile b/docs/hpc/vista/makefile
index 7297b22..6b2d08a 100644
--- a/docs/hpc/vista/makefile
+++ b/docs/hpc/vista/makefile
@@ -4,6 +4,7 @@ VISTA_OBJS = \
system.md \
running.md \
launching.md \
+ mps.md \
ml.md \
building.md \
nvidia.md \
diff --git a/docs/hpc/vista/mps.md b/docs/hpc/vista/mps.md
new file mode 100644
index 0000000..3eb336f
--- /dev/null
+++ b/docs/hpc/vista/mps.md
@@ -0,0 +1,87 @@
+## NVIDIA MPS { #mps }
+
+NVIDIA's [Multi-Process Service](https://docs.nvidia.com/deploy/mps/) (MPS) allows multiple processes to share a GPU efficiently by reducing scheduling overhead. MPS can improve GPU resource sharing between processes when a single process cannot fully saturate the GPU's compute capacity.
+
+Follow these steps to configure MPS on Vista for optimized multi-process workflows:
+
+1. **Configure Environment Variables**
+
+ Set environment variables to define where MPS stores its runtime pipes and logs. In the example below, these are placed in each node's `/tmp` directory. The `/tmp` directory is ephemeral and cleared after a job ends or a node reboots. Add these lines to your job script or shell session:
+
+ ```job-script
+ # Set MPS environment variables
+ export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
+ export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
+ ```
+
+ To retain these logs for later analysis, specify directories in `$SCRATCH`, `$WORK`, or `$HOME` file systems instead of `/tmp`.
+
+2. **Launch MPS Control Daemon**
+
+ Use `ibrun` to start the MPS daemon across all allocated nodes. This ensures one MPS control process per node, targeting GPU 0:
+
+ ```job-script
+ # Launch MPS daemon on all nodes
+ export TACC_TASKS_PER_NODE=1 # Force one task per node
+ ibrun -np $SLURM_NNODES nvidia-cuda-mps-control -d
+ unset TACC_TASKS_PER_NODE # Reset to default task distribution
+ ```
+
+3. **Submit Your GPU Job**
+
+ After enabling MPS, run your CUDA application as usual. For example:
+
+ ```job-script
+ ibrun ./your_cuda_executable
+ ```
+
+### Sample Job Script { #scripts }
+
+Incorporating the above elements into a job script may look like this:
+
+```job-script
+#!/bin/bash
+#SBATCH -J mps_gpu_job # Job name
+#SBATCH -o mps_job.%j.out # Output file (%j = job ID)
+#SBATCH -t 01:00:00 # Wall time (1 hour)
+#SBATCH -N 2 # Number of nodes
+#SBATCH -n 8 # Total tasks (4 per node)
+#SBATCH -p gh # GPU partition (modify as needed)
+#SBATCH -A your_project # Project allocation
+
+# 1. Configure environment
+export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
+export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
+
+# 2. Launch MPS daemon on all nodes
+echo "Starting MPS daemon..."
+export TACC_TASKS_PER_NODE=1 # Force 1 task/node
+ibrun -np $SLURM_NNODES nvidia-cuda-mps-control -d
+unset TACC_TASKS_PER_NODE
+sleep 5 # Wait for daemons to initialize
+
+# 3. Run your CUDA application
+echo "Launching application..."
+ibrun ./your_cuda_executable # Replace with your executable
+```
+
+### Notes on Performance
+
+MPS is particularly effective for workloads characterized by:
+
+* Fine-grained GPU operations (many small kernel launches)
+* Concurrent processes sharing the same GPU
+* Underutilized GPU resources in single-process workflows
+
+You may verify performance gains for your use case using the following command to monitor the node that your job is running on (e.g., `c608-052`):
+
+```cmd-line
+login1$ nvidia-smi dmon --gpm-metrics=3,12 -s u
+```
+
+The side-by-side plots in Figure 1 illustrate the performance enhancement obtained by running two GPU processes simultaneous on a single Hopper node with MPS. The GPU performance improvement is ~12%, compared to no improvement without MPS. Also, the setup cost on the CPU (about 12 seconds) is completely overlapped, resulting in in a 1.2x total improvement for 2 simultaneous Amber executions. Even better performance is expected for applications which don't load the GPU as much as Amber.
+
+
+Figure 1. Usage (SM, Memory and FP32) and SM occupancy percentages for single and dual Amber GPU executions (single-precision) on Hopper H200.
+
+
diff --git a/docs/hpc/vista/notices.md b/docs/hpc/vista/notices.md
index 3fc1dfb..0725c14 100644
--- a/docs/hpc/vista/notices.md
+++ b/docs/hpc/vista/notices.md
@@ -1,8 +1,9 @@
# Vista User Guide
-*Last update: February 7, 2025*
+*Last update: February 18, 2025*
## Notices { #notices }
+* **New**: See TACC Staff's [notes on incorporating NVIDIA's Multi-Process Service](#mps). (MPS)
* **Important**: Please note [TACC's new SU charge policy](#sunotice). (09/20/2024)
* **[Subscribe][TACCSUBSCRIBE] to Vista User News**. Stay up-to-date on Vista's status, scheduled maintenances and other notifications. (09/01/2024)
diff --git a/docs/hpc/vista/system.md b/docs/hpc/vista/system.md
index f51c281..0060633 100644
--- a/docs/hpc/vista/system.md
+++ b/docs/hpc/vista/system.md
@@ -4,7 +4,7 @@
Vista's compute system is divided into Grace-Grace and Grace-Hopper subsystems networked in two-level fat-tree topology as illustrated in Figure 1. below.
-Figure 1. Vista Topology
+Figure 1. Vista Topology
The Grace-Grace (GG) subsystem, a purely CPU-based system, is housed in four racks, each containing 64 Grace-Grace (GG) nodes. Each GG node contains 144 processing cores. A GG node provides over 7 TFlops of double precision performance and up to 1 TiB/s of memory bandwidth. GG nodes connect via an InfiniBand 200 Gb/s fabric to a top rack shelf NVIDIA Quantum-2 MQM9790 NDR switch. In total, the subsystem contains sixty-four 200 Gb/s uplinks to the NDR rack shelf switch.