Ray tutorial #1302

keshavb96 · 2025-02-14T03:45:18Z

This document presents a detailed tutorial on how Ray can be used together with JAX to achieve fault tolerant training.

gspschmid · 2025-02-21T11:05:45Z

docs/ray_resilient_jax.md

+
+### Starting a Ray Cluster manually
+
+We will begin with a simple example of how to manually start a Ray cluster on 2 physical nodes. This will involve a single Ray head node and 2 Ray worker nodes, where each Ray worker node is allocated all GPUs of the node it runs on. We will assume the IP addresses of the physical nodes are `IP_ADDR_1` and `IP_ADDR_2` and that the head node will be allocated on the physical node with `IP_ADDR_1`. 


In the previous section you mention that we'll operate in a 1-process-per-GPU setting, but here we're running with 2 Ray worker nodes. Does that mean we're running 8 actors per worker node? And does each of them run in a separate process? I would clarify this.

gspschmid · 2025-02-21T11:09:44Z

docs/ray_resilient_jax.md

+# First we start the head of the ray cluster on one of the physical nodes
+# In this case we are giving an entire physical node to the ray head node
+# The ray head node is marked by including --head to the ray start command
+srun --nodes=1 --ntasks=1 -w "$head_node" ray start --head \


How come we didn't need --head in the manual example above?

gspschmid · 2025-02-21T11:10:20Z

docs/ray_resilient_jax.md

+
+```console
+#!/bin/bash
+#SBATCH --nodes=<NUM_NODES>+1


Do we really need an extra physical node for the head? Could we instead launch the head on node#0?

gspschmid · 2025-02-21T11:10:52Z

docs/ray_resilient_jax.md

+# Getting the node names
+nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST")
+nodes_array=($nodes)
+


Redundant with the lines below?

gspschmid · 2025-02-21T11:15:00Z

docs/ray_resilient_jax.md

+
+    srun --exact --nodes=1 --ntasks=1 --cpus-per-task=$((16 * gpus_per_node)) -w "$node_i" \
+    ray start --address "$ip_head" \
+              --resources="{\"worker_units\": gpus_per_node}" \


gpus_per_node seems to be missing a $?

gspschmid · 2025-02-21T11:47:59Z

docs/ray_resilient_jax.md

+  def __init__(self, worker_cls, num_workers) -> None:
+    self.worker_cls = worker_cls
+    self.num_workers = num_workers
+
+    self.workers = [worker_cls.options(num_gpus=1, 
+                                       num_cpus=16, 
+                                       resources={"worker_units": 1}).remote() for _ in range(self.num_workers)]


Formatting seems off and will break in Python (2 spaces vs 4 spaces)?

gspschmid · 2025-02-21T11:50:11Z

docs/ray_resilient_jax.md

+    def initialize_workers(self, **kwargs):
+        self.worker_init_kwargs = kwargs
+        coordinator_ip = ray.get(self.workers[0].get_host_ip.remote())
+        coordinator_port = random.randint(1, 100000)  % 2**12 + (65535 - 2**12 + 1)


2**16 - 2**12 + random.randrange(2**12)?

keshavb96 added 3 commits February 13, 2025 19:39

tutorial for achieving recoverable training with Ray

8913a97

remove

e76e945

tutorial for achieving fault tolerant training with Ray

19ccac1

gspschmid reviewed Feb 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ray tutorial #1302

Ray tutorial #1302

keshavb96 commented Feb 14, 2025

gspschmid Feb 21, 2025

gspschmid Feb 21, 2025

gspschmid Feb 21, 2025

gspschmid Feb 21, 2025

gspschmid Feb 21, 2025

gspschmid Feb 21, 2025

gspschmid Feb 21, 2025


		### Starting a Ray Cluster manually

		We will begin with a simple example of how to manually start a Ray cluster on 2 physical nodes. This will involve a single Ray head node and 2 Ray worker nodes, where each Ray worker node is allocated all GPUs of the node it runs on. We will assume the IP addresses of the physical nodes are `IP_ADDR_1` and `IP_ADDR_2` and that the head node will be allocated on the physical node with `IP_ADDR_1`.

Ray tutorial #1302

Are you sure you want to change the base?

Ray tutorial #1302

Conversation

keshavb96 commented Feb 14, 2025

gspschmid Feb 21, 2025

Choose a reason for hiding this comment

gspschmid Feb 21, 2025

Choose a reason for hiding this comment

gspschmid Feb 21, 2025

Choose a reason for hiding this comment

gspschmid Feb 21, 2025

Choose a reason for hiding this comment

gspschmid Feb 21, 2025

Choose a reason for hiding this comment

gspschmid Feb 21, 2025

Choose a reason for hiding this comment

gspschmid Feb 21, 2025

Choose a reason for hiding this comment