Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray tutorial #1302

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

Ray tutorial #1302

wants to merge 3 commits into from

Conversation

keshavb96
Copy link

This document presents a detailed tutorial on how Ray can be used together with JAX to achieve fault tolerant training.


### Starting a Ray Cluster manually

We will begin with a simple example of how to manually start a Ray cluster on 2 physical nodes. This will involve a single Ray head node and 2 Ray worker nodes, where each Ray worker node is allocated all GPUs of the node it runs on. We will assume the IP addresses of the physical nodes are `IP_ADDR_1` and `IP_ADDR_2` and that the head node will be allocated on the physical node with `IP_ADDR_1`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the previous section you mention that we'll operate in a 1-process-per-GPU setting, but here we're running with 2 Ray worker nodes. Does that mean we're running 8 actors per worker node? And does each of them run in a separate process? I would clarify this.

# First we start the head of the ray cluster on one of the physical nodes
# In this case we are giving an entire physical node to the ray head node
# The ray head node is marked by including --head to the ray start command
srun --nodes=1 --ntasks=1 -w "$head_node" ray start --head \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come we didn't need --head in the manual example above?


```console
#!/bin/bash
#SBATCH --nodes=<NUM_NODES>+1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need an extra physical node for the head? Could we instead launch the head on node#0?

Comment on lines +95 to +98
# Getting the node names
nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST")
nodes_array=($nodes)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant with the lines below?


srun --exact --nodes=1 --ntasks=1 --cpus-per-task=$((16 * gpus_per_node)) -w "$node_i" \
ray start --address "$ip_head" \
--resources="{\"worker_units\": gpus_per_node}" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gpus_per_node seems to be missing a $?

Comment on lines +196 to +202
def __init__(self, worker_cls, num_workers) -> None:
self.worker_cls = worker_cls
self.num_workers = num_workers

self.workers = [worker_cls.options(num_gpus=1,
num_cpus=16,
resources={"worker_units": 1}).remote() for _ in range(self.num_workers)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatting seems off and will break in Python (2 spaces vs 4 spaces)?

def initialize_workers(self, **kwargs):
self.worker_init_kwargs = kwargs
coordinator_ip = ray.get(self.workers[0].get_host_ip.remote())
coordinator_port = random.randint(1, 100000) % 2**12 + (65535 - 2**12 + 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2**16 - 2**12 + random.randrange(2**12)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants