-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ray tutorial #1302
base: main
Are you sure you want to change the base?
Ray tutorial #1302
Conversation
|
||
### Starting a Ray Cluster manually | ||
|
||
We will begin with a simple example of how to manually start a Ray cluster on 2 physical nodes. This will involve a single Ray head node and 2 Ray worker nodes, where each Ray worker node is allocated all GPUs of the node it runs on. We will assume the IP addresses of the physical nodes are `IP_ADDR_1` and `IP_ADDR_2` and that the head node will be allocated on the physical node with `IP_ADDR_1`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the previous section you mention that we'll operate in a 1-process-per-GPU setting, but here we're running with 2 Ray worker nodes. Does that mean we're running 8 actors per worker node? And does each of them run in a separate process? I would clarify this.
# First we start the head of the ray cluster on one of the physical nodes | ||
# In this case we are giving an entire physical node to the ray head node | ||
# The ray head node is marked by including --head to the ray start command | ||
srun --nodes=1 --ntasks=1 -w "$head_node" ray start --head \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come we didn't need --head
in the manual example above?
|
||
```console | ||
#!/bin/bash | ||
#SBATCH --nodes=<NUM_NODES>+1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need an extra physical node for the head? Could we instead launch the head on node#0?
# Getting the node names | ||
nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST") | ||
nodes_array=($nodes) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redundant with the lines below?
|
||
srun --exact --nodes=1 --ntasks=1 --cpus-per-task=$((16 * gpus_per_node)) -w "$node_i" \ | ||
ray start --address "$ip_head" \ | ||
--resources="{\"worker_units\": gpus_per_node}" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gpus_per_node
seems to be missing a $
?
def __init__(self, worker_cls, num_workers) -> None: | ||
self.worker_cls = worker_cls | ||
self.num_workers = num_workers | ||
|
||
self.workers = [worker_cls.options(num_gpus=1, | ||
num_cpus=16, | ||
resources={"worker_units": 1}).remote() for _ in range(self.num_workers)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Formatting seems off and will break in Python (2 spaces vs 4 spaces)?
def initialize_workers(self, **kwargs): | ||
self.worker_init_kwargs = kwargs | ||
coordinator_ip = ray.get(self.workers[0].get_host_ip.remote()) | ||
coordinator_port = random.randint(1, 100000) % 2**12 + (65535 - 2**12 + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2**16 - 2**12 + random.randrange(2**12)
?
This document presents a detailed tutorial on how Ray can be used together with JAX to achieve fault tolerant training.