Assumptions:
- Head node is firewalled but worker nodes are not.
- Nodes can ssh into each other.
On the head node, run the following commands:
ray stop
ray start --head --port=6379 --redis-shard-ports=6380 --object-manager-port=6381 --node-manager-port=6382
ray status
should show the head node as running.
On the worker nodes, run the following commands:
- create an entry in
~/.ssh/config
that looks like this:
Host head-node
HostName <head-node-ip>
User <head-node-user>
LocalForward 6379 localhost:6379 # redis/gcs
LocalForward 6380 localhost:6380 # redis-shard-ports
LocalForward 6381 localhost:6381 # object-manager ports
LocalForward 6382 localhost:6382 # node-manager ports
LocalForward 8265 localhost:8265 # dashboard
- ssh into the head node:
ssh head-node
in a separate terminal. (or you could just runssh head-node -N
in the background) - create a dummy entry in
/etc/hosts
, e.g.,127.0.0.1 fake-head-node
. (yes. trust me.) ray start --address='fake-head-node:6379' --resources='{"noobness": 4}' --node-ip-address=<worker-node-public-ip/domain>
ray status
should show the head node and the worker node as running.
The reason to create a dummy entry in /etc/hosts
is because the --address
flag in ray start
doesn't like "localhost/127.0.0.1". If it sees that, it automatically tries to get another IP (e.g., the node's IP on some other interface). We want to force it to use localhost:6379 because we're tunneling the ports to the head node.
In step 4, if you don't put the public ip of the worker node, then ray fails silently and gives mixed signals. ray status
will show 2 nodes as connected, but the resources will only be that of the head node.
Another thing to be careful about is, if your worker node is on Azure (or other cloud providers), you would need to add an entry in Azure's firewall to allow all traffic from that public ip. That's because on the worker node, when you run ray start
, the dashboard agent tries to connect to itself (idk why, it should connect to the head node, right? right?), and that connection is blocked by the firewall. So you need to allow all traffic from the worker node's public ip. (i.e., add an entry like this: Allow inbound traffic from <worker-node-public-ip> to any port
). This is not needed if the worker node is on-prem.
I think we should be able to possibly make it work when all nodes are behind firewall too, although I've not thought about it yet. The problem is, Ray wants bidirectional communication between nodes, and firewalls cripple that. So we want to get around it using ssh-tunneling, but so far I've only been able to figure out one-way tunneling.
References: