In the scenario that we have multiple Ray clusters with processes running on the same node, we want to make sure that each cluster's dashboard only contains metrics from the processes in its cluster. Currently, the reporter
process responsible for collecting these metrics (of which there is one per unique (cluster, node) tuple) fetches metrics for all Ray workers on the node, regardless of their cluster.
I assume that we wish to still have a discrete reporter
process for each (cluster, node) pair, rather than switching to have a single reporter
process per node. It is easier to implement given our current process handling, and it allows us to perform per-cluster configuration of reporting which, although we do not use it now, I think we should aim to support. The downside is that there are certainly metrics that we collect that don't differ at a node level, such as CPU utilization, that would have N processes monitoring them rather than 1.
Of the two solutions, I think Alternative 1 is simpler to implement and leverages existing behavior of the system, whereas alternative 2 is perhaps simpler to read and makes sense logistically, but will introduce more code.
Given that, I would change reporter.py
to filter the worker processes that it fetches based on the ppid
property of the worker processes which refers to the process id of the Raylet from which the worker was forked. I would also need to change the process start-up so that the process ID of the raylet process is passed in at construction time, or I would need to record the Raylet process ID in redis and read it from the Reporter
class.
Start all worker processes for a given Ray cluster in the same Unix process group, and write this group ID into redis. Then, the reporter can fetch all processes from the group, and workers from the other cluster will not be a part of the same group.