This document explains how to build automated health checking proxy to serve the beacon API to validator client processes.
An Ethereum validator client process needs to periodically perform actions on chain, such as attesting blocks proposed by other validators, publishing sync committee message digests, or, if lucky, proposing their own blocks.
Despite the validator always posess the mechanism to produce message data necessary for on-chain action, it has no mechanism to distribute it to other Ethereum nodes. An Ethereum beacon node is another process, which validator client is connecting to over HTTP API, and that beacon node process has all the means to bring data into the network over a suite of Ethereum P2P protocols. Once the validator client has produced the necessary protocol data, it sends it to the beacon node, and then given beacon node acts upon the data and makes sure its peers have received and acknowledged an attestation, block or sync committee message. This means, despite the fact that the validator client is an authoritative participant in the blockchain network, within current architecture of Ethereum proof-of-stake, it cannot act without having a connection to a beacon node, and this connection is a critical path of operation for the validators that secure Ethereum network.
Most Ethereum validator clients nowadays have developed support for load balancing over multiple beacon nodes to fulfill their network duties. This means if some beacon node becomes non-functional, in most cases, the validator client will not use connection to such node, and will resort to using other nodes that were specified in its configuration. For example, if the beacon node's HTTP API returns 503 instead of expected return code, the validator client will not use it for producing next attestation or block, but will fail-over to the other node. There are other scenarios when validator client will ditch beacon node, but depending on Ethereum client implementation, the checks that validator client performs on the beacon node are different. Every client will check if the corresponding beacon node is not syncing currently, and will not perform any action if the node is not synced to the latest head. However, taking into account Ethereum dual nature and presence of both beacon (consensus) and execution nodes, to make sure blocks production will work well, it is important to verify that execution node is also not syncing, not only consensus. In today landscape of Ethereum client implementations, not every client does this. As of time of this writing, Lighthouse validator client does custom checking to ensure execution node is synced, and Teku does not.
There is one more challenge that is less important for persons who stake few validators, but is more important for the node operators who routinely connect several hundreds of validators to a single beacon node. To produce and distribute attestations efficiently, the set of Ethereum validators is randomly divided into 64 attestation subnets, and every attestation message is distributed in the set of peer nodes that operate on the specific subnet and is not delivered to validators that do not subscribe to that subnet. However, largeish node operators usually need to subscribe to all attestation subnets to make sure that their diverse validator set can deliver messages to every necessary subnet every time their attestation duty is up. Now, on Ethereum network not every beacon node subscribes to every subnet, and to be present on every subnet necessary the beacon node needs to have sufficient number of peers, or otherwise some attestation messages could be lost and not delivered to the network, because of beacon node not being connected to the required subnet. This is why, for bigger node operators, when connecting validator clients to beacon nodes, it is also important to verify that beacon nodes are connected to sufficient number of peers before making validator clients to include specific node into operation. Operators must ensure nodes that have been started up recently and are not having sufficient number of peers, or nodes that are facing networking incident and can not keep up a sufficient number of peers, are not used by the validators.
So, at least three important health checks need to be done on the connection between beacon node and validator client, to ensure validator client can trust the beacon node to distribute all the necessary protocol messages. These checks are as follows:
- Beacon node is synced
- Execution node is synced
- Beacon node has sufficient number of peers to accommodate attestations on all subnets
To fulfill the health checking capacity described, validator clients and the health checking layer described can use following Ethereum APIs:
- beacon node HTTP method
/eth/v1/node/syncing
to make sure consensus node is not syncing - JSON RPC method
eth_syncing
to make sure execution node is not syncing - beacon node HTTP method
/eth/v1/node/peer_count
to get peer count, and ensure it is sufficient
In ideal world, all Ethereum client implementations should make all three checks on the beacon node, before starting to use it to produce attestation, block or sync commitee messages. But, this is not happening nowadays, and the functionality behind health checking is different between different client implementations, as it seen by discrepancy on execution node checking between Teku and Lighthouse, as stated above, and continues to evolve.
To address this incompleteness of consistent health checking behavior between different Ethereum client implementation, a smart Ethereum node operator can employ custom proxy layer for health checking the HTTP API that beacon node exposes. A custom proxy must be deployed for every beacon node, and Ethereum validator client will connect to a custom proxy, instead of directly to a beacon node. Every proxy will perform consistent number of health checks on beacon node, and in case if any of health checks fails, will mark the beacon node as being out of rotation, resulting in 503 response for every possible validator client request over HTTP API. Because of this effect, every validator client will not use such faulty node to perform any of the validation duties. This can work the same regardless of the client implementation, and can include more potential health check operations, if it will be necessary as Ethereum protocol evolves.
A layout of health checking proxies, beacon, execution nodes and validator client, that allows for consistent health checking is pictured below in an ASCII diagram.
+-------------------+
| Validator Client |
+--+---------------++
| |
| |
Beacon API Beacon API
| |
| |
+-------v-------+ +-v-------------+
+-----+Proxy (Health) | |Proxy (Health) |
| +---------+-----+ +---+-----------++
| | | |
| Beacon Beacon HTTP |
| HTTP API API |
| +----<-----+ +---<-----+ |
JSON |Beacon | |Beacon | JSON
RPC |Node | |Node | RPC
Health |1 | |2 | Health
Check +----------+ +---------+ Check
| JSON-RPC JSON-RPC |
| +-----<----+ +----<-----+ |
| |Execution | |Execution | |
| |Node | |Node | |
| |1 | |2 <------+
+----------> | | |
+----------+ +----------+
There are multiple ways to implement such health checking layer for Ethereum validation, and smart node operators should prefer to choose the software implementation that already being used by their organization, be it Istio, Traefik, Haproxy or any other health checking capable proxy software. For me personally, Haproxy comes the best of the offering, because it includes Lua programming language support that allows to implement health checks in a way that is both efficient to make the job done, and pleasant to spend the time on programming the logic.
The problem of health checking of beacon nodes is an interesting one to solve. However, in the ideal world there should be a solution for such health checking that is built into the validator clients and is mandated by a protocol. The only suggestion I could have for the protocol developers, is to make sure that, if such mechanism becomes an Ethereum protocol requirement sometimes, it will be extensible, to make it easier to implement new health checks for people working on the future extensions of protocol that might need custom health checking.