Description

The Nimbus cluster is configured via the infra-nimbus repo and mostly with the infra-role-beacon-node role.

Here are some thoughts from Sysadmin perspective.

Thoughts

Multiple Docker Images

Currently in order to use different network setups and "flavours" multiple versions of the docker image have to be built.

For more details you can see the docker/beacon_node folder in the nimbus repo, but more specifically the build_beacon_node.sh script which currently produces 2 images, but in the future will produce 4 or more.

This is an anti-pattern. Runtime settings should be configured at runtime. This us understandable while Nimbus is still under heavy development but once stable this should not be the case.

Healthcheck

A way to verify that the service is healthy. Just because a process is up doesn't mean that the service is functioning properly. Some way of periodically checking that the service is working correctly is always a good idea.

Whether it's a JSON RPC endpoint, just simple HTTP, or even the most basic TCP ping/pong, it's always good to provide a way to verify the service works, and if it can also return some info(active peers, syncing progress, or state of worker threads) even better.

Throttling & Limits

Any server-side software should have configurable limits, like requests, connections or peers limits. All hosts have limited resources available and setting a hard limit on things that could possibly overload the host and cause the service to go down is always a good thing. If this is supposed to be run in production I would recommend including such limits in the configuration.

Multi-threading

I don't know if this software is supposed to be multi-threaded, but if so this should also be configurable.

Logging

Currently all logs are JSON, and that can be useful, but for most people JSON isn't a format that is readable or useful for them. An option to run a node with regular human-readable logs would be a good idea.

In addition to that an option to disable colors in the logs would be helpful for people who run log aggregators.

Metrics

If possible a separate call for getting metrics of a node would be preferable to exposing them, via logs. As it is in case of geth, a JSON RPC call provides all the metrics collected by the service, which leaves the user at liberty to decide frequency of probing for metrics.

The method itself isn't as important as making the metrics available via pull rather than push method.

Signals

Any well behaved service should respect several signals:

SIGTERM - Fast shutdown of a service
SIGQUIT - Slow and graceful shutdown.
SIGHUP - Restarting of worker processes, resetting connections, reloading configuration.
SIGUSR1 & SIGUSR2 - Often used for re-opening log files, flushing caches, sending keepalive packets, and so on.

In general the SIGHUP one is the most important to support, but adding your own handled SIGUSR* signals which provide useful for administrators - like checking all connected peers if they are live or dropping all peers - can make managing the service much easier.

Example Setup

Providing an example of a possible setup of your service is always a good idea. For example nowadays systemd is almost everywhere, so giving and example nimbus.service file to users is a great way to make their setup easier.

jakubgs/nimbus_cluster_thoughts.md

Select an option

No results found