I am far from certain about the cause of the issue.
The symptoms are as follows (I haven't gathered extensive evidence yet, because I'm just setting up and testing sacred for the first time):
- the start of the experiment is always logged
- sometimes some heartbeats are sent at the beginning (because the stdout capture and metrics are showing up)
- then they stop (stdout capture and metrics are not updated, and the experiment has 'Probably dead' status in sacredboard)
- sometimes they show up again for a while (tens of minutes later, much more than what I would expect to see as reasonable delay)
- and then they usually disappear again
- when the experiment completes the appropriate event is sent to the database (and experiment is marked as completed in sacredboard), but the script freezes after the final message
INFO - experiment_name - Completed after 0:14:29
and takes ages to finish (same as #273) - if I let the frozen script finish the output capture and metrics are collected