I am far from certain about the cause of the issue.
The symptoms are as follows (I haven't gathered extensive evidence yet, because I'm just setting up and testing sacred for the first time):
- the start of the experiment is always logged
- sometimes some heartbeats are sent at the beginning (because the stdout capture and metrics are showing up)
- then they stop (stdout capture and metrics are not updated, and the experiment has 'Probably dead' status in sacredboard)
- sometimes they show up again for a while (tens of minutes later, much more than what I would expect to see as reasonable delay)
- and then they usually disappear again
- when the experiment completes the appropriate event is sent to the database (and experiment is marked as completed in sacredboard), but the script freezes after the final message
INFO - experiment_name - Completed after 0:14:29
and takes ages to finish (same as #273) - if I let the frozen script finish the output capture and metrics are collected
- if I interrupt it with
ctrl + c
I get the stacktrace the same as in #273 (included at the end of this comment)
Loosing heartbeats somewhat resembles #258, but I've tested only with PyTorch code so far, rather than Caffe.
For now my workaround is to run _run._emit_heartbeat()
inside my training loop to ensure that I don't loose metrics.
I did a little debugging and it seems to me that when the heartbeats are stopped it is happening because the thread IntervalTimer is getting stuck at https://github.com/IDSIA/sacred/blob/eca2e75867e1c033638ab26c63f358e315c123a2/sacred/observers/mongo.py#L249-L250 However
It seems that the following helped as an ad hoc solution:
- emitting heartbeats in the training loop
- fixing the error pymongo AutoReconnect handling in the MongoObserver.log_metrics() function
- adding the timeout times for pymongo where by default it is infinite
- removing the standard threaded way of sending heartbeats whatsoever
System: Fedora release 25, Anaconda Python 3.6.5, sacred 0.7.3, PyMongo 3.4.0, MongoDB 3.4.10
INFO - pyro_threshold - Completed after 0:00:24
^C
Exception ignored in: <module 'threading' from '/homes/a/utils/miniconda3/envs/py/lib/python3.6/threading.py'>
Traceback (most recent call last):
File "/homes/a/utils/miniconda3/envs/py/lib/python3.6/threading.py", line 1294, in _shutdown
t.join()
File "/homes/a/utils/miniconda3/envs/py/lib/python3.6/threading.py", line 1056, in join
self._wait_for_tstate_lock()
File "/homes/a/utils/miniconda3/envs/py/lib/python3.6/threading.py", line 1072, in _wait_for_tstate_lock
elif lock.acquire(block, timeout):
KeyboardInterrupt