Skip to content

Instantly share code, notes, and snippets.

@talesa
Last active January 11, 2019 14:48
Show Gist options
  • Save talesa/6a78447bda17b3d85ebe2e311cac61da to your computer and use it in GitHub Desktop.
Save talesa/6a78447bda17b3d85ebe2e311cac61da to your computer and use it in GitHub Desktop.
sacred issue

I am far from certain about the cause of the issue.

The symptoms are as follows (I haven't gathered extensive evidence yet, because I'm just setting up and testing sacred for the first time):

  • the start of the experiment is always logged
  • sometimes some heartbeats are sent at the beginning (because the stdout capture and metrics are showing up)
  • then they stop (stdout capture and metrics are not updated, and the experiment has 'Probably dead' status in sacredboard)
  • sometimes they show up again for a while (tens of minutes later, much more than what I would expect to see as reasonable delay)
  • and then they usually disappear again
  • when the experiment completes the appropriate event is sent to the database (and experiment is marked as completed in sacredboard), but the script freezes after the final message INFO - experiment_name - Completed after 0:14:29 and takes ages to finish (same as #273)
  • if I let the frozen script finish the output capture and metrics are collected
  • if I interrupt it with ctrl + c I get the stacktrace the same as in #273 (included at the end of this comment)

Loosing heartbeats somewhat resembles #258, but I've tested only with PyTorch code so far, rather than Caffe.

For now my workaround is to run _run._emit_heartbeat() inside my training loop to ensure that I don't loose metrics.

I did a little debugging and it seems to me that when the heartbeats are stopped it is happening because the thread IntervalTimer is getting stuck at https://github.com/IDSIA/sacred/blob/eca2e75867e1c033638ab26c63f358e315c123a2/sacred/observers/mongo.py#L249-L250 However

It seems that the following helped as an ad hoc solution:

  • emitting heartbeats in the training loop
  • fixing the error pymongo AutoReconnect handling in the MongoObserver.log_metrics() function
  • adding the timeout times for pymongo where by default it is infinite
  • removing the standard threaded way of sending heartbeats whatsoever

System: Fedora release 25, Anaconda Python 3.6.5, sacred 0.7.3, PyMongo 3.4.0, MongoDB 3.4.10

INFO - pyro_threshold - Completed after 0:00:24
^C
Exception ignored in: <module 'threading' from '/homes/a/utils/miniconda3/envs/py/lib/python3.6/threading.py'>
Traceback (most recent call last):
  File "/homes/a/utils/miniconda3/envs/py/lib/python3.6/threading.py", line 1294, in _shutdown
    t.join()
  File "/homes/a/utils/miniconda3/envs/py/lib/python3.6/threading.py", line 1056, in join
    self._wait_for_tstate_lock()
  File "/homes/a/utils/miniconda3/envs/py/lib/python3.6/threading.py", line 1072, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
KeyboardInterrupt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment