Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save AndreasMadsen/91e49e13f0085ececbef0f80c830c5af to your computer and use it in GitHub Desktop.
Save AndreasMadsen/91e49e13f0085ececbef0f80c830c5af to your computer and use it in GitHub Desktop.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0x118af890
W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0x118b36a0
W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0x118b74e0
INFO:tensorflow:0314:14:49:41.625:sg_train.py:312] Training started from epoch[001]-step[0].
W tensorflow/core/common_runtime/bfc_allocator.cc:274] **************************************************************************************_**********xxx
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 368.75MiB. See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:993] Resource exhausted: OOM when allocating tensor with shape[4096,400,1,59]
W tensorflow/core/common_runtime/bfc_allocator.cc:274] **************************************************************************************_**********xxx
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 368.75MiB. See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:993] Resource exhausted: OOM when allocating tensor with shape[4096,400,1,59]
W tensorflow/core/common_runtime/bfc_allocator.cc:274] **************************************************************************************_**********xxx
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 368.75MiB. See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:993] Resource exhausted: OOM when allocating tensor with shape[4096,400,1,59]
W tensorflow/core/common_runtime/bfc_allocator.cc:274] **************************************************************************************_**********xxx
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 368.75MiB. See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:993] Resource exhausted: OOM when allocating tensor with shape[4096,400,1,59]
Exception in thread Thread-441:
Traceback (most recent call last):
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1022, in _do_call
return fn(*args)
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1004, in _run_fn
status, run_metadata)
File "/appl/python/3.6.0/lib/python3.6/contextlib.py", line 89, in __exit__
next(self.gen)
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.CancelledError: Enqueue operation was cancelled
[[Node: random_shuffle_queue_1_EnqueueMany = QueueEnqueueManyV2[Tcomponents=[DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](random_shuffle_queue_1, fraction_of_32_full_DequeueMany)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/appl/python/3.6.0/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/appl/python/3.6.0/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/tensorflow/python/training/queue_runner_impl.py", line 250, in _run
coord.request_stop(e)
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 211, in request_stop
six.reraise(*sys.exc_info())
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/six.py", line 686, in reraise
raise value
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/tensorflow/python/training/queue_runner_impl.py", line 234, in _run
sess.run(enqueue_op)
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 767, in run
run_metadata_ptr)
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 965, in _run
feed_dict_string, options, run_metadata)
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1015, in _do_run
target_list, options, run_metadata)
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1035, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.CancelledError: Enqueue operation was cancelled
[[Node: random_shuffle_queue_1_EnqueueMany = QueueEnqueueManyV2[Tcomponents=[DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](random_shuffle_queue_1, fraction_of_32_full_DequeueMany)]]
Caused by op 'random_shuffle_queue_1_EnqueueMany', defined at:
File "code/script/bytenet_europarl_train.py", line 21, in <module>
validate=True)
File "/zhome/ff/2/77654/workspace/kandidat/code/dataset/wmt_bilingual_news.py", line 117, in __init__
**kwargs
File "/zhome/ff/2/77654/workspace/kandidat/code/dataset/abstract/text_dataset.py", line 75, in __init__
**kwargs)
File "/zhome/ff/2/77654/workspace/kandidat/code/dataset/abstract/dataset.py", line 80, in __init__
length, source, target = self.queue.read()
File "/zhome/ff/2/77654/workspace/kandidat/code/dataset/util/sequence_queue/sequence_queue_memory.py", line 53, in read
seed=self.seed
File "/zhome/ff/2/77654/workspace/kandidat/code/tf_operator/queue/shuffle_tensor_index.py", line 13, in shuffle_tensor_index
shuffle_enqueue = shuffle_queue.enqueue_many([dequeue_op])
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/tensorflow/python/ops/data_flow_ops.py", line 378, in enqueue_many
self._queue_ref, vals, name=scope)
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 1538, in _queue_enqueue_many_v2
name=name)
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
op_def=op_def)
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2395, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1264, in __init__
self._traceback = _extract_stack()
CancelledError (see above for traceback): Enqueue operation was cancelled
[[Node: random_shuffle_queue_1_EnqueueMany = QueueEnqueueManyV2[Tcomponents=[DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](random_shuffle_queue_1, fraction_of_32_full_DequeueMany)]]
INFO:tensorflow:0315:04:23:36.454:sg_train.py:415] Training finished at epoch[1]-step[21334].
Traceback (most recent call last):
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1022, in _do_call
return fn(*args)
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1004, in _run_fn
status, run_metadata)
File "/appl/python/3.6.0/lib/python3.6/contextlib.py", line 89, in __exit__
next(self.gen)
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[4096,400,1,59]
[[Node: train/gradients_1/bytenet-model_1/decoder/bytenet-decoder-depth-2/decoder-res-block.2.3.16/conv-dilated/atrous_conv2d_grad/Conv2DBackpropInput = Conv2DBackpropInput[T=DT_FLOAT, data_format="NHWC", padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:1"](train/gradients_1/bytenet-model_1/decoder/bytenet-decoder-depth-2/decoder-res-block.2.3.16/conv-dilated/atrous_conv2d_grad/Shape, bytenet-model/decoder/decoder-res-block.2.3.16/conv-dilated/W/read/_22785, train/gradients_1/bytenet-model_1/decoder/bytenet-decoder-depth-2/decoder-res-block.2.3.16/conv-dilated/atrous_conv2d/BatchToSpaceND_grad/SpaceToBatchND)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/sugartensor/sg_main.py", line 527, in sg_queue_context
yield
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/sugartensor/sg_train.py", line 340, in wrapper
batch_loss = func(sess, opt)
File "/zhome/ff/2/77654/workspace/kandidat/code/tf_operator/train/basic_train.py", line 44, in train_func
run_metadata=run_metadata)[0]
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 767, in run
run_metadata_ptr)
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 965, in _run
feed_dict_string, options, run_metadata)
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1015, in _do_run
target_list, options, run_metadata)
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1035, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[4096,400,1,59]
[[Node: train/gradients_1/bytenet-model_1/decoder/bytenet-decoder-depth-2/decoder-res-block.2.3.16/conv-dilated/atrous_conv2d_grad/Conv2DBackpropInput = Conv2DBackpropInput[T=DT_FLOAT, data_format="NHWC", padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:1"](train/gradients_1/bytenet-model_1/decoder/bytenet-decoder-depth-2/decoder-res-block.2.3.16/conv-dilated/atrous_conv2d_grad/Shape, bytenet-model/decoder/decoder-res-block.2.3.16/conv-dilated/W/read/_22785, train/gradients_1/bytenet-model_1/decoder/bytenet-decoder-depth-2/decoder-res-block.2.3.16/conv-dilated/atrous_conv2d/BatchToSpaceND_grad/SpaceToBatchND)]]
Caused by op 'train/gradients_1/bytenet-model_1/decoder/bytenet-decoder-depth-2/decoder-res-block.2.3.16/conv-dilated/atrous_conv2d_grad/Conv2DBackpropInput', defined at:
File "code/script/bytenet_europarl_train.py", line 27, in <module>
model.train(max_ep=60, lr=0.0001)
File "/zhome/ff/2/77654/workspace/kandidat/code/model/abstract/model.py", line 73, in train
update = self._update_model(losses, lr=lr, **kwargs)
File "/zhome/ff/2/77654/workspace/kandidat/code/model/abstract/model.py", line 87, in _update_model
return tower_optim(losses, **kwargs)
File "/zhome/ff/2/77654/workspace/kandidat/code/tf_operator/parallel/tower_optim.py", line 16, in tower_optim
gradient = tower_gradient(opt, losses, var_list)
File "/zhome/ff/2/77654/workspace/kandidat/code/tf_operator/parallel/tower_gradient.py", line 12, in tower_gradient
tower_grads.append(opt.compute_gradients(loss, var_list=var_list))
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 354, in compute_gradients
colocate_gradients_with_ops=colocate_gradients_with_ops)
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 482, in gradients
in_grads = grad_fn(op, *out_grads)
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/tensorflow/python/ops/nn_grad.py", line 359, in _Conv2DGrad
op.get_attr("data_format")),
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 489, in conv2d_backprop_input
data_format=data_format, name=name)
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
op_def=op_def)
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2395, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1264, in __init__
self._traceback = _extract_stack()
...which was originally created as op 'bytenet-model_1/decoder/bytenet-decoder-depth-2/decoder-res-block.2.3.16/conv-dilated/atrous_conv2d', defined at:
File "code/script/bytenet_europarl_train.py", line 27, in <module>
model.train(max_ep=60, lr=0.0001)
File "/zhome/ff/2/77654/workspace/kandidat/code/model/abstract/model.py", line 51, in train
loss, losses = self.train_model(reuse=reuse)
File "/zhome/ff/2/77654/workspace/kandidat/code/model/abstract/model.py", line 40, in train_model
**kwargs)
File "/zhome/ff/2/77654/workspace/kandidat/code/model/bytenet.py", line 74, in loss_model
name="bytenet-model"
File "/zhome/ff/2/77654/workspace/kandidat/code/tf_operator/bytenet_translator/bytenet_supervised_translator.py", line 54, in bytenet_supervised_translator
name="decoder")
File "/zhome/ff/2/77654/workspace/kandidat/code/tf_operator/bytenet_decoder/parallel_bytenet_decoder.py", line 21, in parallel_bytenet_decoder
name=f'decoder-res-block.{i}.{size}.{rate_i}'
File "/zhome/ff/2/77654/workspace/kandidat/code/tf_operator/decoder_residual_block/parallel_decoder_residual_block.py", line 77, in parallel_decoder_residual_block
name="conv-dilated")
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/sugartensor/sg_main.py", line 235, in wrapper
out = func(tensor, opt)
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/sugartensor/sg_layer.py", line 197, in sg_aconv1d
out = tf.nn.atrous_conv2d(x, w, rate=opt.rate, padding='VALID') + b
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 994, in atrous_conv2d
name=name)
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 396, in conv2d
data_format=data_format, name=name)
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
op_def=op_def)
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[4096,400,1,59]
[[Node: train/gradients_1/bytenet-model_1/decoder/bytenet-decoder-depth-2/decoder-res-block.2.3.16/conv-dilated/atrous_conv2d_grad/Conv2DBackpropInput = Conv2DBackpropInput[T=DT_FLOAT, data_format="NHWC", padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:1"](train/gradients_1/bytenet-model_1/decoder/bytenet-decoder-depth-2/decoder-res-block.2.3.16/conv-dilated/atrous_conv2d_grad/Shape, bytenet-model/decoder/decoder-res-block.2.3.16/conv-dilated/W/read/_22785, train/gradients_1/bytenet-model_1/decoder/bytenet-decoder-depth-2/decoder-res-block.2.3.16/conv-dilated/atrous_conv2d/BatchToSpaceND_grad/SpaceToBatchND)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "code/script/bytenet_europarl_train.py", line 27, in <module>
model.train(max_ep=60, lr=0.0001)
File "/zhome/ff/2/77654/workspace/kandidat/code/model/abstract/model.py", line 84, in train
lr=lr)
File "/zhome/ff/2/77654/workspace/kandidat/code/model/abstract/model.py", line 92, in _train_loop
basic_train(loss, update_op, **kwargs)
File "/zhome/ff/2/77654/workspace/kandidat/code/tf_operator/train/basic_train.py", line 55, in basic_train
train_func(save_dir=save_dir, **kwargs)
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/sugartensor/sg_train.py", line 404, in wrapper
break
File "/appl/python/3.6.0/lib/python3.6/contextlib.py", line 100, in __exit__
self.gen.throw(type, value, traceback)
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/sugartensor/sg_main.py", line 532, in sg_queue_context
coord.join(threads)
File "/zhome/ff/2/77654/stdpy3/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 390, in join
" ".join(stragglers))
RuntimeError: Coordinator stopped with threads still running: Thread-104 Thread-116 Thread-140 Thread-119 Thread-143 Thread-118 Thread-142 Thread-108 Thread-120 Thread-144 Thread-123 Thread-147 Thread-122 Thread-146 Thread-149 Thread-125 Thread-112 Thread-124 Thread-148 Thread-115 Thread-127 Thread-151 Thread-114 Thread-126 Thread-150 Thread-261 Thread-260 Thread-262 Thread-37 Thread-264 Thread-36 Thread-267 Thread-39 Thread-266 Thread-38 Thread-268 Thread-43 Thread-270 Thread-42 Thread-129 Thread-128 Thread-131 Thread-130 Thread-137 Thread-136 Thread-139 Thread-138 Thread-273 Thread-272 Thread-274 Thread-286 Thread-276 Thread-288 Thread-290 Thread-278 Thread-293 Thread-280 Thread-292 Thread-16 Thread-282 Thread-294 Thread-7 Thread-83 Thread-82 Thread-154 Thread-9 Thread-8 Thread-85 Thread-157 Thread-84 Thread-156 Thread-10 Thread-159 Thread-86 Thread-158 Thread-13 Thread-12 Thread-89 Thread-161 Thread-88 Thread-160 Thread-15 Thread-14 Thread-91 Thread-163 Thread-90 Thread-162 Thread-17 Thread-20 Thread-68 Thread-308 Thread-71 Thread-23 Thread-22 Thread-70 Thread-73 Thread-313 Thread-312 Thread-72 Thread-24 Thread-315 Thread-27 Thread-75 Thread-314 Thread-74 Thread-26 Thread-29 Thread-316 Thread-76 Thread-28 Thread-31 Thread-318 Thread-30 Thread-297 Thread-92 Thread-296 Thread-299 Thread-95 Thread-94 Thread-298 Thread-97 Thread-96 Thread-99 Thread-303 Thread-98 Thread-302 Thread-101 Thread-100 Thread-304 Thread-306 Thread-102 Thread-213 Thread-357 Thread-333 Thread-56 Thread-57 Thread-188 Thread-212 Thread-320 Thread-332 Thread-383 Thread-382 Thread-335 Thread-215 Thread-322 Thread-334 Thread-58 Thread-190 Thread-214 Thread-325 Thread-385 Thread-360 Thread-384 Thread-193 Thread-337 Thread-217 Thread-216 Thread-324 Thread-60 Thread-192 Thread-327 Thread-363 Thread-362 Thread-219 Thread-338 Thread-62 Thread-194 Thread-218 Thread-389 Thread-388 Thread-221 Thread-328 Thread-340 Thread-197 Thread-64 Thread-196 Thread-220 Thread-391 Thread-366 Thread-390 Thread-222 Thread-342 Thread-223 Thread-330 Thread-66 Thread-198 Thread-177 Thread-176 Thread-179 Thread-178 Thread-181 Thread-180 Thread-442 Thread-441 Thread-182 Thread-444 Thread-443 Thread-185 Thread-184 Thread-445 Thread-187 Thread-186 Thread-448 Thread-447 Thread-345 Thread-369 Thread-393 Thread-344 Thread-368 Thread-392 Thread-461 Thread-371 Thread-395 Thread-346 Thread-370 Thread-394 Thread-464 Thread-463 Thread-349 Thread-373 Thread-397 Thread-348 Thread-372 Thread-396 Thread-466 Thread-465 Thread-351 Thread-399 Thread-350 Thread-398 Thread-468 Thread-467 Thread-353 Thread-401 Thread-352 Thread-376 Thread-400 Thread-470 Thread-469 Thread-379 Thread-403 Thread-354 Thread-378 Thread-402 Thread-472 Thread-471 Thread-225 Thread-201 Thread-224 Thread-200 Thread-474 Thread-498 Thread-449 Thread-473 Thread-497 Thread-227 Thread-226 Thread-476 Thread-488 Thread-531 Thread-451 Thread-475 Thread-487 Thread-228 Thread-204 Thread-454 Thread-453 Thread-477 Thread-489 Thread-504 Thread-207 Thread-206 Thread-230 Thread-456 Thread-480 Thread-492 Thread-455 Thread-479 Thread-491 Thread-233 Thread-232 Thread-208 Thread-494 Thread-457 Thread-493 Thread-211 Thread-235 Thread-210 Thread-234 Thread-496 Thread-459 Thread-495 Thread-237 Thread-236 Thread-59 Thread-238 Thread-241 Thread-240 Thread-243 Thread-242 Thread-336 Thread-244 Thread-246 Thread-18 Thread-326 Thread-164 Thread-166 Thread-169 Thread-168 Thread-356 Thread-171 Thread-170 Thread-173 Thread-172 Thread-358 Thread-331 Thread-67 Thread-45 Thread-44 Thread-248 Thread-251 Thread-191 Thread-250 Thread-253 Thread-49 Thread-252 Thread-48 Thread-50 Thread-254 Thread-256 Thread-52 Thread-364 Thread-55 Thread-258 Thread-54 Thread-386
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment