Created
February 23, 2022 20:37
-
-
Save jamesr66a/b51929f69ad0b20d116f84a49b28def8 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
WARNING:torch.distributed.run: | |
***************************************** | |
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. | |
***************************************** | |
[W socket.cpp:701] The server socket on [ip-10-200-31-5.ec2.internal]:40891 is not yet listening (errno: 111 - Connection refused), will retry. | |
[W socket.cpp:701] The server socket on [ip-10-200-31-5.ec2.internal]:40891 is not yet listening (errno: 111 - Connection refused), will retry. | |
[W socket.cpp:701] The server socket on [ip-10-200-31-5.ec2.internal]:40891 is not yet listening (errno: 111 - Connection refused), will retry. | |
[W socket.cpp:701] The server socket on [ip-10-200-31-5.ec2.internal]:40891 is not yet listening (errno: 111 - Connection refused), will retry. | |
[W socket.cpp:701] The server socket on [ip-10-200-31-5.ec2.internal]:40891 is not yet listening (errno: 111 - Connection refused), will retry. | |
[W socket.cpp:701] The server socket on [ip-10-200-31-5.ec2.internal]:40891 is not yet listening (errno: 111 - Connection refused), will retry. | |
REPLICATE config: False -> MultiUseParameterConfig.TRANSMIT | |
GraphModule( | |
(submod_0): GraphModule() | |
(submod_1): GraphModule() | |
(submod_2): GraphModule() | |
(_loss): MSELoss() | |
) | |
def forward(self, x, target): | |
submod_0 = self.submod_0(x) | |
getitem_2 = submod_0[2] | |
getitem = submod_0[0] | |
getitem_1 = submod_0[1] | |
submod_1 = self.submod_1(getitem, getitem_2) | |
getitem_4 = submod_1[1] | |
getitem_3 = submod_1[0] | |
submod_2 = self.submod_2(getitem_3, getitem_1, getitem_4) | |
_loss = self._loss(submod_2, target) | |
stage_backward = pippy_IR_stage_backward(stage_output = _loss, output_grads = None, input_values = [submod_2, target]); target = None | |
getitem_5 = stage_backward[0] | |
getitem_6 = stage_backward[1]; stage_backward = None | |
stage_backward_1 = pippy_IR_stage_backward(stage_output = submod_2, output_grads = getitem_5, input_values = [getitem_3, getitem_1, getitem_4]); submod_2 = getitem_5 = getitem_3 = getitem_1 = getitem_4 = None | |
getitem_7 = stage_backward_1[0] | |
getitem_8 = stage_backward_1[1] | |
getitem_9 = stage_backward_1[2]; stage_backward_1 = None | |
stage_backward_2 = pippy_IR_stage_backward(stage_output = submod_1, output_grads = [getitem_7, getitem_9], input_values = [getitem, getitem_2]); submod_1 = getitem_7 = getitem_9 = getitem = getitem_2 = None | |
getitem_10 = stage_backward_2[0] | |
getitem_11 = stage_backward_2[1]; stage_backward_2 = None | |
stage_backward_3 = pippy_IR_stage_backward(stage_output = submod_0, output_grads = [getitem_10, getitem_8, getitem_11], input_values = [x]); submod_0 = getitem_10 = getitem_8 = getitem_11 = x = None | |
getitem_12 = stage_backward_3[0]; stage_backward_3 = None | |
return _loss | |
/fsx/users/jamesreed/pipeline_for_real/pippy/PipelineDriver.py:394: UserWarning: Running pipeline with 3 stages on world_size of 10. Remaining ranks will be idle. | |
warnings.warn(f'Running pipeline with {len(executor_descriptors)} stages on world_size of {self.world_size}. ' | |
(22440) Instantiating OwnerRRef GloballyUniqueId(created_on=0, local_id=0) with future 0x55ed480483b0 | |
(22440) Populating OwnerRRef GloballyUniqueId(created_on=0, local_id=0) with future 0x55ed480483b0 | |
(22440) Instantiating OwnerRRef GloballyUniqueId(created_on=0, local_id=6) with future 0x55ed4809c660 | |
(22440) Instantiating OwnerRRef GloballyUniqueId(created_on=0, local_id=8) with future 0x55ed480778d0 | |
(22440) Instantiating OwnerRRef GloballyUniqueId(created_on=0, local_id=10) with future 0x55ed48078220 | |
(22440) Instantiating OwnerRRef GloballyUniqueId(created_on=0, local_id=12) with future 0x55ed48065bd0 | |
(22440) Populating OwnerRRef GloballyUniqueId(created_on=0, local_id=6) with future 0x55ed4809c660 | |
(22440) Populating OwnerRRef GloballyUniqueId(created_on=0, local_id=8) with future 0x55ed480778d0 | |
(22440) Populating OwnerRRef GloballyUniqueId(created_on=0, local_id=10) with future 0x55ed48078220 | |
(22440) Populating OwnerRRef GloballyUniqueId(created_on=0, local_id=12) with future 0x55ed48065bd0 | |
(22444) Instantiating OwnerRRef GloballyUniqueId(created_on=0, local_id=1) with future 0x7f2374006750 | |
(22444) Populating OwnerRRef GloballyUniqueId(created_on=0, local_id=1) with future 0x7f2374006750 | |
(22444) Instantiating OwnerRRef GloballyUniqueId(created_on=1, local_id=0) with future 0x7f2370008300 | |
(22444) Populating OwnerRRef GloballyUniqueId(created_on=1, local_id=0) with future 0x7f2370008300 | |
(22444) Instantiating OwnerRRef GloballyUniqueId(created_on=0, local_id=17) with future 0x7f2370007160 | |
(22444) Instantiating OwnerRRef GloballyUniqueId(created_on=1, local_id=3) with future 0x7f2370008730 | |
(22444) Populating OwnerRRef GloballyUniqueId(created_on=1, local_id=3) with future 0x7f2370008730 | |
(22445) Instantiating OwnerRRef GloballyUniqueId(created_on=0, local_id=3) with future 0x7fe9c4006750 | |
(22445) Populating OwnerRRef GloballyUniqueId(created_on=0, local_id=3) with future 0x7fe9c4006750 | |
^^^^ Scenario 1 | |
(22445) Instantiating OwnerRRef GloballyUniqueId(created_on=0, local_id=30) with future 0x7fe9b4007870 | |
(22445) Instantiating OwnerRRef GloballyUniqueId(created_on=2, local_id=0) with future 0x7fe9b4009970 | |
(22445) Populating OwnerRRef GloballyUniqueId(created_on=2, local_id=0) with future 0x7fe9b4009970 | |
(22445) Instantiating OwnerRRef GloballyUniqueId(created_on=0, local_id=34) with future 0x7fe9ac007f00 | |
^^^^ Scenario 2 | |
(22440) Instantiating OwnerRRef GloballyUniqueId(created_on=0, local_id=83) with future 0x55ed481193d0 | |
(22445) Instantiating OwnerRRef GloballyUniqueId(created_on=2, local_id=3) with future 0x7fe9ac00a310 | |
(22445) Populating OwnerRRef GloballyUniqueId(created_on=2, local_id=3) with future 0x7fe9ac00a310 | |
(22440) Instantiating OwnerRRef GloballyUniqueId(created_on=0, local_id=85) with future 0x55ed481365e0 | |
(22445) Instantiating OwnerRRef GloballyUniqueId(created_on=2, local_id=6) with future 0x7fe9ac00a760 | |
(22445) Populating OwnerRRef GloballyUniqueId(created_on=2, local_id=6) with future 0x7fe9ac00a760 | |
(22445) Instantiating OwnerRRef GloballyUniqueId(created_on=0, local_id=39) with future 0x7fe9ac00c130 | |
(22444) Instantiating OwnerRRef GloballyUniqueId(created_on=0, local_id=70) with future 0x7f23500068c0 | |
^^^^ Scenario 2 | |
(22445) Instantiating OwnerRRef GloballyUniqueId(created_on=0, local_id=42) with future 0x7fe998008bf0 | |
(22445) Instantiating OwnerRRef GloballyUniqueId(created_on=0, local_id=53) with future 0x7fe9ac0082d0 | |
(22440) Instantiating OwnerRRef GloballyUniqueId(created_on=0, local_id=86) with future 0x7fcc200064a0 | |
(22440) Populating OwnerRRef GloballyUniqueId(created_on=0, local_id=86) with future 0x7fcc200064a0 | |
(22444) Populating OwnerRRef GloballyUniqueId(created_on=0, local_id=17) with future 0x7f2370007160 | |
(22444) Instantiating OwnerRRef GloballyUniqueId(created_on=0, local_id=23) with future 0x7f2368007880 | |
(22444) Populating OwnerRRef GloballyUniqueId(created_on=0, local_id=23) with future 0x7f2368007880 | |
(22444) Instantiating OwnerRRef GloballyUniqueId(created_on=0, local_id=20) with future 0x7f2364007fa0 | |
(22444) Populating OwnerRRef GloballyUniqueId(created_on=0, local_id=20) with future 0x7f2364007fa0 | |
(22440) Instantiating OwnerRRef GloballyUniqueId(created_on=0, local_id=89) with future 0x7fcc20008e60 | |
(22440) Populating OwnerRRef GloballyUniqueId(created_on=0, local_id=89) with future 0x7fcc20008e60 | |
(22440) Instantiating OwnerRRef GloballyUniqueId(created_on=0, local_id=93) with future 0x7fcc20009d20 | |
(22440) Populating OwnerRRef GloballyUniqueId(created_on=0, local_id=93) with future 0x7fcc20009d20 | |
(22440) Instantiating OwnerRRef GloballyUniqueId(created_on=0, local_id=96) with future 0x7fcc2000a6e0 | |
(22440) Populating OwnerRRef GloballyUniqueId(created_on=0, local_id=96) with future 0x7fcc2000a6e0 | |
(22444) Instantiating OwnerRRef GloballyUniqueId(created_on=0, local_id=73) with future 0x7f23ec012fd0 | |
(22445) Instantiating OwnerRRef GloballyUniqueId(created_on=2, local_id=9) with future 0x7fe9b0006970 | |
(22445) Populating OwnerRRef GloballyUniqueId(created_on=2, local_id=9) with future 0x7fe9b0006970 | |
(22444) Instantiating OwnerRRef GloballyUniqueId(created_on=1, local_id=8) with future 0x7f2340001a90 | |
(22444) Populating OwnerRRef GloballyUniqueId(created_on=1, local_id=8) with future 0x7f2340001a90 | |
(22444) Instantiating OwnerRRef GloballyUniqueId(created_on=0, local_id=76) with future 0x7f23440075d0 | |
(22445) Instantiating OwnerRRef GloballyUniqueId(created_on=0, local_id=59) with future 0x7fe9b40072d0 | |
^^^^ Scenario 2 | |
(22444) Instantiating OwnerRRef GloballyUniqueId(created_on=1, local_id=11) with future 0x7f234c006760 | |
(22444) Populating OwnerRRef GloballyUniqueId(created_on=1, local_id=11) with future 0x7f234c006760 | |
(22444) Instantiating OwnerRRef GloballyUniqueId(created_on=1, local_id=15) with future 0x7f23780098a0 | |
(22444) Populating OwnerRRef GloballyUniqueId(created_on=1, local_id=15) with future 0x7f23780098a0 | |
(22445) Instantiating OwnerRRef GloballyUniqueId(created_on=2, local_id=12) with future 0x7fe990001680 | |
(22445) Populating OwnerRRef GloballyUniqueId(created_on=2, local_id=12) with future 0x7fe990001680 | |
(22445) Instantiating OwnerRRef GloballyUniqueId(created_on=2, local_id=15) with future 0x7fe9c4006b40 | |
(22445) Populating OwnerRRef GloballyUniqueId(created_on=2, local_id=15) with future 0x7fe9c4006b40 | |
(22444) Instantiating OwnerRRef GloballyUniqueId(created_on=1, local_id=18) with future 0x7f237800a720 | |
(22444) Populating OwnerRRef GloballyUniqueId(created_on=1, local_id=18) with future 0x7f237800a720 | |
^^^^ Scenario 2 | |
(22445) Instantiating OwnerRRef GloballyUniqueId(created_on=0, local_id=56) with future 0x7fe9ac009c90 | |
(22444) Instantiating OwnerRRef GloballyUniqueId(created_on=1, local_id=23) with future 0x7f23780270a0 | |
(22444) Populating OwnerRRef GloballyUniqueId(created_on=1, local_id=23) with future 0x7f23780270a0 | |
^^^^ Scenario 2 | |
(22445) Instantiating OwnerRRef GloballyUniqueId(created_on=2, local_id=21) with future 0x7fe990008540 | |
(22445) Populating OwnerRRef GloballyUniqueId(created_on=2, local_id=21) with future 0x7fe990008540 | |
(22445) Instantiating OwnerRRef GloballyUniqueId(created_on=0, local_id=62) with future 0x7fea3c006f80 | |
(22445) Instantiating OwnerRRef GloballyUniqueId(created_on=2, local_id=25) with future 0x7fe9c8007ca0 | |
(22445) Populating OwnerRRef GloballyUniqueId(created_on=2, local_id=25) with future 0x7fe9c8007ca0 | |
(22445) Instantiating OwnerRRef GloballyUniqueId(created_on=2, local_id=27) with future 0x7fe990009380 | |
(22445) Populating OwnerRRef GloballyUniqueId(created_on=2, local_id=27) with future 0x7fe990009380 | |
^^^^ Scenario 2 | |
(22445) Instantiating OwnerRRef GloballyUniqueId(created_on=2, local_id=31) with future 0x7fe990008be0 | |
(22445) Populating OwnerRRef GloballyUniqueId(created_on=2, local_id=31) with future 0x7fe990008be0 | |
^^^^ Scenario 2 | |
^^^^ Scenario 1 | |
(22445) Instantiating OwnerRRef GloballyUniqueId(created_on=0, local_id=30) with future 0x7fe9c80096e0 | |
(22445) Instantiating OwnerRRef GloballyUniqueId(created_on=2, local_id=36) with future 0x7fe99000adf0 | |
(22445) Populating OwnerRRef GloballyUniqueId(created_on=2, local_id=36) with future 0x7fe99000adf0 | |
^^^^ Scenario 2 | |
(22445) Populating OwnerRRef GloballyUniqueId(created_on=0, local_id=30) with future 0x7fe9c80096e0 | |
Traceback (most recent call last): | |
File "/fsx/users/jamesreed/pipeline_for_real/test/local_test_forward_backward.py", line 105, in <module> | |
out = pipe_driver.run(input, target, chunks=CHUNKS, _debug_mask_minibatches = DEBUG_MASK_MINIBATCHES) | |
File "/fsx/users/jamesreed/pipeline_for_real/pippy/PipelineDriver.py", line 586, in run | |
return self._retrieve_output_values(microbatch_interpreters, last_nodes, _debug_mask_minibatches, splits_per_arg) | |
File "/fsx/users/jamesreed/pipeline_for_real/pippy/PipelineDriver.py", line 596, in _retrieve_output_values | |
local_results = [to_here(result) for result in output_vals] | |
File "/fsx/users/jamesreed/pipeline_for_real/pippy/PipelineDriver.py", line 596, in <listcomp> | |
local_results = [to_here(result) for result in output_vals] | |
File "/fsx/users/jamesreed/pipeline_for_real/pippy/PipelineDriver.py", line 45, in to_here | |
return a.to_here() | |
RuntimeError: RPCErr:1:RPC ran for more than set timeout (60000 ms) and will now be marked with an error | |
terminate called without an active exception | |
terminate called recursively | |
terminate called recursively | |
[W tensorpipe_agent.cpp:682] RPC agent for worker9 encountered error when reading incoming request from worker0: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259) | |
[W tensorpipe_agent.cpp:682] RPC agent for worker2 encountered error when reading incoming request from worker0: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259) | |
[W tensorpipe_agent.cpp:682] RPC agent for worker1 encountered error when reading incoming request from worker0: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259) | |
[W tensorpipe_agent.cpp:682] RPC agent for worker3 encountered error when reading incoming request from worker0: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259) | |
[W tensorpipe_agent.cpp:682] RPC agent for worker4 encountered error when reading incoming request from worker0: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259) | |
[W tensorpipe_agent.cpp:682] RPC agent for worker5 encountered error when reading incoming request from worker0: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259) | |
[W tensorpipe_agent.cpp:682] RPC agent for worker7 encountered error when reading incoming request from worker0: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259) | |
[W tensorpipe_agent.cpp:682] RPC agent for worker6 encountered error when reading incoming request from worker0: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259) | |
[W tensorpipe_agent.cpp:682] RPC agent for worker8 encountered error when reading incoming request from worker0: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259) | |
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 22444 closing signal SIGTERM | |
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 22445 closing signal SIGTERM | |
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 22446 closing signal SIGTERM | |
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 22447 closing signal SIGTERM | |
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 22448 closing signal SIGTERM | |
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 22449 closing signal SIGTERM | |
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 22450 closing signal SIGTERM | |
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 22451 closing signal SIGTERM | |
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 22452 closing signal SIGTERM | |
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 22440) of binary: /fsx/users/jamesreed/conda/bin/python | |
Traceback (most recent call last): | |
File "/fsx/users/jamesreed/conda/bin/torchrun", line 33, in <module> | |
sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')()) | |
File "/fsx/users/jamesreed/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper | |
return f(*args, **kwargs) | |
File "/fsx/users/jamesreed/pytorch/torch/distributed/run.py", line 724, in main | |
run(args) | |
File "/fsx/users/jamesreed/pytorch/torch/distributed/run.py", line 715, in run | |
elastic_launch( | |
File "/fsx/users/jamesreed/pytorch/torch/distributed/launcher/api.py", line 131, in __call__ | |
return launch_agent(self._config, self._entrypoint, list(args)) | |
File "/fsx/users/jamesreed/pytorch/torch/distributed/launcher/api.py", line 245, in launch_agent | |
raise ChildFailedError( | |
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: | |
============================================================ | |
/fsx/users/jamesreed/pipeline_for_real/test/local_test_forward_backward.py FAILED | |
------------------------------------------------------------ | |
Failures: | |
<NO_OTHER_FAILURES> | |
------------------------------------------------------------ | |
Root Cause (first observed failure): | |
[0]: | |
time : 2022-02-23_20:37:24 | |
host : ip-10-200-31-5.ec2.internal | |
rank : 0 (local_rank: 0) | |
exitcode : -6 (pid: 22440) | |
error_file: <N/A> | |
traceback : Signal 6 (SIGABRT) received by PID 22440 | |
============================================================ |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment