Created
June 21, 2024 19:39
-
-
Save ethanabrooks/bf75b1d76bb84e3eeb1a02b75ed16aec to your computer and use it in GitHub Desktop.
nccl error
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
0%| | 0/4484 [00:00<?, ?it/s][rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=680, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600013 milliseconds before timing out. | |
[rank6]:[E ProcessGroupNCCL.cpp:563] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=680, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600011 milliseconds before timing out. | |
[rank5]:[E ProcessGroupNCCL.cpp:563] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=680, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600048 milliseconds before timing out. | |
[rank2]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=680, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600050 milliseconds before timing out. | |
[rank3]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=678, OpType=_ALLGATHER_BASE, NumelIn=4096, NumelOut=32768, Timeout(ms)=600000) ran for 600157 milliseconds before timing out. | |
[rank4]:[E ProcessGroupNCCL.cpp:563] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=680, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600063 milliseconds before timing out. | |
[rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 1] Timeout at NCCL work: 680, last enqueued NCCL work: 680, last completed NCCL work: 679. | |
[rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. | |
[rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down. | |
[rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=680, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600013 milliseconds before timing out. | |
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): | |
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f03c09a3897 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so) | |
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f03c1c7cc62 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f03c1c81a80 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f03c1c82dcc in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #4: <unknown function> + 0xd3e95 (0x7f040d736e95 in /opt/conda/bin/../lib/libstdc++.so.6) | |
frame #5: <unknown function> + 0x7ea7 (0x7f040e8f2ea7 in /lib/x86_64-linux-gnu/libpthread.so.0) | |
frame #6: clone + 0x3f (0x7f040e6c3a6f in /lib/x86_64-linux-gnu/libc.so.6) | |
terminate called after throwing an instance of 'c10::DistBackendError' | |
what(): [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=680, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600013 milliseconds before timing out. | |
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): | |
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f03c09a3897 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so) | |
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f03c1c7cc62 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f03c1c81a80 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f03c1c82dcc in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #4: <unknown function> + 0xd3e95 (0x7f040d736e95 in /opt/conda/bin/../lib/libstdc++.so.6) | |
frame #5: <unknown function> + 0x7ea7 (0x7f040e8f2ea7 in /lib/x86_64-linux-gnu/libpthread.so.0) | |
frame #6: clone + 0x3f (0x7f040e6c3a6f in /lib/x86_64-linux-gnu/libc.so.6) | |
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): | |
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f03c09a3897 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so) | |
frame #1: <unknown function> + 0xe32119 (0x7f03c1906119 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #2: <unknown function> + 0xd3e95 (0x7f040d736e95 in /opt/conda/bin/../lib/libstdc++.so.6) | |
frame #3: <unknown function> + 0x7ea7 (0x7f040e8f2ea7 in /lib/x86_64-linux-gnu/libpthread.so.0) | |
frame #4: clone + 0x3f (0x7f040e6c3a6f in /lib/x86_64-linux-gnu/libc.so.6) | |
[rank6]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 6] Timeout at NCCL work: 680, last enqueued NCCL work: 680, last completed NCCL work: 679. | |
[rank6]:[E ProcessGroupNCCL.cpp:577] [Rank 6] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. | |
[rank6]:[E ProcessGroupNCCL.cpp:583] [Rank 6] To avoid data inconsistency, we are taking the entire process down. | |
[rank6]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=680, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600011 milliseconds before timing out. | |
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): | |
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc466710897 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so) | |
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fc4679e9c62 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fc4679eea80 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc4679efdcc in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #4: <unknown function> + 0xd3e95 (0x7fc4b34a3e95 in /opt/conda/bin/../lib/libstdc++.so.6) | |
frame #5: <unknown function> + 0x7ea7 (0x7fc4b465fea7 in /lib/x86_64-linux-gnu/libpthread.so.0) | |
frame #6: clone + 0x3f (0x7fc4b4430a6f in /lib/x86_64-linux-gnu/libc.so.6) | |
terminate called after throwing an instance of 'c10::DistBackendError' | |
what(): [PG 0 Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=680, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600011 milliseconds before timing out. | |
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): | |
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc466710897 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so) | |
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fc4679e9c62 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fc4679eea80 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc4679efdcc in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #4: <unknown function> + 0xd3e95 (0x7fc4b34a3e95 in /opt/conda/bin/../lib/libstdc++.so.6) | |
frame #5: <unknown function> + 0x7ea7 (0x7fc4b465fea7 in /lib/x86_64-linux-gnu/libpthread.so.0) | |
frame #6: clone + 0x3f (0x7fc4b4430a6f in /lib/x86_64-linux-gnu/libc.so.6) | |
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): | |
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc466710897 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so) | |
frame #1: <unknown function> + 0xe32119 (0x7fc467673119 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #2: <unknown function> + 0xd3e95 (0x7fc4b34a3e95 in /opt/conda/bin/../lib/libstdc++.so.6) | |
frame #3: <unknown function> + 0x7ea7 (0x7fc4b465fea7 in /lib/x86_64-linux-gnu/libpthread.so.0) | |
frame #4: clone + 0x3f (0x7fc4b4430a6f in /lib/x86_64-linux-gnu/libc.so.6) | |
[rank5]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 5] Timeout at NCCL work: 680, last enqueued NCCL work: 680, last completed NCCL work: 679. | |
[rank5]:[E ProcessGroupNCCL.cpp:577] [Rank 5] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. | |
[rank5]:[E ProcessGroupNCCL.cpp:583] [Rank 5] To avoid data inconsistency, we are taking the entire process down. | |
[rank5]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=680, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600048 milliseconds before timing out. | |
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): | |
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8f3baa5897 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so) | |
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f8f3cd7ec62 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f8f3cd83a80 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f8f3cd84dcc in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #4: <unknown function> + 0xd3e95 (0x7f8f88838e95 in /opt/conda/bin/../lib/libstdc++.so.6) | |
frame #5: <unknown function> + 0x7ea7 (0x7f8f899f4ea7 in /lib/x86_64-linux-gnu/libpthread.so.0) | |
frame #6: clone + 0x3f (0x7f8f897c5a6f in /lib/x86_64-linux-gnu/libc.so.6) | |
terminate called after throwing an instance of 'c10::DistBackendError' | |
what(): [PG 0 Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=680, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600048 milliseconds before timing out. | |
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): | |
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8f3baa5897 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so) | |
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f8f3cd7ec62 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f8f3cd83a80 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f8f3cd84dcc in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #4: <unknown function> + 0xd3e95 (0x7f8f88838e95 in /opt/conda/bin/../lib/libstdc++.so.6) | |
frame #5: <unknown function> + 0x7ea7 (0x7f8f899f4ea7 in /lib/x86_64-linux-gnu/libpthread.so.0) | |
frame #6: clone + 0x3f (0x7f8f897c5a6f in /lib/x86_64-linux-gnu/libc.so.6) | |
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): | |
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8f3baa5897 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so) | |
frame #1: <unknown function> + 0xe32119 (0x7f8f3ca08119 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #2: <unknown function> + 0xd3e95 (0x7f8f88838e95 in /opt/conda/bin/../lib/libstdc++.so.6) | |
frame #3: <unknown function> + 0x7ea7 (0x7f8f899f4ea7 in /lib/x86_64-linux-gnu/libpthread.so.0) | |
frame #4: clone + 0x3f (0x7f8f897c5a6f in /lib/x86_64-linux-gnu/libc.so.6) | |
[rank2]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 2] Timeout at NCCL work: 680, last enqueued NCCL work: 680, last completed NCCL work: 679. | |
[rank2]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. | |
[rank2]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down. | |
[rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=680, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600050 milliseconds before timing out. | |
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): | |
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4a7b412897 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so) | |
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f4a7c6ebc62 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f4a7c6f0a80 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4a7c6f1dcc in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #4: <unknown function> + 0xd3e95 (0x7f4ac81a5e95 in /opt/conda/bin/../lib/libstdc++.so.6) | |
frame #5: <unknown function> + 0x7ea7 (0x7f4ac9361ea7 in /lib/x86_64-linux-gnu/libpthread.so.0) | |
frame #6: clone + 0x3f (0x7f4ac9132a6f in /lib/x86_64-linux-gnu/libc.so.6) | |
terminate called after throwing an instance of 'c10::DistBackendError' | |
what(): [PG 0 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=680, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600050 milliseconds before timing out. | |
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): | |
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4a7b412897 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so) | |
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f4a7c6ebc62 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f4a7c6f0a80 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4a7c6f1dcc in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #4: <unknown function> + 0xd3e95 (0x7f4ac81a5e95 in /opt/conda/bin/../lib/libstdc++.so.6) | |
frame #5: <unknown function> + 0x7ea7 (0x7f4ac9361ea7 in /lib/x86_64-linux-gnu/libpthread.so.0) | |
frame #6: clone + 0x3f (0x7f4ac9132a6f in /lib/x86_64-linux-gnu/libc.so.6) | |
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): | |
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4a7b412897 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so) | |
frame #1: <unknown function> + 0xe32119 (0x7f4a7c375119 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #2: <unknown function> + 0xd3e95 (0x7f4ac81a5e95 in /opt/conda/bin/../lib/libstdc++.so.6) | |
frame #3: <unknown function> + 0x7ea7 (0x7f4ac9361ea7 in /lib/x86_64-linux-gnu/libpthread.so.0) | |
frame #4: clone + 0x3f (0x7f4ac9132a6f in /lib/x86_64-linux-gnu/libc.so.6) | |
[rank4]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 4] Timeout at NCCL work: 680, last enqueued NCCL work: 682, last completed NCCL work: 679. | |
[rank4]:[E ProcessGroupNCCL.cpp:577] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. | |
[rank4]:[E ProcessGroupNCCL.cpp:583] [Rank 4] To avoid data inconsistency, we are taking the entire process down. | |
[rank4]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=680, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600063 milliseconds before timing out. | |
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): | |
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f59c1cd8897 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so) | |
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f59c2fb1c62 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f59c2fb6a80 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f59c2fb7dcc in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #4: <unknown function> + 0xd3e95 (0x7f5a0ea6be95 in /opt/conda/bin/../lib/libstdc++.so.6) | |
frame #5: <unknown function> + 0x7ea7 (0x7f5a0fc27ea7 in /lib/x86_64-linux-gnu/libpthread.so.0) | |
frame #6: clone + 0x3f (0x7f5a0f9f8a6f in /lib/x86_64-linux-gnu/libc.so.6) | |
terminate called after throwing an instance of 'c10::DistBackendError' | |
what(): [PG 0 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=680, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600063 milliseconds before timing out. | |
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): | |
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f59c1cd8897 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so) | |
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f59c2fb1c62 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f59c2fb6a80 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f59c2fb7dcc in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #4: <unknown function> + 0xd3e95 (0x7f5a0ea6be95 in /opt/conda/bin/../lib/libstdc++.so.6) | |
frame #5: <unknown function> + 0x7ea7 (0x7f5a0fc27ea7 in /lib/x86_64-linux-gnu/libpthread.so.0) | |
frame #6: clone + 0x3f (0x7f5a0f9f8a6f in /lib/x86_64-linux-gnu/libc.so.6) | |
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): | |
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f59c1cd8897 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so) | |
frame #1: <unknown function> + 0xe32119 (0x7f59c2c3b119 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #2: <unknown function> + 0xd3e95 (0x7f5a0ea6be95 in /opt/conda/bin/../lib/libstdc++.so.6) | |
frame #3: <unknown function> + 0x7ea7 (0x7f5a0fc27ea7 in /lib/x86_64-linux-gnu/libpthread.so.0) | |
frame #4: clone + 0x3f (0x7f5a0f9f8a6f in /lib/x86_64-linux-gnu/libc.so.6) | |
[rank3]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 3] Timeout at NCCL work: 678, last enqueued NCCL work: 680, last completed NCCL work: 677. | |
[rank3]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. | |
[rank3]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down. | |
[rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=678, OpType=_ALLGATHER_BASE, NumelIn=4096, NumelOut=32768, Timeout(ms)=600000) ran for 600157 milliseconds before timing out. | |
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): | |
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f92cae6a897 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so) | |
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f92cc143c62 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f92cc148a80 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f92cc149dcc in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #4: <unknown function> + 0xd3e95 (0x7f9317bfde95 in /opt/conda/bin/../lib/libstdc++.so.6) | |
frame #5: <unknown function> + 0x7ea7 (0x7f9318db9ea7 in /lib/x86_64-linux-gnu/libpthread.so.0) | |
frame #6: clone + 0x3f (0x7f9318b8aa6f in /lib/x86_64-linux-gnu/libc.so.6) | |
terminate called after throwing an instance of 'c10::DistBackendError' | |
what(): [PG 0 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=678, OpType=_ALLGATHER_BASE, NumelIn=4096, NumelOut=32768, Timeout(ms)=600000) ran for 600157 milliseconds before timing out. | |
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): | |
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f92cae6a897 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so) | |
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f92cc143c62 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f92cc148a80 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f92cc149dcc in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #4: <unknown function> + 0xd3e95 (0x7f9317bfde95 in /opt/conda/bin/../lib/libstdc++.so.6) | |
frame #5: <unknown function> + 0x7ea7 (0x7f9318db9ea7 in /lib/x86_64-linux-gnu/libpthread.so.0) | |
frame #6: clone + 0x3f (0x7f9318b8aa6f in /lib/x86_64-linux-gnu/libc.so.6) | |
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): | |
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f92cae6a897 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so) | |
frame #1: <unknown function> + 0xe32119 (0x7f92cbdcd119 in /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) | |
frame #2: <unknown function> + 0xd3e95 (0x7f9317bfde95 in /opt/conda/bin/../lib/libstdc++.so.6) | |
frame #3: <unknown function> + 0x7ea7 (0x7f9318db9ea7 in /lib/x86_64-linux-gnu/libpthread.so.0) | |
frame #4: clone + 0x3f (0x7f9318b8aa6f in /lib/x86_64-linux-gnu/libc.so.6) | |
W0621 19:37:31.172000 140614666856256 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1121510 closing signal SIGTERM | |
W0621 19:37:31.172000 140614666856256 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1121517 closing signal SIGTERM | |
E0621 19:37:32.302000 140614666856256 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 1 (pid: 1121511) of binary: /home/ethan/.cache/pypoetry/virtualenvs/mathesis-rTKd3EqQ-py3.10/bin/python | |
Traceback (most recent call last):aded |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment