Skip to content

Instantly share code, notes, and snippets.

@LunNova
Created December 15, 2024 19:19
Show Gist options
  • Save LunNova/0809398bd1abce6dbe2402bf0a89d881 to your computer and use it in GitHub Desktop.
Save LunNova/0809398bd1abce6dbe2402bf0a89d881 to your computer and use it in GitHub Desktop.
I1215 08:32:37.869000 4070434 torch/_inductor/config.py:635] compile_threads set to 12 via env
using device: cuda:2
using device: cuda:1
using device: cuda:3
using device: cuda:5
using device: cuda:4
using device: cuda:0
tsukiakari-nixos:4070434:4070434 [0] NCCL INFO Bootstrap : Using eno1np0:10.5.5.236<0>
tsukiakari-nixos:4070434:4070434 [0] NCCL INFO NET/Plugin: No plugin found (librccl-net.so)
tsukiakari-nixos:4070434:4070434 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : librccl-net.so: cannot open shared object file: No such file or directory : when loading librccl-net.so
tsukiakari-nixos:4070434:4070434 [0] NCCL INFO NET/Plugin: Using internal network plugin.
tsukiakari-nixos:4070434:4070434 [0] NCCL INFO Kernel version: 6.12.0
tsukiakari-nixos:4070435:4070435 [1] NCCL INFO ROCr version 1.1
tsukiakari-nixos:4070435:4070435 [1] NCCL INFO NCCL_DMABUF_ENABLE set by environment to 1.
tsukiakari-nixos:4070436:4070436 [2] NCCL INFO ROCr version 1.1
tsukiakari-nixos:4070434:4070434 [0] NCCL INFO ROCr version 1.1
tsukiakari-nixos:4070435:4070435 [1] NCCL INFO Could not open kernel conf file, will assume CONFIG_DMABUF_MOVE_NOTIFY and CONFIG_PCI_P2PDMA are enabled
tsukiakari-nixos:4070436:4070436 [2] NCCL INFO NCCL_DMABUF_ENABLE set by environment to 1.
tsukiakari-nixos:4070435:4070435 [1] NCCL INFO DMA_BUF Support Enabled
tsukiakari-nixos:4070434:4070434 [0] NCCL INFO NCCL_DMABUF_ENABLE set by environment to 1.
tsukiakari-nixos:4070436:4070436 [2] NCCL INFO Could not open kernel conf file, will assume CONFIG_DMABUF_MOVE_NOTIFY and CONFIG_PCI_P2PDMA are enabled
tsukiakari-nixos:4070437:4070437 [3] NCCL INFO ROCr version 1.1
tsukiakari-nixos:4070434:4070434 [0] NCCL INFO Could not open kernel conf file, will assume CONFIG_DMABUF_MOVE_NOTIFY and CONFIG_PCI_P2PDMA are enabled
tsukiakari-nixos:4070436:4070436 [2] NCCL INFO DMA_BUF Support Enabled
tsukiakari-nixos:4070434:4070434 [0] NCCL INFO DMA_BUF Support Enabled
tsukiakari-nixos:4070439:4070439 [5] NCCL INFO ROCr version 1.1
tsukiakari-nixos:4070437:4070437 [3] NCCL INFO NCCL_DMABUF_ENABLE set by environment to 1.
tsukiakari-nixos:4070439:4070439 [5] NCCL INFO NCCL_DMABUF_ENABLE set by environment to 1.
RCCL version : 2.21.5-Unknown
HIP version : 6.3.42131-
ROCm version : 6.2.2.0-9999-unknown
Hostname : tsukiakari-nixos
Librccl path : /nix/store/hc2saq7x6k17z58nx66aybbwvh4bbzlq-rccl-6.3.0/lib/librccl.so.1
tsukiakari-nixos:4070437:4070437 [3] NCCL INFO Could not open kernel conf file, will assume CONFIG_DMABUF_MOVE_NOTIFY and CONFIG_PCI_P2PDMA are enabled
tsukiakari-nixos:4070439:4070439 [5] NCCL INFO Could not open kernel conf file, will assume CONFIG_DMABUF_MOVE_NOTIFY and CONFIG_PCI_P2PDMA are enabled
tsukiakari-nixos:4070437:4070437 [3] NCCL INFO DMA_BUF Support Enabled
tsukiakari-nixos:4070439:4070439 [5] NCCL INFO DMA_BUF Support Enabled
tsukiakari-nixos:4070434:4070434 [0] NCCL INFO Comm config Blocking set to 0
tsukiakari-nixos:4070435:4070435 [1] NCCL INFO Bootstrap : Using eno1np0:10.5.5.236<0>
tsukiakari-nixos:4070439:4070439 [5] NCCL INFO Bootstrap : Using eno1np0:10.5.5.236<0>
tsukiakari-nixos:4070437:4070437 [3] NCCL INFO Bootstrap : Using eno1np0:10.5.5.236<0>
tsukiakari-nixos:4070435:4070435 [1] NCCL INFO NET/Plugin: No plugin found (librccl-net.so)
tsukiakari-nixos:4070436:4070436 [2] NCCL INFO Bootstrap : Using eno1np0:10.5.5.236<0>
tsukiakari-nixos:4070435:4070435 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : librccl-net.so: cannot open shared object file: No such file or directory : when loading librccl-net.so
tsukiakari-nixos:4070439:4070439 [5] NCCL INFO NET/Plugin: No plugin found (librccl-net.so)
tsukiakari-nixos:4070435:4070435 [1] NCCL INFO NET/Plugin: Using internal network plugin.
tsukiakari-nixos:4070437:4070437 [3] NCCL INFO NET/Plugin: No plugin found (librccl-net.so)
tsukiakari-nixos:4070439:4070439 [5] NCCL INFO NET/Plugin: Plugin load returned 2 : librccl-net.so: cannot open shared object file: No such file or directory : when loading librccl-net.so
tsukiakari-nixos:4070439:4070439 [5] NCCL INFO NET/Plugin: Using internal network plugin.
tsukiakari-nixos:4070437:4070437 [3] NCCL INFO NET/Plugin: Plugin load returned 2 : librccl-net.so: cannot open shared object file: No such file or directory : when loading librccl-net.so
tsukiakari-nixos:4070437:4070437 [3] NCCL INFO NET/Plugin: Using internal network plugin.
tsukiakari-nixos:4070435:4070435 [1] NCCL INFO Kernel version: 6.12.0
tsukiakari-nixos:4070436:4070436 [2] NCCL INFO NET/Plugin: No plugin found (librccl-net.so)
tsukiakari-nixos:4070436:4070436 [2] NCCL INFO NET/Plugin: Plugin load returned 2 : librccl-net.so: cannot open shared object file: No such file or directory : when loading librccl-net.so
tsukiakari-nixos:4070439:4070439 [5] NCCL INFO Kernel version: 6.12.0
tsukiakari-nixos:4070436:4070436 [2] NCCL INFO NET/Plugin: Using internal network plugin.
tsukiakari-nixos:4070437:4070437 [3] NCCL INFO Kernel version: 6.12.0
tsukiakari-nixos:4070438:4070438 [4] NCCL INFO ROCr version 1.1
tsukiakari-nixos:4070436:4070436 [2] NCCL INFO Kernel version: 6.12.0
tsukiakari-nixos:4070438:4070438 [4] NCCL INFO NCCL_DMABUF_ENABLE set by environment to 1.
tsukiakari-nixos:4070435:4070435 [1] NCCL INFO Comm config Blocking set to 0
tsukiakari-nixos:4070439:4070439 [5] NCCL INFO Comm config Blocking set to 0
tsukiakari-nixos:4070437:4070437 [3] NCCL INFO Comm config Blocking set to 0
tsukiakari-nixos:4070438:4070438 [4] NCCL INFO Could not open kernel conf file, will assume CONFIG_DMABUF_MOVE_NOTIFY and CONFIG_PCI_P2PDMA are enabled
tsukiakari-nixos:4070436:4070436 [2] NCCL INFO Comm config Blocking set to 0
tsukiakari-nixos:4070438:4070438 [4] NCCL INFO DMA_BUF Support Enabled
tsukiakari-nixos:4070438:4070438 [4] NCCL INFO Bootstrap : Using eno1np0:10.5.5.236<0>
tsukiakari-nixos:4070438:4070438 [4] NCCL INFO NET/Plugin: No plugin found (librccl-net.so)
tsukiakari-nixos:4070438:4070438 [4] NCCL INFO NET/Plugin: Plugin load returned 2 : librccl-net.so: cannot open shared object file: No such file or directory : when loading librccl-net.so
tsukiakari-nixos:4070438:4070438 [4] NCCL INFO NET/Plugin: Using internal network plugin.
tsukiakari-nixos:4070438:4070438 [4] NCCL INFO Kernel version: 6.12.0
tsukiakari-nixos:4070438:4070438 [4] NCCL INFO Comm config Blocking set to 0
tsukiakari-nixos:4070437:4070474 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [RO]; OOB eno1np0:10.5.5.236<0>
/build/source/build/hipify/src/transport/net_ib.cc:592:12: runtime error: index 3 out of bounds for type 'const char *[3]'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /build/source/build/hipify/src/transport/net_ib.cc:592:12
=================================================================
==4070437==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7ff82df5b038 at pc 0x7fff7c15a3b5 bp 0x7ff8312e28f0 sp 0x7ff8312e28e8
READ of size 8 at 0x7ff82df5b038 thread T6
tsukiakari-nixos:4070434:4070466 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [RO]; OOB eno1np0:10.5.5.236<0>
/build/source/build/hipify/src/transport/net_ib.cc:592:12: runtime error: index 3 out of bounds for type 'const char *[3]'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /build/source/build/hipify/src/transport/net_ib.cc:592:12
=================================================================
==4070434==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7ff82cc51038 at pc 0x7fff7c15a3b5 bp 0x7ff82db508f0 sp 0x7ff82db508e8
READ of size 8 at 0x7ff82cc51038 thread T7
tsukiakari-nixos:4070435:4070471 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [RO]; OOB eno1np0:10.5.5.236<0>
/build/source/build/hipify/src/transport/net_ib.cc:592:12: runtime error: index 3 out of bounds for type 'const char *[3]'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /build/source/build/hipify/src/transport/net_ib.cc:592:12
=================================================================
tsukiakari-nixos:4070436:4070473 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [RO]; OOB eno1np0:10.5.5.236<0>
tsukiakari-nixos:4070439:4070472 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [RO]; OOB eno1np0:10.5.5.236<0>
/build/source/build/hipify/src/transport/net_ib.cc:592:12: runtime error: index 3 out of bounds for type 'const char *[3]'
/build/source/build/hipify/src/transport/net_ib.cc:592:12: runtime error: index 3 out of bounds for type 'const char *[3]'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /build/source/build/hipify/src/transport/net_ib.cc:592:12
=================================================================
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /build/source/build/hipify/src/transport/net_ib.cc:592:12
=================================================================
==4070435==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7ff82df5b038 at pc 0x7fff7c15a3b5 bp 0x7ff8312e28f0 sp 0x7ff8312e28e8
READ of size 8 at 0x7ff82df5b038 thread T6
==4070436==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7ff82df5b038 at pc 0x7fff7c15a3b5 bp 0x7ff8312e28f0 sp 0x7ff8312e28e8
READ of size 8 at 0x7ff82df5b038 thread T6
==4070439==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7ff82d8a8038 at pc 0x7fff7c15a3b5 bp 0x7ff82e7a78f0 sp 0x7ff82e7a78e8
READ of size 8 at 0x7ff82d8a8038 thread T6
tsukiakari-nixos:4070438:4070476 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [RO]; OOB eno1np0:10.5.5.236<0>
/build/source/build/hipify/src/transport/net_ib.cc:592:12: runtime error: index 3 out of bounds for type 'const char *[3]'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /build/source/build/hipify/src/transport/net_ib.cc:592:12
=================================================================
==4070438==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7ff82df5b038 at pc 0x7fff7c15a3b5 bp 0x7ff8312e28f0 sp 0x7ff8312e28e8
READ of size 8 at 0x7ff82df5b038 thread T6
#0 0x7fff7c15a3b4 in ncclIbGdrSupport() /build/source/build/hipify/src/transport/net_ib.cc:592:12
#1 0x7fff7c15adfe in ncclIbGetProperties(int, ncclNetProperties_v8_t*) /build/source/build/hipify/src/transport/net_ib.cc:688:7
#2 0x7fff7bf38198 in ncclNetCheckDeviceVersion(ncclComm*, ncclNet_v8_t*, int) /build/source/build/hipify/src/net.cc:501:3
#3 0x7fff7bf388d5 in ncclNetInit(ncclComm*) /build/source/build/hipify/src/net.cc:562:24
#4 0x7fff7bf0b0ad in commAlloc(ncclComm*, ncclComm*, int, int) /build/source/build/hipify/src/init.cc:533:3
#5 0x7fff7bf00395 in ncclCommInitRankFunc(ncclAsyncJob*) /build/source/build/hipify/src/init.cc:2002:5
#6 0x7fff7bee824e in ncclAsyncJobMain(void*) /build/source/build/hipify/src/group.cc:67:17
#7 0x7ffff749f0d4 in asan_thread_start(void*) (/nix/store/7r6z6nb443psc1ghiyjlqmhwkll7wiia-clr-6.3.0/llvm/lib/linux/libclang_rt.asan-x86_64.so+0x9f0d4)
#8 0x7ffff69b0d01 in start_thread (/nix/store/pacbfvpzqz2mksby36awvbcn051zcji3-glibc-2.40-36/lib/libc.so.6+0x90d01) (BuildId: 2de6548b3bd2f2857c3c1d5f85e5e817ce2c4a7e)
#9 0x7ffff6a303ab in __GI___clone3 (/nix/store/pacbfvpzqz2mksby36awvbcn051zcji3-glibc-2.40-36/lib/libc.so.6+0x1103ab) (BuildId: 2de6548b3bd2f2857c3c1d5f85e5e817ce2c4a7e)
Address 0x7ff82df5b038 is located in stack of thread T6 at offset 56 in frame
#0 0x7fff7c159ae7 in ncclIbGdrSupport() /build/source/build/hipify/src/transport/net_ib.cc:568
This frame has 3 object(s):
[32, 56) 'memory_peers_paths' (line 587) <== Memory access at offset 56 overflows this variable
[96, 351) 'strValue' (line 603)
[416, 672) 'buf' (line 613)
HINT: this may be a false positive if your program uses some custom stack unwind mechanism, swapcontext or vfork
(longjmp and C++ exceptions *are* supported)
Thread T6 created by T5 here:
#0 0x7ffff755c38d in pthread_create (/nix/store/7r6z6nb443psc1ghiyjlqmhwkll7wiia-clr-6.3.0/llvm/lib/linux/libclang_rt.asan-x86_64.so+0x15c38d)
#1 0x7fff7beead9d in groupLaunch(ncclAsyncJob*) /build/source/build/hipify/src/group.cc:314:7
#2 0x7fff7bee824e in ncclAsyncJobMain(void*) /build/source/build/hipify/src/group.cc:67:17
#3 0x7ffff749f0d4 in asan_thread_start(void*) (/nix/store/7r6z6nb443psc1ghiyjlqmhwkll7wiia-clr-6.3.0/llvm/lib/linux/libclang_rt.asan-x86_64.so+0x9f0d4)
Thread T5 created by T0 here:
#0 0x7fff7c15a3b4 in ncclIbGdrSupport() /build/source/build/hipify/src/transport/net_ib.cc:592:12
#1 0x7fff7c15adfe in ncclIbGetProperties(int, ncclNetProperties_v8_t*) /build/source/build/hipify/src/transport/net_ib.cc:688:7
#2 0x7fff7bf38198 in ncclNetCheckDeviceVersion(ncclComm*, ncclNet_v8_t*, int) /build/source/build/hipify/src/net.cc:501:3
#3 0x7fff7bf388d5 in ncclNetInit(ncclComm*) /build/source/build/hipify/src/net.cc:562:24
#4 0x7fff7bf0b0ad in commAlloc(ncclComm*, ncclComm*, int, int) /build/source/build/hipify/src/init.cc:533:3
#5 0x7fff7bf00395 in ncclCommInitRankFunc(ncclAsyncJob*) /build/source/build/hipify/src/init.cc:2002:5
#6 0x7fff7bee824e in ncclAsyncJobMain(void*) /build/source/build/hipify/src/group.cc:67:17
#7 0x7ffff749f0d4 in asan_thread_start(void*) (/nix/store/7r6z6nb443psc1ghiyjlqmhwkll7wiia-clr-6.3.0/llvm/lib/linux/libclang_rt.asan-x86_64.so+0x9f0d4)
#8 0x7ffff69b0d01 in start_thread (/nix/store/pacbfvpzqz2mksby36awvbcn051zcji3-glibc-2.40-36/lib/libc.so.6+0x90d01) (BuildId: 2de6548b3bd2f2857c3c1d5f85e5e817ce2c4a7e)
#9 0x7ffff6a303ab in __GI___clone3 (/nix/store/pacbfvpzqz2mksby36awvbcn051zcji3-glibc-2.40-36/lib/libc.so.6+0x1103ab) (BuildId: 2de6548b3bd2f2857c3c1d5f85e5e817ce2c4a7e)
Address 0x7ff82df5b038 is located in stack of thread T6 at offset 56 in frame
#0 0x7fff7c159ae7 in ncclIbGdrSupport() /build/source/build/hipify/src/transport/net_ib.cc:568
This frame has 3 object(s):
[32, 56) 'memory_peers_paths' (line 587) <== Memory access at offset 56 overflows this variable
[96, 351) 'strValue' (line 603)
[416, 672) 'buf' (line 613)
HINT: this may be a false positive if your program uses some custom stack unwind mechanism, swapcontext or vfork
(longjmp and C++ exceptions *are* supported)
Thread T6 created by T5 here:
#0 0x7ffff755c38d in pthread_create (/nix/store/7r6z6nb443psc1ghiyjlqmhwkll7wiia-clr-6.3.0/llvm/lib/linux/libclang_rt.asan-x86_64.so+0x15c38d)
#1 0x7fff7beead9d in groupLaunch(ncclAsyncJob*) /build/source/build/hipify/src/group.cc:314:7
#2 0x7fff7bee824e in ncclAsyncJobMain(void*) /build/source/build/hipify/src/group.cc:67:17
#3 0x7ffff749f0d4 in asan_thread_start(void*) (/nix/store/7r6z6nb443psc1ghiyjlqmhwkll7wiia-clr-6.3.0/llvm/lib/linux/libclang_rt.asan-x86_64.so+0x9f0d4)
Thread T5 created by T0 here:
#0 0x7fff7c15a3b4 in ncclIbGdrSupport() /build/source/build/hipify/src/transport/net_ib.cc:592:12
#1 0x7fff7c15adfe in ncclIbGetProperties(int, ncclNetProperties_v8_t*) /build/source/build/hipify/src/transport/net_ib.cc:688:7
#2 0x7fff7bf38198 in ncclNetCheckDeviceVersion(ncclComm*, ncclNet_v8_t*, int) /build/source/build/hipify/src/net.cc:501:3
#3 0x7fff7bf388d5 in ncclNetInit(ncclComm*) /build/source/build/hipify/src/net.cc:562:24
#4 0x7fff7bf0b0ad in commAlloc(ncclComm*, ncclComm*, int, int) /build/source/build/hipify/src/init.cc:533:3
#5 0x7fff7bf00395 in ncclCommInitRankFunc(ncclAsyncJob*) /build/source/build/hipify/src/init.cc:2002:5
#6 0x7fff7bee824e in ncclAsyncJobMain(void*) /build/source/build/hipify/src/group.cc:67:17
#7 0x7ffff749f0d4 in asan_thread_start(void*) (/nix/store/7r6z6nb443psc1ghiyjlqmhwkll7wiia-clr-6.3.0/llvm/lib/linux/libclang_rt.asan-x86_64.so+0x9f0d4)
#8 0x7ffff69b0d01 in start_thread (/nix/store/pacbfvpzqz2mksby36awvbcn051zcji3-glibc-2.40-36/lib/libc.so.6+0x90d01) (BuildId: 2de6548b3bd2f2857c3c1d5f85e5e817ce2c4a7e)
#9 0x7ffff6a303ab in __GI___clone3 (/nix/store/pacbfvpzqz2mksby36awvbcn051zcji3-glibc-2.40-36/lib/libc.so.6+0x1103ab) (BuildId: 2de6548b3bd2f2857c3c1d5f85e5e817ce2c4a7e)
Address 0x7ff82df5b038 is located in stack of thread T6 at offset 56 in frame
#0 0x7fff7c159ae7 in ncclIbGdrSupport() /build/source/build/hipify/src/transport/net_ib.cc:568
This frame has 3 object(s):
[32, 56) 'memory_peers_paths' (line 587) <== Memory access at offset 56 overflows this variable
[96, 351) 'strValue' (line 603)
[416, 672) 'buf' (line 613)
HINT: this may be a false positive if your program uses some custom stack unwind mechanism, swapcontext or vfork
(longjmp and C++ exceptions *are* supported)
Thread T6 created by T5 here:
#0 0x7ffff755c38d in pthread_create (/nix/store/7r6z6nb443psc1ghiyjlqmhwkll7wiia-clr-6.3.0/llvm/lib/linux/libclang_rt.asan-x86_64.so+0x15c38d)
#1 0x7fff7beead9d in groupLaunch(ncclAsyncJob*) /build/source/build/hipify/src/group.cc:314:7
#2 0x7fff7bee824e in ncclAsyncJobMain(void*) /build/source/build/hipify/src/group.cc:67:17
#3 0x7ffff749f0d4 in asan_thread_start(void*) (/nix/store/7r6z6nb443psc1ghiyjlqmhwkll7wiia-clr-6.3.0/llvm/lib/linux/libclang_rt.asan-x86_64.so+0x9f0d4)
Thread T5 created by T0 here:
#0 0x7fff7c15a3b4 in ncclIbGdrSupport() /build/source/build/hipify/src/transport/net_ib.cc:592:12
#1 0x7fff7c15adfe in ncclIbGetProperties(int, ncclNetProperties_v8_t*) /build/source/build/hipify/src/transport/net_ib.cc:688:7
#2 0x7fff7bf38198 in ncclNetCheckDeviceVersion(ncclComm*, ncclNet_v8_t*, int) /build/source/build/hipify/src/net.cc:501:3
#3 0x7fff7bf388d5 in ncclNetInit(ncclComm*) /build/source/build/hipify/src/net.cc:562:24
#4 0x7fff7bf0b0ad in commAlloc(ncclComm*, ncclComm*, int, int) /build/source/build/hipify/src/init.cc:533:3
#5 0x7fff7bf00395 in ncclCommInitRankFunc(ncclAsyncJob*) /build/source/build/hipify/src/init.cc:2002:5
#6 0x7fff7bee824e in ncclAsyncJobMain(void*) /build/source/build/hipify/src/group.cc:67:17
#7 0x7ffff749f0d4 in asan_thread_start(void*) (/nix/store/7r6z6nb443psc1ghiyjlqmhwkll7wiia-clr-6.3.0/llvm/lib/linux/libclang_rt.asan-x86_64.so+0x9f0d4)
#8 0x7ffff69b0d01 in start_thread (/nix/store/pacbfvpzqz2mksby36awvbcn051zcji3-glibc-2.40-36/lib/libc.so.6+0x90d01) (BuildId: 2de6548b3bd2f2857c3c1d5f85e5e817ce2c4a7e)
#9 0x7ffff6a303ab in __GI___clone3 (/nix/store/pacbfvpzqz2mksby36awvbcn051zcji3-glibc-2.40-36/lib/libc.so.6+0x1103ab) (BuildId: 2de6548b3bd2f2857c3c1d5f85e5e817ce2c4a7e)
Address 0x7ff82d8a8038 is located in stack of thread T6 at offset 56 in frame
#0 0x7fff7c159ae7 in ncclIbGdrSupport() /build/source/build/hipify/src/transport/net_ib.cc:568
This frame has 3 object(s):
[32, 56) 'memory_peers_paths' (line 587) <== Memory access at offset 56 overflows this variable
[96, 351) 'strValue' (line 603)
[416, 672) 'buf' (line 613)
HINT: this may be a false positive if your program uses some custom stack unwind mechanism, swapcontext or vfork
(longjmp and C++ exceptions *are* supported)
Thread T6 created by T5 here:
#0 0x7ffff755c38d in pthread_create (/nix/store/7r6z6nb443psc1ghiyjlqmhwkll7wiia-clr-6.3.0/llvm/lib/linux/libclang_rt.asan-x86_64.so+0x15c38d)
#1 0x7fff7beead9d in groupLaunch(ncclAsyncJob*) /build/source/build/hipify/src/group.cc:314:7
#2 0x7fff7bee824e in ncclAsyncJobMain(void*) /build/source/build/hipify/src/group.cc:67:17
#3 0x7ffff749f0d4 in asan_thread_start(void*) (/nix/store/7r6z6nb443psc1ghiyjlqmhwkll7wiia-clr-6.3.0/llvm/lib/linux/libclang_rt.asan-x86_64.so+0x9f0d4)
Thread T5 created by T0 here:
#0 0x7fff7c15a3b4 in ncclIbGdrSupport() /build/source/build/hipify/src/transport/net_ib.cc:592:12
#1 0x7fff7c15adfe in ncclIbGetProperties(int, ncclNetProperties_v8_t*) /build/source/build/hipify/src/transport/net_ib.cc:688:7
#2 0x7fff7bf38198 in ncclNetCheckDeviceVersion(ncclComm*, ncclNet_v8_t*, int) /build/source/build/hipify/src/net.cc:501:3
#3 0x7fff7bf388d5 in ncclNetInit(ncclComm*) /build/source/build/hipify/src/net.cc:562:24
#4 0x7fff7bf0b0ad in commAlloc(ncclComm*, ncclComm*, int, int) /build/source/build/hipify/src/init.cc:533:3
#5 0x7fff7bf00395 in ncclCommInitRankFunc(ncclAsyncJob*) /build/source/build/hipify/src/init.cc:2002:5
#6 0x7fff7bee824e in ncclAsyncJobMain(void*) /build/source/build/hipify/src/group.cc:67:17
#7 0x7ffff749f0d4 in asan_thread_start(void*) (/nix/store/7r6z6nb443psc1ghiyjlqmhwkll7wiia-clr-6.3.0/llvm/lib/linux/libclang_rt.asan-x86_64.so+0x9f0d4)
#8 0x7ffff69b0d01 in start_thread (/nix/store/pacbfvpzqz2mksby36awvbcn051zcji3-glibc-2.40-36/lib/libc.so.6+0x90d01) (BuildId: 2de6548b3bd2f2857c3c1d5f85e5e817ce2c4a7e)
#9 0x7ffff6a303ab in __GI___clone3 (/nix/store/pacbfvpzqz2mksby36awvbcn051zcji3-glibc-2.40-36/lib/libc.so.6+0x1103ab) (BuildId: 2de6548b3bd2f2857c3c1d5f85e5e817ce2c4a7e)
Address 0x7ff82df5b038 is located in stack of thread T6 at offset 56 in frame
#0 0x7fff7c159ae7 in ncclIbGdrSupport() /build/source/build/hipify/src/transport/net_ib.cc:568
This frame has 3 object(s):
[32, 56) 'memory_peers_paths' (line 587) <== Memory access at offset 56 overflows this variable
[96, 351) 'strValue' (line 603)
[416, 672) 'buf' (line 613)
HINT: this may be a false positive if your program uses some custom stack unwind mechanism, swapcontext or vfork
(longjmp and C++ exceptions *are* supported)
Thread T6 created by T5 here:
#0 0x7ffff755c38d in pthread_create (/nix/store/7r6z6nb443psc1ghiyjlqmhwkll7wiia-clr-6.3.0/llvm/lib/linux/libclang_rt.asan-x86_64.so+0x15c38d)
#1 0x7fff7beead9d in groupLaunch(ncclAsyncJob*) /build/source/build/hipify/src/group.cc:314:7
#2 0x7fff7bee824e in ncclAsyncJobMain(void*) /build/source/build/hipify/src/group.cc:67:17
#3 0x7ffff749f0d4 in asan_thread_start(void*) (/nix/store/7r6z6nb443psc1ghiyjlqmhwkll7wiia-clr-6.3.0/llvm/lib/linux/libclang_rt.asan-x86_64.so+0x9f0d4)
Thread T5 created by T0 here:
#0 0x7ffff755c38d in pthread_create (/nix/store/7r6z6nb443psc1ghiyjlqmhwkll7wiia-clr-6.3.0/llvm/lib/linux/libclang_rt.asan-x86_64.so+0x15c38d)
#1 0x7fff7bee9809 in ncclGroupEndInternal() /build/source/build/hipify/src/group.cc:434:7
#2 0x7fff7bef9d1b in ncclCommInitRankConfig_impl(ncclComm**, int, ncclUniqueId, int, ncclConfig_v21700*) /build/source/build/hipify/src/init.cc:2452:3
#3 0x7fff7c07b58c in ncclCommInitRankConfig /build/source/build/hipify/src/misc/api_trace.cc:544:12
#4 0x7fffc2f25a24 in c10d::NCCLComm::create(int, int, ncclUniqueId, signed char, ncclConfig_v21700&) (/nix/store/xflh908bzilmxqbh24r3hy1djgj6hmjb-python3.12-torch-2.6.0a-nightly-20241203/lib/python3.12/site-packages/torch/lib/libtorch_hip.so+0x2325a24)
#5 0x7fffc2ef5571 in c10d::ProcessGroupNCCL::initNCCLComm(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, c10::Device&, c10d::OpType, int, bool) (/nix/store/xflh908bzilmxqbh24r3hy1djgj6hmjb-python3.12-torch-2.6.0a-nightly-20241203/lib/python3.12/site-packages/torch/lib/libtorch_hip.so+0x22f5571)
#6 0x7fffc2ef77f7 in c10d::ProcessGroupNCCL::eagerConnectSingleDevice(c10::Device) (/nix/store/xflh908bzilmxqbh24r3hy1djgj6hmjb-python3.12-torch-2.6.0a-nightly-20241203/lib/python3.12/site-packages/torch/lib/libtorch_hip.so+0x22f77f7)
#7 0x7fffee0bbd3d in void pybind11::cpp_function::initialize<pybind11::cpp_function::cpp_function<void, c10d::Backend, c10::Device, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::call_guard<pybind11::gil_scoped_release>>(void (c10d::Backend::*)(c10::Device), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::'lambda'(c10d::Backend*, c10::Device), void, c10d::Backend*, c10::Device, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::call_guard<pybind11::gil_scoped_release>>(void&&, c10d::Backend (*)(c10::Device), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::'lambda1'(pybind11::detail::function_call&)::_FUN(pybind11::detail::function_call&) (/nix/store/xflh908bzilmxqbh24r3hy1djgj6hmjb-python3.12-torch-2.6.0a-nightly-20241203/lib/python3.12/site-packages/torch/lib/libtorch_python.so+0xebbd3d)
#8 0x7fffee54d8ff in typeinfo for c10d::Backend::Options (/nix/store/xflh908bzilmxqbh24r3hy1djgj6hmjb-python3.12-torch-2.6.0a-nightly-20241203/lib/python3.12/site-packages/torch/lib/libtorch_python.so+0x134d8ff)
#0 0x7fff7c15a3b4 in ncclIbGdrSupport() /build/source/build/hipify/src/transport/net_ib.cc:592:12
#1 0x7fff7c15adfe in ncclIbGetProperties(int, ncclNetProperties_v8_t*) /build/source/build/hipify/src/transport/net_ib.cc:688:7
#2 0x7fff7bf38198 in ncclNetCheckDeviceVersion(ncclComm*, ncclNet_v8_t*, int) /build/source/build/hipify/src/net.cc:501:3
#3 0x7fff7bf388d5 in ncclNetInit(ncclComm*) /build/source/build/hipify/src/net.cc:562:24
#4 0x7fff7bf0b0ad in commAlloc(ncclComm*, ncclComm*, int, int) /build/source/build/hipify/src/init.cc:533:3
#5 0x7fff7bf00395 in ncclCommInitRankFunc(ncclAsyncJob*) /build/source/build/hipify/src/init.cc:2002:5
#6 0x7fff7bee824e in ncclAsyncJobMain(void*) /build/source/build/hipify/src/group.cc:67:17
#7 0x7ffff749f0d4 in asan_thread_start(void*) (/nix/store/7r6z6nb443psc1ghiyjlqmhwkll7wiia-clr-6.3.0/llvm/lib/linux/libclang_rt.asan-x86_64.so+0x9f0d4)
#8 0x7ffff69b0d01 in start_thread (/nix/store/pacbfvpzqz2mksby36awvbcn051zcji3-glibc-2.40-36/lib/libc.so.6+0x90d01) (BuildId: 2de6548b3bd2f2857c3c1d5f85e5e817ce2c4a7e)
#9 0x7ffff6a303ab in __GI___clone3 (/nix/store/pacbfvpzqz2mksby36awvbcn051zcji3-glibc-2.40-36/lib/libc.so.6+0x1103ab) (BuildId: 2de6548b3bd2f2857c3c1d5f85e5e817ce2c4a7e)
Address 0x7ff82cc51038 is located in stack of thread T7 at offset 56 in frame
#0 0x7fff7c159ae7 in ncclIbGdrSupport() /build/source/build/hipify/src/transport/net_ib.cc:568
This frame has 3 object(s):
[32, 56) 'memory_peers_paths' (line 587) <== Memory access at offset 56 overflows this variable
[96, 351) 'strValue' (line 603)
SUMMARY: AddressSanitizer: stack-buffer-overflow /build/source/build/hipify/src/transport/net_ib.cc:592:12 in ncclIbGdrSupport()
[416, 672) 'buf' (line 613)
Shadow bytes around the buggy address:
0x7ff82df5ad80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x7ff82df5ae00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x7ff82df5ae80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x7ff82df5af00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x7ff82df5af80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x7ff82df5b000: f1 f1 f1 f1 00 00 00[f2]f2 f2 f2 f2 f8 f8 f8 f8
0x7ff82df5b080: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
0x7ff82df5b100: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f2 f2 f2 f2
0x7ff82df5b180: f2 f2 f2 f2 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
0x7ff82df5b200: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
0x7ff82df5b280: f8 f8 f8 f8 f3 f3 f3 f3 f3 f3 f3 f3 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
==4070437==ABORTING
HINT: this may be a false positive if your program uses some custom stack unwind mechanism, swapcontext or vfork
(longjmp and C++ exceptions *are* supported)
Thread T7 created by T6 here:
#0 0x7ffff755c38d in pthread_create (/nix/store/7r6z6nb443psc1ghiyjlqmhwkll7wiia-clr-6.3.0/llvm/lib/linux/libclang_rt.asan-x86_64.so+0x15c38d)
#1 0x7fff7beead9d in groupLaunch(ncclAsyncJob*) /build/source/build/hipify/src/group.cc:314:7
#2 0x7fff7bee824e in ncclAsyncJobMain(void*) /build/source/build/hipify/src/group.cc:67:17
#3 0x7ffff749f0d4 in asan_thread_start(void*) (/nix/store/7r6z6nb443psc1ghiyjlqmhwkll7wiia-clr-6.3.0/llvm/lib/linux/libclang_rt.asan-x86_64.so+0x9f0d4)
Thread T6 created by T0 here:
#0 0x7ffff755c38d in pthread_create (/nix/store/7r6z6nb443psc1ghiyjlqmhwkll7wiia-clr-6.3.0/llvm/lib/linux/libclang_rt.asan-x86_64.so+0x15c38d)
#1 0x7fff7bee9809 in ncclGroupEndInternal() /build/source/build/hipify/src/group.cc:434:7
#2 0x7fff7bef9d1b in ncclCommInitRankConfig_impl(ncclComm**, int, ncclUniqueId, int, ncclConfig_v21700*) /build/source/build/hipify/src/init.cc:2452:3
#3 0x7fff7c07b58c in ncclCommInitRankConfig /build/source/build/hipify/src/misc/api_trace.cc:544:12
#4 0x7fffc2f25a24 in c10d::NCCLComm::create(int, int, ncclUniqueId, signed char, ncclConfig_v21700&) (/nix/store/xflh908bzilmxqbh24r3hy1djgj6hmjb-python3.12-torch-2.6.0a-nightly-20241203/lib/python3.12/site-packages/torch/lib/libtorch_hip.so+0x2325a24)
#5 0x7fffc2ef5571 in c10d::ProcessGroupNCCL::initNCCLComm(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, c10::Device&, c10d::OpType, int, bool) (/nix/store/xflh908bzilmxqbh24r3hy1djgj6hmjb-python3.12-torch-2.6.0a-nightly-20241203/lib/python3.12/site-packages/torch/lib/libtorch_hip.so+0x22f5571)
#6 0x7fffc2ef77f7 in c10d::ProcessGroupNCCL::eagerConnectSingleDevice(c10::Device) (/nix/store/xflh908bzilmxqbh24r3hy1djgj6hmjb-python3.12-torch-2.6.0a-nightly-20241203/lib/python3.12/site-packages/torch/lib/libtorch_hip.so+0x22f77f7)
#7 0x7fffee0bbd3d in void pybind11::cpp_function::initialize<pybind11::cpp_function::cpp_function<void, c10d::Backend, c10::Device, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::call_guard<pybind11::gil_scoped_release>>(void (c10d::Backend::*)(c10::Device), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::'lambda'(c10d::Backend*, c10::Device), void, c10d::Backend*, c10::Device, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::call_guard<pybind11::gil_scoped_release>>(void&&, c10d::Backend (*)(c10::Device), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::'lambda1'(pybind11::detail::function_call&)::_FUN(pybind11::detail::function_call&) (/nix/store/xflh908bzilmxqbh24r3hy1djgj6hmjb-python3.12-torch-2.6.0a-nightly-20241203/lib/python3.12/site-packages/torch/lib/libtorch_python.so+0xebbd3d)
#8 0x7fffee54d8ff in typeinfo for c10d::Backend::Options (/nix/store/xflh908bzilmxqbh24r3hy1djgj6hmjb-python3.12-torch-2.6.0a-nightly-20241203/lib/python3.12/site-packages/torch/lib/libtorch_python.so+0x134d8ff)
SUMMARY: AddressSanitizer: stack-buffer-overflow /build/source/build/hipify/src/transport/net_ib.cc:592:12 in ncclIbGdrSupport()
#0 0x7ffff755c38d in pthread_create (/nix/store/7r6z6nb443psc1ghiyjlqmhwkll7wiia-clr-6.3.0/llvm/lib/linux/libclang_rt.asan-x86_64.so+0x15c38d)
#1 0x7fff7bee9809 in ncclGroupEndInternal() /build/source/build/hipify/src/group.cc:434:7
#2 0x7fff7bef9d1b in ncclCommInitRankConfig_impl(ncclComm**, int, ncclUniqueId, int, ncclConfig_v21700*) /build/source/build/hipify/src/init.cc:2452:3
#3 0x7fff7c07b58c in ncclCommInitRankConfig /build/source/build/hipify/src/misc/api_trace.cc:544:12
#4 0x7fffc2f25a24 in c10d::NCCLComm::create(int, int, ncclUniqueId, signed char, ncclConfig_v21700&) (/nix/store/xflh908bzilmxqbh24r3hy1djgj6hmjb-python3.12-torch-2.6.0a-nightly-20241203/lib/python3.12/site-packages/torch/lib/libtorch_hip.so+0x2325a24)
#5 0x7fffc2ef5571 in c10d::ProcessGroupNCCL::initNCCLComm(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, c10::Device&, c10d::OpType, int, bool) (/nix/store/xflh908bzilmxqbh24r3hy1djgj6hmjb-python3.12-torch-2.6.0a-nightly-20241203/lib/python3.12/site-packages/torch/lib/libtorch_hip.so+0x22f5571)
#6 0x7fffc2ef77f7 in c10d::ProcessGroupNCCL::eagerConnectSingleDevice(c10::Device) (/nix/store/xflh908bzilmxqbh24r3hy1djgj6hmjb-python3.12-torch-2.6.0a-nightly-20241203/lib/python3.12/site-packages/torch/lib/libtorch_hip.so+0x22f77f7)
#7 0x7fffee0bbd3d in void pybind11::cpp_function::initialize<pybind11::cpp_function::cpp_function<void, c10d::Backend, c10::Device, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::call_guard<pybind11::gil_scoped_release>>(void (c10d::Backend::*)(c10::Device), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::'lambda'(c10d::Backend*, c10::Device), void, c10d::Backend*, c10::Device, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::call_guard<pybind11::gil_scoped_release>>(void&&, c10d::Backend (*)(c10::Device), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::'lambda1'(pybind11::detail::function_call&)::_FUN(pybind11::detail::function_call&) (/nix/store/xflh908bzilmxqbh24r3hy1djgj6hmjb-python3.12-torch-2.6.0a-nightly-20241203/lib/python3.12/site-packages/torch/lib/libtorch_python.so+0xebbd3d)
#8 0x7fffee54d8ff in typeinfo for c10d::Backend::Options (/nix/store/xflh908bzilmxqbh24r3hy1djgj6hmjb-python3.12-torch-2.6.0a-nightly-20241203/lib/python3.12/site-packages/torch/lib/libtorch_python.so+0x134d8ff)
SUMMARY: AddressSanitizer: stack-buffer-overflow /build/source/build/hipify/src/transport/net_ib.cc:592:12 in ncclIbGdrSupport()
Shadow bytes around the buggy address:
0x7ff82d8a7d80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x7ff82d8a7e00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x7ff82d8a7e80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x7ff82d8a7f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x7ff82d8a7f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x7ff82d8a8000: f1 f1 f1 f1 00 00 00[f2]f2 f2 f2 f2 f8 f8 f8 f8
0x7ff82d8a8080: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
0x7ff82d8a8100: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f2 f2 f2 f2
0x7ff82d8a8180: f2 f2 f2 f2 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
0x7ff82d8a8200: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
0x7ff82d8a8280: f8 f8 f8 f8 f3 f3 f3 f3 f3 f3 f3 f3 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
==4070439==ABORTING
Shadow bytes around the buggy address:
0x7ff82df5ad80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x7ff82df5ae00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x7ff82df5ae80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x7ff82df5af00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x7ff82df5af80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x7ff82df5b000: f1 f1 f1 f1 00 00 00[f2]f2 f2 f2 f2 f8 f8 f8 f8
0x7ff82df5b080: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
0x7ff82df5b100: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f2 f2 f2 f2
0x7ff82df5b180: f2 f2 f2 f2 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
0x7ff82df5b200: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
0x7ff82df5b280: f8 f8 f8 f8 f3 f3 f3 f3 f3 f3 f3 f3 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
==4070435==ABORTING
#0 0x7ffff755c38d in pthread_create (/nix/store/7r6z6nb443psc1ghiyjlqmhwkll7wiia-clr-6.3.0/llvm/lib/linux/libclang_rt.asan-x86_64.so+0x15c38d)
#1 0x7fff7bee9809 in ncclGroupEndInternal() /build/source/build/hipify/src/group.cc:434:7
#2 0x7fff7bef9d1b in ncclCommInitRankConfig_impl(ncclComm**, int, ncclUniqueId, int, ncclConfig_v21700*) /build/source/build/hipify/src/init.cc:2452:3
#3 0x7fff7c07b58c in ncclCommInitRankConfig /build/source/build/hipify/src/misc/api_trace.cc:544:12
#4 0x7fffc2f25a24 in c10d::NCCLComm::create(int, int, ncclUniqueId, signed char, ncclConfig_v21700&) (/nix/store/xflh908bzilmxqbh24r3hy1djgj6hmjb-python3.12-torch-2.6.0a-nightly-20241203/lib/python3.12/site-packages/torch/lib/libtorch_hip.so+0x2325a24)
#5 0x7fffc2ef5571 in c10d::ProcessGroupNCCL::initNCCLComm(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, c10::Device&, c10d::OpType, int, bool) (/nix/store/xflh908bzilmxqbh24r3hy1djgj6hmjb-python3.12-torch-2.6.0a-nightly-20241203/lib/python3.12/site-packages/torch/lib/libtorch_hip.so+0x22f5571)
#6 0x7fffc2ef77f7 in c10d::ProcessGroupNCCL::eagerConnectSingleDevice(c10::Device) (/nix/store/xflh908bzilmxqbh24r3hy1djgj6hmjb-python3.12-torch-2.6.0a-nightly-20241203/lib/python3.12/site-packages/torch/lib/libtorch_hip.so+0x22f77f7)
#7 0x7fffee0bbd3d in void pybind11::cpp_function::initialize<pybind11::cpp_function::cpp_function<void, c10d::Backend, c10::Device, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::call_guard<pybind11::gil_scoped_release>>(void (c10d::Backend::*)(c10::Device), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::'lambda'(c10d::Backend*, c10::Device), void, c10d::Backend*, c10::Device, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::call_guard<pybind11::gil_scoped_release>>(void&&, c10d::Backend (*)(c10::Device), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::'lambda1'(pybind11::detail::function_call&)::_FUN(pybind11::detail::function_call&) (/nix/store/xflh908bzilmxqbh24r3hy1djgj6hmjb-python3.12-torch-2.6.0a-nightly-20241203/lib/python3.12/site-packages/torch/lib/libtorch_python.so+0xebbd3d)
#8 0x7fffee54d8ff in typeinfo for c10d::Backend::Options (/nix/store/xflh908bzilmxqbh24r3hy1djgj6hmjb-python3.12-torch-2.6.0a-nightly-20241203/lib/python3.12/site-packages/torch/lib/libtorch_python.so+0x134d8ff)
SUMMARY: AddressSanitizer: stack-buffer-overflow /build/source/build/hipify/src/transport/net_ib.cc:592:12 in ncclIbGdrSupport()
Shadow bytes around the buggy address:
0x7ff82df5ad80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x7ff82df5ae00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x7ff82df5ae80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x7ff82df5af00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x7ff82df5af80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x7ff82df5b000: f1 f1 f1 f1 00 00 00[f2]f2 f2 f2 f2 f8 f8 f8 f8
0x7ff82df5b080: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
0x7ff82df5b100: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f2 f2 f2 f2
0x7ff82df5b180: f2 f2 f2 f2 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
0x7ff82df5b200: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
0x7ff82df5b280: f8 f8 f8 f8 f3 f3 f3 f3 f3 f3 f3 f3 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
==4070436==ABORTING
#0 0x7ffff755c38d in pthread_create (/nix/store/7r6z6nb443psc1ghiyjlqmhwkll7wiia-clr-6.3.0/llvm/lib/linux/libclang_rt.asan-x86_64.so+0x15c38d)
#1 0x7fff7bee9809 in ncclGroupEndInternal() /build/source/build/hipify/src/group.cc:434:7
#2 0x7fff7bef9d1b in ncclCommInitRankConfig_impl(ncclComm**, int, ncclUniqueId, int, ncclConfig_v21700*) /build/source/build/hipify/src/init.cc:2452:3
#3 0x7fff7c07b58c in ncclCommInitRankConfig /build/source/build/hipify/src/misc/api_trace.cc:544:12
#4 0x7fffc2f25a24 in c10d::NCCLComm::create(int, int, ncclUniqueId, signed char, ncclConfig_v21700&) (/nix/store/xflh908bzilmxqbh24r3hy1djgj6hmjb-python3.12-torch-2.6.0a-nightly-20241203/lib/python3.12/site-packages/torch/lib/libtorch_hip.so+0x2325a24)
#5 0x7fffc2ef5571 in c10d::ProcessGroupNCCL::initNCCLComm(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, c10::Device&, c10d::OpType, int, bool) (/nix/store/xflh908bzilmxqbh24r3hy1djgj6hmjb-python3.12-torch-2.6.0a-nightly-20241203/lib/python3.12/site-packages/torch/lib/libtorch_hip.so+0x22f5571)
#6 0x7fffc2ef77f7 in c10d::ProcessGroupNCCL::eagerConnectSingleDevice(c10::Device) (/nix/store/xflh908bzilmxqbh24r3hy1djgj6hmjb-python3.12-torch-2.6.0a-nightly-20241203/lib/python3.12/site-packages/torch/lib/libtorch_hip.so+0x22f77f7)
#7 0x7fffee0bbd3d in void pybind11::cpp_function::initialize<pybind11::cpp_function::cpp_function<void, c10d::Backend, c10::Device, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::call_guard<pybind11::gil_scoped_release>>(void (c10d::Backend::*)(c10::Device), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::'lambda'(c10d::Backend*, c10::Device), void, c10d::Backend*, c10::Device, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::call_guard<pybind11::gil_scoped_release>>(void&&, c10d::Backend (*)(c10::Device), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::'lambda1'(pybind11::detail::function_call&)::_FUN(pybind11::detail::function_call&) (/nix/store/xflh908bzilmxqbh24r3hy1djgj6hmjb-python3.12-torch-2.6.0a-nightly-20241203/lib/python3.12/site-packages/torch/lib/libtorch_python.so+0xebbd3d)
#8 0x7fffee54d8ff in typeinfo for c10d::Backend::Options (/nix/store/xflh908bzilmxqbh24r3hy1djgj6hmjb-python3.12-torch-2.6.0a-nightly-20241203/lib/python3.12/site-packages/torch/lib/libtorch_python.so+0x134d8ff)
SUMMARY: AddressSanitizer: stack-buffer-overflow /build/source/build/hipify/src/transport/net_ib.cc:592:12 in ncclIbGdrSupport()
Shadow bytes around the buggy address:
0x7ff82df5ad80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x7ff82df5ae00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x7ff82df5ae80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x7ff82df5af00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x7ff82df5af80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x7ff82df5b000: f1 f1 f1 f1 00 00 00[f2]f2 f2 f2 f2 f8 f8 f8 f8
0x7ff82df5b080: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
0x7ff82df5b100: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f2 f2 f2 f2
0x7ff82df5b180: f2 f2 f2 f2 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
0x7ff82df5b200: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
0x7ff82df5b280: f8 f8 f8 f8 f3 f3 f3 f3 f3 f3 f3 f3 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
==4070438==ABORTING
#0 0x7ffff755c38d in pthread_create (/nix/store/7r6z6nb443psc1ghiyjlqmhwkll7wiia-clr-6.3.0/llvm/lib/linux/libclang_rt.asan-x86_64.so+0x15c38d)
#1 0x7fff7bee9809 in ncclGroupEndInternal() /build/source/build/hipify/src/group.cc:434:7
#2 0x7fff7bef9d1b in ncclCommInitRankConfig_impl(ncclComm**, int, ncclUniqueId, int, ncclConfig_v21700*) /build/source/build/hipify/src/init.cc:2452:3
#3 0x7fff7c07b58c in ncclCommInitRankConfig /build/source/build/hipify/src/misc/api_trace.cc:544:12
#4 0x7fffc2f25a24 in c10d::NCCLComm::create(int, int, ncclUniqueId, signed char, ncclConfig_v21700&) (/nix/store/xflh908bzilmxqbh24r3hy1djgj6hmjb-python3.12-torch-2.6.0a-nightly-20241203/lib/python3.12/site-packages/torch/lib/libtorch_hip.so+0x2325a24)
#5 0x7fffc2ef5571 in c10d::ProcessGroupNCCL::initNCCLComm(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, c10::Device&, c10d::OpType, int, bool) (/nix/store/xflh908bzilmxqbh24r3hy1djgj6hmjb-python3.12-torch-2.6.0a-nightly-20241203/lib/python3.12/site-packages/torch/lib/libtorch_hip.so+0x22f5571)
#6 0x7fffc2ef77f7 in c10d::ProcessGroupNCCL::eagerConnectSingleDevice(c10::Device) (/nix/store/xflh908bzilmxqbh24r3hy1djgj6hmjb-python3.12-torch-2.6.0a-nightly-20241203/lib/python3.12/site-packages/torch/lib/libtorch_hip.so+0x22f77f7)
#7 0x7fffee0bbd3d in void pybind11::cpp_function::initialize<pybind11::cpp_function::cpp_function<void, c10d::Backend, c10::Device, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::call_guard<pybind11::gil_scoped_release>>(void (c10d::Backend::*)(c10::Device), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::'lambda'(c10d::Backend*, c10::Device), void, c10d::Backend*, c10::Device, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::call_guard<pybind11::gil_scoped_release>>(void&&, c10d::Backend (*)(c10::Device), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::'lambda1'(pybind11::detail::function_call&)::_FUN(pybind11::detail::function_call&) (/nix/store/xflh908bzilmxqbh24r3hy1djgj6hmjb-python3.12-torch-2.6.0a-nightly-20241203/lib/python3.12/site-packages/torch/lib/libtorch_python.so+0xebbd3d)
#8 0x7fffee54d8ff in typeinfo for c10d::Backend::Options (/nix/store/xflh908bzilmxqbh24r3hy1djgj6hmjb-python3.12-torch-2.6.0a-nightly-20241203/lib/python3.12/site-packages/torch/lib/libtorch_python.so+0x134d8ff)
SUMMARY: AddressSanitizer: stack-buffer-overflow /build/source/build/hipify/src/transport/net_ib.cc:592:12 in ncclIbGdrSupport()
Shadow bytes around the buggy address:
0x7ff82cc50d80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x7ff82cc50e00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x7ff82cc50e80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x7ff82cc50f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x7ff82cc50f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x7ff82cc51000: f1 f1 f1 f1 00 00 00[f2]f2 f2 f2 f2 f8 f8 f8 f8
0x7ff82cc51080: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
0x7ff82cc51100: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f2 f2 f2 f2
0x7ff82cc51180: f2 f2 f2 f2 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
0x7ff82cc51200: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
0x7ff82cc51280: f8 f8 f8 f8 f3 f3 f3 f3 f3 f3 f3 f3 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
==4070434==ABORTING
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment