VRehnberg · November 12, 2024 12:12
diff --git a/DeepSpeed-0.14.5-foss-2023a-CUDA-12.1.1_partial.log b/DeepSpeed-0.14.5-foss-2023a-CUDA-12.1.1_partial.log
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:379:1075: warning: T* at::Tensor::data() const [with T = c10::Half] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        379 |         AT_DISPATCH_FLOATING_TYPES_AND_HALF(
            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:379:1177: warning: T* at::Tensor::data() const [with T = float] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        379 |         AT_DISPATCH_FLOATING_TYPES_AND_HALF(
            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:379:1203: warning: T* at::Tensor::data() const [with T = float] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        379 |         AT_DISPATCH_FLOATING_TYPES_AND_HALF(
            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:379:163: warning: T* at::Tensor::data() const [with T = float] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        379 |         AT_DISPATCH_FLOATING_TYPES_AND_HALF(
            |                                                                                                                                                                   ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:379:189: warning: T* at::Tensor::data() const [with T = float] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        379 |         AT_DISPATCH_FLOATING_TYPES_AND_HALF(
            |                                                                                                                                                                                             ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:379:339: warning: T* at::Tensor::data() const [with T = float] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        379 |         AT_DISPATCH_FLOATING_TYPES_AND_HALF(
            |                                                                                                                                                                                                                                                                                                                                                   ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:379:390: warning: T* at::Tensor::data() const [with T = c10::Half] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        379 |         AT_DISPATCH_FLOATING_TYPES_AND_HALF(
            |                                                                                                                                                                                                                                                                                                                                                                                                      ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:379:422: warning: T* at::Tensor::data() const [with T = float] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        379 |         AT_DISPATCH_FLOATING_TYPES_AND_HALF(
            |                                                                                                                                                                                                                                                                                                                                                                                                                                      ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:379:443: warning: T* at::Tensor::data() const [with T = float] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        379 |         AT_DISPATCH_FLOATING_TYPES_AND_HALF(
            |                                                                                                                                                                                                                                                                                                                                                                                                                                                           ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:379:468: warning: T* at::Tensor::data() const [with T = c10::Half] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        379 |         AT_DISPATCH_FLOATING_TYPES_AND_HALF(
            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:379:592: warning: T* at::Tensor::data() const [with T = float] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        379 |         AT_DISPATCH_FLOATING_TYPES_AND_HALF(
            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:379:618: warning: T* at::Tensor::data() const [with T = float] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        379 |         AT_DISPATCH_FLOATING_TYPES_AND_HALF(
            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:379:648: warning: T* at::Tensor::data() const [with T = float] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        379 |         AT_DISPATCH_FLOATING_TYPES_AND_HALF(
            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu: In lambda function:
      csrc/lamb/fused_lamb_cuda_kernel.cu:428:1074: warning: T* at::Tensor::data() const [with T = double] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        428 |         AT_DISPATCH_FLOATING_TYPES(
            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:428:1104: warning: T* at::Tensor::data() const [with T = double] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        428 |         AT_DISPATCH_FLOATING_TYPES(
            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:428:1126: warning: T* at::Tensor::data() const [with T = double] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        428 |         AT_DISPATCH_FLOATING_TYPES(
            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:428:1148: warning: T* at::Tensor::data() const [with T = double] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        428 |         AT_DISPATCH_FLOATING_TYPES(
            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:428:1251: warning: T* at::Tensor::data() const [with T = double] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        428 |         AT_DISPATCH_FLOATING_TYPES(
            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:428:1278: warning: T* at::Tensor::data() const [with T = double] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        428 |         AT_DISPATCH_FLOATING_TYPES(
            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:428:162: warning: T* at::Tensor::data() const [with T = double] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        428 |         AT_DISPATCH_FLOATING_TYPES(
            |                                                                                                                                                                  ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:428:189: warning: T* at::Tensor::data() const [with T = double] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        428 |         AT_DISPATCH_FLOATING_TYPES(
            |                                                                                                                                                                                             ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:428:338: warning: T* at::Tensor::data() const [with T = double] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        428 |         AT_DISPATCH_FLOATING_TYPES(
            |                                                                                                                                                                                                                                                                                                                                                  ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:428:368: warning: T* at::Tensor::data() const [with T = double] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        428 |         AT_DISPATCH_FLOATING_TYPES(
            |                                                                                                                                                                                                                                                                                                                                                                                ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:428:390: warning: T* at::Tensor::data() const [with T = double] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        428 |         AT_DISPATCH_FLOATING_TYPES(
            |                                                                                                                                                                                                                                                                                                                                                                                                      ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:428:412: warning: T* at::Tensor::data() const [with T = double] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        428 |         AT_DISPATCH_FLOATING_TYPES(
            |                                                                                                                                                                                                                                                                                                                                                                                                                            ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:428:537: warning: T* at::Tensor::data() const [with T = double] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        428 |         AT_DISPATCH_FLOATING_TYPES(
            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:428:564: warning: T* at::Tensor::data() const [with T = double] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        428 |         AT_DISPATCH_FLOATING_TYPES(
            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:428:595: warning: T* at::Tensor::data() const [with T = double] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        428 |         AT_DISPATCH_FLOATING_TYPES(
            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu: In lambda function:
      csrc/lamb/fused_lamb_cuda_kernel.cu:428:893: warning: T* at::Tensor::data() const [with T = float] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        428 |         AT_DISPATCH_FLOATING_TYPES(
            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:428:922: warning: T* at::Tensor::data() const [with T = float] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        428 |         AT_DISPATCH_FLOATING_TYPES(
            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:428:943: warning: T* at::Tensor::data() const [with T = float] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        428 |         AT_DISPATCH_FLOATING_TYPES(
            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:428:964: warning: T* at::Tensor::data() const [with T = float] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        428 |         AT_DISPATCH_FLOATING_TYPES(
            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:428:1066: warning: T* at::Tensor::data() const [with T = float] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        428 |         AT_DISPATCH_FLOATING_TYPES(
            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:428:1092: warning: T* at::Tensor::data() const [with T = float] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        428 |         AT_DISPATCH_FLOATING_TYPES(
            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:428:159: warning: T* at::Tensor::data() const [with T = float] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        428 |         AT_DISPATCH_FLOATING_TYPES(
            |                                                                                                                                                               ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:428:185: warning: T* at::Tensor::data() const [with T = float] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        428 |         AT_DISPATCH_FLOATING_TYPES(
            |                                                                                                                                                                                         ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:428:331: warning: T* at::Tensor::data() const [with T = float] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        428 |         AT_DISPATCH_FLOATING_TYPES(
            |                                                                                                                                                                                                                                                                                                                                           ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:428:360: warning: T* at::Tensor::data() const [with T = float] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        428 |         AT_DISPATCH_FLOATING_TYPES(
            |                                                                                                                                                                                                                                                                                                                                                                        ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:428:381: warning: T* at::Tensor::data() const [with T = float] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        428 |         AT_DISPATCH_FLOATING_TYPES(
            |                                                                                                                                                                                                                                                                                                                                                                                             ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:428:402: warning: T* at::Tensor::data() const [with T = float] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        428 |         AT_DISPATCH_FLOATING_TYPES(
            |                                                                                                                                                                                                                                                                                                                                                                                                                  ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:428:526: warning: T* at::Tensor::data() const [with T = float] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        428 |         AT_DISPATCH_FLOATING_TYPES(
            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:428:552: warning: T* at::Tensor::data() const [with T = float] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        428 |         AT_DISPATCH_FLOATING_TYPES(
            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      csrc/lamb/fused_lamb_cuda_kernel.cu:428:582: warning: T* at::Tensor::data() const [with T = float] is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
        428 |         AT_DISPATCH_FLOATING_TYPES(
            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ^
      /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here
        247 |   T * data() const {
            | ^ ~~
      creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/lamb
      g++ -shared -L/apps/Test/software/OpenSSL/1.1/lib64 -L/apps/Test/software/OpenSSL/1.1/lib -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib64 -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib64 -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib64 -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib64 -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib64 -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib -L/apps/Test/software/GCCcore/12.3.0/lib64 -L/apps/Test/software/GCCcore/12.3.0/lib -L/apps/Test/software/OpenSSL/1.1/lib64 -L/apps/Test/software/OpenSSL/1.1/lib -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib64 -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib64 -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib64 -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib64 -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib64 -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib -L/apps/Test/software/GCCcore/12.3.0/lib64 -L/apps/Test/software/GCCcore/12.3.0/lib -L/apps/Test/software/FFTW/3.3.10-GCC-12.3.0/lib64 -L/apps/Test/software/FFTW/3.3.10-GCC-12.3.0/lib -L/apps/Test/software/ScaLAPACK/2.2.0-gompi-2023a-fb/lib64 -L/apps/Test/software/ScaLAPACK/2.2.0-gompi-2023a-fb/lib -L/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/lib64 -L/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/lib -L/apps/Test/software/GCCcore/12.3.0/lib64 -L/apps/Test/software/GCCcore/12.3.0/lib -O2 -ftree-vectorize -march=native -fno-math-errno -I/apps/Test/software/FFTW/3.3.10-GCC-12.3.0/include -I/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/include -I/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/include/flexiblas build/temp.linux-x86_64-cpython-311/csrc/lamb/fused_lamb_cuda.o build/temp.linux-x86_64-cpython-311/csrc/lamb/fused_lamb_cuda_kernel.o -L/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/lib -L/apps/Common/software/CUDA/12.1.1/lib64 -L/apps/Test/software/Python/3.11.3-GCCcore-12.3.0/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-311/deepspeed/ops/lamb/fused_lamb_op.cpython-311-x86_64-linux-gnu.so
      creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/lion
      g++ -shared -L/apps/Test/software/OpenSSL/1.1/lib64 -L/apps/Test/software/OpenSSL/1.1/lib -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib64 -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib64 -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib64 -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib64 -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib64 -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib -L/apps/Test/software/GCCcore/12.3.0/lib64 -L/apps/Test/software/GCCcore/12.3.0/lib -L/apps/Test/software/OpenSSL/1.1/lib64 -L/apps/Test/software/OpenSSL/1.1/lib -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib64 -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib64 -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib64 -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib64 -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib64 -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib -L/apps/Test/software/GCCcore/12.3.0/lib64 -L/apps/Test/software/GCCcore/12.3.0/lib -L/apps/Test/software/FFTW/3.3.10-GCC-12.3.0/lib64 -L/apps/Test/software/FFTW/3.3.10-GCC-12.3.0/lib -L/apps/Test/software/ScaLAPACK/2.2.0-gompi-2023a-fb/lib64 -L/apps/Test/software/ScaLAPACK/2.2.0-gompi-2023a-fb/lib -L/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/lib64 -L/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/lib -L/apps/Test/software/GCCcore/12.3.0/lib64 -L/apps/Test/software/GCCcore/12.3.0/lib -O2 -ftree-vectorize -march=native -fno-math-errno -I/apps/Test/software/FFTW/3.3.10-GCC-12.3.0/include -I/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/include -I/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/include/flexiblas build/temp.linux-x86_64-cpython-311/csrc/lion/cpu_lion.o build/temp.linux-x86_64-cpython-311/csrc/lion/cpu_lion_impl.o -L/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/lib -L/apps/Common/software/CUDA/12.1.1/lib64 -L/apps/Test/software/Python/3.11.3-GCCcore-12.3.0/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-311/deepspeed/ops/lion/cpu_lion_op.cpython-311-x86_64-linux-gnu.so -lcurand -L/apps/Common/software/CUDA/12.1.1/lib64
      creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/adam
      g++ -shared -L/apps/Test/software/OpenSSL/1.1/lib64 -L/apps/Test/software/OpenSSL/1.1/lib -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib64 -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib64 -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib64 -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib64 -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib64 -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib -L/apps/Test/software/GCCcore/12.3.0/lib64 -L/apps/Test/software/GCCcore/12.3.0/lib -L/apps/Test/software/OpenSSL/1.1/lib64 -L/apps/Test/software/OpenSSL/1.1/lib -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib64 -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib64 -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib64 -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib64 -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib64 -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib -L/apps/Test/software/GCCcore/12.3.0/lib64 -L/apps/Test/software/GCCcore/12.3.0/lib -L/apps/Test/software/FFTW/3.3.10-GCC-12.3.0/lib64 -L/apps/Test/software/FFTW/3.3.10-GCC-12.3.0/lib -L/apps/Test/software/ScaLAPACK/2.2.0-gompi-2023a-fb/lib64 -L/apps/Test/software/ScaLAPACK/2.2.0-gompi-2023a-fb/lib -L/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/lib64 -L/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/lib -L/apps/Test/software/GCCcore/12.3.0/lib64 -L/apps/Test/software/GCCcore/12.3.0/lib -O2 -ftree-vectorize -march=native -fno-math-errno -I/apps/Test/software/FFTW/3.3.10-GCC-12.3.0/include -I/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/include -I/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/include/flexiblas build/temp.linux-x86_64-cpython-311/csrc/adam/cpu_adam.o build/temp.linux-x86_64-cpython-311/csrc/adam/cpu_adam_impl.o -L/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/lib -L/apps/Common/software/CUDA/12.1.1/lib64 -L/apps/Test/software/Python/3.11.3-GCCcore-12.3.0/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-311/deepspeed/ops/adam/cpu_adam_op.cpython-311-x86_64-linux-gnu.so -lcurand -L/apps/Common/software/CUDA/12.1.1/lib64
      g++ -shared -L/apps/Test/software/OpenSSL/1.1/lib64 -L/apps/Test/software/OpenSSL/1.1/lib -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib64 -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib64 -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib64 -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib64 -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib64 -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib -L/apps/Test/software/GCCcore/12.3.0/lib64 -L/apps/Test/software/GCCcore/12.3.0/lib -L/apps/Test/software/OpenSSL/1.1/lib64 -L/apps/Test/software/OpenSSL/1.1/lib -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib64 -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib64 -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib64 -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib64 -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib64 -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib -L/apps/Test/software/GCCcore/12.3.0/lib64 -L/apps/Test/software/GCCcore/12.3.0/lib -L/apps/Test/software/FFTW/3.3.10-GCC-12.3.0/lib64 -L/apps/Test/software/FFTW/3.3.10-GCC-12.3.0/lib -L/apps/Test/software/ScaLAPACK/2.2.0-gompi-2023a-fb/lib64 -L/apps/Test/software/ScaLAPACK/2.2.0-gompi-2023a-fb/lib -L/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/lib64 -L/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/lib -L/apps/Test/software/GCCcore/12.3.0/lib64 -L/apps/Test/software/GCCcore/12.3.0/lib -O2 -ftree-vectorize -march=native -fno-math-errno -I/apps/Test/software/FFTW/3.3.10-GCC-12.3.0/include -I/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/include -I/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/include/flexiblas build/temp.linux-x86_64-cpython-311/csrc/lion/fused_lion_frontend.o build/temp.linux-x86_64-cpython-311/csrc/lion/multi_tensor_lion.o -L/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/lib -L/apps/Common/software/CUDA/12.1.1/lib64 -L/apps/Test/software/Python/3.11.3-GCCcore-12.3.0/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-311/deepspeed/ops/lion/fused_lion_op.cpython-311-x86_64-linux-gnu.so
      g++ -shared -L/apps/Test/software/OpenSSL/1.1/lib64 -L/apps/Test/software/OpenSSL/1.1/lib -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib64 -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib64 -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib64 -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib64 -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib64 -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib -L/apps/Test/software/GCCcore/12.3.0/lib64 -L/apps/Test/software/GCCcore/12.3.0/lib -L/apps/Test/software/OpenSSL/1.1/lib64 -L/apps/Test/software/OpenSSL/1.1/lib -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib64 -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib64 -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib64 -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib64 -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib64 -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib -L/apps/Test/software/GCCcore/12.3.0/lib64 -L/apps/Test/software/GCCcore/12.3.0/lib -L/apps/Test/software/FFTW/3.3.10-GCC-12.3.0/lib64 -L/apps/Test/software/FFTW/3.3.10-GCC-12.3.0/lib -L/apps/Test/software/ScaLAPACK/2.2.0-gompi-2023a-fb/lib64 -L/apps/Test/software/ScaLAPACK/2.2.0-gompi-2023a-fb/lib -L/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/lib64 -L/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/lib -L/apps/Test/software/GCCcore/12.3.0/lib64 -L/apps/Test/software/GCCcore/12.3.0/lib -O2 -ftree-vectorize -march=native -fno-math-errno -I/apps/Test/software/FFTW/3.3.10-GCC-12.3.0/include -I/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/include -I/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/include/flexiblas build/temp.linux-x86_64-cpython-311/csrc/adam/fused_adam_frontend.o build/temp.linux-x86_64-cpython-311/csrc/adam/multi_tensor_adam.o -L/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/lib -L/apps/Common/software/CUDA/12.1.1/lib64 -L/apps/Test/software/Python/3.11.3-GCCcore-12.3.0/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-311/deepspeed/ops/adam/fused_adam_op.cpython-311-x86_64-linux-gnu.so
      /apps/Common/software/CUDA/12.1.1/bin/nvcc -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/bias_activations -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/blas_kernels -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/cuda_layer_norm -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/cuda_rms_norm -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/gated_activations -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/cuda_linear -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/includes -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/TH -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/THC -I/apps/Common/software/CUDA/12.1.1/include -I/apps/Test/software/Python/3.11.3-GCCcore-12.3.0/include/python3.11 -c deepspeed/inference/v2/kernels/core_ops/cuda_layer_norm/layer_norm.cpp -o build/temp.linux-x86_64-cpython-311/deepspeed/inference/v2/kernels/core_ops/cuda_layer_norm/layer_norm.o -fPIC -O3 -std=c++17 -g -Wno-reorder -DBF16_AVAILABLE -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1017\" -DTORCH_EXTENSION_NAME=kernelsinference_core_ops -D_GLIBCXX_USE_CXX11_ABI=1
      /apps/Common/software/CUDA/12.1.1/bin/nvcc -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/csrc/transformer/inference/includes -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/csrc/includes -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/TH -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/THC -I/apps/Common/software/CUDA/12.1.1/include -I/apps/Test/software/Python/3.11.3-GCCcore-12.3.0/include/python3.11 -c csrc/transformer/inference/csrc/transform.cu -o build/temp.linux-x86_64-cpython-311/csrc/transformer/inference/csrc/transform.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -Xcompiler -fPIC -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ --threads=8 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=sm_86 -DBF16_AVAILABLE -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1017\" -DTORCH_EXTENSION_NAME=transformer_inference_op -D_GLIBCXX_USE_CXX11_ABI=1 -ccbin gcc
      csrc/transformer/inference/csrc/transform.cu(38): warning #177-D: variable "d0_stride" was declared but never referenced
            int d0_stride = hidden_dim * seq_length;
                ^
      
      Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
      
      csrc/transformer/inference/csrc/transform.cu(66): warning #177-D: variable "lane" was declared but never referenced
            int lane = d3 & 0x1f;
                ^
      
      csrc/transformer/inference/csrc/transform.cu(109): warning #177-D: variable "half_dim" was declared but never referenced
            unsigned half_dim = (rotary_dim << 3) >> 1;
                     ^
                detected during instantiation of "void launch_bias_add_transform_0213(T *, T *, T *, const T *, const T *, int, int, unsigned int, int, int, int, int, int, __nv_bool, __nv_bool, cudaStream_t, int, int, float) [with T=__nv_bfloat16]" at line 281
      
      csrc/transformer/inference/csrc/transform.cu(110): warning #177-D: variable "d0_stride" was declared but never referenced
            int d0_stride = hidden_dim * seq_length;
                ^
                detected during instantiation of "void launch_bias_add_transform_0213(T *, T *, T *, const T *, const T *, int, int, unsigned int, int, int, int, int, int, __nv_bool, __nv_bool, cudaStream_t, int, int, float) [with T=__nv_bfloat16]" at line 281
      
      csrc/transformer/inference/csrc/transform.cu(126): warning #177-D: variable "vals_half" was declared but never referenced
            T2* vals_half = reinterpret_cast<T2*>(&vals_arr);
                ^
                detected during instantiation of "void launch_bias_add_transform_0213(T *, T *, T *, const T *, const T *, int, int, unsigned int, int, int, int, int, int, __nv_bool, __nv_bool, cudaStream_t, int, int, float) [with T=__nv_bfloat16]" at line 281
      
      csrc/transformer/inference/csrc/transform.cu(127): warning #177-D: variable "output_half" was declared but never referenced
            T2* output_half = reinterpret_cast<T2*>(&output_arr);
                ^
                detected during instantiation of "void launch_bias_add_transform_0213(T *, T *, T *, const T *, const T *, int, int, unsigned int, int, int, int, int, int, __nv_bool, __nv_bool, cudaStream_t, int, int, float) [with T=__nv_bfloat16]" at line 281
      
      csrc/transformer/inference/csrc/transform.cu(144): warning #177-D: variable "lane" was declared but never referenced
            int lane = d3 & 0x1f;
                ^
                detected during instantiation of "void launch_bias_add_transform_0213(T *, T *, T *, const T *, const T *, int, int, unsigned int, int, int, int, int, int, __nv_bool, __nv_bool, cudaStream_t, int, int, float) [with T=__nv_bfloat16]" at line 281
      
      csrc/transformer/inference/csrc/transform.cu(38): warning #177-D: variable "d0_stride" was declared but never referenced
            int d0_stride = hidden_dim * seq_length;
                ^
      
      Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
      
      csrc/transformer/inference/csrc/transform.cu(66): warning #177-D: variable "lane" was declared but never referenced
            int lane = d3 & 0x1f;
                ^
      
      csrc/transformer/inference/csrc/transform.cu(109): warning #177-D: variable "half_dim" was declared but never referenced
            unsigned half_dim = (rotary_dim << 3) >> 1;
                     ^
                detected during instantiation of "void launch_bias_add_transform_0213(T *, T *, T *, const T *, const T *, int, int, unsigned int, int, int, int, int, int, __nv_bool, __nv_bool, cudaStream_t, int, int, float) [with T=__nv_bfloat16]" at line 281
      
      csrc/transformer/inference/csrc/transform.cu(110): warning #177-D: variable "d0_stride" was declared but never referenced
            int d0_stride = hidden_dim * seq_length;
                ^
                detected during instantiation of "void launch_bias_add_transform_0213(T *, T *, T *, const T *, const T *, int, int, unsigned int, int, int, int, int, int, __nv_bool, __nv_bool, cudaStream_t, int, int, float) [with T=__nv_bfloat16]" at line 281
      
      csrc/transformer/inference/csrc/transform.cu(126): warning #177-D: variable "vals_half" was declared but never referenced
            T2* vals_half = reinterpret_cast<T2*>(&vals_arr);
                ^
                detected during instantiation of "void launch_bias_add_transform_0213(T *, T *, T *, const T *, const T *, int, int, unsigned int, int, int, int, int, int, __nv_bool, __nv_bool, cudaStream_t, int, int, float) [with T=__nv_bfloat16]" at line 281
      
      csrc/transformer/inference/csrc/transform.cu(127): warning #177-D: variable "output_half" was declared but never referenced
            T2* output_half = reinterpret_cast<T2*>(&output_arr);
                ^
                detected during instantiation of "void launch_bias_add_transform_0213(T *, T *, T *, const T *, const T *, int, int, unsigned int, int, int, int, int, int, __nv_bool, __nv_bool, cudaStream_t, int, int, float) [with T=__nv_bfloat16]" at line 281
      
      csrc/transformer/inference/csrc/transform.cu(144): warning #177-D: variable "lane" was declared but never referenced
            int lane = d3 & 0x1f;
                ^
                detected during instantiation of "void launch_bias_add_transform_0213(T *, T *, T *, const T *, const T *, int, int, unsigned int, int, int, int, int, int, __nv_bool, __nv_bool, cudaStream_t, int, int, float) [with T=__nv_bfloat16]" at line 281
      
      gcc -DNDEBUG -g -fwrapv -O3 -Wall -O2 -ftree-vectorize -march=native -fno-math-errno -fPIC -O2 -ftree-vectorize -march=native -fno-math-errno -fPIC -O2 -ftree-vectorize -march=native -fno-math-errno -I/apps/Test/software/FFTW/3.3.10-GCC-12.3.0/include -I/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/include -I/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/include/flexiblas -fPIC -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/csrc/aio/py_lib -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/csrc/aio/common -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/TH -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/THC -I/apps/Test/software/Python/3.11.3-GCCcore-12.3.0/include/python3.11 -c csrc/aio/py_lib/deepspeed_py_aio_handle.cpp -o build/temp.linux-x86_64-cpython-311/csrc/aio/py_lib/deepspeed_py_aio_handle.o -g -Wall -O0 -std=c++17 -shared -fPIC -Wno-reorder -march=native -fopenmp -D__AVX512__ -laio -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1017\" -DTORCH_EXTENSION_NAME=async_io_op -D_GLIBCXX_USE_CXX11_ABI=1
      creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/transformer
      creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/transformer/inference
      g++ -shared -L/apps/Test/software/OpenSSL/1.1/lib64 -L/apps/Test/software/OpenSSL/1.1/lib -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib64 -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib64 -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib64 -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib64 -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib64 -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib -L/apps/Test/software/GCCcore/12.3.0/lib64 -L/apps/Test/software/GCCcore/12.3.0/lib -L/apps/Test/software/OpenSSL/1.1/lib64 -L/apps/Test/software/OpenSSL/1.1/lib -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib64 -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib64 -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib64 -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib64 -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib64 -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib -L/apps/Test/software/GCCcore/12.3.0/lib64 -L/apps/Test/software/GCCcore/12.3.0/lib -L/apps/Test/software/FFTW/3.3.10-GCC-12.3.0/lib64 -L/apps/Test/software/FFTW/3.3.10-GCC-12.3.0/lib -L/apps/Test/software/ScaLAPACK/2.2.0-gompi-2023a-fb/lib64 -L/apps/Test/software/ScaLAPACK/2.2.0-gompi-2023a-fb/lib -L/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/lib64 -L/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/lib -L/apps/Test/software/GCCcore/12.3.0/lib64 -L/apps/Test/software/GCCcore/12.3.0/lib -O2 -ftree-vectorize -march=native -fno-math-errno -I/apps/Test/software/FFTW/3.3.10-GCC-12.3.0/include -I/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/include -I/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/include/flexiblas build/temp.linux-x86_64-cpython-311/csrc/transformer/inference/csrc/apply_rotary_pos_emb.o build/temp.linux-x86_64-cpython-311/csrc/transformer/inference/csrc/dequantize.o build/temp.linux-x86_64-cpython-311/csrc/transformer/inference/csrc/gelu.o build/temp.linux-x86_64-cpython-311/csrc/transformer/inference/csrc/layer_norm.o build/temp.linux-x86_64-cpython-311/csrc/transformer/inference/csrc/pointwise_ops.o build/temp.linux-x86_64-cpython-311/csrc/transformer/inference/csrc/pt_binding.o build/temp.linux-x86_64-cpython-311/csrc/transformer/inference/csrc/relu.o build/temp.linux-x86_64-cpython-311/csrc/transformer/inference/csrc/rms_norm.o build/temp.linux-x86_64-cpython-311/csrc/transformer/inference/csrc/softmax.o build/temp.linux-x86_64-cpython-311/csrc/transformer/inference/csrc/transform.o -L/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/lib -L/apps/Common/software/CUDA/12.1.1/lib64 -L/apps/Test/software/Python/3.11.3-GCCcore-12.3.0/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-311/deepspeed/ops/transformer/inference/transformer_inference_op.cpython-311-x86_64-linux-gnu.so -lcurand
      /apps/Common/software/CUDA/12.1.1/bin/nvcc -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/csrc/includes -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/TH -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/THC -I/apps/Common/software/CUDA/12.1.1/include -I/apps/Test/software/Python/3.11.3-GCCcore-12.3.0/include/python3.11 -c csrc/transformer/transform_kernels.cu -o build/temp.linux-x86_64-cpython-311/csrc/transformer/transform_kernels.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -Xcompiler -fPIC -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ --threads=8 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=sm_86 -D__STOCHASTIC_MODE__ -DBF16_AVAILABLE -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1017\" -DTORCH_EXTENSION_NAME=stochastic_transformer_op -D_GLIBCXX_USE_CXX11_ABI=1 -ccbin gcc
      /apps/Common/software/CUDA/12.1.1/bin/nvcc -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/csrc/includes -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/TH -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/THC -I/apps/Common/software/CUDA/12.1.1/include -I/apps/Test/software/Python/3.11.3-GCCcore-12.3.0/include/python3.11 -c csrc/transformer/transform_kernels.cu -o build/temp.linux-x86_64-cpython-311/csrc/transformer/transform_kernels.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -Xcompiler -fPIC -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ --threads=8 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=sm_86 -DBF16_AVAILABLE -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1017\" -DTORCH_EXTENSION_NAME=transformer_op -D_GLIBCXX_USE_CXX11_ABI=1 -ccbin gcc
      /apps/Common/software/CUDA/12.1.1/bin/nvcc -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/csrc/includes -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/TH -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/THC -I/apps/Common/software/CUDA/12.1.1/include -I/apps/Test/software/Python/3.11.3-GCCcore-12.3.0/include/python3.11 -c csrc/quantization/quantize.cu -o build/temp.linux-x86_64-cpython-311/csrc/quantization/quantize.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -Xcompiler -fPIC -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ --threads=8 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=sm_86 -DBF16_AVAILABLE -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1017\" -DTORCH_EXTENSION_NAME=quantizer_op -D_GLIBCXX_USE_CXX11_ABI=1 -ccbin gcc
      /dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/csrc/includes/reduction_utils.h(787): warning #1866-D: attribute does not apply to any entity
        __attribute__((aligned(8))) struct IdxReduceResult {
                       ^
      
      Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
      
      g++ -shared -L/apps/Test/software/OpenSSL/1.1/lib64 -L/apps/Test/software/OpenSSL/1.1/lib -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib64 -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib64 -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib64 -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib64 -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib64 -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib -L/apps/Test/software/GCCcore/12.3.0/lib64 -L/apps/Test/software/GCCcore/12.3.0/lib -L/apps/Test/software/OpenSSL/1.1/lib64 -L/apps/Test/software/OpenSSL/1.1/lib -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib64 -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib64 -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib64 -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib64 -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib64 -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib -L/apps/Test/software/GCCcore/12.3.0/lib64 -L/apps/Test/software/GCCcore/12.3.0/lib -L/apps/Test/software/FFTW/3.3.10-GCC-12.3.0/lib64 -L/apps/Test/software/FFTW/3.3.10-GCC-12.3.0/lib -L/apps/Test/software/ScaLAPACK/2.2.0-gompi-2023a-fb/lib64 -L/apps/Test/software/ScaLAPACK/2.2.0-gompi-2023a-fb/lib -L/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/lib64 -L/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/lib -L/apps/Test/software/GCCcore/12.3.0/lib64 -L/apps/Test/software/GCCcore/12.3.0/lib -O2 -ftree-vectorize -march=native -fno-math-errno -I/apps/Test/software/FFTW/3.3.10-GCC-12.3.0/include -I/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/include -I/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/include/flexiblas build/temp.linux-x86_64-cpython-311/csrc/transformer/cublas_wrappers.o build/temp.linux-x86_64-cpython-311/csrc/transformer/dropout_kernels.o build/temp.linux-x86_64-cpython-311/csrc/transformer/ds_transformer_cuda.o build/temp.linux-x86_64-cpython-311/csrc/transformer/gelu_kernels.o build/temp.linux-x86_64-cpython-311/csrc/transformer/general_kernels.o build/temp.linux-x86_64-cpython-311/csrc/transformer/normalize_kernels.o build/temp.linux-x86_64-cpython-311/csrc/transformer/softmax_kernels.o build/temp.linux-x86_64-cpython-311/csrc/transformer/transform_kernels.o -L/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/lib -L/apps/Common/software/CUDA/12.1.1/lib64 -L/apps/Test/software/Python/3.11.3-GCCcore-12.3.0/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-311/deepspeed/ops/transformer/stochastic_transformer_op.cpython-311-x86_64-linux-gnu.so -lcurand
      build/temp.linux-x86_64-cpython-311/csrc/transformer/dropout_kernels.o: file not recognized: file format not recognized
      collect2: error: ld returned 1 exit status
      g++ -shared -L/apps/Test/software/OpenSSL/1.1/lib64 -L/apps/Test/software/OpenSSL/1.1/lib -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib64 -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib64 -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib64 -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib64 -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib64 -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib -L/apps/Test/software/GCCcore/12.3.0/lib64 -L/apps/Test/software/GCCcore/12.3.0/lib -L/apps/Test/software/OpenSSL/1.1/lib64 -L/apps/Test/software/OpenSSL/1.1/lib -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib64 -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib64 -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib64 -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib64 -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib64 -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib -L/apps/Test/software/GCCcore/12.3.0/lib64 -L/apps/Test/software/GCCcore/12.3.0/lib -L/apps/Test/software/FFTW/3.3.10-GCC-12.3.0/lib64 -L/apps/Test/software/FFTW/3.3.10-GCC-12.3.0/lib -L/apps/Test/software/ScaLAPACK/2.2.0-gompi-2023a-fb/lib64 -L/apps/Test/software/ScaLAPACK/2.2.0-gompi-2023a-fb/lib -L/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/lib64 -L/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/lib -L/apps/Test/software/GCCcore/12.3.0/lib64 -L/apps/Test/software/GCCcore/12.3.0/lib -O2 -ftree-vectorize -march=native -fno-math-errno -I/apps/Test/software/FFTW/3.3.10-GCC-12.3.0/include -I/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/include -I/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/include/flexiblas build/temp.linux-x86_64-cpython-311/csrc/transformer/cublas_wrappers.o build/temp.linux-x86_64-cpython-311/csrc/transformer/dropout_kernels.o build/temp.linux-x86_64-cpython-311/csrc/transformer/ds_transformer_cuda.o build/temp.linux-x86_64-cpython-311/csrc/transformer/gelu_kernels.o build/temp.linux-x86_64-cpython-311/csrc/transformer/general_kernels.o build/temp.linux-x86_64-cpython-311/csrc/transformer/normalize_kernels.o build/temp.linux-x86_64-cpython-311/csrc/transformer/softmax_kernels.o build/temp.linux-x86_64-cpython-311/csrc/transformer/transform_kernels.o -L/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/lib -L/apps/Common/software/CUDA/12.1.1/lib64 -L/apps/Test/software/Python/3.11.3-GCCcore-12.3.0/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-311/deepspeed/ops/transformer/transformer_op.cpython-311-x86_64-linux-gnu.so -lcurand
      build/temp.linux-x86_64-cpython-311/csrc/transformer/dropout_kernels.o: file not recognized: file format not recognized
      collect2: error: ld returned 1 exit status
      /dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/csrc/includes/reduction_utils.h(787): warning #1866-D: attribute does not apply to any entity
        __attribute__((aligned(8))) struct IdxReduceResult {
                       ^
      
      Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
      
      /dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/csrc/includes/reduction_utils.h(787): warning #1866-D: attribute does not apply to any entity
        __attribute__((aligned(8))) struct IdxReduceResult {
                       ^
      
      Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
      
      /apps/Common/software/CUDA/12.1.1/bin/nvcc -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/csrc/includes -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/TH -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/THC -I/apps/Common/software/CUDA/12.1.1/include -I/apps/Test/software/Python/3.11.3-GCCcore-12.3.0/include/python3.11 -c csrc/quantization/quantize_intX.cu -o build/temp.linux-x86_64-cpython-311/csrc/quantization/quantize_intX.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -Xcompiler -fPIC -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ --threads=8 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=sm_86 -DBF16_AVAILABLE -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1017\" -DTORCH_EXTENSION_NAME=quantizer_op -D_GLIBCXX_USE_CXX11_ABI=1 -ccbin gcc
      /apps/Common/software/CUDA/12.1.1/bin/nvcc -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/csrc/includes -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/TH -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/THC -I/apps/Common/software/CUDA/12.1.1/include -I/apps/Test/software/Python/3.11.3-GCCcore-12.3.0/include/python3.11 -c csrc/quantization/swizzled_quantize.cu -o build/temp.linux-x86_64-cpython-311/csrc/quantization/swizzled_quantize.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -Xcompiler -fPIC -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ --threads=8 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=sm_86 -DBF16_AVAILABLE -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1017\" -DTORCH_EXTENSION_NAME=quantizer_op -D_GLIBCXX_USE_CXX11_ABI=1 -ccbin gcc
      /dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/csrc/includes/reduction_utils.h(787): warning #1866-D: attribute does not apply to any entity
        __attribute__((aligned(8))) struct IdxReduceResult {
                       ^
      
      Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
      
      /dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/csrc/includes/reduction_utils.h(787): warning #1866-D: attribute does not apply to any entity
        __attribute__((aligned(8))) struct IdxReduceResult {
                       ^
      
      Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
      
      /dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/csrc/includes/reduction_utils.h(787): warning #1866-D: attribute does not apply to any entity
        __attribute__((aligned(8))) struct IdxReduceResult {
                       ^
      
      Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
      
      gcc -DNDEBUG -g -fwrapv -O3 -Wall -O2 -ftree-vectorize -march=native -fno-math-errno -fPIC -O2 -ftree-vectorize -march=native -fno-math-errno -fPIC -O2 -ftree-vectorize -march=native -fno-math-errno -I/apps/Test/software/FFTW/3.3.10-GCC-12.3.0/include -I/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/include -I/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/include/flexiblas -fPIC -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/csrc/aio/py_lib -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/csrc/aio/common -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/TH -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/THC -I/apps/Test/software/Python/3.11.3-GCCcore-12.3.0/include/python3.11 -c csrc/aio/py_lib/deepspeed_py_copy.cpp -o build/temp.linux-x86_64-cpython-311/csrc/aio/py_lib/deepspeed_py_copy.o -g -Wall -O0 -std=c++17 -shared -fPIC -Wno-reorder -march=native -fopenmp -D__AVX512__ -laio -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1017\" -DTORCH_EXTENSION_NAME=async_io_op -D_GLIBCXX_USE_CXX11_ABI=1
      creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/quantizer
      g++ -shared -L/apps/Test/software/OpenSSL/1.1/lib64 -L/apps/Test/software/OpenSSL/1.1/lib -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib64 -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib64 -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib64 -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib64 -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib64 -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib -L/apps/Test/software/GCCcore/12.3.0/lib64 -L/apps/Test/software/GCCcore/12.3.0/lib -L/apps/Test/software/OpenSSL/1.1/lib64 -L/apps/Test/software/OpenSSL/1.1/lib -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib64 -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib64 -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib64 -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib64 -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib64 -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib -L/apps/Test/software/GCCcore/12.3.0/lib64 -L/apps/Test/software/GCCcore/12.3.0/lib -L/apps/Test/software/FFTW/3.3.10-GCC-12.3.0/lib64 -L/apps/Test/software/FFTW/3.3.10-GCC-12.3.0/lib -L/apps/Test/software/ScaLAPACK/2.2.0-gompi-2023a-fb/lib64 -L/apps/Test/software/ScaLAPACK/2.2.0-gompi-2023a-fb/lib -L/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/lib64 -L/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/lib -L/apps/Test/software/GCCcore/12.3.0/lib64 -L/apps/Test/software/GCCcore/12.3.0/lib -O2 -ftree-vectorize -march=native -fno-math-errno -I/apps/Test/software/FFTW/3.3.10-GCC-12.3.0/include -I/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/include -I/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/include/flexiblas build/temp.linux-x86_64-cpython-311/csrc/quantization/dequantize.o build/temp.linux-x86_64-cpython-311/csrc/quantization/fake_quantizer.o build/temp.linux-x86_64-cpython-311/csrc/quantization/pt_binding.o build/temp.linux-x86_64-cpython-311/csrc/quantization/quant_reduce.o build/temp.linux-x86_64-cpython-311/csrc/quantization/quantize.o build/temp.linux-x86_64-cpython-311/csrc/quantization/quantize_intX.o build/temp.linux-x86_64-cpython-311/csrc/quantization/swizzled_quantize.o -L/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/lib -L/apps/Common/software/CUDA/12.1.1/lib64 -L/apps/Test/software/Python/3.11.3-GCCcore-12.3.0/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-311/deepspeed/ops/quantizer/quantizer_op.cpython-311-x86_64-linux-gnu.so -lcurand
      /apps/Common/software/CUDA/12.1.1/bin/nvcc -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/bias_activations -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/blas_kernels -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/cuda_layer_norm -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/cuda_rms_norm -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/gated_activations -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/cuda_linear -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/includes -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/TH -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/THC -I/apps/Common/software/CUDA/12.1.1/include -I/apps/Test/software/Python/3.11.3-GCCcore-12.3.0/include/python3.11 -c deepspeed/inference/v2/kernels/core_ops/cuda_layer_norm/layer_norm_cuda.cu -o build/temp.linux-x86_64-cpython-311/deepspeed/inference/v2/kernels/core_ops/cuda_layer_norm/layer_norm_cuda.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -Xcompiler -fPIC -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ --threads=8 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=sm_86 -DBF16_AVAILABLE -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1017\" -DTORCH_EXTENSION_NAME=kernelsinference_core_ops -D_GLIBCXX_USE_CXX11_ABI=1 -ccbin gcc
      /dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/includes/reduction_utils.h(739): warning #1866-D: attribute does not apply to any entity
        __attribute__((aligned(8))) struct IdxReduceResult {
                       ^
      
      Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
      
      /dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/includes/reduction_utils.h(739): warning #1866-D: attribute does not apply to any entity
        __attribute__((aligned(8))) struct IdxReduceResult {
                       ^
      
      Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
      
      /dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/includes/reduction_utils.h(739): warning #1866-D: attribute does not apply to any entity
        __attribute__((aligned(8))) struct IdxReduceResult {
                       ^
      
      Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
      
      /apps/Common/software/CUDA/12.1.1/bin/nvcc -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/bias_activations -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/blas_kernels -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/cuda_layer_norm -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/cuda_rms_norm -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/gated_activations -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/cuda_linear -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/includes -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/TH -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/THC -I/apps/Common/software/CUDA/12.1.1/include -I/apps/Test/software/Python/3.11.3-GCCcore-12.3.0/include/python3.11 -c deepspeed/inference/v2/kernels/core_ops/cuda_linear/linear_kernels.cpp -o build/temp.linux-x86_64-cpython-311/deepspeed/inference/v2/kernels/core_ops/cuda_linear/linear_kernels.o -fPIC -O3 -std=c++17 -g -Wno-reorder -DBF16_AVAILABLE -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1017\" -DTORCH_EXTENSION_NAME=kernelsinference_core_ops -D_GLIBCXX_USE_CXX11_ABI=1
      gcc -DNDEBUG -g -fwrapv -O3 -Wall -O2 -ftree-vectorize -march=native -fno-math-errno -fPIC -O2 -ftree-vectorize -march=native -fno-math-errno -fPIC -O2 -ftree-vectorize -march=native -fno-math-errno -I/apps/Test/software/FFTW/3.3.10-GCC-12.3.0/include -I/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/include -I/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/include/flexiblas -fPIC -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/csrc/aio/py_lib -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/csrc/aio/common -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/TH -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/THC -I/apps/Test/software/Python/3.11.3-GCCcore-12.3.0/include/python3.11 -c csrc/aio/py_lib/py_ds_aio.cpp -o build/temp.linux-x86_64-cpython-311/csrc/aio/py_lib/py_ds_aio.o -g -Wall -O0 -std=c++17 -shared -fPIC -Wno-reorder -march=native -fopenmp -D__AVX512__ -laio -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1017\" -DTORCH_EXTENSION_NAME=async_io_op -D_GLIBCXX_USE_CXX11_ABI=1
      In file included from /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/torch/csrc/Exceptions.h:14,
                       from /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/torch/csrc/api/include/torch/python.h:11,
                       from /apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/torch/extension.h:9,
                       from csrc/aio/py_lib/py_ds_aio.cpp:10:
      /apps/Test/software/pybind11/2.11.1-GCCcore-12.3.0/include/pybind11/pybind11.h: In instantiation of class pybind11::class_<deepspeed_aio_handle_t>:
      csrc/aio/py_lib/py_ds_aio.cpp:22:55:   required from here
      /apps/Test/software/pybind11/2.11.1-GCCcore-12.3.0/include/pybind11/pybind11.h:1496:7: warning: pybind11::class_<deepspeed_aio_handle_t> declared with greater visibility than its base pybind11::detail::generic_type [-Wattributes]
       1496 | class class_ : public detail::generic_type {
            |       ^~~~~~
      /apps/Common/software/CUDA/12.1.1/bin/nvcc -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/bias_activations -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/blas_kernels -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/cuda_layer_norm -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/cuda_rms_norm -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/gated_activations -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/cuda_linear -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/includes -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/TH -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/THC -I/apps/Common/software/CUDA/12.1.1/include -I/apps/Test/software/Python/3.11.3-GCCcore-12.3.0/include/python3.11 -c deepspeed/inference/v2/kernels/core_ops/cuda_linear/linear_kernels_cuda.cu -o build/temp.linux-x86_64-cpython-311/deepspeed/inference/v2/kernels/core_ops/cuda_linear/linear_kernels_cuda.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -Xcompiler -fPIC -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ --threads=8 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=sm_86 -DBF16_AVAILABLE -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1017\" -DTORCH_EXTENSION_NAME=kernelsinference_core_ops -D_GLIBCXX_USE_CXX11_ABI=1 -ccbin gcc
      creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/aio
      g++ -shared -L/apps/Test/software/OpenSSL/1.1/lib64 -L/apps/Test/software/OpenSSL/1.1/lib -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib64 -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib64 -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib64 -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib64 -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib64 -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib -L/apps/Test/software/GCCcore/12.3.0/lib64 -L/apps/Test/software/GCCcore/12.3.0/lib -L/apps/Test/software/OpenSSL/1.1/lib64 -L/apps/Test/software/OpenSSL/1.1/lib -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib64 -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib64 -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib64 -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib64 -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib64 -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib -L/apps/Test/software/GCCcore/12.3.0/lib64 -L/apps/Test/software/GCCcore/12.3.0/lib -L/apps/Test/software/FFTW/3.3.10-GCC-12.3.0/lib64 -L/apps/Test/software/FFTW/3.3.10-GCC-12.3.0/lib -L/apps/Test/software/ScaLAPACK/2.2.0-gompi-2023a-fb/lib64 -L/apps/Test/software/ScaLAPACK/2.2.0-gompi-2023a-fb/lib -L/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/lib64 -L/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/lib -L/apps/Test/software/GCCcore/12.3.0/lib64 -L/apps/Test/software/GCCcore/12.3.0/lib -O2 -ftree-vectorize -march=native -fno-math-errno -I/apps/Test/software/FFTW/3.3.10-GCC-12.3.0/include -I/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/include -I/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/include/flexiblas build/temp.linux-x86_64-cpython-311/csrc/aio/common/deepspeed_aio_common.o build/temp.linux-x86_64-cpython-311/csrc/aio/common/deepspeed_aio_types.o build/temp.linux-x86_64-cpython-311/csrc/aio/common/deepspeed_aio_utils.o build/temp.linux-x86_64-cpython-311/csrc/aio/py_lib/deepspeed_aio_thread.o build/temp.linux-x86_64-cpython-311/csrc/aio/py_lib/deepspeed_pin_tensor.o build/temp.linux-x86_64-cpython-311/csrc/aio/py_lib/deepspeed_py_aio.o build/temp.linux-x86_64-cpython-311/csrc/aio/py_lib/deepspeed_py_aio_handle.o build/temp.linux-x86_64-cpython-311/csrc/aio/py_lib/deepspeed_py_copy.o build/temp.linux-x86_64-cpython-311/csrc/aio/py_lib/py_ds_aio.o -L/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/lib -L/apps/Test/software/Python/3.11.3-GCCcore-12.3.0/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -o build/lib.linux-x86_64-cpython-311/deepspeed/ops/aio/async_io_op.cpython-311-x86_64-linux-gnu.so -laio
      deepspeed/inference/v2/kernels/core_ops/cuda_linear/include/ptx_mma.cuh(59): warning #174-D: expression has no effect
           ("The matrix load functions are only supported on Ampere and newer architectures", false)
            ^
      
      Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
      
      deepspeed/inference/v2/kernels/core_ops/cuda_linear/include/ptx_mma.cuh(135): warning #174-D: expression has no effect
           ("The mma functions are only implemented for Ampere and newer architectures", false)
            ^
      
      deepspeed/inference/v2/kernels/core_ops/cuda_linear/include/ptx_cp.async.cuh(33): warning #174-D: expression has no effect
           ("The async copy functions are only supported on Ampere and newer architectures", false)
            ^
      
      deepspeed/inference/v2/kernels/core_ops/cuda_linear/include/ptx_cp.async.cuh(44): warning #174-D: expression has no effect
           ("The async copy functions are only supported on Ampere and newer architectures", false)
            ^
      
      deepspeed/inference/v2/kernels/core_ops/cuda_linear/include/ptx_cp.async.cuh(56): warning #174-D: expression has no effect
           ("The async copy functions are only supported on Ampere and newer architectures", false)
            ^
      
      deepspeed/inference/v2/kernels/core_ops/cuda_linear/include/ptx_cp.async.cuh(70): warning #174-D: expression has no effect
           ("The async copy functions are only supported on Ampere and newer architectures", false)
            ^
      
      deepspeed/inference/v2/kernels/core_ops/cuda_linear/include/kernel_matmul.cuh(268): warning #174-D: expression has no effect
           ("The FP6 functions are only available on Ampere GPUs.", false)
            ^
      
      /apps/Common/software/CUDA/12.1.1/bin/nvcc -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/bias_activations -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/blas_kernels -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/cuda_layer_norm -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/cuda_rms_norm -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/gated_activations -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/cuda_linear -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/includes -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/TH -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/THC -I/apps/Common/software/CUDA/12.1.1/include -I/apps/Test/software/Python/3.11.3-GCCcore-12.3.0/include/python3.11 -c deepspeed/inference/v2/kernels/core_ops/cuda_rms_norm/rms_norm.cpp -o build/temp.linux-x86_64-cpython-311/deepspeed/inference/v2/kernels/core_ops/cuda_rms_norm/rms_norm.o -fPIC -O3 -std=c++17 -g -Wno-reorder -DBF16_AVAILABLE -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1017\" -DTORCH_EXTENSION_NAME=kernelsinference_core_ops -D_GLIBCXX_USE_CXX11_ABI=1
      /apps/Common/software/CUDA/12.1.1/bin/nvcc -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/bias_activations -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/blas_kernels -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/cuda_layer_norm -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/cuda_rms_norm -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/gated_activations -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/cuda_linear -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/includes -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/TH -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/THC -I/apps/Common/software/CUDA/12.1.1/include -I/apps/Test/software/Python/3.11.3-GCCcore-12.3.0/include/python3.11 -c deepspeed/inference/v2/kernels/core_ops/cuda_rms_norm/rms_norm_cuda.cu -o build/temp.linux-x86_64-cpython-311/deepspeed/inference/v2/kernels/core_ops/cuda_rms_norm/rms_norm_cuda.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -Xcompiler -fPIC -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ --threads=8 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=sm_86 -DBF16_AVAILABLE -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1017\" -DTORCH_EXTENSION_NAME=kernelsinference_core_ops -D_GLIBCXX_USE_CXX11_ABI=1 -ccbin gcc
      /dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/includes/reduction_utils.h(739): warning #1866-D: attribute does not apply to any entity
        __attribute__((aligned(8))) struct IdxReduceResult {
                       ^
      
      Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
      
      /dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/includes/reduction_utils.h(739): warning #1866-D: attribute does not apply to any entity
        __attribute__((aligned(8))) struct IdxReduceResult {
                       ^
      
      Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
      
      /dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/includes/reduction_utils.h(739): warning #1866-D: attribute does not apply to any entity
        __attribute__((aligned(8))) struct IdxReduceResult {
                       ^
      
      Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
      
      /apps/Common/software/CUDA/12.1.1/bin/nvcc -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/bias_activations -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/blas_kernels -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/cuda_layer_norm -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/cuda_rms_norm -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/gated_activations -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/cuda_linear -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/includes -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/TH -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/THC -I/apps/Common/software/CUDA/12.1.1/include -I/apps/Test/software/Python/3.11.3-GCCcore-12.3.0/include/python3.11 -c deepspeed/inference/v2/kernels/core_ops/gated_activations/gated_activation_kernels.cpp -o build/temp.linux-x86_64-cpython-311/deepspeed/inference/v2/kernels/core_ops/gated_activations/gated_activation_kernels.o -fPIC -O3 -std=c++17 -g -Wno-reorder -DBF16_AVAILABLE -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1017\" -DTORCH_EXTENSION_NAME=kernelsinference_core_ops -D_GLIBCXX_USE_CXX11_ABI=1
      /apps/Common/software/CUDA/12.1.1/bin/nvcc -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/bias_activations -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/blas_kernels -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/cuda_layer_norm -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/cuda_rms_norm -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/gated_activations -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/core_ops/cuda_linear -I/dev/shm/DeepSpeed/0.14.5/foss-2023a-CUDA-12.1.1/DeepSpeed/DeepSpeed-0.14.5/deepspeed/inference/v2/kernels/includes -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/TH -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/THC -I/apps/Common/software/CUDA/12.1.1/include -I/apps/Test/software/Python/3.11.3-GCCcore-12.3.0/include/python3.11 -c deepspeed/inference/v2/kernels/core_ops/gated_activations/gated_activation_kernels_cuda.cu -o build/temp.linux-x86_64-cpython-311/deepspeed/inference/v2/kernels/core_ops/gated_activations/gated_activation_kernels_cuda.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -Xcompiler -fPIC -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ --threads=8 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=sm_86 -DBF16_AVAILABLE -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1017\" -DTORCH_EXTENSION_NAME=kernelsinference_core_ops -D_GLIBCXX_USE_CXX11_ABI=1 -ccbin gcc
      g++ -shared -L/apps/Test/software/OpenSSL/1.1/lib64 -L/apps/Test/software/OpenSSL/1.1/lib -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib64 -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib64 -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib64 -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib64 -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib64 -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib -L/apps/Test/software/GCCcore/12.3.0/lib64 -L/apps/Test/software/GCCcore/12.3.0/lib -L/apps/Test/software/OpenSSL/1.1/lib64 -L/apps/Test/software/OpenSSL/1.1/lib -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib64 -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib64 -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib64 -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib64 -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib64 -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib -L/apps/Test/software/GCCcore/12.3.0/lib64 -L/apps/Test/software/GCCcore/12.3.0/lib -L/apps/Test/software/FFTW/3.3.10-GCC-12.3.0/lib64 -L/apps/Test/software/FFTW/3.3.10-GCC-12.3.0/lib -L/apps/Test/software/ScaLAPACK/2.2.0-gompi-2023a-fb/lib64 -L/apps/Test/software/ScaLAPACK/2.2.0-gompi-2023a-fb/lib -L/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/lib64 -L/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/lib -L/apps/Test/software/GCCcore/12.3.0/lib64 -L/apps/Test/software/GCCcore/12.3.0/lib -O2 -ftree-vectorize -march=native -fno-math-errno -I/apps/Test/software/FFTW/3.3.10-GCC-12.3.0/include -I/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/include -I/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/include/flexiblas build/temp.linux-x86_64-cpython-311/deepspeed/inference/v2/kernels/core_ops/bias_activations/bias_activation.o build/temp.linux-x86_64-cpython-311/deepspeed/inference/v2/kernels/core_ops/bias_activations/bias_activation_cuda.o build/temp.linux-x86_64-cpython-311/deepspeed/inference/v2/kernels/core_ops/core_ops.o build/temp.linux-x86_64-cpython-311/deepspeed/inference/v2/kernels/core_ops/cuda_layer_norm/layer_norm.o build/temp.linux-x86_64-cpython-311/deepspeed/inference/v2/kernels/core_ops/cuda_layer_norm/layer_norm_cuda.o build/temp.linux-x86_64-cpython-311/deepspeed/inference/v2/kernels/core_ops/cuda_linear/linear_kernels.o build/temp.linux-x86_64-cpython-311/deepspeed/inference/v2/kernels/core_ops/cuda_linear/linear_kernels_cuda.o build/temp.linux-x86_64-cpython-311/deepspeed/inference/v2/kernels/core_ops/cuda_rms_norm/rms_norm.o build/temp.linux-x86_64-cpython-311/deepspeed/inference/v2/kernels/core_ops/cuda_rms_norm/rms_norm_cuda.o build/temp.linux-x86_64-cpython-311/deepspeed/inference/v2/kernels/core_ops/gated_activations/gated_activation_kernels.o build/temp.linux-x86_64-cpython-311/deepspeed/inference/v2/kernels/core_ops/gated_activations/gated_activation_kernels_cuda.o -L/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/lib -L/apps/Common/software/CUDA/12.1.1/lib64 -L/apps/Test/software/Python/3.11.3-GCCcore-12.3.0/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-311/deepspeed/inference/v2/kernelsinference_core_ops.cpython-311-x86_64-linux-gnu.so
      /apps/Common/software/CUDA/12.1.1/bin/nvcc -I/apps/Test/software/CUTLASS/3.4.0-foss-2023a-CUDA-12.1.1/include -I/apps/Test/software/CUTLASS/3.4.0-foss-2023a-CUDA-12.1.1/tools/util/include -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/TH -I/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/include/THC -I/apps/Common/software/CUDA/12.1.1/include -I/apps/Test/software/Python/3.11.3-GCCcore-12.3.0/include/python3.11 -c csrc/deepspeed4science/evoformer_attn/attention_cu.cu -o build/temp.linux-x86_64-cpython-311/csrc/deepspeed4science/evoformer_attn/attention_cu.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -Xcompiler -fPIC -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ --threads=8 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=sm_86 -DGPU_ARCH=80 -DBF16_AVAILABLE -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1017\" -DTORCH_EXTENSION_NAME=evoformer_attn_op -D_GLIBCXX_USE_CXX11_ABI=1 -ccbin gcc
      g++ -shared -L/apps/Test/software/OpenSSL/1.1/lib64 -L/apps/Test/software/OpenSSL/1.1/lib -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib64 -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib64 -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib64 -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib64 -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib64 -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib -L/apps/Test/software/GCCcore/12.3.0/lib64 -L/apps/Test/software/GCCcore/12.3.0/lib -L/apps/Test/software/OpenSSL/1.1/lib64 -L/apps/Test/software/OpenSSL/1.1/lib -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/libffi/3.4.4-GCCcore-12.3.0/lib -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/XZ/5.4.2-GCCcore-12.3.0/lib -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib64 -L/apps/Test/software/SQLite/3.42.0-GCCcore-12.3.0/lib -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib64 -L/apps/Test/software/ncurses/6.4-GCCcore-12.3.0/lib -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib64 -L/apps/Test/software/libreadline/8.2-GCCcore-12.3.0/lib -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib64 -L/apps/Test/software/zlib/1.2.13-GCCcore-12.3.0/lib -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib64 -L/apps/Test/software/bzip2/1.0.8-GCCcore-12.3.0/lib -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib64 -L/apps/Test/software/binutils/2.40-GCCcore-12.3.0/lib -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib64 -L/apps/Test/software/pkgconf/1.9.5-GCCcore-12.3.0/lib -L/apps/Test/software/GCCcore/12.3.0/lib64 -L/apps/Test/software/GCCcore/12.3.0/lib -L/apps/Test/software/FFTW/3.3.10-GCC-12.3.0/lib64 -L/apps/Test/software/FFTW/3.3.10-GCC-12.3.0/lib -L/apps/Test/software/ScaLAPACK/2.2.0-gompi-2023a-fb/lib64 -L/apps/Test/software/ScaLAPACK/2.2.0-gompi-2023a-fb/lib -L/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/lib64 -L/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/lib -L/apps/Test/software/GCCcore/12.3.0/lib64 -L/apps/Test/software/GCCcore/12.3.0/lib -O2 -ftree-vectorize -march=native -fno-math-errno -I/apps/Test/software/FFTW/3.3.10-GCC-12.3.0/include -I/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/include -I/apps/Test/software/FlexiBLAS/3.3.1-GCC-12.3.0/include/flexiblas build/temp.linux-x86_64-cpython-311/csrc/deepspeed4science/evoformer_attn/attention.o build/temp.linux-x86_64-cpython-311/csrc/deepspeed4science/evoformer_attn/attention_back.o build/temp.linux-x86_64-cpython-311/csrc/deepspeed4science/evoformer_attn/attention_cu.o -L/apps/Test/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch/lib -L/apps/Common/software/CUDA/12.1.1/lib64 -L/apps/Test/software/Python/3.11.3-GCCcore-12.3.0/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-311/deepspeed/ops/evoformer_attn_op.cpython-311-x86_64-linux-gnu.so -lcurand
      error: command '/apps/Test/software/GCCcore/12.3.0/bin/g++' failed with exit code 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for deepspeed
  Running setup.py clean for deepspeed
 Failed to build deepspeed
 ERROR: Could not build wheels for deepspeed, which is required to install pyproject.toml-based projects
 (at easybuild/tools/run.py:695 in parse_cmd_output)
 == 2024-11-12 13:12:22,631 build_log.py:267 INFO 	... (took 9 mins 1 secs)
 == 2024-11-12 13:12:22,631 build_log.py:267 INFO ... (took 9 mins 25 secs)
 == 2024-11-12 13:12:22,631 filetools.py:2025 INFO Removing lock /apps/Test/software/.locks/_apps_Test_software_DeepSpeed_0.14.5-foss-2023a-CUDA-12.1.1.lock...
 == 2024-11-12 13:12:22,636 filetools.py:385 INFO Path /apps/Test/software/.locks/_apps_Test_software_DeepSpeed_0.14.5-foss-2023a-CUDA-12.1.1.lock successfully removed.
 == 2024-11-12 13:12:22,636 filetools.py:2029 INFO Lock removed: /apps/Test/software/.locks/_apps_Test_software_DeepSpeed_0.14.5-foss-2023a-CUDA-12.1.1.lock
 == 2024-11-12 13:12:22,636 easyblock.py:4297 WARNING build failed (first 300 chars): cmd "export PATH=/cephyr/NOBACKUP/priv/c3-staff/eb-tmp/eb-nk79msr5/tmpdacggzek/bin:$PATH PYTHONPATH=/cephyr/NOBACKUP/priv/c3-staff/eb-tmp/eb-nk79msr5/tmpdacggzek/lib/python3.11/site-packages:$PYTHONPATH LD_LIBRARY_PATH=/cephyr/NOBACKUP/priv/c3-staff/eb-tmp/eb-nk79msr5/tmpdacggzek/lib/python3.11/site
 == 2024-11-12 13:12:22,636 easyblock.py:326 INFO Closing log for application name DeepSpeed version 0.14.5