Skip to content

Instantly share code, notes, and snippets.

@celsowm
Created August 9, 2025 00:07
Show Gist options
  • Save celsowm/21554352a908cfd49e3bd56a3e747d58 to your computer and use it in GitHub Desktop.
Save celsowm/21554352a908cfd49e3bd56a3e747d58 to your computer and use it in GitHub Desktop.
vllm_error.txt
INFO 08-08 17:04:52 [__init__.py:241] Automatically detected platform cuda.
(APIServer pid=1) INFO 08-08 17:04:54 [api_server.py:1787] vLLM API server version 0.10.2.dev2+gf5635d62e.d20250807
(APIServer pid=1) INFO 08-08 17:04:54 [utils.py:326] non-default args: {'model': '/model', 'max_model_len': 32768, 'served_model_name': ['gpt-oss-20b']}
(APIServer pid=1) INFO 08-08 17:04:58 [config.py:726] Resolved architecture: GptOssForCausalLM
(APIServer pid=1) ERROR 08-08 17:04:58 [config.py:123] Error retrieving safetensors: Repo id must use alphanumeric chars or '-', '_', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: '/model'., retrying 1 of 2
(APIServer pid=1) ERROR 08-08 17:05:00 [config.py:121] Error retrieving safetensors: Repo id must use alphanumeric chars or '-', '_', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: '/model'.
(APIServer pid=1) INFO 08-08 17:05:00 [config.py:3628] Downcasting torch.float32 to torch.bfloat16.
(APIServer pid=1) INFO 08-08 17:05:00 [config.py:1759] Using max model len 32768
(APIServer pid=1) WARNING 08-08 17:05:00 [config.py:1198] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
(APIServer pid=1) INFO 08-08 17:05:01 [config.py:2588] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1) INFO 08-08 17:05:01 [config.py:244] Overriding cuda graph sizes to [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512, 528, 544, 560, 576, 592, 608, 624, 640, 656, 672, 688, 704, 720, 736, 752, 768, 784, 800, 816, 832, 848, 864, 880, 896, 912, 928, 944, 960, 976, 992, 1008, 1024]
INFO 08-08 17:05:03 [__init__.py:241] Automatically detected platform cuda.
(EngineCore_0 pid=142) INFO 08-08 17:05:04 [core.py:654] Waiting for init message from front-end.
(EngineCore_0 pid=142) INFO 08-08 17:05:04 [core.py:73] Initializing a V1 LLM engine (v0.10.2.dev2+gf5635d62e.d20250807) with config: model='/model', speculative_config=None, tokenizer='/model', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=mxfp4, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend='openai'), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=gpt-oss-20b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[1024,1008,992,976,960,944,928,912,896,880,864,848,832,816,800,784,768,752,736,720,704,688,672,656,640,624,608,592,576,560,544,528,512,496,480,464,448,432,416,400,384,368,352,336,320,304,288,272,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":1024,"local_cache_dir":null}
(EngineCore_0 pid=142) W0808 17:05:04.906000 142 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
(EngineCore_0 pid=142) W0808 17:05:04.906000 142 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[W808 17:05:05.215649513 ProcessGroupNCCL.cpp:915] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_0 pid=142)
(EngineCore_0 pid=142) LL LL MMM MMM
(EngineCore_0 pid=142) LL LL MMMM MMMM
(EngineCore_0 pid=142) V LL LL MM MM MM MM
(EngineCore_0 pid=142) vvvv VVVV LL LL MM MM MM MM
(EngineCore_0 pid=142) vvvv VVVV LL LL MM MMM MM
(EngineCore_0 pid=142) vvv VVVV LL LL MM M MM
(EngineCore_0 pid=142) vvVVVV LL LL MM MM
(EngineCore_0 pid=142) VVVV LLLLLLLLLL LLLLLLLLL M M
(EngineCore_0 pid=142)
(EngineCore_0 pid=142) INFO 08-08 17:05:05 [parallel_state.py:1102] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_0 pid=142) INFO 08-08 17:05:05 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
(EngineCore_0 pid=142) INFO 08-08 17:05:05 [gpu_model_runner.py:1913] Starting to load model /model...
(EngineCore_0 pid=142) INFO 08-08 17:05:05 [gpu_model_runner.py:1945] Loading model from scratch...
(EngineCore_0 pid=142) INFO 08-08 17:05:05 [cuda.py:286] Using Triton backend on V1 engine.
(EngineCore_0 pid=142) WARNING 08-08 17:05:05 [rocm.py:29] Failed to import from amdsmi with ModuleNotFoundError("No module named 'amdsmi'")
(EngineCore_0 pid=142) WARNING 08-08 17:05:05 [rocm.py:40] Failed to import from vllm._rocm_C with ModuleNotFoundError("No module named 'vllm._rocm_C'")
(EngineCore_0 pid=142) INFO 08-08 17:05:05 [triton_attn.py:263] Using vllm unified attention for TritonAttentionImpl
Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 33% Completed | 1/3 [00:01<00:02, 1.43s/it]
Loading safetensors checkpoint shards: 67% Completed | 2/3 [00:04<00:02, 2.14s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:06<00:00, 2.36s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:06<00:00, 2.23s/it]
(EngineCore_0 pid=142)
(EngineCore_0 pid=142) INFO 08-08 17:05:12 [default_loader.py:262] Loading weights took 6.71 seconds
(EngineCore_0 pid=142) INFO 08-08 17:05:13 [gpu_model_runner.py:1962] Model loading took 13.4825 GiB and 6.927250 seconds
(EngineCore_0 pid=142) INFO 08-08 17:05:14 [backends.py:530] Using cache directory: /root/.cache/vllm/torch_compile_cache/51913bfe77/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_0 pid=142) INFO 08-08 17:05:14 [backends.py:541] Dynamo bytecode transform time: 1.73 s
(EngineCore_0 pid=142) INFO 08-08 17:05:16 [backends.py:194] Cache the graph for dynamic shape for later use
(EngineCore_0 pid=142) INFO 08-08 17:05:27 [backends.py:215] Compiling a graph for dynamic shape takes 12.48 s
(EngineCore_0 pid=142)
(EngineCore_0 pid=142)
(EngineCore_0 pid=142) ================================================================
(EngineCore_0 pid=142) Internal Triton PTX codegen error
(EngineCore_0 pid=142) `ptxas` stderr:
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 670; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 675; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 679; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 683; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 687; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 691; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 695; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 699; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 703; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 707; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 711; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 715; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 719; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 723; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 727; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 731; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 735; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 739; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 743; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 747; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 751; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 755; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 759; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 763; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 767; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 771; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 775; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 779; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 783; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 787; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 791; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 795; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas fatal : Ptx assembly aborted due to errors
(EngineCore_0 pid=142)
(EngineCore_0 pid=142) Repro command: /usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_120a /tmp/tmp7fyy4dpd.ptx -o /tmp/tmp7fyy4dpd.ptx.o
(EngineCore_0 pid=142)
(EngineCore_0 pid=142)
(EngineCore_0 pid=142) //
(EngineCore_0 pid=142) // Generated by LLVM NVPTX Back-End
(EngineCore_0 pid=142) //
(EngineCore_0 pid=142)
(EngineCore_0 pid=142) .version 8.7
(EngineCore_0 pid=142) .target sm_120a
(EngineCore_0 pid=142) .address_size 64
(EngineCore_0 pid=142)
(EngineCore_0 pid=142) // .globl _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu // -- Begin function _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu
(EngineCore_0 pid=142) .extern .shared .align 16 .b8 global_smem[];
(EngineCore_0 pid=142) // @_p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu
(EngineCore_0 pid=142) .visible .entry _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu(
(EngineCore_0 pid=142) .param .u64 .ptr .global .align 1 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_0,
(EngineCore_0 pid=142) .param .u64 .ptr .global .align 1 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_1,
(EngineCore_0 pid=142) .param .u32 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_2,
(EngineCore_0 pid=142) .param .u32 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_3,
(EngineCore_0 pid=142) .param .u32 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_4,
(EngineCore_0 pid=142) .param .align 64 .b8 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_5[128],
(EngineCore_0 pid=142) .param .u32 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_6,
(EngineCore_0 pid=142) .param .u32 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_7,
(EngineCore_0 pid=142) .param .u64 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_8,
(EngineCore_0 pid=142) .param .u64 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_9,
(EngineCore_0 pid=142) .param .u64 .ptr .global .align 1 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_10,
(EngineCore_0 pid=142) .param .u32 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_11,
(EngineCore_0 pid=142) .param .u32 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_12,
(EngineCore_0 pid=142) .param .align 64 .b8 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_13[128],
(EngineCore_0 pid=142) .param .u32 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_14,
(EngineCore_0 pid=142) .param .u32 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_15,
(EngineCore_0 pid=142) .param .u32 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_16,
(EngineCore_0 pid=142) .param .u64 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_17,
(EngineCore_0 pid=142) .param .u64 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_18,
(EngineCore_0 pid=142) .param .u64 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_19,
(EngineCore_0 pid=142) .param .u32 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_20,
(EngineCore_0 pid=142) .param .u32 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_21,
(EngineCore_0 pid=142) .param .align 64 .b8 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_22[128],
(EngineCore_0 pid=142) .param .u32 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_23,
(EngineCore_0 pid=142) .param .u32 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_24,
(EngineCore_0 pid=142) .param .u32 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_25,
(EngineCore_0 pid=142) .param .u32 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_26,
(EngineCore_0 pid=142) .param .u32 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_27,
(EngineCore_0 pid=142) .param .u64 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_28,
(EngineCore_0 pid=142) .param .u64 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_29,
(EngineCore_0 pid=142) .param .u64 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_30,
(EngineCore_0 pid=142) .param .u64 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_31,
(EngineCore_0 pid=142) .param .u64 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_32,
(EngineCore_0 pid=142) .param .u64 .ptr .global .align 1 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_33,
(EngineCore_0 pid=142) .param .u32 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_34,
(EngineCore_0 pid=142) .param .u32 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_35,
(EngineCore_0 pid=142) .param .u32 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_36,
(EngineCore_0 pid=142) .param .u32 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_37,
(EngineCore_0 pid=142) .param .u64 .ptr .global .align 1 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_38,
(EngineCore_0 pid=142) .param .u64 .ptr .global .align 1 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_39,
(EngineCore_0 pid=142) .param .u64 .ptr .global .align 1 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_40,
(EngineCore_0 pid=142) .param .u64 .ptr .global .align 1 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_41,
(EngineCore_0 pid=142) .param .u64 .ptr .global .align 1 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_42,
(EngineCore_0 pid=142) .param .u32 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_43,
(EngineCore_0 pid=142) .param .u32 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_44,
(EngineCore_0 pid=142) .param .f32 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_45,
(EngineCore_0 pid=142) .param .f32 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_46,
(EngineCore_0 pid=142) .param .u64 .ptr .global .align 1 _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_47
(EngineCore_0 pid=142) )
(EngineCore_0 pid=142) .reqntid 384
(EngineCore_0 pid=142) .maxnreg 168
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .reg .pred %p<124>;
(EngineCore_0 pid=142) .reg .b16 %rs<739>;
(EngineCore_0 pid=142) .reg .b32 %r<6301>;
(EngineCore_0 pid=142) .reg .b64 %rd<134>;
(EngineCore_0 pid=142) .loc 1 33 0 // specialize.py:33:0
(EngineCore_0 pid=142) $L__func_begin0:
(EngineCore_0 pid=142) .loc 1 33 0 // specialize.py:33:0
(EngineCore_0 pid=142)
(EngineCore_0 pid=142) // %bb.0:
(EngineCore_0 pid=142) ld.param.b32 %r667, [_p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_44];
(EngineCore_0 pid=142) ld.param.b64 %rd17, [_p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_42];
(EngineCore_0 pid=142) ld.param.b64 %rd15, [_p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_40];
(EngineCore_0 pid=142) ld.param.b64 %rd14, [_p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_39];
(EngineCore_0 pid=142) ld.param.b32 %r666, [_p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_37];
(EngineCore_0 pid=142) ld.param.b32 %r665, [_p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_36];
(EngineCore_0 pid=142) mov.b64 %rd19, _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_22;
(EngineCore_0 pid=142) $L__tmp0:
(EngineCore_0 pid=142) .loc 1 33 0 // specialize.py:33
(EngineCore_0 pid=142) cvta.param.u64 %rd65, %rd19;
(EngineCore_0 pid=142) mov.b64 %rd20, _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_13;
(EngineCore_0 pid=142) cvta.param.u64 %rd64, %rd20;
(EngineCore_0 pid=142) mov.b64 %rd21, _p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_5;
(EngineCore_0 pid=142) cvta.param.u64 %rd3, %rd21;
(EngineCore_0 pid=142) mov.u32 %r1, %tid.x;
(EngineCore_0 pid=142) shr.u32 %r670, %r1, 5;
(EngineCore_0 pid=142) shfl.sync.idx.b32 %r2, %r670, 0, 31, -1;
(EngineCore_0 pid=142) setp.lt.u32 %p1, %r2, 8;
(EngineCore_0 pid=142) @%p1 bra $L__BB0_13;
(EngineCore_0 pid=142) bra.uni $L__BB0_1;
(EngineCore_0 pid=142) $L__BB0_13:
(EngineCore_0 pid=142) .loc 1 0 0 // specialize.py:0:0
(EngineCore_0 pid=142) ld.param.b64 %rd18, [_p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_47];
(EngineCore_0 pid=142) ld.param.b32 %r669, [_p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_46];
(EngineCore_0 pid=142) ld.param.b32 %r668, [_p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_45];
(EngineCore_0 pid=142) ld.param.b64 %rd16, [_p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_41];
(EngineCore_0 pid=142) ld.param.b32 %r664, [_p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_34];
(EngineCore_0 pid=142) ld.param.b64 %rd12, [_p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_33];
(EngineCore_0 pid=142) ld.param.b32 %r663, [_p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_4];
(EngineCore_0 pid=142) ld.param.b64 %rd11, [_p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_1];
(EngineCore_0 pid=142) .loc 1 33 0 // specialize.py:33
(EngineCore_0 pid=142) setmaxnreg.inc.sync.aligned.u32 240;
(EngineCore_0 pid=142) .loc 1 64 37 // specialize.py:64:37
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mov.u32 %r224, 0x0;
(EngineCore_0 pid=142) ld.global.b32 { %r224 }, [ %rd16 + 0 ];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) .loc 1 77 14 // specialize.py:77:14
(EngineCore_0 pid=142) shr.u32 %r981, %r665, 31;
(EngineCore_0 pid=142) add.s32 %r982, %r665, %r981;
(EngineCore_0 pid=142) shr.s32 %r4935, %r982, 1;
(EngineCore_0 pid=142) .loc 1 117 52 // specialize.py:117:52
(EngineCore_0 pid=142) mul.lo.s32 %r226, %r224, %r667;
(EngineCore_0 pid=142) .loc 1 125 33 // specialize.py:125:33
(EngineCore_0 pid=142) mov.u32 %r6039, %ctaid.x;
(EngineCore_0 pid=142) $L__tmp1:
(EngineCore_0 pid=142) .loc 2 49 22 // _common.py:49:22 @[ specialize.py:140:21 ]
(EngineCore_0 pid=142) shl.b32 %r229, %r667, 3;
(EngineCore_0 pid=142) $L__tmp2:
(EngineCore_0 pid=142) .loc 1 261 85 // specialize.py:261:85
(EngineCore_0 pid=142) cvt.s64.s32 %rd4, %r663;
(EngineCore_0 pid=142) .loc 1 274 41 // specialize.py:274:41
(EngineCore_0 pid=142) and.b32 %r230, %r1, 255;
(EngineCore_0 pid=142) mov.b32 %r983, 0f00000000;
(EngineCore_0 pid=142) $L__tmp3:
(EngineCore_0 pid=142) .loc 3 10 26 // _swiglu.py:10:26 @[ specialize.py:330:46 ]
(EngineCore_0 pid=142) sub.f32 %r231, %r983, %r669;
(EngineCore_0 pid=142) .loc 3 45 28 // _swiglu.py:45:28 @[ specialize.py:330:46 ]
(EngineCore_0 pid=142) sub.f32 %r232, %r983, %r668;
(EngineCore_0 pid=142) $L__tmp4:
(EngineCore_0 pid=142) .loc 1 172 36 // specialize.py:172:36
(EngineCore_0 pid=142) add.s32 %r984, %r666, 254;
(EngineCore_0 pid=142) setp.gt.u32 %p50, %r984, 254;
(EngineCore_0 pid=142) @%p50 bra $L__BB0_17;
(EngineCore_0 pid=142) // %bb.14: // %.preheader
(EngineCore_0 pid=142) .loc 1 135 133 // specialize.py:135:133
(EngineCore_0 pid=142) setp.le.s32 %p99, %r226, %r6039;
(EngineCore_0 pid=142) @%p99 bra $L__BB0_23;
(EngineCore_0 pid=142) // %bb.15: // %.lr.ph781
(EngineCore_0 pid=142) .loc 1 0 133 // specialize.py:0:133
(EngineCore_0 pid=142) mov.u32 %r5814, %ctaid.y;
(EngineCore_0 pid=142) mov.u32 %r5815, %ctaid.z;
(EngineCore_0 pid=142) mov.u32 %r5816, %nctaid.x;
(EngineCore_0 pid=142) mov.u32 %r5817, %nctaid.y;
(EngineCore_0 pid=142) mad.lo.s32 %r5818, %r5815, %r5817, %r5814;
(EngineCore_0 pid=142) mad.lo.s32 %r5819, %r5818, %r5816, %r6039;
(EngineCore_0 pid=142) shl.b32 %r5820, %r5819, 8;
(EngineCore_0 pid=142) cvt.s64.s32 %rd101, %r5820;
(EngineCore_0 pid=142) add.s64 %rd120, %rd18, %rd101;
(EngineCore_0 pid=142) shl.b64 %rd113, %rd4, 1;
(EngineCore_0 pid=142) shl.b32 %r5821, %r230, 2;
(EngineCore_0 pid=142) mov.b32 %r5855, global_smem;
(EngineCore_0 pid=142) add.s32 %r5840, %r5855, %r5821;
(EngineCore_0 pid=142) cvta.global.u64 %rd124, %rd120;
(EngineCore_0 pid=142) shl.b32 %r5823, %r1, 1;
(EngineCore_0 pid=142) and.b32 %r5824, %r5823, 508;
(EngineCore_0 pid=142) and.b32 %r5825, %r1, 1;
(EngineCore_0 pid=142) neg.s32 %r5826, %r5825;
(EngineCore_0 pid=142) and.b32 %r5827, %r5826, 576;
(EngineCore_0 pid=142) xor.b32 %r5828, %r5827, %r5824;
(EngineCore_0 pid=142) add.s32 %r234, %r5855, %r5828;
(EngineCore_0 pid=142) shl.b32 %r5829, %r1, 2;
(EngineCore_0 pid=142) and.b32 %r5830, %r5829, 252;
(EngineCore_0 pid=142) add.s32 %r235, %r5855, %r5830;
(EngineCore_0 pid=142) xor.b32 %r5831, %r5830, 576;
(EngineCore_0 pid=142) add.s32 %r236, %r5855, %r5831;
(EngineCore_0 pid=142) shl.b32 %r5832, %r230, 1;
(EngineCore_0 pid=142) shr.u32 %r5833, %r1, 2;
(EngineCore_0 pid=142) and.b32 %r5834, %r5833, 48;
(EngineCore_0 pid=142) xor.b32 %r5835, %r5832, %r5834;
(EngineCore_0 pid=142) add.s32 %r237, %r5855, %r5835;
(EngineCore_0 pid=142) xor.b32 %r5836, %r5835, 576;
(EngineCore_0 pid=142) add.s32 %r238, %r5855, %r5836;
(EngineCore_0 pid=142) setp.eq.s32 %p101, %r230, 0;
(EngineCore_0 pid=142) setp.lt.u32 %p100, %r230, 32;
(EngineCore_0 pid=142) $L__BB0_16: // =>This Inner Loop Header: Depth=1
(EngineCore_0 pid=142) $L__tmp5:
(EngineCore_0 pid=142) .loc 4 69 26 // _p_matmul_ogs.py:69:26 @[ specialize.py:225:25 ]
(EngineCore_0 pid=142) rem.s32 %r5856, %r6039, %r226;
(EngineCore_0 pid=142) .loc 2 50 22 // _common.py:50:22 @[ specialize.py:225:25 ]
(EngineCore_0 pid=142) div.s32 %r5858, %r5856, %r229;
(EngineCore_0 pid=142) .loc 2 51 41 // _common.py:51:41 @[ specialize.py:225:25 ]
(EngineCore_0 pid=142) shl.b32 %r5859, %r5858, 3;
(EngineCore_0 pid=142) .loc 2 51 30 // _common.py:51:30 @[ specialize.py:225:25 ]
(EngineCore_0 pid=142) sub.s32 %r5860, %r224, %r5859;
(EngineCore_0 pid=142) .loc 2 51 50 // _common.py:51:50 @[ specialize.py:225:25 ]
(EngineCore_0 pid=142) min.s32 %r5861, %r5860, 8;
(EngineCore_0 pid=142) .loc 2 52 40 // _common.py:52:40 @[ specialize.py:225:25 ]
(EngineCore_0 pid=142) rem.s32 %r5862, %r5856, %r5861;
(EngineCore_0 pid=142) .loc 2 52 34 // _common.py:52:34 @[ specialize.py:225:25 ]
(EngineCore_0 pid=142) add.s32 %r5863, %r5859, %r5862;
(EngineCore_0 pid=142) .loc 2 53 19 // _common.py:53:19 @[ specialize.py:225:25 ]
(EngineCore_0 pid=142) mul.lo.s32 %r5864, %r5858, %r229;
(EngineCore_0 pid=142) sub.s32 %r5865, %r5856, %r5864;
(EngineCore_0 pid=142) .loc 2 53 30 // _common.py:53:30 @[ specialize.py:225:25 ]
(EngineCore_0 pid=142) div.s32 %r5866, %r5865, %r5861;
(EngineCore_0 pid=142) .loc 4 84 39 // _p_matmul_ogs.py:84:39 @[ specialize.py:225:25 ]
(EngineCore_0 pid=142) mul.wide.s32 %rd126, %r5863, 4;
(EngineCore_0 pid=142) add.s64 %rd102, %rd17, %rd126;
(EngineCore_0 pid=142) .loc 4 84 28 // _p_matmul_ogs.py:84:28 @[ specialize.py:225:25 ]
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mov.u32 %r5837, 0x0;
(EngineCore_0 pid=142) ld.global.b32 { %r5837 }, [ %rd102 + 0 ];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) .loc 4 85 30 // _p_matmul_ogs.py:85:30 @[ specialize.py:225:25 ]
(EngineCore_0 pid=142) and.b32 %r5867, %r5837, 65535;
(EngineCore_0 pid=142) .loc 4 87 32 // _p_matmul_ogs.py:87:32 @[ specialize.py:225:25 ]
(EngineCore_0 pid=142) mul.wide.u32 %rd127, %r5867, 4;
(EngineCore_0 pid=142) add.s64 %rd103, %rd14, %rd127;
(EngineCore_0 pid=142) .loc 4 87 21 // _p_matmul_ogs.py:87:21 @[ specialize.py:225:25 ]
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mov.u32 %r5838, 0x0;
(EngineCore_0 pid=142) ld.global.b32 { %r5838 }, [ %rd103 + 0 ];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) .loc 4 88 37 // _p_matmul_ogs.py:88:37 @[ specialize.py:225:25 ]
(EngineCore_0 pid=142) add.s64 %rd104, %rd15, %rd127;
(EngineCore_0 pid=142) .loc 4 88 26 // _p_matmul_ogs.py:88:26 @[ specialize.py:225:25 ]
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mov.u32 %r5839, 0x0;
(EngineCore_0 pid=142) ld.global.b32 { %r5839 }, [ %rd104 + 0 ];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) .loc 4 91 22 // _p_matmul_ogs.py:91:22 @[ specialize.py:225:25 ]
(EngineCore_0 pid=142) shr.s32 %r5868, %r5837, 9;
(EngineCore_0 pid=142) and.b32 %r5851, %r5868, -128;
(EngineCore_0 pid=142) .loc 4 92 22 // _p_matmul_ogs.py:92:22 @[ specialize.py:225:25 ]
(EngineCore_0 pid=142) shl.b32 %r5869, %r5866, 8;
(EngineCore_0 pid=142) $L__tmp6:
(EngineCore_0 pid=142) .loc 1 261 71 // specialize.py:261:71
(EngineCore_0 pid=142) cvt.s64.s32 %rd128, %r5839;
(EngineCore_0 pid=142) .loc 1 261 85 // specialize.py:261:85
(EngineCore_0 pid=142) mul.lo.s64 %rd129, %rd128, %rd4;
(EngineCore_0 pid=142) .loc 1 261 59 // specialize.py:261:59
(EngineCore_0 pid=142) shl.b64 %rd130, %rd129, 1;
(EngineCore_0 pid=142) add.s64 %rd106, %rd11, %rd130;
(EngineCore_0 pid=142) mov.b32 %r5849, 0;
(EngineCore_0 pid=142) .loc 1 267 16 // specialize.py:267:16
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p100 st.shared.b32 [ %r5840 + 0 ], %r5849;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) bar.warp.sync -1;
(EngineCore_0 pid=142) cvt.u64.u32 %rd105, %r5855;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p101 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd105 + 0 ], %rd106;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p101 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd105 + 0 ], 0x1;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 %r5842, 64;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p101 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd105 + 0 ], 0x0, %r5842;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 %r5843, 128;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p101 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd105 + 0 ], 0x1, %r5843;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p101 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd105 + 0 ], 0x0, %r4935;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p101 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd105 + 0 ], 0x1, %r5838;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p101 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd105 + 0 ], 0x0, %rd113;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 %r5846, 1;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p101 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd105 + 0 ], 0x0, %r5846;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p101 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd105 + 0 ], 0x1, %r5846;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p101 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd105 + 0 ], 0xa;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p101 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd105 + 0 ], 0x0;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p101 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd105 + 0 ], 0x3;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p101 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd105 + 0 ], 0x0;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p100 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd120 + 0 ], [ %rd105 + 0 ], 0x80;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p100 fence.proxy.tensormap::generic.acquire.gpu [ %rd120 + 0 ], 0x80;
(EngineCore_0 pid=142) @%p100 cp.async.bulk.commit_group ;
(EngineCore_0 pid=142) @%p100 cp.async.bulk.wait_group.read 0 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 0, 256 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) .loc 1 274 28 // specialize.py:274:28
(EngineCore_0 pid=142) or.b32 %r5870, %r5869, %r230;
(EngineCore_0 pid=142) .loc 1 275 28 // specialize.py:275:28
(EngineCore_0 pid=142) setp.lt.s32 %p118, %r5870, %r665;
(EngineCore_0 pid=142) .loc 1 277 35 // specialize.py:277:35
(EngineCore_0 pid=142) mul.lo.s32 %r5871, %r5867, %r664;
(EngineCore_0 pid=142) .loc 1 277 24 // specialize.py:277:24
(EngineCore_0 pid=142) mul.wide.s32 %rd131, %r5871, 4;
(EngineCore_0 pid=142) add.s64 %rd132, %rd12, %rd131;
(EngineCore_0 pid=142) .loc 1 277 48 // specialize.py:277:48
(EngineCore_0 pid=142) mul.wide.s32 %rd133, %r5870, 4;
(EngineCore_0 pid=142) add.s64 %rd123, %rd132, %rd133;
(EngineCore_0 pid=142) .loc 1 279 31 // specialize.py:279:31
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mov.u32 %r5848, %r5849;
(EngineCore_0 pid=142) @%p118 ld.global.b32 { %r5848 }, [ %rd123 + 0 ];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) .loc 1 304 27 // specialize.py:304:27
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 0, 256 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) st.shared.b32 [%r234], %r5848;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 0, 256 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) ld.shared.b32 %r5872, [%r235];
(EngineCore_0 pid=142) ld.shared.b32 %r5873, [%r235+256];
(EngineCore_0 pid=142) ld.shared.b32 %r5874, [%r236];
(EngineCore_0 pid=142) ld.shared.b32 %r5875, [%r236+256];
(EngineCore_0 pid=142) .loc 1 325 34 // specialize.py:325:34
(EngineCore_0 pid=142) add.f32 %r5876, %r5872, 0f00000000;
(EngineCore_0 pid=142) add.f32 %r5877, %r5874, 0f00000000;
(EngineCore_0 pid=142) $L__tmp7:
(EngineCore_0 pid=142) .loc 3 8 24 // _swiglu.py:8:24 @[ specialize.py:330:46 ]
(EngineCore_0 pid=142) min.f32 %r5878, %r5876, %r669;
(EngineCore_0 pid=142) min.f32 %r5879, %r5877, %r669;
(EngineCore_0 pid=142) .loc 3 10 33 // _swiglu.py:10:33 @[ specialize.py:330:46 ]
(EngineCore_0 pid=142) max.f32 %r5880, %r231, %r5879;
(EngineCore_0 pid=142) .loc 3 45 36 // _swiglu.py:45:36 @[ specialize.py:330:46 ]
(EngineCore_0 pid=142) mul.f32 %r5881, %r232, %r5878;
(EngineCore_0 pid=142) .loc 3 45 27 // _swiglu.py:45:27 @[ specialize.py:330:46 ]
(EngineCore_0 pid=142) mul.f32 %r5882, %r5881, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5883, %r5882;
(EngineCore_0 pid=142) .loc 3 45 20 // _swiglu.py:45:20 @[ specialize.py:330:46 ]
(EngineCore_0 pid=142) add.f32 %r5884, %r5883, 0f3F800000;
(EngineCore_0 pid=142) .loc 3 45 16 // _swiglu.py:45:16 @[ specialize.py:330:46 ]
(EngineCore_0 pid=142) div.full.f32 %r5885, %r5878, %r5884;
(EngineCore_0 pid=142) .loc 3 46 29 // _swiglu.py:46:29 @[ specialize.py:330:46 ]
(EngineCore_0 pid=142) fma.rn.f32 %r5886, %r5885, %r5880, %r5885;
(EngineCore_0 pid=142) $L__tmp8:
(EngineCore_0 pid=142) .loc 1 353 34 // specialize.py:353:34
(EngineCore_0 pid=142) shr.s32 %r5850, %r5869, 1;
(EngineCore_0 pid=142) .loc 1 360 57 // specialize.py:360:57
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs737, %r5886;
(EngineCore_0 pid=142) .loc 1 360 50 // specialize.py:360:50
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 0, 256 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) st.shared.b16 [%r237], %rs737;
(EngineCore_0 pid=142) st.shared.b16 [%r237+1024], %rs737;
(EngineCore_0 pid=142) st.shared.b16 [%r237+2048], %rs737;
(EngineCore_0 pid=142) st.shared.b16 [%r237+3072], %rs737;
(EngineCore_0 pid=142) st.shared.b16 [%r237+4096], %rs737;
(EngineCore_0 pid=142) st.shared.b16 [%r237+5120], %rs737;
(EngineCore_0 pid=142) st.shared.b16 [%r237+6144], %rs737;
(EngineCore_0 pid=142) st.shared.b16 [%r237+7168], %rs737;
(EngineCore_0 pid=142) st.shared.b16 [%r237+8192], %rs737;
(EngineCore_0 pid=142) st.shared.b16 [%r237+9216], %rs737;
(EngineCore_0 pid=142) st.shared.b16 [%r237+10240], %rs737;
(EngineCore_0 pid=142) st.shared.b16 [%r237+11264], %rs737;
(EngineCore_0 pid=142) st.shared.b16 [%r237+12288], %rs737;
(EngineCore_0 pid=142) st.shared.b16 [%r237+13312], %rs737;
(EngineCore_0 pid=142) st.shared.b16 [%r237+14336], %rs737;
(EngineCore_0 pid=142) st.shared.b16 [%r237+15360], %rs737;
(EngineCore_0 pid=142) st.shared.b16 [%r238], %rs737;
(EngineCore_0 pid=142) st.shared.b16 [%r238+1024], %rs737;
(EngineCore_0 pid=142) st.shared.b16 [%r238+2048], %rs737;
(EngineCore_0 pid=142) st.shared.b16 [%r238+3072], %rs737;
(EngineCore_0 pid=142) st.shared.b16 [%r238+4096], %rs737;
(EngineCore_0 pid=142) st.shared.b16 [%r238+5120], %rs737;
(EngineCore_0 pid=142) st.shared.b16 [%r238+6144], %rs737;
(EngineCore_0 pid=142) st.shared.b16 [%r238+7168], %rs737;
(EngineCore_0 pid=142) st.shared.b16 [%r238+8192], %rs737;
(EngineCore_0 pid=142) st.shared.b16 [%r238+9216], %rs737;
(EngineCore_0 pid=142) st.shared.b16 [%r238+10240], %rs737;
(EngineCore_0 pid=142) st.shared.b16 [%r238+11264], %rs737;
(EngineCore_0 pid=142) st.shared.b16 [%r238+12288], %rs737;
(EngineCore_0 pid=142) st.shared.b16 [%r238+13312], %rs737;
(EngineCore_0 pid=142) st.shared.b16 [%r238+14336], %rs737;
(EngineCore_0 pid=142) st.shared.b16 [%r238+15360], %rs737;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) fence.proxy.async.shared::cta;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 0, 256 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) elect.sync %r5887|%p121, -1;
(EngineCore_0 pid=142) and.pred %p119, %p100, %p121;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p119 cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%rd124, {%r5850, %r5851}], [%r5855];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) cp.async.bulk.commit_group;
(EngineCore_0 pid=142) cp.async.bulk.wait_group.read 0;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 0, 256 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) .loc 1 325 34 // specialize.py:325:34
(EngineCore_0 pid=142) add.f32 %r5888, %r5873, 0f00000000;
(EngineCore_0 pid=142) add.f32 %r5889, %r5875, 0f00000000;
(EngineCore_0 pid=142) $L__tmp9:
(EngineCore_0 pid=142) .loc 3 8 24 // _swiglu.py:8:24 @[ specialize.py:330:46 ]
(EngineCore_0 pid=142) min.f32 %r5890, %r5888, %r669;
(EngineCore_0 pid=142) min.f32 %r5891, %r5889, %r669;
(EngineCore_0 pid=142) .loc 3 10 33 // _swiglu.py:10:33 @[ specialize.py:330:46 ]
(EngineCore_0 pid=142) max.f32 %r5892, %r231, %r5891;
(EngineCore_0 pid=142) .loc 3 45 36 // _swiglu.py:45:36 @[ specialize.py:330:46 ]
(EngineCore_0 pid=142) mul.f32 %r5893, %r232, %r5890;
(EngineCore_0 pid=142) .loc 3 45 27 // _swiglu.py:45:27 @[ specialize.py:330:46 ]
(EngineCore_0 pid=142) mul.f32 %r5894, %r5893, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5895, %r5894;
(EngineCore_0 pid=142) .loc 3 45 20 // _swiglu.py:45:20 @[ specialize.py:330:46 ]
(EngineCore_0 pid=142) add.f32 %r5896, %r5895, 0f3F800000;
(EngineCore_0 pid=142) .loc 3 45 16 // _swiglu.py:45:16 @[ specialize.py:330:46 ]
(EngineCore_0 pid=142) div.full.f32 %r5897, %r5890, %r5896;
(EngineCore_0 pid=142) .loc 3 46 29 // _swiglu.py:46:29 @[ specialize.py:330:46 ]
(EngineCore_0 pid=142) fma.rn.f32 %r5898, %r5897, %r5892, %r5897;
(EngineCore_0 pid=142) $L__tmp10:
(EngineCore_0 pid=142) .loc 1 353 59 // specialize.py:353:59
(EngineCore_0 pid=142) or.b32 %r5853, %r5850, 64;
(EngineCore_0 pid=142) .loc 1 360 57 // specialize.py:360:57
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs738, %r5898;
(EngineCore_0 pid=142) .loc 1 360 50 // specialize.py:360:50
(EngineCore_0 pid=142) st.shared.b16 [%r237], %rs738;
(EngineCore_0 pid=142) st.shared.b16 [%r237+1024], %rs738;
(EngineCore_0 pid=142) st.shared.b16 [%r237+2048], %rs738;
(EngineCore_0 pid=142) st.shared.b16 [%r237+3072], %rs738;
(EngineCore_0 pid=142) st.shared.b16 [%r237+4096], %rs738;
(EngineCore_0 pid=142) st.shared.b16 [%r237+5120], %rs738;
(EngineCore_0 pid=142) st.shared.b16 [%r237+6144], %rs738;
(EngineCore_0 pid=142) st.shared.b16 [%r237+7168], %rs738;
(EngineCore_0 pid=142) st.shared.b16 [%r237+8192], %rs738;
(EngineCore_0 pid=142) st.shared.b16 [%r237+9216], %rs738;
(EngineCore_0 pid=142) st.shared.b16 [%r237+10240], %rs738;
(EngineCore_0 pid=142) st.shared.b16 [%r237+11264], %rs738;
(EngineCore_0 pid=142) st.shared.b16 [%r237+12288], %rs738;
(EngineCore_0 pid=142) st.shared.b16 [%r237+13312], %rs738;
(EngineCore_0 pid=142) st.shared.b16 [%r237+14336], %rs738;
(EngineCore_0 pid=142) st.shared.b16 [%r237+15360], %rs738;
(EngineCore_0 pid=142) st.shared.b16 [%r238], %rs738;
(EngineCore_0 pid=142) st.shared.b16 [%r238+1024], %rs738;
(EngineCore_0 pid=142) st.shared.b16 [%r238+2048], %rs738;
(EngineCore_0 pid=142) st.shared.b16 [%r238+3072], %rs738;
(EngineCore_0 pid=142) st.shared.b16 [%r238+4096], %rs738;
(EngineCore_0 pid=142) st.shared.b16 [%r238+5120], %rs738;
(EngineCore_0 pid=142) st.shared.b16 [%r238+6144], %rs738;
(EngineCore_0 pid=142) st.shared.b16 [%r238+7168], %rs738;
(EngineCore_0 pid=142) st.shared.b16 [%r238+8192], %rs738;
(EngineCore_0 pid=142) st.shared.b16 [%r238+9216], %rs738;
(EngineCore_0 pid=142) st.shared.b16 [%r238+10240], %rs738;
(EngineCore_0 pid=142) st.shared.b16 [%r238+11264], %rs738;
(EngineCore_0 pid=142) st.shared.b16 [%r238+12288], %rs738;
(EngineCore_0 pid=142) st.shared.b16 [%r238+13312], %rs738;
(EngineCore_0 pid=142) st.shared.b16 [%r238+14336], %rs738;
(EngineCore_0 pid=142) st.shared.b16 [%r238+15360], %rs738;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) fence.proxy.async.shared::cta;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 0, 256 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) elect.sync %r5899|%p122, -1;
(EngineCore_0 pid=142) and.pred %p120, %p100, %p122;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p120 cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%rd124, {%r5853, %r5851}], [%r5855];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) cp.async.bulk.commit_group;
(EngineCore_0 pid=142) cp.async.bulk.wait_group.read 0;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 0, 256 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) .loc 1 135 133 // specialize.py:135:133
(EngineCore_0 pid=142) add.s32 %r6039, %r6039, 170;
(EngineCore_0 pid=142) setp.lt.s32 %p123, %r6039, %r226;
(EngineCore_0 pid=142) @%p123 bra $L__BB0_16;
(EngineCore_0 pid=142) bra.uni $L__BB0_23;
(EngineCore_0 pid=142) $L__BB0_1: // %.preheader777.preheader
(EngineCore_0 pid=142) .loc 1 0 133 // specialize.py:0:133
(EngineCore_0 pid=142) ld.param.b64 %rd13, [_p_matmul_ogs_NNT_bf16xbf16xmxfp4_128x256x128x1_swiglu_param_38];
(EngineCore_0 pid=142) bra.uni $L__BB0_2;
(EngineCore_0 pid=142) $L__BB0_12: // in Loop: Header=BB0_2 Depth=1
(EngineCore_0 pid=142) .loc 1 135 133 // specialize.py:135:133
(EngineCore_0 pid=142) setmaxnreg.inc.sync.aligned.u32 88;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync 1 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync 1 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) setmaxnreg.dec.sync.aligned.u32 24;
(EngineCore_0 pid=142) $L__BB0_2: // %.preheader777
(EngineCore_0 pid=142) // =>This Loop Header: Depth=1
(EngineCore_0 pid=142) // Child Loop BB0_8 Depth 2
(EngineCore_0 pid=142) .loc 1 33 0 // specialize.py:33
(EngineCore_0 pid=142) setmaxnreg.dec.sync.aligned.u32 24;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync 1 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 %r672, global_smem;
(EngineCore_0 pid=142) add.s32 %r673, %r672, %r2;
(EngineCore_0 pid=142) ld.shared.b8 %r671, [%r673+83280];
(EngineCore_0 pid=142) setp.gt.u32 %p2, %r671, 3;
(EngineCore_0 pid=142) @%p2 bra $L__BB0_4;
(EngineCore_0 pid=142) // %bb.3: // %.preheader777
(EngineCore_0 pid=142) // in Loop: Header=BB0_2 Depth=1
(EngineCore_0 pid=142) $L_brx_0: .branchtargets
(EngineCore_0 pid=142) $L__BB0_5,
(EngineCore_0 pid=142) $L__BB0_6,
(EngineCore_0 pid=142) $L__BB0_12,
(EngineCore_0 pid=142) $L__BB0_24;
(EngineCore_0 pid=142) brx.idx %r671, $L_brx_0;
(EngineCore_0 pid=142) $L__BB0_5: // in Loop: Header=BB0_2 Depth=1
(EngineCore_0 pid=142) .loc 1 135 133 // specialize.py:135:133
(EngineCore_0 pid=142) setmaxnreg.inc.sync.aligned.u32 88;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync 1 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync 1 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) setmaxnreg.dec.sync.aligned.u32 24;
(EngineCore_0 pid=142) bra.uni $L__BB0_2;
(EngineCore_0 pid=142) $L__BB0_6: // in Loop: Header=BB0_2 Depth=1
(EngineCore_0 pid=142) .loc 1 135 133 // specialize.py:135:133
(EngineCore_0 pid=142) setmaxnreg.inc.sync.aligned.u32 88;
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r3, %r4, %r5901, %r5974}, [global_smem+32768];
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync 1 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) setp.lt.s32 %p3, %r5901, 1;
(EngineCore_0 pid=142) @%p3 bra $L__BB0_11;
(EngineCore_0 pid=142) // %bb.7: // %.lr.ph
(EngineCore_0 pid=142) // in Loop: Header=BB0_2 Depth=1
(EngineCore_0 pid=142) .loc 1 0 0 // specialize.py:0
(EngineCore_0 pid=142) add.s32 %r674, %r665, 127;
(EngineCore_0 pid=142) shr.s32 %r675, %r674, 31;
(EngineCore_0 pid=142) shr.u32 %r676, %r675, 25;
(EngineCore_0 pid=142) add.s32 %r677, %r674, %r676;
(EngineCore_0 pid=142) shr.s32 %r7, %r677, 7;
(EngineCore_0 pid=142) shl.b32 %r8, %r667, 3;
(EngineCore_0 pid=142) add.s32 %r678, %r666, 127;
(EngineCore_0 pid=142) shr.s32 %r679, %r678, 31;
(EngineCore_0 pid=142) shr.u32 %r680, %r679, 25;
(EngineCore_0 pid=142) add.s32 %r681, %r678, %r680;
(EngineCore_0 pid=142) shr.s32 %r682, %r681, 7;
(EngineCore_0 pid=142) max.s32 %r683, %r682, 1;
(EngineCore_0 pid=142) add.s32 %r9, %r683, -1;
(EngineCore_0 pid=142) .loc 1 150 42 // specialize.py:150:42
(EngineCore_0 pid=142) and.b32 %r10, %r1, 63;
(EngineCore_0 pid=142) add.s32 %r686, %r1, -256;
(EngineCore_0 pid=142) shr.u32 %r687, %r1, 2;
(EngineCore_0 pid=142) and.b32 %r688, %r687, 8;
(EngineCore_0 pid=142) and.b32 %r689, %r1, 15;
(EngineCore_0 pid=142) or.b32 %r690, %r688, %r689;
(EngineCore_0 pid=142) shl.b32 %r691, %r690, 4;
(EngineCore_0 pid=142) add.s32 %r693, %r672, 82944;
(EngineCore_0 pid=142) add.s32 %r702, %r693, %r691;
(EngineCore_0 pid=142) shr.u32 %r694, %r1, 1;
(EngineCore_0 pid=142) and.b32 %r695, %r694, 16;
(EngineCore_0 pid=142) add.s32 %r12, %r693, %r695;
(EngineCore_0 pid=142) shr.u32 %r13, %r686, 5;
(EngineCore_0 pid=142) mov.b32 %r5971, -1;
(EngineCore_0 pid=142) mov.b32 %r5975, 0;
(EngineCore_0 pid=142) setp.lt.u32 %p46, %r10, 32;
(EngineCore_0 pid=142) setp.eq.s32 %p9, %r10, 0;
(EngineCore_0 pid=142) mov.b32 %r5976, %r5975;
(EngineCore_0 pid=142) mov.b32 %r5977, %r5975;
(EngineCore_0 pid=142) mov.b32 %r5978, %r5975;
(EngineCore_0 pid=142) mov.b32 %r5979, %r5975;
(EngineCore_0 pid=142) mov.b32 %r5980, %r5975;
(EngineCore_0 pid=142) mov.b32 %r5981, %r5975;
(EngineCore_0 pid=142) mov.b32 %r5982, %r5975;
(EngineCore_0 pid=142) mov.b32 %r5983, %r5975;
(EngineCore_0 pid=142) mov.b32 %r5984, %r5975;
(EngineCore_0 pid=142) mov.b32 %r5985, %r5975;
(EngineCore_0 pid=142) mov.b32 %r5986, %r5975;
(EngineCore_0 pid=142) mov.b32 %r5987, %r5975;
(EngineCore_0 pid=142) mov.b32 %r5988, %r5975;
(EngineCore_0 pid=142) mov.b32 %r5989, %r5975;
(EngineCore_0 pid=142) mov.b32 %r5990, %r5975;
(EngineCore_0 pid=142) mov.b32 %r5991, %r5975;
(EngineCore_0 pid=142) mov.b32 %r5992, %r5975;
(EngineCore_0 pid=142) mov.b32 %r5993, %r5975;
(EngineCore_0 pid=142) mov.b32 %r5994, %r5975;
(EngineCore_0 pid=142) mov.b32 %r5995, %r5975;
(EngineCore_0 pid=142) mov.b32 %r5996, %r5975;
(EngineCore_0 pid=142) mov.b32 %r5997, %r5975;
(EngineCore_0 pid=142) mov.b32 %r5998, %r5975;
(EngineCore_0 pid=142) mov.b32 %r5999, %r5975;
(EngineCore_0 pid=142) mov.b32 %r6000, %r5975;
(EngineCore_0 pid=142) mov.b32 %r6001, %r5975;
(EngineCore_0 pid=142) mov.b32 %r6002, %r5975;
(EngineCore_0 pid=142) mov.b32 %r6003, %r5975;
(EngineCore_0 pid=142) mov.b32 %r6004, %r5975;
(EngineCore_0 pid=142) mov.b32 %r6005, %r5975;
(EngineCore_0 pid=142) mov.b32 %r6006, %r5975;
(EngineCore_0 pid=142) mov.b32 %r6007, %r5975;
(EngineCore_0 pid=142) mov.b32 %r6008, %r5975;
(EngineCore_0 pid=142) mov.b32 %r6009, %r5975;
(EngineCore_0 pid=142) mov.b32 %r6010, %r5975;
(EngineCore_0 pid=142) mov.b32 %r6011, %r5975;
(EngineCore_0 pid=142) mov.b32 %r6012, %r5975;
(EngineCore_0 pid=142) mov.b32 %r6013, %r5975;
(EngineCore_0 pid=142) mov.b32 %r6014, %r5975;
(EngineCore_0 pid=142) mov.b32 %r6015, %r5975;
(EngineCore_0 pid=142) mov.b32 %r6016, %r5975;
(EngineCore_0 pid=142) mov.b32 %r6017, %r5975;
(EngineCore_0 pid=142) mov.b32 %r6018, %r5975;
(EngineCore_0 pid=142) mov.b32 %r6019, %r5975;
(EngineCore_0 pid=142) mov.b32 %r6020, %r5975;
(EngineCore_0 pid=142) mov.b32 %r6021, %r5975;
(EngineCore_0 pid=142) mov.b32 %r6022, %r5975;
(EngineCore_0 pid=142) mov.b32 %r6023, %r5975;
(EngineCore_0 pid=142) mov.b32 %r6024, %r5975;
(EngineCore_0 pid=142) mov.b32 %r6025, %r5975;
(EngineCore_0 pid=142) mov.b32 %r6026, %r5975;
(EngineCore_0 pid=142) mov.b32 %r6027, %r5975;
(EngineCore_0 pid=142) mov.b32 %r6028, %r5975;
(EngineCore_0 pid=142) mov.b32 %r6029, %r5975;
(EngineCore_0 pid=142) mov.b32 %r6030, %r5975;
(EngineCore_0 pid=142) mov.b32 %r6031, %r5975;
(EngineCore_0 pid=142) mov.b32 %r6032, %r5975;
(EngineCore_0 pid=142) mov.b32 %r6033, %r5975;
(EngineCore_0 pid=142) mov.b32 %r6034, %r5975;
(EngineCore_0 pid=142) mov.b32 %r6035, %r5975;
(EngineCore_0 pid=142) mov.b32 %r6036, %r5975;
(EngineCore_0 pid=142) mov.b32 %r6037, %r5975;
(EngineCore_0 pid=142) mov.b32 %r6038, %r5975;
(EngineCore_0 pid=142) mov.b32 %r5966, %r5975;
(EngineCore_0 pid=142) mov.b32 %r5973, %r5975;
(EngineCore_0 pid=142) mov.b32 %r5972, %r5975;
(EngineCore_0 pid=142) mov.b32 %r5969, %r5975;
(EngineCore_0 pid=142) bra.uni $L__BB0_8;
(EngineCore_0 pid=142) $L__BB0_10: // in Loop: Header=BB0_8 Depth=2
(EngineCore_0 pid=142) .loc 1 173 43 // specialize.py:173:43
(EngineCore_0 pid=142) shl.b32 %r731, %r968, 7;
(EngineCore_0 pid=142) .loc 1 174 54 // specialize.py:174:54
(EngineCore_0 pid=142) shl.b32 %r958, %r968, 6;
(EngineCore_0 pid=142) .loc 1 178 39 // specialize.py:178:39
(EngineCore_0 pid=142) add.s32 %r727, %r672, 83200;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142)
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .reg .pred complete;
(EngineCore_0 pid=142) waitLoop:
(EngineCore_0 pid=142) mbarrier.try_wait.parity.shared.b64 complete, [%r727], %r5966;
(EngineCore_0 pid=142) @!complete bra.uni waitLoop;
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142)
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) add.s32 %r729, %r672, 83216;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p9 mbarrier.arrive.expect_tx.shared.b64 _, [%r729], 32768;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 3, 64 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) shfl.sync.idx.b32 %r973, %r13, 0, 31, -1;
(EngineCore_0 pid=142) elect.sync %r974|%p10, -1;
(EngineCore_0 pid=142) shl.b32 %r975, %r973, 9;
(EngineCore_0 pid=142) and.b32 %r976, %r975, 512;
(EngineCore_0 pid=142) add.s32 %r730, %r672, %r976;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p10 cp.async.bulk.tensor.2d.tile::gather4.shared::cluster.global.mbarrier::complete_tx::bytes [%r730], [%rd3, {%r731, %r5975, %r5976, %r5977, %r5978}], [%r729];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) add.s32 %r737, %r730, 16384;
(EngineCore_0 pid=142) or.b32 %r738, %r731, 64;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p10 cp.async.bulk.tensor.2d.tile::gather4.shared::cluster.global.mbarrier::complete_tx::bytes [%r737], [%rd3, {%r738, %r5975, %r5976, %r5977, %r5978}], [%r729];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) add.s32 %r744, %r730, 1024;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p10 cp.async.bulk.tensor.2d.tile::gather4.shared::cluster.global.mbarrier::complete_tx::bytes [%r744], [%rd3, {%r731, %r5979, %r5980, %r5981, %r5982}], [%r729];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) add.s32 %r751, %r730, 17408;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p10 cp.async.bulk.tensor.2d.tile::gather4.shared::cluster.global.mbarrier::complete_tx::bytes [%r751], [%rd3, {%r738, %r5979, %r5980, %r5981, %r5982}], [%r729];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) add.s32 %r758, %r730, 2048;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p10 cp.async.bulk.tensor.2d.tile::gather4.shared::cluster.global.mbarrier::complete_tx::bytes [%r758], [%rd3, {%r731, %r5983, %r5984, %r5985, %r5986}], [%r729];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) add.s32 %r765, %r730, 18432;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p10 cp.async.bulk.tensor.2d.tile::gather4.shared::cluster.global.mbarrier::complete_tx::bytes [%r765], [%rd3, {%r738, %r5983, %r5984, %r5985, %r5986}], [%r729];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) add.s32 %r772, %r730, 3072;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p10 cp.async.bulk.tensor.2d.tile::gather4.shared::cluster.global.mbarrier::complete_tx::bytes [%r772], [%rd3, {%r731, %r5987, %r5988, %r5989, %r5990}], [%r729];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) add.s32 %r779, %r730, 19456;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p10 cp.async.bulk.tensor.2d.tile::gather4.shared::cluster.global.mbarrier::complete_tx::bytes [%r779], [%rd3, {%r738, %r5987, %r5988, %r5989, %r5990}], [%r729];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) add.s32 %r786, %r730, 4096;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p10 cp.async.bulk.tensor.2d.tile::gather4.shared::cluster.global.mbarrier::complete_tx::bytes [%r786], [%rd3, {%r731, %r5991, %r5992, %r5993, %r5994}], [%r729];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) add.s32 %r793, %r730, 20480;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p10 cp.async.bulk.tensor.2d.tile::gather4.shared::cluster.global.mbarrier::complete_tx::bytes [%r793], [%rd3, {%r738, %r5991, %r5992, %r5993, %r5994}], [%r729];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) add.s32 %r800, %r730, 5120;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p10 cp.async.bulk.tensor.2d.tile::gather4.shared::cluster.global.mbarrier::complete_tx::bytes [%r800], [%rd3, {%r731, %r5995, %r5996, %r5997, %r5998}], [%r729];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) add.s32 %r807, %r730, 21504;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p10 cp.async.bulk.tensor.2d.tile::gather4.shared::cluster.global.mbarrier::complete_tx::bytes [%r807], [%rd3, {%r738, %r5995, %r5996, %r5997, %r5998}], [%r729];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) add.s32 %r814, %r730, 6144;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p10 cp.async.bulk.tensor.2d.tile::gather4.shared::cluster.global.mbarrier::complete_tx::bytes [%r814], [%rd3, {%r731, %r5999, %r6000, %r6001, %r6002}], [%r729];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) add.s32 %r821, %r730, 22528;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p10 cp.async.bulk.tensor.2d.tile::gather4.shared::cluster.global.mbarrier::complete_tx::bytes [%r821], [%rd3, {%r738, %r5999, %r6000, %r6001, %r6002}], [%r729];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) add.s32 %r828, %r730, 7168;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p10 cp.async.bulk.tensor.2d.tile::gather4.shared::cluster.global.mbarrier::complete_tx::bytes [%r828], [%rd3, {%r731, %r6003, %r6004, %r6005, %r6006}], [%r729];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) add.s32 %r835, %r730, 23552;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p10 cp.async.bulk.tensor.2d.tile::gather4.shared::cluster.global.mbarrier::complete_tx::bytes [%r835], [%rd3, {%r738, %r6003, %r6004, %r6005, %r6006}], [%r729];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) add.s32 %r842, %r730, 8192;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p10 cp.async.bulk.tensor.2d.tile::gather4.shared::cluster.global.mbarrier::complete_tx::bytes [%r842], [%rd3, {%r731, %r6007, %r6008, %r6009, %r6010}], [%r729];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) add.s32 %r849, %r730, 24576;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p10 cp.async.bulk.tensor.2d.tile::gather4.shared::cluster.global.mbarrier::complete_tx::bytes [%r849], [%rd3, {%r738, %r6007, %r6008, %r6009, %r6010}], [%r729];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) add.s32 %r856, %r730, 9216;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p10 cp.async.bulk.tensor.2d.tile::gather4.shared::cluster.global.mbarrier::complete_tx::bytes [%r856], [%rd3, {%r731, %r6011, %r6012, %r6013, %r6014}], [%r729];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) add.s32 %r863, %r730, 25600;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p10 cp.async.bulk.tensor.2d.tile::gather4.shared::cluster.global.mbarrier::complete_tx::bytes [%r863], [%rd3, {%r738, %r6011, %r6012, %r6013, %r6014}], [%r729];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) add.s32 %r870, %r730, 10240;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p10 cp.async.bulk.tensor.2d.tile::gather4.shared::cluster.global.mbarrier::complete_tx::bytes [%r870], [%rd3, {%r731, %r6015, %r6016, %r6017, %r6018}], [%r729];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) add.s32 %r877, %r730, 26624;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p10 cp.async.bulk.tensor.2d.tile::gather4.shared::cluster.global.mbarrier::complete_tx::bytes [%r877], [%rd3, {%r738, %r6015, %r6016, %r6017, %r6018}], [%r729];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) add.s32 %r884, %r730, 11264;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p10 cp.async.bulk.tensor.2d.tile::gather4.shared::cluster.global.mbarrier::complete_tx::bytes [%r884], [%rd3, {%r731, %r6019, %r6020, %r6021, %r6022}], [%r729];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) add.s32 %r891, %r730, 27648;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p10 cp.async.bulk.tensor.2d.tile::gather4.shared::cluster.global.mbarrier::complete_tx::bytes [%r891], [%rd3, {%r738, %r6019, %r6020, %r6021, %r6022}], [%r729];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) add.s32 %r898, %r730, 12288;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p10 cp.async.bulk.tensor.2d.tile::gather4.shared::cluster.global.mbarrier::complete_tx::bytes [%r898], [%rd3, {%r731, %r6023, %r6024, %r6025, %r6026}], [%r729];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) add.s32 %r905, %r730, 28672;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p10 cp.async.bulk.tensor.2d.tile::gather4.shared::cluster.global.mbarrier::complete_tx::bytes [%r905], [%rd3, {%r738, %r6023, %r6024, %r6025, %r6026}], [%r729];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) add.s32 %r912, %r730, 13312;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p10 cp.async.bulk.tensor.2d.tile::gather4.shared::cluster.global.mbarrier::complete_tx::bytes [%r912], [%rd3, {%r731, %r6027, %r6028, %r6029, %r6030}], [%r729];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) add.s32 %r919, %r730, 29696;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p10 cp.async.bulk.tensor.2d.tile::gather4.shared::cluster.global.mbarrier::complete_tx::bytes [%r919], [%rd3, {%r738, %r6027, %r6028, %r6029, %r6030}], [%r729];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) add.s32 %r926, %r730, 14336;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p10 cp.async.bulk.tensor.2d.tile::gather4.shared::cluster.global.mbarrier::complete_tx::bytes [%r926], [%rd3, {%r731, %r6031, %r6032, %r6033, %r6034}], [%r729];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) add.s32 %r933, %r730, 30720;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p10 cp.async.bulk.tensor.2d.tile::gather4.shared::cluster.global.mbarrier::complete_tx::bytes [%r933], [%rd3, {%r738, %r6031, %r6032, %r6033, %r6034}], [%r729];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) add.s32 %r940, %r730, 15360;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p10 cp.async.bulk.tensor.2d.tile::gather4.shared::cluster.global.mbarrier::complete_tx::bytes [%r940], [%rd3, {%r731, %r6035, %r6036, %r6037, %r6038}], [%r729];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) add.s32 %r947, %r730, 31744;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p10 cp.async.bulk.tensor.2d.tile::gather4.shared::cluster.global.mbarrier::complete_tx::bytes [%r947], [%rd3, {%r738, %r6035, %r6036, %r6037, %r6038}], [%r729];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) $L__tmp11:
(EngineCore_0 pid=142) .loc 4 39 20 // _p_matmul_ogs.py:39:20 @[ specialize.py:193:32 ]
(EngineCore_0 pid=142) add.s32 %r954, %r672, 83232;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142)
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .reg .pred complete;
(EngineCore_0 pid=142) waitLoop:
(EngineCore_0 pid=142) mbarrier.try_wait.parity.shared.b64 complete, [%r954], %r5966;
(EngineCore_0 pid=142) @!complete bra.uni waitLoop;
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142)
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) add.s32 %r956, %r672, 83248;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p9 mbarrier.arrive.expect_tx.shared.b64 _, [%r956], 16384;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 3, 64 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) elect.sync %r977|%p47, -1;
(EngineCore_0 pid=142) and.pred %p43, %p46, %p47;
(EngineCore_0 pid=142) add.s32 %r957, %r672, 65536;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p43 cp.async.bulk.tensor.3d.shared::cluster.global.mbarrier::complete_tx::bytes [%r957], [%rd64, {%r958, %r5973, %r5972}], [%r956];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) $L__tmp12:
(EngineCore_0 pid=142) .loc 1 203 84 // specialize.py:203:84
(EngineCore_0 pid=142) shr.s32 %r978, %r5973, 7;
(EngineCore_0 pid=142) .loc 1 203 75 // specialize.py:203:75
(EngineCore_0 pid=142) mad.lo.s32 %r969, %r5972, %r7, %r978;
(EngineCore_0 pid=142) .loc 1 204 44 // specialize.py:204:44
(EngineCore_0 pid=142) add.s32 %r962, %r672, 83264;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142)
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .reg .pred complete;
(EngineCore_0 pid=142) waitLoop:
(EngineCore_0 pid=142) mbarrier.try_wait.parity.shared.b64 complete, [%r962], %r5966;
(EngineCore_0 pid=142) @!complete bra.uni waitLoop;
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142)
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) add.s32 %r964, %r672, 83280;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p9 mbarrier.arrive.expect_tx.shared.b64 _, [%r964], 1024;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 3, 64 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) elect.sync %r979|%p48, -1;
(EngineCore_0 pid=142) and.pred %p45, %p46, %p48;
(EngineCore_0 pid=142) add.s32 %r965, %r672, 81920;
(EngineCore_0 pid=142) mov.b32 %r966, 0;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p45 cp.async.bulk.tensor.5d.shared::cluster.global.mbarrier::complete_tx::bytes [%r965], [%rd65, {%r966, %r966, %r968, %r969, %r966}], [%r964];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) .loc 1 135 133 // specialize.py:135:133
(EngineCore_0 pid=142) add.s32 %r5969, %r968, 1;
(EngineCore_0 pid=142) .loc 1 178 39 // specialize.py:178:39
(EngineCore_0 pid=142) xor.b32 %r5966, %r5966, 1;
(EngineCore_0 pid=142) .loc 1 135 133 // specialize.py:135:133
(EngineCore_0 pid=142) add.s32 %r5901, %r5901, -1;
(EngineCore_0 pid=142) setp.ne.s32 %p49, %r5901, 0;
(EngineCore_0 pid=142) @%p49 bra $L__BB0_8;
(EngineCore_0 pid=142) bra.uni $L__BB0_11;
(EngineCore_0 pid=142) $L__BB0_8: // Parent Loop BB0_2 Depth=1
(EngineCore_0 pid=142) // => This Inner Loop Header: Depth=2
(EngineCore_0 pid=142) .loc 1 135 133 // specialize.py:135:133
(EngineCore_0 pid=142) add.s32 %r696, %r5971, 1;
(EngineCore_0 pid=142) setp.eq.s32 %p4, %r5971, %r9;
(EngineCore_0 pid=142) selp.b32 %r5971, 0, %r696, %p4;
(EngineCore_0 pid=142) setp.ne.s32 %p5, %r5971, 0;
(EngineCore_0 pid=142) setp.eq.s32 %p6, %r5971, 0;
(EngineCore_0 pid=142) selp.b32 %r968, 0, %r5969, %p6;
(EngineCore_0 pid=142) @%p5 bra $L__BB0_10;
(EngineCore_0 pid=142) // %bb.9: // in Loop: Header=BB0_8 Depth=2
(EngineCore_0 pid=142) add.s32 %r5974, %r5974, 170;
(EngineCore_0 pid=142) $L__tmp13:
(EngineCore_0 pid=142) .loc 4 69 26 // _p_matmul_ogs.py:69:26 @[ specialize.py:140:21 ]
(EngineCore_0 pid=142) rem.s32 %r706, %r5974, %r3;
(EngineCore_0 pid=142) .loc 2 50 22 // _common.py:50:22 @[ specialize.py:140:21 ]
(EngineCore_0 pid=142) div.s32 %r708, %r706, %r8;
(EngineCore_0 pid=142) .loc 2 51 41 // _common.py:51:41 @[ specialize.py:140:21 ]
(EngineCore_0 pid=142) shl.b32 %r709, %r708, 3;
(EngineCore_0 pid=142) .loc 2 51 30 // _common.py:51:30 @[ specialize.py:140:21 ]
(EngineCore_0 pid=142) sub.s32 %r710, %r4, %r709;
(EngineCore_0 pid=142) .loc 2 51 50 // _common.py:51:50 @[ specialize.py:140:21 ]
(EngineCore_0 pid=142) min.s32 %r711, %r710, 8;
(EngineCore_0 pid=142) .loc 2 52 40 // _common.py:52:40 @[ specialize.py:140:21 ]
(EngineCore_0 pid=142) rem.s32 %r712, %r706, %r711;
(EngineCore_0 pid=142) .loc 2 52 34 // _common.py:52:34 @[ specialize.py:140:21 ]
(EngineCore_0 pid=142) add.s32 %r713, %r709, %r712;
(EngineCore_0 pid=142) .loc 2 53 19 // _common.py:53:19 @[ specialize.py:140:21 ]
(EngineCore_0 pid=142) mul.lo.s32 %r714, %r708, %r8;
(EngineCore_0 pid=142) sub.s32 %r715, %r706, %r714;
(EngineCore_0 pid=142) .loc 2 53 30 // _common.py:53:30 @[ specialize.py:140:21 ]
(EngineCore_0 pid=142) div.s32 %r716, %r715, %r711;
(EngineCore_0 pid=142) .loc 4 84 39 // _p_matmul_ogs.py:84:39 @[ specialize.py:140:21 ]
(EngineCore_0 pid=142) mul.wide.s32 %rd27, %r713, 4;
(EngineCore_0 pid=142) add.s64 %rd22, %rd17, %rd27;
(EngineCore_0 pid=142) .loc 4 84 28 // _p_matmul_ogs.py:84:28 @[ specialize.py:140:21 ]
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mov.u32 %r697, 0x0;
(EngineCore_0 pid=142) ld.global.b32 { %r697 }, [ %rd22 + 0 ];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) .loc 4 85 30 // _p_matmul_ogs.py:85:30 @[ specialize.py:140:21 ]
(EngineCore_0 pid=142) and.b32 %r5972, %r697, 65535;
(EngineCore_0 pid=142) .loc 4 87 32 // _p_matmul_ogs.py:87:32 @[ specialize.py:140:21 ]
(EngineCore_0 pid=142) mul.wide.u32 %rd28, %r5972, 4;
(EngineCore_0 pid=142) add.s64 %rd23, %rd14, %rd28;
(EngineCore_0 pid=142) .loc 4 87 21 // _p_matmul_ogs.py:87:21 @[ specialize.py:140:21 ]
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mov.u32 %r698, 0x0;
(EngineCore_0 pid=142) ld.global.b32 { %r698 }, [ %rd23 + 0 ];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) .loc 4 88 37 // _p_matmul_ogs.py:88:37 @[ specialize.py:140:21 ]
(EngineCore_0 pid=142) add.s64 %rd24, %rd15, %rd28;
(EngineCore_0 pid=142) .loc 4 88 26 // _p_matmul_ogs.py:88:26 @[ specialize.py:140:21 ]
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mov.u32 %r699, 0x0;
(EngineCore_0 pid=142) ld.global.b32 { %r699 }, [ %rd24 + 0 ];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) .loc 4 91 22 // _p_matmul_ogs.py:91:22 @[ specialize.py:140:21 ]
(EngineCore_0 pid=142) shr.s32 %r717, %r697, 9;
(EngineCore_0 pid=142) and.b32 %r718, %r717, -128;
(EngineCore_0 pid=142) .loc 4 92 22 // _p_matmul_ogs.py:92:22 @[ specialize.py:140:21 ]
(EngineCore_0 pid=142) shl.b32 %r5973, %r716, 8;
(EngineCore_0 pid=142) $L__tmp14:
(EngineCore_0 pid=142) .loc 1 150 29 // specialize.py:150:29
(EngineCore_0 pid=142) or.b32 %r719, %r718, %r10;
(EngineCore_0 pid=142) or.b32 %r720, %r719, 64;
(EngineCore_0 pid=142) .loc 1 151 31 // specialize.py:151:31
(EngineCore_0 pid=142) setp.lt.s32 %p7, %r719, %r698;
(EngineCore_0 pid=142) setp.lt.s32 %p8, %r720, %r698;
(EngineCore_0 pid=142) .loc 1 160 52 // specialize.py:160:52
(EngineCore_0 pid=142) mul.wide.s32 %rd29, %r699, 4;
(EngineCore_0 pid=142) add.s64 %rd30, %rd13, %rd29;
(EngineCore_0 pid=142) .loc 1 160 77 // specialize.py:160:77
(EngineCore_0 pid=142) mul.wide.s32 %rd31, %r719, 4;
(EngineCore_0 pid=142) add.s64 %rd25, %rd30, %rd31;
(EngineCore_0 pid=142) add.s64 %rd26, %rd25, 256;
(EngineCore_0 pid=142) .loc 1 160 39 // specialize.py:160:39
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mov.u32 %r700, 0xfffffffffffffffc;
(EngineCore_0 pid=142) @%p7 ld.global.b32 { %r700 }, [ %rd25 + 0 ];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mov.u32 %r701, 0xfffffffffffffffc;
(EngineCore_0 pid=142) @%p8 ld.global.b32 { %r701 }, [ %rd26 + 0 ];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) .loc 1 161 75 // specialize.py:161:75
(EngineCore_0 pid=142) shr.s32 %r721, %r700, 31;
(EngineCore_0 pid=142) shr.u32 %r722, %r721, 30;
(EngineCore_0 pid=142) add.s32 %r723, %r700, %r722;
(EngineCore_0 pid=142) shr.s32 %r703, %r723, 2;
(EngineCore_0 pid=142) shr.s32 %r724, %r701, 31;
(EngineCore_0 pid=142) shr.u32 %r725, %r724, 30;
(EngineCore_0 pid=142) add.s32 %r726, %r701, %r725;
(EngineCore_0 pid=142) shr.s32 %r705, %r726, 2;
(EngineCore_0 pid=142) .loc 1 135 133 // specialize.py:135:133
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) stmatrix.sync.aligned.x1.m8n8.shared.b16 [%r702], {%r703};
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 3, 64 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r5975, %r5976, %r5977, %r5978}, [%r12];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r5979, %r5980, %r5981, %r5982}, [%r12+32];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r5983, %r5984, %r5985, %r5986}, [%r12+64];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r5987, %r5988, %r5989, %r5990}, [%r12+96];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r5991, %r5992, %r5993, %r5994}, [%r12+128];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r5995, %r5996, %r5997, %r5998}, [%r12+160];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r5999, %r6000, %r6001, %r6002}, [%r12+192];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r6003, %r6004, %r6005, %r6006}, [%r12+224];
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 3, 64 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) stmatrix.sync.aligned.x1.m8n8.shared.b16 [%r702], {%r705};
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 3, 64 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r6007, %r6008, %r6009, %r6010}, [%r12];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r6011, %r6012, %r6013, %r6014}, [%r12+32];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r6015, %r6016, %r6017, %r6018}, [%r12+64];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r6019, %r6020, %r6021, %r6022}, [%r12+96];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r6023, %r6024, %r6025, %r6026}, [%r12+128];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r6027, %r6028, %r6029, %r6030}, [%r12+160];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r6031, %r6032, %r6033, %r6034}, [%r12+192];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r6035, %r6036, %r6037, %r6038}, [%r12+224];
(EngineCore_0 pid=142) bra.uni $L__BB0_10;
(EngineCore_0 pid=142) $L__BB0_11: // %._crit_edge
(EngineCore_0 pid=142) // in Loop: Header=BB0_2 Depth=1
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync 1 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) setmaxnreg.dec.sync.aligned.u32 24;
(EngineCore_0 pid=142) bra.uni $L__BB0_2;
(EngineCore_0 pid=142) $L__BB0_4: // in Loop: Header=BB0_2 Depth=1
(EngineCore_0 pid=142) .loc 1 33 0 // specialize.py:33
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync 1 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync 1 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) bra.uni $L__BB0_2;
(EngineCore_0 pid=142) $L__BB0_17:
(EngineCore_0 pid=142) .loc 1 0 0 // specialize.py:0
(EngineCore_0 pid=142) add.s32 %r6172, %r6039, -170;
(EngineCore_0 pid=142) $L__tmp15:
(EngineCore_0 pid=142) .loc 5 41 22 // standard.py:41:22 @[ specialize.py:116:25 ]
(EngineCore_0 pid=142) add.s32 %r994, %r666, 127;
(EngineCore_0 pid=142) .loc 5 41 28 // standard.py:41:28 @[ specialize.py:116:25 ]
(EngineCore_0 pid=142) shr.s32 %r995, %r994, 31;
(EngineCore_0 pid=142) shr.u32 %r996, %r995, 25;
(EngineCore_0 pid=142) add.s32 %r997, %r994, %r996;
(EngineCore_0 pid=142) shr.s32 %r998, %r997, 7;
(EngineCore_0 pid=142) $L__tmp16:
(EngineCore_0 pid=142) .loc 1 135 133 // specialize.py:135:133
(EngineCore_0 pid=142) sub.s32 %r999, %r226, %r6039;
(EngineCore_0 pid=142) mul.hi.s32 %r1000, %r999, 1616928865;
(EngineCore_0 pid=142) shr.u32 %r1001, %r1000, 31;
(EngineCore_0 pid=142) shr.s32 %r1002, %r1000, 6;
(EngineCore_0 pid=142) add.s32 %r1003, %r1002, %r1001;
(EngineCore_0 pid=142) mul.lo.s32 %r1004, %r1003, 170;
(EngineCore_0 pid=142) setp.ne.s32 %p60, %r999, %r1004;
(EngineCore_0 pid=142) setp.gt.s32 %p61, %r999, -1;
(EngineCore_0 pid=142) and.pred %p62, %p61, %p60;
(EngineCore_0 pid=142) selp.b32 %r1005, 1, 0, %p62;
(EngineCore_0 pid=142) add.s32 %r1006, %r1003, %r1005;
(EngineCore_0 pid=142) max.s32 %r1007, %r998, 1;
(EngineCore_0 pid=142) mul.lo.s32 %r6040, %r1006, %r1007;
(EngineCore_0 pid=142) setp.eq.s32 %p93, %r230, 0;
(EngineCore_0 pid=142) mov.b32 %r1008, global_smem;
(EngineCore_0 pid=142) add.s32 %r985, %r1008, 83200;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p93 mbarrier.init.shared::cta.b64 [%r985], 1;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) add.s32 %r986, %r1008, 83216;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p93 mbarrier.init.shared::cta.b64 [%r986], 1;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) .loc 1 178 39 // specialize.py:178:39
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 0, 256 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p93 mbarrier.arrive.shared::cta.b64 _, [%r985];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) .loc 1 135 133 // specialize.py:135:133
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 0, 256 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) add.s32 %r988, %r1008, 83232;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p93 mbarrier.init.shared::cta.b64 [%r988], 1;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) add.s32 %r989, %r1008, 83248;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p93 mbarrier.init.shared::cta.b64 [%r989], 1;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) $L__tmp17:
(EngineCore_0 pid=142) .loc 4 39 20 // _p_matmul_ogs.py:39:20 @[ specialize.py:193:32 ]
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 0, 256 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p93 mbarrier.arrive.shared::cta.b64 _, [%r988];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) $L__tmp18:
(EngineCore_0 pid=142) .loc 1 135 133 // specialize.py:135:133
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 0, 256 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) add.s32 %r991, %r1008, 83264;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p93 mbarrier.init.shared::cta.b64 [%r991], 1;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) add.s32 %r992, %r1008, 83280;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p93 mbarrier.init.shared::cta.b64 [%r992], 1;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) .loc 1 204 44 // specialize.py:204:44
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 0, 256 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p93 mbarrier.arrive.shared::cta.b64 _, [%r991];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 %r1009, 33554689;
(EngineCore_0 pid=142) .loc 1 135 133 // specialize.py:135:133
(EngineCore_0 pid=142) st.shared.b32 [global_smem+83288], %r1009;
(EngineCore_0 pid=142) st.shared.v4.b32 [global_smem+32768], {%r226, %r224, %r6040, %r6172};
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync 1 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) setmaxnreg.dec.sync.aligned.u32 208;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync 1 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) setp.lt.s32 %p63, %r6040, 1;
(EngineCore_0 pid=142) @%p63 bra $L__BB0_22;
(EngineCore_0 pid=142) // %bb.18: // %.lr.ph779
(EngineCore_0 pid=142) .loc 1 0 133 // specialize.py:0:133
(EngineCore_0 pid=142) add.s32 %r242, %r1007, -1;
(EngineCore_0 pid=142) shl.b32 %r1013, %r1, 2;
(EngineCore_0 pid=142) and.b32 %r1014, %r1013, 368;
(EngineCore_0 pid=142) bfe.s32 %r1015, %r1, 7, 1;
(EngineCore_0 pid=142) and.b32 %r1016, %r1, 128;
(EngineCore_0 pid=142) shr.u32 %r1017, %r1016, 5;
(EngineCore_0 pid=142) add.s32 %r1019, %r1008, %r1014;
(EngineCore_0 pid=142) add.s32 %r1020, %r1019, %r1017;
(EngineCore_0 pid=142) add.s32 %r243, %r1020, 81920;
(EngineCore_0 pid=142) shl.b32 %r1021, %r1, 1;
(EngineCore_0 pid=142) and.b32 %r1022, %r1021, 126;
(EngineCore_0 pid=142) and.b32 %r1023, %r1, 64;
(EngineCore_0 pid=142) shl.b32 %r1024, %r1023, 8;
(EngineCore_0 pid=142) or.b32 %r1025, %r1022, %r1024;
(EngineCore_0 pid=142) and.b32 %r1026, %r1015, 144;
(EngineCore_0 pid=142) xor.b32 %r1027, %r1025, %r1026;
(EngineCore_0 pid=142) add.s32 %r244, %r1008, %r1027;
(EngineCore_0 pid=142) xor.b32 %r1028, %r1027, 288;
(EngineCore_0 pid=142) add.s32 %r245, %r1008, %r1028;
(EngineCore_0 pid=142) xor.b32 %r1029, %r1027, 576;
(EngineCore_0 pid=142) add.s32 %r246, %r1008, %r1029;
(EngineCore_0 pid=142) xor.b32 %r1030, %r1027, 864;
(EngineCore_0 pid=142) add.s32 %r247, %r1008, %r1030;
(EngineCore_0 pid=142) and.b32 %r1031, %r1021, 254;
(EngineCore_0 pid=142) and.b32 %r1032, %r1015, 320;
(EngineCore_0 pid=142) xor.b32 %r1033, %r1032, %r1031;
(EngineCore_0 pid=142) add.s32 %r4943, %r1008, 32768;
(EngineCore_0 pid=142) add.s32 %r248, %r4943, %r1033;
(EngineCore_0 pid=142) shl.b32 %r1035, %r1, 6;
(EngineCore_0 pid=142) and.b32 %r1036, %r1035, 3840;
(EngineCore_0 pid=142) and.b32 %r1037, %r1, 7;
(EngineCore_0 pid=142) shl.b32 %r1038, %r1037, 4;
(EngineCore_0 pid=142) or.b32 %r1039, %r1036, %r1038;
(EngineCore_0 pid=142) add.s32 %r249, %r4943, %r1039;
(EngineCore_0 pid=142) xor.b32 %r1040, %r1039, 64;
(EngineCore_0 pid=142) add.s32 %r250, %r4943, %r1040;
(EngineCore_0 pid=142) and.b32 %r1041, %r1035, 960;
(EngineCore_0 pid=142) and.b32 %r1042, %r1, 192;
(EngineCore_0 pid=142) shl.b32 %r1043, %r1042, 4;
(EngineCore_0 pid=142) or.b32 %r1044, %r1041, %r1043;
(EngineCore_0 pid=142) shl.b32 %r1045, %r1, 3;
(EngineCore_0 pid=142) and.b32 %r1046, %r1045, 48;
(EngineCore_0 pid=142) or.b32 %r1047, %r1044, %r1046;
(EngineCore_0 pid=142) and.b32 %r1048, %r1, 16;
(EngineCore_0 pid=142) xor.b32 %r1049, %r1047, %r1048;
(EngineCore_0 pid=142) add.s32 %r1050, %r1008, 65536;
(EngineCore_0 pid=142) add.s32 %r1105, %r1050, %r1049;
(EngineCore_0 pid=142) add.s32 %r1110, %r1105, 4096;
(EngineCore_0 pid=142) add.s32 %r1115, %r1105, 8192;
(EngineCore_0 pid=142) add.s32 %r1120, %r1105, 12288;
(EngineCore_0 pid=142) xor.b32 %r1051, %r1049, 32;
(EngineCore_0 pid=142) add.s32 %r1125, %r1050, %r1051;
(EngineCore_0 pid=142) add.s32 %r1130, %r1125, 4096;
(EngineCore_0 pid=142) add.s32 %r1135, %r1125, 8192;
(EngineCore_0 pid=142) add.s32 %r1140, %r1125, 12288;
(EngineCore_0 pid=142) mov.u32 %r1052, %ctaid.y;
(EngineCore_0 pid=142) mov.u32 %r1053, %ctaid.z;
(EngineCore_0 pid=142) mov.u32 %r1054, %nctaid.x;
(EngineCore_0 pid=142) mov.u32 %r1055, %nctaid.y;
(EngineCore_0 pid=142) mad.lo.s32 %r1056, %r1053, %r1055, %r1052;
(EngineCore_0 pid=142) mad.lo.s32 %r1057, %r1056, %r1054, %r6039;
(EngineCore_0 pid=142) shl.b32 %r1058, %r1057, 8;
(EngineCore_0 pid=142) cvt.s64.s32 %rd67, %r1058;
(EngineCore_0 pid=142) add.s64 %rd68, %rd18, %rd67;
(EngineCore_0 pid=142) add.s64 %rd87, %rd68, 128;
(EngineCore_0 pid=142) shl.b64 %rd80, %rd4, 1;
(EngineCore_0 pid=142) shl.b32 %r1059, %r230, 2;
(EngineCore_0 pid=142) add.s32 %r4931, %r4943, %r1059;
(EngineCore_0 pid=142) cvta.global.u64 %rd91, %rd87;
(EngineCore_0 pid=142) shl.b32 %r1060, %r1037, 3;
(EngineCore_0 pid=142) and.b32 %r1061, %r1013, 960;
(EngineCore_0 pid=142) shr.u32 %r1062, %r1, 1;
(EngineCore_0 pid=142) and.b32 %r1063, %r1062, 4;
(EngineCore_0 pid=142) add.s32 %r1064, %r4943, %r1060;
(EngineCore_0 pid=142) add.s32 %r1065, %r1064, %r1061;
(EngineCore_0 pid=142) add.s32 %r260, %r1065, %r1063;
(EngineCore_0 pid=142) and.b32 %r1066, %r1021, 56;
(EngineCore_0 pid=142) add.s32 %r1067, %r4943, %r1066;
(EngineCore_0 pid=142) add.s32 %r261, %r1067, %r1042;
(EngineCore_0 pid=142) and.b32 %r1068, %r1, 3;
(EngineCore_0 pid=142) shl.b32 %r1069, %r1068, 12;
(EngineCore_0 pid=142) and.b32 %r1070, %r1045, 1792;
(EngineCore_0 pid=142) or.b32 %r1071, %r1069, %r1070;
(EngineCore_0 pid=142) or.b32 %r1072, %r1071, %r1038;
(EngineCore_0 pid=142) and.b32 %r1073, %r1021, 48;
(EngineCore_0 pid=142) xor.b32 %r1074, %r1072, %r1073;
(EngineCore_0 pid=142) add.s32 %r262, %r4943, %r1074;
(EngineCore_0 pid=142) xor.b32 %r1075, %r1074, 16448;
(EngineCore_0 pid=142) add.s32 %r263, %r4943, %r1075;
(EngineCore_0 pid=142) and.b32 %r1076, %r1035, 3584;
(EngineCore_0 pid=142) shl.b32 %r1077, %r1068, 4;
(EngineCore_0 pid=142) or.b32 %r1078, %r1076, %r1077;
(EngineCore_0 pid=142) bfe.s32 %r1079, %r1, 2, 1;
(EngineCore_0 pid=142) and.b32 %r1080, %r1079, 16448;
(EngineCore_0 pid=142) or.b32 %r1081, %r1078, %r1080;
(EngineCore_0 pid=142) shl.b32 %r1082, %r1023, 1;
(EngineCore_0 pid=142) or.b32 %r1083, %r1081, %r1082;
(EngineCore_0 pid=142) and.b32 %r1084, %r1015, 4112;
(EngineCore_0 pid=142) xor.b32 %r1085, %r1083, %r1084;
(EngineCore_0 pid=142) add.s32 %r264, %r4943, %r1085;
(EngineCore_0 pid=142) xor.b32 %r1086, %r1085, 64;
(EngineCore_0 pid=142) add.s32 %r265, %r4943, %r1086;
(EngineCore_0 pid=142) xor.b32 %r1087, %r1085, 8224;
(EngineCore_0 pid=142) add.s32 %r266, %r4943, %r1087;
(EngineCore_0 pid=142) xor.b32 %r1088, %r1085, 8288;
(EngineCore_0 pid=142) add.s32 %r267, %r4943, %r1088;
(EngineCore_0 pid=142) shl.b32 %r1089, %r230, 1;
(EngineCore_0 pid=142) shr.u32 %r1090, %r1042, 2;
(EngineCore_0 pid=142) xor.b32 %r1091, %r1089, %r1090;
(EngineCore_0 pid=142) add.s32 %r268, %r4943, %r1091;
(EngineCore_0 pid=142) xor.b32 %r1092, %r1091, 576;
(EngineCore_0 pid=142) add.s32 %r269, %r4943, %r1092;
(EngineCore_0 pid=142) mov.b32 %r6171, -1;
(EngineCore_0 pid=142) mov.b32 %r6042, 0f00000000;
(EngineCore_0 pid=142) mov.b32 %r6041, 0;
(EngineCore_0 pid=142) mov.b32 %r6043, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6044, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6045, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6046, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6047, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6048, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6049, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6050, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6051, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6052, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6053, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6054, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6055, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6056, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6057, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6058, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6059, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6060, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6061, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6062, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6063, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6064, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6065, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6066, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6067, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6068, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6069, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6070, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6071, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6072, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6073, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6074, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6075, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6076, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6077, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6078, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6079, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6080, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6081, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6082, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6083, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6084, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6085, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6086, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6087, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6088, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6089, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6090, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6091, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6092, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6093, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6094, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6095, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6096, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6097, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6098, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6099, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6100, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6101, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6102, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6103, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6104, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6105, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6106, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6107, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6108, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6109, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6110, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6111, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6112, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6113, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6114, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6115, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6116, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6117, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6118, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6119, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6120, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6121, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6122, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6123, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6124, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6125, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6126, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6127, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6128, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6129, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6130, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6131, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6132, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6133, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6134, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6135, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6136, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6137, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6138, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6139, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6140, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6141, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6142, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6143, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6144, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6145, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6146, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6147, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6148, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6149, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6150, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6151, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6152, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6153, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6154, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6155, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6156, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6157, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6158, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6159, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6160, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6161, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6162, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6163, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6164, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6165, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6166, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6167, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6168, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6169, %r6042;
(EngineCore_0 pid=142) bra.uni $L__BB0_19;
(EngineCore_0 pid=142) $L__BB0_21: // in Loop: Header=BB0_19 Depth=1
(EngineCore_0 pid=142) .loc 1 178 39 // specialize.py:178:39
(EngineCore_0 pid=142) xor.b32 %r6041, %r6041, 1;
(EngineCore_0 pid=142) .loc 1 135 133 // specialize.py:135:133
(EngineCore_0 pid=142) add.s32 %r6040, %r6040, -1;
(EngineCore_0 pid=142) setp.ne.s32 %p92, %r6040, 0;
(EngineCore_0 pid=142) @%p92 bra $L__BB0_19;
(EngineCore_0 pid=142) bra.uni $L__BB0_22;
(EngineCore_0 pid=142) $L__BB0_19: // =>This Inner Loop Header: Depth=1
(EngineCore_0 pid=142) .loc 1 135 133 // specialize.py:135:133
(EngineCore_0 pid=142) add.s32 %r4886, %r6171, 1;
(EngineCore_0 pid=142) setp.eq.s32 %p67, %r6171, %r242;
(EngineCore_0 pid=142) selp.b32 %r6171, 0, %r4886, %p67;
(EngineCore_0 pid=142) .loc 1 204 44 // specialize.py:204:44
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 0, 256 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142)
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .reg .pred complete;
(EngineCore_0 pid=142) waitLoop:
(EngineCore_0 pid=142) mbarrier.try_wait.parity.shared.b64 complete, [%r992], %r6041;
(EngineCore_0 pid=142) @!complete bra.uni waitLoop;
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142)
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) $L__tmp19:
(EngineCore_0 pid=142) .loc 6 57 61 // blackwell_scale.py:57:61 @[ specialize.py:206:53 ]
(EngineCore_0 pid=142) ld.shared.b32 %r4888, [%r243];
(EngineCore_0 pid=142) bfe.u32 %r4889, %r4888, 0, 8;
(EngineCore_0 pid=142) cvt.u16.u32 %rs1, %r4889;
(EngineCore_0 pid=142) bfe.u32 %r4890, %r4888, 8, 8;
(EngineCore_0 pid=142) cvt.u16.u32 %rs2, %r4890;
(EngineCore_0 pid=142) bfe.u32 %r4891, %r4888, 16, 8;
(EngineCore_0 pid=142) cvt.u16.u32 %rs3, %r4891;
(EngineCore_0 pid=142) bfe.u32 %r4892, %r4888, 24, 8;
(EngineCore_0 pid=142) cvt.u16.u32 %rs4, %r4892;
(EngineCore_0 pid=142) ld.shared.b32 %r4893, [%r243+128];
(EngineCore_0 pid=142) bfe.u32 %r4894, %r4893, 0, 8;
(EngineCore_0 pid=142) cvt.u16.u32 %rs5, %r4894;
(EngineCore_0 pid=142) bfe.u32 %r4895, %r4893, 8, 8;
(EngineCore_0 pid=142) cvt.u16.u32 %rs6, %r4895;
(EngineCore_0 pid=142) bfe.u32 %r4896, %r4893, 16, 8;
(EngineCore_0 pid=142) cvt.u16.u32 %rs7, %r4896;
(EngineCore_0 pid=142) bfe.u32 %r4897, %r4893, 24, 8;
(EngineCore_0 pid=142) cvt.u16.u32 %rs8, %r4897;
(EngineCore_0 pid=142) ld.shared.b32 %r4898, [%r243+8];
(EngineCore_0 pid=142) bfe.u32 %r4899, %r4898, 0, 8;
(EngineCore_0 pid=142) cvt.u16.u32 %rs9, %r4899;
(EngineCore_0 pid=142) bfe.u32 %r4900, %r4898, 8, 8;
(EngineCore_0 pid=142) cvt.u16.u32 %rs10, %r4900;
(EngineCore_0 pid=142) bfe.u32 %r4901, %r4898, 16, 8;
(EngineCore_0 pid=142) cvt.u16.u32 %rs11, %r4901;
(EngineCore_0 pid=142) bfe.u32 %r4902, %r4898, 24, 8;
(EngineCore_0 pid=142) cvt.u16.u32 %rs12, %r4902;
(EngineCore_0 pid=142) ld.shared.b32 %r4903, [%r243+136];
(EngineCore_0 pid=142) bfe.u32 %r4904, %r4903, 0, 8;
(EngineCore_0 pid=142) cvt.u16.u32 %rs13, %r4904;
(EngineCore_0 pid=142) bfe.u32 %r4905, %r4903, 8, 8;
(EngineCore_0 pid=142) cvt.u16.u32 %rs14, %r4905;
(EngineCore_0 pid=142) bfe.u32 %r4906, %r4903, 16, 8;
(EngineCore_0 pid=142) cvt.u16.u32 %rs15, %r4906;
(EngineCore_0 pid=142) bfe.u32 %r4907, %r4903, 24, 8;
(EngineCore_0 pid=142) cvt.u16.u32 %rs16, %r4907;
(EngineCore_0 pid=142) ld.shared.b32 %r4908, [%r243+512];
(EngineCore_0 pid=142) bfe.u32 %r4909, %r4908, 0, 8;
(EngineCore_0 pid=142) cvt.u16.u32 %rs17, %r4909;
(EngineCore_0 pid=142) bfe.u32 %r4910, %r4908, 8, 8;
(EngineCore_0 pid=142) cvt.u16.u32 %rs18, %r4910;
(EngineCore_0 pid=142) bfe.u32 %r4911, %r4908, 16, 8;
(EngineCore_0 pid=142) cvt.u16.u32 %rs19, %r4911;
(EngineCore_0 pid=142) bfe.u32 %r4912, %r4908, 24, 8;
(EngineCore_0 pid=142) cvt.u16.u32 %rs20, %r4912;
(EngineCore_0 pid=142) ld.shared.b32 %r4913, [%r243+640];
(EngineCore_0 pid=142) bfe.u32 %r4914, %r4913, 0, 8;
(EngineCore_0 pid=142) cvt.u16.u32 %rs21, %r4914;
(EngineCore_0 pid=142) bfe.u32 %r4915, %r4913, 8, 8;
(EngineCore_0 pid=142) cvt.u16.u32 %rs22, %r4915;
(EngineCore_0 pid=142) bfe.u32 %r4916, %r4913, 16, 8;
(EngineCore_0 pid=142) cvt.u16.u32 %rs23, %r4916;
(EngineCore_0 pid=142) bfe.u32 %r4917, %r4913, 24, 8;
(EngineCore_0 pid=142) cvt.u16.u32 %rs24, %r4917;
(EngineCore_0 pid=142) ld.shared.b32 %r4918, [%r243+520];
(EngineCore_0 pid=142) bfe.u32 %r4919, %r4918, 0, 8;
(EngineCore_0 pid=142) cvt.u16.u32 %rs25, %r4919;
(EngineCore_0 pid=142) bfe.u32 %r4920, %r4918, 8, 8;
(EngineCore_0 pid=142) cvt.u16.u32 %rs26, %r4920;
(EngineCore_0 pid=142) bfe.u32 %r4921, %r4918, 16, 8;
(EngineCore_0 pid=142) cvt.u16.u32 %rs27, %r4921;
(EngineCore_0 pid=142) bfe.u32 %r4922, %r4918, 24, 8;
(EngineCore_0 pid=142) cvt.u16.u32 %rs28, %r4922;
(EngineCore_0 pid=142) ld.shared.b32 %r4923, [%r243+648];
(EngineCore_0 pid=142) bfe.u32 %r4924, %r4923, 0, 8;
(EngineCore_0 pid=142) cvt.u16.u32 %rs29, %r4924;
(EngineCore_0 pid=142) bfe.u32 %r4925, %r4923, 8, 8;
(EngineCore_0 pid=142) cvt.u16.u32 %rs30, %r4925;
(EngineCore_0 pid=142) bfe.u32 %r4926, %r4923, 16, 8;
(EngineCore_0 pid=142) cvt.u16.u32 %rs31, %r4926;
(EngineCore_0 pid=142) bfe.u32 %r4927, %r4923, 24, 8;
(EngineCore_0 pid=142) cvt.u16.u32 %rs32, %r4927;
(EngineCore_0 pid=142) $L__tmp20:
(EngineCore_0 pid=142) .loc 1 204 44 // specialize.py:204:44
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) fence.proxy.async.shared::cta;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 0, 256 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p93 mbarrier.arrive.shared::cta.b64 _, [%r991];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) .loc 1 178 39 // specialize.py:178:39
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 0, 256 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142)
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .reg .pred complete;
(EngineCore_0 pid=142) waitLoop:
(EngineCore_0 pid=142) mbarrier.try_wait.parity.shared.b64 complete, [%r986], %r6041;
(EngineCore_0 pid=142) @!complete bra.uni waitLoop;
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142)
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) ld.shared.b16 %rs33, [%r244];
(EngineCore_0 pid=142) ld.shared.b16 %rs34, [%r244+1024];
(EngineCore_0 pid=142) ld.shared.b16 %rs35, [%r244+2048];
(EngineCore_0 pid=142) ld.shared.b16 %rs36, [%r244+3072];
(EngineCore_0 pid=142) ld.shared.b16 %rs37, [%r244+4096];
(EngineCore_0 pid=142) ld.shared.b16 %rs38, [%r244+5120];
(EngineCore_0 pid=142) ld.shared.b16 %rs39, [%r244+6144];
(EngineCore_0 pid=142) ld.shared.b16 %rs40, [%r244+7168];
(EngineCore_0 pid=142) ld.shared.b16 %rs41, [%r244+8192];
(EngineCore_0 pid=142) ld.shared.b16 %rs42, [%r244+9216];
(EngineCore_0 pid=142) ld.shared.b16 %rs43, [%r244+10240];
(EngineCore_0 pid=142) ld.shared.b16 %rs44, [%r244+11264];
(EngineCore_0 pid=142) ld.shared.b16 %rs45, [%r244+12288];
(EngineCore_0 pid=142) ld.shared.b16 %rs46, [%r244+13312];
(EngineCore_0 pid=142) ld.shared.b16 %rs47, [%r244+14336];
(EngineCore_0 pid=142) ld.shared.b16 %rs48, [%r244+15360];
(EngineCore_0 pid=142) ld.shared.b16 %rs49, [%r245];
(EngineCore_0 pid=142) ld.shared.b16 %rs50, [%r245+1024];
(EngineCore_0 pid=142) ld.shared.b16 %rs51, [%r245+2048];
(EngineCore_0 pid=142) ld.shared.b16 %rs52, [%r245+3072];
(EngineCore_0 pid=142) ld.shared.b16 %rs53, [%r245+4096];
(EngineCore_0 pid=142) ld.shared.b16 %rs54, [%r245+5120];
(EngineCore_0 pid=142) ld.shared.b16 %rs55, [%r245+6144];
(EngineCore_0 pid=142) ld.shared.b16 %rs56, [%r245+7168];
(EngineCore_0 pid=142) ld.shared.b16 %rs57, [%r245+8192];
(EngineCore_0 pid=142) ld.shared.b16 %rs58, [%r245+9216];
(EngineCore_0 pid=142) ld.shared.b16 %rs59, [%r245+10240];
(EngineCore_0 pid=142) ld.shared.b16 %rs60, [%r245+11264];
(EngineCore_0 pid=142) ld.shared.b16 %rs61, [%r245+12288];
(EngineCore_0 pid=142) ld.shared.b16 %rs62, [%r245+13312];
(EngineCore_0 pid=142) ld.shared.b16 %rs63, [%r245+14336];
(EngineCore_0 pid=142) ld.shared.b16 %rs64, [%r245+15360];
(EngineCore_0 pid=142) ld.shared.b16 %rs65, [%r246];
(EngineCore_0 pid=142) ld.shared.b16 %rs66, [%r246+1024];
(EngineCore_0 pid=142) ld.shared.b16 %rs67, [%r246+2048];
(EngineCore_0 pid=142) ld.shared.b16 %rs68, [%r246+3072];
(EngineCore_0 pid=142) ld.shared.b16 %rs69, [%r246+4096];
(EngineCore_0 pid=142) ld.shared.b16 %rs70, [%r246+5120];
(EngineCore_0 pid=142) ld.shared.b16 %rs71, [%r246+6144];
(EngineCore_0 pid=142) ld.shared.b16 %rs72, [%r246+7168];
(EngineCore_0 pid=142) ld.shared.b16 %rs73, [%r246+8192];
(EngineCore_0 pid=142) ld.shared.b16 %rs74, [%r246+9216];
(EngineCore_0 pid=142) ld.shared.b16 %rs75, [%r246+10240];
(EngineCore_0 pid=142) ld.shared.b16 %rs76, [%r246+11264];
(EngineCore_0 pid=142) ld.shared.b16 %rs77, [%r246+12288];
(EngineCore_0 pid=142) ld.shared.b16 %rs78, [%r246+13312];
(EngineCore_0 pid=142) ld.shared.b16 %rs79, [%r246+14336];
(EngineCore_0 pid=142) ld.shared.b16 %rs80, [%r246+15360];
(EngineCore_0 pid=142) ld.shared.b16 %rs81, [%r247];
(EngineCore_0 pid=142) ld.shared.b16 %rs82, [%r247+1024];
(EngineCore_0 pid=142) ld.shared.b16 %rs83, [%r247+2048];
(EngineCore_0 pid=142) ld.shared.b16 %rs84, [%r247+3072];
(EngineCore_0 pid=142) ld.shared.b16 %rs85, [%r247+4096];
(EngineCore_0 pid=142) ld.shared.b16 %rs86, [%r247+5120];
(EngineCore_0 pid=142) ld.shared.b16 %rs87, [%r247+6144];
(EngineCore_0 pid=142) ld.shared.b16 %rs88, [%r247+7168];
(EngineCore_0 pid=142) ld.shared.b16 %rs89, [%r247+8192];
(EngineCore_0 pid=142) ld.shared.b16 %rs90, [%r247+9216];
(EngineCore_0 pid=142) ld.shared.b16 %rs91, [%r247+10240];
(EngineCore_0 pid=142) ld.shared.b16 %rs92, [%r247+11264];
(EngineCore_0 pid=142) ld.shared.b16 %rs93, [%r247+12288];
(EngineCore_0 pid=142) ld.shared.b16 %rs94, [%r247+13312];
(EngineCore_0 pid=142) ld.shared.b16 %rs95, [%r247+14336];
(EngineCore_0 pid=142) ld.shared.b16 %rs96, [%r247+15360];
(EngineCore_0 pid=142) st.shared.b16 [%r248], %rs33;
(EngineCore_0 pid=142) st.shared.b16 [%r248+512], %rs49;
(EngineCore_0 pid=142) st.shared.b16 [%r248+1024], %rs65;
(EngineCore_0 pid=142) st.shared.b16 [%r248+1536], %rs81;
(EngineCore_0 pid=142) st.shared.b16 [%r248+2048], %rs34;
(EngineCore_0 pid=142) st.shared.b16 [%r248+2560], %rs50;
(EngineCore_0 pid=142) st.shared.b16 [%r248+3072], %rs66;
(EngineCore_0 pid=142) st.shared.b16 [%r248+3584], %rs82;
(EngineCore_0 pid=142) st.shared.b16 [%r248+4096], %rs35;
(EngineCore_0 pid=142) st.shared.b16 [%r248+4608], %rs51;
(EngineCore_0 pid=142) st.shared.b16 [%r248+5120], %rs67;
(EngineCore_0 pid=142) st.shared.b16 [%r248+5632], %rs83;
(EngineCore_0 pid=142) st.shared.b16 [%r248+6144], %rs36;
(EngineCore_0 pid=142) st.shared.b16 [%r248+6656], %rs52;
(EngineCore_0 pid=142) st.shared.b16 [%r248+7168], %rs68;
(EngineCore_0 pid=142) st.shared.b16 [%r248+7680], %rs84;
(EngineCore_0 pid=142) st.shared.b16 [%r248+8192], %rs37;
(EngineCore_0 pid=142) st.shared.b16 [%r248+8704], %rs53;
(EngineCore_0 pid=142) st.shared.b16 [%r248+9216], %rs69;
(EngineCore_0 pid=142) st.shared.b16 [%r248+9728], %rs85;
(EngineCore_0 pid=142) st.shared.b16 [%r248+10240], %rs38;
(EngineCore_0 pid=142) st.shared.b16 [%r248+10752], %rs54;
(EngineCore_0 pid=142) st.shared.b16 [%r248+11264], %rs70;
(EngineCore_0 pid=142) st.shared.b16 [%r248+11776], %rs86;
(EngineCore_0 pid=142) st.shared.b16 [%r248+12288], %rs39;
(EngineCore_0 pid=142) st.shared.b16 [%r248+12800], %rs55;
(EngineCore_0 pid=142) st.shared.b16 [%r248+13312], %rs71;
(EngineCore_0 pid=142) st.shared.b16 [%r248+13824], %rs87;
(EngineCore_0 pid=142) st.shared.b16 [%r248+14336], %rs40;
(EngineCore_0 pid=142) st.shared.b16 [%r248+14848], %rs56;
(EngineCore_0 pid=142) st.shared.b16 [%r248+15360], %rs72;
(EngineCore_0 pid=142) st.shared.b16 [%r248+15872], %rs88;
(EngineCore_0 pid=142) st.shared.b16 [%r248+16384], %rs41;
(EngineCore_0 pid=142) st.shared.b16 [%r248+16896], %rs57;
(EngineCore_0 pid=142) st.shared.b16 [%r248+17408], %rs73;
(EngineCore_0 pid=142) st.shared.b16 [%r248+17920], %rs89;
(EngineCore_0 pid=142) st.shared.b16 [%r248+18432], %rs42;
(EngineCore_0 pid=142) st.shared.b16 [%r248+18944], %rs58;
(EngineCore_0 pid=142) st.shared.b16 [%r248+19456], %rs74;
(EngineCore_0 pid=142) st.shared.b16 [%r248+19968], %rs90;
(EngineCore_0 pid=142) st.shared.b16 [%r248+20480], %rs43;
(EngineCore_0 pid=142) st.shared.b16 [%r248+20992], %rs59;
(EngineCore_0 pid=142) st.shared.b16 [%r248+21504], %rs75;
(EngineCore_0 pid=142) st.shared.b16 [%r248+22016], %rs91;
(EngineCore_0 pid=142) st.shared.b16 [%r248+22528], %rs44;
(EngineCore_0 pid=142) st.shared.b16 [%r248+23040], %rs60;
(EngineCore_0 pid=142) st.shared.b16 [%r248+23552], %rs76;
(EngineCore_0 pid=142) st.shared.b16 [%r248+24064], %rs92;
(EngineCore_0 pid=142) st.shared.b16 [%r248+24576], %rs45;
(EngineCore_0 pid=142) st.shared.b16 [%r248+25088], %rs61;
(EngineCore_0 pid=142) st.shared.b16 [%r248+25600], %rs77;
(EngineCore_0 pid=142) st.shared.b16 [%r248+26112], %rs93;
(EngineCore_0 pid=142) st.shared.b16 [%r248+26624], %rs46;
(EngineCore_0 pid=142) st.shared.b16 [%r248+27136], %rs62;
(EngineCore_0 pid=142) st.shared.b16 [%r248+27648], %rs78;
(EngineCore_0 pid=142) st.shared.b16 [%r248+28160], %rs94;
(EngineCore_0 pid=142) st.shared.b16 [%r248+28672], %rs47;
(EngineCore_0 pid=142) st.shared.b16 [%r248+29184], %rs63;
(EngineCore_0 pid=142) st.shared.b16 [%r248+29696], %rs79;
(EngineCore_0 pid=142) st.shared.b16 [%r248+30208], %rs95;
(EngineCore_0 pid=142) st.shared.b16 [%r248+30720], %rs48;
(EngineCore_0 pid=142) st.shared.b16 [%r248+31232], %rs64;
(EngineCore_0 pid=142) st.shared.b16 [%r248+31744], %rs80;
(EngineCore_0 pid=142) st.shared.b16 [%r248+32256], %rs96;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) fence.proxy.async.shared::cta;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 0, 256 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p93 mbarrier.arrive.shared::cta.b64 _, [%r985];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 0, 256 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r1314, %r1762, %r2210, %r2658}, [%r249];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r3106, %r3554, %r4002, %r4450}, [%r249+128];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r1328, %r1776, %r2224, %r2672}, [%r249+4096];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r3120, %r3568, %r4016, %r4464}, [%r249+4224];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r1342, %r1790, %r2238, %r2686}, [%r249+8192];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r3134, %r3582, %r4030, %r4478}, [%r249+8320];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r1356, %r1804, %r2252, %r2700}, [%r249+12288];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r3148, %r3596, %r4044, %r4492}, [%r249+12416];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r1370, %r1818, %r2266, %r2714}, [%r249+16384];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r3162, %r3610, %r4058, %r4506}, [%r249+16512];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r1384, %r1832, %r2280, %r2728}, [%r249+20480];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r3176, %r3624, %r4072, %r4520}, [%r249+20608];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r1398, %r1846, %r2294, %r2742}, [%r249+24576];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r3190, %r3638, %r4086, %r4534}, [%r249+24704];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r1412, %r1860, %r2308, %r2756}, [%r249+28672];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r3204, %r3652, %r4100, %r4548}, [%r249+28800];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r1315, %r1763, %r2211, %r2659}, [%r250];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r3107, %r3555, %r4003, %r4451}, [%r250+128];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r1329, %r1777, %r2225, %r2673}, [%r250+4096];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r3121, %r3569, %r4017, %r4465}, [%r250+4224];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r1343, %r1791, %r2239, %r2687}, [%r250+8192];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r3135, %r3583, %r4031, %r4479}, [%r250+8320];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r1357, %r1805, %r2253, %r2701}, [%r250+12288];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r3149, %r3597, %r4045, %r4493}, [%r250+12416];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r1371, %r1819, %r2267, %r2715}, [%r250+16384];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r3163, %r3611, %r4059, %r4507}, [%r250+16512];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r1385, %r1833, %r2281, %r2729}, [%r250+20480];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r3177, %r3625, %r4073, %r4521}, [%r250+20608];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r1399, %r1847, %r2295, %r2743}, [%r250+24576];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r3191, %r3639, %r4087, %r4535}, [%r250+24704];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r1413, %r1861, %r2309, %r2757}, [%r250+28672];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r3205, %r3653, %r4101, %r4549}, [%r250+28800];
(EngineCore_0 pid=142) $L__tmp21:
(EngineCore_0 pid=142) .loc 4 39 20 // _p_matmul_ogs.py:39:20 @[ specialize.py:193:32 ]
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142)
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .reg .pred complete;
(EngineCore_0 pid=142) waitLoop:
(EngineCore_0 pid=142) mbarrier.try_wait.parity.shared.b64 complete, [%r989], %r6041;
(EngineCore_0 pid=142) @!complete bra.uni waitLoop;
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142)
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) $L__tmp22:
(EngineCore_0 pid=142) .loc 1 212 76 // specialize.py:212:76
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r1101, %r1102, %r1103, %r1104}, [%r1105];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r1106, %r1107, %r1108, %r1109}, [%r1110];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r1111, %r1112, %r1113, %r1114}, [%r1115];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r1116, %r1117, %r1118, %r1119}, [%r1120];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r1121, %r1122, %r1123, %r1124}, [%r1125];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r1126, %r1127, %r1128, %r1129}, [%r1130];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r1131, %r1132, %r1133, %r1134}, [%r1135];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r1136, %r1137, %r1138, %r1139}, [%r1140];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) $L__tmp23:
(EngineCore_0 pid=142) .loc 4 39 20 // _p_matmul_ogs.py:39:20 @[ specialize.py:193:32 ]
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) fence.proxy.async.shared::cta;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 0, 256 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p93 mbarrier.arrive.shared::cta.b64 _, [%r988];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) $L__tmp24:
(EngineCore_0 pid=142) .loc 1 212 76 // specialize.py:212:76
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .reg .b32 a<14>;
(EngineCore_0 pid=142) and.b32 a0, %r1101, -2004318072;
(EngineCore_0 pid=142) shr.u32 a1, a0, 3;
(EngineCore_0 pid=142) and.b32 a2, %r1101, 2004318071;
(EngineCore_0 pid=142) shr.u32 a3, a2, 16;
(EngineCore_0 pid=142) shr.u32 a4, a0, 19;
(EngineCore_0 pid=142) prmt.b32 a5, -1065353216, -1065336832, a2;
(EngineCore_0 pid=142) prmt.b32 a6, -1065353216, -1065336832, a3;
(EngineCore_0 pid=142) prmt.b32 a7, 1061109504, 1077952576, a2;
(EngineCore_0 pid=142) prmt.b32 a8, 1061109504, 1077952576, a3;
(EngineCore_0 pid=142) prmt.b32 a9, 32768, 0, a1;
(EngineCore_0 pid=142) prmt.b32 a10, 32768, 0, a4;
(EngineCore_0 pid=142) or.b32 a11, a7, a9;
(EngineCore_0 pid=142) or.b32 a12, a8, a10;
(EngineCore_0 pid=142) prmt.b32 %r1142, a5, a11, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1143, a5, a11, 29538;
(EngineCore_0 pid=142) prmt.b32 %r1144, a6, a12, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1145, a6, a12, 29538;
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 {%rs97, %rs98}, %r1142;
(EngineCore_0 pid=142) mov.b32 {%rs99, %rs100}, %r1143;
(EngineCore_0 pid=142) mov.b32 {%rs101, %rs102}, %r1144;
(EngineCore_0 pid=142) mov.b32 {%rs103, %rs104}, %r1145;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .reg .b32 a<14>;
(EngineCore_0 pid=142) and.b32 a0, %r1102, -2004318072;
(EngineCore_0 pid=142) shr.u32 a1, a0, 3;
(EngineCore_0 pid=142) and.b32 a2, %r1102, 2004318071;
(EngineCore_0 pid=142) shr.u32 a3, a2, 16;
(EngineCore_0 pid=142) shr.u32 a4, a0, 19;
(EngineCore_0 pid=142) prmt.b32 a5, -1065353216, -1065336832, a2;
(EngineCore_0 pid=142) prmt.b32 a6, -1065353216, -1065336832, a3;
(EngineCore_0 pid=142) prmt.b32 a7, 1061109504, 1077952576, a2;
(EngineCore_0 pid=142) prmt.b32 a8, 1061109504, 1077952576, a3;
(EngineCore_0 pid=142) prmt.b32 a9, 32768, 0, a1;
(EngineCore_0 pid=142) prmt.b32 a10, 32768, 0, a4;
(EngineCore_0 pid=142) or.b32 a11, a7, a9;
(EngineCore_0 pid=142) or.b32 a12, a8, a10;
(EngineCore_0 pid=142) prmt.b32 %r1147, a5, a11, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1148, a5, a11, 29538;
(EngineCore_0 pid=142) prmt.b32 %r1149, a6, a12, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1150, a6, a12, 29538;
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 {%rs105, %rs106}, %r1147;
(EngineCore_0 pid=142) mov.b32 {%rs107, %rs108}, %r1148;
(EngineCore_0 pid=142) mov.b32 {%rs109, %rs110}, %r1149;
(EngineCore_0 pid=142) mov.b32 {%rs111, %rs112}, %r1150;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .reg .b32 a<14>;
(EngineCore_0 pid=142) and.b32 a0, %r1103, -2004318072;
(EngineCore_0 pid=142) shr.u32 a1, a0, 3;
(EngineCore_0 pid=142) and.b32 a2, %r1103, 2004318071;
(EngineCore_0 pid=142) shr.u32 a3, a2, 16;
(EngineCore_0 pid=142) shr.u32 a4, a0, 19;
(EngineCore_0 pid=142) prmt.b32 a5, -1065353216, -1065336832, a2;
(EngineCore_0 pid=142) prmt.b32 a6, -1065353216, -1065336832, a3;
(EngineCore_0 pid=142) prmt.b32 a7, 1061109504, 1077952576, a2;
(EngineCore_0 pid=142) prmt.b32 a8, 1061109504, 1077952576, a3;
(EngineCore_0 pid=142) prmt.b32 a9, 32768, 0, a1;
(EngineCore_0 pid=142) prmt.b32 a10, 32768, 0, a4;
(EngineCore_0 pid=142) or.b32 a11, a7, a9;
(EngineCore_0 pid=142) or.b32 a12, a8, a10;
(EngineCore_0 pid=142) prmt.b32 %r1152, a5, a11, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1153, a5, a11, 29538;
(EngineCore_0 pid=142) prmt.b32 %r1154, a6, a12, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1155, a6, a12, 29538;
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 {%rs113, %rs114}, %r1152;
(EngineCore_0 pid=142) mov.b32 {%rs115, %rs116}, %r1153;
(EngineCore_0 pid=142) mov.b32 {%rs117, %rs118}, %r1154;
(EngineCore_0 pid=142) mov.b32 {%rs119, %rs120}, %r1155;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .reg .b32 a<14>;
(EngineCore_0 pid=142) and.b32 a0, %r1104, -2004318072;
(EngineCore_0 pid=142) shr.u32 a1, a0, 3;
(EngineCore_0 pid=142) and.b32 a2, %r1104, 2004318071;
(EngineCore_0 pid=142) shr.u32 a3, a2, 16;
(EngineCore_0 pid=142) shr.u32 a4, a0, 19;
(EngineCore_0 pid=142) prmt.b32 a5, -1065353216, -1065336832, a2;
(EngineCore_0 pid=142) prmt.b32 a6, -1065353216, -1065336832, a3;
(EngineCore_0 pid=142) prmt.b32 a7, 1061109504, 1077952576, a2;
(EngineCore_0 pid=142) prmt.b32 a8, 1061109504, 1077952576, a3;
(EngineCore_0 pid=142) prmt.b32 a9, 32768, 0, a1;
(EngineCore_0 pid=142) prmt.b32 a10, 32768, 0, a4;
(EngineCore_0 pid=142) or.b32 a11, a7, a9;
(EngineCore_0 pid=142) or.b32 a12, a8, a10;
(EngineCore_0 pid=142) prmt.b32 %r1157, a5, a11, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1158, a5, a11, 29538;
(EngineCore_0 pid=142) prmt.b32 %r1159, a6, a12, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1160, a6, a12, 29538;
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 {%rs121, %rs122}, %r1157;
(EngineCore_0 pid=142) mov.b32 {%rs123, %rs124}, %r1158;
(EngineCore_0 pid=142) mov.b32 {%rs125, %rs126}, %r1159;
(EngineCore_0 pid=142) mov.b32 {%rs127, %rs128}, %r1160;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .reg .b32 a<14>;
(EngineCore_0 pid=142) and.b32 a0, %r1121, -2004318072;
(EngineCore_0 pid=142) shr.u32 a1, a0, 3;
(EngineCore_0 pid=142) and.b32 a2, %r1121, 2004318071;
(EngineCore_0 pid=142) shr.u32 a3, a2, 16;
(EngineCore_0 pid=142) shr.u32 a4, a0, 19;
(EngineCore_0 pid=142) prmt.b32 a5, -1065353216, -1065336832, a2;
(EngineCore_0 pid=142) prmt.b32 a6, -1065353216, -1065336832, a3;
(EngineCore_0 pid=142) prmt.b32 a7, 1061109504, 1077952576, a2;
(EngineCore_0 pid=142) prmt.b32 a8, 1061109504, 1077952576, a3;
(EngineCore_0 pid=142) prmt.b32 a9, 32768, 0, a1;
(EngineCore_0 pid=142) prmt.b32 a10, 32768, 0, a4;
(EngineCore_0 pid=142) or.b32 a11, a7, a9;
(EngineCore_0 pid=142) or.b32 a12, a8, a10;
(EngineCore_0 pid=142) prmt.b32 %r1162, a5, a11, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1163, a5, a11, 29538;
(EngineCore_0 pid=142) prmt.b32 %r1164, a6, a12, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1165, a6, a12, 29538;
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 {%rs129, %rs130}, %r1162;
(EngineCore_0 pid=142) mov.b32 {%rs131, %rs132}, %r1163;
(EngineCore_0 pid=142) mov.b32 {%rs133, %rs134}, %r1164;
(EngineCore_0 pid=142) mov.b32 {%rs135, %rs136}, %r1165;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .reg .b32 a<14>;
(EngineCore_0 pid=142) and.b32 a0, %r1122, -2004318072;
(EngineCore_0 pid=142) shr.u32 a1, a0, 3;
(EngineCore_0 pid=142) and.b32 a2, %r1122, 2004318071;
(EngineCore_0 pid=142) shr.u32 a3, a2, 16;
(EngineCore_0 pid=142) shr.u32 a4, a0, 19;
(EngineCore_0 pid=142) prmt.b32 a5, -1065353216, -1065336832, a2;
(EngineCore_0 pid=142) prmt.b32 a6, -1065353216, -1065336832, a3;
(EngineCore_0 pid=142) prmt.b32 a7, 1061109504, 1077952576, a2;
(EngineCore_0 pid=142) prmt.b32 a8, 1061109504, 1077952576, a3;
(EngineCore_0 pid=142) prmt.b32 a9, 32768, 0, a1;
(EngineCore_0 pid=142) prmt.b32 a10, 32768, 0, a4;
(EngineCore_0 pid=142) or.b32 a11, a7, a9;
(EngineCore_0 pid=142) or.b32 a12, a8, a10;
(EngineCore_0 pid=142) prmt.b32 %r1167, a5, a11, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1168, a5, a11, 29538;
(EngineCore_0 pid=142) prmt.b32 %r1169, a6, a12, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1170, a6, a12, 29538;
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 {%rs137, %rs138}, %r1167;
(EngineCore_0 pid=142) mov.b32 {%rs139, %rs140}, %r1168;
(EngineCore_0 pid=142) mov.b32 {%rs141, %rs142}, %r1169;
(EngineCore_0 pid=142) mov.b32 {%rs143, %rs144}, %r1170;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .reg .b32 a<14>;
(EngineCore_0 pid=142) and.b32 a0, %r1123, -2004318072;
(EngineCore_0 pid=142) shr.u32 a1, a0, 3;
(EngineCore_0 pid=142) and.b32 a2, %r1123, 2004318071;
(EngineCore_0 pid=142) shr.u32 a3, a2, 16;
(EngineCore_0 pid=142) shr.u32 a4, a0, 19;
(EngineCore_0 pid=142) prmt.b32 a5, -1065353216, -1065336832, a2;
(EngineCore_0 pid=142) prmt.b32 a6, -1065353216, -1065336832, a3;
(EngineCore_0 pid=142) prmt.b32 a7, 1061109504, 1077952576, a2;
(EngineCore_0 pid=142) prmt.b32 a8, 1061109504, 1077952576, a3;
(EngineCore_0 pid=142) prmt.b32 a9, 32768, 0, a1;
(EngineCore_0 pid=142) prmt.b32 a10, 32768, 0, a4;
(EngineCore_0 pid=142) or.b32 a11, a7, a9;
(EngineCore_0 pid=142) or.b32 a12, a8, a10;
(EngineCore_0 pid=142) prmt.b32 %r1172, a5, a11, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1173, a5, a11, 29538;
(EngineCore_0 pid=142) prmt.b32 %r1174, a6, a12, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1175, a6, a12, 29538;
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 {%rs145, %rs146}, %r1172;
(EngineCore_0 pid=142) mov.b32 {%rs147, %rs148}, %r1173;
(EngineCore_0 pid=142) mov.b32 {%rs149, %rs150}, %r1174;
(EngineCore_0 pid=142) mov.b32 {%rs151, %rs152}, %r1175;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .reg .b32 a<14>;
(EngineCore_0 pid=142) and.b32 a0, %r1124, -2004318072;
(EngineCore_0 pid=142) shr.u32 a1, a0, 3;
(EngineCore_0 pid=142) and.b32 a2, %r1124, 2004318071;
(EngineCore_0 pid=142) shr.u32 a3, a2, 16;
(EngineCore_0 pid=142) shr.u32 a4, a0, 19;
(EngineCore_0 pid=142) prmt.b32 a5, -1065353216, -1065336832, a2;
(EngineCore_0 pid=142) prmt.b32 a6, -1065353216, -1065336832, a3;
(EngineCore_0 pid=142) prmt.b32 a7, 1061109504, 1077952576, a2;
(EngineCore_0 pid=142) prmt.b32 a8, 1061109504, 1077952576, a3;
(EngineCore_0 pid=142) prmt.b32 a9, 32768, 0, a1;
(EngineCore_0 pid=142) prmt.b32 a10, 32768, 0, a4;
(EngineCore_0 pid=142) or.b32 a11, a7, a9;
(EngineCore_0 pid=142) or.b32 a12, a8, a10;
(EngineCore_0 pid=142) prmt.b32 %r1177, a5, a11, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1178, a5, a11, 29538;
(EngineCore_0 pid=142) prmt.b32 %r1179, a6, a12, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1180, a6, a12, 29538;
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 {%rs153, %rs154}, %r1177;
(EngineCore_0 pid=142) mov.b32 {%rs155, %rs156}, %r1178;
(EngineCore_0 pid=142) mov.b32 {%rs157, %rs158}, %r1179;
(EngineCore_0 pid=142) mov.b32 {%rs159, %rs160}, %r1180;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .reg .b32 a<14>;
(EngineCore_0 pid=142) and.b32 a0, %r1106, -2004318072;
(EngineCore_0 pid=142) shr.u32 a1, a0, 3;
(EngineCore_0 pid=142) and.b32 a2, %r1106, 2004318071;
(EngineCore_0 pid=142) shr.u32 a3, a2, 16;
(EngineCore_0 pid=142) shr.u32 a4, a0, 19;
(EngineCore_0 pid=142) prmt.b32 a5, -1065353216, -1065336832, a2;
(EngineCore_0 pid=142) prmt.b32 a6, -1065353216, -1065336832, a3;
(EngineCore_0 pid=142) prmt.b32 a7, 1061109504, 1077952576, a2;
(EngineCore_0 pid=142) prmt.b32 a8, 1061109504, 1077952576, a3;
(EngineCore_0 pid=142) prmt.b32 a9, 32768, 0, a1;
(EngineCore_0 pid=142) prmt.b32 a10, 32768, 0, a4;
(EngineCore_0 pid=142) or.b32 a11, a7, a9;
(EngineCore_0 pid=142) or.b32 a12, a8, a10;
(EngineCore_0 pid=142) prmt.b32 %r1182, a5, a11, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1183, a5, a11, 29538;
(EngineCore_0 pid=142) prmt.b32 %r1184, a6, a12, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1185, a6, a12, 29538;
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 {%rs161, %rs162}, %r1182;
(EngineCore_0 pid=142) mov.b32 {%rs163, %rs164}, %r1183;
(EngineCore_0 pid=142) mov.b32 {%rs165, %rs166}, %r1184;
(EngineCore_0 pid=142) mov.b32 {%rs167, %rs168}, %r1185;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .reg .b32 a<14>;
(EngineCore_0 pid=142) and.b32 a0, %r1107, -2004318072;
(EngineCore_0 pid=142) shr.u32 a1, a0, 3;
(EngineCore_0 pid=142) and.b32 a2, %r1107, 2004318071;
(EngineCore_0 pid=142) shr.u32 a3, a2, 16;
(EngineCore_0 pid=142) shr.u32 a4, a0, 19;
(EngineCore_0 pid=142) prmt.b32 a5, -1065353216, -1065336832, a2;
(EngineCore_0 pid=142) prmt.b32 a6, -1065353216, -1065336832, a3;
(EngineCore_0 pid=142) prmt.b32 a7, 1061109504, 1077952576, a2;
(EngineCore_0 pid=142) prmt.b32 a8, 1061109504, 1077952576, a3;
(EngineCore_0 pid=142) prmt.b32 a9, 32768, 0, a1;
(EngineCore_0 pid=142) prmt.b32 a10, 32768, 0, a4;
(EngineCore_0 pid=142) or.b32 a11, a7, a9;
(EngineCore_0 pid=142) or.b32 a12, a8, a10;
(EngineCore_0 pid=142) prmt.b32 %r1187, a5, a11, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1188, a5, a11, 29538;
(EngineCore_0 pid=142) prmt.b32 %r1189, a6, a12, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1190, a6, a12, 29538;
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 {%rs169, %rs170}, %r1187;
(EngineCore_0 pid=142) mov.b32 {%rs171, %rs172}, %r1188;
(EngineCore_0 pid=142) mov.b32 {%rs173, %rs174}, %r1189;
(EngineCore_0 pid=142) mov.b32 {%rs175, %rs176}, %r1190;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .reg .b32 a<14>;
(EngineCore_0 pid=142) and.b32 a0, %r1108, -2004318072;
(EngineCore_0 pid=142) shr.u32 a1, a0, 3;
(EngineCore_0 pid=142) and.b32 a2, %r1108, 2004318071;
(EngineCore_0 pid=142) shr.u32 a3, a2, 16;
(EngineCore_0 pid=142) shr.u32 a4, a0, 19;
(EngineCore_0 pid=142) prmt.b32 a5, -1065353216, -1065336832, a2;
(EngineCore_0 pid=142) prmt.b32 a6, -1065353216, -1065336832, a3;
(EngineCore_0 pid=142) prmt.b32 a7, 1061109504, 1077952576, a2;
(EngineCore_0 pid=142) prmt.b32 a8, 1061109504, 1077952576, a3;
(EngineCore_0 pid=142) prmt.b32 a9, 32768, 0, a1;
(EngineCore_0 pid=142) prmt.b32 a10, 32768, 0, a4;
(EngineCore_0 pid=142) or.b32 a11, a7, a9;
(EngineCore_0 pid=142) or.b32 a12, a8, a10;
(EngineCore_0 pid=142) prmt.b32 %r1192, a5, a11, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1193, a5, a11, 29538;
(EngineCore_0 pid=142) prmt.b32 %r1194, a6, a12, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1195, a6, a12, 29538;
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 {%rs177, %rs178}, %r1192;
(EngineCore_0 pid=142) mov.b32 {%rs179, %rs180}, %r1193;
(EngineCore_0 pid=142) mov.b32 {%rs181, %rs182}, %r1194;
(EngineCore_0 pid=142) mov.b32 {%rs183, %rs184}, %r1195;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .reg .b32 a<14>;
(EngineCore_0 pid=142) and.b32 a0, %r1109, -2004318072;
(EngineCore_0 pid=142) shr.u32 a1, a0, 3;
(EngineCore_0 pid=142) and.b32 a2, %r1109, 2004318071;
(EngineCore_0 pid=142) shr.u32 a3, a2, 16;
(EngineCore_0 pid=142) shr.u32 a4, a0, 19;
(EngineCore_0 pid=142) prmt.b32 a5, -1065353216, -1065336832, a2;
(EngineCore_0 pid=142) prmt.b32 a6, -1065353216, -1065336832, a3;
(EngineCore_0 pid=142) prmt.b32 a7, 1061109504, 1077952576, a2;
(EngineCore_0 pid=142) prmt.b32 a8, 1061109504, 1077952576, a3;
(EngineCore_0 pid=142) prmt.b32 a9, 32768, 0, a1;
(EngineCore_0 pid=142) prmt.b32 a10, 32768, 0, a4;
(EngineCore_0 pid=142) or.b32 a11, a7, a9;
(EngineCore_0 pid=142) or.b32 a12, a8, a10;
(EngineCore_0 pid=142) prmt.b32 %r1197, a5, a11, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1198, a5, a11, 29538;
(EngineCore_0 pid=142) prmt.b32 %r1199, a6, a12, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1200, a6, a12, 29538;
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 {%rs185, %rs186}, %r1197;
(EngineCore_0 pid=142) mov.b32 {%rs187, %rs188}, %r1198;
(EngineCore_0 pid=142) mov.b32 {%rs189, %rs190}, %r1199;
(EngineCore_0 pid=142) mov.b32 {%rs191, %rs192}, %r1200;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .reg .b32 a<14>;
(EngineCore_0 pid=142) and.b32 a0, %r1126, -2004318072;
(EngineCore_0 pid=142) shr.u32 a1, a0, 3;
(EngineCore_0 pid=142) and.b32 a2, %r1126, 2004318071;
(EngineCore_0 pid=142) shr.u32 a3, a2, 16;
(EngineCore_0 pid=142) shr.u32 a4, a0, 19;
(EngineCore_0 pid=142) prmt.b32 a5, -1065353216, -1065336832, a2;
(EngineCore_0 pid=142) prmt.b32 a6, -1065353216, -1065336832, a3;
(EngineCore_0 pid=142) prmt.b32 a7, 1061109504, 1077952576, a2;
(EngineCore_0 pid=142) prmt.b32 a8, 1061109504, 1077952576, a3;
(EngineCore_0 pid=142) prmt.b32 a9, 32768, 0, a1;
(EngineCore_0 pid=142) prmt.b32 a10, 32768, 0, a4;
(EngineCore_0 pid=142) or.b32 a11, a7, a9;
(EngineCore_0 pid=142) or.b32 a12, a8, a10;
(EngineCore_0 pid=142) prmt.b32 %r1202, a5, a11, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1203, a5, a11, 29538;
(EngineCore_0 pid=142) prmt.b32 %r1204, a6, a12, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1205, a6, a12, 29538;
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 {%rs193, %rs194}, %r1202;
(EngineCore_0 pid=142) mov.b32 {%rs195, %rs196}, %r1203;
(EngineCore_0 pid=142) mov.b32 {%rs197, %rs198}, %r1204;
(EngineCore_0 pid=142) mov.b32 {%rs199, %rs200}, %r1205;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .reg .b32 a<14>;
(EngineCore_0 pid=142) and.b32 a0, %r1127, -2004318072;
(EngineCore_0 pid=142) shr.u32 a1, a0, 3;
(EngineCore_0 pid=142) and.b32 a2, %r1127, 2004318071;
(EngineCore_0 pid=142) shr.u32 a3, a2, 16;
(EngineCore_0 pid=142) shr.u32 a4, a0, 19;
(EngineCore_0 pid=142) prmt.b32 a5, -1065353216, -1065336832, a2;
(EngineCore_0 pid=142) prmt.b32 a6, -1065353216, -1065336832, a3;
(EngineCore_0 pid=142) prmt.b32 a7, 1061109504, 1077952576, a2;
(EngineCore_0 pid=142) prmt.b32 a8, 1061109504, 1077952576, a3;
(EngineCore_0 pid=142) prmt.b32 a9, 32768, 0, a1;
(EngineCore_0 pid=142) prmt.b32 a10, 32768, 0, a4;
(EngineCore_0 pid=142) or.b32 a11, a7, a9;
(EngineCore_0 pid=142) or.b32 a12, a8, a10;
(EngineCore_0 pid=142) prmt.b32 %r1207, a5, a11, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1208, a5, a11, 29538;
(EngineCore_0 pid=142) prmt.b32 %r1209, a6, a12, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1210, a6, a12, 29538;
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 {%rs201, %rs202}, %r1207;
(EngineCore_0 pid=142) mov.b32 {%rs203, %rs204}, %r1208;
(EngineCore_0 pid=142) mov.b32 {%rs205, %rs206}, %r1209;
(EngineCore_0 pid=142) mov.b32 {%rs207, %rs208}, %r1210;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .reg .b32 a<14>;
(EngineCore_0 pid=142) and.b32 a0, %r1128, -2004318072;
(EngineCore_0 pid=142) shr.u32 a1, a0, 3;
(EngineCore_0 pid=142) and.b32 a2, %r1128, 2004318071;
(EngineCore_0 pid=142) shr.u32 a3, a2, 16;
(EngineCore_0 pid=142) shr.u32 a4, a0, 19;
(EngineCore_0 pid=142) prmt.b32 a5, -1065353216, -1065336832, a2;
(EngineCore_0 pid=142) prmt.b32 a6, -1065353216, -1065336832, a3;
(EngineCore_0 pid=142) prmt.b32 a7, 1061109504, 1077952576, a2;
(EngineCore_0 pid=142) prmt.b32 a8, 1061109504, 1077952576, a3;
(EngineCore_0 pid=142) prmt.b32 a9, 32768, 0, a1;
(EngineCore_0 pid=142) prmt.b32 a10, 32768, 0, a4;
(EngineCore_0 pid=142) or.b32 a11, a7, a9;
(EngineCore_0 pid=142) or.b32 a12, a8, a10;
(EngineCore_0 pid=142) prmt.b32 %r1212, a5, a11, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1213, a5, a11, 29538;
(EngineCore_0 pid=142) prmt.b32 %r1214, a6, a12, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1215, a6, a12, 29538;
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 {%rs209, %rs210}, %r1212;
(EngineCore_0 pid=142) mov.b32 {%rs211, %rs212}, %r1213;
(EngineCore_0 pid=142) mov.b32 {%rs213, %rs214}, %r1214;
(EngineCore_0 pid=142) mov.b32 {%rs215, %rs216}, %r1215;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .reg .b32 a<14>;
(EngineCore_0 pid=142) and.b32 a0, %r1129, -2004318072;
(EngineCore_0 pid=142) shr.u32 a1, a0, 3;
(EngineCore_0 pid=142) and.b32 a2, %r1129, 2004318071;
(EngineCore_0 pid=142) shr.u32 a3, a2, 16;
(EngineCore_0 pid=142) shr.u32 a4, a0, 19;
(EngineCore_0 pid=142) prmt.b32 a5, -1065353216, -1065336832, a2;
(EngineCore_0 pid=142) prmt.b32 a6, -1065353216, -1065336832, a3;
(EngineCore_0 pid=142) prmt.b32 a7, 1061109504, 1077952576, a2;
(EngineCore_0 pid=142) prmt.b32 a8, 1061109504, 1077952576, a3;
(EngineCore_0 pid=142) prmt.b32 a9, 32768, 0, a1;
(EngineCore_0 pid=142) prmt.b32 a10, 32768, 0, a4;
(EngineCore_0 pid=142) or.b32 a11, a7, a9;
(EngineCore_0 pid=142) or.b32 a12, a8, a10;
(EngineCore_0 pid=142) prmt.b32 %r1217, a5, a11, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1218, a5, a11, 29538;
(EngineCore_0 pid=142) prmt.b32 %r1219, a6, a12, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1220, a6, a12, 29538;
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 {%rs217, %rs218}, %r1217;
(EngineCore_0 pid=142) mov.b32 {%rs219, %rs220}, %r1218;
(EngineCore_0 pid=142) mov.b32 {%rs221, %rs222}, %r1219;
(EngineCore_0 pid=142) mov.b32 {%rs223, %rs224}, %r1220;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .reg .b32 a<14>;
(EngineCore_0 pid=142) and.b32 a0, %r1111, -2004318072;
(EngineCore_0 pid=142) shr.u32 a1, a0, 3;
(EngineCore_0 pid=142) and.b32 a2, %r1111, 2004318071;
(EngineCore_0 pid=142) shr.u32 a3, a2, 16;
(EngineCore_0 pid=142) shr.u32 a4, a0, 19;
(EngineCore_0 pid=142) prmt.b32 a5, -1065353216, -1065336832, a2;
(EngineCore_0 pid=142) prmt.b32 a6, -1065353216, -1065336832, a3;
(EngineCore_0 pid=142) prmt.b32 a7, 1061109504, 1077952576, a2;
(EngineCore_0 pid=142) prmt.b32 a8, 1061109504, 1077952576, a3;
(EngineCore_0 pid=142) prmt.b32 a9, 32768, 0, a1;
(EngineCore_0 pid=142) prmt.b32 a10, 32768, 0, a4;
(EngineCore_0 pid=142) or.b32 a11, a7, a9;
(EngineCore_0 pid=142) or.b32 a12, a8, a10;
(EngineCore_0 pid=142) prmt.b32 %r1222, a5, a11, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1223, a5, a11, 29538;
(EngineCore_0 pid=142) prmt.b32 %r1224, a6, a12, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1225, a6, a12, 29538;
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 {%rs225, %rs226}, %r1222;
(EngineCore_0 pid=142) mov.b32 {%rs227, %rs228}, %r1223;
(EngineCore_0 pid=142) mov.b32 {%rs229, %rs230}, %r1224;
(EngineCore_0 pid=142) mov.b32 {%rs231, %rs232}, %r1225;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .reg .b32 a<14>;
(EngineCore_0 pid=142) and.b32 a0, %r1112, -2004318072;
(EngineCore_0 pid=142) shr.u32 a1, a0, 3;
(EngineCore_0 pid=142) and.b32 a2, %r1112, 2004318071;
(EngineCore_0 pid=142) shr.u32 a3, a2, 16;
(EngineCore_0 pid=142) shr.u32 a4, a0, 19;
(EngineCore_0 pid=142) prmt.b32 a5, -1065353216, -1065336832, a2;
(EngineCore_0 pid=142) prmt.b32 a6, -1065353216, -1065336832, a3;
(EngineCore_0 pid=142) prmt.b32 a7, 1061109504, 1077952576, a2;
(EngineCore_0 pid=142) prmt.b32 a8, 1061109504, 1077952576, a3;
(EngineCore_0 pid=142) prmt.b32 a9, 32768, 0, a1;
(EngineCore_0 pid=142) prmt.b32 a10, 32768, 0, a4;
(EngineCore_0 pid=142) or.b32 a11, a7, a9;
(EngineCore_0 pid=142) or.b32 a12, a8, a10;
(EngineCore_0 pid=142) prmt.b32 %r1227, a5, a11, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1228, a5, a11, 29538;
(EngineCore_0 pid=142) prmt.b32 %r1229, a6, a12, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1230, a6, a12, 29538;
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 {%rs233, %rs234}, %r1227;
(EngineCore_0 pid=142) mov.b32 {%rs235, %rs236}, %r1228;
(EngineCore_0 pid=142) mov.b32 {%rs237, %rs238}, %r1229;
(EngineCore_0 pid=142) mov.b32 {%rs239, %rs240}, %r1230;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .reg .b32 a<14>;
(EngineCore_0 pid=142) and.b32 a0, %r1113, -2004318072;
(EngineCore_0 pid=142) shr.u32 a1, a0, 3;
(EngineCore_0 pid=142) and.b32 a2, %r1113, 2004318071;
(EngineCore_0 pid=142) shr.u32 a3, a2, 16;
(EngineCore_0 pid=142) shr.u32 a4, a0, 19;
(EngineCore_0 pid=142) prmt.b32 a5, -1065353216, -1065336832, a2;
(EngineCore_0 pid=142) prmt.b32 a6, -1065353216, -1065336832, a3;
(EngineCore_0 pid=142) prmt.b32 a7, 1061109504, 1077952576, a2;
(EngineCore_0 pid=142) prmt.b32 a8, 1061109504, 1077952576, a3;
(EngineCore_0 pid=142) prmt.b32 a9, 32768, 0, a1;
(EngineCore_0 pid=142) prmt.b32 a10, 32768, 0, a4;
(EngineCore_0 pid=142) or.b32 a11, a7, a9;
(EngineCore_0 pid=142) or.b32 a12, a8, a10;
(EngineCore_0 pid=142) prmt.b32 %r1232, a5, a11, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1233, a5, a11, 29538;
(EngineCore_0 pid=142) prmt.b32 %r1234, a6, a12, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1235, a6, a12, 29538;
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 {%rs241, %rs242}, %r1232;
(EngineCore_0 pid=142) mov.b32 {%rs243, %rs244}, %r1233;
(EngineCore_0 pid=142) mov.b32 {%rs245, %rs246}, %r1234;
(EngineCore_0 pid=142) mov.b32 {%rs247, %rs248}, %r1235;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .reg .b32 a<14>;
(EngineCore_0 pid=142) and.b32 a0, %r1114, -2004318072;
(EngineCore_0 pid=142) shr.u32 a1, a0, 3;
(EngineCore_0 pid=142) and.b32 a2, %r1114, 2004318071;
(EngineCore_0 pid=142) shr.u32 a3, a2, 16;
(EngineCore_0 pid=142) shr.u32 a4, a0, 19;
(EngineCore_0 pid=142) prmt.b32 a5, -1065353216, -1065336832, a2;
(EngineCore_0 pid=142) prmt.b32 a6, -1065353216, -1065336832, a3;
(EngineCore_0 pid=142) prmt.b32 a7, 1061109504, 1077952576, a2;
(EngineCore_0 pid=142) prmt.b32 a8, 1061109504, 1077952576, a3;
(EngineCore_0 pid=142) prmt.b32 a9, 32768, 0, a1;
(EngineCore_0 pid=142) prmt.b32 a10, 32768, 0, a4;
(EngineCore_0 pid=142) or.b32 a11, a7, a9;
(EngineCore_0 pid=142) or.b32 a12, a8, a10;
(EngineCore_0 pid=142) prmt.b32 %r1237, a5, a11, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1238, a5, a11, 29538;
(EngineCore_0 pid=142) prmt.b32 %r1239, a6, a12, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1240, a6, a12, 29538;
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 {%rs249, %rs250}, %r1237;
(EngineCore_0 pid=142) mov.b32 {%rs251, %rs252}, %r1238;
(EngineCore_0 pid=142) mov.b32 {%rs253, %rs254}, %r1239;
(EngineCore_0 pid=142) mov.b32 {%rs255, %rs256}, %r1240;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .reg .b32 a<14>;
(EngineCore_0 pid=142) and.b32 a0, %r1131, -2004318072;
(EngineCore_0 pid=142) shr.u32 a1, a0, 3;
(EngineCore_0 pid=142) and.b32 a2, %r1131, 2004318071;
(EngineCore_0 pid=142) shr.u32 a3, a2, 16;
(EngineCore_0 pid=142) shr.u32 a4, a0, 19;
(EngineCore_0 pid=142) prmt.b32 a5, -1065353216, -1065336832, a2;
(EngineCore_0 pid=142) prmt.b32 a6, -1065353216, -1065336832, a3;
(EngineCore_0 pid=142) prmt.b32 a7, 1061109504, 1077952576, a2;
(EngineCore_0 pid=142) prmt.b32 a8, 1061109504, 1077952576, a3;
(EngineCore_0 pid=142) prmt.b32 a9, 32768, 0, a1;
(EngineCore_0 pid=142) prmt.b32 a10, 32768, 0, a4;
(EngineCore_0 pid=142) or.b32 a11, a7, a9;
(EngineCore_0 pid=142) or.b32 a12, a8, a10;
(EngineCore_0 pid=142) prmt.b32 %r1242, a5, a11, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1243, a5, a11, 29538;
(EngineCore_0 pid=142) prmt.b32 %r1244, a6, a12, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1245, a6, a12, 29538;
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 {%rs257, %rs258}, %r1242;
(EngineCore_0 pid=142) mov.b32 {%rs259, %rs260}, %r1243;
(EngineCore_0 pid=142) mov.b32 {%rs261, %rs262}, %r1244;
(EngineCore_0 pid=142) mov.b32 {%rs263, %rs264}, %r1245;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .reg .b32 a<14>;
(EngineCore_0 pid=142) and.b32 a0, %r1132, -2004318072;
(EngineCore_0 pid=142) shr.u32 a1, a0, 3;
(EngineCore_0 pid=142) and.b32 a2, %r1132, 2004318071;
(EngineCore_0 pid=142) shr.u32 a3, a2, 16;
(EngineCore_0 pid=142) shr.u32 a4, a0, 19;
(EngineCore_0 pid=142) prmt.b32 a5, -1065353216, -1065336832, a2;
(EngineCore_0 pid=142) prmt.b32 a6, -1065353216, -1065336832, a3;
(EngineCore_0 pid=142) prmt.b32 a7, 1061109504, 1077952576, a2;
(EngineCore_0 pid=142) prmt.b32 a8, 1061109504, 1077952576, a3;
(EngineCore_0 pid=142) prmt.b32 a9, 32768, 0, a1;
(EngineCore_0 pid=142) prmt.b32 a10, 32768, 0, a4;
(EngineCore_0 pid=142) or.b32 a11, a7, a9;
(EngineCore_0 pid=142) or.b32 a12, a8, a10;
(EngineCore_0 pid=142) prmt.b32 %r1247, a5, a11, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1248, a5, a11, 29538;
(EngineCore_0 pid=142) prmt.b32 %r1249, a6, a12, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1250, a6, a12, 29538;
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 {%rs265, %rs266}, %r1247;
(EngineCore_0 pid=142) mov.b32 {%rs267, %rs268}, %r1248;
(EngineCore_0 pid=142) mov.b32 {%rs269, %rs270}, %r1249;
(EngineCore_0 pid=142) mov.b32 {%rs271, %rs272}, %r1250;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .reg .b32 a<14>;
(EngineCore_0 pid=142) and.b32 a0, %r1133, -2004318072;
(EngineCore_0 pid=142) shr.u32 a1, a0, 3;
(EngineCore_0 pid=142) and.b32 a2, %r1133, 2004318071;
(EngineCore_0 pid=142) shr.u32 a3, a2, 16;
(EngineCore_0 pid=142) shr.u32 a4, a0, 19;
(EngineCore_0 pid=142) prmt.b32 a5, -1065353216, -1065336832, a2;
(EngineCore_0 pid=142) prmt.b32 a6, -1065353216, -1065336832, a3;
(EngineCore_0 pid=142) prmt.b32 a7, 1061109504, 1077952576, a2;
(EngineCore_0 pid=142) prmt.b32 a8, 1061109504, 1077952576, a3;
(EngineCore_0 pid=142) prmt.b32 a9, 32768, 0, a1;
(EngineCore_0 pid=142) prmt.b32 a10, 32768, 0, a4;
(EngineCore_0 pid=142) or.b32 a11, a7, a9;
(EngineCore_0 pid=142) or.b32 a12, a8, a10;
(EngineCore_0 pid=142) prmt.b32 %r1252, a5, a11, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1253, a5, a11, 29538;
(EngineCore_0 pid=142) prmt.b32 %r1254, a6, a12, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1255, a6, a12, 29538;
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 {%rs273, %rs274}, %r1252;
(EngineCore_0 pid=142) mov.b32 {%rs275, %rs276}, %r1253;
(EngineCore_0 pid=142) mov.b32 {%rs277, %rs278}, %r1254;
(EngineCore_0 pid=142) mov.b32 {%rs279, %rs280}, %r1255;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .reg .b32 a<14>;
(EngineCore_0 pid=142) and.b32 a0, %r1134, -2004318072;
(EngineCore_0 pid=142) shr.u32 a1, a0, 3;
(EngineCore_0 pid=142) and.b32 a2, %r1134, 2004318071;
(EngineCore_0 pid=142) shr.u32 a3, a2, 16;
(EngineCore_0 pid=142) shr.u32 a4, a0, 19;
(EngineCore_0 pid=142) prmt.b32 a5, -1065353216, -1065336832, a2;
(EngineCore_0 pid=142) prmt.b32 a6, -1065353216, -1065336832, a3;
(EngineCore_0 pid=142) prmt.b32 a7, 1061109504, 1077952576, a2;
(EngineCore_0 pid=142) prmt.b32 a8, 1061109504, 1077952576, a3;
(EngineCore_0 pid=142) prmt.b32 a9, 32768, 0, a1;
(EngineCore_0 pid=142) prmt.b32 a10, 32768, 0, a4;
(EngineCore_0 pid=142) or.b32 a11, a7, a9;
(EngineCore_0 pid=142) or.b32 a12, a8, a10;
(EngineCore_0 pid=142) prmt.b32 %r1257, a5, a11, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1258, a5, a11, 29538;
(EngineCore_0 pid=142) prmt.b32 %r1259, a6, a12, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1260, a6, a12, 29538;
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 {%rs281, %rs282}, %r1257;
(EngineCore_0 pid=142) mov.b32 {%rs283, %rs284}, %r1258;
(EngineCore_0 pid=142) mov.b32 {%rs285, %rs286}, %r1259;
(EngineCore_0 pid=142) mov.b32 {%rs287, %rs288}, %r1260;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .reg .b32 a<14>;
(EngineCore_0 pid=142) and.b32 a0, %r1116, -2004318072;
(EngineCore_0 pid=142) shr.u32 a1, a0, 3;
(EngineCore_0 pid=142) and.b32 a2, %r1116, 2004318071;
(EngineCore_0 pid=142) shr.u32 a3, a2, 16;
(EngineCore_0 pid=142) shr.u32 a4, a0, 19;
(EngineCore_0 pid=142) prmt.b32 a5, -1065353216, -1065336832, a2;
(EngineCore_0 pid=142) prmt.b32 a6, -1065353216, -1065336832, a3;
(EngineCore_0 pid=142) prmt.b32 a7, 1061109504, 1077952576, a2;
(EngineCore_0 pid=142) prmt.b32 a8, 1061109504, 1077952576, a3;
(EngineCore_0 pid=142) prmt.b32 a9, 32768, 0, a1;
(EngineCore_0 pid=142) prmt.b32 a10, 32768, 0, a4;
(EngineCore_0 pid=142) or.b32 a11, a7, a9;
(EngineCore_0 pid=142) or.b32 a12, a8, a10;
(EngineCore_0 pid=142) prmt.b32 %r1262, a5, a11, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1263, a5, a11, 29538;
(EngineCore_0 pid=142) prmt.b32 %r1264, a6, a12, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1265, a6, a12, 29538;
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 {%rs289, %rs290}, %r1262;
(EngineCore_0 pid=142) mov.b32 {%rs291, %rs292}, %r1263;
(EngineCore_0 pid=142) mov.b32 {%rs293, %rs294}, %r1264;
(EngineCore_0 pid=142) mov.b32 {%rs295, %rs296}, %r1265;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .reg .b32 a<14>;
(EngineCore_0 pid=142) and.b32 a0, %r1117, -2004318072;
(EngineCore_0 pid=142) shr.u32 a1, a0, 3;
(EngineCore_0 pid=142) and.b32 a2, %r1117, 2004318071;
(EngineCore_0 pid=142) shr.u32 a3, a2, 16;
(EngineCore_0 pid=142) shr.u32 a4, a0, 19;
(EngineCore_0 pid=142) prmt.b32 a5, -1065353216, -1065336832, a2;
(EngineCore_0 pid=142) prmt.b32 a6, -1065353216, -1065336832, a3;
(EngineCore_0 pid=142) prmt.b32 a7, 1061109504, 1077952576, a2;
(EngineCore_0 pid=142) prmt.b32 a8, 1061109504, 1077952576, a3;
(EngineCore_0 pid=142) prmt.b32 a9, 32768, 0, a1;
(EngineCore_0 pid=142) prmt.b32 a10, 32768, 0, a4;
(EngineCore_0 pid=142) or.b32 a11, a7, a9;
(EngineCore_0 pid=142) or.b32 a12, a8, a10;
(EngineCore_0 pid=142) prmt.b32 %r1267, a5, a11, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1268, a5, a11, 29538;
(EngineCore_0 pid=142) prmt.b32 %r1269, a6, a12, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1270, a6, a12, 29538;
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 {%rs297, %rs298}, %r1267;
(EngineCore_0 pid=142) mov.b32 {%rs299, %rs300}, %r1268;
(EngineCore_0 pid=142) mov.b32 {%rs301, %rs302}, %r1269;
(EngineCore_0 pid=142) mov.b32 {%rs303, %rs304}, %r1270;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .reg .b32 a<14>;
(EngineCore_0 pid=142) and.b32 a0, %r1118, -2004318072;
(EngineCore_0 pid=142) shr.u32 a1, a0, 3;
(EngineCore_0 pid=142) and.b32 a2, %r1118, 2004318071;
(EngineCore_0 pid=142) shr.u32 a3, a2, 16;
(EngineCore_0 pid=142) shr.u32 a4, a0, 19;
(EngineCore_0 pid=142) prmt.b32 a5, -1065353216, -1065336832, a2;
(EngineCore_0 pid=142) prmt.b32 a6, -1065353216, -1065336832, a3;
(EngineCore_0 pid=142) prmt.b32 a7, 1061109504, 1077952576, a2;
(EngineCore_0 pid=142) prmt.b32 a8, 1061109504, 1077952576, a3;
(EngineCore_0 pid=142) prmt.b32 a9, 32768, 0, a1;
(EngineCore_0 pid=142) prmt.b32 a10, 32768, 0, a4;
(EngineCore_0 pid=142) or.b32 a11, a7, a9;
(EngineCore_0 pid=142) or.b32 a12, a8, a10;
(EngineCore_0 pid=142) prmt.b32 %r1272, a5, a11, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1273, a5, a11, 29538;
(EngineCore_0 pid=142) prmt.b32 %r1274, a6, a12, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1275, a6, a12, 29538;
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 {%rs305, %rs306}, %r1272;
(EngineCore_0 pid=142) mov.b32 {%rs307, %rs308}, %r1273;
(EngineCore_0 pid=142) mov.b32 {%rs309, %rs310}, %r1274;
(EngineCore_0 pid=142) mov.b32 {%rs311, %rs312}, %r1275;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .reg .b32 a<14>;
(EngineCore_0 pid=142) and.b32 a0, %r1119, -2004318072;
(EngineCore_0 pid=142) shr.u32 a1, a0, 3;
(EngineCore_0 pid=142) and.b32 a2, %r1119, 2004318071;
(EngineCore_0 pid=142) shr.u32 a3, a2, 16;
(EngineCore_0 pid=142) shr.u32 a4, a0, 19;
(EngineCore_0 pid=142) prmt.b32 a5, -1065353216, -1065336832, a2;
(EngineCore_0 pid=142) prmt.b32 a6, -1065353216, -1065336832, a3;
(EngineCore_0 pid=142) prmt.b32 a7, 1061109504, 1077952576, a2;
(EngineCore_0 pid=142) prmt.b32 a8, 1061109504, 1077952576, a3;
(EngineCore_0 pid=142) prmt.b32 a9, 32768, 0, a1;
(EngineCore_0 pid=142) prmt.b32 a10, 32768, 0, a4;
(EngineCore_0 pid=142) or.b32 a11, a7, a9;
(EngineCore_0 pid=142) or.b32 a12, a8, a10;
(EngineCore_0 pid=142) prmt.b32 %r1277, a5, a11, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1278, a5, a11, 29538;
(EngineCore_0 pid=142) prmt.b32 %r1279, a6, a12, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1280, a6, a12, 29538;
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 {%rs313, %rs314}, %r1277;
(EngineCore_0 pid=142) mov.b32 {%rs315, %rs316}, %r1278;
(EngineCore_0 pid=142) mov.b32 {%rs317, %rs318}, %r1279;
(EngineCore_0 pid=142) mov.b32 {%rs319, %rs320}, %r1280;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .reg .b32 a<14>;
(EngineCore_0 pid=142) and.b32 a0, %r1136, -2004318072;
(EngineCore_0 pid=142) shr.u32 a1, a0, 3;
(EngineCore_0 pid=142) and.b32 a2, %r1136, 2004318071;
(EngineCore_0 pid=142) shr.u32 a3, a2, 16;
(EngineCore_0 pid=142) shr.u32 a4, a0, 19;
(EngineCore_0 pid=142) prmt.b32 a5, -1065353216, -1065336832, a2;
(EngineCore_0 pid=142) prmt.b32 a6, -1065353216, -1065336832, a3;
(EngineCore_0 pid=142) prmt.b32 a7, 1061109504, 1077952576, a2;
(EngineCore_0 pid=142) prmt.b32 a8, 1061109504, 1077952576, a3;
(EngineCore_0 pid=142) prmt.b32 a9, 32768, 0, a1;
(EngineCore_0 pid=142) prmt.b32 a10, 32768, 0, a4;
(EngineCore_0 pid=142) or.b32 a11, a7, a9;
(EngineCore_0 pid=142) or.b32 a12, a8, a10;
(EngineCore_0 pid=142) prmt.b32 %r1282, a5, a11, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1283, a5, a11, 29538;
(EngineCore_0 pid=142) prmt.b32 %r1284, a6, a12, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1285, a6, a12, 29538;
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 {%rs321, %rs322}, %r1282;
(EngineCore_0 pid=142) mov.b32 {%rs323, %rs324}, %r1283;
(EngineCore_0 pid=142) mov.b32 {%rs325, %rs326}, %r1284;
(EngineCore_0 pid=142) mov.b32 {%rs327, %rs328}, %r1285;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .reg .b32 a<14>;
(EngineCore_0 pid=142) and.b32 a0, %r1137, -2004318072;
(EngineCore_0 pid=142) shr.u32 a1, a0, 3;
(EngineCore_0 pid=142) and.b32 a2, %r1137, 2004318071;
(EngineCore_0 pid=142) shr.u32 a3, a2, 16;
(EngineCore_0 pid=142) shr.u32 a4, a0, 19;
(EngineCore_0 pid=142) prmt.b32 a5, -1065353216, -1065336832, a2;
(EngineCore_0 pid=142) prmt.b32 a6, -1065353216, -1065336832, a3;
(EngineCore_0 pid=142) prmt.b32 a7, 1061109504, 1077952576, a2;
(EngineCore_0 pid=142) prmt.b32 a8, 1061109504, 1077952576, a3;
(EngineCore_0 pid=142) prmt.b32 a9, 32768, 0, a1;
(EngineCore_0 pid=142) prmt.b32 a10, 32768, 0, a4;
(EngineCore_0 pid=142) or.b32 a11, a7, a9;
(EngineCore_0 pid=142) or.b32 a12, a8, a10;
(EngineCore_0 pid=142) prmt.b32 %r1287, a5, a11, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1288, a5, a11, 29538;
(EngineCore_0 pid=142) prmt.b32 %r1289, a6, a12, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1290, a6, a12, 29538;
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 {%rs329, %rs330}, %r1287;
(EngineCore_0 pid=142) mov.b32 {%rs331, %rs332}, %r1288;
(EngineCore_0 pid=142) mov.b32 {%rs333, %rs334}, %r1289;
(EngineCore_0 pid=142) mov.b32 {%rs335, %rs336}, %r1290;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .reg .b32 a<14>;
(EngineCore_0 pid=142) and.b32 a0, %r1138, -2004318072;
(EngineCore_0 pid=142) shr.u32 a1, a0, 3;
(EngineCore_0 pid=142) and.b32 a2, %r1138, 2004318071;
(EngineCore_0 pid=142) shr.u32 a3, a2, 16;
(EngineCore_0 pid=142) shr.u32 a4, a0, 19;
(EngineCore_0 pid=142) prmt.b32 a5, -1065353216, -1065336832, a2;
(EngineCore_0 pid=142) prmt.b32 a6, -1065353216, -1065336832, a3;
(EngineCore_0 pid=142) prmt.b32 a7, 1061109504, 1077952576, a2;
(EngineCore_0 pid=142) prmt.b32 a8, 1061109504, 1077952576, a3;
(EngineCore_0 pid=142) prmt.b32 a9, 32768, 0, a1;
(EngineCore_0 pid=142) prmt.b32 a10, 32768, 0, a4;
(EngineCore_0 pid=142) or.b32 a11, a7, a9;
(EngineCore_0 pid=142) or.b32 a12, a8, a10;
(EngineCore_0 pid=142) prmt.b32 %r1292, a5, a11, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1293, a5, a11, 29538;
(EngineCore_0 pid=142) prmt.b32 %r1294, a6, a12, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1295, a6, a12, 29538;
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 {%rs337, %rs338}, %r1292;
(EngineCore_0 pid=142) mov.b32 {%rs339, %rs340}, %r1293;
(EngineCore_0 pid=142) mov.b32 {%rs341, %rs342}, %r1294;
(EngineCore_0 pid=142) mov.b32 {%rs343, %rs344}, %r1295;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .reg .b32 a<14>;
(EngineCore_0 pid=142) and.b32 a0, %r1139, -2004318072;
(EngineCore_0 pid=142) shr.u32 a1, a0, 3;
(EngineCore_0 pid=142) and.b32 a2, %r1139, 2004318071;
(EngineCore_0 pid=142) shr.u32 a3, a2, 16;
(EngineCore_0 pid=142) shr.u32 a4, a0, 19;
(EngineCore_0 pid=142) prmt.b32 a5, -1065353216, -1065336832, a2;
(EngineCore_0 pid=142) prmt.b32 a6, -1065353216, -1065336832, a3;
(EngineCore_0 pid=142) prmt.b32 a7, 1061109504, 1077952576, a2;
(EngineCore_0 pid=142) prmt.b32 a8, 1061109504, 1077952576, a3;
(EngineCore_0 pid=142) prmt.b32 a9, 32768, 0, a1;
(EngineCore_0 pid=142) prmt.b32 a10, 32768, 0, a4;
(EngineCore_0 pid=142) or.b32 a11, a7, a9;
(EngineCore_0 pid=142) or.b32 a12, a8, a10;
(EngineCore_0 pid=142) prmt.b32 %r1297, a5, a11, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1298, a5, a11, 29538;
(EngineCore_0 pid=142) prmt.b32 %r1299, a6, a12, 20800;
(EngineCore_0 pid=142) prmt.b32 %r1300, a6, a12, 29538;
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 {%rs345, %rs346}, %r1297;
(EngineCore_0 pid=142) mov.b32 {%rs347, %rs348}, %r1298;
(EngineCore_0 pid=142) mov.b32 {%rs349, %rs350}, %r1299;
(EngineCore_0 pid=142) mov.b32 {%rs351, %rs352}, %r1300;
(EngineCore_0 pid=142) $L__tmp25:
(EngineCore_0 pid=142) .loc 6 57 61 // blackwell_scale.py:57:61 @[ specialize.py:206:53 ]
(EngineCore_0 pid=142) and.b16 %rs353, %rs1, 255;
(EngineCore_0 pid=142) and.b16 %rs354, %rs5, 255;
(EngineCore_0 pid=142) and.b16 %rs355, %rs2, 255;
(EngineCore_0 pid=142) and.b16 %rs356, %rs6, 255;
(EngineCore_0 pid=142) and.b16 %rs357, %rs3, 255;
(EngineCore_0 pid=142) and.b16 %rs358, %rs7, 255;
(EngineCore_0 pid=142) and.b16 %rs359, %rs4, 255;
(EngineCore_0 pid=142) and.b16 %rs360, %rs8, 255;
(EngineCore_0 pid=142) and.b16 %rs361, %rs9, 255;
(EngineCore_0 pid=142) and.b16 %rs362, %rs13, 255;
(EngineCore_0 pid=142) and.b16 %rs363, %rs10, 255;
(EngineCore_0 pid=142) and.b16 %rs364, %rs14, 255;
(EngineCore_0 pid=142) and.b16 %rs365, %rs11, 255;
(EngineCore_0 pid=142) and.b16 %rs366, %rs15, 255;
(EngineCore_0 pid=142) and.b16 %rs367, %rs12, 255;
(EngineCore_0 pid=142) and.b16 %rs368, %rs16, 255;
(EngineCore_0 pid=142) and.b16 %rs369, %rs17, 255;
(EngineCore_0 pid=142) and.b16 %rs370, %rs21, 255;
(EngineCore_0 pid=142) and.b16 %rs371, %rs18, 255;
(EngineCore_0 pid=142) and.b16 %rs372, %rs22, 255;
(EngineCore_0 pid=142) and.b16 %rs373, %rs19, 255;
(EngineCore_0 pid=142) and.b16 %rs374, %rs23, 255;
(EngineCore_0 pid=142) and.b16 %rs375, %rs20, 255;
(EngineCore_0 pid=142) and.b16 %rs376, %rs24, 255;
(EngineCore_0 pid=142) and.b16 %rs377, %rs25, 255;
(EngineCore_0 pid=142) and.b16 %rs378, %rs29, 255;
(EngineCore_0 pid=142) and.b16 %rs379, %rs26, 255;
(EngineCore_0 pid=142) and.b16 %rs380, %rs30, 255;
(EngineCore_0 pid=142) and.b16 %rs381, %rs27, 255;
(EngineCore_0 pid=142) and.b16 %rs382, %rs31, 255;
(EngineCore_0 pid=142) and.b16 %rs383, %rs28, 255;
(EngineCore_0 pid=142) and.b16 %rs384, %rs32, 255;
(EngineCore_0 pid=142) shl.b16 %rs385, %rs353, 7;
(EngineCore_0 pid=142) shl.b16 %rs386, %rs354, 7;
(EngineCore_0 pid=142) shl.b16 %rs387, %rs355, 7;
(EngineCore_0 pid=142) shl.b16 %rs388, %rs356, 7;
(EngineCore_0 pid=142) shl.b16 %rs389, %rs357, 7;
(EngineCore_0 pid=142) shl.b16 %rs390, %rs358, 7;
(EngineCore_0 pid=142) shl.b16 %rs391, %rs359, 7;
(EngineCore_0 pid=142) shl.b16 %rs392, %rs360, 7;
(EngineCore_0 pid=142) shl.b16 %rs393, %rs361, 7;
(EngineCore_0 pid=142) shl.b16 %rs394, %rs362, 7;
(EngineCore_0 pid=142) shl.b16 %rs395, %rs363, 7;
(EngineCore_0 pid=142) shl.b16 %rs396, %rs364, 7;
(EngineCore_0 pid=142) shl.b16 %rs397, %rs365, 7;
(EngineCore_0 pid=142) shl.b16 %rs398, %rs366, 7;
(EngineCore_0 pid=142) shl.b16 %rs399, %rs367, 7;
(EngineCore_0 pid=142) shl.b16 %rs400, %rs368, 7;
(EngineCore_0 pid=142) shl.b16 %rs401, %rs369, 7;
(EngineCore_0 pid=142) shl.b16 %rs402, %rs370, 7;
(EngineCore_0 pid=142) shl.b16 %rs403, %rs371, 7;
(EngineCore_0 pid=142) shl.b16 %rs404, %rs372, 7;
(EngineCore_0 pid=142) shl.b16 %rs405, %rs373, 7;
(EngineCore_0 pid=142) shl.b16 %rs406, %rs374, 7;
(EngineCore_0 pid=142) shl.b16 %rs407, %rs375, 7;
(EngineCore_0 pid=142) shl.b16 %rs408, %rs376, 7;
(EngineCore_0 pid=142) shl.b16 %rs409, %rs377, 7;
(EngineCore_0 pid=142) shl.b16 %rs410, %rs378, 7;
(EngineCore_0 pid=142) shl.b16 %rs411, %rs379, 7;
(EngineCore_0 pid=142) shl.b16 %rs412, %rs380, 7;
(EngineCore_0 pid=142) shl.b16 %rs413, %rs381, 7;
(EngineCore_0 pid=142) shl.b16 %rs414, %rs382, 7;
(EngineCore_0 pid=142) shl.b16 %rs415, %rs383, 7;
(EngineCore_0 pid=142) shl.b16 %rs416, %rs384, 7;
(EngineCore_0 pid=142) $L__tmp26:
(EngineCore_0 pid=142) .loc 1 212 76 // specialize.py:212:76
(EngineCore_0 pid=142) mul.bf16 %rs417, %rs97, %rs385;
(EngineCore_0 pid=142) mul.bf16 %rs418, %rs98, %rs385;
(EngineCore_0 pid=142) mul.bf16 %rs419, %rs99, %rs385;
(EngineCore_0 pid=142) mul.bf16 %rs420, %rs100, %rs385;
(EngineCore_0 pid=142) mul.bf16 %rs421, %rs101, %rs385;
(EngineCore_0 pid=142) mul.bf16 %rs422, %rs102, %rs385;
(EngineCore_0 pid=142) mul.bf16 %rs423, %rs103, %rs385;
(EngineCore_0 pid=142) mul.bf16 %rs424, %rs104, %rs385;
(EngineCore_0 pid=142) mul.bf16 %rs425, %rs105, %rs386;
(EngineCore_0 pid=142) mul.bf16 %rs426, %rs106, %rs386;
(EngineCore_0 pid=142) mul.bf16 %rs427, %rs107, %rs386;
(EngineCore_0 pid=142) mul.bf16 %rs428, %rs108, %rs386;
(EngineCore_0 pid=142) mul.bf16 %rs429, %rs109, %rs386;
(EngineCore_0 pid=142) mul.bf16 %rs430, %rs110, %rs386;
(EngineCore_0 pid=142) mul.bf16 %rs431, %rs111, %rs386;
(EngineCore_0 pid=142) mul.bf16 %rs432, %rs112, %rs386;
(EngineCore_0 pid=142) mul.bf16 %rs433, %rs113, %rs387;
(EngineCore_0 pid=142) mul.bf16 %rs434, %rs114, %rs387;
(EngineCore_0 pid=142) mul.bf16 %rs435, %rs115, %rs387;
(EngineCore_0 pid=142) mul.bf16 %rs436, %rs116, %rs387;
(EngineCore_0 pid=142) mul.bf16 %rs437, %rs117, %rs387;
(EngineCore_0 pid=142) mul.bf16 %rs438, %rs118, %rs387;
(EngineCore_0 pid=142) mul.bf16 %rs439, %rs119, %rs387;
(EngineCore_0 pid=142) mul.bf16 %rs440, %rs120, %rs387;
(EngineCore_0 pid=142) mul.bf16 %rs441, %rs121, %rs388;
(EngineCore_0 pid=142) mul.bf16 %rs442, %rs122, %rs388;
(EngineCore_0 pid=142) mul.bf16 %rs443, %rs123, %rs388;
(EngineCore_0 pid=142) mul.bf16 %rs444, %rs124, %rs388;
(EngineCore_0 pid=142) mul.bf16 %rs445, %rs125, %rs388;
(EngineCore_0 pid=142) mul.bf16 %rs446, %rs126, %rs388;
(EngineCore_0 pid=142) mul.bf16 %rs447, %rs127, %rs388;
(EngineCore_0 pid=142) mul.bf16 %rs448, %rs128, %rs388;
(EngineCore_0 pid=142) mul.bf16 %rs449, %rs129, %rs389;
(EngineCore_0 pid=142) mul.bf16 %rs450, %rs130, %rs389;
(EngineCore_0 pid=142) mul.bf16 %rs451, %rs131, %rs389;
(EngineCore_0 pid=142) mul.bf16 %rs452, %rs132, %rs389;
(EngineCore_0 pid=142) mul.bf16 %rs453, %rs133, %rs389;
(EngineCore_0 pid=142) mul.bf16 %rs454, %rs134, %rs389;
(EngineCore_0 pid=142) mul.bf16 %rs455, %rs135, %rs389;
(EngineCore_0 pid=142) mul.bf16 %rs456, %rs136, %rs389;
(EngineCore_0 pid=142) mul.bf16 %rs457, %rs137, %rs390;
(EngineCore_0 pid=142) mul.bf16 %rs458, %rs138, %rs390;
(EngineCore_0 pid=142) mul.bf16 %rs459, %rs139, %rs390;
(EngineCore_0 pid=142) mul.bf16 %rs460, %rs140, %rs390;
(EngineCore_0 pid=142) mul.bf16 %rs461, %rs141, %rs390;
(EngineCore_0 pid=142) mul.bf16 %rs462, %rs142, %rs390;
(EngineCore_0 pid=142) mul.bf16 %rs463, %rs143, %rs390;
(EngineCore_0 pid=142) mul.bf16 %rs464, %rs144, %rs390;
(EngineCore_0 pid=142) mul.bf16 %rs465, %rs145, %rs391;
(EngineCore_0 pid=142) mul.bf16 %rs466, %rs146, %rs391;
(EngineCore_0 pid=142) mul.bf16 %rs467, %rs147, %rs391;
(EngineCore_0 pid=142) mul.bf16 %rs468, %rs148, %rs391;
(EngineCore_0 pid=142) mul.bf16 %rs469, %rs149, %rs391;
(EngineCore_0 pid=142) mul.bf16 %rs470, %rs150, %rs391;
(EngineCore_0 pid=142) mul.bf16 %rs471, %rs151, %rs391;
(EngineCore_0 pid=142) mul.bf16 %rs472, %rs152, %rs391;
(EngineCore_0 pid=142) mul.bf16 %rs473, %rs153, %rs392;
(EngineCore_0 pid=142) mul.bf16 %rs474, %rs154, %rs392;
(EngineCore_0 pid=142) mul.bf16 %rs475, %rs155, %rs392;
(EngineCore_0 pid=142) mul.bf16 %rs476, %rs156, %rs392;
(EngineCore_0 pid=142) mul.bf16 %rs477, %rs157, %rs392;
(EngineCore_0 pid=142) mul.bf16 %rs478, %rs158, %rs392;
(EngineCore_0 pid=142) mul.bf16 %rs479, %rs159, %rs392;
(EngineCore_0 pid=142) mul.bf16 %rs480, %rs160, %rs392;
(EngineCore_0 pid=142) mul.bf16 %rs481, %rs161, %rs393;
(EngineCore_0 pid=142) mul.bf16 %rs482, %rs162, %rs393;
(EngineCore_0 pid=142) mul.bf16 %rs483, %rs163, %rs393;
(EngineCore_0 pid=142) mul.bf16 %rs484, %rs164, %rs393;
(EngineCore_0 pid=142) mul.bf16 %rs485, %rs165, %rs393;
(EngineCore_0 pid=142) mul.bf16 %rs486, %rs166, %rs393;
(EngineCore_0 pid=142) mul.bf16 %rs487, %rs167, %rs393;
(EngineCore_0 pid=142) mul.bf16 %rs488, %rs168, %rs393;
(EngineCore_0 pid=142) mul.bf16 %rs489, %rs169, %rs394;
(EngineCore_0 pid=142) mul.bf16 %rs490, %rs170, %rs394;
(EngineCore_0 pid=142) mul.bf16 %rs491, %rs171, %rs394;
(EngineCore_0 pid=142) mul.bf16 %rs492, %rs172, %rs394;
(EngineCore_0 pid=142) mul.bf16 %rs493, %rs173, %rs394;
(EngineCore_0 pid=142) mul.bf16 %rs494, %rs174, %rs394;
(EngineCore_0 pid=142) mul.bf16 %rs495, %rs175, %rs394;
(EngineCore_0 pid=142) mul.bf16 %rs496, %rs176, %rs394;
(EngineCore_0 pid=142) mul.bf16 %rs497, %rs177, %rs395;
(EngineCore_0 pid=142) mul.bf16 %rs498, %rs178, %rs395;
(EngineCore_0 pid=142) mul.bf16 %rs499, %rs179, %rs395;
(EngineCore_0 pid=142) mul.bf16 %rs500, %rs180, %rs395;
(EngineCore_0 pid=142) mul.bf16 %rs501, %rs181, %rs395;
(EngineCore_0 pid=142) mul.bf16 %rs502, %rs182, %rs395;
(EngineCore_0 pid=142) mul.bf16 %rs503, %rs183, %rs395;
(EngineCore_0 pid=142) mul.bf16 %rs504, %rs184, %rs395;
(EngineCore_0 pid=142) mul.bf16 %rs505, %rs185, %rs396;
(EngineCore_0 pid=142) mul.bf16 %rs506, %rs186, %rs396;
(EngineCore_0 pid=142) mul.bf16 %rs507, %rs187, %rs396;
(EngineCore_0 pid=142) mul.bf16 %rs508, %rs188, %rs396;
(EngineCore_0 pid=142) mul.bf16 %rs509, %rs189, %rs396;
(EngineCore_0 pid=142) mul.bf16 %rs510, %rs190, %rs396;
(EngineCore_0 pid=142) mul.bf16 %rs511, %rs191, %rs396;
(EngineCore_0 pid=142) mul.bf16 %rs512, %rs192, %rs396;
(EngineCore_0 pid=142) mul.bf16 %rs513, %rs193, %rs397;
(EngineCore_0 pid=142) mul.bf16 %rs514, %rs194, %rs397;
(EngineCore_0 pid=142) mul.bf16 %rs515, %rs195, %rs397;
(EngineCore_0 pid=142) mul.bf16 %rs516, %rs196, %rs397;
(EngineCore_0 pid=142) mul.bf16 %rs517, %rs197, %rs397;
(EngineCore_0 pid=142) mul.bf16 %rs518, %rs198, %rs397;
(EngineCore_0 pid=142) mul.bf16 %rs519, %rs199, %rs397;
(EngineCore_0 pid=142) mul.bf16 %rs520, %rs200, %rs397;
(EngineCore_0 pid=142) mul.bf16 %rs521, %rs201, %rs398;
(EngineCore_0 pid=142) mul.bf16 %rs522, %rs202, %rs398;
(EngineCore_0 pid=142) mul.bf16 %rs523, %rs203, %rs398;
(EngineCore_0 pid=142) mul.bf16 %rs524, %rs204, %rs398;
(EngineCore_0 pid=142) mul.bf16 %rs525, %rs205, %rs398;
(EngineCore_0 pid=142) mul.bf16 %rs526, %rs206, %rs398;
(EngineCore_0 pid=142) mul.bf16 %rs527, %rs207, %rs398;
(EngineCore_0 pid=142) mul.bf16 %rs528, %rs208, %rs398;
(EngineCore_0 pid=142) mul.bf16 %rs529, %rs209, %rs399;
(EngineCore_0 pid=142) mul.bf16 %rs530, %rs210, %rs399;
(EngineCore_0 pid=142) mul.bf16 %rs531, %rs211, %rs399;
(EngineCore_0 pid=142) mul.bf16 %rs532, %rs212, %rs399;
(EngineCore_0 pid=142) mul.bf16 %rs533, %rs213, %rs399;
(EngineCore_0 pid=142) mul.bf16 %rs534, %rs214, %rs399;
(EngineCore_0 pid=142) mul.bf16 %rs535, %rs215, %rs399;
(EngineCore_0 pid=142) mul.bf16 %rs536, %rs216, %rs399;
(EngineCore_0 pid=142) mul.bf16 %rs537, %rs217, %rs400;
(EngineCore_0 pid=142) mul.bf16 %rs538, %rs218, %rs400;
(EngineCore_0 pid=142) mul.bf16 %rs539, %rs219, %rs400;
(EngineCore_0 pid=142) mul.bf16 %rs540, %rs220, %rs400;
(EngineCore_0 pid=142) mul.bf16 %rs541, %rs221, %rs400;
(EngineCore_0 pid=142) mul.bf16 %rs542, %rs222, %rs400;
(EngineCore_0 pid=142) mul.bf16 %rs543, %rs223, %rs400;
(EngineCore_0 pid=142) mul.bf16 %rs544, %rs224, %rs400;
(EngineCore_0 pid=142) mul.bf16 %rs545, %rs225, %rs401;
(EngineCore_0 pid=142) mul.bf16 %rs546, %rs226, %rs401;
(EngineCore_0 pid=142) mul.bf16 %rs547, %rs227, %rs401;
(EngineCore_0 pid=142) mul.bf16 %rs548, %rs228, %rs401;
(EngineCore_0 pid=142) mul.bf16 %rs549, %rs229, %rs401;
(EngineCore_0 pid=142) mul.bf16 %rs550, %rs230, %rs401;
(EngineCore_0 pid=142) mul.bf16 %rs551, %rs231, %rs401;
(EngineCore_0 pid=142) mul.bf16 %rs552, %rs232, %rs401;
(EngineCore_0 pid=142) mul.bf16 %rs553, %rs233, %rs402;
(EngineCore_0 pid=142) mul.bf16 %rs554, %rs234, %rs402;
(EngineCore_0 pid=142) mul.bf16 %rs555, %rs235, %rs402;
(EngineCore_0 pid=142) mul.bf16 %rs556, %rs236, %rs402;
(EngineCore_0 pid=142) mul.bf16 %rs557, %rs237, %rs402;
(EngineCore_0 pid=142) mul.bf16 %rs558, %rs238, %rs402;
(EngineCore_0 pid=142) mul.bf16 %rs559, %rs239, %rs402;
(EngineCore_0 pid=142) mul.bf16 %rs560, %rs240, %rs402;
(EngineCore_0 pid=142) mul.bf16 %rs561, %rs241, %rs403;
(EngineCore_0 pid=142) mul.bf16 %rs562, %rs242, %rs403;
(EngineCore_0 pid=142) mul.bf16 %rs563, %rs243, %rs403;
(EngineCore_0 pid=142) mul.bf16 %rs564, %rs244, %rs403;
(EngineCore_0 pid=142) mul.bf16 %rs565, %rs245, %rs403;
(EngineCore_0 pid=142) mul.bf16 %rs566, %rs246, %rs403;
(EngineCore_0 pid=142) mul.bf16 %rs567, %rs247, %rs403;
(EngineCore_0 pid=142) mul.bf16 %rs568, %rs248, %rs403;
(EngineCore_0 pid=142) mul.bf16 %rs569, %rs249, %rs404;
(EngineCore_0 pid=142) mul.bf16 %rs570, %rs250, %rs404;
(EngineCore_0 pid=142) mul.bf16 %rs571, %rs251, %rs404;
(EngineCore_0 pid=142) mul.bf16 %rs572, %rs252, %rs404;
(EngineCore_0 pid=142) mul.bf16 %rs573, %rs253, %rs404;
(EngineCore_0 pid=142) mul.bf16 %rs574, %rs254, %rs404;
(EngineCore_0 pid=142) mul.bf16 %rs575, %rs255, %rs404;
(EngineCore_0 pid=142) mul.bf16 %rs576, %rs256, %rs404;
(EngineCore_0 pid=142) mul.bf16 %rs577, %rs257, %rs405;
(EngineCore_0 pid=142) mul.bf16 %rs578, %rs258, %rs405;
(EngineCore_0 pid=142) mul.bf16 %rs579, %rs259, %rs405;
(EngineCore_0 pid=142) mul.bf16 %rs580, %rs260, %rs405;
(EngineCore_0 pid=142) mul.bf16 %rs581, %rs261, %rs405;
(EngineCore_0 pid=142) mul.bf16 %rs582, %rs262, %rs405;
(EngineCore_0 pid=142) mul.bf16 %rs583, %rs263, %rs405;
(EngineCore_0 pid=142) mul.bf16 %rs584, %rs264, %rs405;
(EngineCore_0 pid=142) mul.bf16 %rs585, %rs265, %rs406;
(EngineCore_0 pid=142) mul.bf16 %rs586, %rs266, %rs406;
(EngineCore_0 pid=142) mul.bf16 %rs587, %rs267, %rs406;
(EngineCore_0 pid=142) mul.bf16 %rs588, %rs268, %rs406;
(EngineCore_0 pid=142) mul.bf16 %rs589, %rs269, %rs406;
(EngineCore_0 pid=142) mul.bf16 %rs590, %rs270, %rs406;
(EngineCore_0 pid=142) mul.bf16 %rs591, %rs271, %rs406;
(EngineCore_0 pid=142) mul.bf16 %rs592, %rs272, %rs406;
(EngineCore_0 pid=142) mul.bf16 %rs593, %rs273, %rs407;
(EngineCore_0 pid=142) mul.bf16 %rs594, %rs274, %rs407;
(EngineCore_0 pid=142) mul.bf16 %rs595, %rs275, %rs407;
(EngineCore_0 pid=142) mul.bf16 %rs596, %rs276, %rs407;
(EngineCore_0 pid=142) mul.bf16 %rs597, %rs277, %rs407;
(EngineCore_0 pid=142) mul.bf16 %rs598, %rs278, %rs407;
(EngineCore_0 pid=142) mul.bf16 %rs599, %rs279, %rs407;
(EngineCore_0 pid=142) mul.bf16 %rs600, %rs280, %rs407;
(EngineCore_0 pid=142) mul.bf16 %rs601, %rs281, %rs408;
(EngineCore_0 pid=142) mul.bf16 %rs602, %rs282, %rs408;
(EngineCore_0 pid=142) mul.bf16 %rs603, %rs283, %rs408;
(EngineCore_0 pid=142) mul.bf16 %rs604, %rs284, %rs408;
(EngineCore_0 pid=142) mul.bf16 %rs605, %rs285, %rs408;
(EngineCore_0 pid=142) mul.bf16 %rs606, %rs286, %rs408;
(EngineCore_0 pid=142) mul.bf16 %rs607, %rs287, %rs408;
(EngineCore_0 pid=142) mul.bf16 %rs608, %rs288, %rs408;
(EngineCore_0 pid=142) mul.bf16 %rs609, %rs289, %rs409;
(EngineCore_0 pid=142) mul.bf16 %rs610, %rs290, %rs409;
(EngineCore_0 pid=142) mul.bf16 %rs611, %rs291, %rs409;
(EngineCore_0 pid=142) mul.bf16 %rs612, %rs292, %rs409;
(EngineCore_0 pid=142) mul.bf16 %rs613, %rs293, %rs409;
(EngineCore_0 pid=142) mul.bf16 %rs614, %rs294, %rs409;
(EngineCore_0 pid=142) mul.bf16 %rs615, %rs295, %rs409;
(EngineCore_0 pid=142) mul.bf16 %rs616, %rs296, %rs409;
(EngineCore_0 pid=142) mul.bf16 %rs617, %rs297, %rs410;
(EngineCore_0 pid=142) mul.bf16 %rs618, %rs298, %rs410;
(EngineCore_0 pid=142) mul.bf16 %rs619, %rs299, %rs410;
(EngineCore_0 pid=142) mul.bf16 %rs620, %rs300, %rs410;
(EngineCore_0 pid=142) mul.bf16 %rs621, %rs301, %rs410;
(EngineCore_0 pid=142) mul.bf16 %rs622, %rs302, %rs410;
(EngineCore_0 pid=142) mul.bf16 %rs623, %rs303, %rs410;
(EngineCore_0 pid=142) mul.bf16 %rs624, %rs304, %rs410;
(EngineCore_0 pid=142) mul.bf16 %rs625, %rs305, %rs411;
(EngineCore_0 pid=142) mul.bf16 %rs626, %rs306, %rs411;
(EngineCore_0 pid=142) mul.bf16 %rs627, %rs307, %rs411;
(EngineCore_0 pid=142) mul.bf16 %rs628, %rs308, %rs411;
(EngineCore_0 pid=142) mul.bf16 %rs629, %rs309, %rs411;
(EngineCore_0 pid=142) mul.bf16 %rs630, %rs310, %rs411;
(EngineCore_0 pid=142) mul.bf16 %rs631, %rs311, %rs411;
(EngineCore_0 pid=142) mul.bf16 %rs632, %rs312, %rs411;
(EngineCore_0 pid=142) mul.bf16 %rs633, %rs313, %rs412;
(EngineCore_0 pid=142) mul.bf16 %rs634, %rs314, %rs412;
(EngineCore_0 pid=142) mul.bf16 %rs635, %rs315, %rs412;
(EngineCore_0 pid=142) mul.bf16 %rs636, %rs316, %rs412;
(EngineCore_0 pid=142) mul.bf16 %rs637, %rs317, %rs412;
(EngineCore_0 pid=142) mul.bf16 %rs638, %rs318, %rs412;
(EngineCore_0 pid=142) mul.bf16 %rs639, %rs319, %rs412;
(EngineCore_0 pid=142) mul.bf16 %rs640, %rs320, %rs412;
(EngineCore_0 pid=142) mul.bf16 %rs641, %rs321, %rs413;
(EngineCore_0 pid=142) mul.bf16 %rs642, %rs322, %rs413;
(EngineCore_0 pid=142) mul.bf16 %rs643, %rs323, %rs413;
(EngineCore_0 pid=142) mul.bf16 %rs644, %rs324, %rs413;
(EngineCore_0 pid=142) mul.bf16 %rs645, %rs325, %rs413;
(EngineCore_0 pid=142) mul.bf16 %rs646, %rs326, %rs413;
(EngineCore_0 pid=142) mul.bf16 %rs647, %rs327, %rs413;
(EngineCore_0 pid=142) mul.bf16 %rs648, %rs328, %rs413;
(EngineCore_0 pid=142) mul.bf16 %rs649, %rs329, %rs414;
(EngineCore_0 pid=142) mul.bf16 %rs650, %rs330, %rs414;
(EngineCore_0 pid=142) mul.bf16 %rs651, %rs331, %rs414;
(EngineCore_0 pid=142) mul.bf16 %rs652, %rs332, %rs414;
(EngineCore_0 pid=142) mul.bf16 %rs653, %rs333, %rs414;
(EngineCore_0 pid=142) mul.bf16 %rs654, %rs334, %rs414;
(EngineCore_0 pid=142) mul.bf16 %rs655, %rs335, %rs414;
(EngineCore_0 pid=142) mul.bf16 %rs656, %rs336, %rs414;
(EngineCore_0 pid=142) mul.bf16 %rs657, %rs337, %rs415;
(EngineCore_0 pid=142) mul.bf16 %rs658, %rs338, %rs415;
(EngineCore_0 pid=142) mul.bf16 %rs659, %rs339, %rs415;
(EngineCore_0 pid=142) mul.bf16 %rs660, %rs340, %rs415;
(EngineCore_0 pid=142) mul.bf16 %rs661, %rs341, %rs415;
(EngineCore_0 pid=142) mul.bf16 %rs662, %rs342, %rs415;
(EngineCore_0 pid=142) mul.bf16 %rs663, %rs343, %rs415;
(EngineCore_0 pid=142) mul.bf16 %rs664, %rs344, %rs415;
(EngineCore_0 pid=142) mul.bf16 %rs665, %rs345, %rs416;
(EngineCore_0 pid=142) mul.bf16 %rs666, %rs346, %rs416;
(EngineCore_0 pid=142) mul.bf16 %rs667, %rs347, %rs416;
(EngineCore_0 pid=142) mul.bf16 %rs668, %rs348, %rs416;
(EngineCore_0 pid=142) mul.bf16 %rs669, %rs349, %rs416;
(EngineCore_0 pid=142) mul.bf16 %rs670, %rs350, %rs416;
(EngineCore_0 pid=142) mul.bf16 %rs671, %rs351, %rs416;
(EngineCore_0 pid=142) mul.bf16 %rs672, %rs352, %rs416;
(EngineCore_0 pid=142) mov.b32 %r1324, {%rs417, %rs418};
(EngineCore_0 pid=142) mov.b32 %r1325, {%rs425, %rs426};
(EngineCore_0 pid=142) mov.b32 %r1326, {%rs433, %rs434};
(EngineCore_0 pid=142) mov.b32 %r1327, {%rs441, %rs442};
(EngineCore_0 pid=142) mov.b32 %r1772, {%rs419, %rs420};
(EngineCore_0 pid=142) mov.b32 %r1773, {%rs427, %rs428};
(EngineCore_0 pid=142) mov.b32 %r1774, {%rs435, %rs436};
(EngineCore_0 pid=142) mov.b32 %r1775, {%rs443, %rs444};
(EngineCore_0 pid=142) mov.b32 %r2220, {%rs421, %rs422};
(EngineCore_0 pid=142) mov.b32 %r2221, {%rs429, %rs430};
(EngineCore_0 pid=142) mov.b32 %r2222, {%rs437, %rs438};
(EngineCore_0 pid=142) mov.b32 %r2223, {%rs445, %rs446};
(EngineCore_0 pid=142) mov.b32 %r2668, {%rs423, %rs424};
(EngineCore_0 pid=142) mov.b32 %r2669, {%rs431, %rs432};
(EngineCore_0 pid=142) mov.b32 %r2670, {%rs439, %rs440};
(EngineCore_0 pid=142) mov.b32 %r2671, {%rs447, %rs448};
(EngineCore_0 pid=142) mov.b32 %r3116, {%rs449, %rs450};
(EngineCore_0 pid=142) mov.b32 %r3117, {%rs457, %rs458};
(EngineCore_0 pid=142) mov.b32 %r3118, {%rs465, %rs466};
(EngineCore_0 pid=142) mov.b32 %r3119, {%rs473, %rs474};
(EngineCore_0 pid=142) mov.b32 %r3564, {%rs451, %rs452};
(EngineCore_0 pid=142) mov.b32 %r3565, {%rs459, %rs460};
(EngineCore_0 pid=142) mov.b32 %r3566, {%rs467, %rs468};
(EngineCore_0 pid=142) mov.b32 %r3567, {%rs475, %rs476};
(EngineCore_0 pid=142) mov.b32 %r4012, {%rs453, %rs454};
(EngineCore_0 pid=142) mov.b32 %r4013, {%rs461, %rs462};
(EngineCore_0 pid=142) mov.b32 %r4014, {%rs469, %rs470};
(EngineCore_0 pid=142) mov.b32 %r4015, {%rs477, %rs478};
(EngineCore_0 pid=142) mov.b32 %r4460, {%rs455, %rs456};
(EngineCore_0 pid=142) mov.b32 %r4461, {%rs463, %rs464};
(EngineCore_0 pid=142) mov.b32 %r4462, {%rs471, %rs472};
(EngineCore_0 pid=142) mov.b32 %r4463, {%rs479, %rs480};
(EngineCore_0 pid=142) mov.b32 %r1436, {%rs481, %rs482};
(EngineCore_0 pid=142) mov.b32 %r1437, {%rs489, %rs490};
(EngineCore_0 pid=142) mov.b32 %r1438, {%rs497, %rs498};
(EngineCore_0 pid=142) mov.b32 %r1439, {%rs505, %rs506};
(EngineCore_0 pid=142) mov.b32 %r1884, {%rs483, %rs484};
(EngineCore_0 pid=142) mov.b32 %r1885, {%rs491, %rs492};
(EngineCore_0 pid=142) mov.b32 %r1886, {%rs499, %rs500};
(EngineCore_0 pid=142) mov.b32 %r1887, {%rs507, %rs508};
(EngineCore_0 pid=142) mov.b32 %r2332, {%rs485, %rs486};
(EngineCore_0 pid=142) mov.b32 %r2333, {%rs493, %rs494};
(EngineCore_0 pid=142) mov.b32 %r2334, {%rs501, %rs502};
(EngineCore_0 pid=142) mov.b32 %r2335, {%rs509, %rs510};
(EngineCore_0 pid=142) mov.b32 %r2780, {%rs487, %rs488};
(EngineCore_0 pid=142) mov.b32 %r2781, {%rs495, %rs496};
(EngineCore_0 pid=142) mov.b32 %r2782, {%rs503, %rs504};
(EngineCore_0 pid=142) mov.b32 %r2783, {%rs511, %rs512};
(EngineCore_0 pid=142) mov.b32 %r3228, {%rs513, %rs514};
(EngineCore_0 pid=142) mov.b32 %r3229, {%rs521, %rs522};
(EngineCore_0 pid=142) mov.b32 %r3230, {%rs529, %rs530};
(EngineCore_0 pid=142) mov.b32 %r3231, {%rs537, %rs538};
(EngineCore_0 pid=142) mov.b32 %r3676, {%rs515, %rs516};
(EngineCore_0 pid=142) mov.b32 %r3677, {%rs523, %rs524};
(EngineCore_0 pid=142) mov.b32 %r3678, {%rs531, %rs532};
(EngineCore_0 pid=142) mov.b32 %r3679, {%rs539, %rs540};
(EngineCore_0 pid=142) mov.b32 %r4124, {%rs517, %rs518};
(EngineCore_0 pid=142) mov.b32 %r4125, {%rs525, %rs526};
(EngineCore_0 pid=142) mov.b32 %r4126, {%rs533, %rs534};
(EngineCore_0 pid=142) mov.b32 %r4127, {%rs541, %rs542};
(EngineCore_0 pid=142) mov.b32 %r4572, {%rs519, %rs520};
(EngineCore_0 pid=142) mov.b32 %r4573, {%rs527, %rs528};
(EngineCore_0 pid=142) mov.b32 %r4574, {%rs535, %rs536};
(EngineCore_0 pid=142) mov.b32 %r4575, {%rs543, %rs544};
(EngineCore_0 pid=142) mov.b32 %r1548, {%rs545, %rs546};
(EngineCore_0 pid=142) mov.b32 %r1549, {%rs553, %rs554};
(EngineCore_0 pid=142) mov.b32 %r1550, {%rs561, %rs562};
(EngineCore_0 pid=142) mov.b32 %r1551, {%rs569, %rs570};
(EngineCore_0 pid=142) mov.b32 %r1996, {%rs547, %rs548};
(EngineCore_0 pid=142) mov.b32 %r1997, {%rs555, %rs556};
(EngineCore_0 pid=142) mov.b32 %r1998, {%rs563, %rs564};
(EngineCore_0 pid=142) mov.b32 %r1999, {%rs571, %rs572};
(EngineCore_0 pid=142) mov.b32 %r2444, {%rs549, %rs550};
(EngineCore_0 pid=142) mov.b32 %r2445, {%rs557, %rs558};
(EngineCore_0 pid=142) mov.b32 %r2446, {%rs565, %rs566};
(EngineCore_0 pid=142) mov.b32 %r2447, {%rs573, %rs574};
(EngineCore_0 pid=142) mov.b32 %r2892, {%rs551, %rs552};
(EngineCore_0 pid=142) mov.b32 %r2893, {%rs559, %rs560};
(EngineCore_0 pid=142) mov.b32 %r2894, {%rs567, %rs568};
(EngineCore_0 pid=142) mov.b32 %r2895, {%rs575, %rs576};
(EngineCore_0 pid=142) mov.b32 %r3340, {%rs577, %rs578};
(EngineCore_0 pid=142) mov.b32 %r3341, {%rs585, %rs586};
(EngineCore_0 pid=142) mov.b32 %r3342, {%rs593, %rs594};
(EngineCore_0 pid=142) mov.b32 %r3343, {%rs601, %rs602};
(EngineCore_0 pid=142) mov.b32 %r3788, {%rs579, %rs580};
(EngineCore_0 pid=142) mov.b32 %r3789, {%rs587, %rs588};
(EngineCore_0 pid=142) mov.b32 %r3790, {%rs595, %rs596};
(EngineCore_0 pid=142) mov.b32 %r3791, {%rs603, %rs604};
(EngineCore_0 pid=142) mov.b32 %r4236, {%rs581, %rs582};
(EngineCore_0 pid=142) mov.b32 %r4237, {%rs589, %rs590};
(EngineCore_0 pid=142) mov.b32 %r4238, {%rs597, %rs598};
(EngineCore_0 pid=142) mov.b32 %r4239, {%rs605, %rs606};
(EngineCore_0 pid=142) mov.b32 %r4684, {%rs583, %rs584};
(EngineCore_0 pid=142) mov.b32 %r4685, {%rs591, %rs592};
(EngineCore_0 pid=142) mov.b32 %r4686, {%rs599, %rs600};
(EngineCore_0 pid=142) mov.b32 %r4687, {%rs607, %rs608};
(EngineCore_0 pid=142) mov.b32 %r1660, {%rs609, %rs610};
(EngineCore_0 pid=142) mov.b32 %r1661, {%rs617, %rs618};
(EngineCore_0 pid=142) mov.b32 %r1662, {%rs625, %rs626};
(EngineCore_0 pid=142) mov.b32 %r1663, {%rs633, %rs634};
(EngineCore_0 pid=142) mov.b32 %r2108, {%rs611, %rs612};
(EngineCore_0 pid=142) mov.b32 %r2109, {%rs619, %rs620};
(EngineCore_0 pid=142) mov.b32 %r2110, {%rs627, %rs628};
(EngineCore_0 pid=142) mov.b32 %r2111, {%rs635, %rs636};
(EngineCore_0 pid=142) mov.b32 %r2556, {%rs613, %rs614};
(EngineCore_0 pid=142) mov.b32 %r2557, {%rs621, %rs622};
(EngineCore_0 pid=142) mov.b32 %r2558, {%rs629, %rs630};
(EngineCore_0 pid=142) mov.b32 %r2559, {%rs637, %rs638};
(EngineCore_0 pid=142) mov.b32 %r3004, {%rs615, %rs616};
(EngineCore_0 pid=142) mov.b32 %r3005, {%rs623, %rs624};
(EngineCore_0 pid=142) mov.b32 %r3006, {%rs631, %rs632};
(EngineCore_0 pid=142) mov.b32 %r3007, {%rs639, %rs640};
(EngineCore_0 pid=142) mov.b32 %r3452, {%rs641, %rs642};
(EngineCore_0 pid=142) mov.b32 %r3453, {%rs649, %rs650};
(EngineCore_0 pid=142) mov.b32 %r3454, {%rs657, %rs658};
(EngineCore_0 pid=142) mov.b32 %r3455, {%rs665, %rs666};
(EngineCore_0 pid=142) mov.b32 %r3900, {%rs643, %rs644};
(EngineCore_0 pid=142) mov.b32 %r3901, {%rs651, %rs652};
(EngineCore_0 pid=142) mov.b32 %r3902, {%rs659, %rs660};
(EngineCore_0 pid=142) mov.b32 %r3903, {%rs667, %rs668};
(EngineCore_0 pid=142) mov.b32 %r4348, {%rs645, %rs646};
(EngineCore_0 pid=142) mov.b32 %r4349, {%rs653, %rs654};
(EngineCore_0 pid=142) mov.b32 %r4350, {%rs661, %rs662};
(EngineCore_0 pid=142) mov.b32 %r4351, {%rs669, %rs670};
(EngineCore_0 pid=142) mov.b32 %r4796, {%rs647, %rs648};
(EngineCore_0 pid=142) mov.b32 %r4797, {%rs655, %rs656};
(EngineCore_0 pid=142) mov.b32 %r4798, {%rs663, %rs664};
(EngineCore_0 pid=142) mov.b32 %r4799, {%rs671, %rs672};
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6042, %r6043, %r6044, %r6045 }, { %r1324, %r1325, %r1326, %r1327 }, { %r1314, %r1315 }, { %r6042, %r6043, %r6044, %r6045 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6046, %r6047, %r6048, %r6049 }, { %r1324, %r1325, %r1326, %r1327 }, { %r1328, %r1329 }, { %r6046, %r6047, %r6048, %r6049 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6050, %r6051, %r6052, %r6053 }, { %r1324, %r1325, %r1326, %r1327 }, { %r1342, %r1343 }, { %r6050, %r6051, %r6052, %r6053 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6054, %r6055, %r6056, %r6057 }, { %r1324, %r1325, %r1326, %r1327 }, { %r1356, %r1357 }, { %r6054, %r6055, %r6056, %r6057 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6058, %r6059, %r6060, %r6061 }, { %r1324, %r1325, %r1326, %r1327 }, { %r1370, %r1371 }, { %r6058, %r6059, %r6060, %r6061 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6062, %r6063, %r6064, %r6065 }, { %r1324, %r1325, %r1326, %r1327 }, { %r1384, %r1385 }, { %r6062, %r6063, %r6064, %r6065 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6066, %r6067, %r6068, %r6069 }, { %r1324, %r1325, %r1326, %r1327 }, { %r1398, %r1399 }, { %r6066, %r6067, %r6068, %r6069 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6070, %r6071, %r6072, %r6073 }, { %r1324, %r1325, %r1326, %r1327 }, { %r1412, %r1413 }, { %r6070, %r6071, %r6072, %r6073 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6074, %r6075, %r6076, %r6077 }, { %r1436, %r1437, %r1438, %r1439 }, { %r1314, %r1315 }, { %r6074, %r6075, %r6076, %r6077 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6078, %r6079, %r6080, %r6081 }, { %r1436, %r1437, %r1438, %r1439 }, { %r1328, %r1329 }, { %r6078, %r6079, %r6080, %r6081 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6082, %r6083, %r6084, %r6085 }, { %r1436, %r1437, %r1438, %r1439 }, { %r1342, %r1343 }, { %r6082, %r6083, %r6084, %r6085 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6086, %r6087, %r6088, %r6089 }, { %r1436, %r1437, %r1438, %r1439 }, { %r1356, %r1357 }, { %r6086, %r6087, %r6088, %r6089 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6090, %r6091, %r6092, %r6093 }, { %r1436, %r1437, %r1438, %r1439 }, { %r1370, %r1371 }, { %r6090, %r6091, %r6092, %r6093 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6094, %r6095, %r6096, %r6097 }, { %r1436, %r1437, %r1438, %r1439 }, { %r1384, %r1385 }, { %r6094, %r6095, %r6096, %r6097 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6098, %r6099, %r6100, %r6101 }, { %r1436, %r1437, %r1438, %r1439 }, { %r1398, %r1399 }, { %r6098, %r6099, %r6100, %r6101 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6102, %r6103, %r6104, %r6105 }, { %r1436, %r1437, %r1438, %r1439 }, { %r1412, %r1413 }, { %r6102, %r6103, %r6104, %r6105 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6106, %r6107, %r6108, %r6109 }, { %r1548, %r1549, %r1550, %r1551 }, { %r1314, %r1315 }, { %r6106, %r6107, %r6108, %r6109 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6110, %r6111, %r6112, %r6113 }, { %r1548, %r1549, %r1550, %r1551 }, { %r1328, %r1329 }, { %r6110, %r6111, %r6112, %r6113 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6114, %r6115, %r6116, %r6117 }, { %r1548, %r1549, %r1550, %r1551 }, { %r1342, %r1343 }, { %r6114, %r6115, %r6116, %r6117 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6118, %r6119, %r6120, %r6121 }, { %r1548, %r1549, %r1550, %r1551 }, { %r1356, %r1357 }, { %r6118, %r6119, %r6120, %r6121 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6122, %r6123, %r6124, %r6125 }, { %r1548, %r1549, %r1550, %r1551 }, { %r1370, %r1371 }, { %r6122, %r6123, %r6124, %r6125 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6126, %r6127, %r6128, %r6129 }, { %r1548, %r1549, %r1550, %r1551 }, { %r1384, %r1385 }, { %r6126, %r6127, %r6128, %r6129 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6130, %r6131, %r6132, %r6133 }, { %r1548, %r1549, %r1550, %r1551 }, { %r1398, %r1399 }, { %r6130, %r6131, %r6132, %r6133 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6134, %r6135, %r6136, %r6137 }, { %r1548, %r1549, %r1550, %r1551 }, { %r1412, %r1413 }, { %r6134, %r6135, %r6136, %r6137 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6138, %r6139, %r6140, %r6141 }, { %r1660, %r1661, %r1662, %r1663 }, { %r1314, %r1315 }, { %r6138, %r6139, %r6140, %r6141 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6142, %r6143, %r6144, %r6145 }, { %r1660, %r1661, %r1662, %r1663 }, { %r1328, %r1329 }, { %r6142, %r6143, %r6144, %r6145 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6146, %r6147, %r6148, %r6149 }, { %r1660, %r1661, %r1662, %r1663 }, { %r1342, %r1343 }, { %r6146, %r6147, %r6148, %r6149 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6150, %r6151, %r6152, %r6153 }, { %r1660, %r1661, %r1662, %r1663 }, { %r1356, %r1357 }, { %r6150, %r6151, %r6152, %r6153 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6154, %r6155, %r6156, %r6157 }, { %r1660, %r1661, %r1662, %r1663 }, { %r1370, %r1371 }, { %r6154, %r6155, %r6156, %r6157 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6158, %r6159, %r6160, %r6161 }, { %r1660, %r1661, %r1662, %r1663 }, { %r1384, %r1385 }, { %r6158, %r6159, %r6160, %r6161 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6162, %r6163, %r6164, %r6165 }, { %r1660, %r1661, %r1662, %r1663 }, { %r1398, %r1399 }, { %r6162, %r6163, %r6164, %r6165 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6166, %r6167, %r6168, %r6169 }, { %r1660, %r1661, %r1662, %r1663 }, { %r1412, %r1413 }, { %r6166, %r6167, %r6168, %r6169 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6042, %r6043, %r6044, %r6045 }, { %r1772, %r1773, %r1774, %r1775 }, { %r1762, %r1763 }, { %r6042, %r6043, %r6044, %r6045 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6046, %r6047, %r6048, %r6049 }, { %r1772, %r1773, %r1774, %r1775 }, { %r1776, %r1777 }, { %r6046, %r6047, %r6048, %r6049 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6050, %r6051, %r6052, %r6053 }, { %r1772, %r1773, %r1774, %r1775 }, { %r1790, %r1791 }, { %r6050, %r6051, %r6052, %r6053 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6054, %r6055, %r6056, %r6057 }, { %r1772, %r1773, %r1774, %r1775 }, { %r1804, %r1805 }, { %r6054, %r6055, %r6056, %r6057 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6058, %r6059, %r6060, %r6061 }, { %r1772, %r1773, %r1774, %r1775 }, { %r1818, %r1819 }, { %r6058, %r6059, %r6060, %r6061 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6062, %r6063, %r6064, %r6065 }, { %r1772, %r1773, %r1774, %r1775 }, { %r1832, %r1833 }, { %r6062, %r6063, %r6064, %r6065 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6066, %r6067, %r6068, %r6069 }, { %r1772, %r1773, %r1774, %r1775 }, { %r1846, %r1847 }, { %r6066, %r6067, %r6068, %r6069 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6070, %r6071, %r6072, %r6073 }, { %r1772, %r1773, %r1774, %r1775 }, { %r1860, %r1861 }, { %r6070, %r6071, %r6072, %r6073 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6074, %r6075, %r6076, %r6077 }, { %r1884, %r1885, %r1886, %r1887 }, { %r1762, %r1763 }, { %r6074, %r6075, %r6076, %r6077 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6078, %r6079, %r6080, %r6081 }, { %r1884, %r1885, %r1886, %r1887 }, { %r1776, %r1777 }, { %r6078, %r6079, %r6080, %r6081 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6082, %r6083, %r6084, %r6085 }, { %r1884, %r1885, %r1886, %r1887 }, { %r1790, %r1791 }, { %r6082, %r6083, %r6084, %r6085 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6086, %r6087, %r6088, %r6089 }, { %r1884, %r1885, %r1886, %r1887 }, { %r1804, %r1805 }, { %r6086, %r6087, %r6088, %r6089 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6090, %r6091, %r6092, %r6093 }, { %r1884, %r1885, %r1886, %r1887 }, { %r1818, %r1819 }, { %r6090, %r6091, %r6092, %r6093 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6094, %r6095, %r6096, %r6097 }, { %r1884, %r1885, %r1886, %r1887 }, { %r1832, %r1833 }, { %r6094, %r6095, %r6096, %r6097 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6098, %r6099, %r6100, %r6101 }, { %r1884, %r1885, %r1886, %r1887 }, { %r1846, %r1847 }, { %r6098, %r6099, %r6100, %r6101 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6102, %r6103, %r6104, %r6105 }, { %r1884, %r1885, %r1886, %r1887 }, { %r1860, %r1861 }, { %r6102, %r6103, %r6104, %r6105 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6106, %r6107, %r6108, %r6109 }, { %r1996, %r1997, %r1998, %r1999 }, { %r1762, %r1763 }, { %r6106, %r6107, %r6108, %r6109 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6110, %r6111, %r6112, %r6113 }, { %r1996, %r1997, %r1998, %r1999 }, { %r1776, %r1777 }, { %r6110, %r6111, %r6112, %r6113 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6114, %r6115, %r6116, %r6117 }, { %r1996, %r1997, %r1998, %r1999 }, { %r1790, %r1791 }, { %r6114, %r6115, %r6116, %r6117 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6118, %r6119, %r6120, %r6121 }, { %r1996, %r1997, %r1998, %r1999 }, { %r1804, %r1805 }, { %r6118, %r6119, %r6120, %r6121 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6122, %r6123, %r6124, %r6125 }, { %r1996, %r1997, %r1998, %r1999 }, { %r1818, %r1819 }, { %r6122, %r6123, %r6124, %r6125 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6126, %r6127, %r6128, %r6129 }, { %r1996, %r1997, %r1998, %r1999 }, { %r1832, %r1833 }, { %r6126, %r6127, %r6128, %r6129 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6130, %r6131, %r6132, %r6133 }, { %r1996, %r1997, %r1998, %r1999 }, { %r1846, %r1847 }, { %r6130, %r6131, %r6132, %r6133 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6134, %r6135, %r6136, %r6137 }, { %r1996, %r1997, %r1998, %r1999 }, { %r1860, %r1861 }, { %r6134, %r6135, %r6136, %r6137 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6138, %r6139, %r6140, %r6141 }, { %r2108, %r2109, %r2110, %r2111 }, { %r1762, %r1763 }, { %r6138, %r6139, %r6140, %r6141 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6142, %r6143, %r6144, %r6145 }, { %r2108, %r2109, %r2110, %r2111 }, { %r1776, %r1777 }, { %r6142, %r6143, %r6144, %r6145 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6146, %r6147, %r6148, %r6149 }, { %r2108, %r2109, %r2110, %r2111 }, { %r1790, %r1791 }, { %r6146, %r6147, %r6148, %r6149 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6150, %r6151, %r6152, %r6153 }, { %r2108, %r2109, %r2110, %r2111 }, { %r1804, %r1805 }, { %r6150, %r6151, %r6152, %r6153 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6154, %r6155, %r6156, %r6157 }, { %r2108, %r2109, %r2110, %r2111 }, { %r1818, %r1819 }, { %r6154, %r6155, %r6156, %r6157 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6158, %r6159, %r6160, %r6161 }, { %r2108, %r2109, %r2110, %r2111 }, { %r1832, %r1833 }, { %r6158, %r6159, %r6160, %r6161 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6162, %r6163, %r6164, %r6165 }, { %r2108, %r2109, %r2110, %r2111 }, { %r1846, %r1847 }, { %r6162, %r6163, %r6164, %r6165 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6166, %r6167, %r6168, %r6169 }, { %r2108, %r2109, %r2110, %r2111 }, { %r1860, %r1861 }, { %r6166, %r6167, %r6168, %r6169 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6042, %r6043, %r6044, %r6045 }, { %r2220, %r2221, %r2222, %r2223 }, { %r2210, %r2211 }, { %r6042, %r6043, %r6044, %r6045 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6046, %r6047, %r6048, %r6049 }, { %r2220, %r2221, %r2222, %r2223 }, { %r2224, %r2225 }, { %r6046, %r6047, %r6048, %r6049 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6050, %r6051, %r6052, %r6053 }, { %r2220, %r2221, %r2222, %r2223 }, { %r2238, %r2239 }, { %r6050, %r6051, %r6052, %r6053 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6054, %r6055, %r6056, %r6057 }, { %r2220, %r2221, %r2222, %r2223 }, { %r2252, %r2253 }, { %r6054, %r6055, %r6056, %r6057 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6058, %r6059, %r6060, %r6061 }, { %r2220, %r2221, %r2222, %r2223 }, { %r2266, %r2267 }, { %r6058, %r6059, %r6060, %r6061 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6062, %r6063, %r6064, %r6065 }, { %r2220, %r2221, %r2222, %r2223 }, { %r2280, %r2281 }, { %r6062, %r6063, %r6064, %r6065 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6066, %r6067, %r6068, %r6069 }, { %r2220, %r2221, %r2222, %r2223 }, { %r2294, %r2295 }, { %r6066, %r6067, %r6068, %r6069 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6070, %r6071, %r6072, %r6073 }, { %r2220, %r2221, %r2222, %r2223 }, { %r2308, %r2309 }, { %r6070, %r6071, %r6072, %r6073 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6074, %r6075, %r6076, %r6077 }, { %r2332, %r2333, %r2334, %r2335 }, { %r2210, %r2211 }, { %r6074, %r6075, %r6076, %r6077 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6078, %r6079, %r6080, %r6081 }, { %r2332, %r2333, %r2334, %r2335 }, { %r2224, %r2225 }, { %r6078, %r6079, %r6080, %r6081 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6082, %r6083, %r6084, %r6085 }, { %r2332, %r2333, %r2334, %r2335 }, { %r2238, %r2239 }, { %r6082, %r6083, %r6084, %r6085 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6086, %r6087, %r6088, %r6089 }, { %r2332, %r2333, %r2334, %r2335 }, { %r2252, %r2253 }, { %r6086, %r6087, %r6088, %r6089 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6090, %r6091, %r6092, %r6093 }, { %r2332, %r2333, %r2334, %r2335 }, { %r2266, %r2267 }, { %r6090, %r6091, %r6092, %r6093 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6094, %r6095, %r6096, %r6097 }, { %r2332, %r2333, %r2334, %r2335 }, { %r2280, %r2281 }, { %r6094, %r6095, %r6096, %r6097 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6098, %r6099, %r6100, %r6101 }, { %r2332, %r2333, %r2334, %r2335 }, { %r2294, %r2295 }, { %r6098, %r6099, %r6100, %r6101 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6102, %r6103, %r6104, %r6105 }, { %r2332, %r2333, %r2334, %r2335 }, { %r2308, %r2309 }, { %r6102, %r6103, %r6104, %r6105 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6106, %r6107, %r6108, %r6109 }, { %r2444, %r2445, %r2446, %r2447 }, { %r2210, %r2211 }, { %r6106, %r6107, %r6108, %r6109 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6110, %r6111, %r6112, %r6113 }, { %r2444, %r2445, %r2446, %r2447 }, { %r2224, %r2225 }, { %r6110, %r6111, %r6112, %r6113 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6114, %r6115, %r6116, %r6117 }, { %r2444, %r2445, %r2446, %r2447 }, { %r2238, %r2239 }, { %r6114, %r6115, %r6116, %r6117 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6118, %r6119, %r6120, %r6121 }, { %r2444, %r2445, %r2446, %r2447 }, { %r2252, %r2253 }, { %r6118, %r6119, %r6120, %r6121 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6122, %r6123, %r6124, %r6125 }, { %r2444, %r2445, %r2446, %r2447 }, { %r2266, %r2267 }, { %r6122, %r6123, %r6124, %r6125 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6126, %r6127, %r6128, %r6129 }, { %r2444, %r2445, %r2446, %r2447 }, { %r2280, %r2281 }, { %r6126, %r6127, %r6128, %r6129 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6130, %r6131, %r6132, %r6133 }, { %r2444, %r2445, %r2446, %r2447 }, { %r2294, %r2295 }, { %r6130, %r6131, %r6132, %r6133 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6134, %r6135, %r6136, %r6137 }, { %r2444, %r2445, %r2446, %r2447 }, { %r2308, %r2309 }, { %r6134, %r6135, %r6136, %r6137 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6138, %r6139, %r6140, %r6141 }, { %r2556, %r2557, %r2558, %r2559 }, { %r2210, %r2211 }, { %r6138, %r6139, %r6140, %r6141 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6142, %r6143, %r6144, %r6145 }, { %r2556, %r2557, %r2558, %r2559 }, { %r2224, %r2225 }, { %r6142, %r6143, %r6144, %r6145 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6146, %r6147, %r6148, %r6149 }, { %r2556, %r2557, %r2558, %r2559 }, { %r2238, %r2239 }, { %r6146, %r6147, %r6148, %r6149 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6150, %r6151, %r6152, %r6153 }, { %r2556, %r2557, %r2558, %r2559 }, { %r2252, %r2253 }, { %r6150, %r6151, %r6152, %r6153 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6154, %r6155, %r6156, %r6157 }, { %r2556, %r2557, %r2558, %r2559 }, { %r2266, %r2267 }, { %r6154, %r6155, %r6156, %r6157 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6158, %r6159, %r6160, %r6161 }, { %r2556, %r2557, %r2558, %r2559 }, { %r2280, %r2281 }, { %r6158, %r6159, %r6160, %r6161 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6162, %r6163, %r6164, %r6165 }, { %r2556, %r2557, %r2558, %r2559 }, { %r2294, %r2295 }, { %r6162, %r6163, %r6164, %r6165 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6166, %r6167, %r6168, %r6169 }, { %r2556, %r2557, %r2558, %r2559 }, { %r2308, %r2309 }, { %r6166, %r6167, %r6168, %r6169 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6042, %r6043, %r6044, %r6045 }, { %r2668, %r2669, %r2670, %r2671 }, { %r2658, %r2659 }, { %r6042, %r6043, %r6044, %r6045 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6046, %r6047, %r6048, %r6049 }, { %r2668, %r2669, %r2670, %r2671 }, { %r2672, %r2673 }, { %r6046, %r6047, %r6048, %r6049 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6050, %r6051, %r6052, %r6053 }, { %r2668, %r2669, %r2670, %r2671 }, { %r2686, %r2687 }, { %r6050, %r6051, %r6052, %r6053 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6054, %r6055, %r6056, %r6057 }, { %r2668, %r2669, %r2670, %r2671 }, { %r2700, %r2701 }, { %r6054, %r6055, %r6056, %r6057 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6058, %r6059, %r6060, %r6061 }, { %r2668, %r2669, %r2670, %r2671 }, { %r2714, %r2715 }, { %r6058, %r6059, %r6060, %r6061 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6062, %r6063, %r6064, %r6065 }, { %r2668, %r2669, %r2670, %r2671 }, { %r2728, %r2729 }, { %r6062, %r6063, %r6064, %r6065 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6066, %r6067, %r6068, %r6069 }, { %r2668, %r2669, %r2670, %r2671 }, { %r2742, %r2743 }, { %r6066, %r6067, %r6068, %r6069 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6070, %r6071, %r6072, %r6073 }, { %r2668, %r2669, %r2670, %r2671 }, { %r2756, %r2757 }, { %r6070, %r6071, %r6072, %r6073 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6074, %r6075, %r6076, %r6077 }, { %r2780, %r2781, %r2782, %r2783 }, { %r2658, %r2659 }, { %r6074, %r6075, %r6076, %r6077 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6078, %r6079, %r6080, %r6081 }, { %r2780, %r2781, %r2782, %r2783 }, { %r2672, %r2673 }, { %r6078, %r6079, %r6080, %r6081 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6082, %r6083, %r6084, %r6085 }, { %r2780, %r2781, %r2782, %r2783 }, { %r2686, %r2687 }, { %r6082, %r6083, %r6084, %r6085 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6086, %r6087, %r6088, %r6089 }, { %r2780, %r2781, %r2782, %r2783 }, { %r2700, %r2701 }, { %r6086, %r6087, %r6088, %r6089 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6090, %r6091, %r6092, %r6093 }, { %r2780, %r2781, %r2782, %r2783 }, { %r2714, %r2715 }, { %r6090, %r6091, %r6092, %r6093 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6094, %r6095, %r6096, %r6097 }, { %r2780, %r2781, %r2782, %r2783 }, { %r2728, %r2729 }, { %r6094, %r6095, %r6096, %r6097 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6098, %r6099, %r6100, %r6101 }, { %r2780, %r2781, %r2782, %r2783 }, { %r2742, %r2743 }, { %r6098, %r6099, %r6100, %r6101 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6102, %r6103, %r6104, %r6105 }, { %r2780, %r2781, %r2782, %r2783 }, { %r2756, %r2757 }, { %r6102, %r6103, %r6104, %r6105 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6106, %r6107, %r6108, %r6109 }, { %r2892, %r2893, %r2894, %r2895 }, { %r2658, %r2659 }, { %r6106, %r6107, %r6108, %r6109 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6110, %r6111, %r6112, %r6113 }, { %r2892, %r2893, %r2894, %r2895 }, { %r2672, %r2673 }, { %r6110, %r6111, %r6112, %r6113 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6114, %r6115, %r6116, %r6117 }, { %r2892, %r2893, %r2894, %r2895 }, { %r2686, %r2687 }, { %r6114, %r6115, %r6116, %r6117 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6118, %r6119, %r6120, %r6121 }, { %r2892, %r2893, %r2894, %r2895 }, { %r2700, %r2701 }, { %r6118, %r6119, %r6120, %r6121 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6122, %r6123, %r6124, %r6125 }, { %r2892, %r2893, %r2894, %r2895 }, { %r2714, %r2715 }, { %r6122, %r6123, %r6124, %r6125 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6126, %r6127, %r6128, %r6129 }, { %r2892, %r2893, %r2894, %r2895 }, { %r2728, %r2729 }, { %r6126, %r6127, %r6128, %r6129 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6130, %r6131, %r6132, %r6133 }, { %r2892, %r2893, %r2894, %r2895 }, { %r2742, %r2743 }, { %r6130, %r6131, %r6132, %r6133 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6134, %r6135, %r6136, %r6137 }, { %r2892, %r2893, %r2894, %r2895 }, { %r2756, %r2757 }, { %r6134, %r6135, %r6136, %r6137 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6138, %r6139, %r6140, %r6141 }, { %r3004, %r3005, %r3006, %r3007 }, { %r2658, %r2659 }, { %r6138, %r6139, %r6140, %r6141 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6142, %r6143, %r6144, %r6145 }, { %r3004, %r3005, %r3006, %r3007 }, { %r2672, %r2673 }, { %r6142, %r6143, %r6144, %r6145 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6146, %r6147, %r6148, %r6149 }, { %r3004, %r3005, %r3006, %r3007 }, { %r2686, %r2687 }, { %r6146, %r6147, %r6148, %r6149 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6150, %r6151, %r6152, %r6153 }, { %r3004, %r3005, %r3006, %r3007 }, { %r2700, %r2701 }, { %r6150, %r6151, %r6152, %r6153 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6154, %r6155, %r6156, %r6157 }, { %r3004, %r3005, %r3006, %r3007 }, { %r2714, %r2715 }, { %r6154, %r6155, %r6156, %r6157 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6158, %r6159, %r6160, %r6161 }, { %r3004, %r3005, %r3006, %r3007 }, { %r2728, %r2729 }, { %r6158, %r6159, %r6160, %r6161 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6162, %r6163, %r6164, %r6165 }, { %r3004, %r3005, %r3006, %r3007 }, { %r2742, %r2743 }, { %r6162, %r6163, %r6164, %r6165 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6166, %r6167, %r6168, %r6169 }, { %r3004, %r3005, %r3006, %r3007 }, { %r2756, %r2757 }, { %r6166, %r6167, %r6168, %r6169 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6042, %r6043, %r6044, %r6045 }, { %r3116, %r3117, %r3118, %r3119 }, { %r3106, %r3107 }, { %r6042, %r6043, %r6044, %r6045 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6046, %r6047, %r6048, %r6049 }, { %r3116, %r3117, %r3118, %r3119 }, { %r3120, %r3121 }, { %r6046, %r6047, %r6048, %r6049 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6050, %r6051, %r6052, %r6053 }, { %r3116, %r3117, %r3118, %r3119 }, { %r3134, %r3135 }, { %r6050, %r6051, %r6052, %r6053 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6054, %r6055, %r6056, %r6057 }, { %r3116, %r3117, %r3118, %r3119 }, { %r3148, %r3149 }, { %r6054, %r6055, %r6056, %r6057 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6058, %r6059, %r6060, %r6061 }, { %r3116, %r3117, %r3118, %r3119 }, { %r3162, %r3163 }, { %r6058, %r6059, %r6060, %r6061 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6062, %r6063, %r6064, %r6065 }, { %r3116, %r3117, %r3118, %r3119 }, { %r3176, %r3177 }, { %r6062, %r6063, %r6064, %r6065 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6066, %r6067, %r6068, %r6069 }, { %r3116, %r3117, %r3118, %r3119 }, { %r3190, %r3191 }, { %r6066, %r6067, %r6068, %r6069 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6070, %r6071, %r6072, %r6073 }, { %r3116, %r3117, %r3118, %r3119 }, { %r3204, %r3205 }, { %r6070, %r6071, %r6072, %r6073 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6074, %r6075, %r6076, %r6077 }, { %r3228, %r3229, %r3230, %r3231 }, { %r3106, %r3107 }, { %r6074, %r6075, %r6076, %r6077 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6078, %r6079, %r6080, %r6081 }, { %r3228, %r3229, %r3230, %r3231 }, { %r3120, %r3121 }, { %r6078, %r6079, %r6080, %r6081 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6082, %r6083, %r6084, %r6085 }, { %r3228, %r3229, %r3230, %r3231 }, { %r3134, %r3135 }, { %r6082, %r6083, %r6084, %r6085 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6086, %r6087, %r6088, %r6089 }, { %r3228, %r3229, %r3230, %r3231 }, { %r3148, %r3149 }, { %r6086, %r6087, %r6088, %r6089 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6090, %r6091, %r6092, %r6093 }, { %r3228, %r3229, %r3230, %r3231 }, { %r3162, %r3163 }, { %r6090, %r6091, %r6092, %r6093 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6094, %r6095, %r6096, %r6097 }, { %r3228, %r3229, %r3230, %r3231 }, { %r3176, %r3177 }, { %r6094, %r6095, %r6096, %r6097 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6098, %r6099, %r6100, %r6101 }, { %r3228, %r3229, %r3230, %r3231 }, { %r3190, %r3191 }, { %r6098, %r6099, %r6100, %r6101 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6102, %r6103, %r6104, %r6105 }, { %r3228, %r3229, %r3230, %r3231 }, { %r3204, %r3205 }, { %r6102, %r6103, %r6104, %r6105 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6106, %r6107, %r6108, %r6109 }, { %r3340, %r3341, %r3342, %r3343 }, { %r3106, %r3107 }, { %r6106, %r6107, %r6108, %r6109 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6110, %r6111, %r6112, %r6113 }, { %r3340, %r3341, %r3342, %r3343 }, { %r3120, %r3121 }, { %r6110, %r6111, %r6112, %r6113 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6114, %r6115, %r6116, %r6117 }, { %r3340, %r3341, %r3342, %r3343 }, { %r3134, %r3135 }, { %r6114, %r6115, %r6116, %r6117 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6118, %r6119, %r6120, %r6121 }, { %r3340, %r3341, %r3342, %r3343 }, { %r3148, %r3149 }, { %r6118, %r6119, %r6120, %r6121 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6122, %r6123, %r6124, %r6125 }, { %r3340, %r3341, %r3342, %r3343 }, { %r3162, %r3163 }, { %r6122, %r6123, %r6124, %r6125 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6126, %r6127, %r6128, %r6129 }, { %r3340, %r3341, %r3342, %r3343 }, { %r3176, %r3177 }, { %r6126, %r6127, %r6128, %r6129 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6130, %r6131, %r6132, %r6133 }, { %r3340, %r3341, %r3342, %r3343 }, { %r3190, %r3191 }, { %r6130, %r6131, %r6132, %r6133 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6134, %r6135, %r6136, %r6137 }, { %r3340, %r3341, %r3342, %r3343 }, { %r3204, %r3205 }, { %r6134, %r6135, %r6136, %r6137 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6138, %r6139, %r6140, %r6141 }, { %r3452, %r3453, %r3454, %r3455 }, { %r3106, %r3107 }, { %r6138, %r6139, %r6140, %r6141 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6142, %r6143, %r6144, %r6145 }, { %r3452, %r3453, %r3454, %r3455 }, { %r3120, %r3121 }, { %r6142, %r6143, %r6144, %r6145 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6146, %r6147, %r6148, %r6149 }, { %r3452, %r3453, %r3454, %r3455 }, { %r3134, %r3135 }, { %r6146, %r6147, %r6148, %r6149 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6150, %r6151, %r6152, %r6153 }, { %r3452, %r3453, %r3454, %r3455 }, { %r3148, %r3149 }, { %r6150, %r6151, %r6152, %r6153 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6154, %r6155, %r6156, %r6157 }, { %r3452, %r3453, %r3454, %r3455 }, { %r3162, %r3163 }, { %r6154, %r6155, %r6156, %r6157 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6158, %r6159, %r6160, %r6161 }, { %r3452, %r3453, %r3454, %r3455 }, { %r3176, %r3177 }, { %r6158, %r6159, %r6160, %r6161 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6162, %r6163, %r6164, %r6165 }, { %r3452, %r3453, %r3454, %r3455 }, { %r3190, %r3191 }, { %r6162, %r6163, %r6164, %r6165 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6166, %r6167, %r6168, %r6169 }, { %r3452, %r3453, %r3454, %r3455 }, { %r3204, %r3205 }, { %r6166, %r6167, %r6168, %r6169 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6042, %r6043, %r6044, %r6045 }, { %r3564, %r3565, %r3566, %r3567 }, { %r3554, %r3555 }, { %r6042, %r6043, %r6044, %r6045 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6046, %r6047, %r6048, %r6049 }, { %r3564, %r3565, %r3566, %r3567 }, { %r3568, %r3569 }, { %r6046, %r6047, %r6048, %r6049 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6050, %r6051, %r6052, %r6053 }, { %r3564, %r3565, %r3566, %r3567 }, { %r3582, %r3583 }, { %r6050, %r6051, %r6052, %r6053 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6054, %r6055, %r6056, %r6057 }, { %r3564, %r3565, %r3566, %r3567 }, { %r3596, %r3597 }, { %r6054, %r6055, %r6056, %r6057 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6058, %r6059, %r6060, %r6061 }, { %r3564, %r3565, %r3566, %r3567 }, { %r3610, %r3611 }, { %r6058, %r6059, %r6060, %r6061 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6062, %r6063, %r6064, %r6065 }, { %r3564, %r3565, %r3566, %r3567 }, { %r3624, %r3625 }, { %r6062, %r6063, %r6064, %r6065 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6066, %r6067, %r6068, %r6069 }, { %r3564, %r3565, %r3566, %r3567 }, { %r3638, %r3639 }, { %r6066, %r6067, %r6068, %r6069 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6070, %r6071, %r6072, %r6073 }, { %r3564, %r3565, %r3566, %r3567 }, { %r3652, %r3653 }, { %r6070, %r6071, %r6072, %r6073 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6074, %r6075, %r6076, %r6077 }, { %r3676, %r3677, %r3678, %r3679 }, { %r3554, %r3555 }, { %r6074, %r6075, %r6076, %r6077 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6078, %r6079, %r6080, %r6081 }, { %r3676, %r3677, %r3678, %r3679 }, { %r3568, %r3569 }, { %r6078, %r6079, %r6080, %r6081 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6082, %r6083, %r6084, %r6085 }, { %r3676, %r3677, %r3678, %r3679 }, { %r3582, %r3583 }, { %r6082, %r6083, %r6084, %r6085 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6086, %r6087, %r6088, %r6089 }, { %r3676, %r3677, %r3678, %r3679 }, { %r3596, %r3597 }, { %r6086, %r6087, %r6088, %r6089 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6090, %r6091, %r6092, %r6093 }, { %r3676, %r3677, %r3678, %r3679 }, { %r3610, %r3611 }, { %r6090, %r6091, %r6092, %r6093 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6094, %r6095, %r6096, %r6097 }, { %r3676, %r3677, %r3678, %r3679 }, { %r3624, %r3625 }, { %r6094, %r6095, %r6096, %r6097 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6098, %r6099, %r6100, %r6101 }, { %r3676, %r3677, %r3678, %r3679 }, { %r3638, %r3639 }, { %r6098, %r6099, %r6100, %r6101 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6102, %r6103, %r6104, %r6105 }, { %r3676, %r3677, %r3678, %r3679 }, { %r3652, %r3653 }, { %r6102, %r6103, %r6104, %r6105 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6106, %r6107, %r6108, %r6109 }, { %r3788, %r3789, %r3790, %r3791 }, { %r3554, %r3555 }, { %r6106, %r6107, %r6108, %r6109 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6110, %r6111, %r6112, %r6113 }, { %r3788, %r3789, %r3790, %r3791 }, { %r3568, %r3569 }, { %r6110, %r6111, %r6112, %r6113 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6114, %r6115, %r6116, %r6117 }, { %r3788, %r3789, %r3790, %r3791 }, { %r3582, %r3583 }, { %r6114, %r6115, %r6116, %r6117 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6118, %r6119, %r6120, %r6121 }, { %r3788, %r3789, %r3790, %r3791 }, { %r3596, %r3597 }, { %r6118, %r6119, %r6120, %r6121 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6122, %r6123, %r6124, %r6125 }, { %r3788, %r3789, %r3790, %r3791 }, { %r3610, %r3611 }, { %r6122, %r6123, %r6124, %r6125 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6126, %r6127, %r6128, %r6129 }, { %r3788, %r3789, %r3790, %r3791 }, { %r3624, %r3625 }, { %r6126, %r6127, %r6128, %r6129 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6130, %r6131, %r6132, %r6133 }, { %r3788, %r3789, %r3790, %r3791 }, { %r3638, %r3639 }, { %r6130, %r6131, %r6132, %r6133 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6134, %r6135, %r6136, %r6137 }, { %r3788, %r3789, %r3790, %r3791 }, { %r3652, %r3653 }, { %r6134, %r6135, %r6136, %r6137 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6138, %r6139, %r6140, %r6141 }, { %r3900, %r3901, %r3902, %r3903 }, { %r3554, %r3555 }, { %r6138, %r6139, %r6140, %r6141 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6142, %r6143, %r6144, %r6145 }, { %r3900, %r3901, %r3902, %r3903 }, { %r3568, %r3569 }, { %r6142, %r6143, %r6144, %r6145 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6146, %r6147, %r6148, %r6149 }, { %r3900, %r3901, %r3902, %r3903 }, { %r3582, %r3583 }, { %r6146, %r6147, %r6148, %r6149 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6150, %r6151, %r6152, %r6153 }, { %r3900, %r3901, %r3902, %r3903 }, { %r3596, %r3597 }, { %r6150, %r6151, %r6152, %r6153 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6154, %r6155, %r6156, %r6157 }, { %r3900, %r3901, %r3902, %r3903 }, { %r3610, %r3611 }, { %r6154, %r6155, %r6156, %r6157 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6158, %r6159, %r6160, %r6161 }, { %r3900, %r3901, %r3902, %r3903 }, { %r3624, %r3625 }, { %r6158, %r6159, %r6160, %r6161 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6162, %r6163, %r6164, %r6165 }, { %r3900, %r3901, %r3902, %r3903 }, { %r3638, %r3639 }, { %r6162, %r6163, %r6164, %r6165 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6166, %r6167, %r6168, %r6169 }, { %r3900, %r3901, %r3902, %r3903 }, { %r3652, %r3653 }, { %r6166, %r6167, %r6168, %r6169 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6042, %r6043, %r6044, %r6045 }, { %r4012, %r4013, %r4014, %r4015 }, { %r4002, %r4003 }, { %r6042, %r6043, %r6044, %r6045 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6046, %r6047, %r6048, %r6049 }, { %r4012, %r4013, %r4014, %r4015 }, { %r4016, %r4017 }, { %r6046, %r6047, %r6048, %r6049 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6050, %r6051, %r6052, %r6053 }, { %r4012, %r4013, %r4014, %r4015 }, { %r4030, %r4031 }, { %r6050, %r6051, %r6052, %r6053 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6054, %r6055, %r6056, %r6057 }, { %r4012, %r4013, %r4014, %r4015 }, { %r4044, %r4045 }, { %r6054, %r6055, %r6056, %r6057 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6058, %r6059, %r6060, %r6061 }, { %r4012, %r4013, %r4014, %r4015 }, { %r4058, %r4059 }, { %r6058, %r6059, %r6060, %r6061 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6062, %r6063, %r6064, %r6065 }, { %r4012, %r4013, %r4014, %r4015 }, { %r4072, %r4073 }, { %r6062, %r6063, %r6064, %r6065 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6066, %r6067, %r6068, %r6069 }, { %r4012, %r4013, %r4014, %r4015 }, { %r4086, %r4087 }, { %r6066, %r6067, %r6068, %r6069 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6070, %r6071, %r6072, %r6073 }, { %r4012, %r4013, %r4014, %r4015 }, { %r4100, %r4101 }, { %r6070, %r6071, %r6072, %r6073 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6074, %r6075, %r6076, %r6077 }, { %r4124, %r4125, %r4126, %r4127 }, { %r4002, %r4003 }, { %r6074, %r6075, %r6076, %r6077 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6078, %r6079, %r6080, %r6081 }, { %r4124, %r4125, %r4126, %r4127 }, { %r4016, %r4017 }, { %r6078, %r6079, %r6080, %r6081 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6082, %r6083, %r6084, %r6085 }, { %r4124, %r4125, %r4126, %r4127 }, { %r4030, %r4031 }, { %r6082, %r6083, %r6084, %r6085 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6086, %r6087, %r6088, %r6089 }, { %r4124, %r4125, %r4126, %r4127 }, { %r4044, %r4045 }, { %r6086, %r6087, %r6088, %r6089 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6090, %r6091, %r6092, %r6093 }, { %r4124, %r4125, %r4126, %r4127 }, { %r4058, %r4059 }, { %r6090, %r6091, %r6092, %r6093 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6094, %r6095, %r6096, %r6097 }, { %r4124, %r4125, %r4126, %r4127 }, { %r4072, %r4073 }, { %r6094, %r6095, %r6096, %r6097 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6098, %r6099, %r6100, %r6101 }, { %r4124, %r4125, %r4126, %r4127 }, { %r4086, %r4087 }, { %r6098, %r6099, %r6100, %r6101 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6102, %r6103, %r6104, %r6105 }, { %r4124, %r4125, %r4126, %r4127 }, { %r4100, %r4101 }, { %r6102, %r6103, %r6104, %r6105 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6106, %r6107, %r6108, %r6109 }, { %r4236, %r4237, %r4238, %r4239 }, { %r4002, %r4003 }, { %r6106, %r6107, %r6108, %r6109 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6110, %r6111, %r6112, %r6113 }, { %r4236, %r4237, %r4238, %r4239 }, { %r4016, %r4017 }, { %r6110, %r6111, %r6112, %r6113 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6114, %r6115, %r6116, %r6117 }, { %r4236, %r4237, %r4238, %r4239 }, { %r4030, %r4031 }, { %r6114, %r6115, %r6116, %r6117 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6118, %r6119, %r6120, %r6121 }, { %r4236, %r4237, %r4238, %r4239 }, { %r4044, %r4045 }, { %r6118, %r6119, %r6120, %r6121 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6122, %r6123, %r6124, %r6125 }, { %r4236, %r4237, %r4238, %r4239 }, { %r4058, %r4059 }, { %r6122, %r6123, %r6124, %r6125 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6126, %r6127, %r6128, %r6129 }, { %r4236, %r4237, %r4238, %r4239 }, { %r4072, %r4073 }, { %r6126, %r6127, %r6128, %r6129 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6130, %r6131, %r6132, %r6133 }, { %r4236, %r4237, %r4238, %r4239 }, { %r4086, %r4087 }, { %r6130, %r6131, %r6132, %r6133 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6134, %r6135, %r6136, %r6137 }, { %r4236, %r4237, %r4238, %r4239 }, { %r4100, %r4101 }, { %r6134, %r6135, %r6136, %r6137 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6138, %r6139, %r6140, %r6141 }, { %r4348, %r4349, %r4350, %r4351 }, { %r4002, %r4003 }, { %r6138, %r6139, %r6140, %r6141 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6142, %r6143, %r6144, %r6145 }, { %r4348, %r4349, %r4350, %r4351 }, { %r4016, %r4017 }, { %r6142, %r6143, %r6144, %r6145 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6146, %r6147, %r6148, %r6149 }, { %r4348, %r4349, %r4350, %r4351 }, { %r4030, %r4031 }, { %r6146, %r6147, %r6148, %r6149 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6150, %r6151, %r6152, %r6153 }, { %r4348, %r4349, %r4350, %r4351 }, { %r4044, %r4045 }, { %r6150, %r6151, %r6152, %r6153 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6154, %r6155, %r6156, %r6157 }, { %r4348, %r4349, %r4350, %r4351 }, { %r4058, %r4059 }, { %r6154, %r6155, %r6156, %r6157 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6158, %r6159, %r6160, %r6161 }, { %r4348, %r4349, %r4350, %r4351 }, { %r4072, %r4073 }, { %r6158, %r6159, %r6160, %r6161 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6162, %r6163, %r6164, %r6165 }, { %r4348, %r4349, %r4350, %r4351 }, { %r4086, %r4087 }, { %r6162, %r6163, %r6164, %r6165 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6166, %r6167, %r6168, %r6169 }, { %r4348, %r4349, %r4350, %r4351 }, { %r4100, %r4101 }, { %r6166, %r6167, %r6168, %r6169 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6042, %r6043, %r6044, %r6045 }, { %r4460, %r4461, %r4462, %r4463 }, { %r4450, %r4451 }, { %r6042, %r6043, %r6044, %r6045 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6046, %r6047, %r6048, %r6049 }, { %r4460, %r4461, %r4462, %r4463 }, { %r4464, %r4465 }, { %r6046, %r6047, %r6048, %r6049 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6050, %r6051, %r6052, %r6053 }, { %r4460, %r4461, %r4462, %r4463 }, { %r4478, %r4479 }, { %r6050, %r6051, %r6052, %r6053 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6054, %r6055, %r6056, %r6057 }, { %r4460, %r4461, %r4462, %r4463 }, { %r4492, %r4493 }, { %r6054, %r6055, %r6056, %r6057 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6058, %r6059, %r6060, %r6061 }, { %r4460, %r4461, %r4462, %r4463 }, { %r4506, %r4507 }, { %r6058, %r6059, %r6060, %r6061 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6062, %r6063, %r6064, %r6065 }, { %r4460, %r4461, %r4462, %r4463 }, { %r4520, %r4521 }, { %r6062, %r6063, %r6064, %r6065 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6066, %r6067, %r6068, %r6069 }, { %r4460, %r4461, %r4462, %r4463 }, { %r4534, %r4535 }, { %r6066, %r6067, %r6068, %r6069 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6070, %r6071, %r6072, %r6073 }, { %r4460, %r4461, %r4462, %r4463 }, { %r4548, %r4549 }, { %r6070, %r6071, %r6072, %r6073 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6074, %r6075, %r6076, %r6077 }, { %r4572, %r4573, %r4574, %r4575 }, { %r4450, %r4451 }, { %r6074, %r6075, %r6076, %r6077 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6078, %r6079, %r6080, %r6081 }, { %r4572, %r4573, %r4574, %r4575 }, { %r4464, %r4465 }, { %r6078, %r6079, %r6080, %r6081 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6082, %r6083, %r6084, %r6085 }, { %r4572, %r4573, %r4574, %r4575 }, { %r4478, %r4479 }, { %r6082, %r6083, %r6084, %r6085 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6086, %r6087, %r6088, %r6089 }, { %r4572, %r4573, %r4574, %r4575 }, { %r4492, %r4493 }, { %r6086, %r6087, %r6088, %r6089 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6090, %r6091, %r6092, %r6093 }, { %r4572, %r4573, %r4574, %r4575 }, { %r4506, %r4507 }, { %r6090, %r6091, %r6092, %r6093 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6094, %r6095, %r6096, %r6097 }, { %r4572, %r4573, %r4574, %r4575 }, { %r4520, %r4521 }, { %r6094, %r6095, %r6096, %r6097 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6098, %r6099, %r6100, %r6101 }, { %r4572, %r4573, %r4574, %r4575 }, { %r4534, %r4535 }, { %r6098, %r6099, %r6100, %r6101 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6102, %r6103, %r6104, %r6105 }, { %r4572, %r4573, %r4574, %r4575 }, { %r4548, %r4549 }, { %r6102, %r6103, %r6104, %r6105 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6106, %r6107, %r6108, %r6109 }, { %r4684, %r4685, %r4686, %r4687 }, { %r4450, %r4451 }, { %r6106, %r6107, %r6108, %r6109 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6110, %r6111, %r6112, %r6113 }, { %r4684, %r4685, %r4686, %r4687 }, { %r4464, %r4465 }, { %r6110, %r6111, %r6112, %r6113 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6114, %r6115, %r6116, %r6117 }, { %r4684, %r4685, %r4686, %r4687 }, { %r4478, %r4479 }, { %r6114, %r6115, %r6116, %r6117 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6118, %r6119, %r6120, %r6121 }, { %r4684, %r4685, %r4686, %r4687 }, { %r4492, %r4493 }, { %r6118, %r6119, %r6120, %r6121 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6122, %r6123, %r6124, %r6125 }, { %r4684, %r4685, %r4686, %r4687 }, { %r4506, %r4507 }, { %r6122, %r6123, %r6124, %r6125 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6126, %r6127, %r6128, %r6129 }, { %r4684, %r4685, %r4686, %r4687 }, { %r4520, %r4521 }, { %r6126, %r6127, %r6128, %r6129 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6130, %r6131, %r6132, %r6133 }, { %r4684, %r4685, %r4686, %r4687 }, { %r4534, %r4535 }, { %r6130, %r6131, %r6132, %r6133 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6134, %r6135, %r6136, %r6137 }, { %r4684, %r4685, %r4686, %r4687 }, { %r4548, %r4549 }, { %r6134, %r6135, %r6136, %r6137 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6138, %r6139, %r6140, %r6141 }, { %r4796, %r4797, %r4798, %r4799 }, { %r4450, %r4451 }, { %r6138, %r6139, %r6140, %r6141 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6142, %r6143, %r6144, %r6145 }, { %r4796, %r4797, %r4798, %r4799 }, { %r4464, %r4465 }, { %r6142, %r6143, %r6144, %r6145 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6146, %r6147, %r6148, %r6149 }, { %r4796, %r4797, %r4798, %r4799 }, { %r4478, %r4479 }, { %r6146, %r6147, %r6148, %r6149 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6150, %r6151, %r6152, %r6153 }, { %r4796, %r4797, %r4798, %r4799 }, { %r4492, %r4493 }, { %r6150, %r6151, %r6152, %r6153 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6154, %r6155, %r6156, %r6157 }, { %r4796, %r4797, %r4798, %r4799 }, { %r4506, %r4507 }, { %r6154, %r6155, %r6156, %r6157 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6158, %r6159, %r6160, %r6161 }, { %r4796, %r4797, %r4798, %r4799 }, { %r4520, %r4521 }, { %r6158, %r6159, %r6160, %r6161 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6162, %r6163, %r6164, %r6165 }, { %r4796, %r4797, %r4798, %r4799 }, { %r4534, %r4535 }, { %r6162, %r6163, %r6164, %r6165 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 { %r6166, %r6167, %r6168, %r6169 }, { %r4796, %r4797, %r4798, %r4799 }, { %r4548, %r4549 }, { %r6166, %r6167, %r6168, %r6169 };
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) .loc 1 135 133 // specialize.py:135:133
(EngineCore_0 pid=142) setp.ne.s32 %p68, %r6171, %r242;
(EngineCore_0 pid=142) @%p68 bra $L__BB0_21;
(EngineCore_0 pid=142) // %bb.20: // in Loop: Header=BB0_19 Depth=1
(EngineCore_0 pid=142) .loc 1 0 133 // specialize.py:0:133
(EngineCore_0 pid=142) setp.lt.u32 %p69, %r230, 32;
(EngineCore_0 pid=142) .loc 1 220 24 // specialize.py:220:24
(EngineCore_0 pid=142) add.s32 %r6172, %r6172, 170;
(EngineCore_0 pid=142) $L__tmp27:
(EngineCore_0 pid=142) .loc 4 69 26 // _p_matmul_ogs.py:69:26 @[ specialize.py:225:25 ]
(EngineCore_0 pid=142) rem.s32 %r4948, %r6172, %r226;
(EngineCore_0 pid=142) .loc 2 50 22 // _common.py:50:22 @[ specialize.py:225:25 ]
(EngineCore_0 pid=142) div.s32 %r4950, %r4948, %r229;
(EngineCore_0 pid=142) .loc 2 51 41 // _common.py:51:41 @[ specialize.py:225:25 ]
(EngineCore_0 pid=142) shl.b32 %r4951, %r4950, 3;
(EngineCore_0 pid=142) .loc 2 51 30 // _common.py:51:30 @[ specialize.py:225:25 ]
(EngineCore_0 pid=142) sub.s32 %r4952, %r224, %r4951;
(EngineCore_0 pid=142) .loc 2 51 50 // _common.py:51:50 @[ specialize.py:225:25 ]
(EngineCore_0 pid=142) min.s32 %r4953, %r4952, 8;
(EngineCore_0 pid=142) .loc 2 52 40 // _common.py:52:40 @[ specialize.py:225:25 ]
(EngineCore_0 pid=142) rem.s32 %r4954, %r4948, %r4953;
(EngineCore_0 pid=142) .loc 2 52 34 // _common.py:52:34 @[ specialize.py:225:25 ]
(EngineCore_0 pid=142) add.s32 %r4955, %r4951, %r4954;
(EngineCore_0 pid=142) .loc 2 53 19 // _common.py:53:19 @[ specialize.py:225:25 ]
(EngineCore_0 pid=142) mul.lo.s32 %r4956, %r4950, %r229;
(EngineCore_0 pid=142) sub.s32 %r4957, %r4948, %r4956;
(EngineCore_0 pid=142) .loc 2 53 30 // _common.py:53:30 @[ specialize.py:225:25 ]
(EngineCore_0 pid=142) div.s32 %r4958, %r4957, %r4953;
(EngineCore_0 pid=142) .loc 4 84 39 // _p_matmul_ogs.py:84:39 @[ specialize.py:225:25 ]
(EngineCore_0 pid=142) mul.wide.s32 %rd93, %r4955, 4;
(EngineCore_0 pid=142) add.s64 %rd69, %rd17, %rd93;
(EngineCore_0 pid=142) .loc 4 84 28 // _p_matmul_ogs.py:84:28 @[ specialize.py:225:25 ]
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mov.u32 %r4928, 0x0;
(EngineCore_0 pid=142) ld.global.b32 { %r4928 }, [ %rd69 + 0 ];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) .loc 4 85 30 // _p_matmul_ogs.py:85:30 @[ specialize.py:225:25 ]
(EngineCore_0 pid=142) and.b32 %r4959, %r4928, 65535;
(EngineCore_0 pid=142) .loc 4 87 32 // _p_matmul_ogs.py:87:32 @[ specialize.py:225:25 ]
(EngineCore_0 pid=142) mul.wide.u32 %rd94, %r4959, 4;
(EngineCore_0 pid=142) add.s64 %rd70, %rd14, %rd94;
(EngineCore_0 pid=142) .loc 4 87 21 // _p_matmul_ogs.py:87:21 @[ specialize.py:225:25 ]
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mov.u32 %r4936, 0x0;
(EngineCore_0 pid=142) ld.global.b32 { %r4936 }, [ %rd70 + 0 ];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) .loc 4 88 37 // _p_matmul_ogs.py:88:37 @[ specialize.py:225:25 ]
(EngineCore_0 pid=142) add.s64 %rd71, %rd15, %rd94;
(EngineCore_0 pid=142) .loc 4 88 26 // _p_matmul_ogs.py:88:26 @[ specialize.py:225:25 ]
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mov.u32 %r4930, 0x0;
(EngineCore_0 pid=142) ld.global.b32 { %r4930 }, [ %rd71 + 0 ];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) .loc 4 91 22 // _p_matmul_ogs.py:91:22 @[ specialize.py:225:25 ]
(EngineCore_0 pid=142) shr.s32 %r4960, %r4928, 9;
(EngineCore_0 pid=142) and.b32 %r4942, %r4960, -128;
(EngineCore_0 pid=142) .loc 4 92 22 // _p_matmul_ogs.py:92:22 @[ specialize.py:225:25 ]
(EngineCore_0 pid=142) shl.b32 %r4961, %r4958, 8;
(EngineCore_0 pid=142) $L__tmp28:
(EngineCore_0 pid=142) .loc 1 261 71 // specialize.py:261:71
(EngineCore_0 pid=142) cvt.s64.s32 %rd95, %r4930;
(EngineCore_0 pid=142) .loc 1 261 85 // specialize.py:261:85
(EngineCore_0 pid=142) mul.lo.s64 %rd96, %rd95, %rd4;
(EngineCore_0 pid=142) .loc 1 261 59 // specialize.py:261:59
(EngineCore_0 pid=142) shl.b64 %rd97, %rd96, 1;
(EngineCore_0 pid=142) add.s64 %rd73, %rd11, %rd97;
(EngineCore_0 pid=142) .loc 1 267 16 // specialize.py:267:16
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 0, 256 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 %r4940, 0;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p69 st.shared.b32 [ %r4931 + 0 ], %r4940;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) bar.warp.sync -1;
(EngineCore_0 pid=142) cvt.u64.u32 %rd72, %r4943;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p93 tensormap.replace.tile.global_address.shared::cta.b1024.b64 [ %rd72 + 0 ], %rd73;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p93 tensormap.replace.tile.rank.shared::cta.b1024.b32 [ %rd72 + 0 ], 0x1;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 %r4933, 64;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p93 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd72 + 0 ], 0x0, %r4933;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 %r4934, 128;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p93 tensormap.replace.tile.box_dim.shared::cta.b1024.b32 [ %rd72 + 0 ], 0x1, %r4934;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p93 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd72 + 0 ], 0x0, %r4935;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p93 tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [ %rd72 + 0 ], 0x1, %r4936;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p93 tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [ %rd72 + 0 ], 0x0, %rd80;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 %r4937, 1;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p93 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd72 + 0 ], 0x0, %r4937;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p93 tensormap.replace.tile.element_stride.shared::cta.b1024.b32 [ %rd72 + 0 ], 0x1, %r4937;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p93 tensormap.replace.tile.elemtype.shared::cta.b1024.b32 [ %rd72 + 0 ], 0xa;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p93 tensormap.replace.tile.interleave_layout.shared::cta.b1024.b32 [ %rd72 + 0 ], 0x0;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p93 tensormap.replace.tile.swizzle_mode.shared::cta.b1024.b32 [ %rd72 + 0 ], 0x3;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p93 tensormap.replace.tile.fill_mode.shared::cta.b1024.b32 [ %rd72 + 0 ], 0x0;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p69 tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [ %rd87 + 0 ], [ %rd72 + 0 ], 0x80;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p69 fence.proxy.tensormap::generic.acquire.gpu [ %rd87 + 0 ], 0x80;
(EngineCore_0 pid=142) @%p69 cp.async.bulk.commit_group ;
(EngineCore_0 pid=142) @%p69 cp.async.bulk.wait_group.read 0 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 0, 256 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) .loc 1 274 28 // specialize.py:274:28
(EngineCore_0 pid=142) or.b32 %r4963, %r4961, %r230;
(EngineCore_0 pid=142) .loc 1 275 28 // specialize.py:275:28
(EngineCore_0 pid=142) setp.lt.s32 %p87, %r4963, %r665;
(EngineCore_0 pid=142) .loc 1 277 35 // specialize.py:277:35
(EngineCore_0 pid=142) mul.lo.s32 %r4964, %r4959, %r664;
(EngineCore_0 pid=142) .loc 1 277 24 // specialize.py:277:24
(EngineCore_0 pid=142) mul.wide.s32 %rd98, %r4964, 4;
(EngineCore_0 pid=142) add.s64 %rd99, %rd12, %rd98;
(EngineCore_0 pid=142) .loc 1 277 48 // specialize.py:277:48
(EngineCore_0 pid=142) mul.wide.s32 %rd100, %r4963, 4;
(EngineCore_0 pid=142) add.s64 %rd90, %rd99, %rd100;
(EngineCore_0 pid=142) .loc 1 279 31 // specialize.py:279:31
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) mov.u32 %r4939, %r4940;
(EngineCore_0 pid=142) @%p87 ld.global.b32 { %r4939 }, [ %rd90 + 0 ];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) .loc 1 304 27 // specialize.py:304:27
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 0, 256 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) st.shared.b32 [%r260], %r4939;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 0, 256 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) ld.shared.v2.b32 {%r4965, %r4966}, [%r261];
(EngineCore_0 pid=142) ld.shared.v2.b32 {%r4967, %r4968}, [%r261+512];
(EngineCore_0 pid=142) ld.shared.v2.b32 {%r4969, %r4970}, [%r261+256];
(EngineCore_0 pid=142) ld.shared.v2.b32 {%r4971, %r4972}, [%r261+768];
(EngineCore_0 pid=142) .loc 1 325 34 // specialize.py:325:34
(EngineCore_0 pid=142) add.f32 %r4973, %r6054, %r4965;
(EngineCore_0 pid=142) add.f32 %r4974, %r6050, %r4965;
(EngineCore_0 pid=142) add.f32 %r4975, %r6046, %r4965;
(EngineCore_0 pid=142) add.f32 %r4976, %r6042, %r4965;
(EngineCore_0 pid=142) add.f32 %r4977, %r6055, %r4965;
(EngineCore_0 pid=142) add.f32 %r4978, %r6051, %r4965;
(EngineCore_0 pid=142) add.f32 %r4979, %r6047, %r4965;
(EngineCore_0 pid=142) add.f32 %r4980, %r6043, %r4965;
(EngineCore_0 pid=142) add.f32 %r4981, %r6056, %r4966;
(EngineCore_0 pid=142) add.f32 %r4982, %r6052, %r4966;
(EngineCore_0 pid=142) add.f32 %r4983, %r6048, %r4966;
(EngineCore_0 pid=142) add.f32 %r4984, %r6044, %r4966;
(EngineCore_0 pid=142) add.f32 %r4985, %r6057, %r4966;
(EngineCore_0 pid=142) add.f32 %r4986, %r6053, %r4966;
(EngineCore_0 pid=142) add.f32 %r4987, %r6049, %r4966;
(EngineCore_0 pid=142) add.f32 %r4988, %r6045, %r4966;
(EngineCore_0 pid=142) add.f32 %r4989, %r6070, %r4965;
(EngineCore_0 pid=142) add.f32 %r4990, %r6066, %r4965;
(EngineCore_0 pid=142) add.f32 %r4991, %r6062, %r4965;
(EngineCore_0 pid=142) add.f32 %r4992, %r6058, %r4965;
(EngineCore_0 pid=142) add.f32 %r4993, %r6071, %r4965;
(EngineCore_0 pid=142) add.f32 %r4994, %r6067, %r4965;
(EngineCore_0 pid=142) add.f32 %r4995, %r6063, %r4965;
(EngineCore_0 pid=142) add.f32 %r4996, %r6059, %r4965;
(EngineCore_0 pid=142) add.f32 %r4997, %r6072, %r4966;
(EngineCore_0 pid=142) add.f32 %r4998, %r6068, %r4966;
(EngineCore_0 pid=142) add.f32 %r4999, %r6064, %r4966;
(EngineCore_0 pid=142) add.f32 %r5000, %r6060, %r4966;
(EngineCore_0 pid=142) add.f32 %r5001, %r6073, %r4966;
(EngineCore_0 pid=142) add.f32 %r5002, %r6069, %r4966;
(EngineCore_0 pid=142) add.f32 %r5003, %r6065, %r4966;
(EngineCore_0 pid=142) add.f32 %r5004, %r6061, %r4966;
(EngineCore_0 pid=142) add.f32 %r5005, %r6086, %r4969;
(EngineCore_0 pid=142) add.f32 %r5006, %r6082, %r4969;
(EngineCore_0 pid=142) add.f32 %r5007, %r6078, %r4969;
(EngineCore_0 pid=142) add.f32 %r5008, %r6074, %r4969;
(EngineCore_0 pid=142) add.f32 %r5009, %r6087, %r4969;
(EngineCore_0 pid=142) add.f32 %r5010, %r6083, %r4969;
(EngineCore_0 pid=142) add.f32 %r5011, %r6079, %r4969;
(EngineCore_0 pid=142) add.f32 %r5012, %r6075, %r4969;
(EngineCore_0 pid=142) add.f32 %r5013, %r6088, %r4970;
(EngineCore_0 pid=142) add.f32 %r5014, %r6084, %r4970;
(EngineCore_0 pid=142) add.f32 %r5015, %r6080, %r4970;
(EngineCore_0 pid=142) add.f32 %r5016, %r6076, %r4970;
(EngineCore_0 pid=142) add.f32 %r5017, %r6089, %r4970;
(EngineCore_0 pid=142) add.f32 %r5018, %r6085, %r4970;
(EngineCore_0 pid=142) add.f32 %r5019, %r6081, %r4970;
(EngineCore_0 pid=142) add.f32 %r5020, %r6077, %r4970;
(EngineCore_0 pid=142) add.f32 %r5021, %r6102, %r4969;
(EngineCore_0 pid=142) add.f32 %r5022, %r6098, %r4969;
(EngineCore_0 pid=142) add.f32 %r5023, %r6094, %r4969;
(EngineCore_0 pid=142) add.f32 %r5024, %r6090, %r4969;
(EngineCore_0 pid=142) add.f32 %r5025, %r6103, %r4969;
(EngineCore_0 pid=142) add.f32 %r5026, %r6099, %r4969;
(EngineCore_0 pid=142) add.f32 %r5027, %r6095, %r4969;
(EngineCore_0 pid=142) add.f32 %r5028, %r6091, %r4969;
(EngineCore_0 pid=142) add.f32 %r5029, %r6104, %r4970;
(EngineCore_0 pid=142) add.f32 %r5030, %r6100, %r4970;
(EngineCore_0 pid=142) add.f32 %r5031, %r6096, %r4970;
(EngineCore_0 pid=142) add.f32 %r5032, %r6092, %r4970;
(EngineCore_0 pid=142) add.f32 %r5033, %r6105, %r4970;
(EngineCore_0 pid=142) add.f32 %r5034, %r6101, %r4970;
(EngineCore_0 pid=142) add.f32 %r5035, %r6097, %r4970;
(EngineCore_0 pid=142) add.f32 %r5036, %r6093, %r4970;
(EngineCore_0 pid=142) $L__tmp29:
(EngineCore_0 pid=142) .loc 3 51 46 // _swiglu.py:51:46 @[ specialize.py:330:46 ]
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 0, 256 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) st.shared.v4.b32 [%r262], {%r4976, %r4975, %r4974, %r4973};
(EngineCore_0 pid=142) st.shared.v4.b32 [%r262+128], {%r4980, %r4979, %r4978, %r4977};
(EngineCore_0 pid=142) st.shared.v4.b32 [%r262+2048], {%r5008, %r5007, %r5006, %r5005};
(EngineCore_0 pid=142) st.shared.v4.b32 [%r262+2176], {%r5012, %r5011, %r5010, %r5009};
(EngineCore_0 pid=142) st.shared.v4.b32 [%r263], {%r4984, %r4983, %r4982, %r4981};
(EngineCore_0 pid=142) st.shared.v4.b32 [%r263+128], {%r4988, %r4987, %r4986, %r4985};
(EngineCore_0 pid=142) st.shared.v4.b32 [%r263+2048], {%r5016, %r5015, %r5014, %r5013};
(EngineCore_0 pid=142) st.shared.v4.b32 [%r263+2176], {%r5020, %r5019, %r5018, %r5017};
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 0, 256 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r5037, %r5038, %r5039, %r5040}, [%r264];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r5041, %r5042, %r5043, %r5044}, [%r264+256];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r5045, %r5046, %r5047, %r5048}, [%r265];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r5049, %r5050, %r5051, %r5052}, [%r265+256];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r5053, %r5054, %r5055, %r5056}, [%r266];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r5057, %r5058, %r5059, %r5060}, [%r266+256];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r5061, %r5062, %r5063, %r5064}, [%r267];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r5065, %r5066, %r5067, %r5068}, [%r267+256];
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 0, 256 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) st.shared.v4.b32 [%r262], {%r4992, %r4991, %r4990, %r4989};
(EngineCore_0 pid=142) st.shared.v4.b32 [%r262+128], {%r4996, %r4995, %r4994, %r4993};
(EngineCore_0 pid=142) st.shared.v4.b32 [%r262+2048], {%r5024, %r5023, %r5022, %r5021};
(EngineCore_0 pid=142) st.shared.v4.b32 [%r262+2176], {%r5028, %r5027, %r5026, %r5025};
(EngineCore_0 pid=142) st.shared.v4.b32 [%r263], {%r5000, %r4999, %r4998, %r4997};
(EngineCore_0 pid=142) st.shared.v4.b32 [%r263+128], {%r5004, %r5003, %r5002, %r5001};
(EngineCore_0 pid=142) st.shared.v4.b32 [%r263+2048], {%r5032, %r5031, %r5030, %r5029};
(EngineCore_0 pid=142) st.shared.v4.b32 [%r263+2176], {%r5036, %r5035, %r5034, %r5033};
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 0, 256 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r5069, %r5070, %r5071, %r5072}, [%r264];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r5073, %r5074, %r5075, %r5076}, [%r264+256];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r5077, %r5078, %r5079, %r5080}, [%r265];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r5081, %r5082, %r5083, %r5084}, [%r265+256];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r5085, %r5086, %r5087, %r5088}, [%r266];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r5089, %r5090, %r5091, %r5092}, [%r266+256];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r5093, %r5094, %r5095, %r5096}, [%r267];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r5097, %r5098, %r5099, %r5100}, [%r267+256];
(EngineCore_0 pid=142) .loc 3 8 24 // _swiglu.py:8:24 @[ specialize.py:330:46 ]
(EngineCore_0 pid=142) min.f32 %r5101, %r5037, %r669;
(EngineCore_0 pid=142) min.f32 %r5102, %r5053, %r669;
(EngineCore_0 pid=142) min.f32 %r5103, %r5041, %r669;
(EngineCore_0 pid=142) min.f32 %r5104, %r5057, %r669;
(EngineCore_0 pid=142) min.f32 %r5105, %r5038, %r669;
(EngineCore_0 pid=142) min.f32 %r5106, %r5054, %r669;
(EngineCore_0 pid=142) min.f32 %r5107, %r5042, %r669;
(EngineCore_0 pid=142) min.f32 %r5108, %r5058, %r669;
(EngineCore_0 pid=142) min.f32 %r5109, %r5039, %r669;
(EngineCore_0 pid=142) min.f32 %r5110, %r5055, %r669;
(EngineCore_0 pid=142) min.f32 %r5111, %r5043, %r669;
(EngineCore_0 pid=142) min.f32 %r5112, %r5059, %r669;
(EngineCore_0 pid=142) min.f32 %r5113, %r5040, %r669;
(EngineCore_0 pid=142) min.f32 %r5114, %r5056, %r669;
(EngineCore_0 pid=142) min.f32 %r5115, %r5044, %r669;
(EngineCore_0 pid=142) min.f32 %r5116, %r5060, %r669;
(EngineCore_0 pid=142) min.f32 %r5117, %r5069, %r669;
(EngineCore_0 pid=142) min.f32 %r5118, %r5085, %r669;
(EngineCore_0 pid=142) min.f32 %r5119, %r5073, %r669;
(EngineCore_0 pid=142) min.f32 %r5120, %r5089, %r669;
(EngineCore_0 pid=142) min.f32 %r5121, %r5070, %r669;
(EngineCore_0 pid=142) min.f32 %r5122, %r5086, %r669;
(EngineCore_0 pid=142) min.f32 %r5123, %r5074, %r669;
(EngineCore_0 pid=142) min.f32 %r5124, %r5090, %r669;
(EngineCore_0 pid=142) min.f32 %r5125, %r5071, %r669;
(EngineCore_0 pid=142) min.f32 %r5126, %r5087, %r669;
(EngineCore_0 pid=142) min.f32 %r5127, %r5075, %r669;
(EngineCore_0 pid=142) min.f32 %r5128, %r5091, %r669;
(EngineCore_0 pid=142) min.f32 %r5129, %r5072, %r669;
(EngineCore_0 pid=142) min.f32 %r5130, %r5088, %r669;
(EngineCore_0 pid=142) min.f32 %r5131, %r5076, %r669;
(EngineCore_0 pid=142) min.f32 %r5132, %r5092, %r669;
(EngineCore_0 pid=142) min.f32 %r5133, %r5045, %r669;
(EngineCore_0 pid=142) min.f32 %r5134, %r5061, %r669;
(EngineCore_0 pid=142) min.f32 %r5135, %r5049, %r669;
(EngineCore_0 pid=142) min.f32 %r5136, %r5065, %r669;
(EngineCore_0 pid=142) min.f32 %r5137, %r5046, %r669;
(EngineCore_0 pid=142) min.f32 %r5138, %r5062, %r669;
(EngineCore_0 pid=142) min.f32 %r5139, %r5050, %r669;
(EngineCore_0 pid=142) min.f32 %r5140, %r5066, %r669;
(EngineCore_0 pid=142) min.f32 %r5141, %r5047, %r669;
(EngineCore_0 pid=142) min.f32 %r5142, %r5063, %r669;
(EngineCore_0 pid=142) min.f32 %r5143, %r5051, %r669;
(EngineCore_0 pid=142) min.f32 %r5144, %r5067, %r669;
(EngineCore_0 pid=142) min.f32 %r5145, %r5048, %r669;
(EngineCore_0 pid=142) min.f32 %r5146, %r5064, %r669;
(EngineCore_0 pid=142) min.f32 %r5147, %r5052, %r669;
(EngineCore_0 pid=142) min.f32 %r5148, %r5068, %r669;
(EngineCore_0 pid=142) min.f32 %r5149, %r5077, %r669;
(EngineCore_0 pid=142) min.f32 %r5150, %r5093, %r669;
(EngineCore_0 pid=142) min.f32 %r5151, %r5081, %r669;
(EngineCore_0 pid=142) min.f32 %r5152, %r5097, %r669;
(EngineCore_0 pid=142) min.f32 %r5153, %r5078, %r669;
(EngineCore_0 pid=142) min.f32 %r5154, %r5094, %r669;
(EngineCore_0 pid=142) min.f32 %r5155, %r5082, %r669;
(EngineCore_0 pid=142) min.f32 %r5156, %r5098, %r669;
(EngineCore_0 pid=142) min.f32 %r5157, %r5079, %r669;
(EngineCore_0 pid=142) min.f32 %r5158, %r5095, %r669;
(EngineCore_0 pid=142) min.f32 %r5159, %r5083, %r669;
(EngineCore_0 pid=142) min.f32 %r5160, %r5099, %r669;
(EngineCore_0 pid=142) min.f32 %r5161, %r5080, %r669;
(EngineCore_0 pid=142) min.f32 %r5162, %r5096, %r669;
(EngineCore_0 pid=142) min.f32 %r5163, %r5084, %r669;
(EngineCore_0 pid=142) min.f32 %r5164, %r5100, %r669;
(EngineCore_0 pid=142) .loc 3 10 33 // _swiglu.py:10:33 @[ specialize.py:330:46 ]
(EngineCore_0 pid=142) max.f32 %r5165, %r231, %r5133;
(EngineCore_0 pid=142) max.f32 %r5166, %r231, %r5134;
(EngineCore_0 pid=142) max.f32 %r5167, %r231, %r5135;
(EngineCore_0 pid=142) max.f32 %r5168, %r231, %r5136;
(EngineCore_0 pid=142) max.f32 %r5169, %r231, %r5137;
(EngineCore_0 pid=142) max.f32 %r5170, %r231, %r5138;
(EngineCore_0 pid=142) max.f32 %r5171, %r231, %r5139;
(EngineCore_0 pid=142) max.f32 %r5172, %r231, %r5140;
(EngineCore_0 pid=142) max.f32 %r5173, %r231, %r5141;
(EngineCore_0 pid=142) max.f32 %r5174, %r231, %r5142;
(EngineCore_0 pid=142) max.f32 %r5175, %r231, %r5143;
(EngineCore_0 pid=142) max.f32 %r5176, %r231, %r5144;
(EngineCore_0 pid=142) max.f32 %r5177, %r231, %r5145;
(EngineCore_0 pid=142) max.f32 %r5178, %r231, %r5146;
(EngineCore_0 pid=142) max.f32 %r5179, %r231, %r5147;
(EngineCore_0 pid=142) max.f32 %r5180, %r231, %r5148;
(EngineCore_0 pid=142) max.f32 %r5181, %r231, %r5149;
(EngineCore_0 pid=142) max.f32 %r5182, %r231, %r5150;
(EngineCore_0 pid=142) max.f32 %r5183, %r231, %r5151;
(EngineCore_0 pid=142) max.f32 %r5184, %r231, %r5152;
(EngineCore_0 pid=142) max.f32 %r5185, %r231, %r5153;
(EngineCore_0 pid=142) max.f32 %r5186, %r231, %r5154;
(EngineCore_0 pid=142) max.f32 %r5187, %r231, %r5155;
(EngineCore_0 pid=142) max.f32 %r5188, %r231, %r5156;
(EngineCore_0 pid=142) max.f32 %r5189, %r231, %r5157;
(EngineCore_0 pid=142) max.f32 %r5190, %r231, %r5158;
(EngineCore_0 pid=142) max.f32 %r5191, %r231, %r5159;
(EngineCore_0 pid=142) max.f32 %r5192, %r231, %r5160;
(EngineCore_0 pid=142) max.f32 %r5193, %r231, %r5161;
(EngineCore_0 pid=142) max.f32 %r5194, %r231, %r5162;
(EngineCore_0 pid=142) max.f32 %r5195, %r231, %r5163;
(EngineCore_0 pid=142) max.f32 %r5196, %r231, %r5164;
(EngineCore_0 pid=142) .loc 3 45 36 // _swiglu.py:45:36 @[ specialize.py:330:46 ]
(EngineCore_0 pid=142) mul.f32 %r5197, %r232, %r5101;
(EngineCore_0 pid=142) mul.f32 %r5198, %r232, %r5102;
(EngineCore_0 pid=142) mul.f32 %r5199, %r232, %r5103;
(EngineCore_0 pid=142) mul.f32 %r5200, %r232, %r5104;
(EngineCore_0 pid=142) mul.f32 %r5201, %r232, %r5105;
(EngineCore_0 pid=142) mul.f32 %r5202, %r232, %r5106;
(EngineCore_0 pid=142) mul.f32 %r5203, %r232, %r5107;
(EngineCore_0 pid=142) mul.f32 %r5204, %r232, %r5108;
(EngineCore_0 pid=142) mul.f32 %r5205, %r232, %r5109;
(EngineCore_0 pid=142) mul.f32 %r5206, %r232, %r5110;
(EngineCore_0 pid=142) mul.f32 %r5207, %r232, %r5111;
(EngineCore_0 pid=142) mul.f32 %r5208, %r232, %r5112;
(EngineCore_0 pid=142) mul.f32 %r5209, %r232, %r5113;
(EngineCore_0 pid=142) mul.f32 %r5210, %r232, %r5114;
(EngineCore_0 pid=142) mul.f32 %r5211, %r232, %r5115;
(EngineCore_0 pid=142) mul.f32 %r5212, %r232, %r5116;
(EngineCore_0 pid=142) mul.f32 %r5213, %r232, %r5117;
(EngineCore_0 pid=142) mul.f32 %r5214, %r232, %r5118;
(EngineCore_0 pid=142) mul.f32 %r5215, %r232, %r5119;
(EngineCore_0 pid=142) mul.f32 %r5216, %r232, %r5120;
(EngineCore_0 pid=142) mul.f32 %r5217, %r232, %r5121;
(EngineCore_0 pid=142) mul.f32 %r5218, %r232, %r5122;
(EngineCore_0 pid=142) mul.f32 %r5219, %r232, %r5123;
(EngineCore_0 pid=142) mul.f32 %r5220, %r232, %r5124;
(EngineCore_0 pid=142) mul.f32 %r5221, %r232, %r5125;
(EngineCore_0 pid=142) mul.f32 %r5222, %r232, %r5126;
(EngineCore_0 pid=142) mul.f32 %r5223, %r232, %r5127;
(EngineCore_0 pid=142) mul.f32 %r5224, %r232, %r5128;
(EngineCore_0 pid=142) mul.f32 %r5225, %r232, %r5129;
(EngineCore_0 pid=142) mul.f32 %r5226, %r232, %r5130;
(EngineCore_0 pid=142) mul.f32 %r5227, %r232, %r5131;
(EngineCore_0 pid=142) mul.f32 %r5228, %r232, %r5132;
(EngineCore_0 pid=142) .loc 3 45 27 // _swiglu.py:45:27 @[ specialize.py:330:46 ]
(EngineCore_0 pid=142) mul.f32 %r5229, %r5197, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5230, %r5229;
(EngineCore_0 pid=142) mul.f32 %r5231, %r5198, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5232, %r5231;
(EngineCore_0 pid=142) mul.f32 %r5233, %r5199, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5234, %r5233;
(EngineCore_0 pid=142) mul.f32 %r5235, %r5200, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5236, %r5235;
(EngineCore_0 pid=142) mul.f32 %r5237, %r5201, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5238, %r5237;
(EngineCore_0 pid=142) mul.f32 %r5239, %r5202, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5240, %r5239;
(EngineCore_0 pid=142) mul.f32 %r5241, %r5203, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5242, %r5241;
(EngineCore_0 pid=142) mul.f32 %r5243, %r5204, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5244, %r5243;
(EngineCore_0 pid=142) mul.f32 %r5245, %r5205, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5246, %r5245;
(EngineCore_0 pid=142) mul.f32 %r5247, %r5206, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5248, %r5247;
(EngineCore_0 pid=142) mul.f32 %r5249, %r5207, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5250, %r5249;
(EngineCore_0 pid=142) mul.f32 %r5251, %r5208, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5252, %r5251;
(EngineCore_0 pid=142) mul.f32 %r5253, %r5209, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5254, %r5253;
(EngineCore_0 pid=142) mul.f32 %r5255, %r5210, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5256, %r5255;
(EngineCore_0 pid=142) mul.f32 %r5257, %r5211, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5258, %r5257;
(EngineCore_0 pid=142) mul.f32 %r5259, %r5212, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5260, %r5259;
(EngineCore_0 pid=142) mul.f32 %r5261, %r5213, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5262, %r5261;
(EngineCore_0 pid=142) mul.f32 %r5263, %r5214, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5264, %r5263;
(EngineCore_0 pid=142) mul.f32 %r5265, %r5215, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5266, %r5265;
(EngineCore_0 pid=142) mul.f32 %r5267, %r5216, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5268, %r5267;
(EngineCore_0 pid=142) mul.f32 %r5269, %r5217, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5270, %r5269;
(EngineCore_0 pid=142) mul.f32 %r5271, %r5218, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5272, %r5271;
(EngineCore_0 pid=142) mul.f32 %r5273, %r5219, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5274, %r5273;
(EngineCore_0 pid=142) mul.f32 %r5275, %r5220, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5276, %r5275;
(EngineCore_0 pid=142) mul.f32 %r5277, %r5221, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5278, %r5277;
(EngineCore_0 pid=142) mul.f32 %r5279, %r5222, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5280, %r5279;
(EngineCore_0 pid=142) mul.f32 %r5281, %r5223, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5282, %r5281;
(EngineCore_0 pid=142) mul.f32 %r5283, %r5224, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5284, %r5283;
(EngineCore_0 pid=142) mul.f32 %r5285, %r5225, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5286, %r5285;
(EngineCore_0 pid=142) mul.f32 %r5287, %r5226, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5288, %r5287;
(EngineCore_0 pid=142) mul.f32 %r5289, %r5227, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5290, %r5289;
(EngineCore_0 pid=142) mul.f32 %r5291, %r5228, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5292, %r5291;
(EngineCore_0 pid=142) .loc 3 45 20 // _swiglu.py:45:20 @[ specialize.py:330:46 ]
(EngineCore_0 pid=142) add.f32 %r5293, %r5230, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5294, %r5232, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5295, %r5234, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5296, %r5236, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5297, %r5238, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5298, %r5240, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5299, %r5242, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5300, %r5244, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5301, %r5246, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5302, %r5248, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5303, %r5250, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5304, %r5252, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5305, %r5254, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5306, %r5256, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5307, %r5258, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5308, %r5260, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5309, %r5262, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5310, %r5264, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5311, %r5266, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5312, %r5268, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5313, %r5270, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5314, %r5272, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5315, %r5274, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5316, %r5276, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5317, %r5278, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5318, %r5280, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5319, %r5282, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5320, %r5284, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5321, %r5286, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5322, %r5288, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5323, %r5290, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5324, %r5292, 0f3F800000;
(EngineCore_0 pid=142) .loc 3 45 16 // _swiglu.py:45:16 @[ specialize.py:330:46 ]
(EngineCore_0 pid=142) div.full.f32 %r5325, %r5101, %r5293;
(EngineCore_0 pid=142) div.full.f32 %r5326, %r5102, %r5294;
(EngineCore_0 pid=142) div.full.f32 %r5327, %r5103, %r5295;
(EngineCore_0 pid=142) div.full.f32 %r5328, %r5104, %r5296;
(EngineCore_0 pid=142) div.full.f32 %r5329, %r5105, %r5297;
(EngineCore_0 pid=142) div.full.f32 %r5330, %r5106, %r5298;
(EngineCore_0 pid=142) div.full.f32 %r5331, %r5107, %r5299;
(EngineCore_0 pid=142) div.full.f32 %r5332, %r5108, %r5300;
(EngineCore_0 pid=142) div.full.f32 %r5333, %r5109, %r5301;
(EngineCore_0 pid=142) div.full.f32 %r5334, %r5110, %r5302;
(EngineCore_0 pid=142) div.full.f32 %r5335, %r5111, %r5303;
(EngineCore_0 pid=142) div.full.f32 %r5336, %r5112, %r5304;
(EngineCore_0 pid=142) div.full.f32 %r5337, %r5113, %r5305;
(EngineCore_0 pid=142) div.full.f32 %r5338, %r5114, %r5306;
(EngineCore_0 pid=142) div.full.f32 %r5339, %r5115, %r5307;
(EngineCore_0 pid=142) div.full.f32 %r5340, %r5116, %r5308;
(EngineCore_0 pid=142) div.full.f32 %r5341, %r5117, %r5309;
(EngineCore_0 pid=142) div.full.f32 %r5342, %r5118, %r5310;
(EngineCore_0 pid=142) div.full.f32 %r5343, %r5119, %r5311;
(EngineCore_0 pid=142) div.full.f32 %r5344, %r5120, %r5312;
(EngineCore_0 pid=142) div.full.f32 %r5345, %r5121, %r5313;
(EngineCore_0 pid=142) div.full.f32 %r5346, %r5122, %r5314;
(EngineCore_0 pid=142) div.full.f32 %r5347, %r5123, %r5315;
(EngineCore_0 pid=142) div.full.f32 %r5348, %r5124, %r5316;
(EngineCore_0 pid=142) div.full.f32 %r5349, %r5125, %r5317;
(EngineCore_0 pid=142) div.full.f32 %r5350, %r5126, %r5318;
(EngineCore_0 pid=142) div.full.f32 %r5351, %r5127, %r5319;
(EngineCore_0 pid=142) div.full.f32 %r5352, %r5128, %r5320;
(EngineCore_0 pid=142) div.full.f32 %r5353, %r5129, %r5321;
(EngineCore_0 pid=142) div.full.f32 %r5354, %r5130, %r5322;
(EngineCore_0 pid=142) div.full.f32 %r5355, %r5131, %r5323;
(EngineCore_0 pid=142) div.full.f32 %r5356, %r5132, %r5324;
(EngineCore_0 pid=142) .loc 3 46 29 // _swiglu.py:46:29 @[ specialize.py:330:46 ]
(EngineCore_0 pid=142) fma.rn.f32 %r5357, %r5325, %r5165, %r5325;
(EngineCore_0 pid=142) fma.rn.f32 %r5358, %r5326, %r5166, %r5326;
(EngineCore_0 pid=142) fma.rn.f32 %r5359, %r5327, %r5167, %r5327;
(EngineCore_0 pid=142) fma.rn.f32 %r5360, %r5328, %r5168, %r5328;
(EngineCore_0 pid=142) fma.rn.f32 %r5361, %r5329, %r5169, %r5329;
(EngineCore_0 pid=142) fma.rn.f32 %r5362, %r5330, %r5170, %r5330;
(EngineCore_0 pid=142) fma.rn.f32 %r5363, %r5331, %r5171, %r5331;
(EngineCore_0 pid=142) fma.rn.f32 %r5364, %r5332, %r5172, %r5332;
(EngineCore_0 pid=142) fma.rn.f32 %r5365, %r5333, %r5173, %r5333;
(EngineCore_0 pid=142) fma.rn.f32 %r5366, %r5334, %r5174, %r5334;
(EngineCore_0 pid=142) fma.rn.f32 %r5367, %r5335, %r5175, %r5335;
(EngineCore_0 pid=142) fma.rn.f32 %r5368, %r5336, %r5176, %r5336;
(EngineCore_0 pid=142) fma.rn.f32 %r5369, %r5337, %r5177, %r5337;
(EngineCore_0 pid=142) fma.rn.f32 %r5370, %r5338, %r5178, %r5338;
(EngineCore_0 pid=142) fma.rn.f32 %r5371, %r5339, %r5179, %r5339;
(EngineCore_0 pid=142) fma.rn.f32 %r5372, %r5340, %r5180, %r5340;
(EngineCore_0 pid=142) fma.rn.f32 %r5373, %r5341, %r5181, %r5341;
(EngineCore_0 pid=142) fma.rn.f32 %r5374, %r5342, %r5182, %r5342;
(EngineCore_0 pid=142) fma.rn.f32 %r5375, %r5343, %r5183, %r5343;
(EngineCore_0 pid=142) fma.rn.f32 %r5376, %r5344, %r5184, %r5344;
(EngineCore_0 pid=142) fma.rn.f32 %r5377, %r5345, %r5185, %r5345;
(EngineCore_0 pid=142) fma.rn.f32 %r5378, %r5346, %r5186, %r5346;
(EngineCore_0 pid=142) fma.rn.f32 %r5379, %r5347, %r5187, %r5347;
(EngineCore_0 pid=142) fma.rn.f32 %r5380, %r5348, %r5188, %r5348;
(EngineCore_0 pid=142) fma.rn.f32 %r5381, %r5349, %r5189, %r5349;
(EngineCore_0 pid=142) fma.rn.f32 %r5382, %r5350, %r5190, %r5350;
(EngineCore_0 pid=142) fma.rn.f32 %r5383, %r5351, %r5191, %r5351;
(EngineCore_0 pid=142) fma.rn.f32 %r5384, %r5352, %r5192, %r5352;
(EngineCore_0 pid=142) fma.rn.f32 %r5385, %r5353, %r5193, %r5353;
(EngineCore_0 pid=142) fma.rn.f32 %r5386, %r5354, %r5194, %r5354;
(EngineCore_0 pid=142) fma.rn.f32 %r5387, %r5355, %r5195, %r5355;
(EngineCore_0 pid=142) fma.rn.f32 %r5388, %r5356, %r5196, %r5356;
(EngineCore_0 pid=142) $L__tmp30:
(EngineCore_0 pid=142) .loc 1 353 34 // specialize.py:353:34
(EngineCore_0 pid=142) shr.s32 %r4941, %r4961, 1;
(EngineCore_0 pid=142) .loc 1 360 57 // specialize.py:360:57
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs673, %r5357;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs674, %r5358;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs675, %r5359;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs676, %r5360;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs677, %r5361;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs678, %r5362;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs679, %r5363;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs680, %r5364;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs681, %r5365;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs682, %r5366;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs683, %r5367;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs684, %r5368;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs685, %r5369;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs686, %r5370;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs687, %r5371;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs688, %r5372;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs689, %r5373;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs690, %r5374;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs691, %r5375;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs692, %r5376;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs693, %r5377;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs694, %r5378;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs695, %r5379;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs696, %r5380;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs697, %r5381;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs698, %r5382;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs699, %r5383;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs700, %r5384;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs701, %r5385;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs702, %r5386;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs703, %r5387;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs704, %r5388;
(EngineCore_0 pid=142) .loc 1 360 50 // specialize.py:360:50
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 0, 256 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) st.shared.b16 [%r268], %rs673;
(EngineCore_0 pid=142) st.shared.b16 [%r268+1024], %rs675;
(EngineCore_0 pid=142) st.shared.b16 [%r268+2048], %rs677;
(EngineCore_0 pid=142) st.shared.b16 [%r268+3072], %rs679;
(EngineCore_0 pid=142) st.shared.b16 [%r268+4096], %rs681;
(EngineCore_0 pid=142) st.shared.b16 [%r268+5120], %rs683;
(EngineCore_0 pid=142) st.shared.b16 [%r268+6144], %rs685;
(EngineCore_0 pid=142) st.shared.b16 [%r268+7168], %rs687;
(EngineCore_0 pid=142) st.shared.b16 [%r268+8192], %rs689;
(EngineCore_0 pid=142) st.shared.b16 [%r268+9216], %rs691;
(EngineCore_0 pid=142) st.shared.b16 [%r268+10240], %rs693;
(EngineCore_0 pid=142) st.shared.b16 [%r268+11264], %rs695;
(EngineCore_0 pid=142) st.shared.b16 [%r268+12288], %rs697;
(EngineCore_0 pid=142) st.shared.b16 [%r268+13312], %rs699;
(EngineCore_0 pid=142) st.shared.b16 [%r268+14336], %rs701;
(EngineCore_0 pid=142) st.shared.b16 [%r268+15360], %rs703;
(EngineCore_0 pid=142) st.shared.b16 [%r269], %rs674;
(EngineCore_0 pid=142) st.shared.b16 [%r269+1024], %rs676;
(EngineCore_0 pid=142) st.shared.b16 [%r269+2048], %rs678;
(EngineCore_0 pid=142) st.shared.b16 [%r269+3072], %rs680;
(EngineCore_0 pid=142) st.shared.b16 [%r269+4096], %rs682;
(EngineCore_0 pid=142) st.shared.b16 [%r269+5120], %rs684;
(EngineCore_0 pid=142) st.shared.b16 [%r269+6144], %rs686;
(EngineCore_0 pid=142) st.shared.b16 [%r269+7168], %rs688;
(EngineCore_0 pid=142) st.shared.b16 [%r269+8192], %rs690;
(EngineCore_0 pid=142) st.shared.b16 [%r269+9216], %rs692;
(EngineCore_0 pid=142) st.shared.b16 [%r269+10240], %rs694;
(EngineCore_0 pid=142) st.shared.b16 [%r269+11264], %rs696;
(EngineCore_0 pid=142) st.shared.b16 [%r269+12288], %rs698;
(EngineCore_0 pid=142) st.shared.b16 [%r269+13312], %rs700;
(EngineCore_0 pid=142) st.shared.b16 [%r269+14336], %rs702;
(EngineCore_0 pid=142) st.shared.b16 [%r269+15360], %rs704;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) fence.proxy.async.shared::cta;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 0, 256 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) elect.sync %r5389|%p90, -1;
(EngineCore_0 pid=142) and.pred %p88, %p69, %p90;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p88 cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%rd91, {%r4941, %r4942}], [%r4943];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) cp.async.bulk.commit_group;
(EngineCore_0 pid=142) cp.async.bulk.wait_group.read 0;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 0, 256 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) .loc 1 325 34 // specialize.py:325:34
(EngineCore_0 pid=142) add.f32 %r5390, %r6118, %r4967;
(EngineCore_0 pid=142) add.f32 %r5391, %r6114, %r4967;
(EngineCore_0 pid=142) add.f32 %r5392, %r6110, %r4967;
(EngineCore_0 pid=142) add.f32 %r5393, %r6106, %r4967;
(EngineCore_0 pid=142) add.f32 %r5394, %r6119, %r4967;
(EngineCore_0 pid=142) add.f32 %r5395, %r6115, %r4967;
(EngineCore_0 pid=142) add.f32 %r5396, %r6111, %r4967;
(EngineCore_0 pid=142) add.f32 %r5397, %r6107, %r4967;
(EngineCore_0 pid=142) add.f32 %r5398, %r6120, %r4968;
(EngineCore_0 pid=142) add.f32 %r5399, %r6116, %r4968;
(EngineCore_0 pid=142) add.f32 %r5400, %r6112, %r4968;
(EngineCore_0 pid=142) add.f32 %r5401, %r6108, %r4968;
(EngineCore_0 pid=142) add.f32 %r5402, %r6121, %r4968;
(EngineCore_0 pid=142) add.f32 %r5403, %r6117, %r4968;
(EngineCore_0 pid=142) add.f32 %r5404, %r6113, %r4968;
(EngineCore_0 pid=142) add.f32 %r5405, %r6109, %r4968;
(EngineCore_0 pid=142) add.f32 %r5406, %r6134, %r4967;
(EngineCore_0 pid=142) add.f32 %r5407, %r6130, %r4967;
(EngineCore_0 pid=142) add.f32 %r5408, %r6126, %r4967;
(EngineCore_0 pid=142) add.f32 %r5409, %r6122, %r4967;
(EngineCore_0 pid=142) add.f32 %r5410, %r6135, %r4967;
(EngineCore_0 pid=142) add.f32 %r5411, %r6131, %r4967;
(EngineCore_0 pid=142) add.f32 %r5412, %r6127, %r4967;
(EngineCore_0 pid=142) add.f32 %r5413, %r6123, %r4967;
(EngineCore_0 pid=142) add.f32 %r5414, %r6136, %r4968;
(EngineCore_0 pid=142) add.f32 %r5415, %r6132, %r4968;
(EngineCore_0 pid=142) add.f32 %r5416, %r6128, %r4968;
(EngineCore_0 pid=142) add.f32 %r5417, %r6124, %r4968;
(EngineCore_0 pid=142) add.f32 %r5418, %r6137, %r4968;
(EngineCore_0 pid=142) add.f32 %r5419, %r6133, %r4968;
(EngineCore_0 pid=142) add.f32 %r5420, %r6129, %r4968;
(EngineCore_0 pid=142) add.f32 %r5421, %r6125, %r4968;
(EngineCore_0 pid=142) add.f32 %r5422, %r6150, %r4971;
(EngineCore_0 pid=142) add.f32 %r5423, %r6146, %r4971;
(EngineCore_0 pid=142) add.f32 %r5424, %r6142, %r4971;
(EngineCore_0 pid=142) add.f32 %r5425, %r6138, %r4971;
(EngineCore_0 pid=142) add.f32 %r5426, %r6151, %r4971;
(EngineCore_0 pid=142) add.f32 %r5427, %r6147, %r4971;
(EngineCore_0 pid=142) add.f32 %r5428, %r6143, %r4971;
(EngineCore_0 pid=142) add.f32 %r5429, %r6139, %r4971;
(EngineCore_0 pid=142) add.f32 %r5430, %r6152, %r4972;
(EngineCore_0 pid=142) add.f32 %r5431, %r6148, %r4972;
(EngineCore_0 pid=142) add.f32 %r5432, %r6144, %r4972;
(EngineCore_0 pid=142) add.f32 %r5433, %r6140, %r4972;
(EngineCore_0 pid=142) add.f32 %r5434, %r6153, %r4972;
(EngineCore_0 pid=142) add.f32 %r5435, %r6149, %r4972;
(EngineCore_0 pid=142) add.f32 %r5436, %r6145, %r4972;
(EngineCore_0 pid=142) add.f32 %r5437, %r6141, %r4972;
(EngineCore_0 pid=142) add.f32 %r5438, %r6166, %r4971;
(EngineCore_0 pid=142) add.f32 %r5439, %r6162, %r4971;
(EngineCore_0 pid=142) add.f32 %r5440, %r6158, %r4971;
(EngineCore_0 pid=142) add.f32 %r5441, %r6154, %r4971;
(EngineCore_0 pid=142) add.f32 %r5442, %r6167, %r4971;
(EngineCore_0 pid=142) add.f32 %r5443, %r6163, %r4971;
(EngineCore_0 pid=142) add.f32 %r5444, %r6159, %r4971;
(EngineCore_0 pid=142) add.f32 %r5445, %r6155, %r4971;
(EngineCore_0 pid=142) add.f32 %r5446, %r6168, %r4972;
(EngineCore_0 pid=142) add.f32 %r5447, %r6164, %r4972;
(EngineCore_0 pid=142) add.f32 %r5448, %r6160, %r4972;
(EngineCore_0 pid=142) add.f32 %r5449, %r6156, %r4972;
(EngineCore_0 pid=142) add.f32 %r5450, %r6169, %r4972;
(EngineCore_0 pid=142) add.f32 %r5451, %r6165, %r4972;
(EngineCore_0 pid=142) add.f32 %r5452, %r6161, %r4972;
(EngineCore_0 pid=142) add.f32 %r5453, %r6157, %r4972;
(EngineCore_0 pid=142) $L__tmp31:
(EngineCore_0 pid=142) .loc 3 51 46 // _swiglu.py:51:46 @[ specialize.py:330:46 ]
(EngineCore_0 pid=142) st.shared.v4.b32 [%r262], {%r5393, %r5392, %r5391, %r5390};
(EngineCore_0 pid=142) st.shared.v4.b32 [%r262+128], {%r5397, %r5396, %r5395, %r5394};
(EngineCore_0 pid=142) st.shared.v4.b32 [%r262+2048], {%r5425, %r5424, %r5423, %r5422};
(EngineCore_0 pid=142) st.shared.v4.b32 [%r262+2176], {%r5429, %r5428, %r5427, %r5426};
(EngineCore_0 pid=142) st.shared.v4.b32 [%r263], {%r5401, %r5400, %r5399, %r5398};
(EngineCore_0 pid=142) st.shared.v4.b32 [%r263+128], {%r5405, %r5404, %r5403, %r5402};
(EngineCore_0 pid=142) st.shared.v4.b32 [%r263+2048], {%r5433, %r5432, %r5431, %r5430};
(EngineCore_0 pid=142) st.shared.v4.b32 [%r263+2176], {%r5437, %r5436, %r5435, %r5434};
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 0, 256 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r5454, %r5455, %r5456, %r5457}, [%r264];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r5458, %r5459, %r5460, %r5461}, [%r264+256];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r5462, %r5463, %r5464, %r5465}, [%r265];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r5466, %r5467, %r5468, %r5469}, [%r265+256];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r5470, %r5471, %r5472, %r5473}, [%r266];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r5474, %r5475, %r5476, %r5477}, [%r266+256];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r5478, %r5479, %r5480, %r5481}, [%r267];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r5482, %r5483, %r5484, %r5485}, [%r267+256];
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 0, 256 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) st.shared.v4.b32 [%r262], {%r5409, %r5408, %r5407, %r5406};
(EngineCore_0 pid=142) st.shared.v4.b32 [%r262+128], {%r5413, %r5412, %r5411, %r5410};
(EngineCore_0 pid=142) st.shared.v4.b32 [%r262+2048], {%r5441, %r5440, %r5439, %r5438};
(EngineCore_0 pid=142) st.shared.v4.b32 [%r262+2176], {%r5445, %r5444, %r5443, %r5442};
(EngineCore_0 pid=142) st.shared.v4.b32 [%r263], {%r5417, %r5416, %r5415, %r5414};
(EngineCore_0 pid=142) st.shared.v4.b32 [%r263+128], {%r5421, %r5420, %r5419, %r5418};
(EngineCore_0 pid=142) st.shared.v4.b32 [%r263+2048], {%r5449, %r5448, %r5447, %r5446};
(EngineCore_0 pid=142) st.shared.v4.b32 [%r263+2176], {%r5453, %r5452, %r5451, %r5450};
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 0, 256 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r5486, %r5487, %r5488, %r5489}, [%r264];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r5490, %r5491, %r5492, %r5493}, [%r264+256];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r5494, %r5495, %r5496, %r5497}, [%r265];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r5498, %r5499, %r5500, %r5501}, [%r265+256];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r5502, %r5503, %r5504, %r5505}, [%r266];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r5506, %r5507, %r5508, %r5509}, [%r266+256];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r5510, %r5511, %r5512, %r5513}, [%r267];
(EngineCore_0 pid=142) ld.shared.v4.b32 {%r5514, %r5515, %r5516, %r5517}, [%r267+256];
(EngineCore_0 pid=142) .loc 3 8 24 // _swiglu.py:8:24 @[ specialize.py:330:46 ]
(EngineCore_0 pid=142) min.f32 %r5518, %r5454, %r669;
(EngineCore_0 pid=142) min.f32 %r5519, %r5470, %r669;
(EngineCore_0 pid=142) min.f32 %r5520, %r5458, %r669;
(EngineCore_0 pid=142) min.f32 %r5521, %r5474, %r669;
(EngineCore_0 pid=142) min.f32 %r5522, %r5455, %r669;
(EngineCore_0 pid=142) min.f32 %r5523, %r5471, %r669;
(EngineCore_0 pid=142) min.f32 %r5524, %r5459, %r669;
(EngineCore_0 pid=142) min.f32 %r5525, %r5475, %r669;
(EngineCore_0 pid=142) min.f32 %r5526, %r5456, %r669;
(EngineCore_0 pid=142) min.f32 %r5527, %r5472, %r669;
(EngineCore_0 pid=142) min.f32 %r5528, %r5460, %r669;
(EngineCore_0 pid=142) min.f32 %r5529, %r5476, %r669;
(EngineCore_0 pid=142) min.f32 %r5530, %r5457, %r669;
(EngineCore_0 pid=142) min.f32 %r5531, %r5473, %r669;
(EngineCore_0 pid=142) min.f32 %r5532, %r5461, %r669;
(EngineCore_0 pid=142) min.f32 %r5533, %r5477, %r669;
(EngineCore_0 pid=142) min.f32 %r5534, %r5486, %r669;
(EngineCore_0 pid=142) min.f32 %r5535, %r5502, %r669;
(EngineCore_0 pid=142) min.f32 %r5536, %r5490, %r669;
(EngineCore_0 pid=142) min.f32 %r5537, %r5506, %r669;
(EngineCore_0 pid=142) min.f32 %r5538, %r5487, %r669;
(EngineCore_0 pid=142) min.f32 %r5539, %r5503, %r669;
(EngineCore_0 pid=142) min.f32 %r5540, %r5491, %r669;
(EngineCore_0 pid=142) min.f32 %r5541, %r5507, %r669;
(EngineCore_0 pid=142) min.f32 %r5542, %r5488, %r669;
(EngineCore_0 pid=142) min.f32 %r5543, %r5504, %r669;
(EngineCore_0 pid=142) min.f32 %r5544, %r5492, %r669;
(EngineCore_0 pid=142) min.f32 %r5545, %r5508, %r669;
(EngineCore_0 pid=142) min.f32 %r5546, %r5489, %r669;
(EngineCore_0 pid=142) min.f32 %r5547, %r5505, %r669;
(EngineCore_0 pid=142) min.f32 %r5548, %r5493, %r669;
(EngineCore_0 pid=142) min.f32 %r5549, %r5509, %r669;
(EngineCore_0 pid=142) min.f32 %r5550, %r5462, %r669;
(EngineCore_0 pid=142) min.f32 %r5551, %r5478, %r669;
(EngineCore_0 pid=142) min.f32 %r5552, %r5466, %r669;
(EngineCore_0 pid=142) min.f32 %r5553, %r5482, %r669;
(EngineCore_0 pid=142) min.f32 %r5554, %r5463, %r669;
(EngineCore_0 pid=142) min.f32 %r5555, %r5479, %r669;
(EngineCore_0 pid=142) min.f32 %r5556, %r5467, %r669;
(EngineCore_0 pid=142) min.f32 %r5557, %r5483, %r669;
(EngineCore_0 pid=142) min.f32 %r5558, %r5464, %r669;
(EngineCore_0 pid=142) min.f32 %r5559, %r5480, %r669;
(EngineCore_0 pid=142) min.f32 %r5560, %r5468, %r669;
(EngineCore_0 pid=142) min.f32 %r5561, %r5484, %r669;
(EngineCore_0 pid=142) min.f32 %r5562, %r5465, %r669;
(EngineCore_0 pid=142) min.f32 %r5563, %r5481, %r669;
(EngineCore_0 pid=142) min.f32 %r5564, %r5469, %r669;
(EngineCore_0 pid=142) min.f32 %r5565, %r5485, %r669;
(EngineCore_0 pid=142) min.f32 %r5566, %r5494, %r669;
(EngineCore_0 pid=142) min.f32 %r5567, %r5510, %r669;
(EngineCore_0 pid=142) min.f32 %r5568, %r5498, %r669;
(EngineCore_0 pid=142) min.f32 %r5569, %r5514, %r669;
(EngineCore_0 pid=142) min.f32 %r5570, %r5495, %r669;
(EngineCore_0 pid=142) min.f32 %r5571, %r5511, %r669;
(EngineCore_0 pid=142) min.f32 %r5572, %r5499, %r669;
(EngineCore_0 pid=142) min.f32 %r5573, %r5515, %r669;
(EngineCore_0 pid=142) min.f32 %r5574, %r5496, %r669;
(EngineCore_0 pid=142) min.f32 %r5575, %r5512, %r669;
(EngineCore_0 pid=142) min.f32 %r5576, %r5500, %r669;
(EngineCore_0 pid=142) min.f32 %r5577, %r5516, %r669;
(EngineCore_0 pid=142) min.f32 %r5578, %r5497, %r669;
(EngineCore_0 pid=142) min.f32 %r5579, %r5513, %r669;
(EngineCore_0 pid=142) min.f32 %r5580, %r5501, %r669;
(EngineCore_0 pid=142) min.f32 %r5581, %r5517, %r669;
(EngineCore_0 pid=142) .loc 3 10 33 // _swiglu.py:10:33 @[ specialize.py:330:46 ]
(EngineCore_0 pid=142) max.f32 %r5582, %r231, %r5550;
(EngineCore_0 pid=142) max.f32 %r5583, %r231, %r5551;
(EngineCore_0 pid=142) max.f32 %r5584, %r231, %r5552;
(EngineCore_0 pid=142) max.f32 %r5585, %r231, %r5553;
(EngineCore_0 pid=142) max.f32 %r5586, %r231, %r5554;
(EngineCore_0 pid=142) max.f32 %r5587, %r231, %r5555;
(EngineCore_0 pid=142) max.f32 %r5588, %r231, %r5556;
(EngineCore_0 pid=142) max.f32 %r5589, %r231, %r5557;
(EngineCore_0 pid=142) max.f32 %r5590, %r231, %r5558;
(EngineCore_0 pid=142) max.f32 %r5591, %r231, %r5559;
(EngineCore_0 pid=142) max.f32 %r5592, %r231, %r5560;
(EngineCore_0 pid=142) max.f32 %r5593, %r231, %r5561;
(EngineCore_0 pid=142) max.f32 %r5594, %r231, %r5562;
(EngineCore_0 pid=142) max.f32 %r5595, %r231, %r5563;
(EngineCore_0 pid=142) max.f32 %r5596, %r231, %r5564;
(EngineCore_0 pid=142) max.f32 %r5597, %r231, %r5565;
(EngineCore_0 pid=142) max.f32 %r5598, %r231, %r5566;
(EngineCore_0 pid=142) max.f32 %r5599, %r231, %r5567;
(EngineCore_0 pid=142) max.f32 %r5600, %r231, %r5568;
(EngineCore_0 pid=142) max.f32 %r5601, %r231, %r5569;
(EngineCore_0 pid=142) max.f32 %r5602, %r231, %r5570;
(EngineCore_0 pid=142) max.f32 %r5603, %r231, %r5571;
(EngineCore_0 pid=142) max.f32 %r5604, %r231, %r5572;
(EngineCore_0 pid=142) max.f32 %r5605, %r231, %r5573;
(EngineCore_0 pid=142) max.f32 %r5606, %r231, %r5574;
(EngineCore_0 pid=142) max.f32 %r5607, %r231, %r5575;
(EngineCore_0 pid=142) max.f32 %r5608, %r231, %r5576;
(EngineCore_0 pid=142) max.f32 %r5609, %r231, %r5577;
(EngineCore_0 pid=142) max.f32 %r5610, %r231, %r5578;
(EngineCore_0 pid=142) max.f32 %r5611, %r231, %r5579;
(EngineCore_0 pid=142) max.f32 %r5612, %r231, %r5580;
(EngineCore_0 pid=142) max.f32 %r5613, %r231, %r5581;
(EngineCore_0 pid=142) .loc 3 45 36 // _swiglu.py:45:36 @[ specialize.py:330:46 ]
(EngineCore_0 pid=142) mul.f32 %r5614, %r232, %r5518;
(EngineCore_0 pid=142) mul.f32 %r5615, %r232, %r5519;
(EngineCore_0 pid=142) mul.f32 %r5616, %r232, %r5520;
(EngineCore_0 pid=142) mul.f32 %r5617, %r232, %r5521;
(EngineCore_0 pid=142) mul.f32 %r5618, %r232, %r5522;
(EngineCore_0 pid=142) mul.f32 %r5619, %r232, %r5523;
(EngineCore_0 pid=142) mul.f32 %r5620, %r232, %r5524;
(EngineCore_0 pid=142) mul.f32 %r5621, %r232, %r5525;
(EngineCore_0 pid=142) mul.f32 %r5622, %r232, %r5526;
(EngineCore_0 pid=142) mul.f32 %r5623, %r232, %r5527;
(EngineCore_0 pid=142) mul.f32 %r5624, %r232, %r5528;
(EngineCore_0 pid=142) mul.f32 %r5625, %r232, %r5529;
(EngineCore_0 pid=142) mul.f32 %r5626, %r232, %r5530;
(EngineCore_0 pid=142) mul.f32 %r5627, %r232, %r5531;
(EngineCore_0 pid=142) mul.f32 %r5628, %r232, %r5532;
(EngineCore_0 pid=142) mul.f32 %r5629, %r232, %r5533;
(EngineCore_0 pid=142) mul.f32 %r5630, %r232, %r5534;
(EngineCore_0 pid=142) mul.f32 %r5631, %r232, %r5535;
(EngineCore_0 pid=142) mul.f32 %r5632, %r232, %r5536;
(EngineCore_0 pid=142) mul.f32 %r5633, %r232, %r5537;
(EngineCore_0 pid=142) mul.f32 %r5634, %r232, %r5538;
(EngineCore_0 pid=142) mul.f32 %r5635, %r232, %r5539;
(EngineCore_0 pid=142) mul.f32 %r5636, %r232, %r5540;
(EngineCore_0 pid=142) mul.f32 %r5637, %r232, %r5541;
(EngineCore_0 pid=142) mul.f32 %r5638, %r232, %r5542;
(EngineCore_0 pid=142) mul.f32 %r5639, %r232, %r5543;
(EngineCore_0 pid=142) mul.f32 %r5640, %r232, %r5544;
(EngineCore_0 pid=142) mul.f32 %r5641, %r232, %r5545;
(EngineCore_0 pid=142) mul.f32 %r5642, %r232, %r5546;
(EngineCore_0 pid=142) mul.f32 %r5643, %r232, %r5547;
(EngineCore_0 pid=142) mul.f32 %r5644, %r232, %r5548;
(EngineCore_0 pid=142) mul.f32 %r5645, %r232, %r5549;
(EngineCore_0 pid=142) .loc 3 45 27 // _swiglu.py:45:27 @[ specialize.py:330:46 ]
(EngineCore_0 pid=142) mul.f32 %r5646, %r5614, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5647, %r5646;
(EngineCore_0 pid=142) mul.f32 %r5648, %r5615, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5649, %r5648;
(EngineCore_0 pid=142) mul.f32 %r5650, %r5616, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5651, %r5650;
(EngineCore_0 pid=142) mul.f32 %r5652, %r5617, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5653, %r5652;
(EngineCore_0 pid=142) mul.f32 %r5654, %r5618, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5655, %r5654;
(EngineCore_0 pid=142) mul.f32 %r5656, %r5619, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5657, %r5656;
(EngineCore_0 pid=142) mul.f32 %r5658, %r5620, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5659, %r5658;
(EngineCore_0 pid=142) mul.f32 %r5660, %r5621, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5661, %r5660;
(EngineCore_0 pid=142) mul.f32 %r5662, %r5622, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5663, %r5662;
(EngineCore_0 pid=142) mul.f32 %r5664, %r5623, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5665, %r5664;
(EngineCore_0 pid=142) mul.f32 %r5666, %r5624, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5667, %r5666;
(EngineCore_0 pid=142) mul.f32 %r5668, %r5625, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5669, %r5668;
(EngineCore_0 pid=142) mul.f32 %r5670, %r5626, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5671, %r5670;
(EngineCore_0 pid=142) mul.f32 %r5672, %r5627, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5673, %r5672;
(EngineCore_0 pid=142) mul.f32 %r5674, %r5628, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5675, %r5674;
(EngineCore_0 pid=142) mul.f32 %r5676, %r5629, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5677, %r5676;
(EngineCore_0 pid=142) mul.f32 %r5678, %r5630, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5679, %r5678;
(EngineCore_0 pid=142) mul.f32 %r5680, %r5631, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5681, %r5680;
(EngineCore_0 pid=142) mul.f32 %r5682, %r5632, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5683, %r5682;
(EngineCore_0 pid=142) mul.f32 %r5684, %r5633, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5685, %r5684;
(EngineCore_0 pid=142) mul.f32 %r5686, %r5634, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5687, %r5686;
(EngineCore_0 pid=142) mul.f32 %r5688, %r5635, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5689, %r5688;
(EngineCore_0 pid=142) mul.f32 %r5690, %r5636, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5691, %r5690;
(EngineCore_0 pid=142) mul.f32 %r5692, %r5637, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5693, %r5692;
(EngineCore_0 pid=142) mul.f32 %r5694, %r5638, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5695, %r5694;
(EngineCore_0 pid=142) mul.f32 %r5696, %r5639, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5697, %r5696;
(EngineCore_0 pid=142) mul.f32 %r5698, %r5640, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5699, %r5698;
(EngineCore_0 pid=142) mul.f32 %r5700, %r5641, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5701, %r5700;
(EngineCore_0 pid=142) mul.f32 %r5702, %r5642, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5703, %r5702;
(EngineCore_0 pid=142) mul.f32 %r5704, %r5643, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5705, %r5704;
(EngineCore_0 pid=142) mul.f32 %r5706, %r5644, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5707, %r5706;
(EngineCore_0 pid=142) mul.f32 %r5708, %r5645, 0f3FB8AA3B;
(EngineCore_0 pid=142) ex2.approx.f32 %r5709, %r5708;
(EngineCore_0 pid=142) .loc 3 45 20 // _swiglu.py:45:20 @[ specialize.py:330:46 ]
(EngineCore_0 pid=142) add.f32 %r5710, %r5647, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5711, %r5649, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5712, %r5651, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5713, %r5653, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5714, %r5655, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5715, %r5657, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5716, %r5659, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5717, %r5661, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5718, %r5663, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5719, %r5665, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5720, %r5667, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5721, %r5669, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5722, %r5671, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5723, %r5673, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5724, %r5675, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5725, %r5677, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5726, %r5679, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5727, %r5681, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5728, %r5683, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5729, %r5685, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5730, %r5687, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5731, %r5689, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5732, %r5691, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5733, %r5693, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5734, %r5695, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5735, %r5697, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5736, %r5699, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5737, %r5701, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5738, %r5703, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5739, %r5705, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5740, %r5707, 0f3F800000;
(EngineCore_0 pid=142) add.f32 %r5741, %r5709, 0f3F800000;
(EngineCore_0 pid=142) .loc 3 45 16 // _swiglu.py:45:16 @[ specialize.py:330:46 ]
(EngineCore_0 pid=142) div.full.f32 %r5742, %r5518, %r5710;
(EngineCore_0 pid=142) div.full.f32 %r5743, %r5519, %r5711;
(EngineCore_0 pid=142) div.full.f32 %r5744, %r5520, %r5712;
(EngineCore_0 pid=142) div.full.f32 %r5745, %r5521, %r5713;
(EngineCore_0 pid=142) div.full.f32 %r5746, %r5522, %r5714;
(EngineCore_0 pid=142) div.full.f32 %r5747, %r5523, %r5715;
(EngineCore_0 pid=142) div.full.f32 %r5748, %r5524, %r5716;
(EngineCore_0 pid=142) div.full.f32 %r5749, %r5525, %r5717;
(EngineCore_0 pid=142) div.full.f32 %r5750, %r5526, %r5718;
(EngineCore_0 pid=142) div.full.f32 %r5751, %r5527, %r5719;
(EngineCore_0 pid=142) div.full.f32 %r5752, %r5528, %r5720;
(EngineCore_0 pid=142) div.full.f32 %r5753, %r5529, %r5721;
(EngineCore_0 pid=142) div.full.f32 %r5754, %r5530, %r5722;
(EngineCore_0 pid=142) div.full.f32 %r5755, %r5531, %r5723;
(EngineCore_0 pid=142) div.full.f32 %r5756, %r5532, %r5724;
(EngineCore_0 pid=142) div.full.f32 %r5757, %r5533, %r5725;
(EngineCore_0 pid=142) div.full.f32 %r5758, %r5534, %r5726;
(EngineCore_0 pid=142) div.full.f32 %r5759, %r5535, %r5727;
(EngineCore_0 pid=142) div.full.f32 %r5760, %r5536, %r5728;
(EngineCore_0 pid=142) div.full.f32 %r5761, %r5537, %r5729;
(EngineCore_0 pid=142) div.full.f32 %r5762, %r5538, %r5730;
(EngineCore_0 pid=142) div.full.f32 %r5763, %r5539, %r5731;
(EngineCore_0 pid=142) div.full.f32 %r5764, %r5540, %r5732;
(EngineCore_0 pid=142) div.full.f32 %r5765, %r5541, %r5733;
(EngineCore_0 pid=142) div.full.f32 %r5766, %r5542, %r5734;
(EngineCore_0 pid=142) div.full.f32 %r5767, %r5543, %r5735;
(EngineCore_0 pid=142) div.full.f32 %r5768, %r5544, %r5736;
(EngineCore_0 pid=142) div.full.f32 %r5769, %r5545, %r5737;
(EngineCore_0 pid=142) div.full.f32 %r5770, %r5546, %r5738;
(EngineCore_0 pid=142) div.full.f32 %r5771, %r5547, %r5739;
(EngineCore_0 pid=142) div.full.f32 %r5772, %r5548, %r5740;
(EngineCore_0 pid=142) div.full.f32 %r5773, %r5549, %r5741;
(EngineCore_0 pid=142) .loc 3 46 29 // _swiglu.py:46:29 @[ specialize.py:330:46 ]
(EngineCore_0 pid=142) fma.rn.f32 %r5774, %r5742, %r5582, %r5742;
(EngineCore_0 pid=142) fma.rn.f32 %r5775, %r5743, %r5583, %r5743;
(EngineCore_0 pid=142) fma.rn.f32 %r5776, %r5744, %r5584, %r5744;
(EngineCore_0 pid=142) fma.rn.f32 %r5777, %r5745, %r5585, %r5745;
(EngineCore_0 pid=142) fma.rn.f32 %r5778, %r5746, %r5586, %r5746;
(EngineCore_0 pid=142) fma.rn.f32 %r5779, %r5747, %r5587, %r5747;
(EngineCore_0 pid=142) fma.rn.f32 %r5780, %r5748, %r5588, %r5748;
(EngineCore_0 pid=142) fma.rn.f32 %r5781, %r5749, %r5589, %r5749;
(EngineCore_0 pid=142) fma.rn.f32 %r5782, %r5750, %r5590, %r5750;
(EngineCore_0 pid=142) fma.rn.f32 %r5783, %r5751, %r5591, %r5751;
(EngineCore_0 pid=142) fma.rn.f32 %r5784, %r5752, %r5592, %r5752;
(EngineCore_0 pid=142) fma.rn.f32 %r5785, %r5753, %r5593, %r5753;
(EngineCore_0 pid=142) fma.rn.f32 %r5786, %r5754, %r5594, %r5754;
(EngineCore_0 pid=142) fma.rn.f32 %r5787, %r5755, %r5595, %r5755;
(EngineCore_0 pid=142) fma.rn.f32 %r5788, %r5756, %r5596, %r5756;
(EngineCore_0 pid=142) fma.rn.f32 %r5789, %r5757, %r5597, %r5757;
(EngineCore_0 pid=142) fma.rn.f32 %r5790, %r5758, %r5598, %r5758;
(EngineCore_0 pid=142) fma.rn.f32 %r5791, %r5759, %r5599, %r5759;
(EngineCore_0 pid=142) fma.rn.f32 %r5792, %r5760, %r5600, %r5760;
(EngineCore_0 pid=142) fma.rn.f32 %r5793, %r5761, %r5601, %r5761;
(EngineCore_0 pid=142) fma.rn.f32 %r5794, %r5762, %r5602, %r5762;
(EngineCore_0 pid=142) fma.rn.f32 %r5795, %r5763, %r5603, %r5763;
(EngineCore_0 pid=142) fma.rn.f32 %r5796, %r5764, %r5604, %r5764;
(EngineCore_0 pid=142) fma.rn.f32 %r5797, %r5765, %r5605, %r5765;
(EngineCore_0 pid=142) fma.rn.f32 %r5798, %r5766, %r5606, %r5766;
(EngineCore_0 pid=142) fma.rn.f32 %r5799, %r5767, %r5607, %r5767;
(EngineCore_0 pid=142) fma.rn.f32 %r5800, %r5768, %r5608, %r5768;
(EngineCore_0 pid=142) fma.rn.f32 %r5801, %r5769, %r5609, %r5769;
(EngineCore_0 pid=142) fma.rn.f32 %r5802, %r5770, %r5610, %r5770;
(EngineCore_0 pid=142) fma.rn.f32 %r5803, %r5771, %r5611, %r5771;
(EngineCore_0 pid=142) fma.rn.f32 %r5804, %r5772, %r5612, %r5772;
(EngineCore_0 pid=142) fma.rn.f32 %r5805, %r5773, %r5613, %r5773;
(EngineCore_0 pid=142) $L__tmp32:
(EngineCore_0 pid=142) .loc 1 353 59 // specialize.py:353:59
(EngineCore_0 pid=142) or.b32 %r4944, %r4941, 64;
(EngineCore_0 pid=142) .loc 1 360 57 // specialize.py:360:57
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs705, %r5774;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs706, %r5775;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs707, %r5776;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs708, %r5777;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs709, %r5778;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs710, %r5779;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs711, %r5780;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs712, %r5781;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs713, %r5782;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs714, %r5783;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs715, %r5784;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs716, %r5785;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs717, %r5786;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs718, %r5787;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs719, %r5788;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs720, %r5789;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs721, %r5790;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs722, %r5791;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs723, %r5792;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs724, %r5793;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs725, %r5794;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs726, %r5795;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs727, %r5796;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs728, %r5797;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs729, %r5798;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs730, %r5799;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs731, %r5800;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs732, %r5801;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs733, %r5802;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs734, %r5803;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs735, %r5804;
(EngineCore_0 pid=142) cvt.rn.bf16.f32 %rs736, %r5805;
(EngineCore_0 pid=142) .loc 1 360 50 // specialize.py:360:50
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 0, 256 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) st.shared.b16 [%r268], %rs705;
(EngineCore_0 pid=142) st.shared.b16 [%r268+1024], %rs707;
(EngineCore_0 pid=142) st.shared.b16 [%r268+2048], %rs709;
(EngineCore_0 pid=142) st.shared.b16 [%r268+3072], %rs711;
(EngineCore_0 pid=142) st.shared.b16 [%r268+4096], %rs713;
(EngineCore_0 pid=142) st.shared.b16 [%r268+5120], %rs715;
(EngineCore_0 pid=142) st.shared.b16 [%r268+6144], %rs717;
(EngineCore_0 pid=142) st.shared.b16 [%r268+7168], %rs719;
(EngineCore_0 pid=142) st.shared.b16 [%r268+8192], %rs721;
(EngineCore_0 pid=142) st.shared.b16 [%r268+9216], %rs723;
(EngineCore_0 pid=142) st.shared.b16 [%r268+10240], %rs725;
(EngineCore_0 pid=142) st.shared.b16 [%r268+11264], %rs727;
(EngineCore_0 pid=142) st.shared.b16 [%r268+12288], %rs729;
(EngineCore_0 pid=142) st.shared.b16 [%r268+13312], %rs731;
(EngineCore_0 pid=142) st.shared.b16 [%r268+14336], %rs733;
(EngineCore_0 pid=142) st.shared.b16 [%r268+15360], %rs735;
(EngineCore_0 pid=142) st.shared.b16 [%r269], %rs706;
(EngineCore_0 pid=142) st.shared.b16 [%r269+1024], %rs708;
(EngineCore_0 pid=142) st.shared.b16 [%r269+2048], %rs710;
(EngineCore_0 pid=142) st.shared.b16 [%r269+3072], %rs712;
(EngineCore_0 pid=142) st.shared.b16 [%r269+4096], %rs714;
(EngineCore_0 pid=142) st.shared.b16 [%r269+5120], %rs716;
(EngineCore_0 pid=142) st.shared.b16 [%r269+6144], %rs718;
(EngineCore_0 pid=142) st.shared.b16 [%r269+7168], %rs720;
(EngineCore_0 pid=142) st.shared.b16 [%r269+8192], %rs722;
(EngineCore_0 pid=142) st.shared.b16 [%r269+9216], %rs724;
(EngineCore_0 pid=142) st.shared.b16 [%r269+10240], %rs726;
(EngineCore_0 pid=142) st.shared.b16 [%r269+11264], %rs728;
(EngineCore_0 pid=142) st.shared.b16 [%r269+12288], %rs730;
(EngineCore_0 pid=142) st.shared.b16 [%r269+13312], %rs732;
(EngineCore_0 pid=142) st.shared.b16 [%r269+14336], %rs734;
(EngineCore_0 pid=142) st.shared.b16 [%r269+15360], %rs736;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) fence.proxy.async.shared::cta;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 0, 256 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) elect.sync %r5806|%p91, -1;
(EngineCore_0 pid=142) and.pred %p89, %p69, %p91;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p89 cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%rd91, {%r4944, %r4942}], [%r4943];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) cp.async.bulk.commit_group;
(EngineCore_0 pid=142) cp.async.bulk.wait_group.read 0;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 0, 256 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) mov.b32 %r6042, 0f00000000;
(EngineCore_0 pid=142) mov.b32 %r6043, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6044, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6045, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6046, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6047, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6048, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6049, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6050, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6051, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6052, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6053, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6054, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6055, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6056, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6057, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6058, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6059, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6060, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6061, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6062, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6063, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6064, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6065, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6066, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6067, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6068, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6069, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6070, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6071, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6072, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6073, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6074, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6075, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6076, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6077, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6078, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6079, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6080, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6081, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6082, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6083, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6084, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6085, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6086, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6087, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6088, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6089, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6090, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6091, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6092, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6093, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6094, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6095, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6096, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6097, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6098, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6099, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6100, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6101, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6102, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6103, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6104, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6105, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6106, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6107, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6108, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6109, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6110, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6111, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6112, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6113, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6114, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6115, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6116, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6117, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6118, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6119, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6120, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6121, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6122, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6123, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6124, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6125, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6126, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6127, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6128, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6129, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6130, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6131, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6132, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6133, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6134, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6135, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6136, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6137, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6138, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6139, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6140, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6141, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6142, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6143, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6144, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6145, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6146, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6147, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6148, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6149, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6150, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6151, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6152, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6153, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6154, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6155, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6156, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6157, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6158, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6159, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6160, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6161, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6162, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6163, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6164, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6165, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6166, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6167, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6168, %r6042;
(EngineCore_0 pid=142) mov.b32 %r6169, %r6042;
(EngineCore_0 pid=142) bra.uni $L__BB0_21;
(EngineCore_0 pid=142) $L__BB0_22: // %._crit_edge780
(EngineCore_0 pid=142) .loc 1 135 133 // specialize.py:135:133
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync 1 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) setmaxnreg.inc.sync.aligned.u32 240;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync.aligned 0, 256 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p93 mbarrier.inval.shared::cta.b64 [%r992];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p93 mbarrier.inval.shared::cta.b64 [%r991];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p93 mbarrier.inval.shared::cta.b64 [%r989];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p93 mbarrier.inval.shared::cta.b64 [%r988];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p93 mbarrier.inval.shared::cta.b64 [%r986];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) @%p93 mbarrier.inval.shared::cta.b64 [%r985];
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) $L__BB0_23: // %.loopexit
(EngineCore_0 pid=142) .loc 1 0 133 // specialize.py:0:133
(EngineCore_0 pid=142) mov.b32 %r5900, 50529027;
(EngineCore_0 pid=142) .loc 1 371 4 // specialize.py:371:4
(EngineCore_0 pid=142) st.shared.b32 [global_smem+83288], %r5900;
(EngineCore_0 pid=142) // begin inline asm
(EngineCore_0 pid=142) barrier.sync 1 ;
(EngineCore_0 pid=142) // end inline asm
(EngineCore_0 pid=142) $L__BB0_24: // %common.ret
(EngineCore_0 pid=142) .loc 1 0 0 // specialize.py:0
(EngineCore_0 pid=142) ret;
(EngineCore_0 pid=142) $L__tmp33:
(EngineCore_0 pid=142) $L__func_end0:
(EngineCore_0 pid=142) // -- End function
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142) .file 1 "/usr/local/lib/python3.12/dist-packages/triton_kernels/specialize.py"
(EngineCore_0 pid=142) .file 2 "/usr/local/lib/python3.12/dist-packages/triton_kernels/matmul_ogs_details/_common.py"
(EngineCore_0 pid=142) .file 3 "/usr/local/lib/python3.12/dist-packages/triton_kernels/swiglu_details/_swiglu.py"
(EngineCore_0 pid=142) .file 4 "/usr/local/lib/python3.12/dist-packages/triton_kernels/matmul_ogs_details/_p_matmul_ogs.py"
(EngineCore_0 pid=142) .file 5 "/usr/local/lib/python3.12/dist-packages/triton/language/standard.py"
(EngineCore_0 pid=142) .file 6 "/usr/local/lib/python3.12/dist-packages/triton_kernels/tensor_details/layout_details/blackwell_scale.py"
(EngineCore_0 pid=142) .section .debug_abbrev
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .b8 1 // Abbreviation Code
(EngineCore_0 pid=142) .b8 17 // DW_TAG_compile_unit
(EngineCore_0 pid=142) .b8 1 // DW_CHILDREN_yes
(EngineCore_0 pid=142) .b8 37 // DW_AT_producer
(EngineCore_0 pid=142) .b8 8 // DW_FORM_string
(EngineCore_0 pid=142) .b8 19 // DW_AT_language
(EngineCore_0 pid=142) .b8 5 // DW_FORM_data2
(EngineCore_0 pid=142) .b8 3 // DW_AT_name
(EngineCore_0 pid=142) .b8 8 // DW_FORM_string
(EngineCore_0 pid=142) .b8 16 // DW_AT_stmt_list
(EngineCore_0 pid=142) .b8 6 // DW_FORM_data4
(EngineCore_0 pid=142) .b8 27 // DW_AT_comp_dir
(EngineCore_0 pid=142) .b8 8 // DW_FORM_string
(EngineCore_0 pid=142) .b8 0 // EOM(1)
(EngineCore_0 pid=142) .b8 0 // EOM(2)
(EngineCore_0 pid=142) .b8 2 // Abbreviation Code
(EngineCore_0 pid=142) .b8 46 // DW_TAG_subprogram
(EngineCore_0 pid=142) .b8 0 // DW_CHILDREN_no
(EngineCore_0 pid=142) .b8 3 // DW_AT_name
(EngineCore_0 pid=142) .b8 8 // DW_FORM_string
(EngineCore_0 pid=142) .b8 32 // DW_AT_inline
(EngineCore_0 pid=142) .b8 11 // DW_FORM_data1
(EngineCore_0 pid=142) .b8 0 // EOM(1)
(EngineCore_0 pid=142) .b8 0 // EOM(2)
(EngineCore_0 pid=142) .b8 3 // Abbreviation Code
(EngineCore_0 pid=142) .b8 46 // DW_TAG_subprogram
(EngineCore_0 pid=142) .b8 1 // DW_CHILDREN_yes
(EngineCore_0 pid=142) .b8 17 // DW_AT_low_pc
(EngineCore_0 pid=142) .b8 1 // DW_FORM_addr
(EngineCore_0 pid=142) .b8 18 // DW_AT_high_pc
(EngineCore_0 pid=142) .b8 1 // DW_FORM_addr
(EngineCore_0 pid=142) .b8 49 // DW_AT_abstract_origin
(EngineCore_0 pid=142) .b8 19 // DW_FORM_ref4
(EngineCore_0 pid=142) .b8 0 // EOM(1)
(EngineCore_0 pid=142) .b8 0 // EOM(2)
(EngineCore_0 pid=142) .b8 4 // Abbreviation Code
(EngineCore_0 pid=142) .b8 29 // DW_TAG_inlined_subroutine
(EngineCore_0 pid=142) .b8 0 // DW_CHILDREN_no
(EngineCore_0 pid=142) .b8 49 // DW_AT_abstract_origin
(EngineCore_0 pid=142) .b8 19 // DW_FORM_ref4
(EngineCore_0 pid=142) .b8 17 // DW_AT_low_pc
(EngineCore_0 pid=142) .b8 1 // DW_FORM_addr
(EngineCore_0 pid=142) .b8 18 // DW_AT_high_pc
(EngineCore_0 pid=142) .b8 1 // DW_FORM_addr
(EngineCore_0 pid=142) .b8 88 // DW_AT_call_file
(EngineCore_0 pid=142) .b8 11 // DW_FORM_data1
(EngineCore_0 pid=142) .b8 89 // DW_AT_call_line
(EngineCore_0 pid=142) .b8 11 // DW_FORM_data1
(EngineCore_0 pid=142) .b8 87 // DW_AT_call_column
(EngineCore_0 pid=142) .b8 11 // DW_FORM_data1
(EngineCore_0 pid=142) .b8 0 // EOM(1)
(EngineCore_0 pid=142) .b8 0 // EOM(2)
(EngineCore_0 pid=142) .b8 5 // Abbreviation Code
(EngineCore_0 pid=142) .b8 29 // DW_TAG_inlined_subroutine
(EngineCore_0 pid=142) .b8 0 // DW_CHILDREN_no
(EngineCore_0 pid=142) .b8 49 // DW_AT_abstract_origin
(EngineCore_0 pid=142) .b8 19 // DW_FORM_ref4
(EngineCore_0 pid=142) .b8 17 // DW_AT_low_pc
(EngineCore_0 pid=142) .b8 1 // DW_FORM_addr
(EngineCore_0 pid=142) .b8 18 // DW_AT_high_pc
(EngineCore_0 pid=142) .b8 1 // DW_FORM_addr
(EngineCore_0 pid=142) .b8 88 // DW_AT_call_file
(EngineCore_0 pid=142) .b8 11 // DW_FORM_data1
(EngineCore_0 pid=142) .b8 89 // DW_AT_call_line
(EngineCore_0 pid=142) .b8 5 // DW_FORM_data2
(EngineCore_0 pid=142) .b8 87 // DW_AT_call_column
(EngineCore_0 pid=142) .b8 11 // DW_FORM_data1
(EngineCore_0 pid=142) .b8 0 // EOM(1)
(EngineCore_0 pid=142) .b8 0 // EOM(2)
(EngineCore_0 pid=142) .b8 0 // EOM(3)
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142) .section .debug_info
(EngineCore_0 pid=142) {
(EngineCore_0 pid=142) .b32 315 // Length of Unit
(EngineCore_0 pid=142) .b8 2 // DWARF version number
(EngineCore_0 pid=142) .b8 0
(EngineCore_0 pid=142) .b32 .debug_abbrev // Offset Into Abbrev. Section
(EngineCore_0 pid=142) .b8 8 // Address Size (in bytes)
(EngineCore_0 pid=142) .b8 1 // Abbrev [1] 0xb:0x134 DW_TAG_compile_unit
(EngineCore_0 pid=142) .b8 116 // DW_AT_producer
(EngineCore_0 pid=142) .b8 114
(EngineCore_0 pid=142) .b8 105
(EngineCore_0 pid=142) .b8 116
(EngineCore_0 pid=142) .b8 111
(EngineCore_0 pid=142) .b8 110
(EngineCore_0 pid=142) .b8 0
(EngineCore_0 pid=142) .b8 2 // DW_AT_language
(EngineCore_0 pid=142) .b8 0
(EngineCore_0 pid=142) .b8 115 // DW_AT_name
(EngineCore_0 pid=142) .b8 112
(EngineCore_0 pid=142) .b8 101
(EngineCore_0 pid=142) .b8 99
(EngineCore_0 pid=142) .b8 105
(EngineCore_0 pid=142) .b8 97
(EngineCore_0 pid=142) .b8 108
(EngineCore_0 pid=142) .b8 105
(EngineCore_0 pid=142) .b8 122
(EngineCore_0 pid=142) .b8 101
(EngineCore_0 pid=142) .b8 46
(EngineCore_0 pid=142) .b8 112
(EngineCore_0 pid=142) .b8 121
(EngineCore_0 pid=142) .b8 0
(EngineCore_0 pid=142) .b32 .debug_line // DW_AT_stmt_list
(EngineCore_0 pid=142) .b8 47 // DW_AT_comp_dir
(EngineCore_0 pid=142) .b8 117
(EngineCore_0 pid=142) .b8 115
(EngineCore_0 pid=142) .b8 114
(EngineCore_0 pid=142) .b8 47
(EngineCore_0 pid=142) .b8 108
(EngineCore_0 pid=142) .b8 111
(EngineCore_0 pid=142) .b8 99
(EngineCore_0 pid=142) .b8 97
(EngineCore_0 pid=142) .b8 108
(EngineCore_0 pid=142) .b8 47
(EngineCore_0 pid=142) .b8 108
(EngineCore_0 pid=142) .b8 105
(EngineCore_0 pid=142) .b8 98
(EngineCore_0 pid=142) .b8 47
(EngineCore_0 pid=142) .b8 112
(EngineCore_0 pid=142) .b8 121
(EngineCore_0 pid=142) .b8 116
(EngineCore_0 pid=142) .b8 104
(EngineCore_0 pid=142) .b8 111
(EngineCore_0 pid=142) .b8 110
(EngineCore_0 pid=142) .b8 51
(EngineCore_0 pid=142) .b8 46
(EngineCore_0 pid=142) .b8 49
(EngineCore_0 pid=142) .b8 50
(EngineCore_0 pid=142) .b8 47
(EngineCore_0 pid=142) .b8 100
(EngineCore_0 pid=142) .b8 105
(EngineCore_0 pid=142) .b8 115
(EngineCore_0 pid=142) .b8 116
(EngineCore_0 pid=142) .b8 45
(EngineCore_0 pid=142) .b8 112
(EngineCore_0 pid=142) .b8 97
(EngineCore_0 pid=142) .b8 99
(EngineCore_0 pid=142) .b8 107
(EngineCore_0 pid=142) .b8 97
(EngineCore_0 pid=142) .b8 103
(EngineCore_0 pid=142) .b8 101
(EngineCore_0 pid=142) .b8 115
(EngineCore_0 pid=142) .b8 47
(EngineCore_0 pid=142) .b8 116
(EngineCore_0 pid=142) .b8 114
(EngineCore_0 pid=142) .b8 105
(EngineCore_0 pid=142) .b8 116
(EngineCore_0 pid=142) .b8 111
(EngineCore_0 pid=142) .b8 110
(EngineCore_0 pid=142) .b8 95
(EngineCore_0 pid=142) .b8 107
(EngineCore_0 pid=142) .b8 101
(EngineCore_0 pid=142) .b8 114
(EngineCore_0 pid=142) .b8 110
(EngineCore_0 pid=142) .b8 101
(EngineCore_0 pid=142) .b8 108
(EngineCore_0 pid=142) .b8 115
(EngineCore_0 pid=142) .b8 0
(EngineCore_0 pid=142) .b8 2 // Abbrev [2] 0x5e:0x39 DW_TAG_subprogram
(EngineCore_0 pid=142) .b8 95 // DW_AT_name
(EngineCore_0 pid=142) .b8 112
(EngineCore_0 pid=142) .b8 95
(EngineCore_0 pid=142) .b8 109
(EngineCore_0 pid=142) .b8 97
(EngineCore_0 pid=142) .b8 116
(EngineCore_0 pid=142) .b8 109
(EngineCore_0 pid=142) .b8 117
(EngineCore_0 pid=142) .b8 108
(EngineCore_0 pid=142) .b8 95
(EngineCore_0 pid=142) .b8 111
(EngineCore_0 pid=142) .b8 103
(EngineCore_0 pid=142) .b8 115
(EngineCore_0 pid=142) .b8 95
(EngineCore_0 pid=142) .b8 78
(EngineCore_0 pid=142) .b8 78
(EngineCore_0 pid=142) .b8 84
(EngineCore_0 pid=142) .b8 95
(EngineCore_0 pid=142) .b8 98
(EngineCore_0 pid=142) .b8 102
(EngineCore_0 pid=142) .b8 49
(EngineCore_0 pid=142) .b8 54
(EngineCore_0 pid=142) .b8 120
(EngineCore_0 pid=142) .b8 98
(EngineCore_0 pid=142) .b8 102
(EngineCore_0 pid=142) .b8 49
(EngineCore_0 pid=142) .b8 54
(EngineCore_0 pid=142) .b8 120
(EngineCore_0 pid=142) .b8 109
(EngineCore_0 pid=142) .b8 120
(EngineCore_0 pid=142) .b8 102
(EngineCore_0 pid=142) .b8 112
(EngineCore_0 pid=142) .b8 52
(EngineCore_0 pid=142) .b8 95
(EngineCore_0 pid=142) .b8 49
(EngineCore_0 pid=142) .b8 50
(EngineCore_0 pid=142) .b8 56
(EngineCore_0 pid=142) .b8 120
(EngineCore_0 pid=142) .b8 50
(EngineCore_0 pid=142) .b8 53
(EngineCore_0 pid=142) .b8 54
(EngineCore_0 pid=142) .b8 120
(EngineCore_0 pid=142) .b8 49
(EngineCore_0 pid=142) .b8 50
(EngineCore_0 pid=142) .b8 56
(EngineCore_0 pid=142) .b8 120
(EngineCore_0 pid=142) .b8 49
(EngineCore_0 pid=142) .b8 95
(EngineCore_0 pid=142) .b8 115
(EngineCore_0 pid=142) .b8 119
(EngineCore_0 pid=142) .b8 105
(EngineCore_0 pid=142) .b8 103
(EngineCore_0 pid=142) .b8 108
(EngineCore_0 pid=142) .b8 117
(EngineCore_0 pid=142) .b8 0
(EngineCore_0 pid=142) .b8 1 // DW_AT_inline
(EngineCore_0 pid=142) .b8 3 // Abbrev [3] 0x97:0xa7 DW_TAG_subprogram
(EngineCore_0 pid=142) .b64 $L__func_begin0 // DW_AT_low_pc
(EngineCore_0 pid=142) .b64 $L__func_end0 // DW_AT_high_pc
(EngineCore_0 pid=142) .b32 94 // DW_AT_abstract_origin
(EngineCore_0 pid=142) .b8 4 // Abbrev [4] 0xac:0x18 DW_TAG_inlined_subroutine
(EngineCore_0 pid=142) .b32 94 // DW_AT_abstract_origin
(EngineCore_0 pid=142) .b64 $L__tmp1 // DW_AT_low_pc
(EngineCore_0 pid=142) .b64 $L__tmp14 // DW_AT_high_pc
(EngineCore_0 pid=142) .b8 1 // DW_AT_call_file
(EngineCore_0 pid=142) .b8 140 // DW_AT_call_line
(EngineCore_0 pid=142) .b8 21 // DW_AT_call_column
(EngineCore_0 pid=142) .b8 5 // Abbrev [5] 0xc4:0x19 DW_TAG_inlined_subroutine
(EngineCore_0 pid=142) .b32 94 // DW_AT_abstract_origin
(EngineCore_0 pid=142) .b64 $L__tmp3 // DW_AT_low_pc
(EngineCore_0 pid=142) .b64 $L__tmp32 // DW_AT_high_pc
(EngineCore_0 pid=142) .b8 1 // DW_AT_call_file
(EngineCore_0 pid=142) .b8 74 // DW_AT_call_line
(EngineCore_0 pid=142) .b8 1
(EngineCore_0 pid=142) .b8 46 // DW_AT_call_column
(EngineCore_0 pid=142) .b8 4 // Abbrev [4] 0xdd:0x18 DW_TAG_inlined_subroutine
(EngineCore_0 pid=142) .b32 94 // DW_AT_abstract_origin
(EngineCore_0 pid=142) .b64 $L__tmp5 // DW_AT_low_pc
(EngineCore_0 pid=142) .b64 $L__tmp28 // DW_AT_high_pc
(EngineCore_0 pid=142) .b8 1 // DW_AT_call_file
(EngineCore_0 pid=142) .b8 225 // DW_AT_call_line
(EngineCore_0 pid=142) .b8 25 // DW_AT_call_column
(EngineCore_0 pid=142) .b8 4 // Abbrev [4] 0xf5:0x18 DW_TAG_inlined_subroutine
(EngineCore_0 pid=142) .b32 94 // DW_AT_abstract_origin
(EngineCore_0 pid=142) .b64 $L__tmp11 // DW_AT_low_pc
(EngineCore_0 pid=142) .b64 $L__tmp24 // DW_AT_high_pc
(EngineCore_0 pid=142) .b8 1 // DW_AT_call_file
(EngineCore_0 pid=142) .b8 193 // DW_AT_call_line
(EngineCore_0 pid=142) .b8 32 // DW_AT_call_column
(EngineCore_0 pid=142) .b8 4 // Abbrev [4] 0x10d:0x18 DW_TAG_inlined_subroutine
(EngineCore_0 pid=142) .b32 94 // DW_AT_abstract_origin
(EngineCore_0 pid=142) .b64 $L__tmp15 // DW_AT_low_pc
(EngineCore_0 pid=142) .b64 $L__tmp16 // DW_AT_high_pc
(EngineCore_0 pid=142) .b8 1 // DW_AT_call_file
(EngineCore_0 pid=142) .b8 116 // DW_AT_call_line
(EngineCore_0 pid=142) .b8 25 // DW_AT_call_column
(EngineCore_0 pid=142) .b8 4 // Abbrev [4] 0x125:0x18 DW_TAG_inlined_subroutine
(EngineCore_0 pid=142) .b32 94 // DW_AT_abstract_origin
(EngineCore_0 pid=142) .b64 $L__tmp19 // DW_AT_low_pc
(EngineCore_0 pid=142) .b64 $L__tmp26 // DW_AT_high_pc
(EngineCore_0 pid=142) .b8 1 // DW_AT_call_file
(EngineCore_0 pid=142) .b8 206 // DW_AT_call_line
(EngineCore_0 pid=142) .b8 53 // DW_AT_call_column
(EngineCore_0 pid=142) .b8 0 // End Of Children Mark
(EngineCore_0 pid=142) .b8 0 // End Of Children Mark
(EngineCore_0 pid=142) }
(EngineCore_0 pid=142) .section .debug_macinfo { }
(EngineCore_0 pid=142)
(EngineCore_0 pid=142) ================================================================
(EngineCore_0 pid=142) please share the reproducer above with Triton project.
(EngineCore_0 pid=142)
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] EngineCore failed to start.
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] Traceback (most recent call last):
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/compiler.py", line 439, in make_cubin
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] subprocess.run(ptxas_cmd, check=True, close_fds=False, stderr=flog)
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/lib/python3.12/subprocess.py", line 571, in run
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] raise CalledProcessError(retcode, process.args,
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] subprocess.CalledProcessError: Command '['/usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/bin/ptxas', '-lineinfo', '-v', '--gpu-name=sm_120a', '/tmp/tmp7fyy4dpd.ptx', '-o', '/tmp/tmp7fyy4dpd.ptx.o']' returned non-zero exit status 255.
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718]
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] During handling of the above exception, another exception occurred:
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718]
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] Traceback (most recent call last):
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 510, in __init__
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 91, in __init__
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] self._initialize_kv_caches(vllm_config)
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 181, in _initialize_kv_caches
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] self.model_executor.determine_available_memory())
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 76, in determine_available_memory
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] output = self.collective_rpc("determine_available_memory")
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 2948, in run_method
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] return func(*args, **kwargs)
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] return func(*args, **kwargs)
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 243, in determine_available_memory
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] self.model_runner.profile_run()
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2498, in profile_run
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] = self._dummy_run(self.max_num_tokens, is_profile=True)
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] return func(*args, **kwargs)
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2250, in _dummy_run
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] outputs = model(
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] return self._call_impl(*args, **kwargs)
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] return forward_call(*args, **kwargs)
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gpt_oss.py", line 258, in forward
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] return self.model(input_ids, positions)
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 272, in __call__
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] output = self.compiled_callable(*args, **kwargs)
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 804, in compile_wrapper
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] return fn(*args, **kwargs)
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gpt_oss.py", line 222, in forward
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] def forward(self, input_ids: torch.Tensor,
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 413, in __call__
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] return super().__call__(*args, **kwargs)
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] return self._call_impl(*args, **kwargs)
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] return forward_call(*args, **kwargs)
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1005, in _fn
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] return fn(*args, **kwargs)
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] return self._wrapped_call(self, *args, **kwargs)
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in __call__
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] raise e
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in __call__
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] return self._call_impl(*args, **kwargs)
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] return forward_call(*args, **kwargs)
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "<eval_with_key>.50", line 209, in forward
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] submod_2 = self.submod_2(getitem_3, s72, l_self_modules_layers_modules_0_modules_attn_modules_o_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_attn_modules_o_proj_parameters_bias_, getitem_4, l_self_modules_layers_modules_0_modules_mlp_modules_norm_parameters_weight_, l_self_modules_layers_modules_0_modules_mlp_modules_router_parameters_weight_, l_self_modules_layers_modules_0_modules_mlp_modules_router_parameters_bias_, l_self_modules_layers_modules_1_modules_attn_modules_norm_parameters_weight_, l_self_modules_layers_modules_1_modules_attn_modules_qkv_parameters_weight_, l_self_modules_layers_modules_1_modules_attn_modules_qkv_parameters_bias_, l_positions_, l_self_modules_layers_modules_0_modules_attn_modules_rotary_emb_buffers_cos_sin_cache_); getitem_3 = l_self_modules_layers_modules_0_modules_attn_modules_o_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_attn_modules_o_proj_parameters_bias_ = getitem_4 = l_self_modules_layers_modules_0_modules_mlp_modules_norm_parameters_weight_ = l_self_modules_layers_modules_0_modules_mlp_modules_router_parameters_weight_ = l_self_modules_layers_modules_0_modules_mlp_modules_router_parameters_bias_ = l_self_modules_layers_modules_1_modules_attn_modules_norm_parameters_weight_ = l_self_modules_layers_modules_1_modules_attn_modules_qkv_parameters_weight_ = l_self_modules_layers_modules_1_modules_attn_modules_qkv_parameters_bias_ = None
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_piecewise_backend.py", line 112, in __call__
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] return self.compiled_graph_for_general_shape(*args)
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/standalone_compile.py", line 62, in __call__
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] return self._compiled_fn(*args)
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1005, in _fn
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] return fn(*args, **kwargs)
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/aot_autograd.py", line 1124, in forward
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] return compiled_fn(full_args)
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 353, in runtime_wrapper
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] all_outs = call_func_at_runtime_with_args(
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/utils.py", line 129, in call_func_at_runtime_with_args
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] out = normalize_as_list(f(args))
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 724, in inner_fn
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] outs = compiled_fn(args)
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 526, in wrapper
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] return compiled_fn(runtime_args)
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/output_code.py", line 588, in __call__
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] return self.current_callable(inputs)
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/utils.py", line 2887, in run
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] out = model(new_inputs)
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/tmp/torchinductor_root/6n/c6ncsqmwytnmkipbbjrqegchtbp4mvs5v2zban6y36txr7wkgpjh.py", line 1030, in call
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] buf4 = torch.ops.vllm.moe_forward.default(buf2, buf3, 'model.block.0.mlp.experts')
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 840, in __call__
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] return self._op(*args, **kwargs)
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 1687, in moe_forward
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] return self.forward_impl(hidden_states, router_logits)
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 1596, in forward_impl
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] final_hidden_states = self.quant_method.apply(
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/mxfp4.py", line 486, in apply
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] return triton_kernel_moe_forward(
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/triton_kernels_moe.py", line 45, in triton_kernel_moe_forward
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] return triton_kernel_fused_experts(
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/triton_kernels_moe.py", line 123, in triton_kernel_fused_experts
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] intermediate_cache1 = matmul_ogs(
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/triton_kernels/matmul_ogs.py", line 531, in matmul_ogs
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] (kernels._p_matmul_ogs if opt_flags.is_persistent else kernels._matmul_ogs)[(grid,)](
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 393, in <lambda>
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 593, in run
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] kernel = self._do_compile(key, signature, device, constexprs, options, attrs, warmup)
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 776, in _do_compile
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] kernel = self.compile(src, target=target, options=options.__dict__)
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 322, in compile
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] next_module = compile_ir(module, metadata)
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/compiler.py", line 491, in <lambda>
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] stages["cubin"] = lambda src, metadata: self.make_cubin(src, metadata, options, self.target.arch)
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/compiler.py", line 474, in make_cubin
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] raise PTXASError(error)
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] triton.runtime.errors.PTXASError: PTXAS error: Internal Triton PTX codegen error
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] `ptxas` stderr:
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ptxas /tmp/tmp7fyy4dpd.ptx, line 670; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ptxas /tmp/tmp7fyy4dpd.ptx, line 675; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ptxas /tmp/tmp7fyy4dpd.ptx, line 679; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ptxas /tmp/tmp7fyy4dpd.ptx, line 683; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ptxas /tmp/tmp7fyy4dpd.ptx, line 687; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ptxas /tmp/tmp7fyy4dpd.ptx, line 691; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ptxas /tmp/tmp7fyy4dpd.ptx, line 695; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ptxas /tmp/tmp7fyy4dpd.ptx, line 699; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ptxas /tmp/tmp7fyy4dpd.ptx, line 703; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ptxas /tmp/tmp7fyy4dpd.ptx, line 707; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ptxas /tmp/tmp7fyy4dpd.ptx, line 711; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ptxas /tmp/tmp7fyy4dpd.ptx, line 715; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ptxas /tmp/tmp7fyy4dpd.ptx, line 719; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ptxas /tmp/tmp7fyy4dpd.ptx, line 723; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ptxas /tmp/tmp7fyy4dpd.ptx, line 727; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ptxas /tmp/tmp7fyy4dpd.ptx, line 731; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ptxas /tmp/tmp7fyy4dpd.ptx, line 735; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ptxas /tmp/tmp7fyy4dpd.ptx, line 739; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ptxas /tmp/tmp7fyy4dpd.ptx, line 743; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ptxas /tmp/tmp7fyy4dpd.ptx, line 747; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ptxas /tmp/tmp7fyy4dpd.ptx, line 751; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ptxas /tmp/tmp7fyy4dpd.ptx, line 755; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ptxas /tmp/tmp7fyy4dpd.ptx, line 759; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ptxas /tmp/tmp7fyy4dpd.ptx, line 763; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ptxas /tmp/tmp7fyy4dpd.ptx, line 767; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ptxas /tmp/tmp7fyy4dpd.ptx, line 771; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ptxas /tmp/tmp7fyy4dpd.ptx, line 775; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ptxas /tmp/tmp7fyy4dpd.ptx, line 779; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ptxas /tmp/tmp7fyy4dpd.ptx, line 783; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ptxas /tmp/tmp7fyy4dpd.ptx, line 787; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ptxas /tmp/tmp7fyy4dpd.ptx, line 791; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ptxas /tmp/tmp7fyy4dpd.ptx, line 795; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] ptxas fatal : Ptx assembly aborted due to errors
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718]
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718] Repro command: /usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_120a /tmp/tmp7fyy4dpd.ptx -o /tmp/tmp7fyy4dpd.ptx.o
(EngineCore_0 pid=142) ERROR 08-08 17:05:31 [core.py:718]
(EngineCore_0 pid=142) Process EngineCore_0:
(EngineCore_0 pid=142) Traceback (most recent call last):
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/compiler.py", line 439, in make_cubin
(EngineCore_0 pid=142) subprocess.run(ptxas_cmd, check=True, close_fds=False, stderr=flog)
(EngineCore_0 pid=142) File "/usr/lib/python3.12/subprocess.py", line 571, in run
(EngineCore_0 pid=142) raise CalledProcessError(retcode, process.args,
(EngineCore_0 pid=142) subprocess.CalledProcessError: Command '['/usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/bin/ptxas', '-lineinfo', '-v', '--gpu-name=sm_120a', '/tmp/tmp7fyy4dpd.ptx', '-o', '/tmp/tmp7fyy4dpd.ptx.o']' returned non-zero exit status 255.
(EngineCore_0 pid=142)
(EngineCore_0 pid=142) During handling of the above exception, another exception occurred:
(EngineCore_0 pid=142)
(EngineCore_0 pid=142) Traceback (most recent call last):
(EngineCore_0 pid=142) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_0 pid=142) self.run()
(EngineCore_0 pid=142) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_0 pid=142) self._target(*self._args, **self._kwargs)
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 722, in run_engine_core
(EngineCore_0 pid=142) raise e
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_0 pid=142) engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 510, in __init__
(EngineCore_0 pid=142) super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 91, in __init__
(EngineCore_0 pid=142) self._initialize_kv_caches(vllm_config)
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 181, in _initialize_kv_caches
(EngineCore_0 pid=142) self.model_executor.determine_available_memory())
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 76, in determine_available_memory
(EngineCore_0 pid=142) output = self.collective_rpc("determine_available_memory")
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
(EngineCore_0 pid=142) answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 2948, in run_method
(EngineCore_0 pid=142) return func(*args, **kwargs)
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_0 pid=142) return func(*args, **kwargs)
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 243, in determine_available_memory
(EngineCore_0 pid=142) self.model_runner.profile_run()
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2498, in profile_run
(EngineCore_0 pid=142) = self._dummy_run(self.max_num_tokens, is_profile=True)
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_0 pid=142) return func(*args, **kwargs)
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2250, in _dummy_run
(EngineCore_0 pid=142) outputs = model(
(EngineCore_0 pid=142) ^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_0 pid=142) return self._call_impl(*args, **kwargs)
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_0 pid=142) return forward_call(*args, **kwargs)
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gpt_oss.py", line 258, in forward
(EngineCore_0 pid=142) return self.model(input_ids, positions)
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 272, in __call__
(EngineCore_0 pid=142) output = self.compiled_callable(*args, **kwargs)
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 804, in compile_wrapper
(EngineCore_0 pid=142) return fn(*args, **kwargs)
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gpt_oss.py", line 222, in forward
(EngineCore_0 pid=142) def forward(self, input_ids: torch.Tensor,
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 413, in __call__
(EngineCore_0 pid=142) return super().__call__(*args, **kwargs)
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_0 pid=142) return self._call_impl(*args, **kwargs)
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_0 pid=142) return forward_call(*args, **kwargs)
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1005, in _fn
(EngineCore_0 pid=142) return fn(*args, **kwargs)
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped
(EngineCore_0 pid=142) return self._wrapped_call(self, *args, **kwargs)
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in __call__
(EngineCore_0 pid=142) raise e
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in __call__
(EngineCore_0 pid=142) return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_0 pid=142) return self._call_impl(*args, **kwargs)
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_0 pid=142) return forward_call(*args, **kwargs)
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "<eval_with_key>.50", line 209, in forward
(EngineCore_0 pid=142) submod_2 = self.submod_2(getitem_3, s72, l_self_modules_layers_modules_0_modules_attn_modules_o_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_attn_modules_o_proj_parameters_bias_, getitem_4, l_self_modules_layers_modules_0_modules_mlp_modules_norm_parameters_weight_, l_self_modules_layers_modules_0_modules_mlp_modules_router_parameters_weight_, l_self_modules_layers_modules_0_modules_mlp_modules_router_parameters_bias_, l_self_modules_layers_modules_1_modules_attn_modules_norm_parameters_weight_, l_self_modules_layers_modules_1_modules_attn_modules_qkv_parameters_weight_, l_self_modules_layers_modules_1_modules_attn_modules_qkv_parameters_bias_, l_positions_, l_self_modules_layers_modules_0_modules_attn_modules_rotary_emb_buffers_cos_sin_cache_); getitem_3 = l_self_modules_layers_modules_0_modules_attn_modules_o_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_attn_modules_o_proj_parameters_bias_ = getitem_4 = l_self_modules_layers_modules_0_modules_mlp_modules_norm_parameters_weight_ = l_self_modules_layers_modules_0_modules_mlp_modules_router_parameters_weight_ = l_self_modules_layers_modules_0_modules_mlp_modules_router_parameters_bias_ = l_self_modules_layers_modules_1_modules_attn_modules_norm_parameters_weight_ = l_self_modules_layers_modules_1_modules_attn_modules_qkv_parameters_weight_ = l_self_modules_layers_modules_1_modules_attn_modules_qkv_parameters_bias_ = None
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_piecewise_backend.py", line 112, in __call__
(EngineCore_0 pid=142) return self.compiled_graph_for_general_shape(*args)
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/standalone_compile.py", line 62, in __call__
(EngineCore_0 pid=142) return self._compiled_fn(*args)
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1005, in _fn
(EngineCore_0 pid=142) return fn(*args, **kwargs)
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/aot_autograd.py", line 1124, in forward
(EngineCore_0 pid=142) return compiled_fn(full_args)
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 353, in runtime_wrapper
(EngineCore_0 pid=142) all_outs = call_func_at_runtime_with_args(
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/utils.py", line 129, in call_func_at_runtime_with_args
(EngineCore_0 pid=142) out = normalize_as_list(f(args))
(EngineCore_0 pid=142) ^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 724, in inner_fn
(EngineCore_0 pid=142) outs = compiled_fn(args)
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 526, in wrapper
(EngineCore_0 pid=142) return compiled_fn(runtime_args)
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/output_code.py", line 588, in __call__
(EngineCore_0 pid=142) return self.current_callable(inputs)
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/utils.py", line 2887, in run
(EngineCore_0 pid=142) out = model(new_inputs)
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/tmp/torchinductor_root/6n/c6ncsqmwytnmkipbbjrqegchtbp4mvs5v2zban6y36txr7wkgpjh.py", line 1030, in call
(EngineCore_0 pid=142) buf4 = torch.ops.vllm.moe_forward.default(buf2, buf3, 'model.block.0.mlp.experts')
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 840, in __call__
(EngineCore_0 pid=142) return self._op(*args, **kwargs)
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 1687, in moe_forward
(EngineCore_0 pid=142) return self.forward_impl(hidden_states, router_logits)
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 1596, in forward_impl
(EngineCore_0 pid=142) final_hidden_states = self.quant_method.apply(
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/mxfp4.py", line 486, in apply
(EngineCore_0 pid=142) return triton_kernel_moe_forward(
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/triton_kernels_moe.py", line 45, in triton_kernel_moe_forward
(EngineCore_0 pid=142) return triton_kernel_fused_experts(
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/triton_kernels_moe.py", line 123, in triton_kernel_fused_experts
(EngineCore_0 pid=142) intermediate_cache1 = matmul_ogs(
(EngineCore_0 pid=142) ^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/triton_kernels/matmul_ogs.py", line 531, in matmul_ogs
(EngineCore_0 pid=142) (kernels._p_matmul_ogs if opt_flags.is_persistent else kernels._matmul_ogs)[(grid,)](
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 393, in <lambda>
(EngineCore_0 pid=142) return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 593, in run
(EngineCore_0 pid=142) kernel = self._do_compile(key, signature, device, constexprs, options, attrs, warmup)
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 776, in _do_compile
(EngineCore_0 pid=142) kernel = self.compile(src, target=target, options=options.__dict__)
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 322, in compile
(EngineCore_0 pid=142) next_module = compile_ir(module, metadata)
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/compiler.py", line 491, in <lambda>
(EngineCore_0 pid=142) stages["cubin"] = lambda src, metadata: self.make_cubin(src, metadata, options, self.target.arch)
(EngineCore_0 pid=142) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=142) File "/usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/compiler.py", line 474, in make_cubin
(EngineCore_0 pid=142) raise PTXASError(error)
(EngineCore_0 pid=142) triton.runtime.errors.PTXASError: PTXAS error: Internal Triton PTX codegen error
(EngineCore_0 pid=142) `ptxas` stderr:
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 670; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 675; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 679; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 683; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 687; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 691; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 695; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 699; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 703; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 707; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 711; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 715; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 719; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 723; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 727; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 731; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 735; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 739; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 743; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 747; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 751; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 755; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 759; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 763; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 767; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 771; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 775; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 779; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 783; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 787; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 791; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas /tmp/tmp7fyy4dpd.ptx, line 795; error : Feature '.tile::gather4 with destination state space as .shared::cluster' not supported on .target 'sm_120a'
(EngineCore_0 pid=142) ptxas fatal : Ptx assembly aborted due to errors
(EngineCore_0 pid=142)
(EngineCore_0 pid=142) Repro command: /usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_120a /tmp/tmp7fyy4dpd.ptx -o /tmp/tmp7fyy4dpd.ptx.o
(EngineCore_0 pid=142)
[rank0]:[W808 17:05:31.177526296 ProcessGroupNCCL.cpp:1522] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1) File "<frozen runpy>", line 198, in _run_module_as_main
(APIServer pid=1) File "<frozen runpy>", line 88, in _run_code
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1895, in <module>
(APIServer pid=1) uvloop.run(run_server(args))
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
(APIServer pid=1) return __asyncio.run(
(APIServer pid=1) ^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1) return runner.run(main)
(APIServer pid=1) ^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1) return self._loop.run_until_complete(task)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
(APIServer pid=1) return await main
(APIServer pid=1) ^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1827, in run_server
(APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1847, in run_server_worker
(APIServer pid=1) async with build_async_engine_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 167, in build_async_engine_client
(APIServer pid=1) async with build_async_engine_client_from_engine_args(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 209, in build_async_engine_client_from_engine_args
(APIServer pid=1) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 1520, in inner
(APIServer pid=1) return fn(*args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 173, in from_vllm_config
(APIServer pid=1) return cls(
(APIServer pid=1) ^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 119, in __init__
(APIServer pid=1) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 101, in make_async_mp_client
(APIServer pid=1) return AsyncMPClient(*client_args)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 733, in __init__
(APIServer pid=1) super().__init__(
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 421, in __init__
(APIServer pid=1) with launch_core_engines(vllm_config, executor_class,
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=1) next(self.gen)
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 697, in launch_core_engines
(APIServer pid=1) wait_for_engine_startup(
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 750, in wait_for_engine_startup
(APIServer pid=1) raise RuntimeError("Engine core initialization failed. "
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
root@srv-ia-010:/home/IN.PGE.RJ.GOV.BR/fontesc#
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment