Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save AmosLewis/7c576451f5c7cebe373a745a5fb9b007 to your computer and use it in GitHub Desktop.
Save AmosLewis/7c576451f5c7cebe373a745a5fb9b007 to your computer and use it in GitHub Desktop.
(.venv) ➜ shark-ai git:(chi/xfail_f16) ✗ pytest sharktank/tests/models/llama/benchmark_amdgpu_test.py -v -s -m "expensive" --iree-hip-target=gfx942 --iree-device=hip://4 -k testBenchmark8B_fp8_TP1_Non_Decomposed
======================================================================================= test session starts =======================================================================================
platform linux -- Python 3.12.9, pytest-8.0.0, pluggy-1.5.0 -- /home/chi/src/shark-ai/.venv/bin/python
cachedir: .pytest_cache
metadata: {'Python': '3.12.9', 'Platform': 'Linux-6.8.0-52-generic-x86_64-with-glibc2.35', 'Packages': {'pytest': '8.0.0', 'pluggy': '1.5.0'}, 'Plugins': {'timeout': '2.3.1', 'anyio': '4.9.0', 'metadata': '3.1.1', 'html': '4.1.1', 'asyncio': '0.23.8', 'xdist': '3.5.0'}}
rootdir: /home/chi/src/shark-ai/sharktank
configfile: pyproject.toml
plugins: timeout-2.3.1, anyio-4.9.0, metadata-3.1.1, html-4.1.1, asyncio-0.23.8, xdist-3.5.0
asyncio: mode=Mode.STRICT
collected 13 items / 12 deselected / 1 selected
sharktank/tests/models/llama/benchmark_amdgpu_test.py::BenchmarkLlama3_1_8B::testBenchmark8B_fp8_TP1_Non_Decomposed
2025-04-07T17:35:38-07:00
Running /home/chi/src/shark-ai/.venv/lib/python3.12/site-packages/iree/_runtime_libs/iree-benchmark-module
Run on (96 X 3810.79 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x96)
L1 Instruction 32 KiB (x96)
L2 Unified 1024 KiB (x96)
L3 Unified 32768 KiB (x16)
Load Average: 5.01, 2.31, 2.13
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
2025-04-07T17:35:43-07:00
Running /home/chi/src/shark-ai/.venv/lib/python3.12/site-packages/iree/_runtime_libs/iree-benchmark-module
Run on (96 X 3810.79 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x96)
L1 Instruction 32 KiB (x96)
L2 Unified 1024 KiB (x96)
L3 Unified 32768 KiB (x16)
Load Average: 4.69, 2.29, 2.13
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
FAILED
============================================================================================ FAILURES =============================================================================================
___________________________________________________________________ BenchmarkLlama3_1_8B.testBenchmark8B_fp8_TP1_Non_Decomposed ___________________________________________________________________
[XPASS(strict)] Benchmarking Error
---------------------------------------------------------------------------------------- Captured log call ----------------------------------------------------------------------------------------
INFO eval:export_artifacts.py:216 Exporting mlir:
cd /home/chi/src/shark-ai && python3 -m sharktank.examples.export_paged_llm_v1 --irpa-file=/shark-dev/8b/fp8/native_fp8_e4m3fnuz_llama3_8b.irpa --output-mlir=/home/chi/src/shark-ai/2025-04-07/llama-8b/fp8_torch_tp1.mlir --output-config=/home/chi/src/shark-ai/2025-04-07/llama-8b/fp8_torch_tp1.json --bs-prefill=4 --bs-decode=4 --block-seq-stride=32 --attention-dtype=bfloat16 --activation-dtype=bfloat16 --kv-cache-dtype=float8_e4m3fnuz --attention-kernel=torch --use-hf
INFO eval:export_artifacts.py:222 Exported to mlir successfully:
Exporting prefill_bs4
Exporting decode_bs4
GENERATED!
Exporting
Saving to '/home/chi/src/shark-ai/2025-04-07/llama-8b/fp8_torch_tp1.mlir'
INFO eval:export_artifacts.py:140 export_to_mlir: 00 hrs : 01 mins : 30.95 secs
INFO eval:export_artifacts.py:271 Launching compile command:
cd /home/chi/src/shark-ai && iree-compile /home/chi/src/shark-ai/2025-04-07/llama-8b/fp8_torch_tp1.mlir --iree-hip-target=gfx942 -o=/home/chi/src/shark-ai/2025-04-07/llama-8b/fp8_torch_tp1.vmfb --iree-hal-target-device=hip --iree-hal-dump-executable-files-to=/home/chi/src/shark-ai/2025-04-07/llama-8b/fp8_torch_tp1/files --iree-opt-level=O3 --iree-hal-indirect-command-buffers=true --iree-stream-resource-memory-model=discrete --iree-hal-memoization=true
INFO eval:export_artifacts.py:140 compile_to_vmfb: 19.67 secs
INFO eval:export_artifacts.py:328 Launching run command:
cd /home/chi/src/shark-ai && ROCR_VISIBLE_DEVICES=0,1,2,3,4 iree-benchmark-module --hip_use_streams=true --module=/home/chi/src/shark-ai/2025-04-07/llama-8b/fp8_torch_tp1.vmfb --parameters=model=/shark-dev/8b/fp8/native_fp8_e4m3fnuz_llama3_8b.irpa --device=hip://4 --function=prefill_bs4 --input=4x128xi64=@/shark-dev/8b/prefill_args_fp8/tokens.bin --input=4xi64=@/shark-dev/8b/prefill_args_fp8/seq_lens.bin --input=4x4xi64=@/shark-dev/8b/prefill_args_fp8/seq_block_ids.bin --input=261x2097152xf8E4M3FNUZ=@/shark-dev/8b/prefill_args_fp8/cs_f8E4M3FNUZ.bin --benchmark_repetitions=3 >> /home/chi/src/shark-ai/2025-04-07/llama-8b/fp8_torch_tp1.txt
INFO eval:export_artifacts.py:328 Launching run command:
cd /home/chi/src/shark-ai && ROCR_VISIBLE_DEVICES=0,1,2,3,4 iree-benchmark-module --hip_use_streams=true --module=/home/chi/src/shark-ai/2025-04-07/llama-8b/fp8_torch_tp1.vmfb --parameters=model=/shark-dev/8b/fp8/native_fp8_e4m3fnuz_llama3_8b.irpa --device=hip://4 --function=decode_bs4 --input=4x1xi64=@/shark-dev/8b/decode_args_fp8/next_tokens.bin --input=4xi64=@/shark-dev/8b/decode_args_fp8/seq_lens.bin --input=4xi64=@/shark-dev/8b/decode_args_fp8/start_positions.bin --input=4x5xi64=@/shark-dev/8b/decode_args_fp8/seq_block_ids.bin --input=261x2097152xf8E4M3FNUZ=@/shark-dev/8b/decode_args_fp8/cs_f8E4M3FNUZ.bin --benchmark_repetitions=3 >> /home/chi/src/shark-ai/2025-04-07/llama-8b/fp8_torch_tp1.txt
===================================================================================== short test summary info =====================================================================================
FAILED sharktank/tests/models/llama/benchmark_amdgpu_test.py::BenchmarkLlama3_1_8B::testBenchmark8B_fp8_TP1_Non_Decomposed
========================================================================== 1 failed, 12 deselected in 122.52s (0:02:02) ===========================================================
@AmosLewis
Copy link
Author

AmosLewis commented Apr 8, 2025

If delete the xfail

   # @pytest.mark.xfail(
   #     reason="Benchmarking Error", strict=True, raises=IreeBenchmarkException
   # )
(.venv) ➜  shark-ai git:(chi/xfail_f16) ✗ pytest sharktank/tests/models/llama/benchmark_amdgpu_test.py -v -s -m "expensive" --iree-hip-target=gfx942 --iree-device=hip://4 -k testBenchmark8B_fp8_TP1_Non_Decomposed
======================================================================================= test session starts =======================================================================================
platform linux -- Python 3.12.9, pytest-8.0.0, pluggy-1.5.0 -- /home/chi/src/shark-ai/.venv/bin/python
cachedir: .pytest_cache
metadata: {'Python': '3.12.9', 'Platform': 'Linux-6.8.0-52-generic-x86_64-with-glibc2.35', 'Packages': {'pytest': '8.0.0', 'pluggy': '1.5.0'}, 'Plugins': {'timeout': '2.3.1', 'anyio': '4.9.0', 'metadata': '3.1.1', 'html': '4.1.1', 'asyncio': '0.23.8', 'xdist': '3.5.0'}}
rootdir: /home/chi/src/shark-ai/sharktank
configfile: pyproject.toml
plugins: timeout-2.3.1, anyio-4.9.0, metadata-3.1.1, html-4.1.1, asyncio-0.23.8, xdist-3.5.0
asyncio: mode=Mode.STRICT
collected 13 items / 12 deselected / 1 selected                                                                                                                                                   

sharktank/tests/models/llama/benchmark_amdgpu_test.py::BenchmarkLlama3_1_8B::testBenchmark8B_fp8_TP1_Non_Decomposed 
2025-04-08T09:22:45-07:00
Running /home/chi/src/shark-ai/.venv/lib/python3.12/site-packages/iree/_runtime_libs/iree-benchmark-module
Run on (96 X 3810.79 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x96)
  L1 Instruction 32 KiB (x96)
  L2 Unified 1024 KiB (x96)
  L3 Unified 32768 KiB (x16)
Load Average: 3.09, 21.26, 48.42
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
2025-04-08T09:22:49-07:00
Running /home/chi/src/shark-ai/.venv/lib/python3.12/site-packages/iree/_runtime_libs/iree-benchmark-module
Run on (96 X 3810.79 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x96)
  L1 Instruction 32 KiB (x96)
  L2 Unified 1024 KiB (x96)
  L3 Unified 32768 KiB (x16)
Load Average: 3.09, 21.26, 48.42
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
Memory access fault by GPU node-6 (Agent handle: 0x25386130) on address 0x76d205400000. Reason: Unknown.
FAILED

============================================================================================ FAILURES =============================================================================================
___________________________________________________________________ BenchmarkLlama3_1_8B.testBenchmark8B_fp8_TP1_Non_Decomposed ___________________________________________________________________

self = <tests.models.llama.benchmark_amdgpu_test.BenchmarkLlama3_1_8B testMethod=testBenchmark8B_fp8_TP1_Non_Decomposed>

    @skipif_run_quick_llama_test
    # @pytest.mark.xfail(
    #     reason="Benchmarking Error", strict=True, raises=IreeBenchmarkException
    # )
    def testBenchmark8B_fp8_TP1_Non_Decomposed(self):
        output_file_name = self.dir_path_8b / "fp8_torch_tp1"
        output_mlir = self.llama8b_fp8_torch_sdpa_artifacts.create_file(
            suffix=".mlir", prefix=output_file_name
        )
        output_json = self.llama8b_fp8_torch_sdpa_artifacts.create_file(
            suffix=".json", prefix=output_file_name
        )
        output_vmfb = self.llama8b_fp8_torch_sdpa_artifacts.create_file(
            suffix=".vmfb", prefix=output_file_name
        )
        output_benchmark = self.llama8b_fp8_torch_sdpa_artifacts.create_file(
            suffix=".txt", prefix=output_file_name
        )
        export_return_code = self.llama8b_fp8_torch_sdpa_artifacts.export_to_mlir(
            mlir_path=output_mlir,
            json_path=output_json,
        )
        self.llama8b_fp8_torch_sdpa_artifacts.compile_to_vmfb(
            mlir_path=str(output_mlir),
            vmfb_path=output_vmfb,
            hal_dump_path=output_file_name,
            cwd=self.repo_root,
            args=self.compile_args,
        )
        # benchmark prefill
        self.llama8b_fp8_torch_sdpa_artifacts.iree_benchmark_vmfb(
            hip_device_id=self.iree_device,
            vmfb_name=output_vmfb,
            irpa_path=self.irpa_path_fp8,
            benchmark_filename=output_benchmark,
            args=self.iree_run_prefill_args_fp8,
            cwd=self.repo_root,
        )
        # benchmark decode
>       self.llama8b_fp8_torch_sdpa_artifacts.iree_benchmark_vmfb(
            hip_device_id=self.iree_device,
            vmfb_name=output_vmfb,
            irpa_path=self.irpa_path_fp8,
            benchmark_filename=output_benchmark,
            args=self.iree_run_decode_args_fp8,
            cwd=self.repo_root,
        )

sharktank/tests/models/llama/benchmark_amdgpu_test.py:362: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <sharktank.utils.export_artifacts.ExportArtifacts object at 0x715f9582dc10>

    def iree_benchmark_vmfb(
        self,
        *,
        hip_device_id: str,
        vmfb_name: str,
        irpa_path: str,
        benchmark_filename: Optional[Path] = None,
        args: List[str],
        cwd: str | Path,
    ):
        """Runs a compiled program with the given args using `iree-benchmark-module`.
        This assumes that the `iree-benchmark-module` command is available (usually via PATH).
        Args:
            vmfb_name: Name of the .vmfb file (relative to `cwd`).
            args: List of arguments to pass to `iree-benchmark-module`.
            cwd: Working directory to run the command within. (either string or Path works)
            compile_cmd: Command used to compile the program, for inclusion in error messages.
        Raises Exception if running fails for some reason.
        """
        benchmark_args = []
        if self.tensor_parallelism_size > 1:
            base_irpa_path, _ = os.path.splitext(irpa_path)
            rocr_visible_devices = [
                f"ROCR_VISIBLE_DEVICES={','.join(str(i) for i in range(self.tensor_parallelism_size))}"
            ]
            params = [f"--parameters=model={base_irpa_path}.irpa"]
            params += [
                f"--parameters=model={base_irpa_path}.rank{i}.irpa"
                for i in range(self.tensor_parallelism_size)
            ]
            devices = [
                f"--device=hip://{i}" for i in range(self.tensor_parallelism_size)
            ]
        else:
            hip_device_arg = int(hip_device_id.split("://")[1])
            rocr_visible_devices = [
                f"ROCR_VISIBLE_DEVICES={','.join(str(i) for i in range(hip_device_arg + 1))}"
            ]
            params = [f"--parameters=model={irpa_path}"]
            devices = [f"--device={hip_device_id}"]
        benchmark_args += rocr_visible_devices
        benchmark_args += [
            "iree-benchmark-module",
            "--hip_use_streams=true",
            f"--module={vmfb_name}",
        ]
        benchmark_args += params
        benchmark_args += devices
        benchmark_args += args
        benchmark_args += [str(benchmark_filename)]
        cmd = subprocess.list2cmdline(benchmark_args)
        logger.info(f" Launching run command:\n" f"cd {cwd} && {cmd}")
        proc = subprocess.run(cmd, shell=True, stdout=sys.stdout, cwd=cwd)
        return_code = proc.returncode
        if return_code != 0:
>           raise IreeBenchmarkException(proc, cwd)
E           sharktank.utils.export_artifacts.IreeBenchmarkException: Error invoking iree-benchmark-module
E           Error code: 250
E           Stderr diagnostics:
E           None
E           Stdout diagnostics:
E           None
E           Run with:
E             cd /home/chi/src/shark-ai && ROCR_VISIBLE_DEVICES=0,1,2,3,4 iree-benchmark-module --hip_use_streams=true --module=/home/chi/src/shark-ai/2025-04-08/llama-8b/fp8_torch_tp1.vmfb --parameters=model=/shark-dev/8b/fp8/native_fp8_e4m3fnuz_llama3_8b.irpa --device=hip://4 --function=decode_bs4 --input=4x1xi64=@/shark-dev/8b/decode_args_fp8/next_tokens.bin --input=4xi64=@/shark-dev/8b/decode_args_fp8/seq_lens.bin --input=4xi64=@/shark-dev/8b/decode_args_fp8/start_positions.bin --input=4x5xi64=@/shark-dev/8b/decode_args_fp8/seq_block_ids.bin --input=261x2097152xf8E4M3FNUZ=@/shark-dev/8b/decode_args_fp8/cs_f8E4M3FNUZ.bin --benchmark_repetitions=3 >> /home/chi/src/shark-ai/2025-04-08/llama-8b/fp8_torch_tp1.txt

sharktank/sharktank/utils/export_artifacts.py:332: IreeBenchmarkException
---------------------------------------------------------------------------------------- Captured log call ----------------------------------------------------------------------------------------
INFO     eval:export_artifacts.py:216  Exporting mlir:
cd /home/chi/src/shark-ai && python3 -m sharktank.examples.export_paged_llm_v1 --irpa-file=/shark-dev/8b/fp8/native_fp8_e4m3fnuz_llama3_8b.irpa --output-mlir=/home/chi/src/shark-ai/2025-04-08/llama-8b/fp8_torch_tp1.mlir --output-config=/home/chi/src/shark-ai/2025-04-08/llama-8b/fp8_torch_tp1.json --bs-prefill=4 --bs-decode=4 --block-seq-stride=32 --attention-dtype=bfloat16 --activation-dtype=bfloat16 --kv-cache-dtype=float8_e4m3fnuz --attention-kernel=torch --use-hf
INFO     eval:export_artifacts.py:222  Exported to mlir successfully:
Exporting prefill_bs4
Exporting decode_bs4
GENERATED!
Exporting
Saving to '/home/chi/src/shark-ai/2025-04-08/llama-8b/fp8_torch_tp1.mlir'

INFO     eval:export_artifacts.py:140  export_to_mlir: 00 hrs : 01 mins : 36.71 secs
INFO     eval:export_artifacts.py:271  Launching compile command:
cd /home/chi/src/shark-ai && iree-compile /home/chi/src/shark-ai/2025-04-08/llama-8b/fp8_torch_tp1.mlir --iree-hip-target=gfx942 -o=/home/chi/src/shark-ai/2025-04-08/llama-8b/fp8_torch_tp1.vmfb --iree-hal-target-device=hip --iree-hal-dump-executable-files-to=/home/chi/src/shark-ai/2025-04-08/llama-8b/fp8_torch_tp1/files --iree-opt-level=O3 --iree-hal-indirect-command-buffers=true --iree-stream-resource-memory-model=discrete --iree-hal-memoization=true
INFO     eval:export_artifacts.py:140  compile_to_vmfb: 19.67 secs
INFO     eval:export_artifacts.py:328  Launching run command:
cd /home/chi/src/shark-ai && ROCR_VISIBLE_DEVICES=0,1,2,3,4 iree-benchmark-module --hip_use_streams=true --module=/home/chi/src/shark-ai/2025-04-08/llama-8b/fp8_torch_tp1.vmfb --parameters=model=/shark-dev/8b/fp8/native_fp8_e4m3fnuz_llama3_8b.irpa --device=hip://4 --function=prefill_bs4 --input=4x128xi64=@/shark-dev/8b/prefill_args_fp8/tokens.bin --input=4xi64=@/shark-dev/8b/prefill_args_fp8/seq_lens.bin --input=4x4xi64=@/shark-dev/8b/prefill_args_fp8/seq_block_ids.bin --input=261x2097152xf8E4M3FNUZ=@/shark-dev/8b/prefill_args_fp8/cs_f8E4M3FNUZ.bin --benchmark_repetitions=3 >> /home/chi/src/shark-ai/2025-04-08/llama-8b/fp8_torch_tp1.txt
INFO     eval:export_artifacts.py:328  Launching run command:
cd /home/chi/src/shark-ai && ROCR_VISIBLE_DEVICES=0,1,2,3,4 iree-benchmark-module --hip_use_streams=true --module=/home/chi/src/shark-ai/2025-04-08/llama-8b/fp8_torch_tp1.vmfb --parameters=model=/shark-dev/8b/fp8/native_fp8_e4m3fnuz_llama3_8b.irpa --device=hip://4 --function=decode_bs4 --input=4x1xi64=@/shark-dev/8b/decode_args_fp8/next_tokens.bin --input=4xi64=@/shark-dev/8b/decode_args_fp8/seq_lens.bin --input=4xi64=@/shark-dev/8b/decode_args_fp8/start_positions.bin --input=4x5xi64=@/shark-dev/8b/decode_args_fp8/seq_block_ids.bin --input=261x2097152xf8E4M3FNUZ=@/shark-dev/8b/decode_args_fp8/cs_f8E4M3FNUZ.bin --benchmark_repetitions=3 >> /home/chi/src/shark-ai/2025-04-08/llama-8b/fp8_torch_tp1.txt
===================================================================================== short test summary info =====================================================================================
FAILED sharktank/tests/models/llama/benchmark_amdgpu_test.py::BenchmarkLlama3_1_8B::testBenchmark8B_fp8_TP1_Non_Decomposed - sharktank.utils.export_artifacts.IreeBenchmarkException: Error invoking iree-benchmark-module
========================================================================== 1 failed, 12 deselected in 126.01s (0:02:06) ===========================================================================

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment