Skip to content

Instantly share code, notes, and snippets.

#!/usr/bin/env bash
# Debug script to see the full FBGEMM-GPU build error for B200
CONTAINER="nvcr.io/nvidia/pytorch:25.10-py3"
echo "=========================================="
echo "Debugging FBGEMM-GPU Build for B200"
echo "=========================================="
echo ""

reset windows with

choco install visualstudio2022buildtools -y --params "--add Microsoft.VisualStudio.Component.VC.Tools.x86.x64 --add Microsoft.VisualStudio.Component.VC.CMake.Project --add Microsoft.VisualStudio.Component.VC.ATL --add Microsoft.VisualStudio.Component.Windows11SDK.22621"
choco install git.install -y --params "'/GitAndUnixToolsOnPath'"
choco install cmake --version=3.31.0 -y
choco install ninja -y
choco install ccache -y
choco install python -y
choco install strawberryperl -y
(.venv) C:\b\pytorch>cmake -G"Visual Studio 17 2022" -Ax64 -Thost=x64 -B build -DBUILD_PYTHON=True -DBUILD_TEST=False -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=C:\b\pytorch\torch -DCMAKE_PREFIX_PATH=C:\TheRock\.venv\Lib\site-packages;C:\TheRock\.venv\Lib\site-packages\_rocm_sdk_devel\lib\cmake -DPython_EXECUTABLE=C:\TheRock\.venv\Scripts\python.exe -DPython_NumPy_INCLUDE_DIR=C:\TheRock\.venv\Lib\site-packages\numpy\_core\include -DTORCH_BUILD_VERSION=2.10.0a0+rocmsdk20251027 -DUSE_CUDA=OFF -DUSE_FLASH_ATTENTION=0 -DUSE_KINETO=OFF -DUSE_MEM_EFF_ATTENTION=0 -DUSE_MPI=OFF -DUSE_NUMA=OFF -DUSE_NUMPY=True -DUSE_ROCM=ON
CMake Deprecation Warning at CMakeLists.txt:8 (cmake_policy):
The OLD behavior for policy CMP0126 will be removed from a future version
of CMake.
The cmake-policies(7) manual explains that the OLD behaviors of all
policies are deprecated and that a policy should be set to OLD only under
specific short-term circumstances. Projects should be ported to the NEW
behavior and not rel
```
(.venv) C:\Users\chiliu12\Documents\src\TheRock\external-builds\pytorch>python pytorch_torch_repo.py checkout --checkout-dir C:/b/pytorch
Cloning repository at nightly
++ Exec [C:\b\pytorch]$ git init --initial-branch=main
Initialized empty Git repository in C:/b/pytorch/.git/
++ Exec [C:\b\pytorch]$ git config advice.detachedHead false
++ Exec [C:\b\pytorch]$ git remote add origin https://github.com/pytorch/pytorch.git
++ Exec [C:\b\pytorch]$ git fetch -j 10 origin nightly
remote: Enumerating objects: 593632, done.
remote: Counting objects: 100% (93636/93636), done.

172.19.166.8

start menu search x64 Native Tools Command Prompt for Visual Studio 2022

OR

Open Windows Terminal as Administrator Click the dropdown (v) → Settings Add a new profile for "Developer Command Prompt": Name: VS 2022 x64 (Admin)

Topic: ROCm 7.1 versions on Mi350/355 machines (Installation Guide) 
STEP 1 -> Uninstall ROCm packages - https://rocm.docs.amd.com/en/docs-7.0-alpha/preview/install/rocm.html#uninstalling 
STEP 2 -> Uninstall ROCm driver - https://rocm.docs.amd.com/en/docs-7.0-alpha/preview/install/instinct-driver.html#uninstalling 

RC-1 Build Installation Instructions
wget -N -P /tmp/ https://artifactory-cdn.amd.com/artifactory/list/amdgpu-deb/amdgpu-install-internal_7.1-22.04-1_all.deb
sudo apt update
sudo apt-get install /tmp/amdgpu-install-internal_7.1-22.04-1_all.deb
sh -c 'echo deb [arch=amd64 trusted=yes] https://compute-artifactory.amd.com/artifactory/list/rocm-release-archive-22.04-deb/ 7.1 rel-18 > /etc/apt/sources.list.d/rocm-build.list'

GPU: AMD radeon tm 890M graphics AMD Radeon™ 890M integrated GPU is based on the gfx1150 architecture, which corresponds to the RDNA 3.5 generation used in AMD's Strix Point APUs like the Ryzen AI 300 series

Issue:

PS C:\Users\chiliu12\Documents\src> git clone https://github.com/ROCm/TheRock.git
git : The term 'git' is not recognized as the name of a cmdlet, function, script file, or operable program. Check the
spelling of the name, or if a path was included, verify that the path is correct and try again.
At line:1 char:1
+ git clone https://github.com/ROCm/TheRock.git

docker

FROM ghcr.io/rocm/no_rocm_image_ubuntu24_04:main

# ######################################################
# # Install MLPerf+Shark reference implementation
# ######################################################
ENV DEBIAN_FRONTEND=noninteractive

USER root
root@smci350-odcdh2-a05-1:/mlperf/harness# ./run_offline.sh --shortfin-config shortfin_405b_config_fp4.json --test-mode PerformanceOnly 2>&1 | tee server-time.log
Warning: Missing argument '--test-scenario'
Info: Defaulting to test scenario 'Offline'
Log started at: 2025-10-07 22:23:01
INFO:shortfin_apps.llm.components.service_debug_dumper:[debug_service.py] Please find debug dumps for service.py in /root/.shortfin/debug/llm_service_invocation_dumps/2025-10-07T22:23:01.515306
INFO:root:####################################################################################################################################################################################
Running python3 harness_alt_mi355.py --devices 0,1,2,3,4,5,6,7 --scenario Offline --test_mode PerformanceOnly --bs 8 --user_conf_path user.conf --count 16 --tensor_path /data/mlperf_llama3.1_405b_dataset_8313_processed_fp16_eval.pkl --logfile_outdir OutputOfflinePerformanceOnly-a05-pp8-16samples-ps8_ds200_dc_4096_1007 --debug False --verbose True --u