Make Nvidia EGPU working on mac os with Pytorch and Fast.ai

[Updated on 2018.11.14] I finally made my GTX1070 working with my MBP for Pytorch and fast.ai. Below are the steps:

Environment

MacBook Pro (15-inch, 2016) with touch bar
OSX version: 10.13.6 (Mojave may not work yet as of now)
eGPU: Razer Core X + GTX 1070 (MSI)

Steps 1: Install Nvidia Web Driver

10.13.6 + 17G65

Follow this if your system is 10.13.6 17G65 (* you can check this number by clicking "version 10.13.6" in "About this Mac"* ). If it is 17G3025 or later, jump to the next section "10.13.5 + 17G3025"

Use this great tool macOS-eGPU to install Nvidia web driver. Just follow the guide and install by "> macos-egpu".

Although it also provide the options to let you install CUDA, DO NOT use it. Because it will automatically install the latest version, which seems not working for Pytorch yet. So just install the NVIDIA web driver.

After the installation, my web driver version is: 387.10.10.10.40.105. Make sure you have this version if your OSX version is 10.13.6 + 17G65.

10.13.6 + 17G3025 [Added on 2018.11.14]

There comes a new security patch in High Sierra 10.13.6 17G3025 (* you can check this number by clicking "version 10.13.6" in "About this Mac"* ) in the beginning of November. The macOs-eGPU has not been updated for this new OSX build yet (as of today 2018.11.14). So I would suggest to use another tool instead: purge-wrangle. It is the same or better (personal opinion), just follow the guide and select "Enable NVIDIA eGPUs".

After the installation, the web driver version will be: 387.10.10.10.40.108.

10.13.6 + 17G5019 [Added on 2019.02.27]

Same as before, just use purge-wrangle to apply the patch. If you already patched the system with purge-wrangle before, simply upgrade the Nvidea web driver. After the restart, purge-wrangle will prompt you to re-patch the system. Easy as a pie.

After the installation, the web driver version will be: 387.10.10.10.40.118.

Verify

If it is installed successfully, once you plug in your eGPU, you shall see your GTX 1070 in "About This Mac -> System Report... -> Graphics/Displays" and "Activity Monitor -> Window -> GPU History". Or you can simply plug an external monitor to eGPU to see if it works.

NOTE: It doesn't support eGPU hot unplug yet. So it is suggested to "reboot and unplug the moment the eGPU power shuts down". (If it is not done properly, kernel panic will happen). But my Razer Core X will not shut down the power during the restart. The fan of the Razer Core X keeps spinning and probably because the GPU temperature is low the fan on GTX 1070 doesn't spin at all. So for me, there is no way to tell the right moment from the eGPU. But with some experiment, I found it seems safe to unplug at the moment that the keyboard backlight turns off during the restart.

Step 2: Install CUDA driver, toolkit

Pytorch works with CUDA 9.2. It doesn't support the latest CUDA 10.0 yet. So I downloaded the installation image from Nvidia. It includes CUDA driver, toolkit and samples. Just install all of them. We will need samples later on. CUDA Toolkit 9.2 has a patch, install the patch as well. You can download the patch from the same place as listed above.

Follow the installation guide here

Make sure the deviceQuery and bandwidthTest from samples work after installation.

After the installation, my CUDA driver version is: ** 396.148 **. You can get this information with the command "> macos-egpu -C".

Step 3: Install CUDNN

Get into this page to download the installation image (require registration). "https://developer.nvidia.com/rdp/cudnn-archive" -> click "cuDNN v7.1.4 Library for OSX".

Make sure to use cuDNN v7.1.4.

Follow this guide for the installation.

Step 4: Compile and install Pytorch

I followed this guide. It mostly correct as for me, but not all... So I would like to write down the steps that works for me.

Create conda enviroment

conda create --name ptc python=3.6 pip

With ptc active (> source activate ptc)

export CMAKE_PREFIX_PATH=[anaconda root directory]
# for me, the anaconda root directory is "/Users/<your_user_name>/anaconda3"

#Install optional dependencies
conda install numpy pyyaml mkl mkl-include setuptools cmake cffi typing

After the above step, unset CMAKE_PREFIX_PATH or simply open a new terminal and activate ptc. This is very important. Becuase we are going to compile pytorch, with CMAKE_PREFIX_PATH, it will cause problem (and it did cause problem for me).
Get the PyTorch source

git clone --recursive https://github.com/pytorch/pytorch
cd pytorch

Switch to v0.4.1 and initial submodules

git checkout tags/v0.4.1
git checkout -b v0.4.1
git submodule update --init

Before we go ahead to start compiling, make sure we have everything correctly:

The following are my enviornment variables in ~/.bash_profile. CUDA_HOME and CUDA_NVCC_EXECUTABLE may not be needed. It was there because I tried to compile tensorflow previously. The last PATH (PATH=/usr/local/cuda/bin:$PATH) may be removed as well. But to be safe, you can keep the same as mine.

export CUDA_HOME=/usr/local/cuda
export CUDA_NVCC_EXECUTABLE=/usr/local/cuda/bin/nvcc
export DYLD_LIBRARY_PATH="$CUDA_HOME/extras/CUPTI/lib:/Developer/NVIDIA/CUDA-9.2/lib:$DYLD_LIBRARY_PATH"
export LD_LIBRARY_PATH=$DYLD_LIBRARY_PATH
export PATH=/Developer/NVIDIA/CUDA-9.2/bin${PATH:+:${PATH}}
export PATH=/usr/local/cuda/bin:$PATH

clang version, it shall be something like below after step 2.

$ clang -v
Apple LLVM version 9.0.0 (clang-900.0.39.2)
Target: x86_64-apple-darwin17.7.0
Thread model: posix
InstalledDir: /Applications/Xcode_9.2.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
Found CUDA installation: /usr/local/cuda, version unknown

Build and install Pytorch. It will take a while (like 30 minutes to an hour). Just be patient.

MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ python setup.py install

After it is done, verify it with "> pip list". You should see "torch" with version "0.5.0a0+a24163a". And with eGPU connected, you can also do. Make sure is_available() returns True.

cd <any directory except pytorch!>
python
>>> import torch
>>> torch.cuda.is_available()
True

Step 5: Install fastai

pip install fastai

This will install torchvision-nightly for you, which is needed by fastai.

Double check pip list and make sure Pillow is installed correctly and there is no both "Pillow" and "pillow".

With torchvision-nightly, installed, we can verify that pytorch is installed correctly. Download pytorch examples and compare time required with and without cuda.

git clone https://github.com/pytorch/examples
cd examples/mnist
time python main.py >/dev/null
real 1m38.430s
user 2m6.163s
sys 0m7.762s
time python main.py --no-cuda >/dev/null
real 5m47.750s
user 37m22.609s
sys 1m23.813s

There are couple of packages shall be installed for fastai as well.

pip install bcolz
pip install opencv-python
pip install seaborn
pip install graphviz
pip install sklearn_pandas
pip install isoweek
pip install pandas_summary
pip install torchtext

Maybe there are more, and maybe the best way is to install it from fastai repo. But these works for me.

Let's test.

git clone https://github.com/fastai/fastai
cd fastai
jupyter-notebook

In Jupyter notebook, open courses/dl1/lesson1.ipynb. Run the first few code blocks, especially those imports and see if there is any error. If any error about missing packages, just pip install them.

Then you can go ahead with the lesson, test and enjoy your eGPU!

BTW, You can open "GPU History" from "Activity Monitor" and monitor your eGPU's load while testing.

If this gist helped you, please leave a star ;-) I will be very happy to see that it helped.

Final Note

Be ** VERY VERY CAREFULE ** about installing OSX security patches / updates.

I installed the latest update for High Sierra, which updated the macOS build to 17G3025 (still version 10.13.6). Then macos-egpu doesn't support this new build and it won't recognize the egpu anymore. Luckily I found purge-wrangler which saved my life. And I like the way it is designed and explained.

Every macOS update rewrites kernel extensions (including security updates). This means that all patches installed using purge-wrangler.sh are reset. With V5.0.0 or later, the system will notify you if this has happened, and allow you to re-patch immediately.

I recommend to have a time machine backup before every system updates. Because it seems there is always a gap between the OSX system updates is released and the corresponding nvidia web driver is released. If you apply the system update before the new web driver is available, you will end up with nowhere... If you still want to use the system, you have to either rollback your system with time machine or use Web-Driver-Toolkit as suggested here to patch the NVDAStartup (I just read this post but didn't try it by myself.)

dandanwei/osx_egpu_pytorch_fastai.md