Skip to content

Instantly share code, notes, and snippets.

@Lyken17
Forked from Mistobaan/tensorflow_cuda_osx.md
Created August 24, 2016 10:32
Show Gist options
  • Save Lyken17/d67bb38f7889b424f3e11579c5b7ee2f to your computer and use it in GitHub Desktop.
Save Lyken17/d67bb38f7889b424f3e11579c5b7ee2f to your computer and use it in GitHub Desktop.
How to enable cuda support for tensor flow on Mac OS X (Updated on April:2016 Tensorflow 0.8)

These instructions will explain how to install tensorflow on mac with cuda enabled GPU suport. I assume you know what tensorflow is and why you would want to have a deep learning framework running on your computer.

Prerequisites

Make sure to update your homebrew formulas

brew update

Coreutils for Macosx.

brew install coreutils swig

Cuda Libraries for macosx. You can install cuda from homebrew using cask.

brew cask install cuda

Make sure that the installed cuda version is 7.5 you can check the version with

brew cask info cuda
cuda: 7.5.20
Nvidia CUDA
https://developer.nvidia.com/cuda-zone
Not installed
https://github.com/caskroom/homebrew-cask/blob/master/Casks/cuda.rb
No Artifact Info

If you don't see 7.5 make sure to upgrade your brew formulas:

brew update
brew upgrade cuda

You need NVIDIA's Cuda Neural Network library libCudnn. You have to register and download it from the website: https://developer.nvidia.com/cudnn.

(Note: from version 0.8 Tensorflow supports cuDNN v5, version 0.7 and 0.7.1 support v4)

Download the file cudnn-7.5-osx-x64-v5.0-rc.tgz

Once downloaded you need to manually copy the files over the /usr/local/cuda/ directory

tar xzvf ~/Downloads/cudnn-7.5-osx-x64-v5.0-rc.tgz
sudo mv -v cuda/lib/libcudnn* /usr/local/cuda/lib
sudo mv -v cuda/include/cudnn.h /usr/local/cuda/include

add in your ~/.bash_profile the reference to /usr/local/cuda/lib. You will need it to run the python scripts.

export DYLD_LIBRARY_PATH=`/usr/local/cuda/lib`:$DYLD_LIBRARY_PATH

Now let's make sure that we are able to compile cuda programs. If you have the latest Xcode Installed (7.3 as the time of this post) nvcc will not work and will give an error like:

nvcc fatal   : The version ('70300') of the host compiler ('Apple clang') is not supported

In order to fix this you need to:

  1. download Xcode 7.2 from the apple developer website
  2. create a new directory /Applications/XCode7.2/
  3. copy the entire XCode.App inside /Applications/XCode7.2
  4. run sudo xcode-select -s /Applications/XCode7.2/Xcode.app/

You should be able to compile the deviceQuery utility found inside the cuda sdk repository. Let's compile the deviceQuery utility to figure out the CUDA_CAPABILITY supported by our graphics card.

cd /usr/local/cuda/samples
sudo make -C 1_Utilities/deviceQuery

And now we run it:

cd /usr/local/cuda/samples/
./bin/x86_64/darwin/release/deviceQuery

The output will look like:

./bin/x86_64/darwin/release/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GT 650M"
  CUDA Driver Version / Runtime Version          7.5 / 7.5
  CUDA Capability Major/Minor version number:    3.0
  Total amount of global memory:                 1024 MBytes (1073414144 bytes)
  ( 2) Multiprocessors, (192) CUDA Cores/MP:     384 CUDA Cores
  GPU Max Clock rate:                            900 MHz (0.90 GHz)
  Memory Clock rate:                             2508 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 262144 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = GeForce GT 650M
Result = PASS

Here you can confirm that the driver is set to 7.5 and you can find also the cuda capability of your GPU, CUDA Capability Major/Minor version number: 3.0 in my case, so we can set this property when we configure tensorflow.

Install bazel

Use homebrew:

brew install bazel

or install it manually from source:

git clone https://github.com/bazelbuild/bazel.git
cd bazel
git checkout tags/0.2.1
./compile.sh
sudo cp output/bazel /usr/local/bin

Make sure you have the right version of bazel, at least 0.2.1

$ bazel version
Build label: 0.2.1-homebrew
Build target: bazel-out/local_darwin-fastbuild/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Fri Apr 1 00:35:17 2016 (1459470917)
Build timestamp: 1459470917
Build timestamp as int: 1459470917

Checkout tensorflow

As of end of April 2016 the build system is merged in the main development line!

git clone --recurse-submodules https://github.com/tensorflow/tensorflow
cd tensorflow
git checkout master

Then we need to configure it.

I use Anaconda for the python distribution. Notice that you need to set the right TF_CUDA_COMPUTE_CAPABILITES value from the previous deviceQuery operation.

PYTHON_BIN_PATH=$HOME/anaconda/bin/python CUDA_TOOLKIT_PATH="/usr/local/cuda" CUDNN_INSTALL_PATH="/usr/local/cuda" TF_UNOFFICIAL_SETTING=1 TF_NEED_CUDA=1 TF_CUDA_COMPUTE_CAPABILITIES="3.0" TF_CUDNN_VERSION="5" TF_CUDA_VERSION="7.5" TF_CUDA_VERSION_TOOLKIT=7.5 ./configure

Now we are ready to build tensorflow pip package. This may take a while.

bazel build -c opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
pip install --upgrade /tmp/tensorflow_pkg/tensorflow-0.8.0rc0-py2-none-any.whl

if you are using anaconda like me you want to add `--ignore-installed``

pip install --upgrade --ignore-installed /tmp/tensorflow_pkg/tensorflow-0.8.0rc0-py2-none-any.whl 

now move to another directory and run a test script:

import tensorflow as tf

# Creates a graph.
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)

# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

# Runs the op.
print sess.run(c)

You should see now the output from the sample program. If not check the Caveats section.

deeplearning$ python test_install.py
I tensorflow/stream_executor/dso_loader.cc:107] successfully opened CUDA library libcublas.7.5.dylib locally
I tensorflow/stream_executor/dso_loader.cc:107] successfully opened CUDA library libcudnn.6.5.dylib locally
I tensorflow/stream_executor/dso_loader.cc:107] successfully opened CUDA library libcufft.7.5.dylib locally
I tensorflow/stream_executor/dso_loader.cc:107] successfully opened CUDA library libcuda.dylib locally
I tensorflow/stream_executor/dso_loader.cc:107] successfully opened CUDA library libcurand.7.5.dylib locally
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] OS X does not support NUMA - returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:103] Found device 0 with properties:
name: GeForce GT 650M
major: 3 minor: 0 memoryClockRate (GHz) 0.9
pciBusID 0000:01:00.0
Total memory: 1023.69MiB
Free memory: 19.40MiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:127] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:137] 0:   Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:703] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 650M, pci bus id: 0000:01:00.0)
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Allocating 19.40MiB bytes.
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:52] GPU 0 memory begins at 0x700a80000 extends to 0x701de6000
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 1.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 2.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 4.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 8.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 16.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 32.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 64.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 128.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 256.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 512.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 1.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 2.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 4.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 8.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 16.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 32.00MiB
Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: GeForce GT 650M, pci bus id: 0000:01:00.0
I tensorflow/core/common_runtime/direct_session.cc:137] Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: GeForce GT 650M, pci bus id: 0000:01:00.0

b: /job:localhost/replica:0/task:0/gpu:0
I tensorflow/core/common_runtime/simple_placer.cc:304] b: /job:localhost/replica:0/task:0/gpu:0
a: /job:localhost/replica:0/task:0/gpu:0
I tensorflow/core/common_runtime/simple_placer.cc:304] a: /job:localhost/replica:0/task:0/gpu:0
MatMul: /job:localhost/replica:0/task:0/gpu:0
I tensorflow/core/common_runtime/simple_placer.cc:304] MatMul: /job:localhost/replica:0/task:0/gpu:0
[[ 22.  28.]
 [ 49.  64.]]

You can read more about using GPUs, in tensorflow in the official GPU article.

Caveats

If you run into this error:

ImportError: No module named core.framework.graph_pb2

you are running the script from the same tensorflow directory and python is using the local directory as the module. Change directory see Stackoverflow Question.

If you get this error:

: Library not loaded: @rpath/libcudart.7.5.dylib
  Referenced from: ~/anaconda/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow.so
  Reason: image not found

Is because python is not able to find the cuda library. make sure to set the environment variable.

export DYLD_LIBRARY_PATH=/usr/local/cuda/lib:$DYLD_LIBRARY_PATH

If you see

Ignoring gpu device (device: 0, name: GeForce GT 650M, pci bus id: 0000:01:00.0) with Cuda compute capability 3.0. The minimum required Cuda capability is 3.5.

recompile the library and use the right compute capability (3.0).

Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size.
[Default is: "3.5,5.2"]: 3.0
Setting up Cuda include
Setting up Cuda lib
Setting up Cuda bin
Setting up Cuda nvvm
Configuration finished

Note that is is still a pull request. So is not officially supported.

I hope that with this tutorial more OSX developers can try the patch and report any errors and confirm that would be a good patch to merge in the main repository.

Stay hungry and have fun.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment