Skip to content

Instantly share code, notes, and snippets.

@rmcgibbo
Last active December 9, 2023 11:36
Show Gist options
  • Save rmcgibbo/6317607 to your computer and use it in GitHub Desktop.
Save rmcgibbo/6317607 to your computer and use it in GitHub Desktop.

Compiling an optimized ATLAS on Ubuntu 12.04

  1. Download the latest ATLAS source from http://sourceforge.net/projects/math-atlas/files/. I'm using version 3.10.1
  2. Download the latest Netlib LAPACK from http://www.netlib.org/lapack/. I'm using version 3.4.2
  3. Turn off frequency scalings on your chip so that you can get reliable timings. This is essential to get a good ATLAS build
  sudo apt-get install cpufreq-info cpuspeed cpufrequtils sysfsutils
  # set each core to the "performance" governor, so that the clock frequency doesn't go down when idle
  # I have 8 cores, which is why I need to do this 8 times
  sudo cpufreq-selector -c 0 -g performance
  sudo cpufreq-selector -c 1 -g performance
  sudo cpufreq-selector -c 2 -g performance
  sudo cpufreq-selector -c 3 -g performance
  sudo cpufreq-selector -c 4 -g performance
  sudo cpufreq-selector -c 5 -g performance
  sudo cpufreq-selector -c 6 -g performance
  sudo cpufreq-selector -c 7 -g performance
  
  # check to make sure the scaling was set correctly
  sudo cat  /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
  1. Compile ATLAS.
tar -xjvf atlas3.10.1.tar.bz2
cd ATLAS
mkdir build
cd build
../configure -Fa alg '-fPIC' --with-netlib-lapack-tarfile=<PATH_TO_NETLIB_LAPACK_TARBALL> --prefix=$HOME/opt/atlas --shared
make
make test
make install

export LD_LIBRARY_PATH=$HOME/opt/atlas/lib:$LD_LIBRARY_PATH

In ATLAS 3.10.1, the two shared libraries that get compiled are named libsatlas.so and libtatlas.so. As configured above, they both contain a full (C)BLAS+LAPACK interface. The differences is that the first is serial and the second is threaded. If you find that confusing and want to make linking against -latlas or -lcblas possible, then go into the install directory and install some symlinks.

cd $HOME/opt/atlas/lib
ln -s libtatlas.so libatlas.so  # make "libatlas" point to the threaded library
ln -s libtatlas.so libcblas.so  # make "libcblas" point to the threaded atlas

Linking numpy/scipy to your optimized ATLAS

Get the numpy source distribution. Move site.cfg.example to site.cfg, and set the following entries in site.cfg

[DEFAULT]
library_dirs = <YOUR_HOME_DIRECTORY>/opt/atlas/lib
include_dirs = <YOUR_HOME_DIRECTORY>//opt/atlas/include

[blas_opt]
libraries = tatlas

[lapack_opt]
libraries = tatlas

Now run setup.py install. Compile scipy from source as well, and it will automatically use the same ATLAS to build, since it detects its build configuration using numpy.distutils.

Timing your BLAS implementation.

Download the file time_dgemm.c from this gist. It does a big matrix multiply via the cblas_dgemm function. You can try linking the program against different BLAS implementations to test your speed.

# link against the default system cblas.
$ gcc time_dgemm.c -lcblas && ./a.out
10.997839 s

You can see that this is linked against the system cblas/atlas, installed throug the package manager.

$ ldd a.out
    linux-vdso.so.1 =>  (0x00007fffbddff000)
    libcblas.so.3gf => /usr/lib/libcblas.so.3gf (0x00007f5aa9498000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f5aa90d8000)
    libatlas.so.3gf => /usr/lib/libatlas.so.3gf (0x00007f5aa8ba7000)
    libgfortran.so.3 => /usr/lib/x86_64-linux-gnu/libgfortran.so.3 (0x00007f5aa8890000)
    libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f5aa8679000)
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f5aa845c000)
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f5aa8160000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f5aa96d1000)
    libquadmath.so.0 => /usr/lib/x86_64-linux-gnu/libquadmath.so.0 (0x00007f5aa7f29000)

Linking the same code against our threaded blas, I get a ~10x speedup.

$ gcc time_dgemm.c -L$HOME/opt/atlas/lib -ltatlas && ./a.out
1.302789 s

The single threaded optimized version gets a ~3x speedup too.

$ gcc time_dgemm.c -L$HOME/opt/atlas/lib -lsatlas && ./a.out
3.237809 s

Compiling OpenBLAS

  1. Download the latest Op/enBLAS from http://xianyi.github.io/OpenBLAS/
  2. Untar the package, and compile it.
$ tar -xvf v0.2.8
$ cd xianyi-OpenBLAS-9c51cdf
$ make
$ make PREFIX=$HOME/opt/openblas-0.2.8 install

Using OpenBLAS 0.2.8, I get better performance than ATLAS.

$ gcc time_dgemm.c -L$HOME/opt/openblas-0.2.8/lib -lopenblas && ./a.out
1.031470 s

Timing MKL

Using (nonfree) MKL, the performance is pretty similar. But the link line is much more complex.

$ gcc -fopenmp -m64 -I$MKLROOT/include time_dgemm.c -L$MKLROOT/lib/intel64 -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -ldl -lpthread -lm && ./a.out
0.951390 s
#define min(x,y) (((x) < (y)) ? (x) : (y))
#include <stdio.h>
#include <stdlib.h>
#include "cblas.h"
#include <time.h>
#include <sys/time.h>
#include <sys/resource.h>
double get_time() {
struct timeval t;
struct timezone tzp;
gettimeofday(&t, &tzp);
return t.tv_sec + t.tv_usec*1e-6;
}
int main()
{
double *A, *B, *C;
int m, n, k, i, j;
double alpha, beta;
double start, end;
m = 20000, k = 2000, n = 1000;
alpha = 1.0; beta = 0.0;
posix_memalign((void**) &A, 64, m*k*sizeof( double ));
posix_memalign((void**) &B, 64, k*n*sizeof( double ));
posix_memalign((void**) &C, 64, m*n*sizeof( double ));
if (A == NULL || B == NULL || C == NULL) {
printf( "\n ERROR: Can't allocate memory for matrices. Aborting... \n\n");
free(A);
free(B);
free(C);
return 1;
}
for (i = 0; i < (m*k); i++) {
A[i] = (double)(i+1);
}
for (i = 0; i < (k*n); i++) {
B[i] = (double)(-i-1);
}
for (i = 0; i < (m*n); i++) {
C[i] = 0.0;
}
start = get_time();
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
m, n, k, alpha, A, k, B, n, beta, C, n);
end = get_time();
printf("%f s\n", (end - start));
free(A);
free(B);
free(C);
return 0;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment