Skip to content

Instantly share code, notes, and snippets.

@lh3
Last active September 22, 2016 09:27
Show Gist options
  • Save lh3/f02d2dd36e64f9096272 to your computer and use it in GitHub Desktop.
Save lh3/f02d2dd36e64f9096272 to your computer and use it in GitHub Desktop.

Writing a program for efficient matrix multiplication is tricky. A naive implementation usually looks like this (I will use a square matrix for simplicity):

for (i = 0; i < n; ++i)
  for (j = 0; j < n; ++i)
    for (k = 0, m[i][j] = 0.; k < n; ++i)
      m[i][j] += a[i][k] * b[k][j];

The problem is the inner loop a[i][k] * b[k][j] where b[k][j] and b[k+1][j] are not adjacent in memory. This leads to frequent cache misses for large matrices. The better implementation is to transpose matrix b. The implementation will look like:

for (i = 0; i < n; ++i) // transpose
	for (j = 0; j < n; ++j)
		c[i][j] = b[j][i];
for (i = 0; i < n; ++i)
	for (j = 0; j < n; ++j)
		for (k = 0, m[i][j] = 0.; k < n; ++k)
			m[i][j] += a[i][k] * c[j][k];

Although transpose also leads to cache misses, it is a O(n^2) operation, much better than the O(n^3) multiplication. This simple treatment dramatically speeds up multiplication of large matrices. The following table shows the time to multiple two square matrices on my Mac laptop (1.7GHz Core i5; clang/LLVM-3.5svn and ruby-2.0.0p481):

Dimension Language Method CPU time (s)
1000 C Transposed 1.91
1000 C Naive 11.90
500 Ruby Transposed 23.19
500 Ruby Library 78.36

The ruby matrix library is not using the transpose trick and is thus inefficient. In fact, most examples from Rosetta Code are naive implementations, too. My reflection on this is: occasionally, even wheels provided by widely used libraries can be square – we need to reinvent to make them round.

#include <stdlib.h>
#include <stdio.h>
#include <sys/resource.h>
#include <sys/time.h>
double cputime()
{
struct rusage r;
getrusage(RUSAGE_SELF, &r);
return r.ru_utime.tv_sec + r.ru_stime.tv_sec + 1e-6 * (r.ru_utime.tv_usec + r.ru_stime.tv_usec);
}
double **mm_init(int n)
{
double **m;
int i;
m = (double**)malloc(n * sizeof(void*));
for (i = 0; i < n; ++i)
m[i] = calloc(n, sizeof(double));
return m;
}
void mm_destroy(int n, double **m)
{
int i;
for (i = 0; i < n; ++i) free(m[i]);
free(m);
}
double **mm_gen(int n)
{
double **m, tmp = 1. / n / n;
int i, j;
m = mm_init(n);
for (i = 0; i < n; ++i)
for (j = 0; j < n; ++j)
m[i][j] = tmp * (i - j) * (i + j);
return m;
}
// better cache performance by transposing the second matrix
double **mm_mul_fast(int n, double *const *a, double *const *b)
{
int i, j, k;
double **m, **c;
m = mm_init(n); c = mm_init(n);
for (i = 0; i < n; ++i) // transpose
for (j = 0; j < n; ++j)
c[i][j] = b[j][i];
for (i = 0; i < n; ++i) {
double *p = a[i], *q = m[i];
for (j = 0; j < n; ++j) {
double t = 0.0, *r = c[j];
for (k = 0; k < n; ++k)
t += p[k] * r[k];
q[j] = t;
}
}
mm_destroy(n, c);
return m;
}
double **mm_mul_naive(int n, double *const *a, double *const *b)
{
int i, j, k;
double **m;
m = mm_init(n);
for (i = 0; i < n; ++i) {
for (j = 0; j < n; ++j) {
double t = 0.0;
for (k = 0; k < n; ++k)
t += a[i][k] * b[k][j];
m[i][j] = t;
}
}
return m;
}
int main(int argc, char *argv[])
{
int n = 100;
double **a, **b, **m, t;
if (argc > 1) n = atoi(argv[1]);
n = (n/2) * 2;
a = mm_gen(n); b = mm_gen(n);
t = cputime();
m = mm_mul_fast(n, a, b);
mm_destroy(n, m);
fprintf(stderr, "Fast; Central value: %.3f; CPU time: %.3f\n", m[n/2][n/2], cputime() - t);
t = cputime();
m = mm_mul_naive(n, a, b);
mm_destroy(n, m);
fprintf(stderr, "Naive; Central value: %.3f; CPU time: %.3f\n", m[n/2][n/2], cputime() - t);
mm_destroy(n, a); mm_destroy(n, b);
return 0;
}
require 'benchmark'
require 'matrix'
def matmul(a, b)
m = a.length
n = a[0].length
p = b[0].length
# transpose
b2 = Array.new(n) { Array.new(p) { 0 } }
for i in 0 .. n-1
for j in 0 .. p-1
b2[j][i] = b[i][j]
end
end
# multiplication
c = Array.new(m) { Array.new(p) { 0 } }
for i in 0 .. m-1
for j in 0 .. p-1
s = 0
ai, b2j = a[i], b2[j]
for k in 0 .. n-1
s += ai[k] * b2j[k]
end
c[i][j] = s
end
end
return c
end
def matgen(n)
tmp = 1.0 / n / n
a = Array.new(n) { Array.new(n) { 0 } }
for i in 0 .. n-1
for j in 0 .. n-1
a[i][j] = tmp * (i - j) * (i + j)
end
end
return a
end
n = 100
if ARGV.length >= 1
n = ARGV[0].to_i
end
n = n / 2 * 2
a = matgen(n)
b = matgen(n)
puts Benchmark.measure { c = matmul(a, b); puts "Fast: #{c[n/2][n/2]}"; }
tmp = 1.0 / n / n
am = Matrix.build(n, n) {|i, j| tmp * (i - j) * (i + j)}
bm = Matrix.build(n, n) {|i, j| tmp * (i - j) * (i + j)}
puts Benchmark.measure { c = am * bm; puts "Library: #{c[n/2,n/2]}"; }
@voutcn
Copy link

voutcn commented Sep 22, 2016

Another cache-friendly version with O(1) extra memory

for (i = 0; i < n; ++i)
  for (j = 0; j < n; ++i)
    for (k = 0; k < n; ++i)
      m[i][k] += a[i][j] * b[j][k];

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment