Writing a program for efficient matrix multiplication is tricky. A naive implementation usually looks like this (I will use a square matrix for simplicity):
for (i = 0; i < n; ++i)
for (j = 0; j < n; ++i)
for (k = 0, m[i][j] = 0.; k < n; ++i)
m[i][j] += a[i][k] * b[k][j];
The problem is the inner loop a[i][k] * b[k][j]
where b[k][j]
and b[k+1][j]
are not adjacent in memory. This leads to frequent cache misses for large matrices. The better implementation is to transpose matrix b
. The implementation will look like:
for (i = 0; i < n; ++i) // transpose
for (j = 0; j < n; ++j)
c[i][j] = b[j][i];
for (i = 0; i < n; ++i)
for (j = 0; j < n; ++j)
for (k = 0, m[i][j] = 0.; k < n; ++k)
m[i][j] += a[i][k] * c[j][k];
Although transpose also leads to cache misses, it is a O(n^2) operation, much better than the O(n^3) multiplication. This simple treatment dramatically speeds up multiplication of large matrices. The following table shows the time to multiple two square matrices on my Mac laptop (1.7GHz Core i5; clang/LLVM-3.5svn and ruby-2.0.0p481):
Dimension | Language | Method | CPU time (s) |
---|---|---|---|
1000 | C | Transposed | 1.91 |
1000 | C | Naive | 11.90 |
500 | Ruby | Transposed | 23.19 |
500 | Ruby | Library | 78.36 |
The ruby matrix library is not using the transpose trick and is thus inefficient. In fact, most examples from Rosetta Code are naive implementations, too. My reflection on this is: occasionally, even wheels provided by widely used libraries can be square – we need to reinvent to make them round.
Another cache-friendly version with O(1) extra memory