Skip to content

Instantly share code, notes, and snippets.

struct AbstractMatrix {
int m; // number of rows
int n; // number of columns
// Pack block at ib, jb of size mb, nb into dest in row-major format.
virtual void pack_rowmajor(int ib, int jb, int mb, int nb, float *dest) const = 0;
// Unpack row-major matrix from src into block at ib, jb of size mb, nb.
virtual void unpack_rowmajor(int ib, int jb, int mb, int nb, const float *src) = 0;
/*
transpose_ij (10000x5000): gmemops=0.23, min=0.436734, avg=0.455368, relerr=3.42%
transpose_ji (10000x5000): gmemops=0.28, min=0.356635, avg=0.363628, relerr=2.55%
transpose_ij_ij (10000x5000): gmemops=1.53, min=0.065207, avg=0.069465, relerr=6.07%
transpose_rec1 (10000x5000): gmemops=1.44, min=0.069258, avg=0.075378, relerr=8.96%
transpose_rec2 (10000x5000): gmemops=1.37, min=0.072819, avg=0.079644, relerr=8.77%
transpose_ij (100000x100): gmemops=1.70, min=0.011731, avg=0.014102, relerr=9.99%
transpose_ji (100000x100): gmemops=0.23, min=0.086909, avg=0.095706, relerr=3.37%
transpose_ij_ij (100000x100): gmemops=1.73, min=0.011543, avg=0.013170, relerr=6.96%
struct Matrix {
float *base; // pointer to first element
int m; // number of rows
int n; // number of columns
int drow; // row stride
Matrix(float *base, int m, int n, int drow = 0) : base(base), m(m), n(n), drow(drow ? drow : n) { }
INLINE float& operator()(int i, int j) {
// assert(0 <= i && i < m);
#if _WIN32
struct Timer {
LARGE_INTEGER win32_freq;
LARGE_INTEGER win32_start;
Timer() {
QueryPerformanceFrequency(&win32_freq);
win32_start.QuadPart = 0;
}
#if _WIN32
// measure the time in seconds to run f(). doubles will retain nanosecond precision.
template<typename F>
double timeit(F f) {
LARGE_INTEGER freq;
QueryPerformanceFrequency(&freq);
LARGE_INTEGER start;
QueryPerformanceCounter(&start);
f();
LARGE_INTEGER end;
// Version 1: Recursion.
node_t *find1(node_t *node, int key) {
if (!node) {
return NULL;
}
if (key < node->key) {
return find1(node->left, key);
} else if (key == node->key) {
return node;
def blocks(A, block_size=(1, 1)):
i, j = block_size
while len(A):
yield A[:i, :j], A[:i, j:], A[i:, :j], A[i:, j:]
A = A[i:, j:]
# 2400 ms for 1000x1000 matrix (~0.4 GFLOPS). Equivalent C code is only twice as fast (~0.8 GFLOPS).
# The reason the C code isn't much faster is that it's an O(n^3) algorithm and most of the time is spent in
# the O(n^2) kernel routine for the outer product in A11[:] -= A10 @ A10.T. Even if we posit that Python
# is 1000x slower than C for the outer loop, that's still 1000n + n^3 vs n^3, which is negligible for n = 1000.
def givens(x, y):
r = np.hypot(x, y)
if np.isclose(r, 0):
return x, 1, 0
return r, x / r, y / r
def cholesky_update(L, x):
m, n = A.shape
x = x.copy()
for i in range(m):

Here's an example of an algebraic approach to bitwise operations. I'm intentionally choosing something that is obvious from the programmer's point of view to see what it corresponds to algebraically.

We know that bitwise rotation of an n-bit word x can be done with left/shift shifts:

(x << 1) | (x >> (n-1)) = (x << 1) ^ (x >> (n-1)).

Algebraically, this means that the rotation operator C satisfies C = L + R^(n-1). Since C is a rotation it must satisfy C^n = 1, i.e. if we rotate n times we should end up where we started. This corresponds to the identity (L + R^(n-1))^n = 1.

Every emulator should have most of these input-related features but I haven't found anything with more than a small fraction:

Default and custom input profiles. Custom profiles can have game-specific input bindings. Bindings in custom profiles set to the 'default' value defer to the binding in the default profile; it's important that custom profiles aren't just initialized as a copy of the then-current default profile as this makes it impossible to later change some non-overridden binding in the default profile and have it automatically propagate to existing custom profiles.

The emulator should remember which custom profile was last used for which game, based on the ROM hash or filename. (The emulator should also automatically save any relevant per-game emulator settings, but I think you want that separate from input profiles. As an example of what not to do, if I set the CPU overclocking for Metal Slug in MAME to 200% to