title | author | date | patat | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Futhark Presentation for the /r/ProgrammingLanguages meetup |
Troels Henriksen |
November 22nd, 2020 |
|
_____ _ _ _
| ___| _| |_| |__ __ _ _ __| | __
| |_ | | | | __| '_ \ / _` | '__| |/ /
| _|| |_| | |_| | | | (_| | | | <
|_| \__,_|\__|_| |_|\__,_|_| |_|\_\
22nd of November, 2020
-
Troels Henriksen (and many others)
-
DIKU - University of Copenhagen
-
futhark-lang.org
- But modern parallel machines (e.g. GPUs) are difficult to program.
- But human brains are really poor at reasoning about massive concurrency.
Could a sufficiently smart compiler automagically make our old code parallel? Probably not.
for (int i = 0; i < n; i++) {
ys[i] = f(xs[i]);
}
. . .
Yes, but requires the compiler to perform index analysis to see it.
for (int i = 0; i < n; i++) {
ys[i+1] = f(ys[i], xs[i]);
}
. . .
Yes, but I would be surprised if any parallelising compiler could detect it.
We express algorithms that have plenty of parallelism with a sequential vocabulary.
. . .
for (int i = 0; i < n; i++) {
ys[i] = f(xs[i]); ⇨ let ys = map f xs
}
for (int i = 0; i < n; i++) {
ys[i+1] = f(ys[i], xs[i]); ⇨ let ys = scan f xs
}
The existing functional programming vocabulary is almost what we need.
-
Functional programming provides a programming model that is in principle parallel.
-
The trick is merely how to generate code that is also fast in practice.
-
We need a restricted functional language that omits complex features that are hard to implement efficiently...
-
...but is still flexible enough to write the programs we care about.
-
Least common denominator purely functional hardware-agnostic language with parallel constructs.
-
Practical performance is the main goal, which carries many restrictions.
-
Syntax is a mix of Haskell and SML to ensure equal dislike by everybody.
. . .
-
Fusion, moderate flattening, and many other optimisations.
-
Focuses on efficient common case for now.
-
Generates GPU-optimised code via OpenCL or CUDA.
-
... or multicore CPU code with pthreads.
. . .
let dotprod [n] (xs: [n]i32) (ys: [n]i32): i32 =
reduce (+) 0 (map2 (*) xs ys)
. . .
let dotprod [n] (xs: [n]i32) (ys: [n]i32): i32 =
reduce (+) 0 (map2 (*) xs ys)
let matmul [n][m][p] (a: [n][m]i32) (b: [m][p]i32): [n][p]i32 =
map (\a_row ->
map (\a_col ->
dotprod a_row a_col)
(transpose b))
a
Be faster than everything that is more flexible, and more flexible than everything that is faster.
. . .
Taking over the world!
. . .
-
We don't want to replace all languages. (Nor could we.)
-
Futhark is for small performance-sensitive computational kernels. The rest of the application can remain unchanged.
-
Normally implemented via function pointers.
-
GPUs do not (efficiently) support function pointers.
-
Indirect calls are slow even on CPUs.
. . .
Fortunately, the 70s were full of people who did not like function pointers either.
Basic idea:
-
Replace each lambda by a tagged data value that captures the free variables:
\x -> x + y ⇨ Lam0 y \x -> z * x ⇨ Lam1 z
. . .
-
Replace function calls by case-switching over these funcstions
f a ⇨ case f of Lam0 y -> a + y Lam1 z -> z + a ...
. . .
Unfortunately, branches are also problematic!
Type restrictions to ensure we always know which function is being called.
. . .
let f = if b1 then \x -> foo
else if b2 then \x -> bar
else ... \x -> baz
in... f y
-
Which function
f
is applied? -
So we ban conditionals from returning functions.
let fs = [\y -> y+a, \z -> z*b, ...]
in... fs[i] 5
- Which function
fs[i]
is applied?
Original program:
let a = 1
let b = 2
let f = \x -> x+a
in f b
. . .
Defunctionalised:
let a = 1
let b = 2
let f = {a=a} -- record that captures free variables
in f' f b
with lifted function
let f' env x =
let a = env.a
in x+a
-
Restricting the language enables better code generation.
-
Crucial: the restrictions are easy to understand, checked by the type checker, and usually not a hindrance in practice
. . .
- Futhark unboxes everything
- Everything is call-by-value (except arrays)
A triple (a,b,c)
is treated as three distinct values, kept in
registers.
- Consider arrays of type
[](i32, i8)
. - Since an
i32
is four bytes and ai8
is one byte, how should Futhark store this in memory?
. . .
0 4 5 9 10...
┌──────────┬────┬──────────┬────┬─
│ i32 │ i8 │ i32 │ i8 │...
└──────────┴────┴──────────┴────┴─
. . .
Problem: Unaligned access.
. . .
0 4 8 12 16
┌──────────┬──────────┬──────────┬──────────┬─
│ i32 │ i8 │ i32 │ i8 │...
└──────────┴──────────┴──────────┴──────────┴─
Problem: Waste of memory.
The Futhark compiler represents an array of type [n](t1, t2, t3...)
as multiple arrays of types [n]t1
, [n]t2
, [n]t3
...
0 4 8 12 16
┌──────────┬───────────┬──────────┬───────────┬─
│ i32 │ i32 │ i32 │ i32 │...
└──────────┴───────────┴──────────┴───────────┴─
0 1 2 3 4 5 6 7 8 9
┌─────────┬────┬────┬────┬────┬────┬────┬────┬─
│ i8 │ i8 │ i8 │ i8 │ i8 │ i8 │ i8 │ i8 │ i8 │...
└────┴────┴────┴────┴────┴────┴────┴────┴────┴─
- Common (and crucial) transformation.
- Called "struct of arrays" in legacy languages.
- Automatically done by the Futhark compiler.
- Only affects internal language.
-
Writing a small and focused language is fun.
-
Designing a language for performance requires you to think hard about what the language permits.
-
Lots more on our dev blog, particularly: