title

author

date

patat

Futhark Presentation for the /r/ProgrammingLanguages meetup

Troels Henriksen

November 22nd, 2020

theme

codeBlock

header

emph

bold

 _____      _   _                _
|  ___|   _| |_| |__   __ _ _ __| | __
| |_ | | | | __| '_ \ / _` | '__| |/ /
|  _|| |_| | |_| | | | (_| | |  |   <
|_|   \__,_|\__|_| |_|\__,_|_|  |_|\_\

       22nd of November, 2020

Troels Henriksen (and many others)
DIKU - University of Copenhagen
futhark-lang.org

The Need for Speed

We need performance more than ever!

But modern parallel machines (e.g. GPUs) are difficult to program.

We need to program in parallel!

But human brains are really poor at reasoning about massive concurrency.

Could we do nothing?

Could a sufficiently smart compiler automagically make our old code parallel? Probably not.

Is this parallel?

for (int i = 0; i < n; i++) {
  ys[i] = f(xs[i]);
}

. . .

Yes, but requires the compiler to perform index analysis to see it.

What about this one?

for (int i = 0; i < n; i++) {
  ys[i+1] = f(ys[i], xs[i]);
}

. . .

Yes, but I would be surprised if any parallelising compiler could detect it.

Write what you mean!

We express algorithms that have plenty of parallelism with a sequential vocabulary.

. . .

What if we improved our vocabulary?

for (int i = 0; i < n; i++) {
  ys[i] = f(xs[i]);            ⇨   let ys = map f xs
}

for (int i = 0; i < n; i++) {
  ys[i+1] = f(ys[i], xs[i]);   ⇨   let ys = scan f xs
}

The existing functional programming vocabulary is almost what we need.

There is hope!

Functional programming provides a programming model that is in principle parallel.
The trick is merely how to generate code that is also fast in practice.

Win some, lose some

We need a restricted functional language that omits complex features that are hard to implement efficiently...
...but is still flexible enough to write the programs we care about.

Futhark!

A Language

Least common denominator purely functional hardware-agnostic language with parallel constructs.
Practical performance is the main goal, which carries many restrictions.
Syntax is a mix of Haskell and SML to ensure equal dislike by everybody.

. . .

A Compiler

Fusion, moderate flattening, and many other optimisations.
Focuses on efficient common case for now.
Generates GPU-optimised code via OpenCL or CUDA.
... or multicore CPU code with pthreads.

Examples

Example: Dot Product

. . .

let dotprod [n] (xs: [n]i32) (ys: [n]i32): i32 =
  reduce (+) 0 (map2 (*) xs ys)

Example: Matrix Multiplication

. . .

let dotprod [n] (xs: [n]i32) (ys: [n]i32): i32 =
  reduce (+) 0 (map2 (*) xs ys)

let matmul [n][m][p] (a: [n][m]i32) (b: [m][p]i32): [n][p]i32 =
  map (\a_row ->
         map (\a_col ->
                dotprod a_row a_col)
             (transpose b))
      a

Goals

Futhark's goal in one sentence

Be faster than everything that is more flexible, and more flexible than everything that is faster.

. . .

Not a goal

Taking over the world!

. . .

Futhark is not all-or-nothing

We don't want to replace all languages. (Nor could we.)
Futhark is for small performance-sensitive computational kernels. The rest of the application can remain unchanged.

Performance is a feature that makes all other features harder to implement

Higher-order functions are problematic

Normally implemented via function pointers.
GPUs do not (efficiently) support function pointers.
Indirect calls are slow even on CPUs.

. . .

Fortunately, the 70s were full of people who did not like function pointers either.

Defunctionalisation by John Reynolds (1972)

Basic idea:

Replace each lambda by a tagged data value that captures the free variables:
```
  \x -> x + y
    ⇨ Lam0 y

  \x -> z * x
    ⇨ Lam1 z
```

. . .

Replace function calls by case-switching over these funcstions

  f a
  ⇨ case f of Lam0 y -> a + y
              Lam1 z -> z + a
              ...

. . .

Unfortunately, branches are also problematic!

Ensuring branch-free defunctionalisation

Type restrictions to ensure we always know which function is being called.

. . .

let f = if b1 then \x -> foo
        else if b2 then \x -> bar
        else ... \x -> baz
in... f y

Which function f is applied?
So we ban conditionals from returning functions.

Arrays may also not contain functions

let fs = [\y -> y+a, \z -> z*b, ...]
in... fs[i] 5

Which function fs[i] is applied?

Example of defunctionalisation

Original program:

let a = 1
let b = 2
let f = \x -> x+a
in f b

. . .

Defunctionalised:

let a = 1
let b = 2
let f = {a=a} -- record that captures free variables
in f' f b

with lifted function

let f' env x =
  let a = env.a
  in x+a

The point

Restricting the language enables better code generation.
Crucial: the restrictions are easy to understand, checked by the type checker, and usually not a hindrance in practice

Let's talk about value representation

. . .

Futhark unboxes everything
Everything is call-by-value (except arrays)

A triple (a,b,c) is treated as three distinct values, kept in registers.

Arrays of tuples

Consider arrays of type [](i32, i8).
Since an i32 is four bytes and a i8 is one byte, how should Futhark store this in memory?

. . .

0          4    5          9    10...
┌──────────┬────┬──────────┬────┬─
│    i32   │ i8 │    i32   │ i8 │...
└──────────┴────┴──────────┴────┴─

. . .

Problem: Unaligned access.

. . .

0          4          8          12         16
┌──────────┬──────────┬──────────┬──────────┬─
│    i32   │    i8    │    i32   │    i8    │...
└──────────┴──────────┴──────────┴──────────┴─

Problem: Waste of memory.

Tuples of arrays

The Futhark compiler represents an array of type [n](t1, t2, t3...) as multiple arrays of types [n]t1, [n]t2, [n]t3...

0          4          8           12          16
┌──────────┬───────────┬──────────┬───────────┬─
│    i32   │    i32    │    i32   │    i32    │...
└──────────┴───────────┴──────────┴───────────┴─

0    1    2    3    4    5    6    7    8    9
┌─────────┬────┬────┬────┬────┬────┬────┬────┬─
│ i8 │ i8 │ i8 │ i8 │ i8 │ i8 │ i8 │ i8 │ i8 │...
└────┴────┴────┴────┴────┴────┴────┴────┴────┴─

Common (and crucial) transformation.
Called "struct of arrays" in legacy languages.
Automatically done by the Futhark compiler.
Only affects internal language.

Conclusions

Writing a small and focused language is fun.
Designing a language for performance requires you to think hard about what the language permits.
Lots more on our dev blog, particularly:

athas/presentation.md

The Need for Speed

We need performance more than ever!

We need to program in parallel!

Could we do nothing?

Is this parallel?

What about this one?

Write what you mean!

What if we improved our vocabulary?

There is hope!

Win some, lose some

Futhark!

A Language

A Compiler

Examples

Example: Dot Product

Example: Matrix Multiplication

Goals

Futhark's goal in one sentence

Not a goal

Futhark is not all-or-nothing

Performance is a feature that makes all other features harder to implement

Higher-order functions are problematic

Defunctionalisation by John Reynolds (1972)

Ensuring branch-free defunctionalisation

Arrays may also not contain functions

Example of defunctionalisation

The point

Let's talk about value representation

Arrays of tuples

Tuples of arrays

Conclusions

Conclusions