Skip to content

Instantly share code, notes, and snippets.

@VictorTaelin
Last active November 28, 2024 14:56
Show Gist options
  • Save VictorTaelin/47a1379a4ad2729417e2dc0210f79cf4 to your computer and use it in GitHub Desktop.
Save VictorTaelin/47a1379a4ad2729417e2dc0210f79cf4 to your computer and use it in GitHub Desktop.
HVM - Pitch Benchmarks

HVM3 - Pitch Benchmarks

Stress Test - HVM's Historic Progress

This is a stress test of the maximum possible throughput of the HVM. It is a very simple program that just loops in a "best case" scenario: i.e., maximum parallelism, and maximum compiled speed (since it compiles to a C loop). I like it because it highlights each major generational breakthrough:

  • Optlam/Absal were my first JS toy implementations of the theory. It was exponentially faster than GHC in higher-order computations, but veeeery slow in "normal" programs.

  • Formality-Net was an early attempt at implementing inets efficiently back when I worked on Ethereum, but it was memory inefficient, and had a lower throughput than I wished.

  • HVM1 was the first implementation of our memory-efficient architecture. It was much faster than Absal, and "just" ~3x slower than Haskell/GHC in a single core. It also had a preview of a parallel evaluator and a compiler, but both were experimental and buggy.

  • HVM2 marks the first correct version of our mass parallelizer. It achieves near-ideal speedup per core. In 16 CPU cores, it is ~12x faster than in one CPU core. In ~32384 GPU cores, it is ~20000 faster than in one GPU core.

  • HVM3 marks the first long-term release of our low-level compiler. It identifies fragments that can be compiled to efficient low-level code, and uses C on these. This results in a 10x-100x speedup in some cases, surpassing Haskell single-core and approaching C and Rust.

The combination of all of these breakthroughs - a memory-efficient architecture (HVM1), a mass parallelizer (HVM2), and a low-level compiler (HVM3) will form the backbone of Bend, which I believe is heading towards becoming the fastest programming language in the world, period. And all of that while being asymptotically optimal, i.e., exponentially (in a complexity theory sense) faster than all alternatives for higher order and symbolic computations.

Absal: ~12h

(performance estimated based on interactions per second - Absal had no native ints)

Formality-Net: ~5m

(performance also estimated based on interactions per second)

HVM1: 16.36s

  • Initial architecture, with efficient memory format
// Run with compiled mode, 1-thread:
// $ hvm1 compile stress.hvm1
// $ cd stress
// $ cargo build
// $ cd target/release
// $ hvm1 run "Main" -t 1 -c

(Loop 0) = 0
(Loop n) = (Loop (- n 1))

(Fun 0) = (Loop 65536)
(Fun n) = (+ (Fun (- n 1)) (Fun (- n 1)))

(Main) = (Fun 16)

HVM2: 2.57s

  • Automatic parallelism introduced
// Run compiled to C:
// $ hvm2 gen-c stress.hvm2 > stress.c
// $ gcc stress.c -O2 stress
// $ ./stress

@fun = (?((@fun__C0 @fun__C1) a) a)

@fun__C0 = a
  & @loop ~ (65536 a)

@fun__C1 = ({a b} d)
  &! @fun ~ (a $([+] $(c d)))
  &! @fun ~ (b c)

@loop = (?((0 @loop__C0) a) a)

@loop__C0 = a
  & @loop ~ a

@main = a
  & @fun ~ (16 a)

HVM3: 0.27s

  • Low-level compiler and codegen (coming soon!)
// Run with C mode:
// $ hvm3 run stress.hvm3 -c -s

@loop(n) = ~n {
  0: 0
  p: @loop(p)
}


@fun(n) = ~n{
  0: @loop(65536)
  p: !&0{p0 p1}=p (+ @fun(p0) @fun(p1))
}

@main = @fun(16)

Note, again, that this is a very simple test to highlight the key milestones. We aim to launch a full benchmark suite after the release of the compiler, and I believe the results will speak for themselves, and make everyone see why interaction nets can be so efficient: they're the leanest model of computation, and, surprisingly enough, the less things a computer do, the faster it runs.

Algebraic Factorizer - HVM vs Alternatives (for Symbolic Computing)

In this test, we implemented a prime number factorizer with a catch: it uses no machine ints, only algebraic datatypes, pattern-matching and recursion. Of course, this is a terrible way to factor numbers; the goal here is to test each runtime as a symbolic computing engine. It is a very simple test on which HVM already shows a ~20x speedup against GHC, which is one of the fastest functional compilers in the world. Note this speedup isn't related to parallelism nor λ-encodings. It is merely due to optimal looping (via superposition nodes). We believe larger speedups will be seen in more complex algorithms.

Python: ~2h

# COMMAND: python main.py

# ADT constructors using classes for faster pattern matching
class O:
    def __init__(self, p):
        self.p = p
        self.tag = 0  # Using integers for faster comparison

class I:
    def __init__(self, p):
        self.p = p
        self.tag = 1

class E:
    def __init__(self):
        self.tag = 2

E = E()  # Single instance for empty

def u32(b):
    if b.tag == 0:  # O
        return 2 * u32(b.p) + 0
    elif b.tag == 1:  # I
        return 2 * u32(b.p) + 1
    else:  # E
        return 0

def bin(s, n):
    if s == 0:
        return E
    return O(bin(s-1, n//2)) if n % 2 == 0 else I(bin(s-1, n//2))

def eq(a, b):
    if a.tag == 2 and b.tag == 2:
        return True
    if a.tag == 0 and b.tag == 0:
        return eq(a.p, b.p)
    if a.tag == 1 and b.tag == 1:
        return eq(a.p, b.p)
    return False

def inc(b):
    if b.tag == 0:
        return I(b.p)
    elif b.tag == 1:
        return O(inc(b.p))
    else:
        return E

def add(a, b):
    if a.tag == 2 or b.tag == 2:
        return E
    if a.tag == 0 and b.tag == 0:
        return O(add(a.p, b.p))
    if a.tag == 0 and b.tag == 1:
        return I(add(a.p, b.p))
    if a.tag == 1 and b.tag == 0:
        return I(add(a.p, b.p))
    if a.tag == 1 and b.tag == 1:
        return O(inc(add(a.p, b.p)))

def mul(a, b):
    if b.tag == 2:
        return E
    if b.tag == 0:
        return O(mul(a, b.p))
    return add(a, O(mul(a, b.p)))

def cat(a, b):
    if a.tag == 0:
        return O(cat(a.p, b))
    elif a.tag == 1:
        return I(cat(a.p, b))
    else:
        return b

#k = 12
#h = 13
#s = 26
#p = I(O(I(I(O(O(O(O(I(I(I(O(O(O(I(O(I(I(I(I(I(I(O(I(O(I(E))))))))))))))))))))))))))

k = 13
h = 14
s = 28
p = I(I(O(I(O(I(I(O(I(I(I(O(O(I(I(O(I(O(O(O(I(O(O(I(I(O(O(I(E))))))))))))))))))))))))))))

import time

def main():
    start = time.time()
    for a in range(2 ** h):
        for b in range(2 ** h):
            binA = cat(bin(h, a), bin(h, 0))
            binB = cat(bin(h, b), bin(h, 0))
            if eq(mul(binA, binB), p):
                duration = time.time() - start
                print(f"FACT: {a} {b}")
                print(f"TIME: {duration:.7f} seconds")
                return

if __name__ == "__main__":
    main()

JavaScript: 5m40s

// COMMAND: bun run prime.js

// ADT constructors
const O = p => ({$:"O", p})
const I = p => ({$:"I", p})
const E = {$:"E"}

const u32 = b => {
  switch(b.$) {
    case "O": return 2 * u32(b.p) + 0
    case "I": return 2 * u32(b.p) + 1
    case "E": return 0
  }
}

const bin = (s, n) => {
  if (s === 0) return E
  return n % 2 === 0 
    ? O(bin(s-1, Math.floor(n/2)))
    : I(bin(s-1, Math.floor(n/2)))
}

const eq = (a, b) => {
  if (a.$ === "E" && b.$ === "E") return true
  if (a.$ === "O" && b.$ === "O") return eq(a.p, b.p)
  if (a.$ === "I" && b.$ === "I") return eq(a.p, b.p)
  return false
}

const inc = b => {
  switch(b.$) {
    case "O": return I(b.p)
    case "I": return O(inc(b.p))
    case "E": return E
  }
}

const add = (a, b) => {
  if (a.$ === "E" || b.$ === "E") return E
  if (a.$ === "O" && b.$ === "O") return O(add(a.p, b.p))
  if (a.$ === "O" && b.$ === "I") return I(add(a.p, b.p))
  if (a.$ === "I" && b.$ === "O") return I(add(a.p, b.p))
  if (a.$ === "I" && b.$ === "I") return O(inc(add(a.p, b.p)))
}

const mul = (a, b) => {
  if (b.$ === "E") return E
  if (b.$ === "O") return O(mul(a, b.p))
  if (b.$ === "I") return add(a, O(mul(a, b.p)))
}

const cat = (a, b) => {
  switch(a.$) {
    case "O": return O(cat(a.p, b))
    case "I": return I(cat(a.p, b))
    case "E": return b
  }
}

//const k = 12
//const h = 13
//const s = 26
//const p = I(O(I(I(O(O(O(O(I(I(I(O(O(O(I(O(I(I(I(I(I(I(O(I(O(I(E))))))))))))))))))))))))))

const k = 13
const h = 14
const s = 28
const p = I(I(O(I(O(I(I(O(I(I(I(O(O(I(I(O(I(O(O(O(I(O(O(I(I(O(O(I(E))))))))))))))))))))))))))))

async function main() {
  const start = Date.now()
  for (let a = 0; a < Math.pow(2, h); a++) {
    for (let b = 0; b < Math.pow(2, h); b++) {
      const binA = cat(bin(h, a), bin(h, 0))
      const binB = cat(bin(h, b), bin(h, 0))
      if (eq(mul(binA, binB), p)) {
        const duration = (Date.now() - start) / 1000
        console.log(`FACT: ${a} ${b}`)
        console.log(`TIME: ${duration.toFixed(7)} seconds`)
        process.exit(0)
      }
    }
  }
}

main()

Haskell: 2.76s

// COMMAND: ghc -O2 prime.hs -o prime; ./prime

import Control.Monad (forM_, when)
import Data.Time.Clock (getCurrentTime, diffUTCTime)
import System.Exit (exitSuccess)
import Text.Printf (printf)

data Bin = O Bin | I Bin | E deriving (Show, Eq)

u32 :: Bin -> Int
u32 (O p) = 2 * u32 p + 0
u32 (I p) = 2 * u32 p + 1
u32 E     = 0

bin :: Int -> Int -> Bin
bin 0 _ = E
bin s n = case n `mod` 2 of
  0 -> O (bin (s-1) (n `div` 2))
  _ -> I (bin (s-1) (n `div` 2))

eq :: Bin -> Bin -> Bool
eq E     E     = True
eq (O a) (O b) = eq a b
eq (I a) (I b) = eq a b
eq _     _     = False

inc :: Bin -> Bin
inc (O p) = I p
inc (I p) = O (inc p)
inc E     = E

add :: Bin -> Bin -> Bin
add (O a) (O b) = O (add a b)
add (O a) (I b) = I (add a b)
add (I a) (O b) = I (add a b)
add (I a) (I b) = O (inc (add a b))
add E     b     = E
add a     E     = E

mul :: Bin -> Bin -> Bin
mul _ E     = E
mul a (O b) = O (mul a b)
mul a (I b) = add a (O (mul a b))

cat :: Bin -> Bin -> Bin
cat (O a) b = O (cat a b)
cat (I a) b = I (cat a b)
cat E     b = b

-- k = 12
-- h = 13
-- s = 26
-- p = I(O(I(I(O(O(O(O(I(I(I(O(O(O(I(O(I(I(I(I(I(I(O(I(O(I(E))))))))))))))))))))))))))

k = 13
h = 14
s = 28
p = I(I(O(I(O(I(I(O(I(I(I(O(O(I(I(O(I(O(O(O(I(O(O(I(I(O(O(I(E))))))))))))))))))))))))))))

main :: IO ()
main = do
  start <- getCurrentTime
  forM_ [0..2^h-1] $ \a -> do
    forM_ [0..2^h-1] $ \b -> do
      let binA = cat (bin h a) (bin h 0)
      let binB = cat (bin h b) (bin h 0)
      when (eq (mul binA binB) p) $ do
        end <- getCurrentTime
        let duration = diffUTCTime end start
        putStrLn $ "FACT: " ++ show a ++ " " ++ show b
        putStrLn $ "TIME: " ++ printf "%.7f seconds" (realToFrac duration :: Double)
        exitSuccess

HVM3: 0.15s

// COMMAND: hvml run main.hvml -s -S -c

// Bitstrings
data Bin { #O{pred} #I{pred} #E }

// If-Then-Else
@if(b t f) = ~b {
  0: f
  k: t
}

// Converts a Bin to an U32
@u32(b) = ~b{
  #O{p}: (+ (* 2 @u32(p)) 0)
  #I{p}: (+ (* 2 @u32(p)) 1)
  #E: 0
}

// Converts an U32 to a Bin of given size
@bin(s n) = ~s{
  0: #E
  p: !&0{n0 n1}=n ~(% n0 2) !p !n1 {
    0: #O{@bin(p (/ n1 2))}
    k: #I{@bin(p (/ n1 2))}
  }
}

// Bin Equality
@eq(a b) = ~a !b {
  #E: ~b {
    #O{bp}: 0
    #I{bp}: 0
    #E: 1
  }
  #O{ap}: ~b{
    #O{bp}: @eq(ap bp)
    #I{bp}: 0
    #E: 0
  }
  #I{ap}: ~b{
    #O{bp}: 0
    #I{bp}: @eq(ap bp)
    #E: 0
  }
}

// Increments a Bin
@inc(a) = ~a{
  #O{p}: #I{p}
  #I{p}: #O{@inc(p)}
  #E: #E
}

// Decrements a Bin
@dec(a) = ~a{
  #O{p}: #O{@dec(p)}
  #I{p}: #I{p}
  #E: #E
}

// Adds two Bins
@add(a b) = ~a !b {
  #O{ap}: ~b !ap {
    #O{bp}: #O{@add(ap bp)}
    #I{bp}: #I{@add(ap bp)}
    #E: #E
  }
  #I{ap}: ~b !ap {
    #O{bp}: #I{@add(ap bp)}
    #I{bp}: #O{@inc(@add(ap bp))}
    #E: #E
  }
  #E: #E
}

// Muls two Bins
@mul(a b) = ~b !a {
  #O{bp}: #O{@mul(a bp)}
  #I{bp}: !&0{a0 a1}=a @add(a0 #O{@mul(a1 bp)})
  #E: #E
}

// Concatenates two Bins
@cat(a b) = ~a !b {
  #O{ap}: #O{@cat(ap b)}
  #I{ap}: #I{@cat(ap b)}
  #E: b
}

// Enums all Bins of given size (label 1)
@all1(s) = ~s{
  0: #E
  p: !&1{p0 p1}=p &1{ #O{@all1(p0)} #I{@all1(p1)} }
}

// Enums all Bins of given size (label 2)
@all2(s) = ~s{
  0: #E
  p: !&2{p0 p1}=p &2{ #O{@all2(p0)} #I{@all2(p1)} }
}

//@K = 12
//@H = 13
//@S = 26
//@X = @cat(@all1(@H) @bin(@H 0))
//@Y = @cat(@all2(@H) @bin(@H 0))
//@P = #1{#0{#1{#1{#0{#0{#0{#0{#1{#1{#1{#0{#0{#0{#1{#0{#1{#1{#1{#1{#1{#1{#0{#1{#0{#1{#2}}}}}}}}}}}}}}}}}}}}}}}}}}

@K = 13
@H = 14
@S = 28
@X = @cat(@all1(@H) @bin(@H 0))
@Y = @cat(@all2(@H) @bin(@H 0))
@P = #1{#1{#0{#1{#0{#1{#1{#0{#1{#1{#1{#0{#0{#1{#1{#0{#1{#0{#0{#0{#1{#0{#0{#1{#1{#0{#0{#1{#2}}}}}}}}}}}}}}}}}}}}}}}}}}}}

@main = @if(@eq(@mul(@X @Y) @P) λt(t @u32(@X) @u32(@Y)) *)

Higher Order equation solving

This is a new test that I just implemented today (Nov 28, 2024). The goal here was to solve a simple higher-order equation:

(?X λt(t 1 2)) == λt(t 2 1)

I implemented it in 3 different ways:

    1. Enumerate BLC strings, parse and test (Haskell): 0.992s
    1. Superpose BLC strings, parse and test (HVM): 0.026s (38x speedup)
    1. Superpose λ-Terms directly and test (HVM): 0.0011s (862x speedup)

The 3rd approach is implemented and described in this file.

@ashtonsix
Copy link

ashtonsix commented Nov 28, 2024

Ah gotcha, I updated the fork.

I'm currently trying to learn just enough about lambda calculus and interaction nets to see how well they might be able to navigate low-level optimisation constraints such as cache line alignment, cache misses and branch prediction, and, of course, make appropiate use of SIMD intrinsics (there are over 30,000 between x86 and ARM combined). Between naive and optimal C++ you can often realise an order of magnitude performance with consideration to these, but it takes a lot of time and thinking from a skilled human to pull it off. AI really struggles at it, and so do compilers like LLVM.

SIMD parralellism on the CPU compares very favourably to GPU parralellism for less than 1000 elements, since such loops can usually complete within the time it takes to schedule a GPU kernel (allocating SMs, memory, etc). With non-unified memory, where there's additional latency introduced by the RAM/VRAM interconnect, the balance tilts further towards the CPU. So for optimal performance you'd likely need to explore hybrids of CPU/GPU compute. Oh, and I totally forgot about the low-level constraints to satisfy on the GPU — thread warps and bank collisions and such.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment