This is a stress test of the maximum possible throughput of the HVM. It is a very simple program that just loops in a "best case" scenario: i.e., maximum parallelism, and maximum compiled speed (since it compiles to a C loop). I like it because it highlights each major generational breakthrough:
-
Optlam/Absal were my first JS toy implementations of the theory. It was exponentially faster than GHC in higher-order computations, but veeeery slow in "normal" programs.
-
Formality-Net was an early attempt at implementing inets efficiently back when I worked on Ethereum, but it was memory inefficient, and had a lower throughput than I wished.
-
HVM1 was the first implementation of our memory-efficient architecture. It was much faster than Absal, and "just" ~3x slower than Haskell/GHC in a single core. It also had a preview of a parallel evaluator and a compiler, but both were experimental and buggy.
-
HVM2 marks the first correct version of our mass parallelizer. It achieves near-ideal speedup per core. In 16 CPU cores, it is ~12x faster than in one CPU core. In ~32384 GPU cores, it is ~20000 faster than in one GPU core.
-
HVM3 marks the first long-term release of our low-level compiler. It identifies fragments that can be compiled to efficient low-level code, and uses C on these. This results in a 10x-100x speedup in some cases, surpassing Haskell single-core and approaching C and Rust.
The combination of all of these breakthroughs - a memory-efficient architecture (HVM1), a mass parallelizer (HVM2), and a low-level compiler (HVM3) will form the backbone of Bend, which I believe is heading towards becoming the fastest programming language in the world, period. And all of that while being asymptotically optimal, i.e., exponentially (in a complexity theory sense) faster than all alternatives for higher order and symbolic computations.
(performance estimated based on interactions per second - Absal had no native ints)
(performance also estimated based on interactions per second)
- Initial architecture, with efficient memory format
// Run with compiled mode, 1-thread:
// $ hvm1 compile stress.hvm1
// $ cd stress
// $ cargo build
// $ cd target/release
// $ hvm1 run "Main" -t 1 -c
(Loop 0) = 0
(Loop n) = (Loop (- n 1))
(Fun 0) = (Loop 65536)
(Fun n) = (+ (Fun (- n 1)) (Fun (- n 1)))
(Main) = (Fun 16)
- Automatic parallelism introduced
// Run compiled to C:
// $ hvm2 gen-c stress.hvm2 > stress.c
// $ gcc stress.c -O2 stress
// $ ./stress
@fun = (?((@fun__C0 @fun__C1) a) a)
@fun__C0 = a
& @loop ~ (65536 a)
@fun__C1 = ({a b} d)
&! @fun ~ (a $([+] $(c d)))
&! @fun ~ (b c)
@loop = (?((0 @loop__C0) a) a)
@loop__C0 = a
& @loop ~ a
@main = a
& @fun ~ (16 a)
- Low-level compiler and codegen (coming soon!)
// Run with C mode:
// $ hvm3 run stress.hvm3 -c -s
@loop(n) = ~n {
0: 0
p: @loop(p)
}
@fun(n) = ~n{
0: @loop(65536)
p: !&0{p0 p1}=p (+ @fun(p0) @fun(p1))
}
@main = @fun(16)
Note, again, that this is a very simple test to highlight the key milestones. We aim to launch a full benchmark suite after the release of the compiler, and I believe the results will speak for themselves, and make everyone see why interaction nets can be so efficient: they're the leanest model of computation, and, surprisingly enough, the less things a computer do, the faster it runs.
In this test, we implemented a prime number factorizer with a catch: it uses no machine ints, only algebraic datatypes, pattern-matching and recursion. Of course, this is a terrible way to factor numbers; the goal here is to test each runtime as a symbolic computing engine. It is a very simple test on which HVM already shows a ~20x speedup against GHC, which is one of the fastest functional compilers in the world. Note this speedup isn't related to parallelism nor λ-encodings. It is merely due to optimal looping (via superposition nodes). We believe larger speedups will be seen in more complex algorithms.
# COMMAND: python main.py
# ADT constructors using classes for faster pattern matching
class O:
def __init__(self, p):
self.p = p
self.tag = 0 # Using integers for faster comparison
class I:
def __init__(self, p):
self.p = p
self.tag = 1
class E:
def __init__(self):
self.tag = 2
E = E() # Single instance for empty
def u32(b):
if b.tag == 0: # O
return 2 * u32(b.p) + 0
elif b.tag == 1: # I
return 2 * u32(b.p) + 1
else: # E
return 0
def bin(s, n):
if s == 0:
return E
return O(bin(s-1, n//2)) if n % 2 == 0 else I(bin(s-1, n//2))
def eq(a, b):
if a.tag == 2 and b.tag == 2:
return True
if a.tag == 0 and b.tag == 0:
return eq(a.p, b.p)
if a.tag == 1 and b.tag == 1:
return eq(a.p, b.p)
return False
def inc(b):
if b.tag == 0:
return I(b.p)
elif b.tag == 1:
return O(inc(b.p))
else:
return E
def add(a, b):
if a.tag == 2 or b.tag == 2:
return E
if a.tag == 0 and b.tag == 0:
return O(add(a.p, b.p))
if a.tag == 0 and b.tag == 1:
return I(add(a.p, b.p))
if a.tag == 1 and b.tag == 0:
return I(add(a.p, b.p))
if a.tag == 1 and b.tag == 1:
return O(inc(add(a.p, b.p)))
def mul(a, b):
if b.tag == 2:
return E
if b.tag == 0:
return O(mul(a, b.p))
return add(a, O(mul(a, b.p)))
def cat(a, b):
if a.tag == 0:
return O(cat(a.p, b))
elif a.tag == 1:
return I(cat(a.p, b))
else:
return b
#k = 12
#h = 13
#s = 26
#p = I(O(I(I(O(O(O(O(I(I(I(O(O(O(I(O(I(I(I(I(I(I(O(I(O(I(E))))))))))))))))))))))))))
k = 13
h = 14
s = 28
p = I(I(O(I(O(I(I(O(I(I(I(O(O(I(I(O(I(O(O(O(I(O(O(I(I(O(O(I(E))))))))))))))))))))))))))))
import time
def main():
start = time.time()
for a in range(2 ** h):
for b in range(2 ** h):
binA = cat(bin(h, a), bin(h, 0))
binB = cat(bin(h, b), bin(h, 0))
if eq(mul(binA, binB), p):
duration = time.time() - start
print(f"FACT: {a} {b}")
print(f"TIME: {duration:.7f} seconds")
return
if __name__ == "__main__":
main()
// COMMAND: bun run prime.js
// ADT constructors
const O = p => ({$:"O", p})
const I = p => ({$:"I", p})
const E = {$:"E"}
const u32 = b => {
switch(b.$) {
case "O": return 2 * u32(b.p) + 0
case "I": return 2 * u32(b.p) + 1
case "E": return 0
}
}
const bin = (s, n) => {
if (s === 0) return E
return n % 2 === 0
? O(bin(s-1, Math.floor(n/2)))
: I(bin(s-1, Math.floor(n/2)))
}
const eq = (a, b) => {
if (a.$ === "E" && b.$ === "E") return true
if (a.$ === "O" && b.$ === "O") return eq(a.p, b.p)
if (a.$ === "I" && b.$ === "I") return eq(a.p, b.p)
return false
}
const inc = b => {
switch(b.$) {
case "O": return I(b.p)
case "I": return O(inc(b.p))
case "E": return E
}
}
const add = (a, b) => {
if (a.$ === "E" || b.$ === "E") return E
if (a.$ === "O" && b.$ === "O") return O(add(a.p, b.p))
if (a.$ === "O" && b.$ === "I") return I(add(a.p, b.p))
if (a.$ === "I" && b.$ === "O") return I(add(a.p, b.p))
if (a.$ === "I" && b.$ === "I") return O(inc(add(a.p, b.p)))
}
const mul = (a, b) => {
if (b.$ === "E") return E
if (b.$ === "O") return O(mul(a, b.p))
if (b.$ === "I") return add(a, O(mul(a, b.p)))
}
const cat = (a, b) => {
switch(a.$) {
case "O": return O(cat(a.p, b))
case "I": return I(cat(a.p, b))
case "E": return b
}
}
//const k = 12
//const h = 13
//const s = 26
//const p = I(O(I(I(O(O(O(O(I(I(I(O(O(O(I(O(I(I(I(I(I(I(O(I(O(I(E))))))))))))))))))))))))))
const k = 13
const h = 14
const s = 28
const p = I(I(O(I(O(I(I(O(I(I(I(O(O(I(I(O(I(O(O(O(I(O(O(I(I(O(O(I(E))))))))))))))))))))))))))))
async function main() {
const start = Date.now()
for (let a = 0; a < Math.pow(2, h); a++) {
for (let b = 0; b < Math.pow(2, h); b++) {
const binA = cat(bin(h, a), bin(h, 0))
const binB = cat(bin(h, b), bin(h, 0))
if (eq(mul(binA, binB), p)) {
const duration = (Date.now() - start) / 1000
console.log(`FACT: ${a} ${b}`)
console.log(`TIME: ${duration.toFixed(7)} seconds`)
process.exit(0)
}
}
}
}
main()
// COMMAND: ghc -O2 prime.hs -o prime; ./prime
import Control.Monad (forM_, when)
import Data.Time.Clock (getCurrentTime, diffUTCTime)
import System.Exit (exitSuccess)
import Text.Printf (printf)
data Bin = O Bin | I Bin | E deriving (Show, Eq)
u32 :: Bin -> Int
u32 (O p) = 2 * u32 p + 0
u32 (I p) = 2 * u32 p + 1
u32 E = 0
bin :: Int -> Int -> Bin
bin 0 _ = E
bin s n = case n `mod` 2 of
0 -> O (bin (s-1) (n `div` 2))
_ -> I (bin (s-1) (n `div` 2))
eq :: Bin -> Bin -> Bool
eq E E = True
eq (O a) (O b) = eq a b
eq (I a) (I b) = eq a b
eq _ _ = False
inc :: Bin -> Bin
inc (O p) = I p
inc (I p) = O (inc p)
inc E = E
add :: Bin -> Bin -> Bin
add (O a) (O b) = O (add a b)
add (O a) (I b) = I (add a b)
add (I a) (O b) = I (add a b)
add (I a) (I b) = O (inc (add a b))
add E b = E
add a E = E
mul :: Bin -> Bin -> Bin
mul _ E = E
mul a (O b) = O (mul a b)
mul a (I b) = add a (O (mul a b))
cat :: Bin -> Bin -> Bin
cat (O a) b = O (cat a b)
cat (I a) b = I (cat a b)
cat E b = b
-- k = 12
-- h = 13
-- s = 26
-- p = I(O(I(I(O(O(O(O(I(I(I(O(O(O(I(O(I(I(I(I(I(I(O(I(O(I(E))))))))))))))))))))))))))
k = 13
h = 14
s = 28
p = I(I(O(I(O(I(I(O(I(I(I(O(O(I(I(O(I(O(O(O(I(O(O(I(I(O(O(I(E))))))))))))))))))))))))))))
main :: IO ()
main = do
start <- getCurrentTime
forM_ [0..2^h-1] $ \a -> do
forM_ [0..2^h-1] $ \b -> do
let binA = cat (bin h a) (bin h 0)
let binB = cat (bin h b) (bin h 0)
when (eq (mul binA binB) p) $ do
end <- getCurrentTime
let duration = diffUTCTime end start
putStrLn $ "FACT: " ++ show a ++ " " ++ show b
putStrLn $ "TIME: " ++ printf "%.7f seconds" (realToFrac duration :: Double)
exitSuccess
// COMMAND: hvml run main.hvml -s -S -c
// Bitstrings
data Bin { #O{pred} #I{pred} #E }
// If-Then-Else
@if(b t f) = ~b {
0: f
k: t
}
// Converts a Bin to an U32
@u32(b) = ~b{
#O{p}: (+ (* 2 @u32(p)) 0)
#I{p}: (+ (* 2 @u32(p)) 1)
#E: 0
}
// Converts an U32 to a Bin of given size
@bin(s n) = ~s{
0: #E
p: !&0{n0 n1}=n ~(% n0 2) !p !n1 {
0: #O{@bin(p (/ n1 2))}
k: #I{@bin(p (/ n1 2))}
}
}
// Bin Equality
@eq(a b) = ~a !b {
#E: ~b {
#O{bp}: 0
#I{bp}: 0
#E: 1
}
#O{ap}: ~b{
#O{bp}: @eq(ap bp)
#I{bp}: 0
#E: 0
}
#I{ap}: ~b{
#O{bp}: 0
#I{bp}: @eq(ap bp)
#E: 0
}
}
// Increments a Bin
@inc(a) = ~a{
#O{p}: #I{p}
#I{p}: #O{@inc(p)}
#E: #E
}
// Decrements a Bin
@dec(a) = ~a{
#O{p}: #O{@dec(p)}
#I{p}: #I{p}
#E: #E
}
// Adds two Bins
@add(a b) = ~a !b {
#O{ap}: ~b !ap {
#O{bp}: #O{@add(ap bp)}
#I{bp}: #I{@add(ap bp)}
#E: #E
}
#I{ap}: ~b !ap {
#O{bp}: #I{@add(ap bp)}
#I{bp}: #O{@inc(@add(ap bp))}
#E: #E
}
#E: #E
}
// Muls two Bins
@mul(a b) = ~b !a {
#O{bp}: #O{@mul(a bp)}
#I{bp}: !&0{a0 a1}=a @add(a0 #O{@mul(a1 bp)})
#E: #E
}
// Concatenates two Bins
@cat(a b) = ~a !b {
#O{ap}: #O{@cat(ap b)}
#I{ap}: #I{@cat(ap b)}
#E: b
}
// Enums all Bins of given size (label 1)
@all1(s) = ~s{
0: #E
p: !&1{p0 p1}=p &1{ #O{@all1(p0)} #I{@all1(p1)} }
}
// Enums all Bins of given size (label 2)
@all2(s) = ~s{
0: #E
p: !&2{p0 p1}=p &2{ #O{@all2(p0)} #I{@all2(p1)} }
}
//@K = 12
//@H = 13
//@S = 26
//@X = @cat(@all1(@H) @bin(@H 0))
//@Y = @cat(@all2(@H) @bin(@H 0))
//@P = #1{#0{#1{#1{#0{#0{#0{#0{#1{#1{#1{#0{#0{#0{#1{#0{#1{#1{#1{#1{#1{#1{#0{#1{#0{#1{#2}}}}}}}}}}}}}}}}}}}}}}}}}}
@K = 13
@H = 14
@S = 28
@X = @cat(@all1(@H) @bin(@H 0))
@Y = @cat(@all2(@H) @bin(@H 0))
@P = #1{#1{#0{#1{#0{#1{#1{#0{#1{#1{#1{#0{#0{#1{#1{#0{#1{#0{#0{#0{#1{#0{#0{#1{#1{#0{#0{#1{#2}}}}}}}}}}}}}}}}}}}}}}}}}}}}
@main = @if(@eq(@mul(@X @Y) @P) λt(t @u32(@X) @u32(@Y)) *)
This is a new test that I just implemented today (Nov 28, 2024). The goal here was to solve a simple higher-order equation:
(?X λt(t 1 2)) == λt(t 2 1)
I implemented it in 3 different ways:
-
- Enumerate BLC strings, parse and test (Haskell): 0.992s
-
- Superpose BLC strings, parse and test (HVM): 0.026s (38x speedup)
-
- Superpose λ-Terms directly and test (HVM): 0.0011s (862x speedup)
The 3rd approach is implemented and described in this file.
Ah gotcha, I updated the fork.
I'm currently trying to learn just enough about lambda calculus and interaction nets to see how well they might be able to navigate low-level optimisation constraints such as cache line alignment, cache misses and branch prediction, and, of course, make appropiate use of SIMD intrinsics (there are over 30,000 between x86 and ARM combined). Between naive and optimal C++ you can often realise an order of magnitude performance with consideration to these, but it takes a lot of time and thinking from a skilled human to pull it off. AI really struggles at it, and so do compilers like LLVM.
SIMD parralellism on the CPU compares very favourably to GPU parralellism for less than 1000 elements, since such loops can usually complete within the time it takes to schedule a GPU kernel (allocating SMs, memory, etc). With non-unified memory, where there's additional latency introduced by the RAM/VRAM interconnect, the balance tilts further towards the CPU. So for optimal performance you'd likely need to explore hybrids of CPU/GPU compute. Oh, and I totally forgot about the low-level constraints to satisfy on the GPU — thread warps and bank collisions and such.