TLDR: I understand the proposal of Bend, but when the efficiency reduction is so big that a RTX 4090 is only 7x faster on a near-optimal scenario than 2 cores of a M3 Max in a language like JavaScript, you should probably not take it.
On the otherhand, easy to use, but opt-in, parallel languages such as OCaml exists and they can compete, so you should likely take those. If you need even more performance, Rust could likely beat the RTX 4090 results on a mobile CPU.
Of course future optimizations should improve Bend results, but my goal here is to show that the current results are not as impressive as they may look, likely a JIT would make the RTX 4090 results 10x faster, but an RTX 4090 still uses at least 100 times more power than a single M3 core at any instant, additionally in principle GPUs are better for purely parallel tasks.
Also this example is a very parallelism friendly, this is both against Bend and in favour of it, most real code is not pure and not a purely binary alg