oconnor663 · May 13, 2022 04:00
diff --git a/early_bao_talk_subtitles.ass b/early_bao_talk_subtitles.ass
 [Script Info]
 ; Script generated by Aegisub 3.2.2
 ; http://www.aegisub.org/
 Title: Default Aegisub file
 ScriptType: v4.00+
 WrapStyle: 0
 ScaledBorderAndShadow: yes
 YCbCr Matrix: TV.601
 PlayResX: 1280
 PlayResY: 720

 [Aegisub Project Garbage]
 Audio File: /home/jacko/Downloads/yt5s.com-BLAKE2b and the bao tree mode SIMD and multithreaded hashing in Rust.mp4
 Video File: /home/jacko/Downloads/yt5s.com-BLAKE2b and the bao tree mode SIMD and multithreaded hashing in Rust.mp4
 Video AR Mode: 4
 Video AR Value: 1.777778
 Video Zoom Percent: 0.250000
 Scroll Position: 18
 Active Line: 24
 Video Position: 4341

 [V4+ Styles]
 Format: Name, Fontname, Fontsize, PrimaryColour, SecondaryColour, OutlineColour, BackColour, Bold, Italic, Underline, StrikeOut, ScaleX, ScaleY, Spacing, Angle, BorderStyle, Outline, Shadow, Alignment, MarginL, MarginR, MarginV, Encoding
 Style: Default,Arial,50,&H00FFFFFF,&H000000FF,&H00000000,&H00000000,0,0,0,0,100,100,0,0,1,2,2,2,10,10,10,1

 [Events]
 Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text
 Dialogue: 0,0:00:00.86,0:00:05.26,Default,,0,0,0,,Alright, great, so...
 Dialogue: 0,0:00:05.26,0:00:06.08,Default,,0,0,0,,Who hashes stuff?
 Dialogue: 0,0:00:06.08,0:00:09.92,Default,,0,0,0,,Anybody ever use like `md5sum`, `sha1sum`,  those things on the command line?
 Dialogue: 0,0:00:09.92,0:00:14.94,Default,,0,0,0,,Sometimes we use `md5sum`, but we probably shouldn't. Sometimes we use `sha1sum`, but we probably shouldn't.
 Dialogue: 0,0:00:14.94,0:00:21.81,Default,,0,0,0,,You may have heard that SHA-1 was broken, about two years ago now, because time moves really fast, by some very expensive machines at Google.
 Dialogue: 0,0:00:21.81,0:00:26.32,Default,,0,0,0,,Those are designed to be very fast. And it turns out that using Rust, we can make things that are even faster.
 Dialogue: 0,0:00:26.32,0:00:34.40,Default,,0,0,0,,So let's look real quick. `md5sum` of a file "f"...
 Dialogue: 0,0:00:34.40,0:00:38.44,Default,,0,0,0,,Always do this twice, because you need to make sure it's in memory.
 Dialogue: 0,0:00:38.44,0:00:46.03,Default,,0,0,0,,Right, so that's `md5sum`, that's a benchmark there. "f" is, let's see here, almost exactly 1 GB in size. It's just full of random data.
 Dialogue: 0,0:00:46.03,0:00:54.67,Default,,0,0,0,,So that's `md5sum`. We also have `sha512sum`. That's probably the one we should be using if we're doing anything important.
 Dialogue: 0,0:00:54.67,0:00:57.55,Default,,0,0,0,,Which is a little slow, which is kind of depressing.
 Dialogue: 0,0:00:57.55,0:01:06.14,Default,,0,0,0,,It turns out the fastest guy on the block right now is still `sha1sum`, unfortunately, which clocks in at 1.3-ish.
 Dialogue: 0,0:01:06.14,0:01:12.52,Default,,0,0,0,,But I always -- sorry, this isn't necessarily clear --  we're looking at this number over here. That's the actual time that passed.
 Dialogue: 0,0:01:12.52,0:01:20.56,Default,,0,0,0,,So 1.3, that's pretty cool. But we shouldn't be using it anymore, because it's a broken hash function. So what can we use that's modern, that might actually conceivably be faster.
 Dialogue: 0,0:01:20.56,0:01:27.50,Default,,0,0,0,,Some of you guys might've heard of a hash function called BLAKE2. Anybody raise their hand who's actually heard that name before?
 Dialogue: 0,0:01:27.50,0:01:41.52,Default,,0,0,0,,There is a BLAKE2 implementation on most people's machines, if you have like GNU Coreutils. So let me try that. `/usr/bin/b2sum`...live coding...
 Dialogue: 0,0:01:41.52,0:01:43.15,Default,,0,0,0,,And unfortunately that was a little slow.
 Dialogue: 0,0:01:43.15,0:01:45.24,Default,,0,0,0,,Despite being a modern hash function it's a little slow.
 Dialogue: 0,0:01:45.24,0:01:52.52,Default,,0,0,0,,The reason it's slow is that SHA-1, under the covers, is linking against OpenSSL, with a very fancy implementation of SHA-1.
 Dialogue: 0,0:01:52.52,0:01:56.41,Default,,0,0,0,,GNU Coreutils is just using portable C code, which is very clean, but it's not quite as fast.
 Dialogue: 0,0:01:56.41,0:02:03.52,Default,,0,0,0,,It turns out that using Rust's just recently stabilized "SIMD intrinsics", we can get quite a lot faster.
 Dialogue: 0,0:02:03.52,0:02:14.00,Default,,0,0,0,,So if I run `b2sum` directly -- that's actually a binary that I've compiled on my machine...
 Dialogue: 0,0:02:14.00,0:02:22.08,Default,,0,0,0,,Boom! We beat SHA-1. We beat SHA-1 using a set of intrinsics called AVX2. Has anybody heard the name "AVX2" before?
 Dialogue: 0,0:02:22.08,0:02:24.70,Default,,0,0,0,,This should be even fewer than "BLAKE2"... Maybe some? Cool.
 Dialogue: 0,0:02:24.70,0:02:33.77,Default,,0,0,0,,So, real quick -- I have so little time -- the Intel people have been spending a lot of time making processors faster, and they've been spending a lot of time making cores to run multiple threads.
 Dialogue: 0,0:02:33.77,0:02:38.00,Default,,0,0,0,,But also they've been adding new instructions, so we can do things that the old processors couldn't do.
 Dialogue: 0,0:02:38.00,0:02:41.15,Default,,0,0,0,,Now if you use those instructions, your program's not compatible with old machines.
 Dialogue: 0,0:02:41.15,0:02:43.32,Default,,0,0,0,,So generally compilers don't emit them by default.
 Dialogue: 0,0:02:43.32,0:02:48.78,Default,,0,0,0,,But if you tell your compiler to emit them, it will. And then you have to be very careful, cause you're doing unsafe things.
 Dialogue: 0,0:02:48.78,0:02:55.48,Default,,0,0,0,,So here is a bunch of crap. And what it means doesn't really matter. It's shuffling bytes around. It's a hash function, it's shuffling bytes around, we all know.
 Dialogue: 0,0:02:55.48,0:02:57.63,Default,,0,0,0,,But this is in Rust, of course.
 Dialogue: 0,0:02:57.63,0:03:06.86,Default,,0,0,0,,And we see this cute little thing here, which is telling the compiler that, regardless of what platform you thought you were compiling for, in this function, use those AVX2 instructions.
 Dialogue: 0,0:03:06.86,0:03:11.24,Default,,0,0,0,,You know, multiple integers at a time, loading and storing, that kind of stuff.
 Dialogue: 0,0:03:11.24,0:03:16.70,Default,,0,0,0,,I did not invent this implementation. I ported it from C. So they, the actual authors, get credit for the performance.
 Dialogue: 0,0:03:16.70,0:03:27.32,Default,,0,0,0,,So doing this says the compiler will emit these instructions for all targets, and then we're responsible for never ever calling this function when it's unsafe. So...
 Dialogue: 0,0:03:27.32,0:03:31.53,Default,,0,0,0,,Down here, we see what we have to do to make this sort of stuff useful.
 Dialogue: 0,0:03:31.53,0:03:35.21,Default,,0,0,0,,So, this is if you're sort of statically targeting modern processors, which is rare.
 Dialogue: 0,0:03:35.21,0:03:39.53,Default,,0,0,0,,But here, we see that we're at runtime asking the processor,
 Dialogue: 0,0:03:39.53,0:03:43.58,Default,,0,0,0,,"What instructions do you support?" If this processor supports AVX2, we'll call this function.
 Dialogue: 0,0:03:43.58,0:03:47.76,Default,,0,0,0,,If I replace this with `if true`, this would hopefully crash.
 Dialogue: 0,0:03:47.76,0:03:52.27,Default,,0,0,0,,Probably do something worse, on a machine that's older than three years. So this part's pretty important.
 Dialogue: 0,0:03:52.27,0:03:56.67,Default,,0,0,0,,So that's how you beat `sha1sum`, using pure, stable Rust.
 Dialogue: 0,0:03:56.67,0:03:58.67,Default,,0,0,0,,And that works now.
 Dialogue: 0,0:03:58.67,0:04:00.67,Default,,0,0,0,,If we want to get even faster than this,
 Dialogue: 0,0:04:00.67,0:04:02.14,Default,,0,0,0,,we have to start using multiple threads.
 Dialogue: 0,0:04:02.14,0:04:05.32,Default,,0,0,0,,So, a standard hash function is simply not designed to run on multiple different threads.
 Dialogue: 0,0:04:05.32,0:04:08.44,Default,,0,0,0,,Each step depends on the last step.
 Dialogue: 0,0:04:08.44,0:04:16.01,Default,,0,0,0,,So as a...so we can see that this implementation is of course producing the same result.
 Dialogue: 0,0:04:16.01,0:04:18.99,Default,,0,0,0,,4e2...something something, as the portable C.
 Dialogue: 0,0:04:18.99,0:04:21.80,Default,,0,0,0,,That's good, because it has the same name, so it better produce the same result.
 Dialogue: 0,0:04:21.80,0:04:27.93,Default,,0,0,0,,But if we wanted, if we were willing to use a different construction designed for multiple threads, we could do things faster. So let's try that.
 Dialogue: 0,0:04:27.93,0:04:29.66,Default,,0,0,0,,This utility implements it.
 Dialogue: 0,0:04:29.66,0:04:32.75,Default,,0,0,0,,The BLAKE2 standard includes such a construction.
 Dialogue: 0,0:04:32.75,0:04:38.72,Default,,0,0,0,,[inaudible] called BLAKE2bp,
 Dialogue: 0,0:04:38.72,0:04:39.82,Default,,0,0,0,,which is quite fast. 
 Dialogue: 0,0:04:39.82,0:04:42.14,Default,,0,0,0,,So that's going to fire up four threads.
 Dialogue: 0,0:04:42.14,0:04:46.30,Default,,0,0,0,,It's going to chunk the input into four groups.
 Dialogue: 0,0:04:46.30,0:04:52.17,Default,,0,0,0,,So the idea is like a block size of, BLAKE2[b] uses 128 bytes if I remember correctly,
 Dialogue: 0,0:04:52.17,0:04:55.64,Default,,0,0,0,,So the first 128 bytes is block0, block 1, 2, 3, and 4.
 Dialogue: 0,0:04:55.64,0:04:58.68,Default,,0,0,0,,So then you get 0, goes with 4, goes with 8, goes with 12, that's on one thread.
 Dialogue: 0,0:04:58.68,0:05:00.40,Default,,0,0,0,,And 1 goes with 5, and so on.
 Dialogue: 0,0:05:00.40,0:05:02.70,Default,,0,0,0,,So you get four threads kind of running in parallel on the same input.
 Dialogue: 0,0:05:02.70,0:05:06.46,Default,,0,0,0,,You get four hashes, hash those together, the result is the BLAKE2bp sum.
 Dialogue: 0,0:05:06.46,0:05:09.72,Default,,0,0,0,,That plus a few other parameter tweaks, that's the definition of BLAKE2bp.
 Dialogue: 0,0:05:09.72,0:05:13.00,Default,,0,0,0,,And then, it's not a like 4x speedup, unfortunately.
 Dialogue: 0,0:05:13.00,0:05:16.80,Default,,0,0,0,,There are four physical cores on this machine. So we hope we get 4x, but it's [inaudible].
 Dialogue: 0,0:05:16.80,0:05:19.23,Default,,0,0,0,,That's BLAKE2bp, and again that's slightly standard.
 Dialogue: 0,0:05:19.23,0:05:23.28,Default,,0,0,0,,Not widely implemented in a lot of places, but in theory other software implements this.
 Dialogue: 0,0:05:23.28,0:05:26.78,Default,,0,0,0,,If we want to get even faster than that, we have to start doing very interesting stuff.
 Dialogue: 0,0:05:26.78,0:05:28.84,Default,,0,0,0,,Real quick, since we're on BLAKE2bp though, 
 Dialogue: 0,0:05:28.84,0:05:33.12,Default,,0,0,0,,let's look at the implementation of that.
 Dialogue: 0,0:05:33.12,0:05:35.24,Default,,0,0,0,,I'm going to ignore most of the details.
 Dialogue: 0,0:05:35.24,0:05:39.47,Default,,0,0,0,, 
 Dialogue: 0,0:05:39.47,0:05:43.23,Default,,0,0,0,,So, four of these go first in parallel.
 Dialogue: 0,0:05:43.23,0:05:46.11,Default,,0,0,0,,One of these goes at the end, slightly different parameters, it's just a hash.
 Dialogue: 0,0:05:46.11,0:05:49.31,Default,,0,0,0,,This is the interesting part, which I'll zoom in on.
 Dialogue: 0,0:05:49.31,0:05:52.19,Default,,0,0,0,,This is Rayon. Who's used Rayon before?
 Dialogue: 0,0:05:52.19,0:05:54.88,Default,,0,0,0,,It's surprisingly easy to use.
 Dialogue: 0,0:05:54.88,0:06:03.72,Default,,0,0,0,,Niko...oh my god I should not even attempt to pronounce his last name...Matsakis, invented this early on, probably before 1.0.
 Dialogue: 0,0:06:03.72,0:06:06.01,Default,,0,0,0,,It makes threading surprisingly easy.
 Dialogue: 0,0:06:06.01,0:06:10.44,Default,,0,0,0,,`join` means, something on the left something on the right, run them on multiple threads...maybe.
 Dialogue: 0,0:06:10.44,0:06:12.70,Default,,0,0,0,,And you'll notice it does some very very clever things.
 Dialogue: 0,0:06:12.70,0:06:15.31,Default,,0,0,0,,If I scroll up...
 Dialogue: 0,0:06:15.31,0:06:17.88,Default,,0,0,0,,`input` is a slice argument to this function, right.
 Dialogue: 0,0:06:17.88,0:06:24.40,Default,,0,0,0,,Then this closure, the worker closure, is like chunking up input [inaudible] and hashing it.
 Dialogue: 0,0:06:24.40,0:06:26.27,Default,,0,0,0,,And this is running that on multiple threads.
 Dialogue: 0,0:06:26.27,0:06:32.86,Default,,0,0,0,,The type checker is already doing all the work to make sure that these threads are not going to take this reference that lives on this thread's stack
 Dialogue: 0,0:06:32.86,0:06:37.58,Default,,0,0,0,,and then like outlast it. It's all safe, and all this stuff magically works. That's really cool.
 Dialogue: 0,0:06:37.58,0:06:44.84,Default,,0,0,0,,And if the program happens to be using Rayon threads in other capacities, it's all the same global threadpool, so it kind of does a lot for you.
 Dialogue: 0,0:06:44.84,0:06:50.01,Default,,0,0,0,,So this is just, do some things in parallel, surprisingly easy, and you get your answer.
 Dialogue: 0,0:06:50.01,0:06:55.26,Default,,0,0,0,,Again, if we want to do something even faster that .4 seconds,
 Dialogue: 0,0:06:55.26,0:06:58.20,Default,,0,0,0,,we need to start defining custom hashing modes.
 Dialogue: 0,0:06:58.20,0:07:04.28,Default,,0,0,0,,So the project that actually...led up...all this talk to...is called "Bao", currently. B-A-O.
 Dialogue: 0,0:07:04.28,0:07:08.86,Default,,0,0,0,,It's a custom tree hashing mode. The details don't matter; it's a binary tree.
 Dialogue: 0,0:07:08.86,0:07:12.94,Default,,0,0,0,,And if we run `bao hash` on "f"...
 Dialogue: 0,0:07:12.94,0:07:15.02,Default,,0,0,0,,Maybe we want to know how fast it is...
 Dialogue: 0,0:07:15.02,0:07:17.02,Default,,0,0,0,,Quite fast.
 Dialogue: 0,0:07:17.02,0:07:19.52,Default,,0,0,0,,That is almost 4 gigabytes per second.
 Dialogue: 0,0:07:19.52,0:07:22.16,Default,,0,0,0,,That is utilizing all the cores on the machine.
 Dialogue: 0,0:07:22.16,0:07:26.03,Default,,0,0,0,,That is utilizing the AVX2 instructions in that BLAKE2 implementation we just saw.
 Dialogue: 0,0:07:26.03,0:07:29.07,Default,,0,0,0,,If you look at the implementation of this...
 Dialogue: 0,0:07:29.07,0:07:31.68,Default,,0,0,0,,This is the top-level hash function.
 Dialogue: 0,0:07:31.68,0:07:36.30,Default,,0,0,0,,It doesn't make sense to use Rayon if you have like, a thousand bytes, cause there's a lot of overhead.
 Dialogue: 0,0:07:36.30,0:07:40.88,Default,,0,0,0,, But if you have enough bytes, it calls into this recursive thing,
 Dialogue: 0,0:07:40.88,0:07:42.30,Default,,0,0,0,,which again uses `rayon::join`.
 Dialogue: 0,0:07:42.30,0:07:47.55,Default,,0,0,0,,This is a shockingly simple function for implementing, like, a four gigabyte per second hash.
 Dialogue: 0,0:07:47.55,0:07:53.02,Default,,0,0,0,,It's...the main difference between this and BLAKE2bp is that this on is actually directly recursing on itself.
 Dialogue: 0,0:07:53.02,0:07:56.35,Default,,0,0,0,,So we actually call `hash_recurse_rayon` inside of `hash_recurse_rayon`.
 Dialogue: 0,0:07:56.35,0:08:01.64,Default,,0,0,0,,Before, with BLAKE2bp, it was just sort of a hardcoded set of four calls into a worker.
 Dialogue: 0,0:08:01.64,0:08:05.80,Default,,0,0,0,,And this will parallelize an arbitrary number of threads on the way down.
 Dialogue: 0,0:08:05.80,0:08:10.08,Default,,0,0,0,,`hash_node` is literally just the BLAKE2 hash. When you actually get to the bottom, you hash a leaf.
 Dialogue: 0,0:08:10.08,0:08:12.40,Default,,0,0,0,,Then you hash everything together at the end.
 Dialogue: 0,0:08:12.40,0:08:14.49,Default,,0,0,0,,So, that's Bao. It's pretty fast.
 Dialogue: 0,0:08:14.49,0:08:19.66,Default,,0,0,0,,If anybody has any questions about Rayon or SIMD instructions, let me know.
 Dialogue: 0,0:08:19.66,0:08:28.14,Default,,0,0,0,,
 Dialogue: 0,0:08:28.14,0:08:32.81,Default,,0,0,0,,So what's the name of the...how do you spell that...the hash?
 Dialogue: 0,0:08:32.81,0:08:34.03,Default,,0,0,0,,The one at the end?
 Dialogue: 0,0:08:34.03,0:08:35.32,Default,,0,0,0,,I think you said "Bao"?
 Dialogue: 0,0:08:35.32,0:08:41.02,Default,,0,0,0,,B-A-O. I just put up the GitHub link, but it's too small for anyone to read.
 Dialogue: 0,0:08:41.02,0:08:53.60,Default,,0,0,0,,
 Dialogue: 0,0:08:53.60,0:08:56.30,Default,,0,0,0,,The other one was `blake2b_simd`.
 Dialogue: 0,0:08:56.30,0:09:03.50,Default,,0,0,0,,This actually...this implements...the first one implements the standard BLAKE[2] hash function, which you might actually be using in other applications. It's just faster
 Dialogue: 0,0:09:03.50,0:09:04.92,Default,,0,0,0,,
 Dialogue: 0,0:09:04.92,0:09:08.83,Default,,0,0,0,,The compress[ion] function you showed using SIMD unstructions is marked unsafe.
 Dialogue: 0,0:09:08.83,0:09:14.48,Default,,0,0,0,,What's [inaudible] SIMD intrinsics, those happen to be unsafe?
 Dialogue: 0,0:09:14.48,0:09:15.28,Default,,0,0,0,,Those happen to be unsafe.
 Dialogue: 0,0:09:15.28,0:09:17.80,Default,,0,0,0,,There's sort of two unsafe things about them.
 Dialogue: 0,0:09:17.80,0:09:23.56,Default,,0,0,0,,One is that, if you just simply call them at all on a platform that doesn't support them, it's just by definition undefined behavior.
 Dialogue: 0,0:09:23.56,0:09:25.26,Default,,0,0,0,,I have no idea what actually happens.
 Dialogue: 0,0:09:25.26,0:09:31.80,Default,,0,0,0,,Another is that those...one of the intrinsics defined in those sets is usually some of wide load.
 Dialogue: 0,0:09:31.80,0:09:37.29,Default,,0,0,0,,Like load a 256-bit integer, or something like that, and those usually require aligned pointers.
 Dialogue: 0,0:09:37.29,0:09:39.74,Default,,0,0,0,,So if you have some pointer...
 Dialogue: 0,0:09:39.74,0:09:42.33,Default,,0,0,0,,You run into this if you're doing pointer casts.
 Dialogue: 0,0:09:42.33,0:09:46.49,Default,,0,0,0,,If you take a pointer to some bytes, and you cast it into a pointer to a 64-bit integer,
 Dialogue: 0,0:09:46.49,0:09:52.48,Default,,0,0,0,,you might think all you need to do is length check, and on x86 that might be true...sometimes.
 Dialogue: 0,0:09:52.48,0:09:56.70,Default,,0,0,0,,But in fact what you've done, is you've got a problem with your unaligned pointer, and you've caused undefined behavior.
 Dialogue: 0,0:09:56.70,0:10:01.68,Default,,0,0,0,,So that is a thing that tends to happen if you [inaudible].
 Dialogue: 0,0:10:01.68,0:10:07.90,Default,,0,0,0,,Does Rayon act like actual threads, or does it do something with like scheduled threads?
 Dialogue: 0,0:10:07.90,0:10:09.90,Default,,0,0,0,,Tell me more about the difference between those two things.
 Dialogue: 0,0:10:09.90,0:10:13.95,Default,,0,0,0,,Like do you [inaudible] full operating system multiple threads where you're running it on...
 Dialogue: 0,0:10:13.95,0:10:19.21,Default,,0,0,0,,Yes, under the covers you have an actual thread pool with actual threads.
 Dialogue: 0,0:10:19.21,0:10:26.49,Default,,0,0,0,,It does a fair bit of logic to make you you have, like, idle cores and stuff like that. I think it tries to be fairly intelligent.
 Dialogue: 0,0:10:26.49,0:10:28.49,Default,,0,0,0,,It also tries to [inaudible] around whether...
 Dialogue: 0,0:10:28.49,0:10:32.59,Default,,0,0,0,,Of course this tree, you could imagine you have like 2^30 branches.
 Dialogue: 0,0:10:32.59,0:10:35.02,Default,,0,0,0,,You might not want to run them all on individual worker threads.
 Dialogue: 0,0:10:35.02,0:10:37.76,Default,,0,0,0,,So it tries to [inaudible], but yes, real threads.
 Dialogue: 0,0:10:37.76,0:10:39.21,Default,,0,0,0,,
 Dialogue: 0,0:10:39.21,0:10:41.21,Default,,0,0,0,,Alright. Thanks guys.
diff --git a/early_bao_talk_subtitles.srt b/early_bao_talk_subtitles.srt
 1
 00:00:00,860 --> 00:00:05,260
 Alright, great, so...

 2
 00:00:05,260 --> 00:00:06,086
 Who hashes stuff?

 3
 00:00:06,086 --> 00:00:09,926
 Anybody ever use like `md5sum`, `sha1sum`,  those things on the command line?

 4
 00:00:09,926 --> 00:00:14,946
 Sometimes we use `md5sum`, but we probably shouldn't. Sometimes we use `sha1sum`, but we probably shouldn't.

 5
 00:00:14,946 --> 00:00:21,813
 You may have heard that SHA-1 was broken, about two years ago now, because time moves really fast, by some very expensive machines at Google.

 6
 00:00:21,813 --> 00:00:26,320
 Those are designed to be very fast. And it turns out that using Rust, we can make things that are even faster.

 7
 00:00:26,320 --> 00:00:34,400
 So let's look real quick. `md5sum` of a file "f"...

 8
 00:00:34,400 --> 00:00:38,448
 Always do this twice, because you need to make sure it's in memory.

 9
 00:00:38,448 --> 00:00:46,032
 Right, so that's `md5sum`, that's a benchmark there. "f" is, let's see here, almost exactly 1 GB in size. It's just full of random data.

 10
 00:00:46,032 --> 00:00:54,672
 So that's `md5sum`. We also have `sha512sum`. That's probably the one we should be using if we're doing anything important.

 11
 00:00:54,672 --> 00:00:57,552
 Which is a little slow, which is kind of depressing.

 12
 00:00:57,550 --> 00:01:06,144
 It turns out the fastest guy on the block right now is still `sha1sum`, unfortunately, which clocks in at 1.3-ish.

 13
 00:01:06,144 --> 00:01:12,528
 But I always -- sorry, this isn't necessarily clear --  we're looking at this number over here. That's the actual time that passed.

 14
 00:01:12,528 --> 00:01:20,560
 So 1.3, that's pretty cool. But we shouldn't be using it anymore, because it's a broken hash function. So what can we use that's modern, that might actually conceivably be faster.

 15
 00:01:20,560 --> 00:01:27,504
 Some of you guys might've heard of a hash function called BLAKE2. Anybody raise their hand who's actually heard that name before?

 16
 00:01:27,504 --> 00:01:41,520
 There is a BLAKE2 implementation on most people's machines, if you have like GNU Coreutils. So let me try that. `/usr/bin/b2sum`...live coding...

 17
 00:01:41,520 --> 00:01:43,152
 And unfortunately that was a little slow.

 18
 00:01:43,150 --> 00:01:45,248
 Despite being a modern hash function it's a little slow.

 19
 00:01:45,248 --> 00:01:52,528
 The reason it's slow is that SHA-1, under the covers, is linking against OpenSSL, with a very fancy implementation of SHA-1.

 20
 00:01:52,520 --> 00:01:56,416
 GNU Coreutils is just using portable C code, which is very clean, but it's not quite as fast.

 21
 00:01:56,416 --> 00:02:03,520
 It turns out that using Rust's just recently stabilized "SIMD intrinsics", we can get quite a lot faster.

 22
 00:02:03,520 --> 00:02:14,000
 So if I run `b2sum` directly -- that's actually a binary that I've compiled on my machine...

 23
 00:02:14,000 --> 00:02:22,080
 Boom! We beat SHA-1. We beat SHA-1 using a set of intrinsics called AVX2. Has anybody heard the name "AVX2" before?

 24
 00:02:22,080 --> 00:02:24,704
 This should be even fewer than "BLAKE2"... Maybe some? Cool.

 25
 00:02:24,704 --> 00:02:33,776
 So, real quick -- I have so little time -- the Intel people have been spending a lot of time making processors faster, and they've been spending a lot of time making cores to run multiple threads.

 26
 00:02:33,776 --> 00:02:38,000
 But also they've been adding new instructions, so we can do things that the old processors couldn't do.

 27
 00:02:38,000 --> 00:02:41,152
 Now if you use those instructions, your program's not compatible with old machines.

 28
 00:02:41,150 --> 00:02:43,328
 So generally compilers don't emit them by default.

 29
 00:02:43,328 --> 00:02:48,784
 But if you tell your compiler to emit them, it will. And then you have to be very careful, cause you're doing unsafe things.

 30
 00:02:48,784 --> 00:02:55,488
 So here is a bunch of crap. And what it means doesn't really matter. It's shuffling bytes around. It's a hash function, it's shuffling bytes around, we all know.

 31
 00:02:55,480 --> 00:02:57,632
 But this is in Rust, of course.

 32
 00:02:57,632 --> 00:03:06,864
 And we see this cute little thing here, which is telling the compiler that, regardless of what platform you thought you were compiling for, in this function, use those AVX2 instructions.

 33
 00:03:06,860 --> 00:03:11,248
 You know, multiple integers at a time, loading and storing, that kind of stuff.

 34
 00:03:11,240 --> 00:03:16,704
 I did not invent this implementation. I ported it from C. So they, the actual authors, get credit for the performance.

 35
 00:03:16,700 --> 00:03:27,328
 So doing this says the compiler will emit these instructions for all targets, and then we're responsible for never ever calling this function when it's unsafe. So...

 36
 00:03:27,320 --> 00:03:31,536
 Down here, we see what we have to do to make this sort of stuff useful.

 37
 00:03:31,530 --> 00:03:35,216
 So, this is if you're sort of statically targeting modern processors, which is rare.

 38
 00:03:35,210 --> 00:03:39,536
 But here, we see that we're at runtime asking the processor,

 39
 00:03:39,530 --> 00:03:43,584
 "What instructions do you support?" If this processor supports AVX2, we'll call this function.

 40
 00:03:43,580 --> 00:03:47,760
 If I replace this with `if true`, this would hopefully crash.

 41
 00:03:47,760 --> 00:03:52,272
 Probably do something worse, on a machine that's older than three years. So this part's pretty important.

 42
 00:03:52,270 --> 00:03:56,672
 So that's how you beat `sha1sum`, using pure, stable Rust.

 43
 00:03:56,672 --> 00:03:58,670
 And that works now.

 44
 00:03:58,670 --> 00:04:00,670
 If we want to get even faster than this,

 45
 00:04:00,670 --> 00:04:02,144
 we have to start using multiple threads.

 46
 00:04:02,140 --> 00:04:05,328
 So, a standard hash function is simply not designed to run on multiple different threads.

 47
 00:04:05,320 --> 00:04:08,448
 Each step depends on the last step.

 48
 00:04:08,440 --> 00:04:16,016
 So as a...so we can see that this implementation is of course producing the same result.

 49
 00:04:16,010 --> 00:04:18,992
 4e2...something something, as the portable C.

 50
 00:04:18,990 --> 00:04:21,808
 That's good, because it has the same name, so it better produce the same result.

 51
 00:04:21,800 --> 00:04:27,936
 But if we wanted, if we were willing to use a different construction designed for multiple threads, we could do things faster. So let's try that.

 52
 00:04:27,930 --> 00:04:29,664
 This utility implements it.

 53
 00:04:29,660 --> 00:04:32,752
 The BLAKE2 standard includes such a construction.

 54
 00:04:32,750 --> 00:04:38,720
 [inaudible] called BLAKE2bp,

 55
 00:04:38,720 --> 00:04:39,824
 which is quite fast. 

 56
 00:04:39,820 --> 00:04:42,144
 So that's going to fire up four threads.

 57
 00:04:42,140 --> 00:04:46,304
 It's going to chunk the input into four groups.

 58
 00:04:46,300 --> 00:04:52,176
 So the idea is like a block size of, BLAKE2[b] uses 128 bytes if I remember correctly,

 59
 00:04:52,170 --> 00:04:55,648
 So the first 128 bytes is block0, block 1, 2, 3, and 4.

 60
 00:04:55,640 --> 00:04:58,688
 So then you get 0, goes with 4, goes with 8, goes with 12, that's on one thread.

 61
 00:04:58,680 --> 00:05:00,400
 And 1 goes with 5, and so on.

 62
 00:05:00,400 --> 00:05:02,704
 So you get four threads kind of running in parallel on the same input.

 63
 00:05:02,700 --> 00:05:06,464
 You get four hashes, hash those together, the result is the BLAKE2bp sum.

 64
 00:05:06,460 --> 00:05:09,728
 That plus a few other parameter tweaks, that's the definition of BLAKE2bp.

 65
 00:05:09,720 --> 00:05:13,008
 And then, it's not a like 4x speedup, unfortunately.

 66
 00:05:13,000 --> 00:05:16,800
 There are four physical cores on this machine. So we hope we get 4x, but it's [inaudible].

 67
 00:05:16,800 --> 00:05:19,232
 That's BLAKE2bp, and again that's slightly standard.

 68
 00:05:19,230 --> 00:05:23,280
 Not widely implemented in a lot of places, but in theory other software implements this.

 69
 00:05:23,280 --> 00:05:26,784
 If we want to get even faster than that, we have to start doing very interesting stuff.

 70
 00:05:26,780 --> 00:05:28,848
 Real quick, since we're on BLAKE2bp though, 

 71
 00:05:28,840 --> 00:05:33,120
 let's look at the implementation of that.

 72
 00:05:33,120 --> 00:05:35,248
 I'm going to ignore most of the details.

 73
 00:05:35,240 --> 00:05:39,472
 

 74
 00:05:39,470 --> 00:05:43,232
 So, four of these go first in parallel.

 75
 00:05:43,230 --> 00:05:46,112
 One of these goes at the end, slightly different parameters, it's just a hash.

 76
 00:05:46,110 --> 00:05:49,312
 This is the interesting part, which I'll zoom in on.

 77
 00:05:49,310 --> 00:05:52,192
 This is Rayon. Who's used Rayon before?

 78
 00:05:52,190 --> 00:05:54,880
 It's surprisingly easy to use.

 79
 00:05:54,880 --> 00:06:03,728
 Niko...oh my god I should not even attempt to pronounce his last name...Matsakis, invented this early on, probably before 1.0.

 80
 00:06:03,720 --> 00:06:06,016
 It makes threading surprisingly easy.

 81
 00:06:06,010 --> 00:06:10,448
 `join` means, something on the left something on the right, run them on multiple threads...maybe.

 82
 00:06:10,440 --> 00:06:12,704
 And you'll notice it does some very very clever things.

 83
 00:06:12,700 --> 00:06:15,312
 If I scroll up...

 84
 00:06:15,310 --> 00:06:17,888
 `input` is a slice argument to this function, right.

 85
 00:06:17,880 --> 00:06:24,400
 Then this closure, the worker closure, is like chunking up input [inaudible] and hashing it.

 86
 00:06:24,400 --> 00:06:26,272
 And this is running that on multiple threads.

 87
 00:06:26,270 --> 00:06:32,864
 The type checker is already doing all the work to make sure that these threads are not going to take this reference that lives on this thread's stack

 88
 00:06:32,860 --> 00:06:37,584
 and then like outlast it. It's all safe, and all this stuff magically works. That's really cool.

 89
 00:06:37,580 --> 00:06:44,848
 And if the program happens to be using Rayon threads in other capacities, it's all the same global threadpool, so it kind of does a lot for you.

 90
 00:06:44,840 --> 00:06:50,016
 So this is just, do some things in parallel, surprisingly easy, and you get your answer.

 91
 00:06:50,010 --> 00:06:55,264
 Again, if we want to do something even faster that .4 seconds,

 92
 00:06:55,260 --> 00:06:58,208
 we need to start defining custom hashing modes.

 93
 00:06:58,200 --> 00:07:04,288
 So the project that actually...led up...all this talk to...is called "Bao", currently. B-A-O.

 94
 00:07:04,280 --> 00:07:08,864
 It's a custom tree hashing mode. The details don't matter; it's a binary tree.

 95
 00:07:08,860 --> 00:07:12,944
 And if we run `bao hash` on "f"...

 96
 00:07:12,940 --> 00:07:15,024
 Maybe we want to know how fast it is...

 97
 00:07:15,024 --> 00:07:17,020
 Quite fast.

 98
 00:07:17,020 --> 00:07:19,520
 That is almost 4 gigabytes per second.

 99
 00:07:19,520 --> 00:07:22,160
 That is utilizing all the cores on the machine.

 100
 00:07:22,160 --> 00:07:26,032
 That is utilizing the AVX2 instructions in that BLAKE2 implementation we just saw.

 101
 00:07:26,030 --> 00:07:29,072
 If you look at the implementation of this...

 102
 00:07:29,070 --> 00:07:31,680
 This is the top-level hash function.

 103
 00:07:31,680 --> 00:07:36,304
 It doesn't make sense to use Rayon if you have like, a thousand bytes, cause there's a lot of overhead.

 104
 00:07:36,300 --> 00:07:40,880
 But if you have enough bytes, it calls into this recursive thing,

 105
 00:07:40,880 --> 00:07:42,304
 which again uses `rayon::join`.

 106
 00:07:42,300 --> 00:07:47,552
 This is a shockingly simple function for implementing, like, a four gigabyte per second hash.

 107
 00:07:47,550 --> 00:07:53,024
 It's...the main difference between this and BLAKE2bp is that this on is actually directly recursing on itself.

 108
 00:07:53,020 --> 00:07:56,352
 So we actually call `hash_recurse_rayon` inside of `hash_recurse_rayon`.

 109
 00:07:56,350 --> 00:08:01,648
 Before, with BLAKE2bp, it was just sort of a hardcoded set of four calls into a worker.

 110
 00:08:01,640 --> 00:08:05,808
 And this will parallelize an arbitrary number of threads on the way down.

 111
 00:08:05,800 --> 00:08:10,080
 `hash_node` is literally just the BLAKE2 hash. When you actually get to the bottom, you hash a leaf.

 112
 00:08:10,080 --> 00:08:12,400
 Then you hash everything together at the end.

 113
 00:08:12,400 --> 00:08:14,496
 So, that's Bao. It's pretty fast.

 114
 00:08:14,490 --> 00:08:19,664
 If anybody has any questions about Rayon or SIMD instructions, let me know.

 115
 00:08:28,140 --> 00:08:32,816
 So what's the name of the...how do you spell that...the hash?

 116
 00:08:32,810 --> 00:08:34,032
 The one at the end?

 117
 00:08:34,030 --> 00:08:35,328
 I think you said "Bao"?

 118
 00:08:35,320 --> 00:08:41,024
 B-A-O. I just put up the GitHub link, but it's too small for anyone to read.

 119
 00:08:53,600 --> 00:08:56,304
 The other one was `blake2b_simd`.

 120
 00:08:56,300 --> 00:09:03,504
 This actually...this implements...the first one implements the standard BLAKE[2] hash function, which you might actually be using in other applications. It's just faster

 121
 00:09:04,920 --> 00:09:08,832
 The compress[ion] function you showed using SIMD unstructions is marked unsafe.

 122
 00:09:08,830 --> 00:09:14,480
 What's [inaudible] SIMD intrinsics, those happen to be unsafe?

 123
 00:09:14,480 --> 00:09:15,280
 Those happen to be unsafe.

 124
 00:09:15,280 --> 00:09:17,808
 There's sort of two unsafe things about them.

 125
 00:09:17,800 --> 00:09:23,568
 One is that, if you just simply call them at all on a platform that doesn't support them, it's just by definition undefined behavior.

 126
 00:09:23,560 --> 00:09:25,264
 I have no idea what actually happens.

 127
 00:09:25,260 --> 00:09:31,808
 Another is that those...one of the intrinsics defined in those sets is usually some of wide load.

 128
 00:09:31,800 --> 00:09:37,296
 Like load a 256-bit integer, or something like that, and those usually require aligned pointers.

 129
 00:09:37,290 --> 00:09:39,744
 So if you have some pointer...

 130
 00:09:39,740 --> 00:09:42,336
 You run into this if you're doing pointer casts.

 131
 00:09:42,330 --> 00:09:46,496
 If you take a pointer to some bytes, and you cast it into a pointer to a 64-bit integer,

 132
 00:09:46,490 --> 00:09:52,480
 you might think all you need to do is length check, and on x86 that might be true...sometimes.

 133
 00:09:52,480 --> 00:09:56,704
 But in fact what you've done, is you've got a problem with your unaligned pointer, and you've caused undefined behavior.

 134
 00:09:56,700 --> 00:10:01,680
 So that is a thing that tends to happen if you [inaudible].

 135
 00:10:01,680 --> 00:10:07,904
 Does Rayon act like actual threads, or does it do something with like scheduled threads?

 136
 00:10:07,904 --> 00:10:09,900
 Tell me more about the difference between those two things.

 137
 00:10:09,900 --> 00:10:13,952
 Like do you [inaudible] full operating system multiple threads where you're running it on...

 138
 00:10:13,950 --> 00:10:19,216
 Yes, under the covers you have an actual thread pool with actual threads.

 139
 00:10:19,210 --> 00:10:26,496
 It does a fair bit of logic to make you you have, like, idle cores and stuff like that. I think it tries to be fairly intelligent.

 140
 00:10:26,496 --> 00:10:28,490
 It also tries to [inaudible] around whether...

 141
 00:10:28,490 --> 00:10:32,592
 Of course this tree, you could imagine you have like 2^30 branches.

 142
 00:10:32,590 --> 00:10:35,024
 You might not want to run them all on individual worker threads.

 143
 00:10:35,020 --> 00:10:37,760
 So it tries to [inaudible], but yes, real threads.

 144
 00:10:39,216 --> 00:10:41,210
 Alright. Thanks guys.
	[Script Info]
	; Script generated by Aegisub 3.2.2
	; http://www.aegisub.org/
	Title: Default Aegisub file
	ScriptType: v4.00+
	WrapStyle: 0
	ScaledBorderAndShadow: yes
	YCbCr Matrix: TV.601
	PlayResX: 1280
	PlayResY: 720

	[Aegisub Project Garbage]
	Audio File: /home/jacko/Downloads/yt5s.com-BLAKE2b and the bao tree mode SIMD and multithreaded hashing in Rust.mp4
	Video File: /home/jacko/Downloads/yt5s.com-BLAKE2b and the bao tree mode SIMD and multithreaded hashing in Rust.mp4
	Video AR Mode: 4
	Video AR Value: 1.777778
	Video Zoom Percent: 0.250000
	Scroll Position: 18
	Active Line: 24
	Video Position: 4341

	[V4+ Styles]
	Format: Name, Fontname, Fontsize, PrimaryColour, SecondaryColour, OutlineColour, BackColour, Bold, Italic, Underline, StrikeOut, ScaleX, ScaleY, Spacing, Angle, BorderStyle, Outline, Shadow, Alignment, MarginL, MarginR, MarginV, Encoding
	Style: Default,Arial,50,&H00FFFFFF,&H000000FF,&H00000000,&H00000000,0,0,0,0,100,100,0,0,1,2,2,2,10,10,10,1

	[Events]
	Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text
	Dialogue: 0,0:00:00.86,0:00:05.26,Default,,0,0,0,,Alright, great, so...
	Dialogue: 0,0:00:05.26,0:00:06.08,Default,,0,0,0,,Who hashes stuff?
	Dialogue: 0,0:00:06.08,0:00:09.92,Default,,0,0,0,,Anybody ever use like `md5sum`, `sha1sum`, those things on the command line?
	Dialogue: 0,0:00:09.92,0:00:14.94,Default,,0,0,0,,Sometimes we use `md5sum`, but we probably shouldn't. Sometimes we use `sha1sum`, but we probably shouldn't.
	Dialogue: 0,0:00:14.94,0:00:21.81,Default,,0,0,0,,You may have heard that SHA-1 was broken, about two years ago now, because time moves really fast, by some very expensive machines at Google.
	Dialogue: 0,0:00:21.81,0:00:26.32,Default,,0,0,0,,Those are designed to be very fast. And it turns out that using Rust, we can make things that are even faster.
	Dialogue: 0,0:00:26.32,0:00:34.40,Default,,0,0,0,,So let's look real quick. `md5sum` of a file "f"...
	Dialogue: 0,0:00:34.40,0:00:38.44,Default,,0,0,0,,Always do this twice, because you need to make sure it's in memory.
	Dialogue: 0,0:00:38.44,0:00:46.03,Default,,0,0,0,,Right, so that's `md5sum`, that's a benchmark there. "f" is, let's see here, almost exactly 1 GB in size. It's just full of random data.
	Dialogue: 0,0:00:46.03,0:00:54.67,Default,,0,0,0,,So that's `md5sum`. We also have `sha512sum`. That's probably the one we should be using if we're doing anything important.
	Dialogue: 0,0:00:54.67,0:00:57.55,Default,,0,0,0,,Which is a little slow, which is kind of depressing.
	Dialogue: 0,0:00:57.55,0:01:06.14,Default,,0,0,0,,It turns out the fastest guy on the block right now is still `sha1sum`, unfortunately, which clocks in at 1.3-ish.
	Dialogue: 0,0:01:06.14,0:01:12.52,Default,,0,0,0,,But I always -- sorry, this isn't necessarily clear -- we're looking at this number over here. That's the actual time that passed.
	Dialogue: 0,0:01:12.52,0:01:20.56,Default,,0,0,0,,So 1.3, that's pretty cool. But we shouldn't be using it anymore, because it's a broken hash function. So what can we use that's modern, that might actually conceivably be faster.
	Dialogue: 0,0:01:20.56,0:01:27.50,Default,,0,0,0,,Some of you guys might've heard of a hash function called BLAKE2. Anybody raise their hand who's actually heard that name before?
	Dialogue: 0,0:01:27.50,0:01:41.52,Default,,0,0,0,,There is a BLAKE2 implementation on most people's machines, if you have like GNU Coreutils. So let me try that. `/usr/bin/b2sum`...live coding...
	Dialogue: 0,0:01:41.52,0:01:43.15,Default,,0,0,0,,And unfortunately that was a little slow.
	Dialogue: 0,0:01:43.15,0:01:45.24,Default,,0,0,0,,Despite being a modern hash function it's a little slow.
	Dialogue: 0,0:01:45.24,0:01:52.52,Default,,0,0,0,,The reason it's slow is that SHA-1, under the covers, is linking against OpenSSL, with a very fancy implementation of SHA-1.
	Dialogue: 0,0:01:52.52,0:01:56.41,Default,,0,0,0,,GNU Coreutils is just using portable C code, which is very clean, but it's not quite as fast.
	Dialogue: 0,0:01:56.41,0:02:03.52,Default,,0,0,0,,It turns out that using Rust's just recently stabilized "SIMD intrinsics", we can get quite a lot faster.
	Dialogue: 0,0:02:03.52,0:02:14.00,Default,,0,0,0,,So if I run `b2sum` directly -- that's actually a binary that I've compiled on my machine...
	Dialogue: 0,0:02:14.00,0:02:22.08,Default,,0,0,0,,Boom! We beat SHA-1. We beat SHA-1 using a set of intrinsics called AVX2. Has anybody heard the name "AVX2" before?
	Dialogue: 0,0:02:22.08,0:02:24.70,Default,,0,0,0,,This should be even fewer than "BLAKE2"... Maybe some? Cool.
	Dialogue: 0,0:02:24.70,0:02:33.77,Default,,0,0,0,,So, real quick -- I have so little time -- the Intel people have been spending a lot of time making processors faster, and they've been spending a lot of time making cores to run multiple threads.
	Dialogue: 0,0:02:33.77,0:02:38.00,Default,,0,0,0,,But also they've been adding new instructions, so we can do things that the old processors couldn't do.
	Dialogue: 0,0:02:38.00,0:02:41.15,Default,,0,0,0,,Now if you use those instructions, your program's not compatible with old machines.
	Dialogue: 0,0:02:41.15,0:02:43.32,Default,,0,0,0,,So generally compilers don't emit them by default.
	Dialogue: 0,0:02:43.32,0:02:48.78,Default,,0,0,0,,But if you tell your compiler to emit them, it will. And then you have to be very careful, cause you're doing unsafe things.
	Dialogue: 0,0:02:48.78,0:02:55.48,Default,,0,0,0,,So here is a bunch of crap. And what it means doesn't really matter. It's shuffling bytes around. It's a hash function, it's shuffling bytes around, we all know.
	Dialogue: 0,0:02:55.48,0:02:57.63,Default,,0,0,0,,But this is in Rust, of course.
	Dialogue: 0,0:02:57.63,0:03:06.86,Default,,0,0,0,,And we see this cute little thing here, which is telling the compiler that, regardless of what platform you thought you were compiling for, in this function, use those AVX2 instructions.
	Dialogue: 0,0:03:06.86,0:03:11.24,Default,,0,0,0,,You know, multiple integers at a time, loading and storing, that kind of stuff.
	Dialogue: 0,0:03:11.24,0:03:16.70,Default,,0,0,0,,I did not invent this implementation. I ported it from C. So they, the actual authors, get credit for the performance.
	Dialogue: 0,0:03:16.70,0:03:27.32,Default,,0,0,0,,So doing this says the compiler will emit these instructions for all targets, and then we're responsible for never ever calling this function when it's unsafe. So...
	Dialogue: 0,0:03:27.32,0:03:31.53,Default,,0,0,0,,Down here, we see what we have to do to make this sort of stuff useful.
	Dialogue: 0,0:03:31.53,0:03:35.21,Default,,0,0,0,,So, this is if you're sort of statically targeting modern processors, which is rare.
	Dialogue: 0,0:03:35.21,0:03:39.53,Default,,0,0,0,,But here, we see that we're at runtime asking the processor,
	Dialogue: 0,0:03:39.53,0:03:43.58,Default,,0,0,0,,"What instructions do you support?" If this processor supports AVX2, we'll call this function.
	Dialogue: 0,0:03:43.58,0:03:47.76,Default,,0,0,0,,If I replace this with `if true`, this would hopefully crash.
	Dialogue: 0,0:03:47.76,0:03:52.27,Default,,0,0,0,,Probably do something worse, on a machine that's older than three years. So this part's pretty important.
	Dialogue: 0,0:03:52.27,0:03:56.67,Default,,0,0,0,,So that's how you beat `sha1sum`, using pure, stable Rust.
	Dialogue: 0,0:03:56.67,0:03:58.67,Default,,0,0,0,,And that works now.
	Dialogue: 0,0:03:58.67,0:04:00.67,Default,,0,0,0,,If we want to get even faster than this,
	Dialogue: 0,0:04:00.67,0:04:02.14,Default,,0,0,0,,we have to start using multiple threads.
	Dialogue: 0,0:04:02.14,0:04:05.32,Default,,0,0,0,,So, a standard hash function is simply not designed to run on multiple different threads.
	Dialogue: 0,0:04:05.32,0:04:08.44,Default,,0,0,0,,Each step depends on the last step.
	Dialogue: 0,0:04:08.44,0:04:16.01,Default,,0,0,0,,So as a...so we can see that this implementation is of course producing the same result.
	Dialogue: 0,0:04:16.01,0:04:18.99,Default,,0,0,0,,4e2...something something, as the portable C.
	Dialogue: 0,0:04:18.99,0:04:21.80,Default,,0,0,0,,That's good, because it has the same name, so it better produce the same result.
	Dialogue: 0,0:04:21.80,0:04:27.93,Default,,0,0,0,,But if we wanted, if we were willing to use a different construction designed for multiple threads, we could do things faster. So let's try that.
	Dialogue: 0,0:04:27.93,0:04:29.66,Default,,0,0,0,,This utility implements it.
	Dialogue: 0,0:04:29.66,0:04:32.75,Default,,0,0,0,,The BLAKE2 standard includes such a construction.
	Dialogue: 0,0:04:32.75,0:04:38.72,Default,,0,0,0,,[inaudible] called BLAKE2bp,
	Dialogue: 0,0:04:38.72,0:04:39.82,Default,,0,0,0,,which is quite fast.
	Dialogue: 0,0:04:39.82,0:04:42.14,Default,,0,0,0,,So that's going to fire up four threads.
	Dialogue: 0,0:04:42.14,0:04:46.30,Default,,0,0,0,,It's going to chunk the input into four groups.
	Dialogue: 0,0:04:46.30,0:04:52.17,Default,,0,0,0,,So the idea is like a block size of, BLAKE2[b] uses 128 bytes if I remember correctly,
	Dialogue: 0,0:04:52.17,0:04:55.64,Default,,0,0,0,,So the first 128 bytes is block0, block 1, 2, 3, and 4.
	Dialogue: 0,0:04:55.64,0:04:58.68,Default,,0,0,0,,So then you get 0, goes with 4, goes with 8, goes with 12, that's on one thread.
	Dialogue: 0,0:04:58.68,0:05:00.40,Default,,0,0,0,,And 1 goes with 5, and so on.
	Dialogue: 0,0:05:00.40,0:05:02.70,Default,,0,0,0,,So you get four threads kind of running in parallel on the same input.
	Dialogue: 0,0:05:02.70,0:05:06.46,Default,,0,0,0,,You get four hashes, hash those together, the result is the BLAKE2bp sum.
	Dialogue: 0,0:05:06.46,0:05:09.72,Default,,0,0,0,,That plus a few other parameter tweaks, that's the definition of BLAKE2bp.
	Dialogue: 0,0:05:09.72,0:05:13.00,Default,,0,0,0,,And then, it's not a like 4x speedup, unfortunately.
	Dialogue: 0,0:05:13.00,0:05:16.80,Default,,0,0,0,,There are four physical cores on this machine. So we hope we get 4x, but it's [inaudible].
	Dialogue: 0,0:05:16.80,0:05:19.23,Default,,0,0,0,,That's BLAKE2bp, and again that's slightly standard.
	Dialogue: 0,0:05:19.23,0:05:23.28,Default,,0,0,0,,Not widely implemented in a lot of places, but in theory other software implements this.
	Dialogue: 0,0:05:23.28,0:05:26.78,Default,,0,0,0,,If we want to get even faster than that, we have to start doing very interesting stuff.
	Dialogue: 0,0:05:26.78,0:05:28.84,Default,,0,0,0,,Real quick, since we're on BLAKE2bp though,
	Dialogue: 0,0:05:28.84,0:05:33.12,Default,,0,0,0,,let's look at the implementation of that.
	Dialogue: 0,0:05:33.12,0:05:35.24,Default,,0,0,0,,I'm going to ignore most of the details.
	Dialogue: 0,0:05:35.24,0:05:39.47,Default,,0,0,0,,
	Dialogue: 0,0:05:39.47,0:05:43.23,Default,,0,0,0,,So, four of these go first in parallel.
	Dialogue: 0,0:05:43.23,0:05:46.11,Default,,0,0,0,,One of these goes at the end, slightly different parameters, it's just a hash.
	Dialogue: 0,0:05:46.11,0:05:49.31,Default,,0,0,0,,This is the interesting part, which I'll zoom in on.
	Dialogue: 0,0:05:49.31,0:05:52.19,Default,,0,0,0,,This is Rayon. Who's used Rayon before?
	Dialogue: 0,0:05:52.19,0:05:54.88,Default,,0,0,0,,It's surprisingly easy to use.
	Dialogue: 0,0:05:54.88,0:06:03.72,Default,,0,0,0,,Niko...oh my god I should not even attempt to pronounce his last name...Matsakis, invented this early on, probably before 1.0.
	Dialogue: 0,0:06:03.72,0:06:06.01,Default,,0,0,0,,It makes threading surprisingly easy.
	Dialogue: 0,0:06:06.01,0:06:10.44,Default,,0,0,0,,`join` means, something on the left something on the right, run them on multiple threads...maybe.
	Dialogue: 0,0:06:10.44,0:06:12.70,Default,,0,0,0,,And you'll notice it does some very very clever things.
	Dialogue: 0,0:06:12.70,0:06:15.31,Default,,0,0,0,,If I scroll up...
	Dialogue: 0,0:06:15.31,0:06:17.88,Default,,0,0,0,,`input` is a slice argument to this function, right.
	Dialogue: 0,0:06:17.88,0:06:24.40,Default,,0,0,0,,Then this closure, the worker closure, is like chunking up input [inaudible] and hashing it.
	Dialogue: 0,0:06:24.40,0:06:26.27,Default,,0,0,0,,And this is running that on multiple threads.
	Dialogue: 0,0:06:26.27,0:06:32.86,Default,,0,0,0,,The type checker is already doing all the work to make sure that these threads are not going to take this reference that lives on this thread's stack
	Dialogue: 0,0:06:32.86,0:06:37.58,Default,,0,0,0,,and then like outlast it. It's all safe, and all this stuff magically works. That's really cool.
	Dialogue: 0,0:06:37.58,0:06:44.84,Default,,0,0,0,,And if the program happens to be using Rayon threads in other capacities, it's all the same global threadpool, so it kind of does a lot for you.
	Dialogue: 0,0:06:44.84,0:06:50.01,Default,,0,0,0,,So this is just, do some things in parallel, surprisingly easy, and you get your answer.
	Dialogue: 0,0:06:50.01,0:06:55.26,Default,,0,0,0,,Again, if we want to do something even faster that .4 seconds,
	Dialogue: 0,0:06:55.26,0:06:58.20,Default,,0,0,0,,we need to start defining custom hashing modes.
	Dialogue: 0,0:06:58.20,0:07:04.28,Default,,0,0,0,,So the project that actually...led up...all this talk to...is called "Bao", currently. B-A-O.
	Dialogue: 0,0:07:04.28,0:07:08.86,Default,,0,0,0,,It's a custom tree hashing mode. The details don't matter; it's a binary tree.
	Dialogue: 0,0:07:08.86,0:07:12.94,Default,,0,0,0,,And if we run `bao hash` on "f"...
	Dialogue: 0,0:07:12.94,0:07:15.02,Default,,0,0,0,,Maybe we want to know how fast it is...
	Dialogue: 0,0:07:15.02,0:07:17.02,Default,,0,0,0,,Quite fast.
	Dialogue: 0,0:07:17.02,0:07:19.52,Default,,0,0,0,,That is almost 4 gigabytes per second.
	Dialogue: 0,0:07:19.52,0:07:22.16,Default,,0,0,0,,That is utilizing all the cores on the machine.
	Dialogue: 0,0:07:22.16,0:07:26.03,Default,,0,0,0,,That is utilizing the AVX2 instructions in that BLAKE2 implementation we just saw.
	Dialogue: 0,0:07:26.03,0:07:29.07,Default,,0,0,0,,If you look at the implementation of this...
	Dialogue: 0,0:07:29.07,0:07:31.68,Default,,0,0,0,,This is the top-level hash function.
	Dialogue: 0,0:07:31.68,0:07:36.30,Default,,0,0,0,,It doesn't make sense to use Rayon if you have like, a thousand bytes, cause there's a lot of overhead.
	Dialogue: 0,0:07:36.30,0:07:40.88,Default,,0,0,0,, But if you have enough bytes, it calls into this recursive thing,
	Dialogue: 0,0:07:40.88,0:07:42.30,Default,,0,0,0,,which again uses `rayon::join`.
	Dialogue: 0,0:07:42.30,0:07:47.55,Default,,0,0,0,,This is a shockingly simple function for implementing, like, a four gigabyte per second hash.
	Dialogue: 0,0:07:47.55,0:07:53.02,Default,,0,0,0,,It's...the main difference between this and BLAKE2bp is that this on is actually directly recursing on itself.
	Dialogue: 0,0:07:53.02,0:07:56.35,Default,,0,0,0,,So we actually call `hash_recurse_rayon` inside of `hash_recurse_rayon`.
	Dialogue: 0,0:07:56.35,0:08:01.64,Default,,0,0,0,,Before, with BLAKE2bp, it was just sort of a hardcoded set of four calls into a worker.
	Dialogue: 0,0:08:01.64,0:08:05.80,Default,,0,0,0,,And this will parallelize an arbitrary number of threads on the way down.
	Dialogue: 0,0:08:05.80,0:08:10.08,Default,,0,0,0,,`hash_node` is literally just the BLAKE2 hash. When you actually get to the bottom, you hash a leaf.
	Dialogue: 0,0:08:10.08,0:08:12.40,Default,,0,0,0,,Then you hash everything together at the end.
	Dialogue: 0,0:08:12.40,0:08:14.49,Default,,0,0,0,,So, that's Bao. It's pretty fast.
	Dialogue: 0,0:08:14.49,0:08:19.66,Default,,0,0,0,,If anybody has any questions about Rayon or SIMD instructions, let me know.
	Dialogue: 0,0:08:19.66,0:08:28.14,Default,,0,0,0,,
	Dialogue: 0,0:08:28.14,0:08:32.81,Default,,0,0,0,,So what's the name of the...how do you spell that...the hash?
	Dialogue: 0,0:08:32.81,0:08:34.03,Default,,0,0,0,,The one at the end?
	Dialogue: 0,0:08:34.03,0:08:35.32,Default,,0,0,0,,I think you said "Bao"?
	Dialogue: 0,0:08:35.32,0:08:41.02,Default,,0,0,0,,B-A-O. I just put up the GitHub link, but it's too small for anyone to read.
	Dialogue: 0,0:08:41.02,0:08:53.60,Default,,0,0,0,,
	Dialogue: 0,0:08:53.60,0:08:56.30,Default,,0,0,0,,The other one was `blake2b_simd`.
	Dialogue: 0,0:08:56.30,0:09:03.50,Default,,0,0,0,,This actually...this implements...the first one implements the standard BLAKE[2] hash function, which you might actually be using in other applications. It's just faster
	Dialogue: 0,0:09:03.50,0:09:04.92,Default,,0,0,0,,
	Dialogue: 0,0:09:04.92,0:09:08.83,Default,,0,0,0,,The compress[ion] function you showed using SIMD unstructions is marked unsafe.
	Dialogue: 0,0:09:08.83,0:09:14.48,Default,,0,0,0,,What's [inaudible] SIMD intrinsics, those happen to be unsafe?
	Dialogue: 0,0:09:14.48,0:09:15.28,Default,,0,0,0,,Those happen to be unsafe.
	Dialogue: 0,0:09:15.28,0:09:17.80,Default,,0,0,0,,There's sort of two unsafe things about them.
	Dialogue: 0,0:09:17.80,0:09:23.56,Default,,0,0,0,,One is that, if you just simply call them at all on a platform that doesn't support them, it's just by definition undefined behavior.
	Dialogue: 0,0:09:23.56,0:09:25.26,Default,,0,0,0,,I have no idea what actually happens.
	Dialogue: 0,0:09:25.26,0:09:31.80,Default,,0,0,0,,Another is that those...one of the intrinsics defined in those sets is usually some of wide load.
	Dialogue: 0,0:09:31.80,0:09:37.29,Default,,0,0,0,,Like load a 256-bit integer, or something like that, and those usually require aligned pointers.
	Dialogue: 0,0:09:37.29,0:09:39.74,Default,,0,0,0,,So if you have some pointer...
	Dialogue: 0,0:09:39.74,0:09:42.33,Default,,0,0,0,,You run into this if you're doing pointer casts.
	Dialogue: 0,0:09:42.33,0:09:46.49,Default,,0,0,0,,If you take a pointer to some bytes, and you cast it into a pointer to a 64-bit integer,
	Dialogue: 0,0:09:46.49,0:09:52.48,Default,,0,0,0,,you might think all you need to do is length check, and on x86 that might be true...sometimes.
	Dialogue: 0,0:09:52.48,0:09:56.70,Default,,0,0,0,,But in fact what you've done, is you've got a problem with your unaligned pointer, and you've caused undefined behavior.
	Dialogue: 0,0:09:56.70,0:10:01.68,Default,,0,0,0,,So that is a thing that tends to happen if you [inaudible].
	Dialogue: 0,0:10:01.68,0:10:07.90,Default,,0,0,0,,Does Rayon act like actual threads, or does it do something with like scheduled threads?
	Dialogue: 0,0:10:07.90,0:10:09.90,Default,,0,0,0,,Tell me more about the difference between those two things.
	Dialogue: 0,0:10:09.90,0:10:13.95,Default,,0,0,0,,Like do you [inaudible] full operating system multiple threads where you're running it on...
	Dialogue: 0,0:10:13.95,0:10:19.21,Default,,0,0,0,,Yes, under the covers you have an actual thread pool with actual threads.
	Dialogue: 0,0:10:19.21,0:10:26.49,Default,,0,0,0,,It does a fair bit of logic to make you you have, like, idle cores and stuff like that. I think it tries to be fairly intelligent.
	Dialogue: 0,0:10:26.49,0:10:28.49,Default,,0,0,0,,It also tries to [inaudible] around whether...
	Dialogue: 0,0:10:28.49,0:10:32.59,Default,,0,0,0,,Of course this tree, you could imagine you have like 2^30 branches.
	Dialogue: 0,0:10:32.59,0:10:35.02,Default,,0,0,0,,You might not want to run them all on individual worker threads.
	Dialogue: 0,0:10:35.02,0:10:37.76,Default,,0,0,0,,So it tries to [inaudible], but yes, real threads.
	Dialogue: 0,0:10:37.76,0:10:39.21,Default,,0,0,0,,
	Dialogue: 0,0:10:39.21,0:10:41.21,Default,,0,0,0,,Alright. Thanks guys.
	1
	00:00:00,860 --> 00:00:05,260
	Alright, great, so...

	2
	00:00:05,260 --> 00:00:06,086
	Who hashes stuff?

	3
	00:00:06,086 --> 00:00:09,926
	Anybody ever use like `md5sum`, `sha1sum`, those things on the command line?

	4
	00:00:09,926 --> 00:00:14,946
	Sometimes we use `md5sum`, but we probably shouldn't. Sometimes we use `sha1sum`, but we probably shouldn't.

	5
	00:00:14,946 --> 00:00:21,813
	You may have heard that SHA-1 was broken, about two years ago now, because time moves really fast, by some very expensive machines at Google.

	6
	00:00:21,813 --> 00:00:26,320
	Those are designed to be very fast. And it turns out that using Rust, we can make things that are even faster.

	7
	00:00:26,320 --> 00:00:34,400
	So let's look real quick. `md5sum` of a file "f"...

	8
	00:00:34,400 --> 00:00:38,448
	Always do this twice, because you need to make sure it's in memory.

	9
	00:00:38,448 --> 00:00:46,032
	Right, so that's `md5sum`, that's a benchmark there. "f" is, let's see here, almost exactly 1 GB in size. It's just full of random data.

	10
	00:00:46,032 --> 00:00:54,672
	So that's `md5sum`. We also have `sha512sum`. That's probably the one we should be using if we're doing anything important.

	11
	00:00:54,672 --> 00:00:57,552
	Which is a little slow, which is kind of depressing.

	12
	00:00:57,550 --> 00:01:06,144
	It turns out the fastest guy on the block right now is still `sha1sum`, unfortunately, which clocks in at 1.3-ish.

	13
	00:01:06,144 --> 00:01:12,528
	But I always -- sorry, this isn't necessarily clear -- we're looking at this number over here. That's the actual time that passed.

	14
	00:01:12,528 --> 00:01:20,560
	So 1.3, that's pretty cool. But we shouldn't be using it anymore, because it's a broken hash function. So what can we use that's modern, that might actually conceivably be faster.

	15
	00:01:20,560 --> 00:01:27,504
	Some of you guys might've heard of a hash function called BLAKE2. Anybody raise their hand who's actually heard that name before?

	16
	00:01:27,504 --> 00:01:41,520
	There is a BLAKE2 implementation on most people's machines, if you have like GNU Coreutils. So let me try that. `/usr/bin/b2sum`...live coding...

	17
	00:01:41,520 --> 00:01:43,152
	And unfortunately that was a little slow.

	18
	00:01:43,150 --> 00:01:45,248
	Despite being a modern hash function it's a little slow.

	19
	00:01:45,248 --> 00:01:52,528
	The reason it's slow is that SHA-1, under the covers, is linking against OpenSSL, with a very fancy implementation of SHA-1.

	20
	00:01:52,520 --> 00:01:56,416
	GNU Coreutils is just using portable C code, which is very clean, but it's not quite as fast.

	21
	00:01:56,416 --> 00:02:03,520
	It turns out that using Rust's just recently stabilized "SIMD intrinsics", we can get quite a lot faster.

	22
	00:02:03,520 --> 00:02:14,000
	So if I run `b2sum` directly -- that's actually a binary that I've compiled on my machine...

	23
	00:02:14,000 --> 00:02:22,080
	Boom! We beat SHA-1. We beat SHA-1 using a set of intrinsics called AVX2. Has anybody heard the name "AVX2" before?

	24
	00:02:22,080 --> 00:02:24,704
	This should be even fewer than "BLAKE2"... Maybe some? Cool.

	25
	00:02:24,704 --> 00:02:33,776
	So, real quick -- I have so little time -- the Intel people have been spending a lot of time making processors faster, and they've been spending a lot of time making cores to run multiple threads.

	26
	00:02:33,776 --> 00:02:38,000
	But also they've been adding new instructions, so we can do things that the old processors couldn't do.

	27
	00:02:38,000 --> 00:02:41,152
	Now if you use those instructions, your program's not compatible with old machines.

	28
	00:02:41,150 --> 00:02:43,328
	So generally compilers don't emit them by default.

	29
	00:02:43,328 --> 00:02:48,784
	But if you tell your compiler to emit them, it will. And then you have to be very careful, cause you're doing unsafe things.

	30
	00:02:48,784 --> 00:02:55,488
	So here is a bunch of crap. And what it means doesn't really matter. It's shuffling bytes around. It's a hash function, it's shuffling bytes around, we all know.

	31
	00:02:55,480 --> 00:02:57,632
	But this is in Rust, of course.

	32
	00:02:57,632 --> 00:03:06,864
	And we see this cute little thing here, which is telling the compiler that, regardless of what platform you thought you were compiling for, in this function, use those AVX2 instructions.

	33
	00:03:06,860 --> 00:03:11,248
	You know, multiple integers at a time, loading and storing, that kind of stuff.

	34
	00:03:11,240 --> 00:03:16,704
	I did not invent this implementation. I ported it from C. So they, the actual authors, get credit for the performance.

	35
	00:03:16,700 --> 00:03:27,328
	So doing this says the compiler will emit these instructions for all targets, and then we're responsible for never ever calling this function when it's unsafe. So...

	36
	00:03:27,320 --> 00:03:31,536
	Down here, we see what we have to do to make this sort of stuff useful.

	37
	00:03:31,530 --> 00:03:35,216
	So, this is if you're sort of statically targeting modern processors, which is rare.

	38
	00:03:35,210 --> 00:03:39,536
	But here, we see that we're at runtime asking the processor,

	39
	00:03:39,530 --> 00:03:43,584
	"What instructions do you support?" If this processor supports AVX2, we'll call this function.

	40
	00:03:43,580 --> 00:03:47,760
	If I replace this with `if true`, this would hopefully crash.

	41
	00:03:47,760 --> 00:03:52,272
	Probably do something worse, on a machine that's older than three years. So this part's pretty important.

	42
	00:03:52,270 --> 00:03:56,672
	So that's how you beat `sha1sum`, using pure, stable Rust.

	43
	00:03:56,672 --> 00:03:58,670
	And that works now.

	44
	00:03:58,670 --> 00:04:00,670
	If we want to get even faster than this,

	45
	00:04:00,670 --> 00:04:02,144
	we have to start using multiple threads.

	46
	00:04:02,140 --> 00:04:05,328
	So, a standard hash function is simply not designed to run on multiple different threads.

	47
	00:04:05,320 --> 00:04:08,448
	Each step depends on the last step.

	48
	00:04:08,440 --> 00:04:16,016
	So as a...so we can see that this implementation is of course producing the same result.

	49
	00:04:16,010 --> 00:04:18,992
	4e2...something something, as the portable C.

	50
	00:04:18,990 --> 00:04:21,808
	That's good, because it has the same name, so it better produce the same result.

	51
	00:04:21,800 --> 00:04:27,936
	But if we wanted, if we were willing to use a different construction designed for multiple threads, we could do things faster. So let's try that.

	52
	00:04:27,930 --> 00:04:29,664
	This utility implements it.

	53
	00:04:29,660 --> 00:04:32,752
	The BLAKE2 standard includes such a construction.

	54
	00:04:32,750 --> 00:04:38,720
	[inaudible] called BLAKE2bp,

	55
	00:04:38,720 --> 00:04:39,824
	which is quite fast.

	56
	00:04:39,820 --> 00:04:42,144
	So that's going to fire up four threads.

	57
	00:04:42,140 --> 00:04:46,304
	It's going to chunk the input into four groups.

	58
	00:04:46,300 --> 00:04:52,176
	So the idea is like a block size of, BLAKE2[b] uses 128 bytes if I remember correctly,

	59
	00:04:52,170 --> 00:04:55,648
	So the first 128 bytes is block0, block 1, 2, 3, and 4.

	60
	00:04:55,640 --> 00:04:58,688
	So then you get 0, goes with 4, goes with 8, goes with 12, that's on one thread.

	61
	00:04:58,680 --> 00:05:00,400
	And 1 goes with 5, and so on.

	62
	00:05:00,400 --> 00:05:02,704
	So you get four threads kind of running in parallel on the same input.

	63
	00:05:02,700 --> 00:05:06,464
	You get four hashes, hash those together, the result is the BLAKE2bp sum.

	64
	00:05:06,460 --> 00:05:09,728
	That plus a few other parameter tweaks, that's the definition of BLAKE2bp.

	65
	00:05:09,720 --> 00:05:13,008
	And then, it's not a like 4x speedup, unfortunately.

	66
	00:05:13,000 --> 00:05:16,800
	There are four physical cores on this machine. So we hope we get 4x, but it's [inaudible].

	67
	00:05:16,800 --> 00:05:19,232
	That's BLAKE2bp, and again that's slightly standard.

	68
	00:05:19,230 --> 00:05:23,280
	Not widely implemented in a lot of places, but in theory other software implements this.

	69
	00:05:23,280 --> 00:05:26,784
	If we want to get even faster than that, we have to start doing very interesting stuff.

	70
	00:05:26,780 --> 00:05:28,848
	Real quick, since we're on BLAKE2bp though,

	71
	00:05:28,840 --> 00:05:33,120
	let's look at the implementation of that.

	72
	00:05:33,120 --> 00:05:35,248
	I'm going to ignore most of the details.

	73
	00:05:35,240 --> 00:05:39,472


	74
	00:05:39,470 --> 00:05:43,232
	So, four of these go first in parallel.

	75
	00:05:43,230 --> 00:05:46,112
	One of these goes at the end, slightly different parameters, it's just a hash.

	76
	00:05:46,110 --> 00:05:49,312
	This is the interesting part, which I'll zoom in on.

	77
	00:05:49,310 --> 00:05:52,192
	This is Rayon. Who's used Rayon before?

	78
	00:05:52,190 --> 00:05:54,880
	It's surprisingly easy to use.

	79
	00:05:54,880 --> 00:06:03,728
	Niko...oh my god I should not even attempt to pronounce his last name...Matsakis, invented this early on, probably before 1.0.

	80
	00:06:03,720 --> 00:06:06,016
	It makes threading surprisingly easy.

	81
	00:06:06,010 --> 00:06:10,448
	`join` means, something on the left something on the right, run them on multiple threads...maybe.

	82
	00:06:10,440 --> 00:06:12,704
	And you'll notice it does some very very clever things.

	83
	00:06:12,700 --> 00:06:15,312
	If I scroll up...

	84
	00:06:15,310 --> 00:06:17,888
	`input` is a slice argument to this function, right.

	85
	00:06:17,880 --> 00:06:24,400
	Then this closure, the worker closure, is like chunking up input [inaudible] and hashing it.

	86
	00:06:24,400 --> 00:06:26,272
	And this is running that on multiple threads.

	87
	00:06:26,270 --> 00:06:32,864
	The type checker is already doing all the work to make sure that these threads are not going to take this reference that lives on this thread's stack

	88
	00:06:32,860 --> 00:06:37,584
	and then like outlast it. It's all safe, and all this stuff magically works. That's really cool.

	89
	00:06:37,580 --> 00:06:44,848
	And if the program happens to be using Rayon threads in other capacities, it's all the same global threadpool, so it kind of does a lot for you.

	90
	00:06:44,840 --> 00:06:50,016
	So this is just, do some things in parallel, surprisingly easy, and you get your answer.

	91
	00:06:50,010 --> 00:06:55,264
	Again, if we want to do something even faster that .4 seconds,

	92
	00:06:55,260 --> 00:06:58,208
	we need to start defining custom hashing modes.

	93
	00:06:58,200 --> 00:07:04,288
	So the project that actually...led up...all this talk to...is called "Bao", currently. B-A-O.

	94
	00:07:04,280 --> 00:07:08,864
	It's a custom tree hashing mode. The details don't matter; it's a binary tree.

	95
	00:07:08,860 --> 00:07:12,944
	And if we run `bao hash` on "f"...

	96
	00:07:12,940 --> 00:07:15,024
	Maybe we want to know how fast it is...

	97
	00:07:15,024 --> 00:07:17,020
	Quite fast.

	98
	00:07:17,020 --> 00:07:19,520
	That is almost 4 gigabytes per second.

	99
	00:07:19,520 --> 00:07:22,160
	That is utilizing all the cores on the machine.

	100
	00:07:22,160 --> 00:07:26,032
	That is utilizing the AVX2 instructions in that BLAKE2 implementation we just saw.

	101
	00:07:26,030 --> 00:07:29,072
	If you look at the implementation of this...

	102
	00:07:29,070 --> 00:07:31,680
	This is the top-level hash function.

	103
	00:07:31,680 --> 00:07:36,304
	It doesn't make sense to use Rayon if you have like, a thousand bytes, cause there's a lot of overhead.

	104
	00:07:36,300 --> 00:07:40,880
	But if you have enough bytes, it calls into this recursive thing,

	105
	00:07:40,880 --> 00:07:42,304
	which again uses `rayon::join`.

	106
	00:07:42,300 --> 00:07:47,552
	This is a shockingly simple function for implementing, like, a four gigabyte per second hash.

	107
	00:07:47,550 --> 00:07:53,024
	It's...the main difference between this and BLAKE2bp is that this on is actually directly recursing on itself.

	108
	00:07:53,020 --> 00:07:56,352
	So we actually call `hash_recurse_rayon` inside of `hash_recurse_rayon`.

	109
	00:07:56,350 --> 00:08:01,648
	Before, with BLAKE2bp, it was just sort of a hardcoded set of four calls into a worker.

	110
	00:08:01,640 --> 00:08:05,808
	And this will parallelize an arbitrary number of threads on the way down.

	111
	00:08:05,800 --> 00:08:10,080
	`hash_node` is literally just the BLAKE2 hash. When you actually get to the bottom, you hash a leaf.

	112
	00:08:10,080 --> 00:08:12,400
	Then you hash everything together at the end.

	113
	00:08:12,400 --> 00:08:14,496
	So, that's Bao. It's pretty fast.

	114
	00:08:14,490 --> 00:08:19,664
	If anybody has any questions about Rayon or SIMD instructions, let me know.

	115
	00:08:28,140 --> 00:08:32,816
	So what's the name of the...how do you spell that...the hash?

	116
	00:08:32,810 --> 00:08:34,032
	The one at the end?

	117
	00:08:34,030 --> 00:08:35,328
	I think you said "Bao"?

	118
	00:08:35,320 --> 00:08:41,024
	B-A-O. I just put up the GitHub link, but it's too small for anyone to read.

	119
	00:08:53,600 --> 00:08:56,304
	The other one was `blake2b_simd`.

	120
	00:08:56,300 --> 00:09:03,504
	This actually...this implements...the first one implements the standard BLAKE[2] hash function, which you might actually be using in other applications. It's just faster

	121
	00:09:04,920 --> 00:09:08,832
	The compress[ion] function you showed using SIMD unstructions is marked unsafe.

	122
	00:09:08,830 --> 00:09:14,480
	What's [inaudible] SIMD intrinsics, those happen to be unsafe?

	123
	00:09:14,480 --> 00:09:15,280
	Those happen to be unsafe.

	124
	00:09:15,280 --> 00:09:17,808
	There's sort of two unsafe things about them.

	125
	00:09:17,800 --> 00:09:23,568
	One is that, if you just simply call them at all on a platform that doesn't support them, it's just by definition undefined behavior.

	126
	00:09:23,560 --> 00:09:25,264
	I have no idea what actually happens.

	127
	00:09:25,260 --> 00:09:31,808
	Another is that those...one of the intrinsics defined in those sets is usually some of wide load.

	128
	00:09:31,800 --> 00:09:37,296
	Like load a 256-bit integer, or something like that, and those usually require aligned pointers.

	129
	00:09:37,290 --> 00:09:39,744
	So if you have some pointer...

	130
	00:09:39,740 --> 00:09:42,336
	You run into this if you're doing pointer casts.

	131
	00:09:42,330 --> 00:09:46,496
	If you take a pointer to some bytes, and you cast it into a pointer to a 64-bit integer,

	132
	00:09:46,490 --> 00:09:52,480
	you might think all you need to do is length check, and on x86 that might be true...sometimes.

	133
	00:09:52,480 --> 00:09:56,704
	But in fact what you've done, is you've got a problem with your unaligned pointer, and you've caused undefined behavior.

	134
	00:09:56,700 --> 00:10:01,680
	So that is a thing that tends to happen if you [inaudible].

	135
	00:10:01,680 --> 00:10:07,904
	Does Rayon act like actual threads, or does it do something with like scheduled threads?

	136
	00:10:07,904 --> 00:10:09,900
	Tell me more about the difference between those two things.

	137
	00:10:09,900 --> 00:10:13,952
	Like do you [inaudible] full operating system multiple threads where you're running it on...

	138
	00:10:13,950 --> 00:10:19,216
	Yes, under the covers you have an actual thread pool with actual threads.

	139
	00:10:19,210 --> 00:10:26,496
	It does a fair bit of logic to make you you have, like, idle cores and stuff like that. I think it tries to be fairly intelligent.

	140
	00:10:26,496 --> 00:10:28,490
	It also tries to [inaudible] around whether...

	141
	00:10:28,490 --> 00:10:32,592
	Of course this tree, you could imagine you have like 2^30 branches.

	142
	00:10:32,590 --> 00:10:35,024
	You might not want to run them all on individual worker threads.

	143
	00:10:35,020 --> 00:10:37,760
	So it tries to [inaudible], but yes, real threads.

	144
	00:10:39,216 --> 00:10:41,210
	Alright. Thanks guys.