Created
May 13, 2022 04:00
-
-
Save oconnor663/46895288b886616ffa06dfb6f90d7966 to your computer and use it in GitHub Desktop.
early Bao video subtitles
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[Script Info] | |
; Script generated by Aegisub 3.2.2 | |
; http://www.aegisub.org/ | |
Title: Default Aegisub file | |
ScriptType: v4.00+ | |
WrapStyle: 0 | |
ScaledBorderAndShadow: yes | |
YCbCr Matrix: TV.601 | |
PlayResX: 1280 | |
PlayResY: 720 | |
[Aegisub Project Garbage] | |
Audio File: /home/jacko/Downloads/yt5s.com-BLAKE2b and the bao tree mode SIMD and multithreaded hashing in Rust.mp4 | |
Video File: /home/jacko/Downloads/yt5s.com-BLAKE2b and the bao tree mode SIMD and multithreaded hashing in Rust.mp4 | |
Video AR Mode: 4 | |
Video AR Value: 1.777778 | |
Video Zoom Percent: 0.250000 | |
Scroll Position: 18 | |
Active Line: 24 | |
Video Position: 4341 | |
[V4+ Styles] | |
Format: Name, Fontname, Fontsize, PrimaryColour, SecondaryColour, OutlineColour, BackColour, Bold, Italic, Underline, StrikeOut, ScaleX, ScaleY, Spacing, Angle, BorderStyle, Outline, Shadow, Alignment, MarginL, MarginR, MarginV, Encoding | |
Style: Default,Arial,50,&H00FFFFFF,&H000000FF,&H00000000,&H00000000,0,0,0,0,100,100,0,0,1,2,2,2,10,10,10,1 | |
[Events] | |
Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text | |
Dialogue: 0,0:00:00.86,0:00:05.26,Default,,0,0,0,,Alright, great, so... | |
Dialogue: 0,0:00:05.26,0:00:06.08,Default,,0,0,0,,Who hashes stuff? | |
Dialogue: 0,0:00:06.08,0:00:09.92,Default,,0,0,0,,Anybody ever use like `md5sum`, `sha1sum`, those things on the command line? | |
Dialogue: 0,0:00:09.92,0:00:14.94,Default,,0,0,0,,Sometimes we use `md5sum`, but we probably shouldn't. Sometimes we use `sha1sum`, but we probably shouldn't. | |
Dialogue: 0,0:00:14.94,0:00:21.81,Default,,0,0,0,,You may have heard that SHA-1 was broken, about two years ago now, because time moves really fast, by some very expensive machines at Google. | |
Dialogue: 0,0:00:21.81,0:00:26.32,Default,,0,0,0,,Those are designed to be very fast. And it turns out that using Rust, we can make things that are even faster. | |
Dialogue: 0,0:00:26.32,0:00:34.40,Default,,0,0,0,,So let's look real quick. `md5sum` of a file "f"... | |
Dialogue: 0,0:00:34.40,0:00:38.44,Default,,0,0,0,,Always do this twice, because you need to make sure it's in memory. | |
Dialogue: 0,0:00:38.44,0:00:46.03,Default,,0,0,0,,Right, so that's `md5sum`, that's a benchmark there. "f" is, let's see here, almost exactly 1 GB in size. It's just full of random data. | |
Dialogue: 0,0:00:46.03,0:00:54.67,Default,,0,0,0,,So that's `md5sum`. We also have `sha512sum`. That's probably the one we should be using if we're doing anything important. | |
Dialogue: 0,0:00:54.67,0:00:57.55,Default,,0,0,0,,Which is a little slow, which is kind of depressing. | |
Dialogue: 0,0:00:57.55,0:01:06.14,Default,,0,0,0,,It turns out the fastest guy on the block right now is still `sha1sum`, unfortunately, which clocks in at 1.3-ish. | |
Dialogue: 0,0:01:06.14,0:01:12.52,Default,,0,0,0,,But I always -- sorry, this isn't necessarily clear -- we're looking at this number over here. That's the actual time that passed. | |
Dialogue: 0,0:01:12.52,0:01:20.56,Default,,0,0,0,,So 1.3, that's pretty cool. But we shouldn't be using it anymore, because it's a broken hash function. So what can we use that's modern, that might actually conceivably be faster. | |
Dialogue: 0,0:01:20.56,0:01:27.50,Default,,0,0,0,,Some of you guys might've heard of a hash function called BLAKE2. Anybody raise their hand who's actually heard that name before? | |
Dialogue: 0,0:01:27.50,0:01:41.52,Default,,0,0,0,,There is a BLAKE2 implementation on most people's machines, if you have like GNU Coreutils. So let me try that. `/usr/bin/b2sum`...live coding... | |
Dialogue: 0,0:01:41.52,0:01:43.15,Default,,0,0,0,,And unfortunately that was a little slow. | |
Dialogue: 0,0:01:43.15,0:01:45.24,Default,,0,0,0,,Despite being a modern hash function it's a little slow. | |
Dialogue: 0,0:01:45.24,0:01:52.52,Default,,0,0,0,,The reason it's slow is that SHA-1, under the covers, is linking against OpenSSL, with a very fancy implementation of SHA-1. | |
Dialogue: 0,0:01:52.52,0:01:56.41,Default,,0,0,0,,GNU Coreutils is just using portable C code, which is very clean, but it's not quite as fast. | |
Dialogue: 0,0:01:56.41,0:02:03.52,Default,,0,0,0,,It turns out that using Rust's just recently stabilized "SIMD intrinsics", we can get quite a lot faster. | |
Dialogue: 0,0:02:03.52,0:02:14.00,Default,,0,0,0,,So if I run `b2sum` directly -- that's actually a binary that I've compiled on my machine... | |
Dialogue: 0,0:02:14.00,0:02:22.08,Default,,0,0,0,,Boom! We beat SHA-1. We beat SHA-1 using a set of intrinsics called AVX2. Has anybody heard the name "AVX2" before? | |
Dialogue: 0,0:02:22.08,0:02:24.70,Default,,0,0,0,,This should be even fewer than "BLAKE2"... Maybe some? Cool. | |
Dialogue: 0,0:02:24.70,0:02:33.77,Default,,0,0,0,,So, real quick -- I have so little time -- the Intel people have been spending a lot of time making processors faster, and they've been spending a lot of time making cores to run multiple threads. | |
Dialogue: 0,0:02:33.77,0:02:38.00,Default,,0,0,0,,But also they've been adding new instructions, so we can do things that the old processors couldn't do. | |
Dialogue: 0,0:02:38.00,0:02:41.15,Default,,0,0,0,,Now if you use those instructions, your program's not compatible with old machines. | |
Dialogue: 0,0:02:41.15,0:02:43.32,Default,,0,0,0,,So generally compilers don't emit them by default. | |
Dialogue: 0,0:02:43.32,0:02:48.78,Default,,0,0,0,,But if you tell your compiler to emit them, it will. And then you have to be very careful, cause you're doing unsafe things. | |
Dialogue: 0,0:02:48.78,0:02:55.48,Default,,0,0,0,,So here is a bunch of crap. And what it means doesn't really matter. It's shuffling bytes around. It's a hash function, it's shuffling bytes around, we all know. | |
Dialogue: 0,0:02:55.48,0:02:57.63,Default,,0,0,0,,But this is in Rust, of course. | |
Dialogue: 0,0:02:57.63,0:03:06.86,Default,,0,0,0,,And we see this cute little thing here, which is telling the compiler that, regardless of what platform you thought you were compiling for, in this function, use those AVX2 instructions. | |
Dialogue: 0,0:03:06.86,0:03:11.24,Default,,0,0,0,,You know, multiple integers at a time, loading and storing, that kind of stuff. | |
Dialogue: 0,0:03:11.24,0:03:16.70,Default,,0,0,0,,I did not invent this implementation. I ported it from C. So they, the actual authors, get credit for the performance. | |
Dialogue: 0,0:03:16.70,0:03:27.32,Default,,0,0,0,,So doing this says the compiler will emit these instructions for all targets, and then we're responsible for never ever calling this function when it's unsafe. So... | |
Dialogue: 0,0:03:27.32,0:03:31.53,Default,,0,0,0,,Down here, we see what we have to do to make this sort of stuff useful. | |
Dialogue: 0,0:03:31.53,0:03:35.21,Default,,0,0,0,,So, this is if you're sort of statically targeting modern processors, which is rare. | |
Dialogue: 0,0:03:35.21,0:03:39.53,Default,,0,0,0,,But here, we see that we're at runtime asking the processor, | |
Dialogue: 0,0:03:39.53,0:03:43.58,Default,,0,0,0,,"What instructions do you support?" If this processor supports AVX2, we'll call this function. | |
Dialogue: 0,0:03:43.58,0:03:47.76,Default,,0,0,0,,If I replace this with `if true`, this would hopefully crash. | |
Dialogue: 0,0:03:47.76,0:03:52.27,Default,,0,0,0,,Probably do something worse, on a machine that's older than three years. So this part's pretty important. | |
Dialogue: 0,0:03:52.27,0:03:56.67,Default,,0,0,0,,So that's how you beat `sha1sum`, using pure, stable Rust. | |
Dialogue: 0,0:03:56.67,0:03:58.67,Default,,0,0,0,,And that works now. | |
Dialogue: 0,0:03:58.67,0:04:00.67,Default,,0,0,0,,If we want to get even faster than this, | |
Dialogue: 0,0:04:00.67,0:04:02.14,Default,,0,0,0,,we have to start using multiple threads. | |
Dialogue: 0,0:04:02.14,0:04:05.32,Default,,0,0,0,,So, a standard hash function is simply not designed to run on multiple different threads. | |
Dialogue: 0,0:04:05.32,0:04:08.44,Default,,0,0,0,,Each step depends on the last step. | |
Dialogue: 0,0:04:08.44,0:04:16.01,Default,,0,0,0,,So as a...so we can see that this implementation is of course producing the same result. | |
Dialogue: 0,0:04:16.01,0:04:18.99,Default,,0,0,0,,4e2...something something, as the portable C. | |
Dialogue: 0,0:04:18.99,0:04:21.80,Default,,0,0,0,,That's good, because it has the same name, so it better produce the same result. | |
Dialogue: 0,0:04:21.80,0:04:27.93,Default,,0,0,0,,But if we wanted, if we were willing to use a different construction designed for multiple threads, we could do things faster. So let's try that. | |
Dialogue: 0,0:04:27.93,0:04:29.66,Default,,0,0,0,,This utility implements it. | |
Dialogue: 0,0:04:29.66,0:04:32.75,Default,,0,0,0,,The BLAKE2 standard includes such a construction. | |
Dialogue: 0,0:04:32.75,0:04:38.72,Default,,0,0,0,,[inaudible] called BLAKE2bp, | |
Dialogue: 0,0:04:38.72,0:04:39.82,Default,,0,0,0,,which is quite fast. | |
Dialogue: 0,0:04:39.82,0:04:42.14,Default,,0,0,0,,So that's going to fire up four threads. | |
Dialogue: 0,0:04:42.14,0:04:46.30,Default,,0,0,0,,It's going to chunk the input into four groups. | |
Dialogue: 0,0:04:46.30,0:04:52.17,Default,,0,0,0,,So the idea is like a block size of, BLAKE2[b] uses 128 bytes if I remember correctly, | |
Dialogue: 0,0:04:52.17,0:04:55.64,Default,,0,0,0,,So the first 128 bytes is block0, block 1, 2, 3, and 4. | |
Dialogue: 0,0:04:55.64,0:04:58.68,Default,,0,0,0,,So then you get 0, goes with 4, goes with 8, goes with 12, that's on one thread. | |
Dialogue: 0,0:04:58.68,0:05:00.40,Default,,0,0,0,,And 1 goes with 5, and so on. | |
Dialogue: 0,0:05:00.40,0:05:02.70,Default,,0,0,0,,So you get four threads kind of running in parallel on the same input. | |
Dialogue: 0,0:05:02.70,0:05:06.46,Default,,0,0,0,,You get four hashes, hash those together, the result is the BLAKE2bp sum. | |
Dialogue: 0,0:05:06.46,0:05:09.72,Default,,0,0,0,,That plus a few other parameter tweaks, that's the definition of BLAKE2bp. | |
Dialogue: 0,0:05:09.72,0:05:13.00,Default,,0,0,0,,And then, it's not a like 4x speedup, unfortunately. | |
Dialogue: 0,0:05:13.00,0:05:16.80,Default,,0,0,0,,There are four physical cores on this machine. So we hope we get 4x, but it's [inaudible]. | |
Dialogue: 0,0:05:16.80,0:05:19.23,Default,,0,0,0,,That's BLAKE2bp, and again that's slightly standard. | |
Dialogue: 0,0:05:19.23,0:05:23.28,Default,,0,0,0,,Not widely implemented in a lot of places, but in theory other software implements this. | |
Dialogue: 0,0:05:23.28,0:05:26.78,Default,,0,0,0,,If we want to get even faster than that, we have to start doing very interesting stuff. | |
Dialogue: 0,0:05:26.78,0:05:28.84,Default,,0,0,0,,Real quick, since we're on BLAKE2bp though, | |
Dialogue: 0,0:05:28.84,0:05:33.12,Default,,0,0,0,,let's look at the implementation of that. | |
Dialogue: 0,0:05:33.12,0:05:35.24,Default,,0,0,0,,I'm going to ignore most of the details. | |
Dialogue: 0,0:05:35.24,0:05:39.47,Default,,0,0,0,, | |
Dialogue: 0,0:05:39.47,0:05:43.23,Default,,0,0,0,,So, four of these go first in parallel. | |
Dialogue: 0,0:05:43.23,0:05:46.11,Default,,0,0,0,,One of these goes at the end, slightly different parameters, it's just a hash. | |
Dialogue: 0,0:05:46.11,0:05:49.31,Default,,0,0,0,,This is the interesting part, which I'll zoom in on. | |
Dialogue: 0,0:05:49.31,0:05:52.19,Default,,0,0,0,,This is Rayon. Who's used Rayon before? | |
Dialogue: 0,0:05:52.19,0:05:54.88,Default,,0,0,0,,It's surprisingly easy to use. | |
Dialogue: 0,0:05:54.88,0:06:03.72,Default,,0,0,0,,Niko...oh my god I should not even attempt to pronounce his last name...Matsakis, invented this early on, probably before 1.0. | |
Dialogue: 0,0:06:03.72,0:06:06.01,Default,,0,0,0,,It makes threading surprisingly easy. | |
Dialogue: 0,0:06:06.01,0:06:10.44,Default,,0,0,0,,`join` means, something on the left something on the right, run them on multiple threads...maybe. | |
Dialogue: 0,0:06:10.44,0:06:12.70,Default,,0,0,0,,And you'll notice it does some very very clever things. | |
Dialogue: 0,0:06:12.70,0:06:15.31,Default,,0,0,0,,If I scroll up... | |
Dialogue: 0,0:06:15.31,0:06:17.88,Default,,0,0,0,,`input` is a slice argument to this function, right. | |
Dialogue: 0,0:06:17.88,0:06:24.40,Default,,0,0,0,,Then this closure, the worker closure, is like chunking up input [inaudible] and hashing it. | |
Dialogue: 0,0:06:24.40,0:06:26.27,Default,,0,0,0,,And this is running that on multiple threads. | |
Dialogue: 0,0:06:26.27,0:06:32.86,Default,,0,0,0,,The type checker is already doing all the work to make sure that these threads are not going to take this reference that lives on this thread's stack | |
Dialogue: 0,0:06:32.86,0:06:37.58,Default,,0,0,0,,and then like outlast it. It's all safe, and all this stuff magically works. That's really cool. | |
Dialogue: 0,0:06:37.58,0:06:44.84,Default,,0,0,0,,And if the program happens to be using Rayon threads in other capacities, it's all the same global threadpool, so it kind of does a lot for you. | |
Dialogue: 0,0:06:44.84,0:06:50.01,Default,,0,0,0,,So this is just, do some things in parallel, surprisingly easy, and you get your answer. | |
Dialogue: 0,0:06:50.01,0:06:55.26,Default,,0,0,0,,Again, if we want to do something even faster that .4 seconds, | |
Dialogue: 0,0:06:55.26,0:06:58.20,Default,,0,0,0,,we need to start defining custom hashing modes. | |
Dialogue: 0,0:06:58.20,0:07:04.28,Default,,0,0,0,,So the project that actually...led up...all this talk to...is called "Bao", currently. B-A-O. | |
Dialogue: 0,0:07:04.28,0:07:08.86,Default,,0,0,0,,It's a custom tree hashing mode. The details don't matter; it's a binary tree. | |
Dialogue: 0,0:07:08.86,0:07:12.94,Default,,0,0,0,,And if we run `bao hash` on "f"... | |
Dialogue: 0,0:07:12.94,0:07:15.02,Default,,0,0,0,,Maybe we want to know how fast it is... | |
Dialogue: 0,0:07:15.02,0:07:17.02,Default,,0,0,0,,Quite fast. | |
Dialogue: 0,0:07:17.02,0:07:19.52,Default,,0,0,0,,That is almost 4 gigabytes per second. | |
Dialogue: 0,0:07:19.52,0:07:22.16,Default,,0,0,0,,That is utilizing all the cores on the machine. | |
Dialogue: 0,0:07:22.16,0:07:26.03,Default,,0,0,0,,That is utilizing the AVX2 instructions in that BLAKE2 implementation we just saw. | |
Dialogue: 0,0:07:26.03,0:07:29.07,Default,,0,0,0,,If you look at the implementation of this... | |
Dialogue: 0,0:07:29.07,0:07:31.68,Default,,0,0,0,,This is the top-level hash function. | |
Dialogue: 0,0:07:31.68,0:07:36.30,Default,,0,0,0,,It doesn't make sense to use Rayon if you have like, a thousand bytes, cause there's a lot of overhead. | |
Dialogue: 0,0:07:36.30,0:07:40.88,Default,,0,0,0,, But if you have enough bytes, it calls into this recursive thing, | |
Dialogue: 0,0:07:40.88,0:07:42.30,Default,,0,0,0,,which again uses `rayon::join`. | |
Dialogue: 0,0:07:42.30,0:07:47.55,Default,,0,0,0,,This is a shockingly simple function for implementing, like, a four gigabyte per second hash. | |
Dialogue: 0,0:07:47.55,0:07:53.02,Default,,0,0,0,,It's...the main difference between this and BLAKE2bp is that this on is actually directly recursing on itself. | |
Dialogue: 0,0:07:53.02,0:07:56.35,Default,,0,0,0,,So we actually call `hash_recurse_rayon` inside of `hash_recurse_rayon`. | |
Dialogue: 0,0:07:56.35,0:08:01.64,Default,,0,0,0,,Before, with BLAKE2bp, it was just sort of a hardcoded set of four calls into a worker. | |
Dialogue: 0,0:08:01.64,0:08:05.80,Default,,0,0,0,,And this will parallelize an arbitrary number of threads on the way down. | |
Dialogue: 0,0:08:05.80,0:08:10.08,Default,,0,0,0,,`hash_node` is literally just the BLAKE2 hash. When you actually get to the bottom, you hash a leaf. | |
Dialogue: 0,0:08:10.08,0:08:12.40,Default,,0,0,0,,Then you hash everything together at the end. | |
Dialogue: 0,0:08:12.40,0:08:14.49,Default,,0,0,0,,So, that's Bao. It's pretty fast. | |
Dialogue: 0,0:08:14.49,0:08:19.66,Default,,0,0,0,,If anybody has any questions about Rayon or SIMD instructions, let me know. | |
Dialogue: 0,0:08:19.66,0:08:28.14,Default,,0,0,0,, | |
Dialogue: 0,0:08:28.14,0:08:32.81,Default,,0,0,0,,So what's the name of the...how do you spell that...the hash? | |
Dialogue: 0,0:08:32.81,0:08:34.03,Default,,0,0,0,,The one at the end? | |
Dialogue: 0,0:08:34.03,0:08:35.32,Default,,0,0,0,,I think you said "Bao"? | |
Dialogue: 0,0:08:35.32,0:08:41.02,Default,,0,0,0,,B-A-O. I just put up the GitHub link, but it's too small for anyone to read. | |
Dialogue: 0,0:08:41.02,0:08:53.60,Default,,0,0,0,, | |
Dialogue: 0,0:08:53.60,0:08:56.30,Default,,0,0,0,,The other one was `blake2b_simd`. | |
Dialogue: 0,0:08:56.30,0:09:03.50,Default,,0,0,0,,This actually...this implements...the first one implements the standard BLAKE[2] hash function, which you might actually be using in other applications. It's just faster | |
Dialogue: 0,0:09:03.50,0:09:04.92,Default,,0,0,0,, | |
Dialogue: 0,0:09:04.92,0:09:08.83,Default,,0,0,0,,The compress[ion] function you showed using SIMD unstructions is marked unsafe. | |
Dialogue: 0,0:09:08.83,0:09:14.48,Default,,0,0,0,,What's [inaudible] SIMD intrinsics, those happen to be unsafe? | |
Dialogue: 0,0:09:14.48,0:09:15.28,Default,,0,0,0,,Those happen to be unsafe. | |
Dialogue: 0,0:09:15.28,0:09:17.80,Default,,0,0,0,,There's sort of two unsafe things about them. | |
Dialogue: 0,0:09:17.80,0:09:23.56,Default,,0,0,0,,One is that, if you just simply call them at all on a platform that doesn't support them, it's just by definition undefined behavior. | |
Dialogue: 0,0:09:23.56,0:09:25.26,Default,,0,0,0,,I have no idea what actually happens. | |
Dialogue: 0,0:09:25.26,0:09:31.80,Default,,0,0,0,,Another is that those...one of the intrinsics defined in those sets is usually some of wide load. | |
Dialogue: 0,0:09:31.80,0:09:37.29,Default,,0,0,0,,Like load a 256-bit integer, or something like that, and those usually require aligned pointers. | |
Dialogue: 0,0:09:37.29,0:09:39.74,Default,,0,0,0,,So if you have some pointer... | |
Dialogue: 0,0:09:39.74,0:09:42.33,Default,,0,0,0,,You run into this if you're doing pointer casts. | |
Dialogue: 0,0:09:42.33,0:09:46.49,Default,,0,0,0,,If you take a pointer to some bytes, and you cast it into a pointer to a 64-bit integer, | |
Dialogue: 0,0:09:46.49,0:09:52.48,Default,,0,0,0,,you might think all you need to do is length check, and on x86 that might be true...sometimes. | |
Dialogue: 0,0:09:52.48,0:09:56.70,Default,,0,0,0,,But in fact what you've done, is you've got a problem with your unaligned pointer, and you've caused undefined behavior. | |
Dialogue: 0,0:09:56.70,0:10:01.68,Default,,0,0,0,,So that is a thing that tends to happen if you [inaudible]. | |
Dialogue: 0,0:10:01.68,0:10:07.90,Default,,0,0,0,,Does Rayon act like actual threads, or does it do something with like scheduled threads? | |
Dialogue: 0,0:10:07.90,0:10:09.90,Default,,0,0,0,,Tell me more about the difference between those two things. | |
Dialogue: 0,0:10:09.90,0:10:13.95,Default,,0,0,0,,Like do you [inaudible] full operating system multiple threads where you're running it on... | |
Dialogue: 0,0:10:13.95,0:10:19.21,Default,,0,0,0,,Yes, under the covers you have an actual thread pool with actual threads. | |
Dialogue: 0,0:10:19.21,0:10:26.49,Default,,0,0,0,,It does a fair bit of logic to make you you have, like, idle cores and stuff like that. I think it tries to be fairly intelligent. | |
Dialogue: 0,0:10:26.49,0:10:28.49,Default,,0,0,0,,It also tries to [inaudible] around whether... | |
Dialogue: 0,0:10:28.49,0:10:32.59,Default,,0,0,0,,Of course this tree, you could imagine you have like 2^30 branches. | |
Dialogue: 0,0:10:32.59,0:10:35.02,Default,,0,0,0,,You might not want to run them all on individual worker threads. | |
Dialogue: 0,0:10:35.02,0:10:37.76,Default,,0,0,0,,So it tries to [inaudible], but yes, real threads. | |
Dialogue: 0,0:10:37.76,0:10:39.21,Default,,0,0,0,, | |
Dialogue: 0,0:10:39.21,0:10:41.21,Default,,0,0,0,,Alright. Thanks guys. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1 | |
00:00:00,860 --> 00:00:05,260 | |
Alright, great, so... | |
2 | |
00:00:05,260 --> 00:00:06,086 | |
Who hashes stuff? | |
3 | |
00:00:06,086 --> 00:00:09,926 | |
Anybody ever use like `md5sum`, `sha1sum`, those things on the command line? | |
4 | |
00:00:09,926 --> 00:00:14,946 | |
Sometimes we use `md5sum`, but we probably shouldn't. Sometimes we use `sha1sum`, but we probably shouldn't. | |
5 | |
00:00:14,946 --> 00:00:21,813 | |
You may have heard that SHA-1 was broken, about two years ago now, because time moves really fast, by some very expensive machines at Google. | |
6 | |
00:00:21,813 --> 00:00:26,320 | |
Those are designed to be very fast. And it turns out that using Rust, we can make things that are even faster. | |
7 | |
00:00:26,320 --> 00:00:34,400 | |
So let's look real quick. `md5sum` of a file "f"... | |
8 | |
00:00:34,400 --> 00:00:38,448 | |
Always do this twice, because you need to make sure it's in memory. | |
9 | |
00:00:38,448 --> 00:00:46,032 | |
Right, so that's `md5sum`, that's a benchmark there. "f" is, let's see here, almost exactly 1 GB in size. It's just full of random data. | |
10 | |
00:00:46,032 --> 00:00:54,672 | |
So that's `md5sum`. We also have `sha512sum`. That's probably the one we should be using if we're doing anything important. | |
11 | |
00:00:54,672 --> 00:00:57,552 | |
Which is a little slow, which is kind of depressing. | |
12 | |
00:00:57,550 --> 00:01:06,144 | |
It turns out the fastest guy on the block right now is still `sha1sum`, unfortunately, which clocks in at 1.3-ish. | |
13 | |
00:01:06,144 --> 00:01:12,528 | |
But I always -- sorry, this isn't necessarily clear -- we're looking at this number over here. That's the actual time that passed. | |
14 | |
00:01:12,528 --> 00:01:20,560 | |
So 1.3, that's pretty cool. But we shouldn't be using it anymore, because it's a broken hash function. So what can we use that's modern, that might actually conceivably be faster. | |
15 | |
00:01:20,560 --> 00:01:27,504 | |
Some of you guys might've heard of a hash function called BLAKE2. Anybody raise their hand who's actually heard that name before? | |
16 | |
00:01:27,504 --> 00:01:41,520 | |
There is a BLAKE2 implementation on most people's machines, if you have like GNU Coreutils. So let me try that. `/usr/bin/b2sum`...live coding... | |
17 | |
00:01:41,520 --> 00:01:43,152 | |
And unfortunately that was a little slow. | |
18 | |
00:01:43,150 --> 00:01:45,248 | |
Despite being a modern hash function it's a little slow. | |
19 | |
00:01:45,248 --> 00:01:52,528 | |
The reason it's slow is that SHA-1, under the covers, is linking against OpenSSL, with a very fancy implementation of SHA-1. | |
20 | |
00:01:52,520 --> 00:01:56,416 | |
GNU Coreutils is just using portable C code, which is very clean, but it's not quite as fast. | |
21 | |
00:01:56,416 --> 00:02:03,520 | |
It turns out that using Rust's just recently stabilized "SIMD intrinsics", we can get quite a lot faster. | |
22 | |
00:02:03,520 --> 00:02:14,000 | |
So if I run `b2sum` directly -- that's actually a binary that I've compiled on my machine... | |
23 | |
00:02:14,000 --> 00:02:22,080 | |
Boom! We beat SHA-1. We beat SHA-1 using a set of intrinsics called AVX2. Has anybody heard the name "AVX2" before? | |
24 | |
00:02:22,080 --> 00:02:24,704 | |
This should be even fewer than "BLAKE2"... Maybe some? Cool. | |
25 | |
00:02:24,704 --> 00:02:33,776 | |
So, real quick -- I have so little time -- the Intel people have been spending a lot of time making processors faster, and they've been spending a lot of time making cores to run multiple threads. | |
26 | |
00:02:33,776 --> 00:02:38,000 | |
But also they've been adding new instructions, so we can do things that the old processors couldn't do. | |
27 | |
00:02:38,000 --> 00:02:41,152 | |
Now if you use those instructions, your program's not compatible with old machines. | |
28 | |
00:02:41,150 --> 00:02:43,328 | |
So generally compilers don't emit them by default. | |
29 | |
00:02:43,328 --> 00:02:48,784 | |
But if you tell your compiler to emit them, it will. And then you have to be very careful, cause you're doing unsafe things. | |
30 | |
00:02:48,784 --> 00:02:55,488 | |
So here is a bunch of crap. And what it means doesn't really matter. It's shuffling bytes around. It's a hash function, it's shuffling bytes around, we all know. | |
31 | |
00:02:55,480 --> 00:02:57,632 | |
But this is in Rust, of course. | |
32 | |
00:02:57,632 --> 00:03:06,864 | |
And we see this cute little thing here, which is telling the compiler that, regardless of what platform you thought you were compiling for, in this function, use those AVX2 instructions. | |
33 | |
00:03:06,860 --> 00:03:11,248 | |
You know, multiple integers at a time, loading and storing, that kind of stuff. | |
34 | |
00:03:11,240 --> 00:03:16,704 | |
I did not invent this implementation. I ported it from C. So they, the actual authors, get credit for the performance. | |
35 | |
00:03:16,700 --> 00:03:27,328 | |
So doing this says the compiler will emit these instructions for all targets, and then we're responsible for never ever calling this function when it's unsafe. So... | |
36 | |
00:03:27,320 --> 00:03:31,536 | |
Down here, we see what we have to do to make this sort of stuff useful. | |
37 | |
00:03:31,530 --> 00:03:35,216 | |
So, this is if you're sort of statically targeting modern processors, which is rare. | |
38 | |
00:03:35,210 --> 00:03:39,536 | |
But here, we see that we're at runtime asking the processor, | |
39 | |
00:03:39,530 --> 00:03:43,584 | |
"What instructions do you support?" If this processor supports AVX2, we'll call this function. | |
40 | |
00:03:43,580 --> 00:03:47,760 | |
If I replace this with `if true`, this would hopefully crash. | |
41 | |
00:03:47,760 --> 00:03:52,272 | |
Probably do something worse, on a machine that's older than three years. So this part's pretty important. | |
42 | |
00:03:52,270 --> 00:03:56,672 | |
So that's how you beat `sha1sum`, using pure, stable Rust. | |
43 | |
00:03:56,672 --> 00:03:58,670 | |
And that works now. | |
44 | |
00:03:58,670 --> 00:04:00,670 | |
If we want to get even faster than this, | |
45 | |
00:04:00,670 --> 00:04:02,144 | |
we have to start using multiple threads. | |
46 | |
00:04:02,140 --> 00:04:05,328 | |
So, a standard hash function is simply not designed to run on multiple different threads. | |
47 | |
00:04:05,320 --> 00:04:08,448 | |
Each step depends on the last step. | |
48 | |
00:04:08,440 --> 00:04:16,016 | |
So as a...so we can see that this implementation is of course producing the same result. | |
49 | |
00:04:16,010 --> 00:04:18,992 | |
4e2...something something, as the portable C. | |
50 | |
00:04:18,990 --> 00:04:21,808 | |
That's good, because it has the same name, so it better produce the same result. | |
51 | |
00:04:21,800 --> 00:04:27,936 | |
But if we wanted, if we were willing to use a different construction designed for multiple threads, we could do things faster. So let's try that. | |
52 | |
00:04:27,930 --> 00:04:29,664 | |
This utility implements it. | |
53 | |
00:04:29,660 --> 00:04:32,752 | |
The BLAKE2 standard includes such a construction. | |
54 | |
00:04:32,750 --> 00:04:38,720 | |
[inaudible] called BLAKE2bp, | |
55 | |
00:04:38,720 --> 00:04:39,824 | |
which is quite fast. | |
56 | |
00:04:39,820 --> 00:04:42,144 | |
So that's going to fire up four threads. | |
57 | |
00:04:42,140 --> 00:04:46,304 | |
It's going to chunk the input into four groups. | |
58 | |
00:04:46,300 --> 00:04:52,176 | |
So the idea is like a block size of, BLAKE2[b] uses 128 bytes if I remember correctly, | |
59 | |
00:04:52,170 --> 00:04:55,648 | |
So the first 128 bytes is block0, block 1, 2, 3, and 4. | |
60 | |
00:04:55,640 --> 00:04:58,688 | |
So then you get 0, goes with 4, goes with 8, goes with 12, that's on one thread. | |
61 | |
00:04:58,680 --> 00:05:00,400 | |
And 1 goes with 5, and so on. | |
62 | |
00:05:00,400 --> 00:05:02,704 | |
So you get four threads kind of running in parallel on the same input. | |
63 | |
00:05:02,700 --> 00:05:06,464 | |
You get four hashes, hash those together, the result is the BLAKE2bp sum. | |
64 | |
00:05:06,460 --> 00:05:09,728 | |
That plus a few other parameter tweaks, that's the definition of BLAKE2bp. | |
65 | |
00:05:09,720 --> 00:05:13,008 | |
And then, it's not a like 4x speedup, unfortunately. | |
66 | |
00:05:13,000 --> 00:05:16,800 | |
There are four physical cores on this machine. So we hope we get 4x, but it's [inaudible]. | |
67 | |
00:05:16,800 --> 00:05:19,232 | |
That's BLAKE2bp, and again that's slightly standard. | |
68 | |
00:05:19,230 --> 00:05:23,280 | |
Not widely implemented in a lot of places, but in theory other software implements this. | |
69 | |
00:05:23,280 --> 00:05:26,784 | |
If we want to get even faster than that, we have to start doing very interesting stuff. | |
70 | |
00:05:26,780 --> 00:05:28,848 | |
Real quick, since we're on BLAKE2bp though, | |
71 | |
00:05:28,840 --> 00:05:33,120 | |
let's look at the implementation of that. | |
72 | |
00:05:33,120 --> 00:05:35,248 | |
I'm going to ignore most of the details. | |
73 | |
00:05:35,240 --> 00:05:39,472 | |
74 | |
00:05:39,470 --> 00:05:43,232 | |
So, four of these go first in parallel. | |
75 | |
00:05:43,230 --> 00:05:46,112 | |
One of these goes at the end, slightly different parameters, it's just a hash. | |
76 | |
00:05:46,110 --> 00:05:49,312 | |
This is the interesting part, which I'll zoom in on. | |
77 | |
00:05:49,310 --> 00:05:52,192 | |
This is Rayon. Who's used Rayon before? | |
78 | |
00:05:52,190 --> 00:05:54,880 | |
It's surprisingly easy to use. | |
79 | |
00:05:54,880 --> 00:06:03,728 | |
Niko...oh my god I should not even attempt to pronounce his last name...Matsakis, invented this early on, probably before 1.0. | |
80 | |
00:06:03,720 --> 00:06:06,016 | |
It makes threading surprisingly easy. | |
81 | |
00:06:06,010 --> 00:06:10,448 | |
`join` means, something on the left something on the right, run them on multiple threads...maybe. | |
82 | |
00:06:10,440 --> 00:06:12,704 | |
And you'll notice it does some very very clever things. | |
83 | |
00:06:12,700 --> 00:06:15,312 | |
If I scroll up... | |
84 | |
00:06:15,310 --> 00:06:17,888 | |
`input` is a slice argument to this function, right. | |
85 | |
00:06:17,880 --> 00:06:24,400 | |
Then this closure, the worker closure, is like chunking up input [inaudible] and hashing it. | |
86 | |
00:06:24,400 --> 00:06:26,272 | |
And this is running that on multiple threads. | |
87 | |
00:06:26,270 --> 00:06:32,864 | |
The type checker is already doing all the work to make sure that these threads are not going to take this reference that lives on this thread's stack | |
88 | |
00:06:32,860 --> 00:06:37,584 | |
and then like outlast it. It's all safe, and all this stuff magically works. That's really cool. | |
89 | |
00:06:37,580 --> 00:06:44,848 | |
And if the program happens to be using Rayon threads in other capacities, it's all the same global threadpool, so it kind of does a lot for you. | |
90 | |
00:06:44,840 --> 00:06:50,016 | |
So this is just, do some things in parallel, surprisingly easy, and you get your answer. | |
91 | |
00:06:50,010 --> 00:06:55,264 | |
Again, if we want to do something even faster that .4 seconds, | |
92 | |
00:06:55,260 --> 00:06:58,208 | |
we need to start defining custom hashing modes. | |
93 | |
00:06:58,200 --> 00:07:04,288 | |
So the project that actually...led up...all this talk to...is called "Bao", currently. B-A-O. | |
94 | |
00:07:04,280 --> 00:07:08,864 | |
It's a custom tree hashing mode. The details don't matter; it's a binary tree. | |
95 | |
00:07:08,860 --> 00:07:12,944 | |
And if we run `bao hash` on "f"... | |
96 | |
00:07:12,940 --> 00:07:15,024 | |
Maybe we want to know how fast it is... | |
97 | |
00:07:15,024 --> 00:07:17,020 | |
Quite fast. | |
98 | |
00:07:17,020 --> 00:07:19,520 | |
That is almost 4 gigabytes per second. | |
99 | |
00:07:19,520 --> 00:07:22,160 | |
That is utilizing all the cores on the machine. | |
100 | |
00:07:22,160 --> 00:07:26,032 | |
That is utilizing the AVX2 instructions in that BLAKE2 implementation we just saw. | |
101 | |
00:07:26,030 --> 00:07:29,072 | |
If you look at the implementation of this... | |
102 | |
00:07:29,070 --> 00:07:31,680 | |
This is the top-level hash function. | |
103 | |
00:07:31,680 --> 00:07:36,304 | |
It doesn't make sense to use Rayon if you have like, a thousand bytes, cause there's a lot of overhead. | |
104 | |
00:07:36,300 --> 00:07:40,880 | |
But if you have enough bytes, it calls into this recursive thing, | |
105 | |
00:07:40,880 --> 00:07:42,304 | |
which again uses `rayon::join`. | |
106 | |
00:07:42,300 --> 00:07:47,552 | |
This is a shockingly simple function for implementing, like, a four gigabyte per second hash. | |
107 | |
00:07:47,550 --> 00:07:53,024 | |
It's...the main difference between this and BLAKE2bp is that this on is actually directly recursing on itself. | |
108 | |
00:07:53,020 --> 00:07:56,352 | |
So we actually call `hash_recurse_rayon` inside of `hash_recurse_rayon`. | |
109 | |
00:07:56,350 --> 00:08:01,648 | |
Before, with BLAKE2bp, it was just sort of a hardcoded set of four calls into a worker. | |
110 | |
00:08:01,640 --> 00:08:05,808 | |
And this will parallelize an arbitrary number of threads on the way down. | |
111 | |
00:08:05,800 --> 00:08:10,080 | |
`hash_node` is literally just the BLAKE2 hash. When you actually get to the bottom, you hash a leaf. | |
112 | |
00:08:10,080 --> 00:08:12,400 | |
Then you hash everything together at the end. | |
113 | |
00:08:12,400 --> 00:08:14,496 | |
So, that's Bao. It's pretty fast. | |
114 | |
00:08:14,490 --> 00:08:19,664 | |
If anybody has any questions about Rayon or SIMD instructions, let me know. | |
115 | |
00:08:28,140 --> 00:08:32,816 | |
So what's the name of the...how do you spell that...the hash? | |
116 | |
00:08:32,810 --> 00:08:34,032 | |
The one at the end? | |
117 | |
00:08:34,030 --> 00:08:35,328 | |
I think you said "Bao"? | |
118 | |
00:08:35,320 --> 00:08:41,024 | |
B-A-O. I just put up the GitHub link, but it's too small for anyone to read. | |
119 | |
00:08:53,600 --> 00:08:56,304 | |
The other one was `blake2b_simd`. | |
120 | |
00:08:56,300 --> 00:09:03,504 | |
This actually...this implements...the first one implements the standard BLAKE[2] hash function, which you might actually be using in other applications. It's just faster | |
121 | |
00:09:04,920 --> 00:09:08,832 | |
The compress[ion] function you showed using SIMD unstructions is marked unsafe. | |
122 | |
00:09:08,830 --> 00:09:14,480 | |
What's [inaudible] SIMD intrinsics, those happen to be unsafe? | |
123 | |
00:09:14,480 --> 00:09:15,280 | |
Those happen to be unsafe. | |
124 | |
00:09:15,280 --> 00:09:17,808 | |
There's sort of two unsafe things about them. | |
125 | |
00:09:17,800 --> 00:09:23,568 | |
One is that, if you just simply call them at all on a platform that doesn't support them, it's just by definition undefined behavior. | |
126 | |
00:09:23,560 --> 00:09:25,264 | |
I have no idea what actually happens. | |
127 | |
00:09:25,260 --> 00:09:31,808 | |
Another is that those...one of the intrinsics defined in those sets is usually some of wide load. | |
128 | |
00:09:31,800 --> 00:09:37,296 | |
Like load a 256-bit integer, or something like that, and those usually require aligned pointers. | |
129 | |
00:09:37,290 --> 00:09:39,744 | |
So if you have some pointer... | |
130 | |
00:09:39,740 --> 00:09:42,336 | |
You run into this if you're doing pointer casts. | |
131 | |
00:09:42,330 --> 00:09:46,496 | |
If you take a pointer to some bytes, and you cast it into a pointer to a 64-bit integer, | |
132 | |
00:09:46,490 --> 00:09:52,480 | |
you might think all you need to do is length check, and on x86 that might be true...sometimes. | |
133 | |
00:09:52,480 --> 00:09:56,704 | |
But in fact what you've done, is you've got a problem with your unaligned pointer, and you've caused undefined behavior. | |
134 | |
00:09:56,700 --> 00:10:01,680 | |
So that is a thing that tends to happen if you [inaudible]. | |
135 | |
00:10:01,680 --> 00:10:07,904 | |
Does Rayon act like actual threads, or does it do something with like scheduled threads? | |
136 | |
00:10:07,904 --> 00:10:09,900 | |
Tell me more about the difference between those two things. | |
137 | |
00:10:09,900 --> 00:10:13,952 | |
Like do you [inaudible] full operating system multiple threads where you're running it on... | |
138 | |
00:10:13,950 --> 00:10:19,216 | |
Yes, under the covers you have an actual thread pool with actual threads. | |
139 | |
00:10:19,210 --> 00:10:26,496 | |
It does a fair bit of logic to make you you have, like, idle cores and stuff like that. I think it tries to be fairly intelligent. | |
140 | |
00:10:26,496 --> 00:10:28,490 | |
It also tries to [inaudible] around whether... | |
141 | |
00:10:28,490 --> 00:10:32,592 | |
Of course this tree, you could imagine you have like 2^30 branches. | |
142 | |
00:10:32,590 --> 00:10:35,024 | |
You might not want to run them all on individual worker threads. | |
143 | |
00:10:35,020 --> 00:10:37,760 | |
So it tries to [inaudible], but yes, real threads. | |
144 | |
00:10:39,216 --> 00:10:41,210 | |
Alright. Thanks guys. | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment