[00:00.000 --> 00:16.400] Hello, hello, and welcome to "Waterpark - Transforming Health Care with Distributed Actors".
[00:16.400 --> 00:17.400] My name is Brian Hunter.
[00:17.400 --> 00:22.160] It is awesome to be back in Oslo, and it's a joy to be here today to share some of my
[00:22.160 --> 00:24.520] team's research.
[00:24.520 --> 00:31.240] This research led to a product with an unusual architecture that's out there doing good
[00:31.240 --> 00:36.000] in the world, and so it's going to be fun to share this with you all.
[00:36.000 --> 00:40.360] All right, so I'm an Enterprise Fellow at HCA Health Care.
[00:40.360 --> 00:44.520] Here in Norway, you might have never heard of HCA Health Care, you might not be familiar
[00:44.520 --> 00:47.640] with it, but HCA is big.
[00:47.640 --> 00:55.000] There are 186 hospitals that are owned and operated by HCA, and over 2,300 surgery centers,
[00:55.000 --> 01:01.680] freestanding ERs, imaging centers, and clinics around the states.
[01:01.680 --> 01:09.800] I have 309,000 colleagues that can all send me chat messages and emails, and on Monday,
[01:09.800 --> 01:13.840] after being gone for a week, it'll probably feel like they all have.
[01:13.840 --> 01:22.560] So with 43 million patient encounters each year, a lot of people depend on us doing things
[01:22.560 --> 01:28.240] right.
[01:28.240 --> 01:32.600] Idea of long ideas, holding an idea for a long time.
[01:32.600 --> 01:37.760] So early in my career, well, not quite that early, I worked as a developer in healthcare
[01:37.760 --> 01:38.760] IT.
[01:39.080 --> 01:46.840] Much of the software used in healthcare then was poor and it still is, but I saw the potential
[01:46.840 --> 01:51.960] that code could help patients and could help caregivers.
[01:51.960 --> 01:57.080] That idea stuck with me, and even though my work back then wasn't particularly helpful
[01:57.080 --> 02:04.120] or meaningful to anyone, I knew that the potential for meaning existed in this healthcare work.
[02:05.080 --> 02:11.640] A little bit later in my career, my first trip to Oslo was in 2007.
[02:11.640 --> 02:19.520] I was here on this six-week OO .Net consulting project, not healthcare-related, and over
[02:19.520 --> 02:26.120] dinner one night, a Norwegian developer I was working with, he mentioned Erlang, the
[02:26.120 --> 02:33.800] Erlang VM, and a real-world example of a system with four years of continuous uptime.
[02:33.880 --> 02:37.560] And that story, it stuck with me, too.
[02:37.560 --> 02:41.960] So that night, back at the hotel, I started digging into Erlang, looked up the Wikipedia
[02:41.960 --> 02:48.240] page, and so I'd never seen anything like it as I got into it, and that brief conversation
[02:48.240 --> 02:50.160] changed my life.
[02:50.160 --> 02:55.180] It's a different course entirely than I was on before, and I spent 10 years after that
[02:55.180 --> 03:01.960] thinking through how well the Erlang VM mapped on to these problems of healthcare, these
[03:02.120 --> 03:06.240] big, meaningful healthcare problems that I hadn't really had much of a chance of solving
[03:06.240 --> 03:08.280] earlier.
[03:08.280 --> 03:16.040] So in 2017, after 10 years of thinking about this for free, so got 10 years of free consulting
[03:16.040 --> 03:21.400] whenever I engaged in with HCA, but after 10 years of thinking about this problem, I
[03:21.400 --> 03:27.400] was presented with this research and development challenge at HCA, within the Enterprise Integration
[03:27.400 --> 03:29.640] Group.
[03:29.640 --> 03:35.400] So given "all the data" and the ErlangVM, what good could we do?
[03:35.400 --> 03:38.300] Could we improve patient outcomes?
[03:38.300 --> 03:43.480] Could we help caregivers, our nurses, our doctors?
[03:43.480 --> 03:50.960] And could we give our hospitals better footing in the case of a disaster or a pandemic?
[03:50.960 --> 03:55.840] And luck favors the prepared, as the saying goes.
[03:55.840 --> 03:59.560] So secondarily, could we bring some joy to IT?
[03:59.560 --> 04:06.700] For me, dev joy is being productive and having low-friction ways of doing important work.
[04:06.700 --> 04:09.640] That makes me happy as a developer.
[04:09.640 --> 04:13.760] And biz joy, faster delivery times, makes the business happy.
[04:13.760 --> 04:17.480] And also lowering risk and uncertainty makes the business happy.
[04:17.480 --> 04:20.160] So we wanted to do these things, these were our goals.
[04:20.160 --> 04:29.320] And then if we have observable fault-tolerant systems with stress-free failure models, then
[04:29.320 --> 04:31.360] your ops people, they have some joy.
[04:31.360 --> 04:38.480] Ops people don't have a lot of joy, but they have some joy.
[04:38.480 --> 04:43.280] So that year of experimentation, proofs, and research, it seeded several useful projects,
[04:43.280 --> 04:45.480] including Waterpark.
[04:45.480 --> 04:48.520] So you might be wondering about what's with that name?
[04:48.520 --> 04:50.160] Why is this healthcare system called this?
[04:50.160 --> 04:56.080] And so it came about because we knew we wanted this thing to be, it was going to be fast,
[04:56.080 --> 04:59.400] it was going to be fun, it was going to be full of joy, as we just mentioned.
[04:59.400 --> 05:01.960] And then all this data was going to be flowing through it.
[05:01.960 --> 05:04.120] Everything is going to be flowing through.
[05:04.120 --> 05:09.240] And we thought of it less like a data lake or a data swamp and more like a data waterpark.
[05:09.240 --> 05:13.120] And so the name made everyone happy and so we went with it.
[05:14.120 --> 05:19.280] Okay, if you want to help your patients, your caregivers, and your IT friends, you have
[05:19.280 --> 05:20.280] to be up.
[05:20.280 --> 05:23.480] You can't have a system that's falling down all the time.
[05:23.480 --> 05:29.560] But highly available, that term, as we thought about it, it felt subjective, too subjective.
[05:29.560 --> 05:34.800] And so we decided on something from the start where instead of highly available, we'd be
[05:34.800 --> 05:36.960] continuously available.
[05:36.960 --> 05:38.360] No downtime.
[05:38.360 --> 05:41.240] That was an initial goal.
[05:41.280 --> 05:47.480] And in an organization, if it's hard to get access to data that you need, then teams inevitably,
[05:47.480 --> 05:49.720] they're going to hoard data.
[05:49.720 --> 05:55.120] They're going to hoard the data and then these silos of data will appear in the organization.
[05:55.120 --> 06:01.680] But if you make publishing and subscribing to the data easy, self-service, and you provide
[06:01.680 --> 06:07.680] it in the data formats, the data types, data formats, and the protocols that people want
[06:07.720 --> 06:12.720] to talk with, you meet them where they're at, then the hoarding, it stops.
[06:12.720 --> 06:22.080] We also wanted, thought about this problem of the cost of failed experiments.
[06:22.080 --> 06:28.560] So a business leader, innovator, they have an experiment and if the cost of failure is
[06:28.560 --> 06:33.440] high, either in dollars or political capital, then the experimentation, innovation, it's
[06:33.440 --> 06:35.520] going to dry up.
[06:35.520 --> 06:43.040] So could we create this place where experiments were free and experimentation and we could
[06:43.040 --> 06:48.160] have ideas that were fast, cheap, and low risk to try out.
[06:48.160 --> 06:51.800] And then finally, could we shine?
[06:51.800 --> 06:57.480] Could we use Waterpark itself as an exemplar of how software can be built?
[06:57.480 --> 07:00.780] So those are these goals at the very beginning in 2017.
[07:00.780 --> 07:06.560] So they're kind of audacious, some of them, but let's see what we get.
[07:06.560 --> 07:12.820] So here's Waterpark, it runs on a cluster of servers, data comes in, we get thousands
[07:12.820 --> 07:18.260] and thousands of feeds coming in, just high velocity data from lots of different sources.
[07:18.260 --> 07:22.980] So this is truly a big data platform, volume, velocity, variety.
[07:22.980 --> 07:25.580] And some of the data, we simply route.
[07:25.580 --> 07:28.820] So we're an integration platform, so some of it comes in and we just send it right back
[07:28.820 --> 07:29.820] out as it was.
[07:30.220 --> 07:35.460] And then some data we transform, maybe it came in as XML over REST, and we transform
[07:35.460 --> 07:42.020] it into JSON and put it on a Kafka topic, if that's what they want, the subscribers.
[07:42.020 --> 07:45.900] And other signals are generated entirely within Waterpark.
[07:45.900 --> 07:49.660] And so this is where we're a little bit different than a lot of integration platforms.
[07:49.660 --> 07:55.380] And these signals are generated entirely within Waterpark by inferring knowledge from these
[07:55.580 --> 08:00.020] disparate streams of data that are coming in.
[08:00.020 --> 08:05.540] So Waterpark, it plays a lot of bases, integration engine, streaming system, distributed database,
[08:05.540 --> 08:10.580] content delivery network, functions as a service platform, complex event processor, queue,
[08:10.580 --> 08:12.860] and a cache.
[08:12.860 --> 08:19.580] So by implementing from scratch, in Elixir, the minimal set of features that we needed
[08:19.740 --> 08:26.340] from say a cache or a queue, we didn't have to take a hard dependency on those.
[08:26.340 --> 08:31.580] So we don't need to depend on Redis, we are a cache.
[08:31.580 --> 08:37.380] We don't depend on Kafka, we are a streaming platform and a queue.
[08:37.380 --> 08:40.140] Waterpark doesn't even use a database.
[08:40.140 --> 08:45.620] We don't depend on SQL Server, Postgres, or Cassandra, or any database.
[08:45.620 --> 08:47.900] Waterpark is a database.
[08:47.900 --> 08:57.620] It's a distributed, geographically fault-tolerant, RAM-based, never-touches-disk database.
[08:57.620 --> 09:07.300] So we don't use our drives for anything other than spinning up the product when Linux boots.
[09:07.300 --> 09:11.460] Investing a few months in building a tailor-fit subset of the features that we truly needed
[09:11.460 --> 09:14.160] let us avoid dependencies.
[09:14.160 --> 09:18.440] And dependencies, even when they're free, they're not really free.
[09:18.440 --> 09:23.800] In Waterpark, our deployment and failure models as a result are our own.
[09:23.800 --> 09:28.920] And this has been a key to us having zero downtime since going to production over five
[09:28.920 --> 09:31.380] years ago.
[09:31.380 --> 09:37.240] Not a millisecond of downtime on this platform.
If you wish to make an apple pie from scratch, you must first invent the universe. Carl Sagan (Cosmos "The Lives of the Start", Nov 23, 1980)
[09:37.240 --> 09:43.340] But as virtuous as wheel reinvention can be, inventing the universe is a much bigger job
[09:43.640 --> 09:45.840] and you need to be able to differentiate between the two.
[09:45.840 --> 09:51.800] Are you rebuilding a wheel or are you trying to create the universe?
[09:51.800 --> 09:55.760] So for Waterpark, we didn't invent hardware.
[09:55.760 --> 10:05.760] We chose simple 1U servers with onboard disks, no fancy shared storage, and we expect these
[10:05.760 --> 10:07.120] servers to fail.
[10:07.120 --> 10:12.080] We got the cheapest 1U servers that they would let us put in the data centers.
[10:12.080 --> 10:16.700] And the goal was we wanted these boxes to fail as a unit.
[10:16.700 --> 10:21.820] We wanted that box to collapse and go away and not have like half of its dependencies
[10:21.820 --> 10:26.020] across the network, like its shared storage.
[10:26.020 --> 10:30.420] And we didn't write a language or an operating system either, but we did choose a less conventional
[10:30.420 --> 10:33.780] operating system, the Erlang VM.
[10:33.780 --> 10:39.140] And so if you wonder, the Erlang VM is an operating system.
[10:39.140 --> 10:43.640] You can deploy Erlang bare metal, without Linux or anything underneath it.
[10:43.640 --> 10:44.840] We don't do that.
[10:44.840 --> 10:51.440] We deploy our systems written in Elixir or on the Erlang VM and we deploy them on Linux,
[10:51.440 --> 10:55.020] just because we get nice things in Linux that we wanted.
[10:55.020 --> 10:57.360] And so we wouldn't have to do that.
[10:57.360 --> 11:01.040] And it does all these things on the right over here, these process management, interrupts
[11:01.040 --> 11:02.580] and so on.
[11:02.580 --> 11:06.860] It's entirely an operating system and that gives us these superpowers.
[11:06.860 --> 11:11.900] So I wanted to mention the language that we use is Elixir, which is a modern productive
[11:11.900 --> 11:14.780] DevJoy language that runs on the Erlang VM.
[11:14.780 --> 11:19.420] And there's a [QR code there](https://www.youtube.com/watch?v=iswld-Rpi_g) if you wanted to check out a talk from ElixirConf just to
[11:19.420 --> 11:23.840] give you like a flavor of what Elixir is.
[11:23.840 --> 11:32.260] And so on a single commodity server, we run millions of actor processes, massive concurrency
[11:32.260 --> 11:34.760] here, millions of processes running on here.
[11:35.520 --> 11:40.240] We're talking the same isolation that you'd have between notepad and calculator on your
[11:40.240 --> 11:41.920] desktop, that much isolation.
[11:41.920 --> 11:45.080] So millions of these things running on one server.
[11:45.080 --> 11:48.000] And these actors, they auto balance across the cores.
[11:48.000 --> 11:49.240] We don't think about threading.
[11:49.240 --> 11:53.520] There's no concept of threading on the Erlang VM.
[11:53.520 --> 11:57.380] Everything is a process and so the processes are automatically concurrent and they can
[11:57.380 --> 12:00.940] then be distributed across cores.
[12:00.980 --> 12:05.860] And we get software all the time without thinking about it and talking to an actor on the other
[12:05.860 --> 12:10.420] side of the country is no more difficult than talking to one in the same data center.
[12:10.420 --> 12:11.420] It's the same code.
[12:11.420 --> 12:16.420] It just takes a little bit longer because light, you know, in physics.
[12:16.420 --> 12:21.260] So the Erlang VM is this special purpose operating system, built around the actor model, and that
[12:21.260 --> 12:24.060] fit our needs really well.
[12:24.060 --> 12:27.100] So the actor model is introduced by Carl Hewitt in 1973. https://www.ijcai.org/Proceedings/73/Papers/027B.pdf
[12:27.540 --> 12:31.980] It's this model of concurrent computation that's inspired by physics.
[12:31.980 --> 12:38.420] And in this model, actors, they're the fundamental element and actors interact through message
[12:38.420 --> 12:39.420] passing.
[12:39.420 --> 12:42.820] So everything is done through message passing in the actor model.
[12:42.820 --> 12:45.640] So what does that look like in Elixir?
[12:45.640 --> 12:50.940] So every actor process has some props.
[12:50.940 --> 12:56.540] Each actor processes its own dedicated, isolated, share nothing memory.
[12:56.540 --> 12:58.260] So you have two processes.
[12:58.260 --> 12:59.820] They can't access the memory anymore.
[12:59.820 --> 13:03.220] The notepad calculator can access each other's memory.
[13:03.220 --> 13:08.580] And you have millions of these running on a server again.
[13:08.580 --> 13:13.140] And so this memory here, it starts off tiny, two kilobytes on a 64-bit system.
[13:13.140 --> 13:15.340] And inside of that, we get a stack and a heap.
[13:15.340 --> 13:18.340] And as our actors run and they accumulate state, they just grow.
[13:18.340 --> 13:20.580] They just grab more memory.
[13:20.580 --> 13:25.180] And the growth follows the Fibonacci series as it's growing.
[13:25.740 --> 13:30.180] And each process also gets its own garbage collector.
[13:30.180 --> 13:32.460] So let that sink in for just a second.
[13:32.460 --> 13:40.140] We've got these hundreds of thousands of processes, and they all have their own garbage collectors,
[13:40.140 --> 13:41.740] that are running.
[13:41.740 --> 13:44.620] And they're on this box.
[13:44.620 --> 13:49.900] And think about this job, though, of a garbage collector on an operating system where all
[13:49.900 --> 13:53.340] of the languages are functional.
[13:53.340 --> 13:57.380] And so in Elixir, ErlangVM, you can't mutate state.
[13:57.380 --> 14:00.900] There's no option of saying mutable, and then you can cheat.
[14:00.900 --> 14:02.740] It's just you can't mutate state.
[14:02.740 --> 14:07.220] And so you've got functional languages, the code can't mutate state, and there's no shared
[14:07.220 --> 14:09.100] memory between these different processes.
[14:09.100 --> 14:10.900] There's no concept of threads.
[14:10.900 --> 14:14.940] So the easiest garbage collection job in the world.
[14:14.940 --> 14:18.100] Not a bad place to be a garbage collector.
[14:18.100 --> 14:24.340] And so we get these tiny deterministic GCs, and we don't get these availability killing
[14:24.340 --> 14:29.060] stop the world GCs that we'd expect on other platforms.
[14:29.060 --> 14:32.260] Every process also gets its own dedicated mailbox.
[14:32.260 --> 14:38.900] And this is the only way that a process can talk to other processes on the cluster.
[14:38.900 --> 14:42.740] And it's actually the only way it can talk to the outside world, including the file system.
[14:42.740 --> 14:45.140] Everything is through message passing.
[14:45.140 --> 14:51.460] So it's absolutely lonely being an Elixir process.
[14:51.460 --> 14:52.940] For links and monitors.
[14:52.940 --> 14:59.620] So we can tell the ErlangVM that we want to know immediately, if some other process on
[14:59.620 --> 15:05.260] the same server across the cluster, across the country, if it crashes.
[15:05.260 --> 15:11.580] And so a link is one form of this, and with a link, it's a bi-directional thing.
[15:11.580 --> 15:15.740] And it says, OK, if you die, I want to die with you.
[15:15.740 --> 15:20.260] This is going to be Thelma and Luise, going over the cliff.
[15:20.260 --> 15:25.220] There's another idea of called monitors, and this is more like reading the obituaries.
[15:25.220 --> 15:29.660] It's like, I care, I'll grieve, but I'll carry on.
[15:29.660 --> 15:31.140] I don't want to die with you.
[15:31.140 --> 15:35.700] And so these low level operating system level messages get passed around.
[15:35.700 --> 15:38.340] And remember the ErlangVM is an operating system.
[15:38.340 --> 15:43.340] They're used to build these higher level ErlangVM concepts, called supervision trees
[15:43.340 --> 15:44.740] and supervisors.
[15:44.740 --> 15:49.980] And so what a supervisor lets you do is a process crashes and it'll be restarted automatically
[15:49.980 --> 15:52.540] into its init state.
[15:52.540 --> 15:59.220] So this is one of the primitives that gives the ErlangVM high availability or fault tolerance.
[15:59.220 --> 16:01.940] All right.
[16:01.940 --> 16:05.140] Now let's take a quick look at process scheduling.
[16:05.140 --> 16:11.300] So we have a single CPU core, and we're going to get a single actor process scheduler.
[16:11.300 --> 16:14.940] So you can think of a program as a stream of operations.
[16:14.940 --> 16:22.180] So the ErlangVM scheduler gives each actor process a chance to do 2,000 operations.
[16:22.180 --> 16:26.300] So 2,000, then immediately it moves on to the next actor process and gives it 2,000
[16:26.300 --> 16:27.300] and so on.
[16:27.300 --> 16:30.980] And there's no way that an actor can take more than 2,000.
[16:30.980 --> 16:32.940] There are no hogs.
[16:32.940 --> 16:38.980] And the context switching cost between here is almost zero, the context switching.
[16:38.980 --> 16:42.060] It's nothing like the context switching cost of threads.
[16:42.060 --> 16:47.100] And so we get this buttery smooth soft real time processing.
[16:47.100 --> 16:53.820] So two cores, we get two schedulers, two things happening in parallel, and so on.
[16:53.820 --> 16:58.180] And as a note, Waterpark, our 1U servers, they have 56 logical cores, and so we have
[16:58.580 --> 17:02.820] 56 concurrent things happening on each one of our servers.
[17:02.820 --> 17:06.420] And we don't have to do a thing to make that happen.
[17:06.420 --> 17:10.140] Okay, the actor model in healthcare.
[17:10.140 --> 17:12.620] We got this common understanding of the ErlangVM and the actor model.
[17:12.620 --> 17:15.520] Let's see how we apply this.
[17:15.520 --> 17:21.560] So we use actor processes to represent each patient as a digital twin.
[17:21.560 --> 17:27.220] Actor process per patient, not database row per patient.
[17:27.260 --> 17:31.220] In more typical healthcare systems, a patient is represented as a moment in time snapshot
[17:31.220 --> 17:37.180] of data, usually on disk, but this can be HL7 messages, table rows, JSON, et cetera.
[17:37.180 --> 17:39.740] And most systems, they read patient data.
[17:39.740 --> 17:43.740] They perform work on the current values, and they flush the buffers, and they move on with
[17:43.740 --> 17:47.340] no memory of just what they did before, like a goldfish.
In Waterpark, we model each patient as a long-running "Patient Actor" (currently millions of Patient Actors)
[17:47.340 --> 17:53.220] In Waterpark, though, we model each patient as a long-running Patient Actor.
[17:53.220 --> 17:58.660] Currently, there are millions of Patient Actors running on the cluster, and a patient
[17:58.660 --> 18:07.660] actor represents and is dedicated to one individual human, that is at our hospitals.
[18:07.660 --> 18:11.900] These Patient Actors, they run from pre-admit to post-discharge, and they're going to run
[18:11.900 --> 18:14.300] a minimum of 21 days.
[18:14.300 --> 18:19.780] So when one comes in, we reset a clock every time a new message gets routed to that patient
[18:19.820 --> 18:25.340] actor, and so these processes are going to be running for a minimum of three weeks.
[18:25.340 --> 18:29.340] Millions of these things, every human represented by a long-running process.
[18:29.340 --> 18:33.920] A Patient Actor is not limited to just the data of the latest HL7 message, it holds every
[18:33.920 --> 18:37.420] message and event that led to the current state.
[18:37.420 --> 18:44.220] So you can start thinking about this in event-sourcing terms.
[18:44.220 --> 18:45.220] The long shift.
[18:45.220 --> 18:49.420] Does any of you have friends, family, in health care?
[18:49.420 --> 18:52.060] Okay, I see some hands.
Nurses often work 12-hour shifts to provide continuity-of-care for patients and to reduce the frequency of dangerous handoffs
[18:52.060 --> 18:58.300] So you are inevitably aware of the long shift, like this idea of 12-hour shifts or longer
[18:58.300 --> 19:03.060] that doctors and nurses, and this is done to provide continuity-of-care for the patients
[19:03.060 --> 19:05.580] and reduce dangerous handoffs.
[19:05.580 --> 19:12.500] And what I mean by that, let me make it easier to read, from Jayco, 80% of serious
[19:12.500 --> 19:16.900] medical errors are around miscommunication during the handoffs during the transition
[19:16.900 --> 19:17.900] of care.
[19:17.900 --> 19:23.420] So it's dangerous to have a handoff, so by going 12-hour shifts, you have two changes
[19:23.420 --> 19:25.200] instead of three, a day.
[19:25.200 --> 19:28.140] So you reduce your risks.
[19:28.140 --> 19:35.140] We extend this idea of the long shift, except we're extending it to 21 days, at minimum.
[19:35.140 --> 19:37.720] So we're not losing context about this patient.
[19:37.720 --> 19:39.620] We know everything that's happened to them.
[19:39.620 --> 19:44.680] And so we extend this idea to the clinical systems and helps us with having awareness
[19:44.680 --> 19:46.680] of the full patient's visit.
[19:46.680 --> 19:51.780] So this full visit awareness enables real-time notifications and alerts based on days or
[19:51.780 --> 19:56.420] weeks of events, whether that's drugs administered, transfers, procedures, lab results, and so
[19:56.420 --> 19:57.420] on.
[19:57.420 --> 19:58.420] So let's see what this looks like.
[19:58.420 --> 20:04.300] We've got this hospital with two patients, and we have our EMR here, our medical record
[20:04.300 --> 20:09.140] system, and it's going to generate HL7 messages, and it's going to relay those.
[20:09.380 --> 20:13.100] When they hit Waterpark, we're going to spin up a Patient Actor to represent this human,
[20:13.100 --> 20:17.020] because we have never seen patient 1001 before.
[20:17.020 --> 20:22.140] So we spin that patient up, and we apply this admit message.
[20:22.140 --> 20:25.580] Another message comes in, and we say, hey, we do have this on the cluster, and it gets
[20:25.580 --> 20:31.820] routed to this one instance of that patient on the cluster.
[20:31.820 --> 20:36.820] And we've never seen 1022, and so we spin that one up.
[20:36.820 --> 20:38.840] And then another message comes in.
[20:38.840 --> 20:42.860] So we can see this transcript building up of these HL7 messages.
[20:42.860 --> 20:45.920] So I mentioned HL7 messages a few times so far.
[20:45.920 --> 20:48.880] So let's take a peek what healthcare looks like.
[20:48.880 --> 20:53.380] So you might squint, might close your eyes just a little bit.
[20:53.380 --> 20:57.220] And so this is what healthcare looks like all over the world.
[20:57.220 --> 21:03.700] So every time you go to a doctor, there's something like this, flowing back and forth.
[21:03.700 --> 21:08.260] I'll make it a little bit easier to see by putting spaces between the segments and bolding
[21:08.260 --> 21:11.140] these segment headers.
[21:11.140 --> 21:18.240] But there's an idea called the path syntax in HL7 that lets you do things like `MSH-3` maps
[21:18.240 --> 21:19.980] to this, the sending application.
[21:19.980 --> 21:23.300] It's `MSH-4 Sending facility`, and so on.
[21:23.300 --> 21:25.860] And so these paths are there.
[21:25.860 --> 21:27.640] So that's nice.
[21:27.640 --> 21:28.640] We can walk through this.
[21:28.640 --> 21:30.940] This thing ends up being a pretty interesting data structure.
[21:30.940 --> 21:31.940] It looks ugly to begin with.
[21:31.940 --> 21:33.300] It's this tree structure, though.
[21:33.300 --> 21:36.900] It's just a four-level deep tree.
[21:36.900 --> 21:42.140] And so HL7, though, was critically important from the beginning of Water Park.
[21:42.140 --> 21:44.540] And we needed a library, so we built one. https://github.com/HCA-Healthcare/elixir-hl7
[21:44.540 --> 21:48.340] And I'm proud that HCA made this work available to the community.
[21:48.340 --> 21:50.860] So this is open source out in the world.
[21:50.860 --> 21:55.180] And raising all boats, caring for the community, and it's one more reason for healthcare companies
[21:55.180 --> 21:57.680] to use Elixir.
[21:57.680 --> 21:59.420] So let's see what this library looks like.
iex> msg = HL7.Message.new(hl7_text)
%HL7.Message{ fragments: [], header: %HL7.Header{hl7_version: "2.5", ...}}
iex> msg |> HL7.Query.new() |> HL7.Query.get_part("AL1-3.2")
"ASPIRIN"
[21:59.940 --> 22:04.460] We go through, message new, off that text we were looking at there, we get this data
[22:04.460 --> 22:05.460] structure.
[22:05.460 --> 22:12.220] And then we can query, get part, and say AL1 3.2.
[22:12.220 --> 22:18.820] And we see that in their allergy segment, they're allergic to aspirin, it looks like.
[22:18.820 --> 22:24.420] So we got to thinking about those paths and this processing, and we're playing.
[22:24.420 --> 22:27.620] And so we thought about scalable bloom filters and how we could apply that.
[22:27.620 --> 22:31.500] So scalable bloom filter is this probabilistic data structure that supports two things.
[22:31.500 --> 22:34.420] You can `add_item(bloom, item)` to the bloom filter.
[22:34.420 --> 22:38.620] And you can say, `is_member?(bloom, item)`, have I ever added this item to this bloom filter?
[22:38.620 --> 22:40.420] And it'll give you a Boolean.
[22:40.420 --> 22:45.880] And so it'll never return a false negative, but it can return false positives.
[22:45.880 --> 22:47.620] So that's what makes it probabilistic.
[22:47.620 --> 22:50.380] There's a chance it'll give you a false positive.
[22:50.380 --> 22:55.660] And the scalable part here is you can say, I want to maintain a 1% false positive rate
[22:55.780 --> 23:01.060] and just grow as you need to, to meet that, or a 10% or a 0.1%, whatever you want to set
[23:01.060 --> 23:02.060] it to.
[23:02.060 --> 23:05.100] But the idea behind a bloom filter is they're super space efficient.
[23:05.100 --> 23:06.580] They're really tight.
[23:06.580 --> 23:10.340] And if you're interested in this idea of probabilistic data structures, this is an awesome book.
[23:10.340 --> 23:14.820] Andrii Gakhov wrote, and so very good. https://www.amazon.de/-/en/Probabilistic-Data-Structures-Algorithms-Applications/dp/3748190484
[23:14.820 --> 23:16.100] But let's see how we can apply that.
[23:16.100 --> 23:19.420] So a bloom filter for Patient 1000 here.
[23:19.420 --> 23:22.340] So we could add `"AL1-3.2: Aspirin"`.
[23:22.340 --> 23:23.780] And we're going to add it as a string.
[23:23.780 --> 23:28.160] So why are we adding AL1 colon aspirin here?
[23:28.160 --> 23:35.740] And why are we adding `DG1-3.1: 786.50`, this chest pain unspecified code here?
[23:35.740 --> 23:38.280] And a zip code.
[23:38.280 --> 23:41.580] We could use this to build an advanced search.
[23:41.580 --> 23:47.260] So we could search by paths and say, give me all of the patients who have this diagnosis
[23:47.260 --> 23:52.100] code of chest pain unspecified and your bloom filters, you could just iterate across your
[23:52.740 --> 23:56.580] and you would get a hit of every patient that ever had that in one of their messages, which
[23:56.580 --> 23:58.180] would be really awesome.
[23:58.180 --> 24:01.420] So bloom filters are used by all the search engines, by the way.
[24:01.420 --> 24:04.100] And so we're using it in our search engine.
[24:04.100 --> 24:07.380] And so while we're doing this on this one pass, though, we're going to go ahead and
[24:07.380 --> 24:08.380] do 2 things:
[24:08.380 --> 24:12.000] On the left side, we're going to do this thing with the path colon string.
[24:12.000 --> 24:13.840] And on the right, we're just going to take the string.
[24:13.840 --> 24:16.720] So on the left, we get an advanced search.
[24:16.720 --> 24:22.680] And on the right, we get a full text search, all in exactly once processing of these messages,
[24:22.680 --> 24:24.800] which is pretty neat.
[24:24.800 --> 24:26.920] So we're going to then be able to do something with that like this:
[24:26.920 --> 24:29.920] We're going to say, `is_member?(patient1000_bloom, "PID-7: 19620910")` of patient this bloom filter.
[24:29.920 --> 24:32.400] And we're going to look for this birth date.
[24:32.400 --> 24:34.040] And true, it found it.
[24:34.040 --> 24:38.520] And we're going to look `is_member?(patient1000_bloom, "Birmingham")` for this patient that bloom Birmingham is a full text search.
[24:38.520 --> 24:40.560] And it found that.
[24:40.560 --> 24:44.080] And we're going to search for `is_member?(patient1000_bloom, "SARS-CoV-2")`.
[24:44.080 --> 24:46.280] And it's not going to find that.
[24:46.360 --> 24:49.160] Because COVID didn't exist then.
[24:49.160 --> 24:50.720] It wasn't a thing then.
[24:50.720 --> 24:55.160] And all of a sudden, COVID did exist and our world changed.
[24:55.160 --> 25:00.840] And that's when we got this in three weeks, we need to take water part to production.
[25:00.840 --> 25:03.160] And so the play stopped.
[25:03.160 --> 25:08.480] And which also meant then that in three weeks, you'll never ever be able to take downtime
[25:08.480 --> 25:11.060] again.
[25:11.060 --> 25:18.380] So continuous availability means no unplanned outages and no planned outages.
[25:18.380 --> 25:22.820] And to get that, you really have to follow this idea of no masters.
[25:25.500 --> 25:26.540] So what do I mean by that?
[25:26.540 --> 25:31.820] So here we have a pattern that's common.
[25:31.820 --> 25:34.660] We have a task router and then these three worker nodes.
[25:34.660 --> 25:40.060] It's a bad plan, because if the task router goes down, you're not doing any work anymore.
[25:40.060 --> 25:41.540] The system is down.
[25:41.540 --> 25:43.100] This is an even more common pattern.
[25:43.100 --> 25:46.620] You've got some worker nodes, workers out there doing things, and they have shared storage,
[25:46.620 --> 25:47.900] like a database.
[25:47.900 --> 25:48.900] The database goes down.
[25:48.900 --> 25:50.060] Well, you're toast.
[25:50.060 --> 25:52.380] Your whole system is down.
[25:52.380 --> 25:53.380] That's a bad plan.
[25:53.380 --> 25:58.260] But it's the one that the industry is stuck in, right now.
[25:58.260 --> 26:06.740] So on the left, we have roles-in-series, also known as N-tier architecture, and also
[26:06.740 --> 26:12.460] known as the worst high availability plan ever devised.
[26:12.460 --> 26:14.380] So what's the problem?
[26:14.380 --> 26:17.820] So we have this component with three nines (99.9% or 0.999) of availability.
[26:17.820 --> 26:20.900] That's going to give us 8.8 hours of downtime a year.
[26:20.900 --> 26:27.940] If we put this in series on the left, versus parallel on the right, we get quite different
[26:27.940 --> 26:28.940] results.
[26:28.940 --> 26:30.260] So it multiplies out.
[26:30.260 --> 26:35.780] We end up with 26 hours of downtime a year by having these things in series.
[26:35.820 --> 26:41.060] And on the right, if any of those are there, we're able to process.
[26:41.060 --> 26:47.140] And so we get nine nines of uptime on that side, 32 milliseconds per year.
[26:47.140 --> 26:49.700] So we liked the one on the right.
[26:49.700 --> 26:53.940] And so we went with that option, no masters, and we want to avoid our single points of
[26:53.940 --> 26:54.940] failure.
[26:54.940 --> 26:55.940] So we liked three.
[26:55.940 --> 26:56.940] But we liked eight better.
[26:56.940 --> 26:58.900] It's like three's good, but eight's better.
[26:58.900 --> 27:01.900] And we thought, if we have eight, let's go ahead and divide those.
[27:02.020 --> 27:05.700] So we'll put a little line in the middle, and we'll create these availability zones,
[27:05.700 --> 27:14.180] so that in the data centers, if the power falls down on a series of racks, well, only
[27:14.180 --> 27:15.780] four of them will be affected.
[27:15.780 --> 27:21.860] If a network core dies, if a truck drives into the building, fire happens, whatever,
[27:21.860 --> 27:23.700] we'll be protected, and we'll have these availability zones.
[27:23.700 --> 27:25.360] And we thought that would be good.
[27:25.360 --> 27:29.060] And so we also thought it'd be good if we had another level here.
[27:29.220 --> 27:34.380] And so we put some in Florida, some in Tennessee, some in Texas, and some in Utah, so this spans
[27:34.380 --> 27:35.380] the United States.
[27:35.380 --> 27:36.980] We've got these servers there.
[27:36.980 --> 27:40.040] And every one of these servers is an exact peer.
[27:40.040 --> 27:41.260] They're absolutely identical.
[27:41.260 --> 27:44.940] There's no difference between the server at the top left and the one at the bottom right,
[27:44.940 --> 27:46.580] other than its name.
[27:46.580 --> 27:51.700] That's the only thing that breaks symmetry across these servers.
[27:51.700 --> 27:52.880] Absolutely the same capabilities.
[27:52.880 --> 27:58.820] And so they are the roles in parallel.
[27:59.580 --> 28:00.580] So process pairs.
[28:00.580 --> 28:05.500] This idea of process pairs was used by Jim Gray in the design of the fault-tolerant
[28:05.500 --> 28:06.500] tandem computer. ("Why Do Computers Stop and What Can Be Done About It?")
[28:06.500 --> 28:13.260] And so this process pairs idea was adopted early on in the design of Erlang.
[28:13.260 --> 28:16.340] And so here's Joe Armstrong.
[28:16.340 --> 28:20.220] Who here, by chance, met him when he was at NDC Oslo when we had a functional programming
[28:20.220 --> 28:21.220] check?
[28:21.220 --> 28:22.220] Okay, awesome.
[28:22.220 --> 28:23.220] That's super.
[28:23.220 --> 28:24.220] Okay.
Fault-tolerance is achieved like this:
If a machine crashes, the failure is detected by another machine in the network.
The machine (or machines) detecting the failure must have sufficent data to take over
from the machine that crashed and continue with the application.
Users should not notice the failure.
By Joe Armstrong
Communications of the ACM
September 2010, Vol. 53 No. 9, Pages 68-75
10.1145/1910891.1810910
[28:24.220 --> 28:28.100] To be fault-tolerant, nodes in a cluster, they must be able to detect other nodes crashing.
[28:28.100 --> 28:30.380] You remember those links and monitors?
[28:30.380 --> 28:31.980] So there we go.
[28:31.980 --> 28:35.940] And they must have sufficient data to take over for a downed node.
[28:35.940 --> 28:41.160] And users should not notice the failure.
[28:41.160 --> 28:43.800] So how do we manage that in waterpark?
[28:43.800 --> 28:51.100] So if we spin up an actor, Florida A2, we want to have a read replica of it somewhere.
[28:51.100 --> 28:53.380] And so this is our backup for it.
[28:53.380 --> 28:57.300] But we also are going to create another backup at each data center.
[28:57.300 --> 29:02.900] So each actor is going to have a read replica at each of the data centers.
[29:02.900 --> 29:08.180] What that means is then when a server just catches on fire, we'll be able to spin that
[29:08.180 --> 29:10.360] up and it'll just go somewhere.
[29:10.360 --> 29:15.980] That actor will restart somewhere and it's going to then rehydrate from its buddies.
[29:15.980 --> 29:16.980] Okay?
[29:16.980 --> 29:20.180] So it rehydrated onto a place that's about to get hit by an asteroid.
[29:20.180 --> 29:23.620] And unfortunately, we just lost the Tennessee data center and all the servers there and
[29:24.620 --> 29:28.060] that actor process there, that Patient Actor process was there.
[29:28.060 --> 29:30.060] And my home is not too far from that data center.
[29:30.060 --> 29:31.660] So maybe I'm gone too.
[29:31.660 --> 29:35.600] But it's okay for everyone using the system because I don't need to be there and that
[29:35.600 --> 29:37.260] data center doesn't need to be there.
[29:37.260 --> 29:41.380] And no one even needs to wake up unless they want to tell my family: "Sorry".
[29:41.380 --> 29:44.260] Because the system is going to keep on chugging.
[29:44.260 --> 29:45.540] It's not going to lose anything.
[29:45.540 --> 29:51.380] And so even if someone sent a message to that server and one second later that asteroid
[29:51.420 --> 29:55.660] hit, if they sent a message and they got an ACK back and one second later that asteroid
[29:55.660 --> 30:00.420] hit vaporized the data center, that message was not lost.
[30:00.420 --> 30:03.740] And so you think about how many systems can make that guarantee.
[30:03.740 --> 30:04.740] Not many.
[30:04.740 --> 30:10.740] We make that guarantee on every message that comes into Waterpark.
[30:10.740 --> 30:14.440] Okay?
[30:14.440 --> 30:16.900] So we get to an interesting thing.
Key Routing Needs:
-
All servers must agree on where an actor lives
-
Use math instead of a centralized distributed registry
-
Minimize churn when servers are added or removed
-
Fairly distribute our actors across the cluster
[30:16.900 --> 30:19.700] How do we know where those processes should live? [30:19.700 --> 30:23.100] What servers they should be on? [30:23.100 --> 30:24.740] Key routing. [30:24.740 --> 30:30.140] So key routing, we know all servers must agree on where an actor lives. [30:30.140 --> 30:33.060] And we're going to use math instead of a centralized registry. [30:33.060 --> 30:34.060] Centralized registries are expensive. [30:34.060 --> 30:37.380] And we're going to minimize churn when things are added and removed. [30:37.380 --> 30:39.900] And we want to fairly distribute our actors across the cluster. [30:39.900 --> 30:41.700] These are our goals in key routing. [30:41.700 --> 30:47.180] So we're going to start off here with a simple p hash algorithm, just a simple hash.
[30:47.180 --> 30:53.860] And we're going to say node count of eight, got a cluster there represented.
[30:53.860 --> 30:57.180] We're going to put a key in here and we're going to see where it maps to.
[30:57.180 --> 31:04.180] And it's always going to map to slice 6 there, assuming we have 8 slices.
Given 8 slices, "actor-key-1000" will always map to slice 6:
iex> node_count = 8
8
iex> :erlang.phash2("actor-key-1000", node_count)
6
[31:04.180 --> 31:11.180] The problem comes in, though, if we have seven slices and we map, it's going to then change.
iex> node_count = 7
7
iex> :erlang.phash2("actor-key-1000", node_count)
3
[31:11.180 --> 31:12.180] It's going to move over there.
[31:12.180 --> 31:14.380] So that's not good, right?
[31:14.380 --> 31:21.180] So really not good, really bad, actually, because we have all this churn on the cluster
[31:21.180 --> 31:23.300] and we don't want churn on the cluster.
[31:23.300 --> 31:28.140] We can't have that because that's expensive on a RAM-based system, especially.
[31:28.140 --> 31:33.300] So we have a solution for this, another paper we love, consistent hashing.
Our caching protocols are based on a special kind of hashing that we call consistent hashing. Roughly speaking, a consistent hash function is one which changes minimally as the range of the function changes. Through the development of good consistent hash functions, we are able to develop caching protocols which do not require users to have a current or even consistent view of the network. We believe that consistent hash functions may eventually prove to be useful in other applications such as distributed name servers and/or quorum systems.
[31:33.300 --> 31:37.300] So a key, I think it's interesting, down here at the bottom.
[31:37.300 --> 31:41.260] So back in 97, this was written, it says, may eventually prove to be useful in name
[31:41.260 --> 31:43.160] servers and quorum systems.
[31:43.160 --> 31:45.320] And so they were right, it's useful.
[31:45.320 --> 31:50.680] And so here's hashing algorithm, of using consistent hashing, we're going to use here.
iex> nodes = ["s1", "s2", "s3", "s4", "s5", "s6", "s7", "s8"]
iex> KeyRouting.key_to_node("actor-key-1001", nodes)
"s7"
[31:50.680 --> 31:54.360] And we had used consistent hashing for years in Waterpark, we've just recently switched,
[31:54.360 --> 31:58.240] but here's what it looks like at our nodes.
[31:58.240 --> 32:04.320] It's always going to live there, even when we drop nodes.
iex> nodes = ["s1", "s2", "s3", "s6", "s7", "s8"]
iex> KeyRouting.key_to_node("actor-key-1001", nodes)
"s7"
[32:04.320 --> 32:06.920] So it's going to stay there anyway.
[32:06.920 --> 32:08.480] That's good.
[32:08.480 --> 32:13.240] So we introduced the rendezvous hashing this year, and we switched over to it, here's the
[32:13.240 --> 32:15.400] paper on it: [A Name-Based Mapping Scheme for Rendevouz](https://www.eecs.umich.edu/techreports/cse/96/CSE-TR-316-96.pdf)
[32:15.400 --> 32:20.840] And the reason that we did this is it solves a problem.
[32:20.840 --> 32:22.920] So about fairness, it solves two problems, actually.
[32:22.920 --> 32:23.960] So one's about fairness.
[32:23.960 --> 32:28.120] So consistent hashing, it reduces churn, but it's not entirely fair.
[32:28.120 --> 32:32.280] You might end up with more processes on one server than the other.
[32:32.280 --> 32:35.240] So you might end up with something like this, not this bad, but we wanted something like
[32:35.240 --> 32:37.600] this, and rendezvous hashing gives us this.
[32:37.600 --> 32:39.480] Another thing is it's super simple.
[32:39.480 --> 32:43.000] So we wrote our rendezvous hashing algorithm, and as a result, we were able to drop a library
[32:43.000 --> 32:46.440] dependency that had a couple thousand lines of complex code in it, that was kind of hard
[32:46.440 --> 32:50.640] to understand, and instead we replaced it with a dozen lines of code, because rendezvous
[32:50.640 --> 32:56.480] hashing's easy, simple, and fast.
[32:56.480 --> 32:58.720] And so it works like this.
[32:58.720 --> 33:03.040] So we've got our same ring there that we're going to show for visuals.
[33:03.040 --> 33:10.840] And each server, a score is going to be assigned to the key for each server in that ring.
[33:10.840 --> 33:16.600] So the server scoring the key highest is going to own that key.
[33:16.600 --> 33:20.280] And so then if those others go away, there's no reason for it to move.
[33:20.280 --> 33:23.540] The only things that are ever going to move are the things that were on S2 and S3.
[33:23.540 --> 33:28.040] They're going to have to move somewhere else.
[33:28.040 --> 33:34.160] So we went on a running system, again, and we went from consistent hashing to rendezvous
[33:34.160 --> 33:38.720] hashing, kind of like this, where we said we're going to up the percentage, and we're going
[33:38.720 --> 33:42.880] to take one percentage of our writers, and we're going to put them on rendezvous, two,
[33:42.880 --> 33:43.880] three, four, or five.
[33:43.880 --> 33:48.360] And so by the end of it, we went through a bit of churn, as we changed algorithms, and
[33:48.360 --> 33:51.960] on the other side of it, we were entirely on rendezvous.
[33:51.960 --> 33:56.600] And we have just perfect balance now across our servers.
[33:56.600 --> 33:58.160] So that's good.
[33:58.160 --> 34:00.280] We checked some boxes here.
[34:00.280 --> 34:09.240] But we still have to wonder about where these actors live, and so this gets us to topology.
[34:09.240 --> 34:15.960] So we ask our topology to get our servers, and it's going to give us all of our servers.
iex> Topology.get_servers()
["FLA1", "FLA2", "FLA3", "FLA4",
"FLB1", "FLB2", "FLB3", "FLB4",
"TNA1", "TNA2", "TNA3", "TNA4",
"TNB1", "TNB2", "TNB3", "TNB4",
"TXA1", "TXA2", "TXA3", "TXA4",
"TXB1", "TXB2", "TXB3", "TXB4",
"UTA1", "UTA2", "UTA3", "UTA4",
"UTB1", "UTB2", "UTB3", "UTB4"]
[34:15.960 --> 34:20.240] We ask topology to get our data centers, and it's going to give us our list of data centers.
iex> Topology.get_data_centers()
["FL", "TN", "TX", "UT]
[34:20.240 --> 34:26.240] We say, give us the servers at Tennessee, and it gives us that list of servers.
iex> Topology.get_dc_servers("TN")
["TNA1", "TNA2", "TNA3", "TNA4",
"TNB1", "TNB2", "TNB3", "TNB4"]
[34:26.240 --> 34:31.580] We say, give us the key, let's map a key to a server, and we're going to pass it in this
[34:31.580 --> 34:40.760] patient key here, patient 1001, it's a writer, and it's going to say it's going to live there.
iex> Topology.key_to_server(%{id: "p1001", role: :writer})
"FLA2"
[34:40.760 --> 34:45.280] And the math is going to be the same for every, the topology is going to be the same.
[34:45.280 --> 34:50.760] If they have the same list of servers in their brain, then they're all going to say Florida A2 `FLA2`.
[34:52.920 --> 34:58.600] So you wanted a globally consistent process registry, but that was far too expensive.
[34:58.600 --> 35:02.600] The solution? Math!
[35:02.600 --> 35:04.420] But we still have a problem, though.
[35:04.420 --> 35:07.320] We do need all of our servers to agree on the topology.
[35:07.320 --> 35:10.560] They all need to have the same map of the world, and if they don't agree, there's still
[35:10.560 --> 35:12.960] going to be some churn.
[35:12.960 --> 35:16.400] It'll sort itself out to a degree, because lots of them should agree on what the world
[35:16.400 --> 35:19.080] looks like, but unless we enforce it, it can't.
[35:19.080 --> 35:23.880] So we thought, okay, let's put in strong consensus on just topology.
[35:23.880 --> 35:30.120] But topology is expensive, and this talk here by, or this paper by Heidi Howard talks about
Distributed consensus revised, by Heidi Howard
Paxos is widely deployed in production systems, yet it is poorly understood, and it proves to be heavyweight, unscalable, and unreliable in practice.
[35:30.120 --> 35:35.100] Paxos is widely deployed in production, poorly understood, proves to be heavyweight, unscalable,
[35:35.100 --> 35:39.320] and unreliable in practice, and this was our experience as well.
[35:39.320 --> 35:45.860] And even with the much lighter weight, Raft, that people use, Raft is basically an implementation
[35:45.860 --> 35:51.260] of Paxos that leaves off some things, and it's much simpler as a result of it.
[35:51.260 --> 35:56.820] But it was also too heavy and couldn't deal with fault in the way that we needed.
[35:56.820 --> 36:02.240] And so, by the way, here's a [QR code](https://pwlconf.org/2019/heidi-howard/) for a [really excellent talk by Heidi](https://www.youtube.com/watch?v=Pqc6X3sj6q8) here at Papers We Love.
[36:04.140 --> 36:10.900] And so we realized, we had this aha moment, actually watching her talk, and we don't need a consensus algorithm.
[36:12.400 --> 36:15.500] We will actually just take less and less, and we just take the part we need, and that is leader election.
[36:16.980 --> 36:18.060] That's all we really need.
[36:18.060 --> 36:21.660] We need that one tiny part of a consensus algorithm.
[36:21.660 --> 36:27.660] And so our leader, we can say, topology key to server, and we just pass it some arbitrary string like "the-boss".
iex> Topology.key_to_server("the-boss)
"TNB1"
[36:30.060 --> 36:34.460] And whichever server comes back, that's going to be our leader, you know, so you just route
[36:34.460 --> 36:35.460] that through that.
[36:35.460 --> 36:39.300] And so if everyone has the same servers, they're all going to agree that the boss is Tennessee B1
[36:39.300 --> 36:44.700] and if they don't have the same list of servers, then maybe it'll have a wrong view,
[36:44.900 --> 36:46.300] but we're not counting on that.
[36:46.300 --> 36:49.540] We're going to have a leader election algorithm here.
[36:49.540 --> 36:55.240] And so election starts, the non-candidate nodes vote.
[36:55.240 --> 36:59.740] So the ones that don't think they're the boss, vote for who they think is the boss.
[36:59.740 --> 37:04.020] And the one that thinks it's going to be the boss sits there and it waits for a majority
[37:04.020 --> 37:07.500] of messages to come in saying, "yeah, you're the boss".
[37:07.500 --> 37:12.700] And then on majority, it's going to claim victory, and it's going to tell even the ones
[37:12.700 --> 37:15.220] that didn't vote for it, "I'm your boss, and you're going to have to listen to me from now on."
[37:18.340 --> 37:21.900] And then the leader then shoves its topology out to all of them.
[37:21.900 --> 37:23.940] So they all then have a consistent topology.
[37:23.940 --> 37:26.100] They all have the same map of the world.
[37:26.100 --> 37:29.100] So we just maintain consensus on topology.
[37:29.100 --> 37:33.140] We use that to route to the proper local process registry.
[37:33.140 --> 37:37.700] So local process registry is fine, cheap, not hard to synchronize.
[37:37.700 --> 37:41.020] We just use the topology to get us there.
[37:41.020 --> 37:43.580] And now we don't need a global registry.
[37:43.580 --> 37:45.220] So when?
[37:45.220 --> 37:47.740] Location transparency.
[37:47.740 --> 37:52.420] So "However it works in the distributed case is how it should work locally", is another Joe Armstrong quote.
[37:53.900 --> 37:56.140] So super powerful idea on the VM.
[37:56.140 --> 38:04.100] And it's another one of the reasons that we used the BEAM or the VM for Waterpark.
[38:04.100 --> 38:05.440] And here's what it looks like here.
iex> Mailroom.deliver_to_actor(actor_key, message)
[38:05.440 --> 38:08.900] We just say mailroom, deliver to actor, we give it a key and a message, and we don't
[38:08.900 --> 38:10.760] care what server it's on.
[38:10.760 --> 38:12.200] It's up to topology to decide.
[38:12.200 --> 38:14.720] That's up to key routing to decide.
[38:14.720 --> 38:21.180] So we save time by getting a mailbox.
[38:21.180 --> 38:26.280] So in Waterpark, each node is a peer, as we said, and any node can receive the message
[38:26.280 --> 38:28.360] so it can hit any box in the load balancer.
[38:28.360 --> 38:32.740] No one on the outside even knows or cares about the servers, hashes, registries, etc.
[38:32.740 --> 38:34.800] And each node has a mailroom.
[38:34.800 --> 38:36.920] The mailroom knows the topology.
[38:36.920 --> 38:39.720] And when inbound messages come in, they're going to be routed through the mailroom to
[38:39.800 --> 38:41.680] the appropriate Patient Actor.
[38:41.680 --> 38:46.000] And if the patient is remote, then the mailroom's going to mount it, send that over to the mailroom
[38:46.000 --> 38:47.840] on the server it thinks it should be.
[38:47.840 --> 38:49.920] It's going to go over there.
[38:49.920 --> 38:54.800] And if the topology changed in flight, the mailroom will reroute that to the correct
[38:54.800 --> 38:58.560] mailroom, and that local mailroom will then deliver the message to the correct patient
[38:58.560 --> 38:59.840] actor.
[38:59.840 --> 39:01.200] And it'll be applied.
[39:01.200 --> 39:03.880] So that's a visual of how that goes.
[39:03.880 --> 39:06.400] So we get some really nice things out of this mailroom:
A mailroom provides
-
nice seam for testing
-
place to bundle data
-
spot to compress data
-
way to hide icky bits (
:rpc
) -
a path to distribution replacements
[39:06.400 --> 39:12.720] We get a nice seam for testing, a place to bundle data and a spot to compress data and [39:12.720 --> 39:14.440] icky bits. [39:14.440 --> 39:19.560] And we can change our, in the future, we could change our distribution model from Distributed [39:19.560 --> 39:22.720] Erlang to gRPC or to anything, really. [39:22.720 --> 39:27.240] But this testing seam is really nice as well, because instead of having to do integration [39:27.240 --> 39:31.840] tests, you can then do unit tests with seams and mocks.
[39:31.840 --> 39:36.120] Joe Armstrong had this thing that he would say every time you met him, or basically every
[39:36.120 --> 39:39.400] time he talked about anything, about "state your invariance".
[39:39.400 --> 39:41.200] And this is the first time I met Joe.
[39:41.200 --> 39:46.440] We'd been on podcasts together before and got to be friends online, but this is the
[39:46.440 --> 39:47.440] first time I met him.
[39:47.440 --> 39:51.200] And we spent two hours, probably, talking about stating your invariance and what all
[39:51.200 --> 39:52.240] was meant by that.
[39:52.240 --> 39:55.880] So in Waterpark, we want to state our Patient Actor invariance.
[39:55.880 --> 40:00.780] Every Patient Actor is registered (by key) on its node's registry.
[40:00.780 --> 40:02.900] Every Patient Actor is supervised.
[40:02.900 --> 40:05.060] Commands are delivered to the Patient Actor via the mailroom.
[40:05.060 --> 40:09.980] If the Patient Actor is not running when a command is delivered, it will be started.
[40:09.980 --> 40:13.900] If Topology has changed and a Patient Actor's key no longer maps to its current node, it will migrate to the correct node.
[40:16.460 --> 40:21.660] Every Patient Actor has one read only follower (process pair) at each data center.
[40:22.660 --> 40:28.660] A Patient Actor processes commands and emits events.
[40:28.660 --> 40:32.900] Before a Patient Actor commits an event to its event log, two of its four read-only followers must acknowledge the receipt of the event.
[40:34.700 --> 40:39.700] When a Patient Actor starts, it will ask (via the mailroom) if its four read-only followers have state.
[40:40.700 --> 40:43.700] If they do, the Patient Actor will recover from the "best" reader.
[40:43.700 --> 40:48.700] Each Patient Actor's state contains its key, an event store, and a Map for the event handler plugins to store projections.
[40:51.260 --> 40:53.380] So we could visualize that really quickly.
[40:53.380 --> 40:54.380] So it spins up.
[40:54.380 --> 40:56.660] This is going to be brand new.
[40:56.660 --> 40:57.940] These readers didn't even exist.
[40:57.940 --> 40:59.820] And so they're all going to return new back there.
[40:59.820 --> 41:04.820] And so then the writer knows that it was a new writer and not a crashed writer.
[41:04.820 --> 41:07.980] And then it's going to process that message, and it's going to then send that across the cluster to all of the readers.
[41:11.300 --> 41:14.980] And after two of them have returned `:ok`, it's going to go ahead and return an `:ok` back to the client.
[41:16.620 --> 41:20.620] So we can lower latency, we can be more efficient.
[41:20.620 --> 41:23.600] And then these other two are going to trickle in, but we don't have to wait for those.
[41:23.600 --> 41:24.820] Because we already have our guarantee.
[41:24.820 --> 41:30.700] We already have our asteroid protection, because we've written it to two data centers.
[41:30.700 --> 41:35.860] Interested in this, you might dig into the [Dynamo paper](https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf), we're a quorum system like this.
[41:35.860 --> 41:37.860] Deployments.
[41:37.860 --> 41:41.340] How we deploy changes to the cluster?
[41:41.340 --> 41:46.380] Our rolling deployment scheme was originally our chaos testing scheme.
[41:46.380 --> 41:51.660] So we would just bounce nodes and see how well we healed after things happened.
[41:51.660 --> 41:55.540] And so we just added the extra step of putting a release out there.
[41:55.540 --> 41:59.580] And we continued to then just basically do this brutal rolling.
[41:59.580 --> 42:02.300] And it's kept us honest, and it's just a way of deploying.
[42:02.300 --> 42:08.700] We also have another way, that's a superpower of ErlangVM called "hot code loading".
[42:08.700 --> 42:13.500] And so here we have BEAM modules, these four modules of code that are out there, and these
[42:13.500 --> 42:15.740] three processes running that code.
[42:15.740 --> 42:19.220] All we have to do if we're going to change this third module, we just change it, we put
[42:19.220 --> 42:20.900] a new BEAM file out there.
[42:20.900 --> 42:23.020] Nothing has to be bounced, restarted.
[42:23.020 --> 42:28.740] The next function call into that module is going to pick up the new code from that process.
[42:28.740 --> 42:30.740] And it's just a crazy superpower of ErlangVM.
[42:30.740 --> 42:40.100] And if it weren't for that, I can't imagine that we would have been up these five years.
[42:40.100 --> 42:41.980] And this is what that looks like.
[42:41.980 --> 42:42.980] Boop.
[42:42.980 --> 42:45.580] And so it's all immediately done.
[42:45.580 --> 42:49.540] Every node gets to the same time, and it's done with basically this code.
def push_module(node, module) do
{^module, bin, beam_path} = :code.get_object_code(module)
Mailroom.call(node, :code, :load_library, [module, beam_path, bin])
end
[42:49.540 --> 42:50.540] Whoops.
[42:50.540 --> 42:53.620] It's done with basically this code of where we get the object code, and then we load that
[42:53.620 --> 42:57.580] object code library onto each node.
https://joearms.github.io/published/2013-11-21-My-favorite-erlang-program.html
[42:57.580 --> 43:01.660] Universal servers, Joe Armstrong's favorite program.
[43:01.660 --> 43:09.140] So his idea of the universal server was, you have a universal server that'll take commands.
[43:09.140 --> 43:13.300] It doesn't do anything, but it'll take commands on how to become an FTP server, or how to
[43:13.300 --> 43:19.700] become an HTTP server or an FTP server, whatever you give it the commands to do.
[43:19.700 --> 43:23.820] So as COVID hit and Waterpark went to production, we knew we needed a way to turn the patient
[43:23.820 --> 43:28.140] actors into universal servers, that could be extended and changed dynamically at runtime,
[43:28.140 --> 43:29.380] with no downtime.
[43:29.380 --> 43:31.140] So in three weeks, this is what we did.
[43:31.140 --> 43:33.220] We implemented this subsystem of event handler plugins.
[43:33.220 --> 43:35.380] It looks like this.
# Patient Actor
%{
key: %{id: "1001", facility: "XYZ"},
projections: %{...},
event_store: [...]
}
[43:35.380 --> 43:38.820] We got this endless event stream that comes through, and they're going to flow into our Patient Actors.
[43:39.820 --> 43:44.380] And we're going to have this list of our event handler plugins where use cases are going to come in.
[43:45.540 --> 43:50.000] And each use case that's there is going to have a contract of where you tell it to handle
[43:50.000 --> 43:54.580] the message or apply the message, and it's going to take the key of the actor, the projections,
[43:54.580 --> 44:01.060] existing projections, the message it's supposed to apply, and the pointer back to the entire
[44:01.060 --> 44:03.460] message history, that it can reach into.
[44:03.460 --> 44:08.860] And it's going to return new projections off the inferred, whatever logic is there, it's
[44:08.860 --> 44:12.400] going to infer meaning off that new message in the old state and build new projections
[44:12.400 --> 44:13.720] that it's going to return.
[44:13.720 --> 44:16.880] It's also going to return side effects.
@spec handle(
key :: map(),
projections :: projections(),
msg :: Msg.t(),
history :: [Message.t()]
) :: {:ok, projections(), side_effects()}
[44:16.880 --> 44:21.400] So much of the talk has been about patterns of "how?". In this segment, we're going to focus on why it matters.
[44:22.840 --> 44:29.380] So early in the pandemic, we deployed this Patient Actor plug-in that would immediately
[44:29.380 --> 44:34.760] notify caseworkers, any time a patient who had transferred from a nursing home, they
[44:34.760 --> 44:38.160] came in, and then they tested positive for COVID.
[44:38.160 --> 44:42.360] And this allowed early contact tracing and quarantines and reduced the virus spread with
[44:42.400 --> 44:44.680] in this vulnerable community.
[44:44.680 --> 44:49.040] So this is what that looked like.
Simplified Rules:
-
Is admisssion source nursing home or long-term care?
-
Is there a COVID-19 lab order?
-
Do we have a result?
-
Is the result "positive"?
-
Case manager informed?
[44:49.040 --> 44:53.880] So we would go through, and these messages are coming in hours and hours apart, and these [44:53.880 --> 44:56.640] digital twins are out there representing again. [44:56.640 --> 45:02.120] And so we see this lab order came in, and we, it has a COVID lab order, and we do have [45:02.120 --> 45:06.840] a result back, and we're able to then, in our code, maybe right after that, we say, [45:06.840 --> 45:08.560] is it positive? [45:08.560 --> 45:13.240] And if so, we want to send a long-term care facility alert to notify the case manager, [45:13.240 --> 45:15.200] and then we update that. [45:15.200 --> 45:25.880] So, my friend, you know, and he wrote Joe, he passed away in April of 2019. [45:25.880 --> 45:31.000] And I think it's just amazing that about a year later, his ideas are out there making [45:31.000 --> 45:34.800] nursing homes throughout the United States safer than they would have been otherwise. [45:34.800 --> 45:40.240] So I got just a, wow, you know, what an amazing guy. Thanks Joe.
[45:41.240 --> 45:55.080] I'm going to talk about just real quickly, we have just a few minutes left, not many, but I want to talk about the big rewrite.
[45:57.040 --> 46:03.360] So as I mentioned earlier that we rushed into production long before we intended to, and
[46:03.720 --> 46:07.640] some technical debt, and it made many things that we wanted to do harder than they should have been.
[46:09.120 --> 46:17.160] And so how do you do a big rewrite, though, on a zero downtime system? On a RAM-based system?
[46:26.000 --> 46:31.580] So we opted, instead of that dangerous maneuver there, to use the strangler pattern.
[46:31.580 --> 46:33.940] And so with that, we basically did this.
[46:33.940 --> 46:39.060] We have these components, or green components here, and what we do is we introduced a blue circle,
[46:39.060 --> 46:43.340] that would be a replacement for the green circle.
[46:43.340 --> 46:48.600] And we started bleeding then commands and things over to the blue one.
[46:48.600 --> 46:53.580] And after we had bled off from 0% to 100% down to the other one, and we see that it's
[46:53.580 --> 46:58.100] all working, we have the two systems running in parallel, we then move on, and we get rid
[46:58.100 --> 46:59.100] of the other circle.
[46:59.100 --> 47:01.520] We only have a blue circle now.
[47:01.520 --> 47:03.860] So on with the star, and so on.
[47:03.860 --> 47:06.080] And so we replaced bit by bit of the system.
[47:06.080 --> 47:09.740] We even replaced the entire idea of a Patient Actor.
[47:09.740 --> 47:12.860] We had two sets of the Patient Actors running.
[47:12.860 --> 47:14.400] We kept them all running at the same time.
[47:14.400 --> 47:18.960] They were both doing all their work, and we slowly cut over, and the old code was gone
[47:18.960 --> 47:25.860] then on the system.
[47:25.860 --> 47:29.380] And we got done with that in 2023.
[47:29.380 --> 47:35.440] And recently in April 1st, we celebrated our five-year birthday of being up in production
[47:35.440 --> 47:38.920] for five years with zero downtime.
[47:38.920 --> 47:46.440] So I want to pass along this, kind of the closing here, I guess, is absorb papers and
[47:46.440 --> 47:51.260] conference talks, and conversations over meals.
[47:51.260 --> 47:56.820] Sometimes people can give you really, really good advice over a meal, and so like I got.
[47:57.360 --> 48:01.320] Absorb these papers and think about them for a long time.
[48:01.320 --> 48:02.860] Pursue these long ideas.
[48:02.860 --> 48:06.060] Hold on to things and don't let it go.
[48:06.060 --> 48:08.820] And don't be afraid of reinventing wheels.
[48:08.820 --> 48:13.780] It's a disease I think we have in this industry of where everyone kind of pokes at you.
[48:13.780 --> 48:17.500] They say, not invented here, and stuff like that.
[48:17.500 --> 48:22.620] Well, sometimes the wheel is just not good enough for what you need it to do, and you
[48:22.620 --> 48:24.380] need to reinvent the wheel.
[48:24.380 --> 48:28.220] And so just know the difference and push back about that.
[48:28.220 --> 48:32.300] I think it's awful advice that we give ourselves as an industry.
[48:32.300 --> 48:38.940] So reinvent some wheels and find, create work that's meaningful.
[48:38.940 --> 48:43.660] So I'd like to special thanks to my team at HCA.
[48:43.660 --> 48:47.580] Building Waterpark with them, it's been the most fun, meaningful work of my career so far.
[48:49.980 --> 48:54.860] There are people who were alive today, that wouldn't be if it weren't for Waterpark.
[48:54.860 --> 48:59.780] This system has, it saves lives.
[48:59.780 --> 49:04.500] And the interesting thing is, I doubt that Waterpark would even exist if not for that
[49:04.500 --> 49:07.420] chance encounter here in Oslo.
[49:07.420 --> 49:10.460] And so I'm thankful for this place.
[49:10.460 --> 49:14.780] And Oslo, it's full of goodness here.
[49:15.460 --> 49:19.380] You've got a good place, good community here.
[49:19.380 --> 49:21.380] And so thank you.
[49:21.380 --> 49:22.900] Do some talk.
[49:36.900 --> 49:42.220] If anybody, I've got some cards up here, and I'll be, any of this that you want to reach
[49:42.220 --> 49:45.380] out to me, the QR code will take to my LinkedIn.
-
https://github.com/bryanhunter
[49:45.380 --> 49:48.260] And I love answering questions about this stuff. [49:48.260 --> 49:49.980] So don't feel like you're bothering. [49:49.980 --> 49:51.980] I, you know. [49:51.980 --> 49:53.980] Yes. [50:05.980 --> 50:08.780] So FHIR is, there is FHIR. [50:08.780 --> 50:11.660] It's still, it's still a tiny, tiny fraction of healthcare. [50:11.660 --> 50:13.500] And so we process FHIR as well. [50:13.500 --> 50:18.940] It's just, it's just that it's going to, it just takes so long for things to change in healthcare. [50:19.940 --> 50:25.660] And so HL7, it'll probably be 40 years before, I don't even know how, it'll last forever. [50:25.660 --> 50:32.060] You know, COBOL is still here, but I mean HL7, it'll be around for a long time, I think. [50:32.060 --> 50:35.660] And yeah, but really a great call out about FHIR.