Skip to content

Instantly share code, notes, and snippets.

@RangerMauve
Forked from HerbCaudill/transcript.md
Created January 17, 2023 05:13
Show Gist options
  • Save RangerMauve/54029be9176272f25edd65babc96ad42 to your computer and use it in GitHub Desktop.
Save RangerMauve/54029be9176272f25edd65babc96ad42 to your computer and use it in GitHub Desktop.
Cleaned-up transcript of Rich Hickey's talk "Deconstructing the Database"

This is a talk I keep referring back to, and I wanted to have it in text form. I grabbed the raw machine-generated transcript from YouTube and used GPT-3 to help me turn it into well-punctuated sentences and paragraphs. I had to do some additional cleanup, but it got me most of the way there - my first experience getting AI to help me out with a real task!


Deconstructing the Database

Rich Hickey, author of Clojure, and designer of Datomic presents a new way to look at database architectures in this talk from JaxConf 2012. https://www.youtube.com/watch?v=Cym4TZwTCNU

The title of this talk is "Deconstructing the Database".

What that is, is a look at a new way to look at database architecture.

It happens to also be a look at the underpinnings of the architecture of Datomic, which is the database I've been working on for the last two years; but it's not a sales pitch for Datomic, it's really about the ideas underlying the design choices.

Complexity

So why do we want to deconstruct the database? What are we trying to accomplish? What problem are we trying to solve?

I think the fundamental problem we're trying to solve is the problem of complexity in programming.

How many people think dealing with databases is easy and trouble-free? Nice. Most people don't, and there's a number of sources of complexity.

There's this great paper, "Out of the Tar Pit." In it, the authors identified a bunch of problems related to complexity in programming. They basically said that all complexity boils down to two flavors: One that has to do with state, and the other that has to do with control. The authors didn't implement it, but they suggested that by adopting functional programming and declarative programming and a relational model for data inside our applications, we could get rid of this complexity. It's a great paper, I really recommend you read it.

But one of the problems with the paper is that while they had a good grip on the functional programming and declarative programming part — and also possibly on using a relational model for data inside your applications — they really punted. Has everybody seen the cartoon where the mathematician has this chalkboard, and it's full of stuff, and then in the bottom corner it says "and then a miracle occurs", and then there's the answer?

The big thing I was missing from their picture was, they imagine there would be this relational model of your data that you could access in your application, and that somehow it got updated. Something happened in the world, and it was different. And all the ick related to state had to do with however that got updated — but didn't say how it would.

I would call that updating, and the problem that they avoided talking about, really the problem of process in programs. In other words, we know there's going to be novelty in the world that our programs are going to encounter. Where does that go, in a model that's otherwise functional?

Declarative Programming

One thing we want to obtain in revisiting the database in a fresh way is to embrace declarative programming. I agree with the paper — we want declarative programming.

What's the best example of declarative program we encounter most often? If we're not artificial intelligence researchers, it's SQL. It's actually the most declarative thing most of us encounter on a day-to-day basis.

Declarative programming is much better at manipulating data than what we do in our languages, even functional languages. We don't have this very nice higher-order set logic for dealing with data, and it is superior to dealing with data even in a functional language, and way superior than dealing with data in an object-oriented language. (We use object-oriented languages because we have them, not because they're better at this; they're really much worse at this.)

The problems we have with a client-server database is that declarative programming is alien to us.

The other problem related to the model these guys were espousing is that there's no basis. If I want to do a calculation related to a database, what's the basis for that calculation? Well, if the whole database is changing all the time, we're back to the problems I talked about in the keynote. The database, in its entirety, is a place, and we have a problem where we say, "What's the basis for our decision making? Well, I don't know. It was what I saw last Tuesday when I ran this computation, but I can't tell you now what that was exactly."

"Over There"

There are problems with databases related simply to the client-server nature of them being "over there." The basis problem is one of them.

The other is that we're afraid of round trips, and we're afraid of round trips I think most often for performance reasons. But actually, the biggest round trip problem is that same basis problem: What if I have a composite decision to make? Can I ask the database three independent questions over time, and then get my answer? No! Why not? Because stuff has happened to that place in between those calls. This makes us do weird stuff. In particular, one of the things I think it makes us do is couple questions with reporting.

Let's say your application makes a decision about what entities we're going to put on sale on this webpage. Do we ever send the query to find out what those entities are, and then later send a query to gather the data we need for display? No. A lot of times we piggyback those two things together, because we're afraid of the result sets not matching up anymore. That's actually a fear that's born of the lack of basis.

From a design perspective, we know those two pieces of logic should be separate. They should be independent decisions: one part of my app knows the logic for deciding what should be displayed, and the other one knows about what the screen should look like and what we want to show of it.

Consistency and Scale

We also have problems related to consistency and scale. I don't know if anyone saw any of the NoSQL talks this week, but a lot of times we have difficulty scaling servers that are monolithic by default. We've seen NoSQL, we've seen the Dynamo paper, and some of these other technologies.

I think one of the questions we have in revisiting the architecture of a database is, what's possible? How much of the value propositions of databases can we retain while tapping into some of the new value propositions of distributed systems, in particular, their arbitrary scalability and elasticity?

Also, I think people are adopting these distributed systems and getting a lot of complexity as a result. Because they're trading off distribution and scale for consistency, they're losing consistency in the trade-off. But we have things like Dynamo and BigTable. How do we use them?

Flexibility

Other problems we have in general when we talk about traditional databases are flexibility problems. Everyone knows the rigidity of relational databases and the big rectangles. We also have the artifice of having to form intersection record tables and things like that. Things that you really shouldn't have to know about. And your application ends up becoming rigid because it does know about those.

In addition, lots of things are difficult to represent in a traditional relational model, like sparse data, or regular data, hierarchical data, and things like that. So we want to be more flexible and more agile in our development. We want to try to avoid this rigidity seeping in.

Information and Time

Another thing we want to try to get right if we revisit database architecture is information and time in particular. We want a database that we can use to represent information, we can use to obtain real memory and real record-keeping like we used to have before we had computers.

There's lots of good reasons for this — it helps support decision-making and auditing, and there's plenty of domains in which it's a requirement. People are doing this manually on top of systems that don't really understand that that's what you're trying to do.

(How many people have ever added a timestamp field themselves to tables and managed it all themselves? A lot of people have, right? How many people have written the query that gets you "now" out of that table? How many people have tuned that? Yeah, that's a nightmare. Anybody like that query tuning? That query is brutal, right? The contention is terrible, especially if it's also an online system. )

Perception and Reaction

The last thing I think we'd like from databases, that maybe we don't think about now because we don't connect the two necessarily, is a strong model for perception and reaction.

Perception what I was talking about before: Getting that stable basis for decision-making.

Reaction is more like eventing: Things are changing in the world. How do we see change in a traditional client-server database? What do we do? It's a four-letter word that begins with “P” and ends with “oll”. Poll, we poll! It’s gross, right?

So we'd like to be able to make reactive systems that don't poll. And we'd like those systems to get consistent views of the world, which is another difficult thing. Even if you manually build, say, a trigger-based eventing system, and the triggers say, "Oh, something changed" — it's like, okay, great. Well, in between, when they told you something changed, and you're wanting to make a decision on the basis of it, maybe you're going to go back to the database. What's the basis now for that? Did it change again in between? What's in flight? You have no way to know that this changed, and that change was related to the database at this point in time. You can’t go back and ask questions to figure out what was going on, and what either caused that change, or what the effects of that change should be, or how it relates to the rest of the world. You have no way to do that. You just were told something changed, and maybe a value about it, but not where that's situated relative to the rest of the world. So we want to do that better.

Traditional Database

So, if we're going to take a database apart, we have to look at how it's put together. This not a particular database we're talking about here, but this should be familiar to anybody who's dealt with traditional databases, and how they're laid out. And the guts, the meat of it is at the bottom, so we'll start at the bottom.

A database certainly is something in general we expect to be durable. So, most of the traditional databases are built around, there is this disk, and we're going to we're going to put that disk in a box, and that box is going to be in charge of the disk, and everything comes from there. So there's some IO subsystem that deals with the disk.

And then there are two fundamental sets of services the database will provide. They both rely on the disk. First, there's a transactional component that accepts novelty and integrates novelty into the view of the world. Then there's a query support component that accepts questions and gives you back answers based upon the values here.

If you just took everything that came into a database and you just appended it to a flat file, how good would the query engine be? Not very, right? So leverage comes from indexing. Leverage comes from organizing the way we store the data, such that a query engine has sorted views of things that it can use to answer questions quickly. That's the leverage of a database.

(I think we've gone to key value stores that have almost no leverage, and we're still calling them databases. But, in my mind, this is what made a database a database. Otherwise, we had file systems and all other kinds of things before we had databases, and we didn't call them databases. Why are we calling key value stores databases now?)

In general, traditionally, this was a big monolithic thing. There was a big sophisticated or complicated process that knew about all this stuff, and it had an integrated view of how they would work. And, because this was expensive, and the memory needed was expensive, and the box on which it ran was expensive, this was a very special thing. You had one of them, and then you had clients, which were somewhat more lightweight, and they communicate using usually a foreign language. How do you communicate with a SQL database? Strings in the foreign language. You send it over, and it does something. How do you communicate queries? You send strings over in a foreign language, and you get back — well, who knows, maybe the API makes it look like a result set to you, up at the Java level.

And then we know, as we get a lot of apps going, this unique resource gets taxed. Everyone's putting all the data in there, and everybody's asking questions there. We know the questions dominate — most applications are read-oriented. So, most applications eventually end up adding another tier. If it's very costly for me to ask questions, I'm going to store the answers to those questions in a cache. So, maybe next time I want to ask that question, I'll check the cache first, otherwise I'll incur the cost of going all the way to the server.

What goes in the cache? What form does it take? When does it get invalidated? Whose problems are all these questions? Your problem. (Or maybe you buy into some fancy ORM and that makes it your problem with another layer on top of your problem. Now you have two problems.) Yeah, it's up to you. It's definitely not the server's job. That’s why we call it caching over a database.

And there are some other things that database comes with, that we don't necessarily think about.

Certainly, most databases have a data model. It can be a really low-level thing that is about how things are stored, or an API kind of thing, or can be a relatively high-level thing. It's certainly great trait of SQL databases, that they're based upon a mathematical foundation in relational algebra. There's a proper data model with a bunch of great characteristics, that allow you to write those declarative programs.

But they also contain a state model. And, in fact, relational algebra is a lot like that old paper. Relational algebra is like, perfect. It says there is the state of the database, and all this algebra applies, this math. It's great.

How do you get a new state of database? Well, a miracle occurs, and then you have a new relational world. But update is not mathematical, there's not the same model behind it. So there is a state model and, in general, not all the time, that's an update-in-place model. It's subject to all the kind of criticisms I gave in the keynote.

What's usually missing is an information model. By an information model, I mean the ability to store facts, to not have things replace other things in place, to have some temporal notion to what's being stored. That's what I would consider a true information model, and that's usually missing from the databases.

Challenges

So, we want more scalability; we want to try and leverage these new systems; we'd like to have more declarative programming in our applications; we'd like to have a proper information model; maybe we don't want to program with strings anymore.

What are the challenges we're going to face if we try to do that with this approach? The biggest one by far is definitely the state model. The fact that it's update-in-place. There's a great reason why traditional databases work the way they do; because when they were invented 30 or more years ago, these resources were scarce. You couldn't make a database that said, "I'll just keep everything" because you had like this tiny little disk. So they invented all this update-in-place technology.

Usually, inside a database, there are these B-trees. They use blocks on the disk, they'll reuse the blocks, they'll fill the blocks, they're rewriting them; they're usually interacting at a pretty intimate level with the memory management on the computer. And because they're updating in place, and they're trying to serve multiple clients, they have a huge amount of coordination overhead to do that. That slows them down significantly.

Approach

So, the approach we're going to take in trying to break things apart is based on these four principles:

  1. Move to an information model
  2. Split process and perception
  3. Immutable basis in storage
  4. Novelty in memory

A database of facts

To move to an information model means to move to a data model that is fundamentally about facts. So we're going to have a database of facts, and we're going to get rid of this complexity.

And that means sucking the structure out. Because when you look at a relational row or a document, there's nothing fact-like about that. It doesn't say when, and the granularity of the fact is this composite thing. So if I had a whole row for you and you change your email — and email is one of the columns — the row is not a fact. It's not at the granularity of a fact. It's bigger than a fact. Maybe it's a set of facts.

So we want to get down to single facts. That's going to be important for efficiency reasons, but it also dramatically simplifies stuff.

RDF is an attempt to have a universal schema for information, and they use something called triples, which are subject, predicate, object. I've argued that that's not enough, because it doesn't let you represent facts, because it doesn't have any temporal aspect. But it's generally a good idea; it seems atomic. We really do want atomic facts.

We call them "datoms" and we just spell it differently, so we can say "datoms" because if we spelled it "datum", the plural would be "data", and then people wouldn't know what we mean. So we have "datom", which is an atomic fact, and "datoms" are more than one fact.

A datom is just an entity, an attribute, and a value, and then some temporal component.

You could just put the timestamp there. But that doesn't give you a lot of power.

If instead you say, "This thing came in as part of a transaction, I can store the transaction there." If your transactions are first-class — and they are in this system — you can put the time on the transaction. But you can also put who set it on the transaction, or where it came from, or whether or not it's been audited, or any other kind of provenance, or other characteristics. So that's what we do.

Database state

So now we have the problem of database state.

We say we want to have the database fundamentally be a value. We want it to be immutable. It seems to be a contradiction in terms, because we know there's going to be novelty. Our business is going to run, and we're going to sell more stuff or get new products or have new customers — and that novelty, that newness, has to go somewhere.

So how does that jive with the notion of a value? The best analogy I could come up with was a tree ring. Think of the database as an ever-expanding value that never updates in place. It only expands; it only grows outward. It grows by accretion of facts. We're just going to add more facts. We never go back inside. Just like the tree rings — you never go back inside the tree rings, you just add more rings.

We end up with something that really smells like a value and will function as a value. We're only accreting. The past doesn't change, so the core upon which we're building never changes. And that's really the key characteristic we expect of a value — anything I've seen before will never change.

One implication is that new stuff means new space.

And this is key: We're going to get a lot of freedom by moving away from places.

Process

The other problem we have is, how do we represent change?

We said we're going to accrete facts. So what is the granularity of change?

We're used to saying, "Update this place and here's the address of the place, or here's the primary key of the place. Go do something there."

If we don't want to say that anymore — if we just want to accrete facts — then what is the fundamental unit of novelty? Say I have a new customer — what am I going to say to the database?

What we're going to say is that, at the bottom, we can represent this process. This is the problem we're trying to solve: the novelty problem. We're going to say, we can represent novelty, but just as assertions or retractions of facts.

"This new thing is true."

"This new thing is true."

"That thing that was true is not true anymore." Okay. Still a fact that it was true from then to here. It was true.

And this ends up being the minimal possible representation of process. You can express anything in these terms. And so we'll say that all the other transformations will expand into this. And I'll show you that a little bit later.

The other key thing we want to do with process is, we want to reify it.

How many people have heard of event sourcing or anything like that? So one of the ideas behind it is that, if you just look at a database that's been running for a while now, it's had a lot of activity, and you want to know what happened, how do you figure that out? How did it become what it is? You have no resources for doing that — unless you know how to read the logs, maybe there's a transaction log and maybe there's a way to read that. But a lot of times you're going to have to replay that, because that's just a successive set of modifications to places. It's really hard to read that and understand what happened.

What if instead, we say, "We're going to reify process." When you add a new customer, there's going to be something that says, "There is a new customer. That customer's name is Sally." That's what's going to be in a reified version of process, that says process is just assertions and retractions of facts.

So that's great. So we want to make a thing out of that, because that's something that we can store, we can look at, we can understand when we look at it: This change happened. We added this. We retracted that email, we added a new email. We sold this. We did this, we did this. Fact. Fact. Fact. Fact. These things happened. It's an information system: Fact. Fact. Fact. Fact.

And that's going to be great, because that's going to let us do some other cool things later, like events.

Accretion

One of the things that is important to understand about the accretion process is that it really does add to what's already there. So that means that, if you ever look at the view of the database at any point in time, you will be able to access the past. It's still inside. Just like the inner tree rings are still there. It's not like there's a snapshot from last Tuesday, and then one from Wednesday, and one from Thursday, and one from Friday, and each one has more stuff in it. It's as if anytime you look at the database it includes all of history inside of it.

This is important for an information model, where we want to give people decision-making capabilities that say "how much changed in this window of time?" or "count how many of those that happened over a window of time". We need all the time in one place, we don't want a bunch of independent records — here's Tuesday's facts, and Wednesday's, and Thursday's, separately. We have this growing tree ring thing.

Deconstruction

All right, so that's our plan. How do we do it?

We're talking more now at the model level. We want to deconstruct this. Now we're looking just the server component: indexing, transactions, query I/O, and disk. I think you can divide this up into halves, and that's why I said before we want to separate process from perception.

Because there's a process part of this — novelty processing. I have a new thing, it has to go through a transaction processing thing; maybe indexing happens on it then or not — don't know, it's an open question at this point in the talk. Then there's output to the storage system.

Completely independent of that is a perception characteristic to the use of databases. I have a question I want to ask — the question may leverage indexes — almost definitely it does. And there's going to be input — this is all process relative input to me, as I read back from storage. We can separate these two things out, and we can only because we've adopted an information model and immutability.

Model

So the model we're trying to get to — and again this is not yet a physical model — is one where we can empower independent applications with as much of these capabilities as we can. We want them to be able to perceive change, we want them to be able to react to it, we want them to be able to independently remember anything that's important to their decision-making process, and we want them to make decisions and then possibly affect the process that they're sharing.

What we want to do — and obviously there's going to be some shared resources here — right there's got to be some coordination really around change. And there's going to have to be some shared resource around storage. We want to minimize the coordination that's necessary to support this.

But that's the model we want, because now if we can do this — if we want a more powerful system — what do we have to do? If we want more query capability, what do we need to do? Just add more of these guys, and we don't really care about this growing, because we're not asking it to do much — just some coordination.

State

So if we revisit that whole tree ring thing — now we're into implementation details. How do we represent state? How do we represent this immutable expanding value? We know one thing — whatever representation we use, it has to be organized to support query. It has to be sorted in some way. That's really the fundamental leverage capability we have — sorting things.

It ends up that there's a technique that can be used — it's been used in functional programming quite often — called persistent data structures. That word "persistent" there does not mean durable — it's a different notion of persistence, and it has to do with the fact that you can represent a large immutable structure. And it doesn't matter whether it's supposed to represent an array or a map or a sorted set, you can represent almost anything as a tree.

Structural Sharing

And you can represent that immutably by using something called structural sharing. So this tree can represent anything — it can represent, for instance, a sorted set. And this is the view of it we have right now, and it contains all these nodes. If we want to add another child to this node, so we're going to make a new version of the set, but we don't want to update it in place. We're going to need to allocate a new node for that leaf, and then copy the path to the root. Then we have a new tree that has this new piece of information in it, and substantially share structure with the old tree, because we can do that because why? It's all immutable.

Right, so this is the underpinnings of what are called persistent data structures. So we can do this in memory — it's done in memory by most functional programming languages. But what we're going to start to do though is do this on disk, so we'll have durable and persistent data structures that have this kind of shape.

Storage

So, you saw an earlier diagram — we had a server, and it had I/O and a disk. I don't want to know about I/O and disks anymore. There was a paper recently that came out — it said "Disk Locality Considered Irrelevant". It's another one of these old notions that's now dying. It used to be — boy, if you're not the machine that has the disks, you're at a tremendous disadvantage from a computational perspective, because that machine — it's got a card that's attached to the disk, you can get the data up into memory, it's lightning fast. That machine has a privileged access to that disk. If you try to access that data on the disk from another machine, you can be paying a huge overhead.

Well, now, how much faster are our networks? Way faster! And what's the differential between disk and memory? Huge! It ends up that anybody that needs to access the disk is losing, and the guy who has the disk in the same box is only losing very, very slightly less than anybody else would lose. So it doesn't matter anymore — that kind of locality is not the way we should be architecting systems. We don't care about it — we care a lot more about putting data into memory, and having good locality in the way we do that. But if we actually have to touch the disk, we're losing anyway.

So we're just going to wrap up both the I/O and the disk, and say that's something we're going to call a black box called storage. What we're going to put in storage are two things — one is a log of every new fact as it comes in, and that's really an append-only log. Somebody says "I sold this", "Sally gave me an email". Fact, fact, fact. We're just going to shovel those facts as fast as we can into storage.

The other thing that will be in storage are much the same shaped things that we used to have locally. When we had a database that was monolithic, it had B-trees on disk. We're now going to have those persistent trees in storage, with nodes that we just don't change, but otherwise it's the same kind of idea. We've moved away from disk to storage, and we're storing index segments that we're not going to mutate.

But the key thing now is that we're treating storage with a simple interface — it’s just a key-value interface. We say, this block of the index tree maps to this segment which is a blob of stuff, and that's going to be immutable. So all we need from storage is the key-value thing. It's a lot like how the old databases used to say, “this block on the disk contains these bytes”. Now we're just lifting it up to a systems level, cooperating systems level.

So of the storage, we need a key-value interface where we can store blocks of index under keys.

We also do need a little bit of modifiable storage for the roots. What's the current root of the whole database tree? It's something we're going to need to point at new versions of the tree.

The other characteristic we need from storage is that we have to be able to obtain that using consistent read. So we're now getting a set of definitions for a storage service. What are the requirements of a storage service? It must support key-value storage, and every now and then we're going to ask for consistent read. These are the two things we need out of storage. Otherwise, I don't care how it works or where it is. I don't want to know. I'm now getting architectural flexibility from doing that. And there's lots of things that can satisfy this: We can sit on top of DynamoDB, or a SQL database can satisfy those two things. You can treat a SQL database as a key-value store; stick blobs in it, and it offers consistent read.

Indexes

An index is a very simple thing: it's a tree. There's some root, it's got pointers to inner nodes which have then got pointers to leaf nodes. All that's in these leaf segments are sorted datoms. So it's a big block of datoms sorted in a particular order.

You can imagine the sort orders. We said entity-attribute-value-transaction, so you're going to have a sort by entity. That's going to give you a great way to pull out what look like objects or documents, because it's entity-oriented.

But you also have a sort that's oriented by attribute first. Same data, just a second copy of it sorted a different way. One that's driven by attributes is going to feel like what kind of database? It's going to feel like a column store. Column stores store just email addresses all together, and then they store just phone numbers all together. Column stores are very powerful tools for analytics.

And you can store other flavors, but there's only six ways to sort them.

So then there's the actual job of indexing. We get novelty in, we want to incorporate this. We're going to have these trees, obviously; once we have a lot of data, those trees are going to be huge. And we said they're immutable. So we said anytime you want to incorporate new information into that tree, we're going to have to do that path copying job. Do we want to do that every time we get a new fact? No, that's a disaster. We can't do that. So I would call that maintaining the sort live in storage. It's not something you can do efficiently if you're going to treat storage immutably.

So Google's BigTable is an example of a solution to this problem that other people have used as well. The idea is, everything that's new is already logged to storage. This is not a durability question, it just has to do with how often do you integrate novelty into that big, relatively expensive to create index. The way BigTable works is it accumulates novelty in memory, until it's got 64 MB of novelty; and then it blows that out onto disk. And then a separate process later takes that 64 MB and the big sorted flat file that's everything it knew before and it does a merge sort and produces a new flat file. None of the files ever get changed, so it's the same idea; they're all immutable.

The biggest difference between that and what I've been talking about is that we have trees of smaller segments, and can share stuff when they create a new flat file. It shares nothing with the older flat file, and it's huge, so it's not particularly addressable and it's not easily cached in chunks. By using trees, you get fine-grained adjustability and you get these nice small chunks which are good for caching. You also have the potential for structural sharing. I can integrate new stuff into the next version of the index. It could share a whole bunch of nodes with the old index, and if they've been cached they're still good — because we know they're not going to change.

So we accumulate in memory, and anytime we want a current view of the world, we’re going to have to merge dynamically what we have in memory with what's coming from storage. Every now and then, something has to go and integrate what's in memory into storage, as soon as that's been done everybody can drop that from memory. We don't care about that, we're going to start it fresh because we know now that's in the index that's in storage.

Transactions and Indexing

So that just looks like this — this is still pretty much a logical model, but whatever is handling transactions is going to take novelty and immediately log it: That's where you get your durability — so if the thing dies, it's somewhere. But that's not organized in a leverageable way. It's gonna put it into an index in memory — we’ll call it the the live index. This is sorted — it's very inexpensive to create a sorted set in memory.

And then sometime later, and occasionally, some other process is going to go and take this, and whatever's there, and merge. But there's gonna be a lot more efficiency to that, because now it's got a whole bunch of novelty, it’s gonna make a new tree that incorporates the new novelty. And there's a lot of efficiency to sharing that job as opposed to doing it for every new transaction.

Perception

So that's the process side.

The perception side is really straightforward. If I want to see what's going on — could be a query, could be an analytics thing, could be just getting an entity — if want to see the current state of the world, I have to somehow have access to the live index, and access to storage.

Look at all the stuff I don't have to have access to: I don’t need to be near the transaction processing system, I don't need to even have anything to do with it.

Is there any coordination associated with doing this? No. Something has happened in a couple of slides here: What happened to read transactions? They're gone! Where did they go? They just disappeared — because a correct implementation of perception does not require coordination. Perception in the real world does not require coordination.

We are left one with one lingering question, which is how does this get updated? If this is actually local to me, where does that come? I’ll answer that in a second.

Components

So now we have a couple of names that may be new as we look at the real architecture. We've broken stuff down: Now, we have storage separate. We have perceivers, who might be doing queries, they're separate. And we're going to have something that processes and coordinates transactions, we're going to call that the transactor.

We're going to call anyone who has capabilities to perceive, remember, decide, react, stuff — a peer. They're equals, they're very powerful equals in this system, and they're your only application servers.

Finally, we have some storage servers. Ideally, it would be a redundant store, one of these newfangled storages, that do distributed redundancy. They really have some great properties, as long as you don't try treating them like a database. If you treat them like a key-value store, they are awesome! They’re redundant, highly reliable, highly available, scalable, distributed — there's a lot of power there. We want to use that.

Datomic Architecture

So this is the same jobs, rearranged. For the purposes of talking about this, I'm going to say that a transactor will occasionally do the indexing job — in the current implementation it does as well , just because it has the spare cycles and it's convenient to do so. But anytime you want to, you can move this to a separate box.

So we'll start with some novelty. Your app has some novelty, it needs to communicate with the transactor. The transactor is going to log that right away. It has its own live index, so you can just imagine that's here. It's going to put it in there. The other thing that's going to do is transmit that novelty to any of the peers. That's how the live indexes are kept together: It’s just a re-broadcast of the novelty, and the novelty only. I'll talk a little bit more about this process in a second.

The storage service, we don't care a whole lot about. It's very much a black box. It needs to be a key-value store that can support consistent read. This works on top of DynamoDB. It works on top of Postgres or any relational database. It works on top of InfiniSpan — if you want a big memory grid behind it, you don't actually ever want disks, you put InfiniSpan behind it, and now you have a big memory grid. Basically, anything that can support that, we’ll eventually support, but those are the ones we support right now.

If we look at a peer — obviously they have a communications component so they can talk to the transactor. They're going to have this live index in memory. And they have to have some ability (just ignore the caching for the moment) to talk directly to storage.

And this is the other critical thing: Now who has locality to storage? Nobody! Actually, no one has locality to storage in this (except whatever the storage infrastructure is). This guy's no closer to storage than these guys, because there's no advantage to being close to storage. So this is just a service or a server somewhere else. But once it is that, it means that everybody can read directly.

[inaudible question] No, there’s one transactor. I’ll talk about that in a second.

So everybody has access to storage.

It also means — Where does query live? Query can live anywhere you want. What does query need? What does the query engine actually need? It needs access to the organized data: access to the index. Since anybody can get access to the index, we can put query anywhere we want. There's no special machines for query. So in the case of Datomic, you end up with this library you put in your Java app or JVM app that has a query engine built into it. It's got the live index built into it. And I'll show you what that looks like in a second.

The other thing that's really critical about this is, we have this storage — it's remote, it's at least a network hop. We said the network hop isn't costing you nearly as much as the disk would if you actually had to touch the disk. But it is still a network hop. Can we alleviate it? What is safe to cache from this storage? Everything. Why? It's immutable. When does it expire? Never. These are the things you want to hear as an architect, right? This is sweet. We can cache this stuff relentlessly anywhere we want. You can have a local cache; you can set up a Memcached cluster, right, and cache stuff there. So when guys don't find it in their local cache, they can pull it from storage and put it in a cache that a whole bunch of peers could share. This is very powerful.

But what's different about this? Whose problem is this — putting this, looking it up in storage and then putting it in the cache if it's not there? Whose problem is it? It's my problem. It's the system problem. It's this library's problem. It's not your problem! You never see that. Effectively, you ask a query; the query tries to find the data; it's either got in the cache, or it goes and figures out how to get it. But it’s caching under or caching inside. It's not an application-level problem, this caching, because it’s mechanical; there's no special logic around it, or anything else. The system can do this caching for you. All you have to do is start up Memcached and tell it about it.

[inaudible question] No, so this is like, this is an app. What does it do? I don't know — maybe it does analytics on pricing, so it's going to be interested in a certain portion of this data. That's all that's that we're going to cache. This other app — this guy underneath him — he's the website. He's putting up product pages, and he's reading a little different kinds of stuff from here. This is not a replication of everything. It's not even proactive replication of anything in specific. Everybody is pulling in their working sets, depending on the work they're doing. And they don't care; they never need to pull in anything else, ever. Now, of course, being trees, they're all going to have a really nice set of the top part of that tree they’ll all cache. And they'll get a lot of efficiency for that. In fact, it ends up that that diagram where I showed you three tiers — that's it. It doesn't get any deeper than that. Which means that, once you've cached the top part of that, you can find any piece of information you'd never seen before, in one read. And potentially, that's one read from Memcached. (Although some of these storage services, like Dynamo, are really fast.) So you're only caching — it's a on-demand driven cache, it’s filled on demand. It's filled with your working set. So different peers will have different working sets, and different amounts of stuff cached.

[inaudible question] No, this part is writes. This is the write path. I have some novelty, I send it into a transaction, it writes it in here. That's that's the right part. This is all read.

[inaudible question] So let's just talk through it a little bit. Let's just imagine that indexing it takes the log and turns it into a sorted tree. So the log is just append append append, and the index here in storage is the sorted tree. Let's say it last did that an hour ago. Everything that's happened since it last did that is right here. It came in, it said boom, it got logged — but it's not in the index yet — but it got reflected back out, so it's here. So now you want to answer a question: "I want to put up Sally's profile page". In the last hour, Sally told me her email address had changed. That fact is here, but all the facts Sally told us in the past prior to an hour ago are in here. When I want to go and ask a question — "Tell me everything about Sally, because I need to show her profile page" — the query engine is going to do a merge join. It's going to say, "Seek to Sally in here, seek to Sally in there, merge them together". So if there's retractions in here, they'll they'll make it seem like the older facts in here are invisible; but what you'll get is the sum of information from the two and that looks like now. Eventually let's say we allow this to build up for an hour and we say, "Now is a good time to index". We're going to make a new tree here. It's going to share a lot with the old tree, but it will now incorporate Sally's new email address. When that job is done, everyone will be informed and this will turn to empty and will start accumulating the next window of change. That's how it works.

[inaudible question] Yes, the live image is actually local in memory to any peer. Every change is broadcast to every peer.

[inaudible question] Well, so the thing is that these are servers; you have two keys, are these are not some you're going to have ten thousand of. This peer is really something like an app server. That's the way to think of it. It, in turn, is probably serving other people, the way your app servers do today. It's not necessarily the web layer. Ten thousand peers in your web tier would be a lot. We don't currently have a multicast infrastructure for making that propagation — if you wanted tens of thousands — but that would be a way to approach doing that. It definitely does happen — that stuff does come over here, but it's not the same as what's there. It's not organized yet. The organization happens here.

Let me keep moving, and then I'll take questions at the end.

Process

So, process itself, it ends up that I said we can boil everything down to assertions and retractions. But you can certainly think of transformations that you couldn't express that way. Like, I want to add $10 to your bank account. That's the logical process I want to make. But it ends up that the fact that ends up from that process is dependent upon the existing value in your bank account.

So, how do you do that? The way you do that is with transaction functions. A transaction function is a function of the database and some arguments. What it yields is transaction data. We said transaction data is assertions and retractions of facts. Now we're going to say, transaction data is assertions and retractions of facts, or transaction functions and arguments. Which means that we can have a transaction function that is 'add' that takes, and it will be passed the database. We say 'add $10 to Sally's bank account'. That function will be run inside the transaction. It will be given the current value of the database, because the database really is a value. So this is a real function of a real value. It can perform any queries it wants — including looking up Sally's current balance — and then will yield more transaction data.

Process Expansion

So if it was that simple a thing, it would go: that function could go look it up, find out the balance is $100, say it's $110, and what it would expand into is the fact that Sally's new balance is $110. But it could do some more involved things: You can imagine any transaction as assertions and retractions and potentially calls to transaction functions. Those transaction functions can expand into calls to other transaction functions, but eventually will expand into assertions and retractions. This expansion happens over and over again until it's all assertions and retractions. And this allows you to do any arbitrary transformation on the data in the database. You can ask questions, you can do anything you want.

But the cool thing is you have a decision about architectural independence as to when you do this right. If you're adding pure novelty, you don't need to do it inside the transaction. You just say, "I have new facts, I know they're new." If you have something where you think you have very little contention, you can say, "My local view of Sally's bank account is 100. I'm going to make it 110." And I'll just put that in with a with a condition that says if it wasn't 100, fail. That's more optimistic, but if it was an expensive calculation and the chance of collision was really low, that's the more efficient way to do it.

Or, you can do it this sort of old-school way, which is to send the function all the way into the transaction, which is going to get run atomically inside a transaction with no interference.

So we call that expansion.

Transactor

So this is what the transactor does. It accepts transactions. Transactions are just data: Fact. Fact. Fact. Assert. Assert. Retract. And even if you want to have a function, it's still expressed as data. It says, "Call the update salary function with this argument and this target entity."

It expands them, eventually it will apply them to the in-memory view of the database inside the transactor, and it will log it. Finally, it will acknowledge the transaction to you and then broadcast it to everybody else. Every now and then, this indexing will occur.

(As I said in my talk, there's the storage equivalent of garbage, and garbage collection comes out of doing it this way, because as you make a new index, now no one else is going to care about the old index. The root and a bunch of nodes of the old index are junk in storage. So we have to clean that up. But the analogy to memory is really good.)

These peer servers have direct access to storage. They have their own query engine, they have their own live memory index, and they do that merge themselves.

There's also this two-tier cache going on. This is inside, right, under. When you ask a query, if it doesn't have the data, it gets it. It will look for a local cache, which is at the object level. If that is a miss, it will go and look for segments in Memcached, if you've configured it. Otherwise, it'll go to storage, and then it will cache it.

What’s in a DB Value?

I'm not going to talk too much about this, because it's involved. But this is just a picture of what's happening: There's an in-memory persistent data structure, that's the live index. It's really immutable in-memory, too. We're just moving from one to the next. There's an infrastructure for allowing you to find the data in storage. It goes through the cache, and then to storage; and it's trees, the roots of which will almost definitely be cached in memory.

This value is just a pointer; it's a struct that points to these things. It's inside a box. The only mutable thing in the system is the fact that the contents of this box will move from one of these immutable structures to another whenever we've updated the memory index, or whenever an indexing job is completed. There's nothing else mutable in the system. This one identity, which is the database as of right now. Which means that if you've obtained this, and you started running a long-running query, everything that you've got won't change underneath you, even amongst threads in your same process. You're really working with the value.

Consistency and Scale

So what are some of the characteristics? Somebody asked before, does it use Lamport clocks, or vector clocks? And one of the cool things about separating perception and process is you now can make independent decisions about availability and scalability for those two things, because they're completely separate.

So the decision that's made by Datomic, because I think it's a market need and has a lot of value to companies, is to make a traditional decision about the process side: It's transactional; it's a consistent view of the world; it's a single writer system.

The cool thing. though, is that that transactor is not doing anything else. It's not serving queries or anything at all. It has to do is handle writes.

If you want arbitrary write scalability, you're going to have to give up transactions and queries. And everybody who's adopting NoSQL databases, that make that choice for you, is getting that trade-off. I think it's not a great trade-off for a lot of companies, and that's why I'm in this space trying to give them this hybrid solution, that's a combination of two.

On the read side, though, we get all the benefits of distribution. If you put DynamoDB in that storage slot, you have arbitrary scaling, all of Amazon's goodness about storing stuff in different places. You have knobs that bound throughput for reads and writes — a tremendous service approach to storage. Or if you're already running a SQL database, you can leave it there and just start adding data in this format on top of it. But it's an independent decision.

If you do choose a redundant scalable storage subsystem though, you will get the benefits of having done that. So you get scalable reads. And obviously queries scale, because the queries are not in one box. The queries are in every peer box. So you not only have scalability of that, but you have very elastic scalability of that. It's not like adding a new box to a cluster configuration; this is kind of a big job. This can scale up and down with Amazon's auto-scaling. More peers come up and down due to load, you have more query capability. So it's elastic.

Flexibility

I'm going to hurry up through the last couple here.

You definitely have more flexibility. The schema model is extremely small: All you define are the attribute definitions, the name and type of an attribute, what its cardinality is, and things like that. There's no higher level thing. The fundamental unit is a fact. The only configurable aspect of the fact is the attribute. So that's the only schema that exists. And there is schema.

And the net result is that it feels like you're programming with multimaps. If you take an entity approach to this, it feels like every entity is a key-value set, where each key and each value could be multi-valued if you wanted. It's very convenient, it's very flexible, and it's an easy to use programming model down at the bottom.

Time

The other huge benefit you get from this is you get time. The database is really a value; it includes all the past. Which means you can take any database that you've got and say, "Pretend it's as of last month." That doesn't do anything, it doesn't change stuff, it doesn't throw stuff away; just pretend it's last month. Now you can ask queries of that database, right? When you ask the queries, if it finds anything newer than last month, it just doesn't use it, and it can answer those questions. That means you can do “as-of” questions against the current database for a past point in time. They don't cost anything to do that.

The other critical thing is how many people have ever built systems with timestamps and had to do as-of queries? It's brutal. Because what do you have to do? You have to flow around that T, that time. If I want to say "as of last week", last week has to be part of every query, has to be part of every join, and it's nasty. You have to do max and all that. Very, very tough stuff.

What did I just say about this? I walk up to a database and I say "as of last week". So I can have a query which is, "Get me the sales subtotals". That query works against now. I can tell the database, "Give me yourself as of last week." I can take same query, I've run it against that database value, it's not parameterized by time anymore. The database knows I'm supposed to be pretending it's last week. Now the same query, you can ask it; it doesn't have any time. Times are not part of the join. Of course, you can also ask queries where you want to compare times and things like that. So you can do as of a point in time. You can do windows since a point in time.

The other thing you can do — because the database is actually a local value, you have that memory component, but it is really local in your process — you can do "as-if", “what if”. What if I added this stuff, what if I committed this transaction, what would the database look like? What would the answer to this query look like? Do I need to go to the server to do that? No. I could just take the in-memory data structure which is persistent. I can make a new version of it that includes the transaction data I'm thinking about committing. That's now another database value; it's all in memory. I can take the same query, and ask the query. I don't need anybody's help. I'm not bothering anybody. I can do "what if", and then I can say, "Well, that query still works, so good. Now I'll try to commit it, or maybe I'm just doing what-if analysis for people, and I never commit it. We're just trying to make decisions, what would happen if we made this decision, what would the world look like? I don't have to put that into the database to ask to answer the question. I still have query capability.

Perception and reaction

Perception is straightforward. We have this immutable thing. All the queries have been flavors of perception. But the reaction now is easy, because we have this live feed. The transactor is sending all novelty to all the peers, which means it’s easy to make an event on the peer that says, "Some novelty came in."

But the other thing that's beautiful about it is you can say, "Here's some novelty: Sally changed her email address." And if it was important to you to say, "Wow, what is it the same as what it was?" Can you do that? Sure! Because the database as a value means that you can capture the database value when that change was made, and in fact it gets sent to you with the events as well. Here's the database before the change, here's the data of the change, here's the database after the change. Ask any queries you want.

And if you want to have sophisticated eventing, you just have to query that data. You just say, I’ve got this event feed, I'm only interested in these certain kinds of changes. You just query the feed, and now you get that, because the query engine is in memory, and it operates not only against the database, but it operates on memory data structures — combinations of your own memory and what's coming from the database — so you can filter that way.

DB Simplicity

So I think you get a lot out of this. In particular, what you get is simplicity in the database. The state model is what I would call epochal. It only moves from valid, consistent point to valid, consistent point. There's never any in-between. There's no read coordination or anything else. There's only coordination for the process, and that's done in the transactor. Every time you ask the same query, you get the same results. You have a stable basis for decision-making.

If you think there was a problem with the system in the middle of the day and you're not going to get to look at it till next week — how many people have ever had that happen? The queries are returning some weird results at 5 o'clock, everybody wants to go home, so we'll look into this tomorrow. Then, you come back tomorrow and the query is fine, just because more data was added into the system. How are you ever going to find out what was wrong? You're toast! With this, it's straightforward. We can say, "Ask that query as if it was five o'clock, because I know at five o'clock that query screwy, and giving me weird results." And you could figure out the answer to your problem, because you can get a basis anytime you want.

And the transaction is really well-defined. What a transaction is, is a function of the database value.

Other benefits

The other thing that is important is that you can communicate a basis. If I make a decision or I want to give you work to do, I have a way to communicate it. I'm not going to say "go look in the database". I'm going to say "there's some work you have to do at database time T", and you can go and look at exactly what I was seeing, so you know what to do.

We've seen already architecturally, because we've broken stuff apart, we have freedom to relocate things. We don't care where the queries are owned. We don't care where the storage is. We don't care where the transactor is running. You have a whole bunch of flexibility about where you put things. You can put data up on DynamoDB and run the whole system in your LAN, and put Memcached in your LAN and isolate yourself from the fact that the internet is actually potentially in between.

We saw the time travel stuff and we saw the event processing.

The Database as a Value

So the net result of deconstructing the database is that you're able to treat the database as a value in precisely the way I was talking about in my talk.

You have a real information model.

You have a system that's substantially less complex — it's easier to understand; it's easier to change; if you want to change your storage, you can do that easily; it's easy to replace components; it's easy to relocate things.

It's more powerful and more scalable. If you want more brains, you start more peers.

There's less coordination, which also leads to scalability aspects.

The information model not only lets you remember everything — which is important for your businesses because they want to make decisions — but also gives you more flexibility because that’s a model that’s easy to turn into any shape that you want.

With that, I'll wrap and answer any questions. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment