Created
August 12, 2013 19:20
-
-
Save ox/6214169 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The polar opinions on various databases always crack me up. If you ask a programmer about a given database they'll either love it for solving all of their problems or hate it for causing them. Most of the time the issue is actually caused by the way the programmer decided to originally model their data. This is a proposal for a program that will automatically handle the way associations are stored in a key-value store and hopefully end some of the mindless debates about how #webscale your favorite #noSQL DB is. | |
Example Schema | |
Throughout this article I’ll use the below schema for a user as an example: | |
{ | |
id: String, | |
name: String, | |
friends: [User] | |
} | |
Optimizing for Reads | |
If we were to optimize the above user schema for reads, we would store entire user objects in a user’s array of friends. That way when you requested a user object, you could also get their friends with only one request. | |
So what’s the problem? Imagine if Robert is friends with 1000 people. If Robert decides he likes the nickname Bob more, you would have to change his name in all of his 1000 friends’ arrays. Not very efficient. | |
Optimizing for Updates | |
In order to avoid the crazy update situation, we can make user.friends an array of user ids. Now, if we want to get information on all of a user’s friends, we can make a batch get-request for all of the users with ids in the friends array. This mimics how a SQL join works, except that SQL is generally faster at doing this. | |
Optimizing the Optimization | |
I’d like to propose a more intelligent approach to storing associations in key-value stores—optimizing some records for reads and some for updates, depending on the frequency at which they are accessed. | |
Counting Links | |
In the example of a user with friends, you probably don’t want to optimize for reads if the user is in more than X user’s list of friends. At some point, a user will be linked to enough other users where it won’t be worth duplicating their data in all of their friends’ arrays—instead, the data should be optimized for updates. I could see a programmer setting a max X per schema. | |
Query Type Frequency | |
A simple heuristic for what optimization to make (that incorporates the number of links) is to just count the number of times updates and reads are made for a given record — just optimize for whichever is more frequent. | |
One problem with this could be if the number of reads and updates are roughly equivalent and changes non-deterministically. In that case, the system will constantly be changing the way data is saved, resulting in unnecessary database requests. | |
Manual Declaration | |
Due to scenarios like the one above, it would be nice to allow the programmer to decide how a record should be stored when they create/update it. It would also be interesting to incorporate this as an option at the schema level in an ORM. | |
Implementation | |
In order to implement this system, you would need: | |
a delayed jobs queue that can handle a task to calculate whether or not the way a record is stored should be changed each time an update is made to a record | |
a script to change the way a record is stored from embedded doc to a list of ids | |
Conclusion | |
Like all software implementations, there are going to be other limitations that this proposal is overlooking. Regardless, I think we can more intelligently store associations in key value stores. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment