Created
February 12, 2012 21:51
Revisions
-
Tom Boutell created this gist
Feb 12, 2012 .There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,93 @@ I'm going to examine a plausible, nontrivial query that comes up for client sites and consider how it would be implemented in both MongoDB and a MySQL back end that attempts to support a lot of the capabilities we like in MongoDB. I'm not going to look at how we retrieve the inner contents of a page (nested contents) or information about subpages (related contents) because I have a pretty good idea how we want to cope with those bits in both cases. The query syntax here is just pseudocode, I'm not proposing it: (published_at > :now) AND ((category_7 == 1) OR (tag_count > 2)) ORDER BY published_at Note the use of category_7 as the property we check for to see if a content object has a particular category. We do this because we don't want to assume a backend that supports queries on deep structures and/or understands arrays. Similarly tag_count is something we maintain separate from setting tag_1 and tag_7. We can provide helpers to make these things more transparent. What do implementations look like in MongoDB and MySQL, assuming that we want all implementations to support the loosey-gooseyness of MongoDB as much as practical? The MongoDB implementation is: db.content.find( { $and: [ { published_at: {$gt: now} }, { $or: [ { category_7: 1}, { tag_count: { $gt: 2 } } ] } ] }) .sort({ published_at: 1 }); Now let's look at the MySQL implementation. Let's assume that we don't have columns for all of these things in the content table, because we're not requiring people to declare everything in advance, and because there could be potentially hundreds of properties for different content types. So we use the entity-attribute-value pattern. Each content object has a row in a content table and we join that to an attr table which stores all the properties. It turns out I can automatically translate the whole thing into a series of left joins and a 'where' clause. It's straightforward even if clauses are nested as in this query. The first left join is intended to actually return the properties of the chosen objects. The rest are used to implement the query: select c.*, a.* from content c left join attr a on c.id = a.content_id left join attr b on c.id = a.content_id and b.name = 'published_at' and b.value > :now left join attr c on c.id = a.content_id and c.name = 'category:7' and c.value = 1 left join attr d on d.id = a.content_id and d.name = 'tag_count' and d.value > 2 where (b.id and c.id and (c.id or d.id)); ... But for performance we should use INNER JOIN wherever we can to avoid trolling through more rows than necessary. If the query is simple like this: (color == 'blue') Or part of it descends strictly via AND clauses from the main query: (published_at > :now) AND ((category_7 == 1) OR (tag_count > 2)) ^^^ this part Then that part of the query can be an inner join. Checking it in the 'where' is then redundant although not harmful: select c.*, a.* from content c left join attr a on c.id = a.content_id inner join attr b on c.id = a.content_id and b.name = 'published_at' and b.value > :now left join attr c on c.id = a.content_id and c.name = 'category:7' and c.value = 1 left join attr d on d.id = a.content_id and d.name = 'tag_count' and d.value > 2 where (b.id and c.id and (c.id or d.id)) I should be able to do an ORDER BY clause here to implement the sort: ORDER BY b.value I suspect it's best to put the final join for all the attributes of the "winning" objects at the end to help MySQL understand what optimizations to make: select c.*, a.* from content c inner join attr b on c.id = a.content_id and a.name = 'published_at' and a.value > :date left join attr c on c.id = a.content_id and c.name = 'category:7' and a.value = 1 left join attr d on d.id = a.content_id and d.name = 'tag_count' and a.value > 2 left join attr a on c.id = a.content_id where (b.id and c.id and (c.id or d.id)); We should also support IN clauses because each IN clause only needs a single join whereas a big dumb OR looking for various ids would generate many joins (without really excessive cleverness anyway). One big concern is that MySQL is case insensitive, while MongoDB is case sensitive. This runs pretty deep and you can't change it in MongoDB - you have to explicitly store lowercase versions of things for sorting purposes if you want to sort and search that way (which we mostly do). It is possible to force MySQL to be case sensitive by setting the utf8_general_cs collation (UTF8, General, Case Sensitive).