Skip to content

Instantly share code, notes, and snippets.

@vonconrad
Created April 13, 2012 02:29
Show Gist options
  • Save vonconrad/2373086 to your computer and use it in GitHub Desktop.
Save vonconrad/2373086 to your computer and use it in GitHub Desktop.
MongoDB structure

So, I'm putting together a MongoDB database to store XML documents. There are plenty of reasons for why I would want to do something seemingly insane like this; most importantly, I need to be able to see what's changed in the document as well as easily access individual element and attribute values.

I want to store each individual "item" as a row. By "item," I mean each child of the root element. Each item will have standard XML attributes and element children.

Now, one thing to keep in mind: Some of these documents are going to be small (< 100kb), others pretty big (> 100mb). Obviously, plenty in between as well. The documents are going range from a few hundred items to potentially hundreds of thousands.

My initial thought was to have an Item model for each item, with its child elements and attributes as nested documents. Consider this XML doc:

<root>
  <item sku="123">
    <brand>Apple</brand>
    <model>iPhone 4s</model>
    <rating>5</rating>
    <price>599.95</price>
  </item>
  <item sku="456">
    <brand>Samsung</brand>
    <model>Galaxy S II</model>
    <rating>2</rating>
    <price>514.99</price>
  </item>
  <item sku="789">
    <brand>Nokia</brand>
    <model>Lumia 900</model>
    <rating>3</rating>
    <price>499.99</price>
  </item>
</root>

I thought that the item might look something like this:

#<Item _id: ..., document_id: ..., checksum: ..., attributes: <NestedDocument>, elements: <NestedDocument>>

However, the more I think about it, the more I'm not sure whether it's a good idea to have them as nested documents or separate.

For how I'll want to be able to access the data, I'm obviously going to need to fetch the items (together with attributes and elements) for a single document a lot. I can't think of a scenario where I need to fetch the items without attributes and elements.

But occasionally, I'm also going to need to grab, for example, all brands within a document--basically, returning an array of ['Apple', 'Samsung', 'Nokia']. This is where I'm unsure, as I think separate MongoDB documents would help with this operation. I guess I just don't know enough about Mongo yet!

@mariovisic
Copy link

So each node in the XML will be an document in a collection.

If you go with the references type association then what happens is you get a collection that holds all of the documents. If you go with this approach then you can easily query them, so for example if you wanted to find a document which has a samsung brand you could achieve it by doing a simple query in mongoid or another ORM:

node = Node.where(:name => 'brand', :value => 'Samsung').first

From there you can grab the document or other nodes around it. Even if you go with the embedded approach there are ways to achieve that same sort of lookup but you'll probably be using map reduce to get the documents you want.
The alternative is using an embedded association. when you do this it adds the associated elements to the same document, the BSON structure you would have might be something like this:

{
  "title":"My first XML document",
  "nodes":[
    {"name":"root","nodes":[
      {"name":"item","attributes":{}}
    ]}
  ]
}

This is really good if you always want the nodes to be quickly accessible as soon as you grab the parents data. The obvious problem is that it is more difficult to query the data for nodes (which isn't a problem if this isn't a requirement).
The choice is ultimately yours and will depend on weather or not you have the requirement to query nodes directly or not. If you're going to be always returning a document with all of the nodes then it may be best to use the embedded approach as the documentation suggests by only having to perform 1 lookup that it may be much quicker (for small collections).

If you're going to start getting a large amount of nodes then it may become slow when pulling the data out if they are all contained within a single collection. But if you're using references associations and pulling out all of the nodes at a time it's probably going to be slower anyways.

Hope that helps a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment