- MapReduce is a two step process
- reduce function takes output of maps and collections and aggregates then gives you it in a collection
- http://www.mongodb.org/display/DOCS/MapReduce
- How to start thinking in terms of rich document modeling
- mongo makes you feel like you are denormalizing your data, it makes your data feel more object like
- object like is a huge gain of mongo
- collections is a set of documentions equivalent to a table
- NO joins in mongo, but there is embedding
- sophisticated query system, not as good as SQL, but pretty decent
- all updates are atomic and isolated
- Considerations
- no joins
- documents are atomic
- mongo id is a bson specific id that is given to you
- you get an automatic timestamp as well
- You can examine the query plan by using .explain()
- cool update operators such as puss, pull, pop, etc..
- The 'dot' operator
- reach into the fields of the objects
- http://www.mongodb.org/display/DOCS/Dot+Notation+%28Reaching+into+Objects%29
- Modify atomically
- findAndModify allows you to find and modify atomically
- have the db conform to the application you are trying to build
- Shutterfly doesn't have any cloud based stuff they run on their own private servers
- traditional to RDBMS environments
- Data modeling matters, kind of where you start tuning
- General tuning order
- modeling
- statement tuning
- instance tuning
- hardware tuning
- data modeling (http://www.mongodb.org/display/DOCS/MongoDB+Data+Modeling+and+Rails)
- Statement tuning
- enable it, leave it on. it is a low overhead
- What to look for?
- full scans
- nreturned vs nscanned
- updates
- fastmod (fastest)
- moved (exceeds reserved space in document)
- key updates (indexes need update)
- full scans
- explain()
- use during development
- use when you find bad operations in profiler
- db.foo.find().explain()
- index usage; nscanned vs nreturned
- nYeilds = waiting for an operation to be completed
- covered indexes says you can get all data by just reading the index no reason to go to the payload
- run twice for in memory speed
- High performance writes
- Tuning
- read before write
- profiler
- tune for fastmod
- architectural changes
- split by collection
- shard
- Tuning
- High performance reads
- cache to disk ratio
- try to have enough memory in system for your indexes
- mongostat faults column
- data locality
- organize data for optimized I/O path. Minimize I/O per query
- cache to disk ratio
- Tools
- mongostat
- aggregate instance level information
- faults: cache misses
- lock%: tune updates
- aggregate instance level information
- mtop
- good picture of current session level information
- iostat
- how much physical I/O you are doing?
- mongostat
- is it faster to use a single thread for writes?
- yes
- shell is spidermonkey
- what is it good for?
- debugging
- administration
- scripting glue
- NOT for building apps
- Endianness http://en.wikipedia.org/wiki/Endianness
- Shard http://en.wikipedia.org/wiki/Sharding
- SAN http://en.wikipedia.org/wiki/Storage_area_network
- Lesson: Replica Sets Rock
- Lesson: Know your data
- mongodb is utf-8
- Lesson: Know your data size
- 4mb in 1.6.x and 16mb in 1.8.x
- Lesson: Know some sharding
- balancer can be your frenemy
- initial insert rate: 8000/sec
- http://blog.zawodny.com/2011/03/06/mongodb-pre-splitting-for-faster-data-loading-and-importing/
- mongostat
- like iostat
- gives you your virtual size
- provided by a database command called serverStatus
- db.serverStatus();
- profiler
- db.setProfilingLevel(2)
- 2 = any operations (insert, read, write) that takes longer than a certain amount of miliseconds the default is 100
- db.setProfilingLevel(2)
- principals for indexing
- same as RDBMS
- Monitoring service
- Nagios and Munin as well as MMS (Mongo Monitoring service)
- Write block percentage
- Concurrency
- one write OR many readers
- Concurrency
- web-console
- always have at port 28017 an http page for console info
- background flushing
- 10gen tells people to RAID their EBS volumes
- connection leaks are sometimes an issue
- Network bytes in and out
- important for read heavy applications
- Fragmentation
- padding factor
- you cannot manually set padding factor right now
- dynamically calculated, the amount of space to leave when you update a new document
- padding factor
- Journaling
- recommend having a second spindle just for the journal because syncing to the journal is a little expensive
- you can create a secondary index in the background
- can take a secondary index offline and then sync it back up
- nginx, Haproxy
- mongodb and migrating off of postgres
- what we love about mongodb
- fast
- indexes and rich queries
- sharding and auto-balancing
- replication (see http://engineering.foursquare.com/2011/05/24/fun-with-mongodb-replica-sets/)
- lessons learned
- keep working set in memory
- keep indexes in memory
- avoid long-running queries
- monitor everything (per collection stats)
- application level metrics is always good to monitor
- use small field names for large collections
- keep working set in memory
- mongo gem and bson gem because bson is the native object
- bsonext gem make it a bit faster
- all ruby types map to bson types
- object ids are NOT strings
- MongoMapper recommended over MongoID. There is also Mongomatic
- You need to size your replica set as if it were the primary
- Typical MongoD should be on a large or extra large standard on demand instance on EC2
- Big MongoD should be on extra large, double extra large, quadruple extra large high-memory on-demand instance on EC2
- Small instance on EC2 is 32-bit so DO NOT use it
- ConfigD/Arbiter can run on a micro instance on EC2
- High-CPU Medium is 32-bit so DO NOT use it on EC2. High-CPU in general is just not necessary. More RAM is more important than having more CPU
- Operating Systems (Debian, Ubunti, Fedora, Redhat, FreeBSD)
- Turn off atime
- Raise file descriptor limits
- cat >> /etc/security/limits.conf << EOF
- hard nofile 65536
- soft nofile 65536 EOF
- cat >> /etc/security/limits.conf << EOF
- Use ext4, xfs
- DO NOT use large VM pages
- Use RAID
- RAID10 on MongoD
- RAID1 on ConfigDB
- MongoD on EC2
- LVM or MDADM
- 64-Bit EC2 instance
- stripping = partitions of mirrors
- MongoS on EC2
- Runs on Application server
- doesn't need disk, ebs volume, raid
- 32 or 64 bit instance
- Arbiter on EC2
- Meant to vote on elections
- Normally need once a week
- Do not run it on the same node as MongoD
- 64 bit EC2 instance, micro or small is fine
- ConfigDB on EC2
- LVM or MDADM
- 64 bit EC2 instance micro or small is fine
- Deployment scenarios
- 3 - Node replica set
- 2 large MongoD in US-East one is primary and one is secondary with RAID 10
- 1 secondary MongoD with priority = 0 (cannot become a primary) in US-West also with RAID 10
- 3 - Node replica set
- why to find out which is the master
- db.is_master?