MongoDB 🌎 2016 Notes

Tuesday, June 28th

Morning Keynote Session

Eliot Horowitz, CTO, MongoDB

Exciting features in MongoDB 3.4

Split-storage replica sets
Recursive graph lookup
Faceted search
Compass data exploration tool
$lookup (like joins)
Read-only views

Ross Mauri, General Manager, z Systems and LinuxONE at IBM

LinuxOne + Mongo UK Schools ibm.com/linuxone/mongodb

Containerizing MongoDB with Kubernetes

Dan Worth, Director of Engineering, fuboTV Brian McNamara, Founder, CloudyOps, LLC

Tagsets Stateful services

Mongo data should persist

Label selectors Persistent volumes in pod definitions / Replication Controllers

MongoDB service endpoint

FuboTV blog post

Scaling MongoDB w/ Docker and cgroups

[Marco Bonezzi][], Technical Services Engineer at MongoDB [Marco Bonezzi]: https://twitter.com/marcobonezzi

Deployment -> Orchestration Using predefined cluster patterns. Replicating environments.

Resource Control -> Resource Management Settings limmits to key resources

MongoDB & Docker -> Automate for Scaling Create once, deploy everywhere Deploy patterns, not processes

Docker usage survey > Why are companies interested in docker?

Docker Machine / Swarm / Compose

Blog post: Evaluating container platforms at scale

Using affinity filters in docker swarm to prevent replica set members on the same instance.

Resource control with cgroups

Wired Tiger memory split in two WT cache + MongoDB memory (connections, aggregations, map reduce, etc)

Mongo process does not see cgroup limits so Wired Tiger cache should be set explicitly.

docker top rsa1 / docker stats rs1

Combine docker metrics with mongo metrics

Creating a Swarm cluster on AWS to deploy MongoDB

Configure docker-machine with ec2 driver (AWS)
Deploy discovery service for Swarm Master
Deploy AWS instances for swarm master and swarm worker nodes
Define compose file for deployment
Define swarm filters and constraints and cgroup limits
Connect to the swarm master
Define Sward filters and constraints and cgroup limits
Connect to the Swarm master
Deploy the environment with a single command using the compose file
Configure our MongDB sharded cluster using Cloud Manager API
Demo!

Smart Strategies for Resilient Applications

[A Jesse Jiryu Davis][], Staff Engineer at MongoDB [A Jesse Jiryu Davis]: https://twitter.com/jessejiryudavis

Links: screencast, resources

MongoS

Smart retry strategies

SDAM: Service Discovery and Monitoring

Bad retry strategies as they apply to:

Network blips
Primary failover
Network down
Command error

Make your operations idempotent.

Non-idempotent insert ($inc)

Add unique token.
Remove token and increment counter.
$addToSet
$pull, $inc

Eventual consistency after a network outage.

Appropriate for high value, infrequent updates when trade off for additional load/latency is warranted.

Otherwise, just use $inc, and accept possible count misses.

Black Pipe Testing / MockupDB

bit.ly/resilient-applications

Ops Manager API, Puppet & OpenStack - Fully Automated Orchestration from Scratch!

Naama Bamberger, Software Developer at Cisco

MongoDB

Easier to perform updates
Forwards and backwards compatible

Automated deployment of MongoDB

Ops Manager?

OpenStack, HEAT, Puppet, Python for tooling

OpenStack is a virtualization solution. AWS is one of many crowd built on top of OpenStack. Began in 2010 by Nasa and RackSpace.

OpenStack services: Compute/Networking/Storage

Orchestration service. Yaml confit describes deployment requirements.

Create VMs for each component.

OpenStack UI OpsManager UI

Automation agent

Goal state = ready to use.

Multiple data centers

On secondaries: Deploy agents Create deployer From deployer - Access Ops Manager in primary to extend cluster

Third data center with one arbiter.

When a fallback occurs, create a new machine with an arbiter so a new primary can be elected.

Sef / scale.io storage layer

Evergreen: The Life of a MongoDB GitHub Commit

[Kyle Erf][], Software Engineer, MongoDB Shraya Ramani, Software Engineer, MongoDB [Kyle Erf]: https://twitter.com/KyleErf

https://github.com/evergreen-ci/evergreen

Evergreen / In house continuous integration system

MongoDB is not a typical use case for CI

Most users have a good idea of how their product will be used.

Tests would take 20+ hours serially on one machine

Supported on multiple platforms.

MongoDB is tested on 50 different variants.

600 hours of computer time to run tests on all variants.

Used build bot before which couldn't scale enough.

Evergreen autoscales testing hardware to meet commit traffic.

Multi-platform support.

Powerful navigation. Open source licensing.

Components

Repo tracker: uses a polling strategy for recent commits.
Scheduler
Host initializer
Agent
Task runner

Goals

Minimize time in task queue
Minimize idle host time

"Job Shop Scheduling" Minimize makespan

Evening Keynote

Oron Gill Haus, Managing Vice President, Consumer Bank Engineering, Retail and Direct Bank, Capital One

Hygieia CapitalOne DevOps Dashboard https://github.com/capitalone/Hygieia

Wednesday, June 29th

Morning Keynote

[Dr. Eric Brewer][], Vice President of Infrastructure, Google [Dr. Eric Brewer]: https://twitter.com/eric_brewer

Kubernetes 1.3 Rolling upgrades with pod labels Names are persistent and resolvable Init hook / recover or initialize state Staggered start

Dr. Hannah Fry, Lecturer in the Mathematics of Cities at the Centre for Advanced Spatial Analysis, University College London (UCL)

https://twitter.com/FryRsquared

Bike share usage patterns, tweet language distribution, and cows in heat!

The Life of a Write in a Sharded Cluster & Config Servers as Replica Sets

Randolph Tan, Software Engineer at MongoDB

Config servers with replica sets. Election process.

Single source of truth. Single server maintains lock.

readConcern readConcernMajority readAfterOpTime

Topics on MongoDB docs:

Read concern
Replica set
Rollback
Sharding concepts

Managing Petabytes of Data at Baidu

Beibei Xiao, DevOps Engineer at Baidu

2D geospatial indexes

MongoDB service API

Single point of entry

Quota control Authorization Flow control

Split large database into smaller databases

Create index in turn replication create index Secondary first and primary last Just care of oplog time

Brought up as new mode without replication. Run index in foreground. Add to replica set. [make primary?]. Oplog is synced to others.

Balancer and Migration

Problems:

Balanced degrade system performance.
Disk space not released to system after migration.
Balance speed is not fast enough to when shard number increases.

Solutions:

Use hash instead of range shard key to avoid balancing.
PreSplit and move chunks
Limit balancer running time window
Baidu custom balancer script
Migrate databases between nodes.

In the Future:

Spinning disk / better write performance when not using SSDs
Balancer: lighter and faster
Data compression.
WT engine, document validation.

Coming in MongoDB 3.4:

Enable parallel chunk migration
Remove migration throttling by default for WiredTiger

MongoDB Rocks MongoDB storage integration layer for the Rocks storage engine https://github.com/mongodb-partners/mongo-rocks

Advanced MongoDB Aggregation Pipelines

Joe Drumgoole, Director of Developer Advocacy, EMEA at MongoDB

Aggregation grew up in 3.0

Cursors are returned from aggregate queries. $out to new collection.

Processing pipeline Design to process large groups of documents in parallel Is shard aware Can create new data from old

Match -> Project -> Group -> Sort

Group: group by, execute accumulators, rename fields.

Geo queries must be first. $out must be the last stage.

Demoed with U.K. Driver and Vehicle Licensing Agency (DVLA) data set

Building WiredTiger

Keith Bostic, Senior Staff Engineer at MongoDB

Top to bottom level searching Higher levels skip more values Singly forward linked list

Atomic increment/decrement

Hazard pointers
Skiplist
Ticket locks

Open source implementations in WiredTiger

~200 lines of code in a btree ~20 lines of code in a skiplist

7x-10x performance bump

https://github.com/wiredtiger

From Story to Document: Modeling Common Business Problems with MongoDB

Nuri Halperin, Principal at Plus N Consulting

Same rules, completely different scale

"Mongorama"

Example: "Maker Space" says the boss / space, creativity, tools, community

So many tables / relationships. Done. Mic drop.

First rule of Nuri's thumb: Key interactions should drive document design

Why is my data scattered across tables? Must all objects be flat? Can we iterate quickly?

Add makership info directly on a person document.

Data that works together lives together.

Instead of application logic to require certifications, put that boolean within the document?

Maybe the application should define and enforce the rules?

Embed immutable data.

You don't get much value from referential schema design.

Should I embed?: Ownership? / Work Together / Bound Growth / Lifetime

Example: Library book checkout system. Card stays with book.

Will I need more and more library cards?

Does the rate of change of the embedded data match the parent document?

Have a separate ledger collection. Add some additional information for maker / tool so we don't have to do a referential lookup to answer questions.

Key interactions:

Maker gets verified
Borrow and return tool

Aggregation frameworks. Who's the biggest users of tools? Audit report. Aging report.

Let the engine work for you.

Aggregate framework should be used for reports instead of calculating in memory.

You can keep a fixed length array of recent checkouts on the tool itself. Do it either on write or have a background job.

slice:-3

Collections: makers, tools, toolLog

Easier to determine function with these 3 collections than many more flat tables with mostly referential data.

Takeaways:

Key interactions drive schema design
Data that works together lives together.
Embed immutable data.
Let the engine work for you

The roles have changed. DBAs no longer set the schema for developers to conform to. Responsibility on the developer side has increased.

dblandin/mongo-world-2016-notes.md