Skip to content

Instantly share code, notes, and snippets.

@mkantor
Created December 4, 2014 20:25
Show Gist options
  • Save mkantor/dd0ebe79e116f7c37f4e to your computer and use it in GitHub Desktop.
Save mkantor/dd0ebe79e116f7c37f4e to your computer and use it in GitHub Desktop.

MongoDB SF

http://www.mongodb.com/events/mongodb-sf-2014

Keynote

Mostly about 2.8 features.

  • document-level locking
  • pluggable storage engine
    • in Asya's demo, it looks like rs.status() does not show the storage engine used by each replica set (lame)
    • there's a bug hunt with prizes for the release candidate of the storage engine stuff
  • some MMS automation bullshit (magic click-to-deploy stuff)
    • sounds like it makes the upgrade path simple, but it's not something we can make use of
      • however, we can easily emulate their approach (take down replica set members one at a time and upgrade them)

Internet of Things with MongoDB

  • should be a neat live demo
  • "10k inserts per second is where stuff gets interesting"
  • problems with IoT and systems with zillions of sensors:
    • lots and lots of writes
    • really big data
  • ... writing some visualization stuff live using Processing (seems pretty off-topic...)
    • nothing to do with mongodb yet
    • so far just wasting time showing us how processing works and talking about basic programming/GUI stuff
  • everyone in the audience has a big white piece of paper and he's setting up a camera on stage that will detect when we're holding them up
    • presenter has written more processing code to capture from the camera and display it on a different part of the canvas
      • also downsampling to low resolution and grayscale as well as cropping to make it easier to work with
  • now he's hooking up mongo to store readings from the camera
    • storing in a collection:
      • current time
      • coordinate of pixel (x,y)
      • "color" (really just an int 0-255 for brightness since he's using grayscale and masking it)
  • first brute force approach has performance problems (can't do inserts quickly enough)
    • going to refactor to use bulk inserts
    • after refactor: much higher performance (about 50k inserts/second)
  • using aggregation framework to display analysis
    • aggregating to determine average lightness of each frame in order to graph how many people are holding up the white signs
      • not an accurate measure, just being able to compare over time whether one moment was whiter or less white than others
    • the simple approach here has performance issues
  • going to use "pre aggregation" to imrprove things
    • basically only storing what we care about for the analysis and using upsert to reduce the number of unique documents
  • next step: hot spot analysis (checking each pixel for its whiteness)
    • going to use a "singleton collection" (a collection that only ever has one document)
      • don't care about historical data, only the current state, so we can just overwrite this document as we go
      • this way queries only ever need to retrieve one thing
      • pre-aggregating values for each pixel at a unique x.y path in the collection
      • more processing code to display it
      • it never quite worked, but he ran out of time

Unify Your Selling Channels in One Product Catalog Service

  • "systems of engagement"
    • ways to answer questions and take action when a customer is in the store
    • rapid iteration in retail (e.g. "is this sale working? if not, what do i need to change?")
  • challenges:
    • data model changes frequently
      • new products, partners, product/customer attributes, etc all the time
    • desire to ask questions in real time
    • geo-location
  • use cases: modern, seamless retail
    • store rich product information (shitloads of attributes and relations to other products, etc)
    • "consolidated customer view"
      • e.g. same customer across in-person store, online store, catalog orders, phone, etc
      • personalalize the (virtual) storefront for each customer
      • even external stuff, e.g. "what did this particular customer say about this particular product on facebook"
    • objective is a "global product service"
      • single canonical view of a product, all products in one central service
        • schema needs to be flexible
        • geographical distribution
        • high volume read/write spikes, e.g. 100k reads/second
        • need good indexes!
      • how to manage multiple copies for the same data
        • e.g. individual store wants a local catalog of its products that are somehow copies of some central catalog of all products
        • briefly mentioned geographically-distributed replica set members (but this only solves the read problem, not writes)
        • not sure if he ever actually mentioned a full solution for this
      • responding to events in real time
        • examples:
          • twitter promotion on black friday for a discount, decided and implemented within an hour because of time-sensitivity
          • for virtual storefronts, what's the current weather at the customer's location? (e.g. do i advertise umbrellas or flip flops?)
      • price may vary across many dimensions:
        • product, size, color, store, customer, etc
      • search
  • TL;DR of everything so far: "retail has a naturally complicated data model and high performance/availability/consistency demands, also lots of reads and writes"
    • k
    • stopped taking detailed notes at this point, didn't seem worth it...
  • another +1 for "tailor your schema to your queries"
    • and another +1 for pre-aggregation (his example was just a count, but still)
  • takeaway: mongoDB's flexible schemas are a much better fit for retail data than traditional strictly-schema'd RDBMSes
    • but it still requires the same kind of planning and due dilligence

A Full-Stack, Realtime Database Driver: Meteor and the Next Generation of Web and Mobile Applications

  • Meteor is a JS framework/ecosystem for building realtime apps, made up of:

    • LiveQuery
      • realtime DB queries
    • DDP
      • subscribe to changes in DB
    • MiniMongo
      • run db queries from the client
      • cache relevant data on the client
    • Tracker
      • re-run functions when data changes
    • Blaze
      • keep the view up-to-date with data
  • JS runs on both client and server

    doSomethingOnBothClientAndServer();
    if (Meteor.isClient) {
      doSomethingOnClientOnly();
    } else if (Meteor.isServer) {
      doSomethingOnServerOnly();
    }
    • code does exist both places, just doesn't execute (because conditionals)
      • sounds like you want to be careful about where you put your secret sauce and how it is exposed
  • Blaze is a custom templating language

    • HTML + goop
    • it looks kinda like handlebars
    • view automatically updates when data changes
      • e.g. you just have run a mongo query from the client, a bunch of magic happens, and the view automatically responds (without redrawing everything)
        • UI elements are bound by observers to live data (kind of like cursors with a websocket in the middle)
      • actually it's clever, the UI is updated as soon as the client fires off the update request
        • but the server is still the ultimate source of truth, it'll send back what it thinks is the new state of the data and if needed the UI gets updated again to reflect the server-side state
  • uses websockets for continuous communication

  • seems like this would have been a good choice for my battle cobras hackathon project

  • meteor has its own package system

    • managing packages actually affects clients in real time too (no need to refresh your browser)
  • can explicitly decide what to publish to a client (the "autopublish" package just shoots out everything, great for demos, terrible for anything real)

    • the default behavior is to observe for changes and blast out data appropriate to the given query
    • the oplog is tailed to become aware of updates without having to poll
  • server knows about the current state of the client's data so that it doesn't send unnecessary junk

  • does not work with sharded clusters

  • "mongo1" discount code for eventedmind.com

  • my takeaway: a really neat approach, but still a immature and seems far too "insecure by default"

    • also not sure how well it will scale

MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and Visualization Using Flight Data

  • for his demo:
    • one collection
    • every document is a flight
      • tons o' fields
    • will use data set to answer various questions
      • which airline has the most delays?
      • which airports are the worst in terms of cancelled flights?
      • etc
  • aggregation operations
    • group
    • sort
    • sum
    • avg
    • projection:
      • everything to do with fields (one at a time)
        • create computed fields on outputs based on other fields
        • rename fields
        • etc
      • e.g. i have a number of total flights and a number of cancelled flights, what's the cancel rate?
        • answer: you use $divide to compute cancelled/total and spit it into a new field
    • unwind
  • order of operation in an aggregation query does matter
  • overall this was more about "how to do analysis" and not very much about "how mongodb aggregation framework works"

MATH is Hard : TTL Index Configuration and Considerations

http://docs.mongodb.org/manual/core/index-ttl/

  • TTL indexes define how many seconds a document lives for
    • a TTLMonitor process sweeps through and deletes documents whose TTL has expired
  • avoids having to do manual deletes of stale data
  • expire after vs expires at
    • expireAfterSeconds is a global policy
    • expiresAt is per-document expiration
    • can combine these two
  • created like a normal index, just have to specify expireAfterSeconds
    • you always need to specify expireAfterSeconds, even if you use expiresAt
      • can be expireAfterSeconds: 0
  • think about fragmentation and other costs of frequent deletes
    • probably want to keep TTL'd data separate from other data to avoid performance issues
  • TTL index limitations:
    • can't use _id
    • can't use nulls
    • no compound indexes
  • sounds like you manually set the timestamp for creation date/expires at?
    • if so, gotta keep your app code smart
  • TTLMonitor doesn't always delete stuff immediately at the expiration date
    • depends on workload, etc
    • only runs once every 60 seconds
  • other tips:
    • ISODate is microseconds, TTL is seconds
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment