Skip to content

Instantly share code, notes, and snippets.

@jhsu
Created March 21, 2012 17:17
Show Gist options
  • Save jhsu/2149708 to your computer and use it in GitHub Desktop.
Save jhsu/2149708 to your computer and use it in GitHub Desktop.
WNY Ruby meetup notes for March-a-polooza

WNY Ruby March-a-palooza

'^`*.,*'^`*.,*'^`*.,*'^`*.,*'^`*.,*'^`*.                         
*'^`*.,*'^`*.,*'^`*.,*'^`*.,*'^`*.,*'^`*.,*'                     
,*'^`*.,*'^`*.,*'^`*.,*'^`*.,*'^`*.,*'^`*.,*'^`*         ,---/V\  <  RUBY!! )
.,*'^`*.,*'^`*.,*'^`*.,*'^`*.,*'^`*.,*'^`*.,*'^`*.,*    ~|__(o.o)
*.,*'^`*.,*'^`*.,*'^`*.,*'^`*.,*'^`*.,*'^`*.,*'^`*.,*'^`  UU  UU 

Start 7:30pm

Riak for the Ruby Brigade

by Andrew Thompson

Traits

  • key value
  • simple operations (get, put, delete)
  • rest inspired API
  • optional metadata
  • mapreduce (like ruby map / inject)
  • links (link to other keys, useful for associations)
  • full-text search (via ria-search, bundled)
  • secondary indexes
  • distribution, clustered (recommended)
  • fault tolerance (no master / slave), handles node failure
  • all data written N times
  • network partitions (ie switch dies)
  • Highly available (ask any node to get data)
  • live node inherits responsibility of dead node
  • define strickness

Background

  • built with erlang/OTP (open telecom platform)

Speaking REST

  • HTTP default interface
  • works well with reverse proxy caches, load balances, web servers, etc

Amazon dynomo inspired

  • masterless
  • consistant hashing
  • anti-entropy: read-repair, hinted hand-off

Data Model

  • referenced by its key
  • grouped into buckets
  • object composed of metadata and value
  • metadata:
    • vector clock
    • Content-Type (text/html, text/plain, application/json, etc.)
    • Links

Consistent Hashing

  • bucket/key combination is hashed to a 160-bit
  • Ring is split into evenly sized partition, tries hard to prevent two partitions handled by nodes
  • ring divided evenly as possible among nodes

Requests and magic numbers

  • N - number of replicas (default: 2)
  • R - read quorum (how many reads you want to succeed)
  • W - write quorum (how many writes before return, accepted the write)
  • DW - durable write (write actually happened, likely on disk)

every request will read every (N) replicas

Conflict resolution

  • Conflicting writes, using a vector clock to determine which is write
  • you can configure to allow multiple (could use application to deal conflict manually)

Node failure

  1. node dies (had certain partitions)
  2. partitions get redistributed to live nodes
  3. requests re-routed, writes succeed, N-replicatas of writes created still
  4. when node comes back up, revived node is updated
  • read repair -> one replica didn't get a write, that node gets pushed correct information

  • can gracefully scale nodes

Clients

Official clients: erlang, java, ruby, python and php.

There are several un-official clients.

Storage engine

  • Bitcask (default) - append-only, all keys stored in memory, disk for data
  • Memory - all stored in memory
  • LevelDB - by google, faster than Innostore
  • Innostore - 'Embedded InnoDB', not supported by Oracle, GPL (not recommended)
  • Multi - proxy to different backends depending on buckets name

Search and secondary indexes

  • full-text search:

    • custom analyzer support
    • custom schema support
    • Lucene inspired query syntax (inspired, not exact and not fully)
  • secondayr indexes

    • leveldb backend only
    • integer or string indexes
    • can be arbitrary
    • query on keys using secondary indexes, can do range (or exact) queries

Both of these let you feed the result to map/reduce.

Enterprise edition (commercial offering)

  • materless multi-site replication (useful different regions), real-time / periodic (differential), can be bidirectional
  • full customer support

Reasons not to use

  • single server database
  • simple relational model
  • need to commit to using riak versus sql knowledge
  • unsure of application data model, needs planning

Reasons to use

  • Availability, assurance of uptime
  • Scale predictibility
  • latency predictibility
  • willing

Riak success stories

  • webapps: Yammer, Clipboard, GitHub (link shorteners)
  • casual gaming
  • medical records (Trifork + Denmark)
  • smartphone applications (Voxer, Bump)
  • caching (Posterous)
  • session store

Voxer

pusth-to-talk phone application

  • store both user and raw audio data in riak

Questions

Q: scaling, initial configure of ring size
A: hard to do, it is a pain point but is being worked on. append-only is
   cleaned out periodically

Q: learning knew way to structure data
A: ocassionally, relational databases fall over at certain scale levels.
   time-series (range queries), can do a rollup 

End 8:22pm

Intro to MongoDB

Ines, Data Engineer @ Engine Yard

Relational World

ACID

  • sharding

NoSQL

CAP - Consistency, Availability, Partition Tolerance

  • horizontal Scaling
  • unstructured data
  • large quantity of data

MongoDB

Document oriented DB (binary JSON), cschema-less, memory-mapped files (2GB) written in C++

Query interface

  • js console

  • will have aggregation framework (native, faster)

  • sharding out-of-the-box

  • map/reduce

  • geospatial indexes

  • backups

  • imprort/export data

  • single write lock (hinders heavy writes)

Structure

  • single master DB
  • roles: master and slave
  • Replica sets:
    • 2-3 nodes basic install
    • Primary, Secondary & arbiter (Tie-breaker)
    • Automatic failover

Replica sets recommended (1.7+) over Master / slave setup

  • replica sets: data
  • config servers: sharding info, balancer
  • mongos: connections to cluster, application slices

Oplog

  • keeps track of replication
  • capped collection: default 5% of total disk (EY uses 10%)
  • size determines how much replication will be retained

The Journal

  • write-ahead redo logs
  • performance impact
  • faster recover

Drivers

  • mongomapper
  • mongoid
  • ruby mongo driver

Tips

  • Use in 64-bits, 32-bits limits to ~2.5 gb addressable memory then kaputt
  • Don't run standalone, use replication
  • use replica sets
  • turn journaling on (default ~2.0.x)
  • keep with current versions
  • keep working set in memory
  • use MMS for monitoring (mongo monitoring service)
  • choose a good sharding key
  • mind your config servers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment