Skip to content

Instantly share code, notes, and snippets.

@ckdake
Created June 17, 2011 17:52
Show Gist options
  • Save ckdake/1031895 to your computer and use it in GitHub Desktop.
Save ckdake/1031895 to your computer and use it in GitHub Desktop.
Velocity 2011 Notes

Velocity 2011

Tuesday

OpenStack workshop – Ron Pedde (Rackspace Hosting), Todd Willey (OpenStack), Matt Ray (Opscode)

  • Open Source Cloud – compute nodes and object storage
  • openstack-cookbooks – stable ([email protected]/opscode.org)
  • dellcloudedge – bare metal openstack installer
  • voxeldotnet – cookbooks for launching swift in production with spiceweasel
  • not ready for prime time, specifically the networking side

How to Scale Dirty and it’s Influence on People – Philip Kromer (Infochimps), Dennis Yang (Infochimps)

  • cluster_chef
  • Goliath for concurrency
  • `knife cluster launch spidermonkey webnode —bootstrap`
  • startup is a tool to turn time and money into a validation of what the world wants
  • prefinery – manage customer signups in beta
  • “only automate out of boredom or terror”
  • “don’t solve problems you’d like to have”
  • monitoring: statsd (udp/decoupled), graphite (embeddable/decoupled) – measure anything measure everything
  • Flume – secretly ‘reliability glue’
  • tests are for confident deploys and decoupling (integration tests, not unit tests)
  • radical decoupling – reduce critical path edges (sum(kloc)^2) – hoorah for rails engines
  • unbundle reliability glue from function
  • distributed but fits in your brain
  • optimize debug loop time – scripts should be optimized for developer happiness, improving performance is to lower debug loop time (if system can scale like disk, makes it easier to do this)

Advanced PostMoretem Fu & Humar Error 101 – John Allspaw (Etsy)

  • ‘fundamental surprises’ are events that weren’t thought to be possible
  • organizations are the blunt end, operators are the sharp end
  • ‘The Conversation’ movie w/Gene Hackman
  • ‘Severity 3 is kindof like your third child’
  • ‘thematic vagabonding’ – context switching meets troubleshooting – too much switching to actually make process
  • ‘goal fixiation’ (encystment) – irrationally ignoring signals based on gut feelings
  • improvisation is a requirement for troubleshooting complex systems
  • ‘any explanation is better than none’
  • hindsight bias – knowledge of the outcome influences the analysis of the process
  • ‘peoples need to be right is stronger than their ability to be objective’
  • systemic model: ‘resultant’ – known things lined up, ‘emergent’ → not things in a line, but factors spread out all over
  • functional resonance – in isolation, components act within bounds. Interconnected, they produce emerging behaviors
  • “There is no root cause”
  • ‘human error is an effect, not a cause’ – it’s just a label
  • success is a special case of failure
  • tame complexity through new forms of feedback
  • human error is an inevitable by-product of strained complex systems
  • reprimanding someone is like peeing your pants – first feels nice and warm, then uncomfortable
  • accountability = responsibility + requisite authority

Managing the System Lifecycle and Configuration of Apache Hadoop and Other Distributed Systems Philip (Cloudera)

  • magic system for managing Hadoop

Ignite Velocity

  • Zoompf.com/free yslow hooked up to web crawler with web frontend
  • Interviewing – bookofbrilliantthings.com – guidenlines for interviewing and being interviewed
  • Finding a place to rent in San Francisco
  • monitor driven infrastructure – hubspot
  • everything i need to know about capacity management i learned at disneyland – little’s law → N = XR , fail fast+shed load OR performance tuning OR scale up (add more Dumbos)
  • lightning karaoke
  • ETIL
  • CloudFlare 50GBPs DDoS protection
  • @lnxchk – around the world in a glass
  • Why is Pixar so fscking good

Wednesday

Keynotes

CSS3 & HTML5 – Beyond the Hype! – Nicole Sullivan (@stubbornella)

  • Don’t worry about degradations in IE6 for non-major things (rounded corners, drop shadows, etc)
  • oocss
  • CSS xfer size has highest correlation to render time (HTTP archive)
  • use classes/ids instead of descendants
  • div.className is worse than .className (KISS)
  • avoid transparency calculations (border-radius, box-shadown, rgba)
  • avoid combining CSS effects (test together not separately)
  • use CSS3 to cut down on number of images (password strength widgets, tab and table decorations), ‘callout’ module
  • RGBA is nice (Zurb’s awesome buttons)
  • border-radius is nice too, goes after vendor prefix properties
  • gradients – Lea Verou pattern backgrounds
  • csslint (github.com/nzakas/parser-lib) csslint github

Why the Yahoo FrontPage Went Down and Why It Didn’t Go Down For up to a Decade before That – Jake Loomis (Yahoo!)

  • redundancy for everything (understand your system’s failure points)
  • error proof change – make changes in a safe environment – treat staging as production and fork in prod traffic to staging env. continuous integration. dark launch code
  • global load balancing (many different locations)
  • monitor everything
  • peak traffic on news was royal wedding, then bin laden death. shifted content to other properties, turned off features (35k tested peak rqs for news, up to 41k)
  • fallback plan in case of failure. isolate failure. drop features to add capacity
  • metrics on performance of external sites to prevent sending them more traffic than they can handle
  • APC cache was root cause of Yahoo.com outage

Oh, To Be Single Again – Building a Single Codebase in a Client-server World – Daniel Hunt (Yahoo!)

  • history lesson on evolution of application stack and complexity (static → dynamic → personalized dynamic) and sources of latency (network, storage, browser)
  • optimize time to interaction on Y!Mail by sending HTML skeleton and launch.js to client, as soon as backend data feed it gets pushed with JSON to user (facbeook, mail, twitter, etc).
  • new Y!Mail uses this
  • Ground rules:
    • use element.innerHTML instead of document.createElement
    • use function loadScript() instead of script tags
    • separate out pieces
  • use a templating strategy (like mustache). PHP is not a good one
  • javascript classes only pulled in for needed phase. rendering can be done on server or on client
  • NodeJS handles workload much better than Apache (5-6x)
  • Single codebase is the way to go! Only one set of tests

How to Build a SaaS App With Twitter-like Throughput on Just 9 Servers – Lew Cirne (New Relic)

  • 20 billion measurements from 150k processes from 10k customers every day on 9 servers
  • majority of data collected is ‘timeslices’ ~250 bytes, ~100k/second (twitter peaks at ~8k/s)
  • collecting is one thing, but need to provide realtime insight. data is always stale, caching techniques are no good
  • average page load time for main dashboard is 2.4s
  • User interface rails 2.3 (Rails+nginx+linux), data collectors on Jetty (Java, Jetty, Linux), data store shared by accounts (MySQL, percona)
  • 9 agg/col/db servers: 24 core nehalem, 48gb raid, SAS RAID 5
  • 2 web app server – 12 core nehalem, 48gb ram
  • new version ‘real user monitoring’ uses ‘Episodes’ by steve souders to time full page load time → over 1 billion POSTs per week
  • RUM beacons go into beacon servers on EC2 which then roll it up and async aggregate to collectors
  • challenges: data purging, what to pre-aggregate, large accounts, MySQL tuning, IO performance
  • MySQL can control exact placement of bytes on disk → optimize number of disk seeks required to render page
  • Keep it simple, less is more, Trendy != Reliable, Plan for scale, Use the right technology for a given task

Creating the Dev/Test/PM/Ops Supertribe: From “Visible Ops” To DevOps – Gene Kim (Visible IT Flow)

  • higher performing IT orgs are more stable/nimble/compliant/secure
  • common traits of high performers: culture of change management, causality, compliance and continual reduction of operational variance
  • author of visible ops handbook
  • 3 questions that predict 60% of performance: to what extent does an org define/monitor/enforce
    • standardized configuration strategy
    • process discipline
    • controlled access to production systems
  • the dreaded disease “IT Operations Constipatus” – accumulating technical debt
  • Goal #1: Decrease cycle time of releases “Release early, release often”
  • Goal #2: Increase Production Rigor “Better → Faster → Cheaper”
  • “When IT Fails: The Novel” and “DevOps Cookbook” on the way

Building for the Cloud: Lessons Learned at Heroku – Mark Imbriaco (Heroku)

  • Heroku is on Amazon’s cloud 100%
  • ELB (SSL termination) → mesh of routing proxies (erlang) → grid of application servers
  • routing proxy maps apps to app servers where it’s running
  • router idles out app processes for free apps after some time
  • used to have varnish but pulled it out in favor of CDNs
  • opinionated about decisions and force the Right Way on their users
  • No Persistent Filesystem – shouldn’t do this anyways
  • Horizontal Scalability → think about sharding up front (disk IO is major problem)
  • Avoid the disk. EBS is tempting but don’t do it. keep everything in memory
  • working towards disposable compute where any node can fail at any time → very well understood failure domain (Netflix Chaos Monkey strategy)
  • throw nodes away instead of diagnosing individual problems
  • Elasticity isn’t just for scaling → deploy new code to new servers for rolling updates on excess capacity
  • Service Discovery – good to be good at keeping everything in sync (AMQP announcements – treat as ephermal, timeout since ‘down’ messages might not make it)
  • doozer github
  • DNS is sexy: low TTLs, use subdomains (can’t cname apex)
  • ‘yo dawg, i heard you like platforms so i put a platform on your platform’ – running platform for an app on top of Heroku. Minimal changes for heroku to run their app on top of their infrastructure if can boot a single app (‘internal’ feature flag)

Thursday

Keynotes

Ian Flynt (Yahoo!) – World IPv6 day

  • monitoring isn’t the same in a dual-stack environment – DNS health checks didn’t know about IPv6 so fell back to default rotation in US datacenters
  • don’t start something big and risky at a traffic inflection point
  • always have more than one way to look at things (2x)
  • practice makes perfect. for a major change, schedule multiple test runs

Facebook Open Compute & Other Infrastructure – Jonathan Heiliger (Facebook)

  • make audacious bets and iterate quickly, but manage risk with hedges

Velocity Culture – Jon Jenkins (Amazon.com)

  • success of culture depends on linking it to the business
  • focus on capacity planning is focus on spending money → better to do capacity optimization
  • November 10, 2010 → last physical web server at amazon.com shut down → full move to ec2
  • deployments every 12s to production, average 10k hosts receiving deployment simultaneously

Artur on SSD’s – Artur Bergman (Fastly)

  • SSDs for disk → 1 watt vs 15 watt, 7 min fsck on massive filesystem, not that expensive, DO IT

Cisco and Open Stack – Lew Tucker (Cisco)

  • OpenStack – Compute, Image Service, Object Storage → open source cloud that runs on multiple hypervisors
  • Cisco doing open source for the first time
  • Quantum – network service – virtual wire (L2/L3, attach VMs and services, etc)
  • API abstraction with vendor-specific plugin backend

State of the Infrastructure – Rachel Chalmers (The 451 Group)

  • science fiction – obsession with tools, fantasy – obsession with symbols
  • amazon is ‘a bookstore selling crack out the back door’

Holistic Performance – John Resig (Mozilla Corporation)

  • performance in jQuery project
  • more than wall time → battery usage, parse time, number of requests, file size, etc
  • can’t drop browser support for performance gains in another (can’t slow down IE just to make others faster either)
  • have to prove positive impact of JS performance change to do it
  • jsperf
  • doesn’t matter how much you unroll a loop if that loop is doing DOM manipulation
  • don’t compromise code quality in exchange for performance
  • hard to create realistic test cases

Lightning Demos Thursday – Michael Schneider (Google), Andreas Grabner (dynaTrace Software), Paul Irish (jQuery Developer Relations), Sergey Chernyshev (truTV)

  • pagespeed
  • dynatrace – John Resig and Steve Sounders both like → full support for FF4
  • chrome tools
    • task manager → right click to get more info
    • performance.timing → added performance.memory via —enable-memory-info
    • window.onerror
    • console.profile() and console.profiles[] → console.profileEnd()
    • console.markTimeline()
    • extensions: audits (way to maintain standards like no images above 80k)
    • heap profiler
    • remote debugging —remote-debugging port
  • show slow
    • can pull in custom metrics (GoogleAnalytics, order system, etc)
    • can send events to it via a web service i.e. ‘Combined multiple JS files into one’

Cast – The Open Deployment Platform – Paul Querna (Rackspace)

  • from cloudkick
  • cast
  • github
  • maslovs hierarchy: Having Releases → Your 7 year old bash script → Configuration management → APIs for Deployment → Heroku
  • Service management → start/stop/restart via HTTP API
  • Version management → Distribution/upgrade/rollback
  • Service monitoring → logfiles/network ports/processes
  • Service coordination → config/where is my database? who is the master?

Making the web instant – Arvind Jain (Google), Sreeram Ramachandran (Google)

  • average time to click on link → onmouseover to onclick → 300ms avg
  • Google instant pages → pre-renders first page from search results by predicting what you’ll click on
  • Google Chrome Instant → preference in Chrome, preloads based on what you’re typing in the omnibox
  • link rel-prerender → instruct the browser to load a page that the user is likely to visit next
  • proposed new standard “Page Visibility API” → mine if a page has actually been seen by a user or not

Wikia: The Road to Active/Active – Jason Cook (Wikia)

  • 10s of millions of articles, 246 languages, 45+ million monthly visitors, 1 billion pages per month
  • mediawiki: master/slave, read after write, direct cache invalidation, caches really well
  • 94% cache hit, 1-5% logged in, 99% no-writes
  • DR site in Iowa (Identical copy of hardware)
  • Varnish → swiss army knife of wikia. multiple retries in config instead of sending error pages to user
  • MediaWiki read-only mode for DR site → no user problems until they try to edit things
  • SSD sped up MySQL slave buffer cache full time from 1.5 hours to 3 minutes
  • varnish purges pushed to a queue, processed worldwide in under 1s – make sure to do these after MySQL updates make it through replication
  • use idle apaches to render for geo directed users, but some GETS trigger writes.
  • users get cookies from master after POST, directed to master until slave cluster catches up
  • dont forget to maintain 2x capacity!

Instrumenting the real-time web: Node.js, DTrace and the Robinson Projection – Bryan Cantrill (Joyent, Inc.)

  • nodejs is three ideas:
    • JS rich support for asynchrony (i.e. closures)
    • High-performance VMs
    • The system abstractions that God intended “dynamic c”
  • nodeknockout contest → instrument incoming connections and geolocate them to provide ‘leaderboard’
  • use DTrace to instrument connections “dtrace – something that is important for ass saving”
  • dtrace is in kernel and instruments entire system, but tough for high-level interpreted environments → USDT “User-level statically defined tracing”
  • function in javascript that probes into C++ backend with USDT instrumentation
  • if you’re hitting GC in node you have a memory leak, you’re like a drug addict that has hit rock bottom. if i give you more memory it’s going right in your arm
  • Joyent uses OS level virtualization instead of hypervisor to allow introspection like dtrace from dom-0
  • github
  • leaderd/tickerd solution → 700ms latency from connection to app to appearing on dashboard
  • what map projection to use? Robinson projection → it’s not actually a projection github
  • node.js perfect for web-facing real-time systems that are hurt by long latency events and not CPU time
  • “Data Intensive Real Time” – DIRT

Scaling Concurrently – John Adams (Twitter)

  • traditional monolith is one big block of code, one big database, lots of SPOFs and technical debt
  • start early with being able to monitor: observability
  • ensure that network services are reachable (monit, god, etc)
  • separate everything as decoupled individual services – introduces latency but thats ok
  • decompose
    • via RPC: Transport (thrift, protocol buffers), Data Marshaling (JSON, Binary JSON, Raw Data, Messagepack), Fail Fast (less than 20ms)
    • via queues/daemons: (kestrel is twitters): Handling failures? re-queue? drop? global locks and concurrency a concern
  • kestrel – scala / open source
    • set enqueue, get dequeue (speaks memcache)
    • no job ordering, no shared state
  • start configuration management early
  • SSL performance: cert key length matters, cipher accept order matters, ssl keepalives to eliminate handshake but uses more sockets/descriptors
  • Scribe for logging

HTML5, Flash and the Battle for Faster Cat Videos – Greg Schechter (YouTube), Phil Harnish (YouTube)

  • HTML5 v flash – features, accessibility, deviceability, security, embeds, api
  • HTML5 is missing:
    • content protection (RTMPE protocol)
    • camera and microphone access
    • fullscreen video
    • consistent format support – needs to support H.264 and WebM
    • cross platform consistency
  • HTML5 has
    • open source (full stack)
    • lower latency (no plug-in)
    • better performance and fidelity
    • accessibility
  • 45% of youtube.com has ability to watch HTML5, but majority of people using YoutubeAPI are in HTML5
  • HTML5 player boots faster but video play time lags flash → mostly due to HTML5 ones not being in cache
  • 200ms improvement by preloading video connection in head
  • wait until user clicks play to do all the work
  • Flash still preferred most places, but HTML5 is awesome and there is demand

reddit.com War Stories: The mistakes we made and how you can avoid them. – Jeremy Edberg (reddit.com)

  • mistake: relying on a single cloud product and expecting it to work as advertised (avoid EBS for now, or RAID around it)
  • single EBS for a database now: they use 13 disks now, 6 pairs spanned and a spare
  • mistake: not account for increased latency in virtualized environments
  • mistake: not using a service based architecture sooner
  • mistake: not using a consistent key hashing algorithm at first → move to Cassandra (Dynamo model for consistent hashing)
  • mistake: using bleeding edge software in production (Cassandra 0.7)
  • mistake: not having enough monitoring and not having monitoring that is virtualization friendly (Use Ganglia, backed by RRD, not friendly to change)
  • database scaling issues: they do shard, but using key value don’t use transactions and which they did
  • use londiste for replication which is great and flexible, but doesn’t handle errors well like slow disk
  • users notice inconsistency and make comments about it
  • plan for 3 or more than 3 whenever you are writing code
  • queues are your friend
  • treat logged out users as second class (server all from cdn)
  • reddit is open source on github

Choose Your Own Adventure 2: Electric Boogaloo ;-) – Adam Jacob (Opscode), Jesse Robbins (Opscode)

  • sales and marketing: marketing brings leads via campaigns. lead nurturing → qualified prospects. ‘prosecuted’
  • no assholes rule (positive interactions must outnumber negative ones 5:1) withholders of effort, affectively negative, or interpersonal deviants
  • automate: provisioning, DNS, server inventory, configuration management, identity management, version control, monitoring and trending, application deployment
  • polyglots: sysadmins are software developers with shitty languages: sysadmins should learn scheme
  • managing ops: ops responsible for: system availability and efficiency. developers must be on call, sysadmins should be escalated to. metrics tie to $$. should be saying ‘yes’ instead of no, but make people commit
  • philosophy: people don’t remember tools used to build great things. can only be measured by final solution. your best skill is knowing systems and problems
  • open source: we like to look at pretty cars, but we take the ugly ones home and work on it/fix it up/etc. you cannot leapfrog stewardship
  • devops: is not a job description, you can’t ‘be’ a devops “you don’t care that i got divorced because your crappy code woke me up” – sysadmin to dev. devops is all inclusive. someone not happy or exclusive → doing it wrong
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment