Velocity 2011

Tuesday

OpenStack workshop – Ron Pedde (Rackspace Hosting), Todd Willey (OpenStack), Matt Ray (Opscode)

Open Source Cloud – compute nodes and object storage
openstack-cookbooks – stable ([email protected]/opscode.org)
dellcloudedge – bare metal openstack installer
voxeldotnet – cookbooks for launching swift in production with spiceweasel
not ready for prime time, specifically the networking side

How to Scale Dirty and it’s Influence on People – Philip Kromer (Infochimps), Dennis Yang (Infochimps)

cluster_chef
Goliath for concurrency
`knife cluster launch spidermonkey webnode —bootstrap`
startup is a tool to turn time and money into a validation of what the world wants
prefinery – manage customer signups in beta
“only automate out of boredom or terror”
“don’t solve problems you’d like to have”
monitoring: statsd (udp/decoupled), graphite (embeddable/decoupled) – measure anything measure everything
Flume – secretly ‘reliability glue’
tests are for confident deploys and decoupling (integration tests, not unit tests)
radical decoupling – reduce critical path edges (sum(kloc)^2) – hoorah for rails engines
unbundle reliability glue from function
distributed but fits in your brain
optimize debug loop time – scripts should be optimized for developer happiness, improving performance is to lower debug loop time (if system can scale like disk, makes it easier to do this)

Advanced PostMoretem Fu & Humar Error 101 – John Allspaw (Etsy)

‘fundamental surprises’ are events that weren’t thought to be possible
organizations are the blunt end, operators are the sharp end
‘The Conversation’ movie w/Gene Hackman
‘Severity 3 is kindof like your third child’
‘thematic vagabonding’ – context switching meets troubleshooting – too much switching to actually make process
‘goal fixiation’ (encystment) – irrationally ignoring signals based on gut feelings
improvisation is a requirement for troubleshooting complex systems
‘any explanation is better than none’
hindsight bias – knowledge of the outcome influences the analysis of the process
‘peoples need to be right is stronger than their ability to be objective’
systemic model: ‘resultant’ – known things lined up, ‘emergent’ → not things in a line, but factors spread out all over
functional resonance – in isolation, components act within bounds. Interconnected, they produce emerging behaviors
“There is no root cause”
‘human error is an effect, not a cause’ – it’s just a label
success is a special case of failure
tame complexity through new forms of feedback
human error is an inevitable by-product of strained complex systems
reprimanding someone is like peeing your pants – first feels nice and warm, then uncomfortable
accountability = responsibility + requisite authority

Managing the System Lifecycle and Configuration of Apache Hadoop and Other Distributed Systems Philip (Cloudera)

magic system for managing Hadoop

Ignite Velocity

Zoompf.com/free yslow hooked up to web crawler with web frontend
Interviewing – bookofbrilliantthings.com – guidenlines for interviewing and being interviewed
Finding a place to rent in San Francisco
monitor driven infrastructure – hubspot
everything i need to know about capacity management i learned at disneyland – little’s law → N = XR , fail fast+shed load OR performance tuning OR scale up (add more Dumbos)
lightning karaoke
ETIL
CloudFlare 50GBPs DDoS protection
@lnxchk – around the world in a glass
Why is Pixar so fscking good

Wednesday

Keynotes

CSS3 & HTML5 – Beyond the Hype! – Nicole Sullivan (@stubbornella)

Don’t worry about degradations in IE6 for non-major things (rounded corners, drop shadows, etc)
oocss
CSS xfer size has highest correlation to render time (HTTP archive)
use classes/ids instead of descendants
div.className is worse than .className (KISS)
avoid transparency calculations (border-radius, box-shadown, rgba)
avoid combining CSS effects (test together not separately)
use CSS3 to cut down on number of images (password strength widgets, tab and table decorations), ‘callout’ module
RGBA is nice (Zurb’s awesome buttons)
border-radius is nice too, goes after vendor prefix properties
gradients – Lea Verou pattern backgrounds
csslint (github.com/nzakas/parser-lib) csslint github

Why the Yahoo FrontPage Went Down and Why It Didn’t Go Down For up to a Decade before That – Jake Loomis (Yahoo!)

redundancy for everything (understand your system’s failure points)
error proof change – make changes in a safe environment – treat staging as production and fork in prod traffic to staging env. continuous integration. dark launch code
global load balancing (many different locations)
monitor everything
peak traffic on news was royal wedding, then bin laden death. shifted content to other properties, turned off features (35k tested peak rqs for news, up to 41k)
fallback plan in case of failure. isolate failure. drop features to add capacity
metrics on performance of external sites to prevent sending them more traffic than they can handle
APC cache was root cause of Yahoo.com outage

Oh, To Be Single Again – Building a Single Codebase in a Client-server World – Daniel Hunt (Yahoo!)

history lesson on evolution of application stack and complexity (static → dynamic → personalized dynamic) and sources of latency (network, storage, browser)
optimize time to interaction on Y!Mail by sending HTML skeleton and launch.js to client, as soon as backend data feed it gets pushed with JSON to user (facbeook, mail, twitter, etc).
new Y!Mail uses this
Ground rules:
- use element.innerHTML instead of document.createElement
- use function loadScript() instead of script tags
- separate out pieces
use a templating strategy (like mustache). PHP is not a good one
javascript classes only pulled in for needed phase. rendering can be done on server or on client
NodeJS handles workload much better than Apache (5-6x)
Single codebase is the way to go! Only one set of tests

How to Build a SaaS App With Twitter-like Throughput on Just 9 Servers – Lew Cirne (New Relic)

20 billion measurements from 150k processes from 10k customers every day on 9 servers
majority of data collected is ‘timeslices’ ~250 bytes, ~100k/second (twitter peaks at ~8k/s)
collecting is one thing, but need to provide realtime insight. data is always stale, caching techniques are no good
average page load time for main dashboard is 2.4s
User interface rails 2.3 (Rails+nginx+linux), data collectors on Jetty (Java, Jetty, Linux), data store shared by accounts (MySQL, percona)
9 agg/col/db servers: 24 core nehalem, 48gb raid, SAS RAID 5
2 web app server – 12 core nehalem, 48gb ram
new version ‘real user monitoring’ uses ‘Episodes’ by steve souders to time full page load time → over 1 billion POSTs per week
RUM beacons go into beacon servers on EC2 which then roll it up and async aggregate to collectors
challenges: data purging, what to pre-aggregate, large accounts, MySQL tuning, IO performance
MySQL can control exact placement of bytes on disk → optimize number of disk seeks required to render page
Keep it simple, less is more, Trendy != Reliable, Plan for scale, Use the right technology for a given task

Creating the Dev/Test/PM/Ops Supertribe: From “Visible Ops” To DevOps – Gene Kim (Visible IT Flow)

higher performing IT orgs are more stable/nimble/compliant/secure
common traits of high performers: culture of change management, causality, compliance and continual reduction of operational variance
author of visible ops handbook
3 questions that predict 60% of performance: to what extent does an org define/monitor/enforce
- standardized configuration strategy
- process discipline
- controlled access to production systems
the dreaded disease “IT Operations Constipatus” – accumulating technical debt
Goal #1: Decrease cycle time of releases “Release early, release often”
Goal #2: Increase Production Rigor “Better → Faster → Cheaper”
“When IT Fails: The Novel” and “DevOps Cookbook” on the way

Building for the Cloud: Lessons Learned at Heroku – Mark Imbriaco (Heroku)

Heroku is on Amazon’s cloud 100%
ELB (SSL termination) → mesh of routing proxies (erlang) → grid of application servers
routing proxy maps apps to app servers where it’s running
router idles out app processes for free apps after some time
used to have varnish but pulled it out in favor of CDNs
opinionated about decisions and force the Right Way on their users
No Persistent Filesystem – shouldn’t do this anyways
Horizontal Scalability → think about sharding up front (disk IO is major problem)
Avoid the disk. EBS is tempting but don’t do it. keep everything in memory
working towards disposable compute where any node can fail at any time → very well understood failure domain (Netflix Chaos Monkey strategy)
throw nodes away instead of diagnosing individual problems
Elasticity isn’t just for scaling → deploy new code to new servers for rolling updates on excess capacity
Service Discovery – good to be good at keeping everything in sync (AMQP announcements – treat as ephermal, timeout since ‘down’ messages might not make it)
doozer github
DNS is sexy: low TTLs, use subdomains (can’t cname apex)
‘yo dawg, i heard you like platforms so i put a platform on your platform’ – running platform for an app on top of Heroku. Minimal changes for heroku to run their app on top of their infrastructure if can boot a single app (‘internal’ feature flag)

Thursday

Keynotes

Ian Flynt (Yahoo!) – World IPv6 day

monitoring isn’t the same in a dual-stack environment – DNS health checks didn’t know about IPv6 so fell back to default rotation in US datacenters
don’t start something big and risky at a traffic inflection point
always have more than one way to look at things (2x)
practice makes perfect. for a major change, schedule multiple test runs

Facebook Open Compute & Other Infrastructure – Jonathan Heiliger (Facebook)

make audacious bets and iterate quickly, but manage risk with hedges

Velocity Culture – Jon Jenkins (Amazon.com)

success of culture depends on linking it to the business
focus on capacity planning is focus on spending money → better to do capacity optimization
November 10, 2010 → last physical web server at amazon.com shut down → full move to ec2
deployments every 12s to production, average 10k hosts receiving deployment simultaneously

Artur on SSD’s – Artur Bergman (Fastly)

SSDs for disk → 1 watt vs 15 watt, 7 min fsck on massive filesystem, not that expensive, DO IT

Cisco and Open Stack – Lew Tucker (Cisco)

OpenStack – Compute, Image Service, Object Storage → open source cloud that runs on multiple hypervisors
Cisco doing open source for the first time
Quantum – network service – virtual wire (L2/L3, attach VMs and services, etc)
API abstraction with vendor-specific plugin backend

State of the Infrastructure – Rachel Chalmers (The 451 Group)

science fiction – obsession with tools, fantasy – obsession with symbols
amazon is ‘a bookstore selling crack out the back door’

Holistic Performance – John Resig (Mozilla Corporation)

performance in jQuery project
more than wall time → battery usage, parse time, number of requests, file size, etc
can’t drop browser support for performance gains in another (can’t slow down IE just to make others faster either)
have to prove positive impact of JS performance change to do it
jsperf
doesn’t matter how much you unroll a loop if that loop is doing DOM manipulation
don’t compromise code quality in exchange for performance
hard to create realistic test cases

Lightning Demos Thursday – Michael Schneider (Google), Andreas Grabner (dynaTrace Software), Paul Irish (jQuery Developer Relations), Sergey Chernyshev (truTV)

pagespeed
dynatrace – John Resig and Steve Sounders both like → full support for FF4
chrome tools
- task manager → right click to get more info
- performance.timing → added performance.memory via —enable-memory-info
- window.onerror
- console.profile() and console.profiles[] → console.profileEnd()
- console.markTimeline()
- extensions: audits (way to maintain standards like no images above 80k)
- heap profiler
- remote debugging —remote-debugging port
show slow
- can pull in custom metrics (GoogleAnalytics, order system, etc)
- can send events to it via a web service i.e. ‘Combined multiple JS files into one’

Cast – The Open Deployment Platform – Paul Querna (Rackspace)

from cloudkick
cast
github
maslovs hierarchy: Having Releases → Your 7 year old bash script → Configuration management → APIs for Deployment → Heroku
Service management → start/stop/restart via HTTP API
Version management → Distribution/upgrade/rollback
Service monitoring → logfiles/network ports/processes
Service coordination → config/where is my database? who is the master?

Making the web instant – Arvind Jain (Google), Sreeram Ramachandran (Google)

average time to click on link → onmouseover to onclick → 300ms avg
Google instant pages → pre-renders first page from search results by predicting what you’ll click on
Google Chrome Instant → preference in Chrome, preloads based on what you’re typing in the omnibox
link rel-prerender → instruct the browser to load a page that the user is likely to visit next
proposed new standard “Page Visibility API” → mine if a page has actually been seen by a user or not

Wikia: The Road to Active/Active – Jason Cook (Wikia)

10s of millions of articles, 246 languages, 45+ million monthly visitors, 1 billion pages per month
mediawiki: master/slave, read after write, direct cache invalidation, caches really well
94% cache hit, 1-5% logged in, 99% no-writes
DR site in Iowa (Identical copy of hardware)
Varnish → swiss army knife of wikia. multiple retries in config instead of sending error pages to user
MediaWiki read-only mode for DR site → no user problems until they try to edit things
SSD sped up MySQL slave buffer cache full time from 1.5 hours to 3 minutes
varnish purges pushed to a queue, processed worldwide in under 1s – make sure to do these after MySQL updates make it through replication
use idle apaches to render for geo directed users, but some GETS trigger writes.
users get cookies from master after POST, directed to master until slave cluster catches up
dont forget to maintain 2x capacity!

Instrumenting the real-time web: Node.js, DTrace and the Robinson Projection – Bryan Cantrill (Joyent, Inc.)

nodejs is three ideas:
- JS rich support for asynchrony (i.e. closures)
- High-performance VMs
- The system abstractions that God intended “dynamic c”
nodeknockout contest → instrument incoming connections and geolocate them to provide ‘leaderboard’
use DTrace to instrument connections “dtrace – something that is important for ass saving”
dtrace is in kernel and instruments entire system, but tough for high-level interpreted environments → USDT “User-level statically defined tracing”
function in javascript that probes into C++ backend with USDT instrumentation
if you’re hitting GC in node you have a memory leak, you’re like a drug addict that has hit rock bottom. if i give you more memory it’s going right in your arm
Joyent uses OS level virtualization instead of hypervisor to allow introspection like dtrace from dom-0
github
leaderd/tickerd solution → 700ms latency from connection to app to appearing on dashboard
what map projection to use? Robinson projection → it’s not actually a projection github
node.js perfect for web-facing real-time systems that are hurt by long latency events and not CPU time
“Data Intensive Real Time” – DIRT

Scaling Concurrently – John Adams (Twitter)

traditional monolith is one big block of code, one big database, lots of SPOFs and technical debt
start early with being able to monitor: observability
ensure that network services are reachable (monit, god, etc)
separate everything as decoupled individual services – introduces latency but thats ok
decompose
- via RPC: Transport (thrift, protocol buffers), Data Marshaling (JSON, Binary JSON, Raw Data, Messagepack), Fail Fast (less than 20ms)
- via queues/daemons: (kestrel is twitters): Handling failures? re-queue? drop? global locks and concurrency a concern
kestrel – scala / open source
- set enqueue, get dequeue (speaks memcache)
- no job ordering, no shared state
start configuration management early
SSL performance: cert key length matters, cipher accept order matters, ssl keepalives to eliminate handshake but uses more sockets/descriptors
Scribe for logging

HTML5, Flash and the Battle for Faster Cat Videos – Greg Schechter (YouTube), Phil Harnish (YouTube)

HTML5 v flash – features, accessibility, deviceability, security, embeds, api
HTML5 is missing:
- content protection (RTMPE protocol)
- camera and microphone access
- fullscreen video
- consistent format support – needs to support H.264 and WebM
- cross platform consistency
HTML5 has
- open source (full stack)
- lower latency (no plug-in)
- better performance and fidelity
- accessibility
45% of youtube.com has ability to watch HTML5, but majority of people using YoutubeAPI are in HTML5
HTML5 player boots faster but video play time lags flash → mostly due to HTML5 ones not being in cache
200ms improvement by preloading video connection in head
wait until user clicks play to do all the work
Flash still preferred most places, but HTML5 is awesome and there is demand

reddit.com War Stories: The mistakes we made and how you can avoid them. – Jeremy Edberg (reddit.com)

mistake: relying on a single cloud product and expecting it to work as advertised (avoid EBS for now, or RAID around it)
single EBS for a database now: they use 13 disks now, 6 pairs spanned and a spare
mistake: not account for increased latency in virtualized environments
mistake: not using a service based architecture sooner
mistake: not using a consistent key hashing algorithm at first → move to Cassandra (Dynamo model for consistent hashing)
mistake: using bleeding edge software in production (Cassandra 0.7)
mistake: not having enough monitoring and not having monitoring that is virtualization friendly (Use Ganglia, backed by RRD, not friendly to change)
database scaling issues: they do shard, but using key value don’t use transactions and which they did
use londiste for replication which is great and flexible, but doesn’t handle errors well like slow disk
users notice inconsistency and make comments about it
plan for 3 or more than 3 whenever you are writing code
queues are your friend
treat logged out users as second class (server all from cdn)
reddit is open source on github

Choose Your Own Adventure 2: Electric Boogaloo ;-) – Adam Jacob (Opscode), Jesse Robbins (Opscode)

sales and marketing: marketing brings leads via campaigns. lead nurturing → qualified prospects. ‘prosecuted’
no assholes rule (positive interactions must outnumber negative ones 5:1) withholders of effort, affectively negative, or interpersonal deviants
automate: provisioning, DNS, server inventory, configuration management, identity management, version control, monitoring and trending, application deployment
polyglots: sysadmins are software developers with shitty languages: sysadmins should learn scheme
managing ops: ops responsible for: system availability and efficiency. developers must be on call, sysadmins should be escalated to. metrics tie to $$. should be saying ‘yes’ instead of no, but make people commit
philosophy: people don’t remember tools used to build great things. can only be measured by final solution. your best skill is knowing systems and problems
open source: we like to look at pretty cars, but we take the ugly ones home and work on it/fix it up/etc. you cannot leapfrog stewardship
devops: is not a job description, you can’t ‘be’ a devops “you don’t care that i got divorced because your crappy code woke me up” – sysadmin to dev. devops is all inclusive. someone not happy or exclusive → doing it wrong

ckdake/Velocity 2011 Notes.textile

Velocity 2011

Tuesday

OpenStack workshop – Ron Pedde (Rackspace Hosting), Todd Willey (OpenStack), Matt Ray (Opscode)

How to Scale Dirty and it’s Influence on People – Philip Kromer (Infochimps), Dennis Yang (Infochimps)

Advanced PostMoretem Fu & Humar Error 101 – John Allspaw (Etsy)

Managing the System Lifecycle and Configuration of Apache Hadoop and Other Distributed Systems Philip (Cloudera)

Ignite Velocity

Wednesday

Keynotes

CSS3 & HTML5 – Beyond the Hype! – Nicole Sullivan (@stubbornella)

Why the Yahoo FrontPage Went Down and Why It Didn’t Go Down For up to a Decade before That – Jake Loomis (Yahoo!)

Oh, To Be Single Again – Building a Single Codebase in a Client-server World – Daniel Hunt (Yahoo!)

How to Build a SaaS App With Twitter-like Throughput on Just 9 Servers – Lew Cirne (New Relic)

Creating the Dev/Test/PM/Ops Supertribe: From “Visible Ops” To DevOps – Gene Kim (Visible IT Flow)

Building for the Cloud: Lessons Learned at Heroku – Mark Imbriaco (Heroku)

Thursday

Keynotes

Ian Flynt (Yahoo!) – World IPv6 day

Facebook Open Compute & Other Infrastructure – Jonathan Heiliger (Facebook)

Velocity Culture – Jon Jenkins (Amazon.com)

Artur on SSD’s – Artur Bergman (Fastly)

Cisco and Open Stack – Lew Tucker (Cisco)

State of the Infrastructure – Rachel Chalmers (The 451 Group)

Holistic Performance – John Resig (Mozilla Corporation)

Lightning Demos Thursday – Michael Schneider (Google), Andreas Grabner (dynaTrace Software), Paul Irish (jQuery Developer Relations), Sergey Chernyshev (truTV)

Cast – The Open Deployment Platform – Paul Querna (Rackspace)

Making the web instant – Arvind Jain (Google), Sreeram Ramachandran (Google)

Wikia: The Road to Active/Active – Jason Cook (Wikia)

Instrumenting the real-time web: Node.js, DTrace and the Robinson Projection – Bryan Cantrill (Joyent, Inc.)

Scaling Concurrently – John Adams (Twitter)

HTML5, Flash and the Battle for Faster Cat Videos – Greg Schechter (YouTube), Phil Harnish (YouTube)

reddit.com War Stories: The mistakes we made and how you can avoid them. – Jeremy Edberg (reddit.com)

Choose Your Own Adventure 2: Electric Boogaloo ;-) – Adam Jacob (Opscode), Jesse Robbins (Opscode)