You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
optimize debug loop time – scripts should be optimized for developer happiness, improving performance is to lower debug loop time (if system can scale like disk, makes it easier to do this)
Advanced PostMoretem Fu & Humar Error 101 – John Allspaw (Etsy)
‘fundamental surprises’ are events that weren’t thought to be possible
organizations are the blunt end, operators are the sharp end
‘The Conversation’ movie w/Gene Hackman
‘Severity 3 is kindof like your third child’
‘thematic vagabonding’ – context switching meets troubleshooting – too much switching to actually make process
‘goal fixiation’ (encystment) – irrationally ignoring signals based on gut feelings
improvisation is a requirement for troubleshooting complex systems
‘any explanation is better than none’
hindsight bias – knowledge of the outcome influences the analysis of the process
‘peoples need to be right is stronger than their ability to be objective’
systemic model: ‘resultant’ – known things lined up, ‘emergent’ → not things in a line, but factors spread out all over
functional resonance – in isolation, components act within bounds. Interconnected, they produce emerging behaviors
“There is no root cause”
‘human error is an effect, not a cause’ – it’s just a label
success is a special case of failure
tame complexity through new forms of feedback
human error is an inevitable by-product of strained complex systems
reprimanding someone is like peeing your pants – first feels nice and warm, then uncomfortable
everything i need to know about capacity management i learned at disneyland – little’s law → N = XR , fail fast+shed load OR performance tuning OR scale up (add more Dumbos)
Why the Yahoo FrontPage Went Down and Why It Didn’t Go Down For up to a Decade before That – Jake Loomis (Yahoo!)
redundancy for everything (understand your system’s failure points)
error proof change – make changes in a safe environment – treat staging as production and fork in prod traffic to staging env. continuous integration. dark launch code
global load balancing (many different locations)
monitor everything
peak traffic on news was royal wedding, then bin laden death. shifted content to other properties, turned off features (35k tested peak rqs for news, up to 41k)
fallback plan in case of failure. isolate failure. drop features to add capacity
metrics on performance of external sites to prevent sending them more traffic than they can handle
APC cache was root cause of Yahoo.com outage
Oh, To Be Single Again – Building a Single Codebase in a Client-server World – Daniel Hunt (Yahoo!)
history lesson on evolution of application stack and complexity (static → dynamic → personalized dynamic) and sources of latency (network, storage, browser)
optimize time to interaction on Y!Mail by sending HTML skeleton and launch.js to client, as soon as backend data feed it gets pushed with JSON to user (facbeook, mail, twitter, etc).
new Y!Mail uses this
Ground rules:
use element.innerHTML instead of document.createElement
use function loadScript() instead of script tags
separate out pieces
use a templating strategy (like mustache). PHP is not a good one
javascript classes only pulled in for needed phase. rendering can be done on server or on client
NodeJS handles workload much better than Apache (5-6x)
Single codebase is the way to go! Only one set of tests
How to Build a SaaS App With Twitter-like Throughput on Just 9 Servers – Lew Cirne (New Relic)
20 billion measurements from 150k processes from 10k customers every day on 9 servers
majority of data collected is ‘timeslices’ ~250 bytes, ~100k/second (twitter peaks at ~8k/s)
collecting is one thing, but need to provide realtime insight. data is always stale, caching techniques are no good
average page load time for main dashboard is 2.4s
User interface rails 2.3 (Rails+nginx+linux), data collectors on Jetty (Java, Jetty, Linux), data store shared by accounts (MySQL, percona)
new version ‘real user monitoring’ uses ‘Episodes’ by steve souders to time full page load time → over 1 billion POSTs per week
RUM beacons go into beacon servers on EC2 which then roll it up and async aggregate to collectors
challenges: data purging, what to pre-aggregate, large accounts, MySQL tuning, IO performance
MySQL can control exact placement of bytes on disk → optimize number of disk seeks required to render page
Keep it simple, less is more, Trendy != Reliable, Plan for scale, Use the right technology for a given task
Creating the Dev/Test/PM/Ops Supertribe: From “Visible Ops” To DevOps – Gene Kim (Visible IT Flow)
higher performing IT orgs are more stable/nimble/compliant/secure
common traits of high performers: culture of change management, causality, compliance and continual reduction of operational variance
author of visible ops handbook
3 questions that predict 60% of performance: to what extent does an org define/monitor/enforce
standardized configuration strategy
process discipline
controlled access to production systems
the dreaded disease “IT Operations Constipatus” – accumulating technical debt
Goal #1: Decrease cycle time of releases “Release early, release often”
Goal #2: Increase Production Rigor “Better → Faster → Cheaper”
“When IT Fails: The Novel” and “DevOps Cookbook” on the way
Building for the Cloud: Lessons Learned at Heroku – Mark Imbriaco (Heroku)
Heroku is on Amazon’s cloud 100%
ELB (SSL termination) → mesh of routing proxies (erlang) → grid of application servers
routing proxy maps apps to app servers where it’s running
router idles out app processes for free apps after some time
used to have varnish but pulled it out in favor of CDNs
opinionated about decisions and force the Right Way on their users
No Persistent Filesystem – shouldn’t do this anyways
Horizontal Scalability → think about sharding up front (disk IO is major problem)
Avoid the disk. EBS is tempting but don’t do it. keep everything in memory
working towards disposable compute where any node can fail at any time → very well understood failure domain (Netflix Chaos Monkey strategy)
throw nodes away instead of diagnosing individual problems
Elasticity isn’t just for scaling → deploy new code to new servers for rolling updates on excess capacity
Service Discovery – good to be good at keeping everything in sync (AMQP announcements – treat as ephermal, timeout since ‘down’ messages might not make it)
DNS is sexy: low TTLs, use subdomains (can’t cname apex)
‘yo dawg, i heard you like platforms so i put a platform on your platform’ – running platform for an app on top of Heroku. Minimal changes for heroku to run their app on top of their infrastructure if can boot a single app (‘internal’ feature flag)
Thursday
Keynotes
Ian Flynt (Yahoo!) – World IPv6 day
monitoring isn’t the same in a dual-stack environment – DNS health checks didn’t know about IPv6 so fell back to default rotation in US datacenters
don’t start something big and risky at a traffic inflection point
always have more than one way to look at things (2x)
practice makes perfect. for a major change, schedule multiple test runs
Facebook Open Compute & Other Infrastructure – Jonathan Heiliger (Facebook)
make audacious bets and iterate quickly, but manage risk with hedges
Velocity Culture – Jon Jenkins (Amazon.com)
success of culture depends on linking it to the business
focus on capacity planning is focus on spending money → better to do capacity optimization
November 10, 2010 → last physical web server at amazon.com shut down → full move to ec2
deployments every 12s to production, average 10k hosts receiving deployment simultaneously
Artur on SSD’s – Artur Bergman (Fastly)
SSDs for disk → 1 watt vs 15 watt, 7 min fsck on massive filesystem, not that expensive, DO IT
Cisco and Open Stack – Lew Tucker (Cisco)
OpenStack – Compute, Image Service, Object Storage → open source cloud that runs on multiple hypervisors
Cisco doing open source for the first time
Quantum – network service – virtual wire (L2/L3, attach VMs and services, etc)
API abstraction with vendor-specific plugin backend
State of the Infrastructure – Rachel Chalmers (The 451 Group)
science fiction – obsession with tools, fantasy – obsession with symbols
amazon is ‘a bookstore selling crack out the back door’
Holistic Performance – John Resig (Mozilla Corporation)
performance in jQuery project
more than wall time → battery usage, parse time, number of requests, file size, etc
can’t drop browser support for performance gains in another (can’t slow down IE just to make others faster either)
have to prove positive impact of JS performance change to do it
maslovs hierarchy: Having Releases → Your 7 year old bash script → Configuration management → APIs for Deployment → Heroku
Service management → start/stop/restart via HTTPAPI
Version management → Distribution/upgrade/rollback
Service monitoring → logfiles/network ports/processes
Service coordination → config/where is my database? who is the master?
Making the web instant – Arvind Jain (Google), Sreeram Ramachandran (Google)
average time to click on link → onmouseover to onclick → 300ms avg
Google instant pages → pre-renders first page from search results by predicting what you’ll click on
Google Chrome Instant → preference in Chrome, preloads based on what you’re typing in the omnibox
link rel-prerender → instruct the browser to load a page that the user is likely to visit next
proposed new standard “Page Visibility API” → mine if a page has actually been seen by a user or not
Wikia: The Road to Active/Active – Jason Cook (Wikia)
10s of millions of articles, 246 languages, 45+ million monthly visitors, 1 billion pages per month
mediawiki: master/slave, read after write, direct cache invalidation, caches really well
94% cache hit, 1-5% logged in, 99% no-writes
DR site in Iowa (Identical copy of hardware)
Varnish → swiss army knife of wikia. multiple retries in config instead of sending error pages to user
MediaWiki read-only mode for DR site → no user problems until they try to edit things
SSD sped up MySQL slave buffer cache full time from 1.5 hours to 3 minutes
varnish purges pushed to a queue, processed worldwide in under 1s – make sure to do these after MySQL updates make it through replication
use idle apaches to render for geo directed users, but some GETS trigger writes.
users get cookies from master after POST, directed to master until slave cluster catches up
dont forget to maintain 2x capacity!
Instrumenting the real-time web: Node.js, DTrace and the Robinson Projection – Bryan Cantrill (Joyent, Inc.)
nodejs is three ideas:
JS rich support for asynchrony (i.e. closures)
High-performance VMs
The system abstractions that God intended “dynamic c”
nodeknockout contest → instrument incoming connections and geolocate them to provide ‘leaderboard’
use DTrace to instrument connections “dtrace – something that is important for ass saving”
dtrace is in kernel and instruments entire system, but tough for high-level interpreted environments → USDT “User-level statically defined tracing”
function in javascript that probes into C++ backend with USDT instrumentation
if you’re hitting GC in node you have a memory leak, you’re like a drug addict that has hit rock bottom. if i give you more memory it’s going right in your arm
Joyent uses OS level virtualization instead of hypervisor to allow introspection like dtrace from dom-0
Choose Your Own Adventure 2: Electric Boogaloo ;-) – Adam Jacob (Opscode), Jesse Robbins (Opscode)
sales and marketing: marketing brings leads via campaigns. lead nurturing → qualified prospects. ‘prosecuted’
no assholes rule (positive interactions must outnumber negative ones 5:1) withholders of effort, affectively negative, or interpersonal deviants
automate: provisioning, DNS, server inventory, configuration management, identity management, version control, monitoring and trending, application deployment
polyglots: sysadmins are software developers with shitty languages: sysadmins should learn scheme
managing ops: ops responsible for: system availability and efficiency. developers must be on call, sysadmins should be escalated to. metrics tie to $$. should be saying ‘yes’ instead of no, but make people commit
philosophy: people don’t remember tools used to build great things. can only be measured by final solution. your best skill is knowing systems and problems
open source: we like to look at pretty cars, but we take the ugly ones home and work on it/fix it up/etc. you cannot leapfrog stewardship
devops: is not a job description, you can’t ‘be’ a devops “you don’t care that i got divorced because your crappy code woke me up” – sysadmin to dev. devops is all inclusive. someone not happy or exclusive → doing it wrong