Operations - anything pertaining to maintaining and implementing systems at scale
Software engineers think they are off the hook with performance, management, scalibility.
Operations Engineering at Scale is a specialized skillset It is not "soft eng lite" It is not someone to do all the annoying parts of running systems for you
You have hard operational problems
Hard Operational Problems
- Extreme reliability demands
- Extreme scalability (3x-10x year over year)
- Extreme security requirements
- Solving operational category problems for the whole internet (platforms, services)
Look for strenghths that are key to your company's success.
Good operations engineer is broadly literate and can go deep on at least one or two areas.
- Strong automation instincts
- Ownership over their systems
- Strong opinions, weakly held
- Simplfy, simplify, simplify
- Excellent communication skills in a crisis
- Value process Prevents the same mistake, over and over.
- Empathy
Things that dont work
- whiteboarding code
- particular technolog or language
- particular degree
- big company pedigree
Success at a startup
- comfortable with chaos
- knows when to solve 80% and move on
- total responsibilty for outcomes
- good judgement
- highly reactive
- technical breadth
First, do the job yourself
Figure out what strengths you really need The Hard Thing about Hard Things (Horowitz)
Don't say no because someone has a lack of weaknesses
Hire for YOUR weaknesss. Find someone to fill them.
Good interview questions
- Are leading to establish tech. range
- probe the candidates self-reported strengths
- related to your problems
- ask culture questions. screen for learned helplessness
Bad interview
- depends on a specific tech.
- look for a reason to deny a candidate
- designed to trip them up
- deny candidates the resources they would use to solve something in the real world
ask culture questions. how they felt about their last job and coworkers. screen for learned helplessness. learned helplessness is like startup kryptonite.
"Employers put too much weight on interviews, and too little weight on references." - Reid Hoffman
Including ops team for product development.
Bad ops engineers / fire them
- Tweaking indefinitely & pointlessly
- Walling off prod from dev.
- Adding complexity
- Won't admit they don't know things
- Disconnected from customer experience.
How to lose good ops engs
- all the responsibility, no authority
- all the tedious shitwork
- blameful culture
- no interesting operational problems
encouraging worklife balance supporting members
cultulre: the patterns you call out and celebrate, will get repeated
Cube analytics ( business intel.) predictive analytics ( statis and machine learning )
Analyze data as it is being produced: streaming
interactive: store data and provide results instantly a query is posed
aurora, borealis, cayuga, STREAM, NiagaraCQ (SIGMOD / CIDR)
S4 STORM Samza Spark Pulsar
guarenteed msg protocol horizontal scalability robust fault tolerance cncise code-focus on logic
topology - directed acyclic graph - verties = computations, edges = streams of data
spouts - sources of data for the topoplogy
bolts - units of ocmputation on data
- Bad Host
- Hot Keys
- Network Issues
Performance Bottlenecks
- Real-time Processing
- Failures (slow writes, connectivity issues)
- Backpressure / Container Deaths (ms spent in backpressure)
Spike in input traffic Hot Keys/Connectivity Issues Anomalous Nodes - Kestrel Spout Lag
Finding
Automated
Statistically Robust ( minimise false postiveis)
- R Package : Seasonality and Trend Aware (available on Twitter blog)
- Key Features - Filter/Expected values/Long term
- Widely Used Outside TWitter
Applicable to univariate time series
Leverage multiple metrics (minimize false positives)
Exploit correlation/topology - observed variables and latent variables
Determine the intersection of the set of anomalies of each process
Service Component Health
Determine the intersection of the set of anomalies of each process HC
Anomaly Type - Container Death = all metrics of instances on that container had drops
How philosophy of failure is approached at Etsy
- you will create bugs
- you will build the wrong thing
- you will not foresee the unexpected
- money
- time
- data (loss)
- customers
- credibility
expensive failure is not
just speed just trust
logster to grab specific metrics out of logs
captures in range, aggregates, sends to graphite
reported weekly w/median and perc95
individual check for each page/api individual threshold for each page/api
How do you choose thresholds?
Perfnag (Etsy)
95th perc over 2 weeks * 1.1 ( warning )
- somethig ( critical )
github.com/etsy/nagios-herald
Better visualization for warning / critical states.
focus on improvements
removing dba to database engineer
Databases at Scale - Laine Campbell and CHarity Majors
- quantitative
- interdisciplinary
- results focused
- repeatable and code-driven
- designing and managing complex systems for complete life-cycle
- translation from biz to sys, focus
- designing process to balance objectives
- infrastructure to serve businesss
- focuses on the glue common to all services/platform
- deployment, efficinecy, scale, perf., observ
- often done by systems and operations eng. rather than being their own
- forces horz. scaling
- forces designing for resillience
- elasticity drives new data store
- management by api
- lena manufacturing defines our workflows
- tighter feedback loops require org. shifts
- experimentation and controlled failure shift arch and proc. design
- integration drives empathy
- brings us to the source code control paradigm
- we must be teachers, not gatekeepers
- testing and compliance become top priorities
- relational is not the end of the line
- data must be looked at end to end
- function dictates form
- we cannot rpedic all sues
- its about the mission
- protect the data
- elim. waste
- data-drivee decison making
- dbs are not special
- eliminate the barriers between sw and ops
- mission KPIs
- function not fomr
- operational processes and management
how quickly can we pivot or change the datastore
how elastic, adding resources, vendor lockin, cost per transaction?
user management, audit trail, data and connection encryption, vunerabilities history
how tunable, limits and curves
spofs, backups, partitioning, failover/rebalancing, consistency
Partitions always occur, whether outage or overload
- e.f. codd's 12 rules of relations
- sql access
- acid levels
files/apis - antipatterns: in memory/binaries degrad. mgmt: read-only modes, dyn config, queue drainin, timeouts
static config, long timeouts, bad defaults
online changes, fast alters, atomic ops, instrum.
anti-pattern: schema level locking
voldemort hive
config. management orchestration self-service
anomaly tests and statistics
lambda arch: pubsub, batch proc, hadoop
cache
immutable architectures will force us to create change at the template layer and redeploy
Docker Slave - The Rickbot
Added consul (eventually consistent service registry) - register & lookup for port mapping containers
Chef to only manage on-box/host (outside of docker)
Host machine - Consul, logging, metric agg.
Marathon - Bamboo - HAProxy to register services w/haproxy
github.com/QubitProducts/bamboo
yeoman - bootstrap services
Spring Boot - java autoconfig
Dropwizard Metrics - java in app metrics
Consul Registration/Discovery (OrbitzWordlwide/consul-client)
Logstash/Logback
Swagger
Hystrix
Retrofit + Consul
Amazon ECS? Docker Swarm? Kubernetes?
- Docker - repeatable apps
- Chef - repeatable infra
- Jenkins - repeatble releases
- compile time - bake into docker image
- boot time - bake into playbook/launcher - parameter for Docker
- anytime - externalize (consul kv, etcd, zookeeper)
developing a hostile attitude towards their job losing motivation and passion shifting to bare minimum from your best
negative feelings turning inward on yourself, mistakes on your path, imposter syndrome
Works are overwhelmed, unable ot cope, unmotivated, and display negative attitudes and poor performance.
- prolonged response to chronic interpersonal stressors on the job
- three dimensions
- exhaustion individual stress
- cynicism negative respnse to job
- proffesionnal ineffivacy negative self-eval
- poor quality of work
- low morale
- absenteeism - goes up
- turnover
- health problems
- depression
mismatch/misfit between person and job, thus predicting burnout is it the job or the person (wrong question) - both habe to be taken into account
-
workload not usually the issue
sustainable-workload good
-
control how much agency does one have over their job. whether micromanaged or chaotic work environment. feeling they have appropriate level.
choice and control: good
-
reward not just tangible things
recognition and reward: good
-
community workplace - social relationships with other people, colleagues, supervisors. gain trust, spirit with each other. bad: unresolved conflicts, competing against eachother, preventing clear comms., sharing knwoeldge, providing support, "socially toxic" gossiping/politics This is becoming important/rising
supportive work community: good
-
fairness how we do work, policies, rules. people feel they aren't being treated fairly or with respect will contribute to cynicism. they arent being treated fairly. counter-productive behaviors will evolve.
fairness, respect, and social justice: good
-
values not in conflict with what you hold. respecting others for theirs. value conflicts will erode.
clear values and meaningful work: good
more mismatches = more burnout more matches = more engagement
preventing burnout is a better strategy than waiting to treat it building engagement is the best approach to preventing organizational intervention can be more productive than individual intervention
- Christina Maslach books truth about burningout banishing burnout
Values and community - recognizing them early as a manager and foster. its a big responsibilit
first person to burnout might not be the only
Organizaitonal not individual, patronizing to isolate
very little orgs do work/health polling
correlistic ignorance - i have to look like I'm doing the right thing, I'm okay, fitting in, mirroring others in the workplace -- then holding back on issues
creating a safe harbor for people to speak if they don't feel comfortable being explicit
trying extra hard to be a team
- beyond cat gifs
- being mindful of degrading our community
sharing stories, growing bonds, feeling you aren't alone
culture who glamourizes the hero is toxic. teams that encourage people to disconnect. little things can mean a lot
focusing on when things go right, rather than wrong
larger outcomes, rather than smaller wins
teamwork and community != friendship based more on trust: confidence and character
research work in canada: incivilility in the community, rudeness, bullying, snarky, sarcastic, people reciprocate, and spirals downward. number one sign in this group: eyerolling
c ivility r espect e ngagement @ w ork
Read "The Goal" to discocver constraints in action Business Process Optimization
Goal of the business is to make money (derp)
throughput generated by sales inventory is money invested / wip / haven't earned back / feature operating expense - time to ship feature, moving from inventory to throughput businesses are systems to produce money
- theory of constraints to primary constraints to hold back any step documenting/blogging/etc can slow down a feature release optimize process around constraints
-
dependencies linkages to another events that have to happen before something happens
-
variations changes in input will change output
- decouple to create "aysnc" processing
- gate-ing work/buffering work to slowly release it out
- trimming waste will be negative, since it creates fluctuations, unless you known it is for sure the bottleneck
- "this is queueing theory" amdahl's law
- make your bottleneck the leader to synchornize, then improve/remove constraints
- brent is the constraint/bottleneck
switching back and fourth and being the dependency is bad
- no prediction
- lack of repeatable steps
- lack of knowledge sharing when people were siloed
the e myth revisisted (book)
live demo of a copy change, class open to all staff, intro to dev tools + process
reference to: liberal arts schools
github flow in the browser - lowest barrier of entry!!!!
shows how we test and deploy makes engineering culture transparent so other teams can see how we work everyone can commit code
lightweight process for making simple changes most trivial changes causes you to avoid building a CMS!
cultural values become transparent to engineering and helps generate concensus version control is a communication tool, creates history and story on all changes transparency + consensus = blamelessness increases your impact
- explain git branches + commits
- explain file layout
- always be learning (just in time learning)
- dry - dont repeat yourself: code at the limit of your understanding to improve
protip: safe deploy process
- a git branch of your own
- tests and continuous integration
- deployer checks what's getting deployed
protip: explain what this means
WTF emojis mean
👍 means what for different groups
Experience D.U.T.
gh:hivequeen rack-attack
"It's a security liability" - But what else is?