Scale Summit 2014

Intro, MBS

Chatham House Rule, so no attribution of ideas to people or companies

ideas for sessions

bootstrapping environments (without object stores)
service discovery
removing spofs
modern monitoring – sensu, runbooks, dashboards
- tradeoff between ease of management and sophistication
- elastic sites?
surviving DDoS attacks when your site is transactional
modern cmdbs
ansible
icinga re-acknowlegdement
- ie I know disk is critical at 10%, but please re-alert at 5%

session 1: monitoring & metrics

big infrastructure
- shared web servers
- shared tomcat servers
- zenoss over snmp
  - snmp didn’t scale
- problem: everything is averaged over 5 minutes
  - teams are spinning up their own graphite instances to monitor their own stuff
- zenoss required 40 boxes, I expected 2
what does graphite look like at scale?
- protip: buy fusion io
- it can be hard to rebalance your metrics
  - particularly if you’re using consistent hashing
- carbonate for migrating data to another graphite server
  - though you’ll probably end up with downtime
has anyone used skyline?
- we looked at it, but we got lots of false alerts
  - my suspicion is that if we understood maths better, we could make it work really well
in sensu-community-plugins, there’s a check-graphite
- it does nice things like exceeding N std deviations
what do people use below graphite?
- we’re using collectd
  - the latest stuff has statsd and jmx connectors
anyone using ganglia?
- we’re replacing ganglia with sensu stuff and diamond
  - why are using diamond rather than the in-built sensu stuff?
    - because we’re a python shop
  - we push data over rabbitmq
  - and fan in to a big fat central fusionio graphite
  - how do you monitor rabbit?
    - sensu monitors rabbit using rabbit
    - there are healthchecks which should fire if rabbit is completely broken
    - we have a cron job on every rabbit and every sensu server to kill the process every hour
      - and it still works
is anyone using riemann?
- is it worth spending time with?
- where does it add value?
  - real time anomaly detection
  - it does events as well as numbers
  - it also has events timeouts – it can notify on an absence of events as well as presence
  - I think you could replace statsd with riemann
does anyone store second or subsecond data for a long time?
we have a single biggest day each year
- we snapshot everything for that day - stats, logs, etc
- use it to drive load testing for the next year
we’ve been trying redshift
how does elasticsearch cope with metrics?
- we push quite large documents about everything to do with a web request
- I often find log data in kibana much more useful than the same data in graphite
- does anyone use realtime queries to drive alerting from elasticsearch?
  - yes, from graylog2
one thing we’ve done recently is tuning down the amount of io operations that carbon uses per second. massively reduces disk usage
- or write to ram disk and sync once per minute
how do you get devs to make more metrics available?
- you put them on call until they do
do people cull metrics at all?
- i never have enough data
do people have app metrics measured by their continuous delivery pipeline?
- our apps publish an xml document which is a schema of the types of metrics that they can publish
if I don’t hate myself, is there anything other than sensu I should use for monitoring that environment?
- does anyone rely on cloudwatch?
  - we use it as a source for some data (ELB metrics)
    - you can get these delivered into S3 these days
  - but it only stores data for two weeks
does anyone using sensu miss nagios tactical view?
- I miss having a decent dashboard
  - I don’t miss the 10 different nagioses per environment
  - I don’t miss the failover when we lost the primary nagios instance and all the state in it
- we wrote a dashboard to query nagios and sensu
from the internet peanut gallery: is anyone using circonus?

Session 2: versioning of artefacts

my agenda:
- the presence of artefacts I don’t necessarily own
  - large graphical images or video data
  - third party applications
- I may wish to release the same artefact multiple times
  - we’ll use oracle 11 everywhere at one patch level
    - but in different configurations
- windows images (VDIs)
fpm is useful
- but it never generates a spec file or a source rpm
- makes me uncomfortable
I’m not happy about rpms, because you can only have one version of one package installed at once
- eg a simple webapp where we don’t want to do the loadbalancer dance
- that also implies the app is relocatable which vendor binaries often aren’t
is containerization part of the solution?
- it allows you to have multiple overlapping filesystems
- a model: each customer has their own container
  - we haven’t done it
  - that sounds very expensive
- how do you version control containers?
  - do you treat them as a single binary?
  - do you reconstruct it?
- a lot of solutions assume all machines are stateless
  - someone else will deal with the databases
- containers allow you to minimize surprise
  - a DBA logging into your container can find things where they expect, even if it’s from an underlying frankenstein filesystem
- I don’t mind snapshots, but they should be generated mechanically and repeatably.
what tool would you love to exist in an ideal world?
- I’d like the deployment database to do effectively dependency injection
  - I know where the dependencies are and what data I’m injecting, so I can use system monitoring to know what I’ve deployed

Session 2b: µservices

HTTP isn’t the best protocol in the world
use queues!
refactoring and testing is a better solved problem within the python programming language than over the network
- I don’t think it’s hard to test µservices
  - there are clear contracts
    - that’s the theory, right?
we end up building lots of small monoliths and wiring together
we switched to using amazon SNS to manage notifications
how you get your ops team to support µservices is you get them to support as little as possible
- they only work when the functional team owns the whole stack right to the bottom
services have a life cycle
- we like building things
- we should get better at killing things when they’re not using things
is there an additional cost to the organization for running µservices?
- is there an organizational cost to having a 2 million line codebase?
ownership of services
- handover of building team to ongoing running team
- problems can get pushed back to the building team
antipattern around µservices:
- developers think they’re clever
ntp is a µservice
aren’t µservices and SOA the same thing?
- is it SOA done right?

Session 3: managing OSS software at work

how do you deal with PRs?
- what about things that are not on your roadmap?
  - by not having a very good roadmap?
- or moving in directions you don’t want to go?
- it can be awkward because people might have put a lot of work in
  - but you need to explain “if you want to do that you need to fork it”
  - you can try to avoid it by writing a decent rationale of what you’re trying to do
  - though you can’t answer all the questions up front
you want to optimize for dragging people into your community
- as the implementer, your documentation is going to be awful
- because you already understand the whole system and don’t understand when you’re assuming tacit knowledge
- whereas if you can attract users to your irc channel, and answer their questions really clearly, they can write great docs for you
- I try to have a policy of: if anything confuses you, here’s my email, twitter, irc, etc and I will try to help you
- encourage people to raise bugs against docs
- I come from the perl community
  - there are 10-15 year old projects there where the maintainer has changed 4-6 times
  - have you got an example?
    - Catalyst
      - ~200 repos (core + plugins)
      - ~450 active committers
plugins are interesting: if people are trying to pull the project in different directions, you can let them through extensions but keep the core very small
does anyone have experience of running OSS projects at work?
- how do you manage time management?
  - the important PRs to pay attention to are those from new contributors
    - certainly get back within 24 hours
    - don’t necessarily have to merge
why are you open sourcing this code?
- to get the community using
- to get good publicity
do you have an OSS landing page?
- yes, but it’s out of date
the OSS stuff that has mostly been infrastructure-related we’ve been trying to put into a separate github org
you imply some level of support here
- running an OSS project is more than just making code open
- to be able to do that successfully, you need to at least mentally divest yourself from your parent organization
what do you do if that project isn’t your main focus?
- my OSS contributions are entirely selfish
- you need a maintainer
  - there needs to be clear communication channels
what does a maintainer do?
- is it always one person?
  - no! not if you can avoid it?
  - once a project has a community it’s difficult for one person to maintain
  - even if you’re not writing code, managing the community can rapidly become a full-time job
- what about the cost of maintenance?
  - use travis!
  - but please review the contribution even if the contribution passes the tests
- problem of selectivity, vision and direction
  - mozilla in the early days, just accepted everything.
  - ended up having to rewrite as firebird (now firefox)
how do you ensure governance doesn’t become onerous?
- example of people who forked their own project after it had become an apache project
- example of gcc fork (egcs) which got merged back in
a lot comes back to documenting your original vision
- I’ve been added as a maintainer in places, and sometimes there’s clear advice and sometimes there isn’t.
if you open source a project that you don’t use is a recipe for abandonware.
- we also have an organization for abandoned code to move it out of our main github org
forks
- how do you transfer maintainership?
- what happens if a project gets abandoned and then forked?
what are the good communication channels to have for an OSS project?
- own website for announcement and discovery?
  - how do you summarize your project?
  - peeve: like <other project> but X
- community of contributors comes from community of users
  - so good user documentation will foster contributors
- issues
  - is it worth seeding the issues list even if we have an internal tracker?
  - yes, because it helps users google for error messages
  - they are effectively documentation
  - do you move to only use the external tracker or do you have an internal tracker too?
- do you need a security contact?
  - yes, with a GPG key
- people need to see activity
  - if all your activity is on your internal tracker & mailing list & private irc, people will think it’s dead
- where do people host mailing lists?
  - google groups
- a few people are averse to irc
  - people don’t realise that they won’t get an immediate response necessarily
  - irc shouldn’t be used alone
  - timezones are also an issue
- ipython uses hangouts
- gmane: a newsgroup view on your mailing list
- don’t have a separate irc channel per project if you’re managing lots of projects
how do you host your docs?
- you should control your domain?
- when is a README not enough?
- start with github pages, and you can migrate later
- what should it have?
  - screenshots
  - getting started guides
- github pages are a bad idea because you can’t version them
  - readthedocs keeps old versions too
- contributions must update docs when they update behaviour
documentation & communication is super super important
- careful with contributions from newbies
  - rejecting a contribution because of lack of tests can be tricky
    - they might not have written many tests in general
    - they might not understand your particular test framework
  - but rejecting because of no docs is more reasonable
  - you can write tests for them
    - and use this as a communication channel
    - “does this test look like it’s measuring the thing you’re trying to build?”
how do you handle trolls, griefers and timewasters?
- one small doc patch earns you a hundred stupid questions
- love your idiots

Session 4: what’s changed since last scale camp?

what’s arrived? what’s died?
Big Data is now a thing people talk about
- you’re now seeing adverts on the tube about it
is couchdb dead?
- npm?
- we still use it, but we only used it as a key-value store
still going:
- mongo
- riak
websockets are now standardized and supported by lbs, proxies
edgeconf
- grunt and pig and oink and stuff
- doing a js build and running tests
- angularjs
ndoc has gone
flash is in its death throes
most video sites work on an ipad
webgl has taken hold
epic demoed unreal engine 4 in firefox
60 fps on the web
docker!
- although solaris has been doing it for yonks
golang has taken off
- when did go hit 1.0?
- people are rewriting individual bits in go (rather than everything)
is hacker news dead yet?
bitcoin happened
- VPS providers have been getting attacked for people trying to steal them
- people trawling github to find access keys
- bitcoin mining in the browser
erlang
- nobody’s started writing things in it
- though there’s elixir
- and julia
- and idris
what’s falling out of favour?
- ruby? no
- scala? no
facebook’s hack
- seems sensible if you’re already in a php environment
bittorrent
- an incredibly good way of saturating your network
- though this isn’t new
µservices
- just due to containerization?
- seems to be a bunch of ex-tw people
elasticsearch is now usable
- and quite good
- and they acquired logstash and kibana
logs being searchable in es
- splunk has a reasonable oss competitor
graphite has grown
- there’s experimentation going on there
  - storage backends (cassandra, leveldb)
what about lucene?
- very few people use it directly these days
snowden
DC security
https everywhere
- gmail is now ssl only
- facebook
- PFS
- the perception that TLS is expensive
- spdy
webp
IE6 is on its deathbed
winxp
- though it’s still in cash terminals
mobile growth
- many sites are on the edge for 50% mobile
- talk of mobile first and now mobile only
4G
bootstrap
wearables & IoT
- fitbit
- pebble
- automotive
  - tesla motors
security updates
- wordpress now has autoupdate
nagios isn’t dead yet
- sensu is still the hot new thing
- riemann
- flapjack
desktops are going away
- except for gaming
centos is now owned by redhat
linux mint?
systemd
ubuntu as a server is now more probably
- is upstart going away?
postgres got built-in replication
graph dbs (neo4j)
paas
- people are still excited
- it got even more complicated to install your own
where’s node going?
streaming extensions
- rx in .NET
- rise of functional
linux on the desktop?
- the XPS13 is good
- the rise of chromebooks
openstack?
- everyone thinks it’s a great idea
private clouds?
- azure will sell you an on-premise cloud thing
- what’s the difference between an in-house cloud and a data centre?
drones, quadcopters, hexapods
- for filming
what’s coming up? what will be important at the next scale summit?
- net security is in flux
- forks of android will be the new linux distro
- http 2
- IPv6?
- anomaly detection
- software defined networks
- containerization
- silicon roundabout?
  - it’s not a playground for children anymore
  - the adults have taken over
- computing in government
  - US has 18F
  - GDS
- I’d like there to be a world-class home grown east london startup doing technically challenging stuff
  - startups which solve technical problems don’t generally get funded
  - acquisitions
- crowdfunding?
  - noone cares
what’s going to die?
- couchdb
- python 2 will not die

Session 5: mentoring

how do we hire & train & new people into our industry?
we certainly have struggled to recruit
- we’ve come to the realization that part of the solution is hiring junior people & growing them into the role
- I’ve been asked to mentor a junior person but I’ve no idea what to do
I’m a recent junior
- one on one time is quite good
- I came in having a basic idea what I’d be doing
- be open for questions
  - the devops world is really overwhelming
  - it’s so useful to be able to ask things
- that’s one of the ground rules we’ve agreed on
  - ie that I’m interruptible
- we’ve certainly noticed that hiring in the junior area is useful
it’s great having juniors because you get chaos monkeys as well
- if you’re not prepared to let a junior touch something, you probably need to make it more resilient
ETO1: 12-week night course
- teaches you how to teach
how do you get the theory? how do you talk about underlying principles that are independent of the particular situation at hand?
- pair programming is really good for that
  - does that depend on the teaching style of the pair?
- make the junior document the things that you’re teaching them
  - it helps ensure that they’ve understood it
I get irritated when technical people tweet complaining about the cost of interruptions
- when you have new people, you have to empower them to interrupt
- I don’t think you should have your entire team mentor a new starter
- we use the red flag system
  - you put a red flag up if you don’t want to be interrupted
- designated interruptible person
- juniors also have a difficult time saying no
  - you want to make everyone happy and be helpful
- do you have a system that makes work visible? eg kanban
  - we have a helpdesk system
  - but external people don’t use it for smaller tasks
    - raise a ticket on their behalf
- how do we teach juniors that it’s ok to say no?
  - also, how to understand what the requestor is trying to achieve, rather than the specific task they want done, and recognize when it’s the wrong fit?
juniors are way more engaged if they get a choice (however constrained) on what they get to spend their time on
also allow people to fail
- teach them that it’s okay to fail
- I troll my junior developers sometimes
  - I lead them down the garden path
  - but then I’m there to pick up the pieces when they fail
- do something that’s visible to other people in the company
  - so that they can show people what they’re capable of
how do you direct people through different areas of knowledge?
- do you go shallow on lots of tools? Or really deep on one thing?
- depends on the junior
  - throw things at them and see what sticks
- go broad with the concepts early on
  - architecture, system, etc
onboarding
- desk & computer should be ready
- first week should be meeting all the people they need to know about
- have monthly checkins with the mentor
  - checkins, not reviews!
- get a sales person to give a demo of whatever it is you build
can anyone recommend useful resources for managing developers?
- how to talk to your kids or something like that
how do you improve diversity?
- how do juniors find your roles?
- you don’t have to stick to the same old networks when hiring juniors
- thoughtbot – structured apprentice schemes
- I wonder if being more explicit & realistic about what experience required and salaries are in job postings?
  - recruiters muddy the waters a lot
  - go direct if you can
how do you know when to stop mentoring? and how do you measure success?

lightning talks

tdoran docker to prod in 5 minutes

docker + 150 lines of shell

mirroring the internet

mirroring cpan, rubygems, npm
filesystems are good at serving things that look like files
you don’t need to use couch or
what was the easiest to mirror?
- cpan – it has a single line rsync command to create a mirror
wikipedia is hard to mirror
- each wikimedia site has a different set of plugins

analytics and search evaluation

it’s important to have good search for your site
we use google analytics. you can use this to find click behaviour for particular search terms
- ie for term X, how often do people click on link 1, 2, 3, 4, etc
automate this!
crunch the most popular searches
identify how many clicks they got
use it to calculate how many more clicks we would have got if we had ordered the results better

juju

juju is a service orchestration tool

your laptop is not your friend

apple, facebook employees hacked via website malware, java vulnerability
data in transit protection
data at rest protection
authentication
- user to device, user to service, device to service
secure boot
- firmware
platform integrity and app sandboxing
app whitelisting
- although key here is to ensure that whitelist doesn’t take too long to modify for new things
security policy
sounds like configuration management
external interface protection (firewalls)
device update policy
incident response
- things will go wrong
although don’t worry too much about this
- unless you have to.

philandstuff/scale-summit.org

Scale Summit 2014

Intro, MBS

ideas for sessions

session 1: monitoring & metrics

Session 2: versioning of artefacts

Session 2b: µservices

Session 3: managing OSS software at work

Session 4: what’s changed since last scale camp?

Session 5: mentoring

lightning talks

tdoran docker to prod in 5 minutes

mirroring the internet

analytics and search evaluation

juju

your laptop is not your friend

write libraries, not services

we’re doing a festival called electromagnetic field

outro

petemounce commented Mar 22, 2014