- Chatham House Rule, so no attribution of ideas to people or companies
- bootstrapping environments (without object stores)
- service discovery
- removing spofs
- modern monitoring – sensu, runbooks, dashboards
- tradeoff between ease of management and sophistication
- elastic sites?
- surviving DDoS attacks when your site is transactional
- modern cmdbs
- ansible
- icinga re-acknowlegdement
- ie I know disk is critical at 10%, but please re-alert at 5%
- big infrastructure
- shared web servers
- shared tomcat servers
- zenoss over snmp
- snmp didn’t scale
- problem: everything is averaged over 5 minutes
- teams are spinning up their own graphite instances to monitor their own stuff
- zenoss required 40 boxes, I expected 2
- what does graphite look like at scale?
- protip: buy fusion io
- it can be hard to rebalance your metrics
- particularly if you’re using consistent hashing
- carbonate for migrating data to another graphite server
- though you’ll probably end up with downtime
- has anyone used skyline?
- we looked at it, but we got lots of false alerts
- my suspicion is that if we understood maths better, we could make it work really well
- we looked at it, but we got lots of false alerts
- in sensu-community-plugins, there’s a check-graphite
- it does nice things like exceeding N std deviations
- what do people use below graphite?
- we’re using collectd
- the latest stuff has statsd and jmx connectors
- we’re using collectd
- anyone using ganglia?
- we’re replacing ganglia with sensu stuff and diamond
- why are using diamond rather than the in-built sensu stuff?
- because we’re a python shop
- we push data over rabbitmq
- and fan in to a big fat central fusionio graphite
- how do you monitor rabbit?
- sensu monitors rabbit using rabbit
- there are healthchecks which should fire if rabbit is completely broken
- we have a cron job on every rabbit and every sensu server to
kill the process every hour
- and it still works
- why are using diamond rather than the in-built sensu stuff?
- we’re replacing ganglia with sensu stuff and diamond
- is anyone using riemann?
- is it worth spending time with?
- where does it add value?
- real time anomaly detection
- it does events as well as numbers
- it also has events timeouts – it can notify on an absence of events as well as presence
- I think you could replace statsd with riemann
- does anyone store second or subsecond data for a long time?
- we have a single biggest day each year
- we snapshot everything for that day - stats, logs, etc
- use it to drive load testing for the next year
- we’ve been trying redshift
- how does elasticsearch cope with metrics?
- we push quite large documents about everything to do with a web request
- I often find log data in kibana much more useful than the same data in graphite
- does anyone use realtime queries to drive alerting from
elasticsearch?
- yes, from graylog2
- one thing we’ve done recently is tuning down the amount of io
operations that carbon uses per second. massively reduces disk
usage
- or write to ram disk and sync once per minute
- how do you get devs to make more metrics available?
- you put them on call until they do
- do people cull metrics at all?
- i never have enough data
- do people have app metrics measured by their continuous delivery
pipeline?
- our apps publish an xml document which is a schema of the types of metrics that they can publish
- if I don’t hate myself, is there anything other than sensu I
should use for monitoring that environment?
- does anyone rely on cloudwatch?
- we use it as a source for some data (ELB metrics)
- you can get these delivered into S3 these days
- but it only stores data for two weeks
- we use it as a source for some data (ELB metrics)
- does anyone rely on cloudwatch?
- does anyone using sensu miss nagios tactical view?
- I miss having a decent dashboard
- I don’t miss the 10 different nagioses per environment
- I don’t miss the failover when we lost the primary nagios instance and all the state in it
- we wrote a dashboard to query nagios and sensu
- I miss having a decent dashboard
- from the internet peanut gallery: is anyone using circonus?
- my agenda:
- the presence of artefacts I don’t necessarily own
- large graphical images or video data
- third party applications
- I may wish to release the same artefact multiple times
- we’ll use oracle 11 everywhere at one patch level
- but in different configurations
- we’ll use oracle 11 everywhere at one patch level
- windows images (VDIs)
- the presence of artefacts I don’t necessarily own
- fpm is useful
- but it never generates a spec file or a source rpm
- makes me uncomfortable
- I’m not happy about rpms, because you can only have one version of
one package installed at once
- eg a simple webapp where we don’t want to do the loadbalancer dance
- that also implies the app is relocatable which vendor binaries often aren’t
- is containerization part of the solution?
- it allows you to have multiple overlapping filesystems
- a model: each customer has their own container
- we haven’t done it
- that sounds very expensive
- how do you version control containers?
- do you treat them as a single binary?
- do you reconstruct it?
- a lot of solutions assume all machines are stateless
- someone else will deal with the databases
- containers allow you to minimize surprise
- a DBA logging into your container can find things where they expect, even if it’s from an underlying frankenstein filesystem
- I don’t mind snapshots, but they should be generated mechanically and repeatably.
- what tool would you love to exist in an ideal world?
- I’d like the deployment database to do effectively dependency
injection
- I know where the dependencies are and what data I’m injecting, so I can use system monitoring to know what I’ve deployed
- I’d like the deployment database to do effectively dependency
injection
- HTTP isn’t the best protocol in the world
- use queues!
- refactoring and testing is a better solved problem within the
python programming language than over the network
- I don’t think it’s hard to test µservices
- there are clear contracts
- that’s the theory, right?
- there are clear contracts
- I don’t think it’s hard to test µservices
- we end up building lots of small monoliths and wiring together
- we switched to using amazon SNS to manage notifications
- how you get your ops team to support µservices is you get them to
support as little as possible
- they only work when the functional team owns the whole stack right to the bottom
- services have a life cycle
- we like building things
- we should get better at killing things when they’re not using things
- is there an additional cost to the organization for running
µservices?
- is there an organizational cost to having a 2 million line codebase?
- ownership of services
- handover of building team to ongoing running team
- problems can get pushed back to the building team
- antipattern around µservices:
- developers think they’re clever
- ntp is a µservice
- aren’t µservices and SOA the same thing?
- is it SOA done right?
- how do you deal with PRs?
- what about things that are not on your roadmap?
- by not having a very good roadmap?
- or moving in directions you don’t want to go?
- it can be awkward because people might have put a lot of work in
- but you need to explain “if you want to do that you need to fork it”
- you can try to avoid it by writing a decent rationale of what you’re trying to do
- though you can’t answer all the questions up front
- what about things that are not on your roadmap?
- you want to optimize for dragging people into your community
- as the implementer, your documentation is going to be awful
- because you already understand the whole system and don’t understand when you’re assuming tacit knowledge
- whereas if you can attract users to your irc channel, and answer their questions really clearly, they can write great docs for you
- I try to have a policy of: if anything confuses you, here’s my email, twitter, irc, etc and I will try to help you
- encourage people to raise bugs against docs
- I come from the perl community
- there are 10-15 year old projects there where the maintainer has changed 4-6 times
- have you got an example?
- Catalyst
- ~200 repos (core + plugins)
- ~450 active committers
- Catalyst
- plugins are interesting: if people are trying to pull the project in different directions, you can let them through extensions but keep the core very small
- does anyone have experience of running OSS projects at work?
- how do you manage time management?
- the important PRs to pay attention to are those from new
contributors
- certainly get back within 24 hours
- don’t necessarily have to merge
- the important PRs to pay attention to are those from new
contributors
- how do you manage time management?
- why are you open sourcing this code?
- to get the community using
- to get good publicity
- do you have an OSS landing page?
- yes, but it’s out of date
- the OSS stuff that has mostly been infrastructure-related we’ve been trying to put into a separate github org
- you imply some level of support here
- running an OSS project is more than just making code open
- to be able to do that successfully, you need to at least mentally divest yourself from your parent organization
- what do you do if that project isn’t your main focus?
- my OSS contributions are entirely selfish
- you need a maintainer
- there needs to be clear communication channels
- what does a maintainer do?
- is it always one person?
- no! not if you can avoid it?
- once a project has a community it’s difficult for one person to maintain
- even if you’re not writing code, managing the community can rapidly become a full-time job
- what about the cost of maintenance?
- use travis!
- but please review the contribution even if the contribution passes the tests
- problem of selectivity, vision and direction
- mozilla in the early days, just accepted everything.
- ended up having to rewrite as firebird (now firefox)
- is it always one person?
- how do you ensure governance doesn’t become onerous?
- example of people who forked their own project after it had become an apache project
- example of gcc fork (egcs) which got merged back in
- a lot comes back to documenting your original vision
- I’ve been added as a maintainer in places, and sometimes there’s clear advice and sometimes there isn’t.
- if you open source a project that you don’t use is a recipe for
abandonware.
- we also have an organization for abandoned code to move it out of our main github org
- forks
- how do you transfer maintainership?
- what happens if a project gets abandoned and then forked?
- what are the good communication channels to have for an OSS
project?
- own website for announcement and discovery?
- how do you summarize your project?
- peeve: like <other project> but X
- community of contributors comes from community of users
- so good user documentation will foster contributors
- issues
- is it worth seeding the issues list even if we have an internal tracker?
- yes, because it helps users google for error messages
- they are effectively documentation
- do you move to only use the external tracker or do you have an internal tracker too?
- do you need a security contact?
- yes, with a GPG key
- people need to see activity
- if all your activity is on your internal tracker & mailing list & private irc, people will think it’s dead
- where do people host mailing lists?
- google groups
- a few people are averse to irc
- people don’t realise that they won’t get an immediate response necessarily
- irc shouldn’t be used alone
- timezones are also an issue
- ipython uses hangouts
- gmane: a newsgroup view on your mailing list
- don’t have a separate irc channel per project if you’re managing lots of projects
- own website for announcement and discovery?
- how do you host your docs?
- you should control your domain?
- when is a README not enough?
- start with github pages, and you can migrate later
- what should it have?
- screenshots
- getting started guides
- github pages are a bad idea because you can’t version them
- readthedocs keeps old versions too
- contributions must update docs when they update behaviour
- documentation & communication is super super important
- careful with contributions from newbies
- rejecting a contribution because of lack of tests can be
tricky
- they might not have written many tests in general
- they might not understand your particular test framework
- but rejecting because of no docs is more reasonable
- you can write tests for them
- and use this as a communication channel
- “does this test look like it’s measuring the thing you’re trying to build?”
- rejecting a contribution because of lack of tests can be
tricky
- careful with contributions from newbies
- how do you handle trolls, griefers and timewasters?
- one small doc patch earns you a hundred stupid questions
- love your idiots
- what’s arrived? what’s died?
- Big Data is now a thing people talk about
- you’re now seeing adverts on the tube about it
- is couchdb dead?
- npm?
- we still use it, but we only used it as a key-value store
- still going:
- mongo
- riak
- websockets are now standardized and supported by lbs, proxies
- edgeconf
- grunt and pig and oink and stuff
- doing a js build and running tests
- angularjs
- ndoc has gone
- flash is in its death throes
- most video sites work on an ipad
- webgl has taken hold
- epic demoed unreal engine 4 in firefox
- 60 fps on the web
- docker!
- although solaris has been doing it for yonks
- golang has taken off
- when did go hit 1.0?
- people are rewriting individual bits in go (rather than everything)
- is hacker news dead yet?
- bitcoin happened
- VPS providers have been getting attacked for people trying to steal them
- people trawling github to find access keys
- bitcoin mining in the browser
- erlang
- nobody’s started writing things in it
- though there’s elixir
- and julia
- and idris
- what’s falling out of favour?
- ruby? no
- scala? no
- facebook’s hack
- seems sensible if you’re already in a php environment
- bittorrent
- an incredibly good way of saturating your network
- though this isn’t new
- µservices
- just due to containerization?
- seems to be a bunch of ex-tw people
- elasticsearch is now usable
- and quite good
- and they acquired logstash and kibana
- logs being searchable in es
- splunk has a reasonable oss competitor
- graphite has grown
- there’s experimentation going on there
- storage backends (cassandra, leveldb)
- there’s experimentation going on there
- what about lucene?
- very few people use it directly these days
- snowden
- DC security
- https everywhere
- gmail is now ssl only
- PFS
- the perception that TLS is expensive
- spdy
- webp
- IE6 is on its deathbed
- winxp
- though it’s still in cash terminals
- mobile growth
- many sites are on the edge for 50% mobile
- talk of mobile first and now mobile only
- 4G
- bootstrap
- wearables & IoT
- fitbit
- pebble
- automotive
- tesla motors
- security updates
- wordpress now has autoupdate
- nagios isn’t dead yet
- sensu is still the hot new thing
- riemann
- flapjack
- desktops are going away
- except for gaming
- centos is now owned by redhat
- linux mint?
- systemd
- ubuntu as a server is now more probably
- is upstart going away?
- postgres got built-in replication
- graph dbs (neo4j)
- paas
- people are still excited
- it got even more complicated to install your own
- where’s node going?
- streaming extensions
- rx in .NET
- rise of functional
- linux on the desktop?
- the XPS13 is good
- the rise of chromebooks
- openstack?
- everyone thinks it’s a great idea
- private clouds?
- azure will sell you an on-premise cloud thing
- what’s the difference between an in-house cloud and a data centre?
- drones, quadcopters, hexapods
- for filming
- what’s coming up? what will be important at the next scale
summit?
- net security is in flux
- forks of android will be the new linux distro
- http 2
- IPv6?
- anomaly detection
- software defined networks
- containerization
- silicon roundabout?
- it’s not a playground for children anymore
- the adults have taken over
- computing in government
- US has 18F
- GDS
- I’d like there to be a world-class home grown east london
startup doing technically challenging stuff
- startups which solve technical problems don’t generally get funded
- acquisitions
- crowdfunding?
- noone cares
- what’s going to die?
- couchdb
- python 2 will not die
- how do we hire & train & new people into our industry?
- we certainly have struggled to recruit
- we’ve come to the realization that part of the solution is hiring junior people & growing them into the role
- I’ve been asked to mentor a junior person but I’ve no idea what to do
- I’m a recent junior
- one on one time is quite good
- I came in having a basic idea what I’d be doing
- be open for questions
- the devops world is really overwhelming
- it’s so useful to be able to ask things
- that’s one of the ground rules we’ve agreed on
- ie that I’m interruptible
- we’ve certainly noticed that hiring in the junior area is useful
- it’s great having juniors because you get chaos monkeys as well
- if you’re not prepared to let a junior touch something, you probably need to make it more resilient
- ETO1: 12-week night course
- teaches you how to teach
- how do you get the theory? how do you talk about underlying
principles that are independent of the particular situation at
hand?
- pair programming is really good for that
- does that depend on the teaching style of the pair?
- make the junior document the things that you’re teaching them
- it helps ensure that they’ve understood it
- pair programming is really good for that
- I get irritated when technical people tweet complaining about the
cost of interruptions
- when you have new people, you have to empower them to interrupt
- I don’t think you should have your entire team mentor a new starter
- we use the red flag system
- you put a red flag up if you don’t want to be interrupted
- designated interruptible person
- juniors also have a difficult time saying no
- you want to make everyone happy and be helpful
- do you have a system that makes work visible? eg kanban
- we have a helpdesk system
- but external people don’t use it for smaller tasks
- raise a ticket on their behalf
- how do we teach juniors that it’s ok to say no?
- also, how to understand what the requestor is trying to achieve, rather than the specific task they want done, and recognize when it’s the wrong fit?
- juniors are way more engaged if they get a choice (however constrained) on what they get to spend their time on
- also allow people to fail
- teach them that it’s okay to fail
- I troll my junior developers sometimes
- I lead them down the garden path
- but then I’m there to pick up the pieces when they fail
- do something that’s visible to other people in the company
- so that they can show people what they’re capable of
- how do you direct people through different areas of knowledge?
- do you go shallow on lots of tools? Or really deep on one thing?
- depends on the junior
- throw things at them and see what sticks
- go broad with the concepts early on
- architecture, system, etc
- onboarding
- desk & computer should be ready
- first week should be meeting all the people they need to know about
- have monthly checkins with the mentor
- checkins, not reviews!
- get a sales person to give a demo of whatever it is you build
- can anyone recommend useful resources for managing developers?
- how to talk to your kids or something like that
- how do you improve diversity?
- how do juniors find your roles?
- you don’t have to stick to the same old networks when hiring juniors
- thoughtbot – structured apprentice schemes
- I wonder if being more explicit & realistic about what
experience required and salaries are in job postings?
- recruiters muddy the waters a lot
- go direct if you can
- how do you know when to stop mentoring? and how do you measure success?
- docker + 150 lines of shell
- mirroring cpan, rubygems, npm
- filesystems are good at serving things that look like files
- you don’t need to use couch or
- what was the easiest to mirror?
- cpan – it has a single line rsync command to create a mirror
- wikipedia is hard to mirror
- each wikimedia site has a different set of plugins
- it’s important to have good search for your site
- we use google analytics. you can use this to find click behaviour
for particular search terms
- ie for term X, how often do people click on link 1, 2, 3, 4, etc
- automate this!
- crunch the most popular searches
- identify how many clicks they got
- use it to calculate how many more clicks we would have got if we had ordered the results better
- juju is a service orchestration tool
- apple, facebook employees hacked via website malware, java vulnerability
- data in transit protection
- data at rest protection
- authentication
- user to device, user to service, device to service
- secure boot
- firmware
- platform integrity and app sandboxing
- app whitelisting
- although key here is to ensure that whitelist doesn’t take too long to modify for new things
- security policy
- sounds like configuration management
- external interface protection (firewalls)
- device update policy
- incident response
- things will go wrong
- although don’t worry too much about this
- unless you have to.
- scale using libraries
- a library has all the modularity properties that services have
- except you don’t need to worry about the network going down
- august 29th for 3 days
- go here
The book is "how to talk so kids will listen & listen so kids will talk" by Adele Faber and Elaine Mazlish.