- jenkins as a build tool
- microservices
- how do we standdardize unit testing in our pipeline?
- is a failure because we broke the pipeline, or because the code is bad?
- we’ve recently been trying jenkins declarative pipelines
- bash scripts
-
- it’s good to have jenkins config in source control
- what we did to scale up (in terms of lots of people able to create new projects)
- need some toil to set up jenkins
- automated this: jenkins detects new projects
- version control deployments
- team segregations, bringing up new agents
- just change a couple of yaml lines
- we use bazel within our jenkins & teamcity builds
- bazel/teamcity autodetects flaky tests & reruns
- remote caching & remote execution
- we have tension between dev teams
- we want entirely reproducible builds
- dev teams want to be able to control their own builds
- (do you version control your shared libraries too?) yes
- we have some standard pipelines rather than a jenkinsfile per repo
- reduces influence the dev teams can have on the build
- (does it cause tension because it slows them down?)
- i find it hard to talk about from dev perspective
- we have a different approach: we have 2 monorepos (!)
- one server-side / platform / continuously deployed
- one packaged software with longer release cadence
- we source control the jenkins jobs in the platform monorepo
- people will copy/paste the job builder files around
- github status checks, etc
- who uses something they didn’t build themselves for CI?
- who’s using AWS codebuild, etc
- circleci, travisci, codeship
- all have different advanctages & disadvantages (eg none supports windows)
- charging models are tricky for CI SaaS
- charging for build disincentives devs from using it
- https://buildkite.com/ was the nicest one we found
- problem with circle & codeship
- basic/pro tiers
- completely different implementation
- pro for codeship: ship us a dockerfile
- basic for codeship: web interface
- circleci are scrapping the web config
- (how do you scale the number of agents?)
- you add more agents
- a few people doing more than 100
- how do you trade off making things fast for dev teams against the size of the cluster for running jobs?
- we (as far as i’m aware) don’t test our infrastructure code
- in previous places I’ve done bits & bobs with test kitchen & serverspec
- we use terraform & puppet
- we share both in common modules
- don’t really test terraform
- puppet: test-kitchen
- central product
- jenkins master
- rspec testing
- (do you want to just run terraform plan to keep an eye on your infrastructure or something else?)
- we have terraform running from jenkins
- goss
- what is infrastructure testing for?
- tests & declarations duplicate each other
- testing infrastructure is more at the integration level
- check that graphite is actually running on this port
- (is that testing or is it monitoring?)
- depends how frequently you run it
- monitoring is continuous
- interesting to hear about goss for healthchecks
- https://github.com/aelsabbahy/goss
- you can integrate it into your monitoring system
- you can specify sets of tests you want
- runs on server so it’s really fast - quick feedback
- can autogenerate tests from an existing “perfect” environment
- analyse state of all ports and create a config
- i don’t think there needs to be a distinction between testing & monitoring
- the test pyramid for app code testing doesn’t work for
infrastructure
- (eg: your cloud platform might make a breaking change to your API: your code might still pass but your system is broken)
- we should talk about feedback cycles:
- when you check in a piece of code, you want to know when it’s broken
- swiss cheese model:
- different layers of tests
- we used puppet testing to manage our rollout from puppet 3 to
puppet 5
- infrastructure testing is useful for regression testing
- it depends on the infrastructure you inherit
- if you have inherited some pet-style (rather than cattle-style) infrastructure
- you need to manage the pets even as you migrate to cattle
- how do you build that pet (even if you don’t want to)
- BDD models are more useful for infrastructure testing
- you can repurpose tests like “is graphite running?” as monitoring
- the more unit-level tests - serverspec - can ossify the codebase
- it just tells me a person wrote the algorithm the way they intended to write it
- triggering builds based on github labels
- mark a PR as possibly affecting performance
- webhook carries label, triggers gatling run
- results are written back to PR as comment
- problem with developers making gradual changes against their local environments
- we have a weekly local-environment-teardown ceremony
- (what does “development environment” mean here?)
- some people are using the kubernetes docker thing
- some people have their own kubernetes clusters in aws
- (are you talking about minikube?)
- don’t know
- most tools are instance-centric
- with the growth of FaaS these tools don’t fit any more
- IAM policies
- security groups
- eg you have a hosted RDS db that’s only accessible from a lambda
fn
- you want to verify that only certain ports open on the hosted db
- security/audit people asking for documentation for infrastructure
- then, asking for the tests
- “tell me the ports you have open”
- lynis - node auditing
- how do you get from the “panic” phase to a reasonable,
constructive way of attacking this?
- bad but effective: stand back long enough for someone else to screw it up
- but:
- we won’t know the outcome on machine learning stuff for a while
- other things like: consent has to be granular: this is really
clear
- you can’t do pre-checked boxes
- problematic overlays between cookies and consent regulations
- even doing nothing is a decision, and your lawyers need to be aware
- when you get reported and the ICO come knocking you need to explain your decision
- “significant automated decision making”
- you need a data protection officer who reports directly to board
- machine learning can codify existing biases
- we dwell on machine learning because it’s a gnarly edge case
- there’s lots of lower hanging fruit to start with
- poor security practices
- being unaware of where data comes from and goes to
- what data is kept where, and what consent was given by data
subject when they submitted that data
- you need version control for your privacy policy
- if you have a marketing email list based on dark-pattern pre-checked tickboxes, can you use that list any more after 25th may?
- ICO has limited money & people
- they won’t go after everyone on day one
- there will probably be some benefit of the doubt
- some high-profile cases will generate case law
- containeros / smartos / redhat
- what are the motivations for wanting to do immutable OS?
- security
- consistency
- avoiding snowflakes
- thinking about host os
- unikernels
- i’ve done one
- you’re responsible for everything
- it’s an interesting concept
- a friend does mirage
- whole mindset is fighting against mindset of cracking open a
shell and working out what’s going on
- when something goes wrong, it’s hard to debug
- a lot of unikernel things are academic projects
- there’s a lot of duct tape
- firmwares are like unikernels
- no visibility, telemetry
- a lot of them did xen initially
- they’re bright people
- it’ll take them a while to get there
- i’ve done one
- how do you find monitoring lambda?
- cloudwatch stats are basically everything
- newrelic had some functionality but it didn’t look useful
- x-ray is great for microservices generally (lambda or not)
- https://docs.aws.amazon.com/xray/latest/devguide/xray-services-lambda.html
- service map - visualization of all the traces
- what are the biggest problems people have seen with serverless?
- terraform is tricky
- API gateway isn’t one thing - it’s loads of different things
bundled
- it has a hidden cloudfront in the middle of it
- main pain point we see isn’t lambda, it’s dynamo
- response time
- we found putting lambdas into VPCs added an order of magnitude to response time (with standard test traffic)
- concurrent execution limits
- how do you view x-ray? do you aggregate into cloudwatch?
- if there’s a jump in response time, you use x-ray to investigate
- anyone using kinesis much?
- shipping logs
- my experience: it works fine, but you have to specify the number of shards
- what languages are people writing lambdas in?
- python, node, java
- google functions
- lambci - run lambda in a docker container
- anyone using serverless marketplace?
- how do you onboard people?
- it was a real headfuck at first
- RPC’s bad, right?
- what is this binary format
- i don’t know why we’re doing this
- generating code as part of workflow? feels weird
- you have to think about types a bit more
- if you don’t have a monorepo, where do schemas live?
- for a while we had one repo just for protos. that was terrible
- two monorepos: one for platform (continuous deployment), one for product (6-8 week packaged software release)
- linting is important to check you haven’t reused an index
- you need to structure your repository to match your system boundaries
- graphql as another alternative?
- we like being able to deprecate things easily and to have a schema
- django impl called grapheme
- js frontend
- flow for types
- relay - state management
- same types going all the way through
- we use thrift as part of our content api
- binary messaging formats
- capnproto
- a friend looked at it and liked it for low-latency work
- no deserialization - directly dump into memory
- did they need that? I’m skeptical
- changing schemas
- simple binary encoding
- https://dataintensive.net/ has a chapter on serialization formats (chapter 4: encoding and evolution)
- capnproto
- https://gafferongames.com/ has some articles on designing custom
network protocols for games
- where TCP doesn’t fit but you still need reliability so you have to build something custom on UDP
- if you introduce something, who’s going to manage it when you go away?
- autonomy doesn’t mean free from responsibility
- SRE model: if you want to put something in prod, you have to tick these boxes
- standard tooling
- release engineering
- well-trodden path
- standards should empower autonomy
- there should be multiple implementations of the standard to make things work well
- what does your organization value? what are they concerned about?
- this will inform your approach to standards and consistency
- what are the tactics for fixing standards?
- standards always slip and they’re never up-to-date
- if you accept this, you build for it
- where i currently am, we have fully autonomous code teams
- if they want to use a new thing, they have to own it
- no tossing over the fence
- we had a devops rota
- no contraints on language
- team that did on call was drawn from devs on all different teams
- extremely good monetary reward for being called out
- (although: watch out for incentivising people to build crappy code)
- changing standards is often a power question
- a lot of orgs are top-down
- but then enforcement is lacking
- automation and testing makes top-down command and control easier by detected breaches
- a lot of orgs are top-down
- we need to precisely define who is involved in standards process
- who’s accountable for the standard being met?
- our org had the spotify model (guilds/squads/etc)
- engineering guild owned how on call works
- it was in my interest to be a member of the guild
- we also had an “interested parties review” for a team to
propose a novel change
- amount of rigour proportional to size of change
- from just a conversation on slack to a full document on wiki + formal meeting
- amount of rigour proportional to size of change
- our org is full TOGAF
- services that are expensive have to budget for operational cost
- have to go to a technical design authority with design for
system for approval
- they can check against standard or authorize an exception
- it all sounds like it meets what we’re talking about but none of it works
- ops team gets budget cut every year and says no to everything
- TDA say no to everything and have never updated standards
- none of the delivery teams ever deliver anything
- where you can automate compliance checking
- make it clear how to get the standard updated
- http://danger.systems/ - failure reports have link to how to update the repo
- it boils down to good team culture, good onboarding
- we have a process where people have to present ideas
- but it’s not so much about yes/no vs: it looks like you’re trying to do monitoring. are you aware of these other things that are going on?
- there’s different return on investment at different scales
- if you only have 10 devs, you probably don’t want too much process
- standards always slip and they’re never up-to-date
- we’re talking about different kinds of standards:
- languages / libraries / style
- easily automatable
- processes
- much more difficult to automate
- languages / libraries / style
- another org: lots and lots of little things
- 7 programming languages
- all operated by same ops
- people burn out
- we want to get somewhere where these people don’t feel like everything is their responsibility
- if you put an app into heroku, and it crashes all the time, heroku won’t fix it for you
- question: how do you introduce standards where divergence already exists? has anyone done this before?
- question: how do you deprecate standards? how do you squash those last bits of (eg) php? has anyone done this?
- we’ve only been doing ADRs for just over a year
- we’re already seeing benefits from people asking questions
- joyent have an open rfd process
- rust have rfcs
- feels like RFCs are about product decisions, whereas ADRs are documenting a technical decision based on an already-established product need
- what’s the process for who gets to merge a PR?
- two thumbs up 👍👍
- numbered ADRs
- don’t change a decision, supersede it
- https://github.com/npryce/adr-tools
- some people don’t like them because their opinion wasn’t the popular one
- what’s useful to have in ADRs?
- where are ADRs appropriate?
- some things are relevant within a single file, can just be a
code comment
- “this is modelled as a finite state machine, so don’t worry if it looks like two nested switch statements which you’d normally run away from”
- ADRs are for decisions at a larger scale
- some things are relevant within a single file, can just be a
code comment
- do you do anything to control drift from previous decisions?
- no, not really
- tech lead: derisking
- talk about:
- what to build
- how you might build it
- milestones
- during reviews of decisions, i found i didn’t want to okay them
- or at least i wanted to postpone them
- i introspected to understand why
- it was because i didn’t trust the specification of the product in the first place
- i was trying to insulate the tech from what was actually a human problem
- sometimes “we can’t deploy to this environment” means “we don’t trust your team to do it correctly”
- once you realise you’re talking about a human problem as well as a tech problem, what do you do?
- how do you build trust? esp within an outside party who doesn’t trust your team’s decision making?
- @bridgetkromhout
- why?
- visibility!
- put your twitter handle on every slide please
- lower the barrier to entry for people to attribute things to you
- pre-livetweeting checklist
- find conference hashtag
- decide which talks (if multi-track)
- find speaker twitter handles
- draft tweets in tweetbot
- even if you don’t use twitter, you use mastodon dot social, you could still park a twitter handle for people to use as a reference to you
- who is the hashtag for?
- people at the conference
- attendees - which talk should i go to?
- speakers
- organizers who are too busy to go to talks!
- people not at the conference
- people with FOMO
- your followers who aren’t interested in the event
- they can mute the hashtag
- people at the conference
- #keep #hashtags #simple
- you don’t need the year in the hashtag (looking at you #scalesummit18)
- take photos!
- choose the right angle
- try to include:
- speaker
- slide
- some of the room
- what is your goal here?
- when i don’t tweet
- E_TOO_MANY_DISTRACTIONS (emails etc)
- kindness > negativity
- (backchannels > subtweets)
- life is too short to arguing on the internet
- mute early, mute often
- mute “well actually”
- incidents & accidents
- misquotes / misunderstandings
- trip reports
- highlight activities of your team
- pages for your own talks
- before: title, description, date, event, location
- shortly after: slides, embedded tweets
- eventually: video
- https://bridgetkromhout.com/speaking/
- conference twitter isn’t the only twitter
- we are entire human beings