scalable build pipelines

A

jenkins as a build tool
microservices
how do we standdardize unit testing in our pipeline?
is a failure because we broke the pipeline, or because the code is bad?
we’ve recently been trying jenkins declarative pipelines
bash scripts

B

it’s good to have jenkins config in source control

C

what we did to scale up (in terms of lots of people able to create new projects)
need some toil to set up jenkins
- automated this: jenkins detects new projects
- version control deployments
- team segregations, bringing up new agents
  - just change a couple of yaml lines

B

we use bazel within our jenkins & teamcity builds
- bazel/teamcity autodetects flaky tests & reruns
- remote caching & remote execution

A

we have tension between dev teams
we want entirely reproducible builds
dev teams want to be able to control their own builds
(do you version control your shared libraries too?) yes
we have some standard pipelines rather than a jenkinsfile per repo
reduces influence the dev teams can have on the build
(does it cause tension because it slows them down?)
- i find it hard to talk about from dev perspective

B

we have a different approach: we have 2 monorepos (!)
- one server-side / platform / continuously deployed
- one packaged software with longer release cadence
- we source control the jenkins jobs in the platform monorepo
- people will copy/paste the job builder files around
  - github status checks, etc

question

who uses something they didn’t build themselves for CI?
- who’s using AWS codebuild, etc

D

circleci, travisci, codeship
all have different advanctages & disadvantages (eg none supports windows)

B

charging models are tricky for CI SaaS
- charging for build disincentives devs from using it
https://buildkite.com/ was the nicest one we found

D

problem with circle & codeship
- basic/pro tiers
- completely different implementation
- pro for codeship: ship us a dockerfile
- basic for codeship: web interface
- circleci are scrapping the web config

B

(how do you scale the number of agents?)
- you add more agents

(survey around concurrent builds)

a few people doing more than 100
how do you trade off making things fast for dev teams against the size of the cluster for running jobs?

infrastructure testing

A

we (as far as i’m aware) don’t test our infrastructure code
in previous places I’ve done bits & bobs with test kitchen & serverspec

B

we use terraform & puppet
we share both in common modules
don’t really test terraform
puppet: test-kitchen
- central product
- jenkins master
- rspec testing
(do you want to just run terraform plan to keep an eye on your infrastructure or something else?)
we have terraform running from jenkins
goss

C

what is infrastructure testing for?
tests & declarations duplicate each other
testing infrastructure is more at the integration level
- check that graphite is actually running on this port
- (is that testing or is it monitoring?)
  - depends how frequently you run it
  - monitoring is continuous
interesting to hear about goss for healthchecks

D

https://github.com/aelsabbahy/goss
you can integrate it into your monitoring system
you can specify sets of tests you want
runs on server so it’s really fast - quick feedback
can autogenerate tests from an existing “perfect” environment
- analyse state of all ports and create a config

E

i don’t think there needs to be a distinction between testing & monitoring
the test pyramid for app code testing doesn’t work for infrastructure
- (eg: your cloud platform might make a breaking change to your API: your code might still pass but your system is broken)
we should talk about feedback cycles:
- when you check in a piece of code, you want to know when it’s broken
swiss cheese model:
- different layers of tests

F

we used puppet testing to manage our rollout from puppet 3 to puppet 5
- infrastructure testing is useful for regression testing

D

it depends on the infrastructure you inherit
- if you have inherited some pet-style (rather than cattle-style) infrastructure
- you need to manage the pets even as you migrate to cattle
- how do you build that pet (even if you don’t want to)

G

BDD models are more useful for infrastructure testing

H

you can repurpose tests like “is graphite running?” as monitoring
the more unit-level tests - serverspec - can ossify the codebase
- it just tells me a person wrote the algorithm the way they intended to write it

I

triggering builds based on github labels
- mark a PR as possibly affecting performance
- webhook carries label, triggers gatling run
- results are written back to PR as comment

A

problem with developers making gradual changes against their local environments
we have a weekly local-environment-teardown ceremony
(what does “development environment” mean here?)
- some people are using the kubernetes docker thing
- some people have their own kubernetes clusters in aws
- (are you talking about minikube?)
  - don’t know

J

most tools are instance-centric
- with the growth of FaaS these tools don’t fit any more
- IAM policies
- security groups
eg you have a hosted RDS db that’s only accessible from a lambda fn
- you want to verify that only certain ports open on the hosted db

K

security/audit people asking for documentation for infrastructure
- then, asking for the tests
- “tell me the ports you have open”
lynis - node auditing

GDPR

how do you get from the “panic” phase to a reasonable, constructive way of attacking this?
- bad but effective: stand back long enough for someone else to screw it up
- but:
  - we won’t know the outcome on machine learning stuff for a while
  - other things like: consent has to be granular: this is really clear
    - you can’t do pre-checked boxes
    - problematic overlays between cookies and consent regulations
  - even doing nothing is a decision, and your lawyers need to be aware
  - when you get reported and the ICO come knocking you need to explain your decision
  - “significant automated decision making”
    - you need a data protection officer who reports directly to board
  - machine learning can codify existing biases
we dwell on machine learning because it’s a gnarly edge case
- there’s lots of lower hanging fruit to start with
- poor security practices
- being unaware of where data comes from and goes to
what data is kept where, and what consent was given by data subject when they submitted that data
- you need version control for your privacy policy
if you have a marketing email list based on dark-pattern pre-checked tickboxes, can you use that list any more after 25th may?
ICO has limited money & people
- they won’t go after everyone on day one
- there will probably be some benefit of the doubt
- some high-profile cases will generate case law

immutable root / OS

containeros / smartos / redhat
what are the motivations for wanting to do immutable OS?
- security
- consistency
- avoiding snowflakes
- thinking about host os
unikernels
- i’ve done one
  - you’re responsible for everything
- it’s an interesting concept
- a friend does mirage
- whole mindset is fighting against mindset of cracking open a shell and working out what’s going on
  - when something goes wrong, it’s hard to debug
- a lot of unikernel things are academic projects
  - there’s a lot of duct tape
- firmwares are like unikernels
  - no visibility, telemetry
- a lot of them did xen initially
  - they’re bright people
  - it’ll take them a while to get there

faas / serverless

how do you find monitoring lambda?
- cloudwatch stats are basically everything
- newrelic had some functionality but it didn’t look useful
- x-ray is great for microservices generally (lambda or not)
- https://docs.aws.amazon.com/xray/latest/devguide/xray-services-lambda.html
service map - visualization of all the traces
- https://docs.aws.amazon.com/xray/latest/devguide/xray-console.html#xray-console-servicemap
what are the biggest problems people have seen with serverless?
- terraform is tricky
- API gateway isn’t one thing - it’s loads of different things bundled
  - it has a hidden cloudfront in the middle of it
- main pain point we see isn’t lambda, it’s dynamo
- response time
  - we found putting lambdas into VPCs added an order of magnitude to response time (with standard test traffic)
- concurrent execution limits
how do you view x-ray? do you aggregate into cloudwatch?
- if there’s a jump in response time, you use x-ray to investigate
anyone using kinesis much?
- shipping logs
- my experience: it works fine, but you have to specify the number of shards
what languages are people writing lambdas in?
- python, node, java
- google functions
lambci - run lambda in a docker container
anyone using serverless marketplace?

gRPC / protobuf vs REST

how do you onboard people?
- it was a real headfuck at first
- RPC’s bad, right?
- what is this binary format
- i don’t know why we’re doing this
- generating code as part of workflow? feels weird
- you have to think about types a bit more
if you don’t have a monorepo, where do schemas live?
- for a while we had one repo just for protos. that was terrible
- two monorepos: one for platform (continuous deployment), one for product (6-8 week packaged software release)
- linting is important to check you haven’t reused an index
- you need to structure your repository to match your system boundaries
graphql as another alternative?
- we like being able to deprecate things easily and to have a schema
- django impl called grapheme
- js frontend
  - flow for types
  - relay - state management
  - same types going all the way through
we use thrift as part of our content api
binary messaging formats
- capnproto
  - a friend looked at it and liked it for low-latency work
  - no deserialization - directly dump into memory
  - did they need that? I’m skeptical
  - changing schemas
- simple binary encoding
- https://dataintensive.net/ has a chapter on serialization formats (chapter 4: encoding and evolution)
https://gafferongames.com/ has some articles on designing custom network protocols for games
- where TCP doesn’t fit but you still need reliability so you have to build something custom on UDP

standards vs autonomy

if you introduce something, who’s going to manage it when you go away?
autonomy doesn’t mean free from responsibility
- SRE model: if you want to put something in prod, you have to tick these boxes
- standard tooling
  - release engineering
- well-trodden path
standards should empower autonomy
- there should be multiple implementations of the standard to make things work well
what does your organization value? what are they concerned about?
- this will inform your approach to standards and consistency
what are the tactics for fixing standards?
- standards always slip and they’re never up-to-date
  - if you accept this, you build for it
  - where i currently am, we have fully autonomous code teams
    - if they want to use a new thing, they have to own it
    - no tossing over the fence
- we had a devops rota
  - no contraints on language
  - team that did on call was drawn from devs on all different teams
  - extremely good monetary reward for being called out
    - (although: watch out for incentivising people to build crappy code)
- changing standards is often a power question
  - a lot of orgs are top-down
    - but then enforcement is lacking
  - automation and testing makes top-down command and control easier by detected breaches
- we need to precisely define who is involved in standards process
- who’s accountable for the standard being met?
- our org had the spotify model (guilds/squads/etc)
  - engineering guild owned how on call works
  - it was in my interest to be a member of the guild
  - we also had an “interested parties review” for a team to propose a novel change
    - amount of rigour proportional to size of change
      - from just a conversation on slack to a full document on wiki + formal meeting
- our org is full TOGAF
  - services that are expensive have to budget for operational cost
  - have to go to a technical design authority with design for system for approval
    - they can check against standard or authorize an exception
  - it all sounds like it meets what we’re talking about but none of it works
  - ops team gets budget cut every year and says no to everything
  - TDA say no to everything and have never updated standards
  - none of the delivery teams ever deliver anything
- where you can automate compliance checking
  - make it clear how to get the standard updated
  - http://danger.systems/ - failure reports have link to how to update the repo
- it boils down to good team culture, good onboarding
- we have a process where people have to present ideas
  - but it’s not so much about yes/no vs: it looks like you’re trying to do monitoring. are you aware of these other things that are going on?
- there’s different return on investment at different scales
  - if you only have 10 devs, you probably don’t want too much process
we’re talking about different kinds of standards:
- languages / libraries / style
  - easily automatable
- processes
  - much more difficult to automate
another org: lots and lots of little things
- 7 programming languages
- all operated by same ops
- people burn out
- we want to get somewhere where these people don’t feel like everything is their responsibility
- if you put an app into heroku, and it crashes all the time, heroku won’t fix it for you
question: how do you introduce standards where divergence already exists? has anyone done this before?
question: how do you deprecate standards? how do you squash those last bits of (eg) php? has anyone done this?

Architecture decision records

we’ve only been doing ADRs for just over a year
- we’re already seeing benefits from people asking questions
joyent have an open rfd process
rust have rfcs
feels like RFCs are about product decisions, whereas ADRs are documenting a technical decision based on an already-established product need
what’s the process for who gets to merge a PR?
- two thumbs up 👍👍
numbered ADRs
- don’t change a decision, supersede it
https://github.com/npryce/adr-tools
some people don’t like them because their opinion wasn’t the popular one
what’s useful to have in ADRs?
where are ADRs appropriate?
- some things are relevant within a single file, can just be a code comment
  - “this is modelled as a finite state machine, so don’t worry if it looks like two nested switch statements which you’d normally run away from”
- ADRs are for decisions at a larger scale
do you do anything to control drift from previous decisions?
- no, not really

when tech problems are really human problems

tech lead: derisking
talk about:
- what to build
- how you might build it
- milestones
during reviews of decisions, i found i didn’t want to okay them
- or at least i wanted to postpone them
i introspected to understand why
it was because i didn’t trust the specification of the product in the first place
i was trying to insulate the tech from what was actually a human problem
sometimes “we can’t deploy to this environment” means “we don’t trust your team to do it correctly”
once you realise you’re talking about a human problem as well as a tech problem, what do you do?
how do you build trust? esp within an outside party who doesn’t trust your team’s decision making?

live tweeting tech conferences

@bridgetkromhout
why?
- visibility!
put your twitter handle on every slide please
- lower the barrier to entry for people to attribute things to you
pre-livetweeting checklist
- find conference hashtag
- decide which talks (if multi-track)
- find speaker twitter handles
- draft tweets in tweetbot
even if you don’t use twitter, you use mastodon dot social, you could still park a twitter handle for people to use as a reference to you
who is the hashtag for?
- people at the conference
  - attendees - which talk should i go to?
  - speakers
  - organizers who are too busy to go to talks!
- people not at the conference
  - people with FOMO
  - your followers who aren’t interested in the event
    - they can mute the hashtag
#keep #hashtags #simple
- you don’t need the year in the hashtag (looking at you #scalesummit18)
take photos!
- choose the right angle
- try to include:
  - speaker
  - slide
  - some of the room
- what is your goal here?
when i don’t tweet
- E_TOO_MANY_DISTRACTIONS (emails etc)
kindness > negativity
- (backchannels > subtweets)
life is too short to arguing on the internet
- mute early, mute often
- mute “well actually”
incidents & accidents
- misquotes / misunderstandings
trip reports
- highlight activities of your team
pages for your own talks
- before: title, description, date, event, location
- shortly after: slides, embedded tweets
- eventually: video
- https://bridgetkromhout.com/speaking/
conference twitter isn’t the only twitter
- we are entire human beings

philandstuff/scale-summit-18.org

scalable build pipelines

A

B

C

B

A

B

question

D

B

D

B

(survey around concurrent builds)

infrastructure testing

A

B

C

D

E

F

D

G

H

I

A

J

K

GDPR

immutable root / OS

faas / serverless

gRPC / protobuf vs REST

standards vs autonomy

Architecture decision records

when tech problems are really human problems

live tweeting tech conferences