Morgan McLean - Google
This talk is all about opencensus.io. Morgan is the PM for it. He's arguing that the usual N pillars of observability is not sufficient to... have observability.
Instead he's saying context/topology+status+root cause analysis == observability.
Opencensus does:
- distributed traces
- tags
- metrics
They're providing libraries in a bunch of languages. This seems like a plug-and-play replacement for a prometheus+zipkin stack? https://opencensus.io/faq/index.html
n.b.: The Python clientlibs aren't complete yet. They do tracing but not metrics exporting yet.
Very focused on app-level telemetry. Don't really support system metrics etc.
Overall: out of the box observability. Writing your own exporter is super easy etc.
Yan Cui - DAZN
Sports streaming site. Not launched in the US yet, but soon.
Not being able to install daemons (bc serverless) makes life hard. You don't want it in your critical path etc.
Great Charity Majors quote:
With distributed systems you don't care about the health of the system - you care about the health of the event or the slice.
Prateek Rungta - Uber
Golden Signals: Something high in signal/noise ratio. Lots of them come from the SRE book.
Auto-create dashboards for your services.
Group alerts (we do this) and set some alerts as dependent on others (we don't quite do this, but we probably could).
400-600M raw metrics per second. 20M stored metrics per second. Seeing about 20% growth quarter over quarter.
2014-2015: Graphite. 2015-2016: Cassandra, with 16x YoY growth. Expensive, more than 1500 Cassandra hosts! Mostly due to compactions and slow repairs, and they ended up turning down replication factor to cope. Sound familiar?!
They looked at OSS products, none scaled that far. They looked at vendors, none that cost effective. So they wrote their own: M3DB. It's open source.
This is now going pretty deep into the M3DB architecture. It seems like a pretty sweet design, but I think if you want to know all the details it's best just to look at Prateek's talk. More info: github.com/m3db
Slides: bit.ly/m3db-monitorama2018
Mercedes Coyle - Sensu
Allan Espinosa - Bloomberg
Assisted Remediation: By trying to build an autoremediation system, we realized we never actually wanted one
Kale Stedman - Demonware
Dave Cadwallader - DNAnexus
Relationship between Ops and Security matters! Security vs Compliance How to automate compliance checking Creating compliance SLOs/SLAs
So DNANexus' whole thing is being a platform for DNA research and storage. This is hXc HIPAA information so they have to do a ton of work around compliance and reporting.
Compliance just means you meet a certain set of requirements at a certain moment in time. You still have to take action at all the other times to be secure.
They use Prometheus to ping things and say "how you doin?" To use this with Cloudwatch, for example, you need an intermediate exporter process.
He shows an example of using linear extrapolation within Prometheus to do "disk will fill up in X hours". It looks pretty simple to do. But he's also using INSPEC (github.com/inspec) which is an auditability framework.
INSPEC wants to SSH to each of your prod boxes and run audits on them. "Eeew!" said the security team. So you can just schedule it on each machine locally, dump to JSON, and then have a bit of code that writes to Prometheus!
I really like his summarization here. He's just doing "passed, failed, skipped" counters for each host. If you need to investigate, go to the darn host and read the logs! Then, you can put the ultimate compliance SLO on your boxes! If you have any failed tests, your detector rule flags it. Perfect.
He's looking for collaborators on his "security through observability" project: https://github.com/geekdave/prometheus_inspec_exporter
I always enjoy Dave's talks. This one was really cool.
Beth Cornils - Hashicorp
This talk is not about monitoring. It's about D&I.
How she says you should hire people. A lot to unpack here, because it appears - per my own interpretation - essentially unmeritocratic. But I may be wrong on this and missing a lot of nuance.
- Cand. meets min qualifications for the job?
- Cand. has capacity to learn/grow into the job?
- Will cand. contribute to grow a culture of inclusion?
Then we do the privilege walk, except with hand raising.
Then we talk about volunteering and mentoring. Title I schools mentioned.