and1truong/00 - SRE conferences index.md

Last active January 16, 2025 05:38

Star (1) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/and1truong/1c3decfc7f1b3816de5c666cfe48e8a6.js"></script>
Save and1truong/1c3decfc7f1b3816de5c666cfe48e8a6 to your computer and use it in GitHub Desktop.

Download ZIP

SRE conferences

Raw

00 - SRE conferences index.md

SRE Conferecnes

Raw

2015.md

SREcon15

Notes from Production Engineering
Case Study: Adopting SRE Principles at StackOverflow
Monitoring without Infrastructure at Airbnb
Scaling Networks through Software
Incident Analysis
From Zero to Hero: Recommended Practises for training your ever-evolving SRE Teams
Architecting and launching the Halo 4 Services
Being afraid - How to Paranoia at Dropbox protects your data
Panel: AMA with the SRECon chairs and speakers
Netflix RaaS: Reliability as a service
Making the Sum of AWS Networking greater than its parts -- Achieving High Availability
Making every SRE Hire count
Building Billion User load balancer ✨
Panel: Educate SRE
Collin and the Slingbot
Smart monitor system for automatic anomaly detection at Baidu ✨
MySQL automation at Facebook Scale
Learning from Mistakes and outages at Facebook
Lightning Talks
Mux: How I stopped worrying and learned to love the multiplexing ✨
Instagration: A case study in Cloud Migration at scale
Error Budgets and risks
Panel: The weeping angels of Site Reliability
Ensuring Success During Disaster
Panel: Fifty shades of Grey: Different Models for Reliability Work

Raw

2016.md

SRECon16

The Realities of the Job of Delivering Reliability
Beyond repair: Proactive maintenance work at scale ✨
nrrd 911 ic me: The Incident Commander Role
Continuous Deployment to Millions of users 40 times a day
What's NetDevOps? How do I start?
Netflix: 190 countries and 5 core SREs ✨
Debugging distributed systems ✨
Doorman: Global Distributed client side rate limiting ✨
How to improve a service by roasting it
College student to SRE: Onboarding your entry level talent
Service Levels and Error Budgets
Stepping up to scale
From Ops to SRE on a Brazilian Startup
Shopping Event Reliability
Using salt to make infrastructure Consumable (Tasty, Even)
Operations at (small) Scale
Operational Buddhism: Building reliable services from Unreliable components
Finding the order of chaos
Moving large workload from a public to an OpenStack Private Cloud: Is it really worth it?
SREs + Software Engineers: Making it work
Monitoring the unmeasureable ✨
Go for SREs using Python
A young lady's illustrated primer to technical decision-making
Putting together great SRE teams
Server Provisioning in an IPv6 only world ✨
Privacy Reliability Engineering: Looking at privacy through the Lens of SRE
Building reliable social infrastructure for Google
The Evolution of Global traffic routing and failover
Lightning talks
Terraform at Adobe ✨
Transforming Tier 1 Caterpillars to Butterflies
The Art of Performance Monitoring
Managing Grumpy: Embracing Diversity to Build Stronger Teams
It's People All the Way Down
Running Consul at Scale - Journey from RFC to Production
Panel: Who/What is SRE?
Avoiding Cascading Failures at eBay? ✨
SRE at a Start-Up: Lessons from LinkedIn ✨
Less Alarming Alerts!
Shaping Reality to Shape Outcomes: Making SRE Work with Uber Growth
Panel: SRE Managers
Performance Checklists for SREs ✨

Raw

2017.md

SREcon17 Asia/Australia

LinkedIn SRE From Inception to Global Scale ✨
Next Generation of DevOps AIOps in Practice @Baidu ✨
How Could Small Teams Get Ready for SRE
How We Built TechLadies in Singapore
Focal Impact - The Service Pyramid ✨
Smart Monitoring System for Anomaly Detection on Business Trends in Alibaba ✨
Graphite@Scale or How to Store Millions of Metrics per Second ✨
Data Checking at Dropbox ✨
Managing Server Secrets at Scale with a Vaultless Password Manager
Open Falcon - A Distributed and High Performance Monitoring System ✨
Talking to an OpenStack Cluster in Plain English
Distributed Consensus Algorithms
A Distribution Framework over ANSIBLE
Draining the Flood - A Combat against Alert Fatigue
Good, Better, Best, Mobile User Experience
Reliable Launches at Scale
Didi: How to Provide a Reliable Ridesharing Service
Measuring the Success of Incident Management at Atlassian
Managing Changes Seamlessly on Yahoo's Hadoop Infrastructure Servers
Event Correlation - A Fresh Approach towards Reducing
Automated Troubleshooting of Live Site Issues
A Unit Test Would Have Caught This
Testing for DR Failover Testing
Accept Partial Failures, Minimize Service Loss
Azure SREBot - More than a Chatbot
Merou: A Decentralized, Audited Authorization Service
Canary in the Internet Mine
InnoDB to MyRocks Migration in Main MySQL Database at Facebook
Golang's Garbage
Capacity Planning and Flow Control
Managing Capacity @ LinkedIn
Distributed Scheduler Hell
SRE Your gRPC - Building Reliable Distributed Systems Illustrated with gRPC
Operationalizing DevOps Teaching
Scaling Reliability at Dropbox - Our Journey towards a Distributed Ownership

SREcon17 Americas

Reducing MTTR and False Escalations: Event Correlation at LinkedIn
The Service Score Card—Gamifying Operational Excellence
Postmortem Action Items: Plan the Work and Work the Plan ✨
Don't Call Me Remodel Building and Managing Distributed Teams
Observability in the Cambrian Stack Era ✨
Keep Calm and Carry On: Scaling Your Org with Microservices
From Engineering Operations to Site Reliability Engineering
DNSControl: A DSL for DNS as Code from StackOverflow.com
Every Day Is Monday in Operations ✨
Traps and Cookies
Spotify's Love-Hate Relationship with DNS ✨
Lyft's Envoy: Experiences Operating a Large Service Mesh ✨
Principles of Chaos Engineering ✨
BPerf-Bing.com Cloud Profiling on Production ✨
I'm an SRE Lead! Now What? How to Bootstrap and Organize Your SRE Team
Ambyr-Linkedin's Distributed Immutable Object Store ✨
A Million Containers at Last Cool
It's the End of the World as We Know It (I Feel Fine): Engineering for Crisis
Killing Our Darlings: How to Degenerate Systems
Tune Your Way to Savings!
Feedback Loops: How SREs Benefit and What is Needed to Realize Their Potential
Anomaly Detection in Infrequently Occurred Patterns
SRE and Presidential Campaigns
A Practical Guide to Monitoring and Alerting with Time Series at Scale
Panel: Training New SREs
Deployment Automation: Releasing Quickly and Reliably
Lightning Talks 1
Lightning Talks 2

SREcon17 Europe/Middle East/Africa

Care and Feeding of SRE
Diversity and Inclusion in SRE: A Postmortem
Globalizing SRE in a Walkup Culture
Make Haste Slowly: Balancing SRE Diligence in Urgency...
Want to Solve Over-Monitoring...
SRE Your gRPC... ✨
Profiling Node Applications
The Dangers of Being Overly-Paranoid
Show Me the RIGHT Numbers! Are Our Users Happy? ✨
Standing On the Shoulders of Giants...
InStream: Large Scale Distribution...
Use Load Testing to Build a Proper Mental Model of Your Service
Traffic Steering using Rum DNS @ Linkedin ✨
Capturing and Analyzing Millions...
OK Log: Distributed and Coördination-Free Logging ✨
How We Try to Make a Lion Bulletproof...
From Firefighting to Proactive Work: ...
Incident Command at the Edge ✨
Resiliency Testing with Todgvory
Building a Culture of Reliability
Tech Leadership in SRE
Case Study Lessons Learned from Our First Worldwide Outage
When Trouble Comes to Town
The Why, What, and How of Starting an SRE Engagement
Startup Systems Engineers Instruction Manual
Cognitive Bias and On-Call
Reducing MTTR and False Escalations: Event Correlation
The Never-Ending Story of Site Reliability ✨
Hiring SREs May Be Literally Impossible
Gamifying Reliability Excellence—The Service Score Card
Incident Management
Lightning Talks
Why Work with Tech Writers? ✨
Postmortem Action Items: Plan the Work and Work the Plan
Building an On-Premise Kubernetes
Distributed Systems, Like It or Not ✨
Avoiding and Breaking Out of Capacity Prison ✨
Service with an Angry Smile: Passive-Aggressive Behavior in SRE ✨
The Cult(Ure) of Strength
Run Less Software; Use Less Bits ✨
Monitoring Cloudflare's Planet-Scale Edge Network
Monitoring Design Principles ✨
And the CFO Wept: AWS Cost Control
Have You Tried Turning It off and Turning It on Again?
100 Teams, 100 Ways to Fail ✨
Persistent SRE Antipatterns: Pitfalls On the...

Raw

2018.md

SREcon18 Americas

If You Don’t Know Where You’re Going, It Doesn’t Matter How Fast You Get There
Security and SRE: Natural Force Multipliers
What It Really Means to Be an Effective Engineer
SparkPost: The Day the DNS Died
Stable and Accurate Health-Checking of Horizontally-Scaled Services
Beyond Burnout: Mental Health and Neurodiversity in Engineering
Bootstrapping an SRE Team:
Don’t Ever Change! Are Immutable Deployments Really Simpler, Faster, and Safer?
Lessons Learned from Our Main Database Migrations at Facebook
Leveraging Multiple Regions to Improve Site Reliability:
Building Successful SRE in Large Enterprises—One Year Later
Working with Third Parties Shouldn't Suck
When to NOT Set SLOs: Lots of Strangers Are Running My Software!
Lessons Learned from Five Years of Multi-Cloud at PagerDuty
Help Protect Your Data Centers with Safety Constraints
Real World SLOs and SLIs: A Deep Dive
How SREs Found More than $100 Million Using Failed Customer Interactions
Learning at Scale Is Hard! Outage Pattern Analysis and Dirty Data
How Not to Go Boom: Lessons for SREs from Oil Refineries
Containerization War Stories
Resolving Outages Faster with Better Debugging Strategies
Monitoring DNS with Open-Source Solutions
Antics, Drift, and Chaos
Security as a Service
Breaking in a New Job as an SRE
"Capacity Prediction" instead of "Capacity Planning":
Distributed Tracing, Lessons Learned
Junior Engineers Are Features, Not Bugs
Approaching the Unacceptable Workload Boundary
Building Shopify's PaaS on Kubernetes
Know Thy Enemy: How to Prioritize and Communicate Risks
Automatic Metric Screening for Service Diagnosis
Whispers in Chaos: Searching for Weak Signals in Incidents
Architecting a Technical Post Mortem
Your System Has Recovered from an Incident, but Have Your Developers?
The History of Fire Escapes
Leaping from Mainframes to AWS: Technology Time Travel in the Government
Operational Excellence in April Fools’ Pranks: Being Funny Is Serious Work!

SREcon18 Asia/Australia

The Evolution of Site Reliability Engineering
Safe Client Behaviour
Service Monitoring Manual—2018 Edition
Introduction to Alibaba Monitoring System
Building SRE: Culture from the Outside In
Quantifying Empathy with Service Level Objectives
Doing Things the Hard Way
Achieving Observability into Your Application with OpenCensus
Know Thy Enemy: How to Prioritize and Communicate Risks
Data Visualization for SREs—an Essential Skill for Quick Debugging
You Can't Stop Fires with an Ambulance
Comprehensive Container-Based Service Monitoring with Kubernetes and Istio
How to Make Releases Safer in Baidu
Cultural Nuance and Effective Collaboration for Multicultural Teams
Automatic Datacenter and Service Deployments...
From Monitoring to Automated Testing of Your Infrastructure Code
Shopify's Move from the Data Centre to the Cloud
Ensuring Reliability of High-Performance Applications
Smarter Disasters: End-to-End Automation for Incidents
Debugging at Scale—Going from Single Box to Production
Productionizing Machine-Learning Services: Lessons from Google SRE
Pro Tip: Save Money on Outages by Having a Bot Do the Heavy Lifting
Evolution of SRE and Rising Need of SRE Catalyzers
How to Serve and Protect (with Client Isolation)
A Tale of One Billion Time Series
Isolation without Containers
Automatic Traffic Scheduling for Internet Connectivity Failures
Lessons Learned from Our Main Database Migrations at Facebook
PV Monitoring Based on Linear Regression
Do Docs Better: Practical Tips on Delivering...
Characterizing and Understanding Phases of SRE Practices
Scaling Yourself for Managing Distributed Teams...
Interviewing for Systems Design Skills
Scaling a Distributed Stateful System: A LinkedIn Case Study
Mentoring: A Newcomer's Perspective
You Get What You Measure—Why Metrics Are Important
Blame. Language. Sharing: Three Tips for...
A Theory and Practice of Alerting with Service Level Objectives
Production Engineering: Connect the Dots
Mental Models for SREs

SREcon18 Europe

Circonus: Design (Failures) Case Study
SRE Theory vs. Practice: A Song of Ice and TireFire
Data Protection Update and Tales from the Introduction of the GDPR
What Makes a Good SRE: Findings from the SRE Survey
Sustainability Starts Early: Creating a Great Ops Internship
The Silver Lining Consortium: Post-Mortems for the Rest of Us
Migrations under Production Load: How to Switch...
The 7 Deadly Sins of Documentation
Y2K and Other Disappointing Disasters: Risk Reduction and Harm Mitigation
Availability, Latency, and Cost: Withstanding Regional Outages
SRE for Mobile Applications
Your System Has Recovered from an Incident, but Have Your Developers?
Against On-Call: A Polemic
Impact of Network Automation
Migrating Your Old Server Products to Be Stateless Cloud Services
Lightning Talks
Dealing with Dark Debt: Lessons Learnt at Goldman Sachs
Halt and Don’t Catch Fire
Applying the Principles of Chaos to Serverless
Know Your Kubernetes Deploys
Not Invented Here Syndrome and Dark Debt: The PagerDuty Story
Building a Debuggable Go Server
Building a Fellowship Program to Mentor and Grow Your SRE Team
SoundCloud's Story of Seeking Sustainable SRE
How We Un-Scattered Our DNS Setup and Unlocked New Automation Options
Kernel Upgrades at Facebook
Managing Misfortune for Best Results
Clearing the Way for SRE in the Enterprise
The Math behind Project Scheduling, Bug Tracking, and Triage
Ethics in Computing
Canarying Well: Lessons Learned from Canarying Large Populations
Real World SLOs and SLIs: A Deep Dive
I’m SRE and You Can Too!—A Fine Manual...
Lessons Learned—Data Driven Hiring 3 Years Later
SRE Team Lifecycles
Capacity Planning in Four Parts: Telling the Future without a Crystal Ball
The Nth Region Project: An Open Retrospective
This IS NOT Fine: Putting Out (Code) Fires
What Medicine Can Teach Us about Being On-Call
Tradeoffs in Resiliency: Managing the Burden of Data Recoverability
Scalable Coding—Find the Error
Delete This: Decommissioning Servers at Scale
Observability for Emerging Infra: What Got You Here Won't Get You There
Deploying SRE Training Best Practices to Production...
Keep Building Fresh: Shopify's Journey to Kubernetes
The Myth of Cloud Agnosticism
SRE for Good: Engineering Intersections between Operations and Social Activism
Can I Tell You a Secret? I See Dead Systems
Junior Engineers Are Features, Not Bugs

Raw

2020.md

SREcon Conversations 2020

SREcon Conversations Europe/Middle East/Africa with Amy Tobey, Equinix Metal
SREcon Conversations Europe/Middle East/Africa with Jennifer Petoff and JC van Winkel, Google Inc.
SREcon Conversations Europe/Middle East/Africa with Avery Pennarun, Tailscale
SREcon Conversations Europe/Middle East/Africa with King'ori Maina, Zappi
SREcon Conversations Europe/Middle East/Africa with Alex Hidalgo
SREcon Conversations Europe/Middle East/Africa with Štěpán Davidovič, Google Inc.
SREcon Conversations Asia/Pacific with Koon Seng Lim, DBS
SREcon Conversations Asia/Pacific with Katherine Lim, Innablr
SREcon Conversations Asia/Pacific with Karthikeyan Selvaraj and Rajesh Ramachandran, PayPal
SREcon Conversations with David Argent, Amazon (August 2020)
SREcon Conversations with Avleen Vig, Facebook (July 2020)
SREcon Conversations with Ingrid Epure, Netlify (May 2020)

SREcon20 Americas

The Secret Lives of SREs - Controlling the Costs of Coordination across Remote
Identifying Hidden Dependencies
Are We Getting Better Yet? Progress Toward Safer Operations
Continuously Improving Culture through Design Decisions
Avoiding Goodhart's Law - Use SLO's as Tools Not Cudgels
Off the Beaten Path: Moving Observability Focus from Your Service
Observing from Incidents
Building Service Ownership Using Documentation, Telemetry, and a Chance to Make
Study on Human Factors and Team Culture to Improve Pager Fatigue
Squish Level Objectives: How SRE can Help Align Technical Work to User Benefit
Building Actionable Code Ownership
SREs at Telecom and Media Industry: Bridging between Legacy and Cloud Native App
Jupyter as Incident Response Tool
Sustainable Software Engineering & SREs
When /bin/sh Attacks: Revisiting ""Automate All the Things""
Testing Encyclopedias in Production
Why SREs can't afford to NOT do Chaos Engineering
Implementing Distributed Consensus
Incident Response in Unfamiliar Sociotechnical Systems
Making Infrastructure More Friendly for Beginners
Low Context DevOps: Improving SRE Team Culture through Defaults, Documentation
The Smallest Possible SRE Team
Cloudy with a Chance of Chaos
Confessions of a Systems Engineer: Learning from My 20+ Years of Failure
Pragmatic Security for SRE
"Disorganizing" Your SRE Organization
Failure is Not an Option! SRE Lessons 50 Years after the Apollo 13 Flight
Challenges of Starting an SRE Team from Scratch in an Enterprise
The Good, the Bad and the Ugly: The 3 Learnings of an SRE
9 Years of Failure: How Racing Crappy Cars Made Me a Better SRE
Production Population Control: My Cattle are Rabbits!
Latency and Availability Error Budgets Done Right at Scale
The Evolution of Traffic Routing in a Streaming World
Heap Optimization for Go Systems
Soft Failures, Hard Goals - Accelerating Payments at Scale During the Pandemic
Give Your PXE wings! Bootstrapping Explained
Hot Swap Your Datastore: A Practical Approach and Lessons Learned
Automatically Detect the Top Performance & Scalability Issues in Distributed
A Bartender's Guide to Network Monitoring
Achieving the Ultimate Performance with KVM
Weeks of Debugging Can Save You Hours of TLA+
Capacity Planning and Performance Enhancement with Page Reference Sampling
Achieving Mutual TLS: Secure Pod-to-Pod Communication Without the Hassle
Panel: Learning from Adaptations to Coronavirus
It's a Trap! How Abstractions Have Failed Us.