Skip to content

Instantly share code, notes, and snippets.

@and1truong
Last active January 16, 2025 05:38
Show Gist options
  • Save and1truong/1c3decfc7f1b3816de5c666cfe48e8a6 to your computer and use it in GitHub Desktop.
Save and1truong/1c3decfc7f1b3816de5c666cfe48e8a6 to your computer and use it in GitHub Desktop.
SRE conferences
  • Notes from Production Engineering
  • Case Study: Adopting SRE Principles at StackOverflow
  • Monitoring without Infrastructure at Airbnb
  • Scaling Networks through Software
  • Incident Analysis
  • From Zero to Hero: Recommended Practises for training your ever-evolving SRE Teams
  • Architecting and launching the Halo 4 Services
  • Being afraid - How to Paranoia at Dropbox protects your data
  • Panel: AMA with the SRECon chairs and speakers
  • Netflix RaaS: Reliability as a service
  • Making the Sum of AWS Networking greater than its parts -- Achieving High Availability
  • Making every SRE Hire count
  • Building Billion User load balancer ✨
  • Panel: Educate SRE
  • Collin and the Slingbot
  • Smart monitor system for automatic anomaly detection at Baidu ✨
  • MySQL automation at Facebook Scale
  • Learning from Mistakes and outages at Facebook
  • Lightning Talks
  • Mux: How I stopped worrying and learned to love the multiplexing ✨
  • Instagration: A case study in Cloud Migration at scale
  • Error Budgets and risks
  • Panel: The weeping angels of Site Reliability
  • Ensuring Success During Disaster
  • Panel: Fifty shades of Grey: Different Models for Reliability Work
  • The Realities of the Job of Delivering Reliability
  • Beyond repair: Proactive maintenance work at scale ✨
  • nrrd 911 ic me: The Incident Commander Role
  • Continuous Deployment to Millions of users 40 times a day
  • What's NetDevOps? How do I start?
  • Netflix: 190 countries and 5 core SREs ✨
  • Debugging distributed systems ✨
  • Doorman: Global Distributed client side rate limiting ✨
  • How to improve a service by roasting it
  • College student to SRE: Onboarding your entry level talent
  • Service Levels and Error Budgets
  • Stepping up to scale
  • From Ops to SRE on a Brazilian Startup
  • Shopping Event Reliability
  • Using salt to make infrastructure Consumable (Tasty, Even)
  • Operations at (small) Scale
  • Operational Buddhism: Building reliable services from Unreliable components
  • Finding the order of chaos
  • Moving large workload from a public to an OpenStack Private Cloud: Is it really worth it?
  • SREs + Software Engineers: Making it work
  • Monitoring the unmeasureable ✨
  • Go for SREs using Python
  • A young lady's illustrated primer to technical decision-making
  • Putting together great SRE teams
  • Server Provisioning in an IPv6 only world ✨
  • Privacy Reliability Engineering: Looking at privacy through the Lens of SRE
  • Building reliable social infrastructure for Google
  • The Evolution of Global traffic routing and failover
  • Lightning talks
  • Terraform at Adobe ✨
  • Transforming Tier 1 Caterpillars to Butterflies
  • The Art of Performance Monitoring
  • Managing Grumpy: Embracing Diversity to Build Stronger Teams
  • It's People All the Way Down
  • Running Consul at Scale - Journey from RFC to Production
  • Panel: Who/What is SRE?
  • Avoiding Cascading Failures at eBay? ✨
  • SRE at a Start-Up: Lessons from LinkedIn ✨
  • Less Alarming Alerts!
  • Shaping Reality to Shape Outcomes: Making SRE Work with Uber Growth
  • Panel: SRE Managers
  • Performance Checklists for SREs ✨
  • LinkedIn SRE From Inception to Global Scale ✨
  • Next Generation of DevOps AIOps in Practice @Baidu ✨
  • How Could Small Teams Get Ready for SRE
  • How We Built TechLadies in Singapore
  • Focal Impact - The Service Pyramid ✨
  • Smart Monitoring System for Anomaly Detection on Business Trends in Alibaba ✨
  • Graphite@Scale or How to Store Millions of Metrics per Second ✨
  • Data Checking at Dropbox ✨
  • Managing Server Secrets at Scale with a Vaultless Password Manager
  • Open Falcon - A Distributed and High Performance Monitoring System ✨
  • Talking to an OpenStack Cluster in Plain English
  • Distributed Consensus Algorithms
  • A Distribution Framework over ANSIBLE
  • Draining the Flood - A Combat against Alert Fatigue
  • Good, Better, Best, Mobile User Experience
  • Reliable Launches at Scale
  • Didi: How to Provide a Reliable Ridesharing Service
  • Measuring the Success of Incident Management at Atlassian
  • Managing Changes Seamlessly on Yahoo's Hadoop Infrastructure Servers
  • Event Correlation - A Fresh Approach towards Reducing
  • Automated Troubleshooting of Live Site Issues
  • A Unit Test Would Have Caught This
  • Testing for DR Failover Testing
  • Accept Partial Failures, Minimize Service Loss
  • Azure SREBot - More than a Chatbot
  • Merou: A Decentralized, Audited Authorization Service
  • Canary in the Internet Mine
  • InnoDB to MyRocks Migration in Main MySQL Database at Facebook
  • Golang's Garbage
  • Capacity Planning and Flow Control
  • Managing Capacity @ LinkedIn
  • Distributed Scheduler Hell
  • SRE Your gRPC - Building Reliable Distributed Systems Illustrated with gRPC
  • Operationalizing DevOps Teaching
  • Scaling Reliability at Dropbox - Our Journey towards a Distributed Ownership
  • Reducing MTTR and False Escalations: Event Correlation at LinkedIn
  • The Service Score Card—Gamifying Operational Excellence
  • Postmortem Action Items: Plan the Work and Work the Plan ✨
  • Don't Call Me Remodel Building and Managing Distributed Teams
  • Observability in the Cambrian Stack Era ✨
  • Keep Calm and Carry On: Scaling Your Org with Microservices
  • From Engineering Operations to Site Reliability Engineering
  • DNSControl: A DSL for DNS as Code from StackOverflow.com
  • Every Day Is Monday in Operations ✨
  • Traps and Cookies
  • Spotify's Love-Hate Relationship with DNS ✨
  • Lyft's Envoy: Experiences Operating a Large Service Mesh ✨
  • Principles of Chaos Engineering ✨
  • BPerf-Bing.com Cloud Profiling on Production ✨
  • I'm an SRE Lead! Now What? How to Bootstrap and Organize Your SRE Team
  • Ambyr-Linkedin's Distributed Immutable Object Store ✨
  • A Million Containers at Last Cool
  • It's the End of the World as We Know It (I Feel Fine): Engineering for Crisis
  • Killing Our Darlings: How to Degenerate Systems
  • Tune Your Way to Savings!
  • Feedback Loops: How SREs Benefit and What is Needed to Realize Their Potential
  • Anomaly Detection in Infrequently Occurred Patterns
  • SRE and Presidential Campaigns
  • A Practical Guide to Monitoring and Alerting with Time Series at Scale
  • Panel: Training New SREs
  • Deployment Automation: Releasing Quickly and Reliably
  • Lightning Talks 1
  • Lightning Talks 2
  • Care and Feeding of SRE
  • Diversity and Inclusion in SRE: A Postmortem
  • Globalizing SRE in a Walkup Culture
  • Make Haste Slowly: Balancing SRE Diligence in Urgency...
  • Want to Solve Over-Monitoring...
  • SRE Your gRPC... ✨
  • Profiling Node Applications
  • The Dangers of Being Overly-Paranoid
  • Show Me the RIGHT Numbers! Are Our Users Happy? ✨
  • Standing On the Shoulders of Giants...
  • InStream: Large Scale Distribution...
  • Use Load Testing to Build a Proper Mental Model of Your Service
  • Traffic Steering using Rum DNS @ Linkedin ✨
  • Capturing and Analyzing Millions...
  • OK Log: Distributed and Coördination-Free Logging ✨
  • How We Try to Make a Lion Bulletproof...
  • From Firefighting to Proactive Work: ...
  • Incident Command at the Edge ✨
  • Resiliency Testing with Todgvory
  • Building a Culture of Reliability
  • Tech Leadership in SRE
  • Case Study Lessons Learned from Our First Worldwide Outage
  • When Trouble Comes to Town
  • The Why, What, and How of Starting an SRE Engagement
  • Startup Systems Engineers Instruction Manual
  • Cognitive Bias and On-Call
  • Reducing MTTR and False Escalations: Event Correlation
  • The Never-Ending Story of Site Reliability ✨
  • Hiring SREs May Be Literally Impossible
  • Gamifying Reliability Excellence—The Service Score Card
  • Incident Management
  • Lightning Talks
  • Why Work with Tech Writers? ✨
  • Postmortem Action Items: Plan the Work and Work the Plan
  • Building an On-Premise Kubernetes
  • Distributed Systems, Like It or Not ✨
  • Avoiding and Breaking Out of Capacity Prison ✨
  • Service with an Angry Smile: Passive-Aggressive Behavior in SRE ✨
  • The Cult(Ure) of Strength
  • Run Less Software; Use Less Bits ✨
  • Monitoring Cloudflare's Planet-Scale Edge Network
  • Monitoring Design Principles ✨
  • And the CFO Wept: AWS Cost Control
  • Have You Tried Turning It off and Turning It on Again?
  • 100 Teams, 100 Ways to Fail ✨
  • Persistent SRE Antipatterns: Pitfalls On the...
  • If You Don’t Know Where You’re Going, It Doesn’t Matter How Fast You Get There
  • Security and SRE: Natural Force Multipliers
  • What It Really Means to Be an Effective Engineer
  • SparkPost: The Day the DNS Died
  • Stable and Accurate Health-Checking of Horizontally-Scaled Services
  • Beyond Burnout: Mental Health and Neurodiversity in Engineering
  • Bootstrapping an SRE Team:
  • Don’t Ever Change! Are Immutable Deployments Really Simpler, Faster, and Safer?
  • Lessons Learned from Our Main Database Migrations at Facebook
  • Leveraging Multiple Regions to Improve Site Reliability:
  • Building Successful SRE in Large Enterprises—One Year Later
  • Working with Third Parties Shouldn't Suck
  • When to NOT Set SLOs: Lots of Strangers Are Running My Software!
  • Lessons Learned from Five Years of Multi-Cloud at PagerDuty
  • Help Protect Your Data Centers with Safety Constraints
  • Real World SLOs and SLIs: A Deep Dive
  • How SREs Found More than $100 Million Using Failed Customer Interactions
  • Learning at Scale Is Hard! Outage Pattern Analysis and Dirty Data
  • How Not to Go Boom: Lessons for SREs from Oil Refineries
  • Containerization War Stories
  • Resolving Outages Faster with Better Debugging Strategies
  • Monitoring DNS with Open-Source Solutions
  • Antics, Drift, and Chaos
  • Security as a Service
  • Breaking in a New Job as an SRE
  • "Capacity Prediction" instead of "Capacity Planning":
  • Distributed Tracing, Lessons Learned
  • Junior Engineers Are Features, Not Bugs
  • Approaching the Unacceptable Workload Boundary
  • Building Shopify's PaaS on Kubernetes
  • Know Thy Enemy: How to Prioritize and Communicate Risks
  • Automatic Metric Screening for Service Diagnosis
  • Whispers in Chaos: Searching for Weak Signals in Incidents
  • Architecting a Technical Post Mortem
  • Your System Has Recovered from an Incident, but Have Your Developers?
  • The History of Fire Escapes
  • Leaping from Mainframes to AWS: Technology Time Travel in the Government
  • Operational Excellence in April Fools’ Pranks: Being Funny Is Serious Work!
  • The Evolution of Site Reliability Engineering
  • Safe Client Behaviour
  • Service Monitoring Manual—2018 Edition
  • Introduction to Alibaba Monitoring System
  • Building SRE: Culture from the Outside In
  • Quantifying Empathy with Service Level Objectives
  • Doing Things the Hard Way
  • Achieving Observability into Your Application with OpenCensus
  • Know Thy Enemy: How to Prioritize and Communicate Risks
  • Data Visualization for SREs—an Essential Skill for Quick Debugging
  • You Can't Stop Fires with an Ambulance
  • Comprehensive Container-Based Service Monitoring with Kubernetes and Istio
  • How to Make Releases Safer in Baidu
  • Cultural Nuance and Effective Collaboration for Multicultural Teams
  • Automatic Datacenter and Service Deployments...
  • From Monitoring to Automated Testing of Your Infrastructure Code
  • Shopify's Move from the Data Centre to the Cloud
  • Ensuring Reliability of High-Performance Applications
  • Smarter Disasters: End-to-End Automation for Incidents
  • Debugging at Scale—Going from Single Box to Production
  • Productionizing Machine-Learning Services: Lessons from Google SRE
  • Pro Tip: Save Money on Outages by Having a Bot Do the Heavy Lifting
  • Evolution of SRE and Rising Need of SRE Catalyzers
  • How to Serve and Protect (with Client Isolation)
  • A Tale of One Billion Time Series
  • Isolation without Containers
  • Automatic Traffic Scheduling for Internet Connectivity Failures
  • Lessons Learned from Our Main Database Migrations at Facebook
  • PV Monitoring Based on Linear Regression
  • Do Docs Better: Practical Tips on Delivering...
  • Characterizing and Understanding Phases of SRE Practices
  • Scaling Yourself for Managing Distributed Teams...
  • Interviewing for Systems Design Skills
  • Scaling a Distributed Stateful System: A LinkedIn Case Study
  • Mentoring: A Newcomer's Perspective
  • You Get What You Measure—Why Metrics Are Important
  • Blame. Language. Sharing: Three Tips for...
  • A Theory and Practice of Alerting with Service Level Objectives
  • Production Engineering: Connect the Dots
  • Mental Models for SREs
  • Circonus: Design (Failures) Case Study
  • SRE Theory vs. Practice: A Song of Ice and TireFire
  • Data Protection Update and Tales from the Introduction of the GDPR
  • What Makes a Good SRE: Findings from the SRE Survey
  • Sustainability Starts Early: Creating a Great Ops Internship
  • The Silver Lining Consortium: Post-Mortems for the Rest of Us
  • Migrations under Production Load: How to Switch...
  • The 7 Deadly Sins of Documentation
  • Y2K and Other Disappointing Disasters: Risk Reduction and Harm Mitigation
  • Availability, Latency, and Cost: Withstanding Regional Outages
  • SRE for Mobile Applications
  • Your System Has Recovered from an Incident, but Have Your Developers?
  • Against On-Call: A Polemic
  • Impact of Network Automation
  • Migrating Your Old Server Products to Be Stateless Cloud Services
  • Lightning Talks
  • Dealing with Dark Debt: Lessons Learnt at Goldman Sachs
  • Halt and Don’t Catch Fire
  • Applying the Principles of Chaos to Serverless
  • Know Your Kubernetes Deploys
  • Not Invented Here Syndrome and Dark Debt: The PagerDuty Story
  • Building a Debuggable Go Server
  • Building a Fellowship Program to Mentor and Grow Your SRE Team
  • SoundCloud's Story of Seeking Sustainable SRE
  • How We Un-Scattered Our DNS Setup and Unlocked New Automation Options
  • Kernel Upgrades at Facebook
  • Managing Misfortune for Best Results
  • Clearing the Way for SRE in the Enterprise
  • The Math behind Project Scheduling, Bug Tracking, and Triage
  • Ethics in Computing
  • Canarying Well: Lessons Learned from Canarying Large Populations
  • Real World SLOs and SLIs: A Deep Dive
  • I’m SRE and You Can Too!—A Fine Manual...
  • Lessons Learned—Data Driven Hiring 3 Years Later
  • SRE Team Lifecycles
  • Capacity Planning in Four Parts: Telling the Future without a Crystal Ball
  • The Nth Region Project: An Open Retrospective
  • This IS NOT Fine: Putting Out (Code) Fires
  • What Medicine Can Teach Us about Being On-Call
  • Tradeoffs in Resiliency: Managing the Burden of Data Recoverability
  • Scalable Coding—Find the Error
  • Delete This: Decommissioning Servers at Scale
  • Observability for Emerging Infra: What Got You Here Won't Get You There
  • Deploying SRE Training Best Practices to Production...
  • Keep Building Fresh: Shopify's Journey to Kubernetes
  • The Myth of Cloud Agnosticism
  • SRE for Good: Engineering Intersections between Operations and Social Activism
  • Can I Tell You a Secret? I See Dead Systems
  • Junior Engineers Are Features, Not Bugs
  • SREcon Conversations Europe/Middle East/Africa with Amy Tobey, Equinix Metal
  • SREcon Conversations Europe/Middle East/Africa with Jennifer Petoff and JC van Winkel, Google Inc.
  • SREcon Conversations Europe/Middle East/Africa with Avery Pennarun, Tailscale
  • SREcon Conversations Europe/Middle East/Africa with King'ori Maina, Zappi
  • SREcon Conversations Europe/Middle East/Africa with Alex Hidalgo
  • SREcon Conversations Europe/Middle East/Africa with Štěpán Davidovič, Google Inc.
  • SREcon Conversations Asia/Pacific with Koon Seng Lim, DBS
  • SREcon Conversations Asia/Pacific with Katherine Lim, Innablr
  • SREcon Conversations Asia/Pacific with Karthikeyan Selvaraj and Rajesh Ramachandran, PayPal
  • SREcon Conversations with David Argent, Amazon (August 2020)
  • SREcon Conversations with Avleen Vig, Facebook (July 2020)
  • SREcon Conversations with Ingrid Epure, Netlify (May 2020)
  • The Secret Lives of SREs - Controlling the Costs of Coordination across Remote
  • Identifying Hidden Dependencies
  • Are We Getting Better Yet? Progress Toward Safer Operations
  • Continuously Improving Culture through Design Decisions
  • Avoiding Goodhart's Law - Use SLO's as Tools Not Cudgels
  • Off the Beaten Path: Moving Observability Focus from Your Service
  • Observing from Incidents
  • Building Service Ownership Using Documentation, Telemetry, and a Chance to Make
  • Study on Human Factors and Team Culture to Improve Pager Fatigue
  • Squish Level Objectives: How SRE can Help Align Technical Work to User Benefit
  • Building Actionable Code Ownership
  • SREs at Telecom and Media Industry: Bridging between Legacy and Cloud Native App
  • Jupyter as Incident Response Tool
  • Sustainable Software Engineering & SREs
  • When /bin/sh Attacks: Revisiting ""Automate All the Things""
  • Testing Encyclopedias in Production
  • Why SREs can't afford to NOT do Chaos Engineering
  • Implementing Distributed Consensus
  • Incident Response in Unfamiliar Sociotechnical Systems
  • Making Infrastructure More Friendly for Beginners
  • Low Context DevOps: Improving SRE Team Culture through Defaults, Documentation
  • The Smallest Possible SRE Team
  • Cloudy with a Chance of Chaos
  • Confessions of a Systems Engineer: Learning from My 20+ Years of Failure
  • Pragmatic Security for SRE
  • "Disorganizing" Your SRE Organization
  • Failure is Not an Option! SRE Lessons 50 Years after the Apollo 13 Flight
  • Challenges of Starting an SRE Team from Scratch in an Enterprise
  • The Good, the Bad and the Ugly: The 3 Learnings of an SRE
  • 9 Years of Failure: How Racing Crappy Cars Made Me a Better SRE
  • Production Population Control: My Cattle are Rabbits!
  • Latency and Availability Error Budgets Done Right at Scale
  • The Evolution of Traffic Routing in a Streaming World
  • Heap Optimization for Go Systems
  • Soft Failures, Hard Goals - Accelerating Payments at Scale During the Pandemic
  • Give Your PXE wings! Bootstrapping Explained
  • Hot Swap Your Datastore: A Practical Approach and Lessons Learned
  • Automatically Detect the Top Performance & Scalability Issues in Distributed
  • A Bartender's Guide to Network Monitoring
  • Achieving the Ultimate Performance with KVM
  • Weeks of Debugging Can Save You Hours of TLA+
  • Capacity Planning and Performance Enhancement with Page Reference Sampling
  • Achieving Mutual TLS: Secure Pod-to-Pod Communication Without the Hassle
  • Panel: Learning from Adaptations to Coronavirus
  • It's a Trap! How Abstractions Have Failed Us.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment