Skip to content

Instantly share code, notes, and snippets.

@and1truong
Created January 15, 2025 19:16
Show Gist options
  • Save and1truong/fe9f5ad7d01c2c55c4d08a956909980a to your computer and use it in GitHub Desktop.
Save and1truong/fe9f5ad7d01c2c55c4d08a956909980a to your computer and use it in GitHub Desktop.

2022 - SRE conferences

  • The 'Success' in SRE Is Silent
  • Building and Running a Diversity-focused Pre-internship Program for SRE
  • A Postmortem of SRE Interviewing
  • Self-Destructing Feature Flags
  • Tales from the VOID: The Scary Truth about Incident Metrics
  • How We Survived (and Thrived) During The Pandemic and Helped Millions...
  • The Pandemic and The Classroom—Enabling Education for Millions
  • Applied Science Fiction: Operating a Research-Led Product
  • Taking the 737 to the Max
  • Securing Your Software Delivery Chain with Process Auditing
  • The Future of above-the-line Tooling
  • Tracing Bare Metal with OpenTelemetry
  • Are We There Yet? Metrics-Driven Prioritization for Your Reliability Roadmap
  • SRE stands for...Skydiving Resilience Engineer
  • Building a Path to the Future: Mentoring New SREs
  • eBPF: The Next Power Tool of SREs
  • How the Metrics Backend Works at Datadog
  • Automated Operating System and Environment Certification at LinkedIn...
  • Triaging Real-time Security Threats with eBPF-powered Observability
  • Exemplars in Practice: Finding the Needle in Your Observability Haystack
  • Dark Sky Camping: Reducing Alert Pollution with Modern Observability Practices
  • Ten-year Journey to 10,000 Production Machines
  • Beyond Distributed Tracing
  • History-based Latency Prober Tuning
  • Using Serverless Functions for Real-time Observability
  • Improving How We Observe Our Observability Data: Techniques for SREs
  • Principled Performance Analytics
  • Modeling Alert Quality
  • Emergent Organizational Failure: Five Disconnections
  • DO, RE, Me: Measuring the Effectiveness of Site Reliability Engineering
  • The Scientific Method for Resilience
  • A Fresh Look at Operational Debt
  • Knowledge and Power: A Sociotechnical Systems Discussion on...
  • SRE as She Is Spoke
  • Oncall: An Equal Opportunity Waste of Time
  • Financial Regulators Worldwide Are Getting the Legal Right...
  • Statistics for Engineers
  • Measuring Reliability: What Got Us Here Won't Get Us There
  • Crayon Drawing Is a Vital Engineering Skill
  • Building Dynamic Configuration into Terraform
  • Hunting for Risky Dependencies in the World of Microservices
  • How We Implemented High Throughput Logging at Spotify
  • Engineering for Sustainability
  • SLOs, SREs, and GHGs
  • The Biases Confronting SREs
  • Market Data: Applying SRE Techniques to Legacy Designs
  • Life after The Chocolate Factory
  • Is Our Team as Resilient as Our Systems?
  • What SRE Could Be: Systems Reliability Engineering
  • Diamonds with Flaws: Examining the Pressures, Realities, and...
  • How We Drained Every Backbone Router Simultaneously
  • Break Free of the Template: Incident Writeups They Want to Read
  • Making the Impossible Impossible: Improving Reliability by...
  • Deep Dive: Azure Resource Manager Outage
  • Commas Save Lives, or at Least LinkedIn
  • Passing the Torch - Building a New Grad Program to Mentor...
  • Going from 30 to 30 Million SLOs
  • Disaster Recovery Testing at Booking.com
  • Slack's DNSSEC Rollout: Third Time's the Outage
  • Meatbag Systems: How Our Reliability Culture & Practice...
  • Principled Identification of "Root Causes" Using Techniques...
  • A Case Study in Chaos Testing: Uncovering Kernel Scaling Issues
  • A Better Way to Manage Command Line Tools: What We Learned...
  • Honey, I Broke the Things: Debugging Gray Failures...
  • The Repeat Incident Fallacy: What Jurassic Park Can Teach Us...
  • SRE in Enterprise
  • Unified Theory of SRE
  • Dissecting the Humble LSM Tree and SSTable
  • Caching Entire Systems without Invalidation
  • An SRE Guide to Linux Kernel Upgrades
  • The Math of Scalability
  • Schema-First Application Telemetry
  • SRE Is Weird, Down the Stack
  • SRE and ML: Why It Matters
  • Emotional Disaster Recovery: Debugging the Self with...
  • Over Nine Billion Dollars of SRE Lessons - the James Webb...
  • Rock Fishing and Incident Analysis: Increasing Insight
  • How Can SRE Help Security Governance?...
  • Navigating in the Dark
  • Computing Performance 2022: What's on the Horizon
  • Move Fast and Learn Things: Principles of Cognition, Teaming...
  • How to Not Destroy Your Production Kubernetes Clusters
  • The Math behind the Incident Aftermath: A Practical Guide to Measuring...
  • OpenTelemetry and Observability: What, Why, and Why Now?
  • Principles of Safety and Reliability Learned from US Navy Landing Signal...
  • Infra Eng to Staff SRE: A Tale of Developing Yourself in an Ever Evolving...
  • Lifecycle of a Sample in the Prometheus TSDB
  • Metrics Stream Processing Using Riemann
  • Lifecycle of Reusable Automations: Track, Maintain, Deprecate
  • Dashboards and Runbooks: Scrapbooking for Engineers
  • Observability Is Not Analytics!
  • Lessons Learned Building a Global Synthetic Monitoring System
  • Sustaining Everything, Everywhere, All at Once!
  • Introducing the Reliability Map – r9y.dev
  • Chaos Engineering at Scale
  • The Multi Layered Cake of Resilience
  • Capacity vs Efficiency: Building a Globally Scalable Cloud Database
  • Improving Observability, Reliability, and Security of Relational Database...
  • Real-Time Adaptive Controls for Resilient Distributed Systems
  • Improving Machine Learning Development Reliability
  • How Can We Make Data Integrity Easy?
  • Cognitive and Self-Adaptive System for Effective Distributed-Tracing...
  • Site Reliability Evangelism: Practice Start-up within an Established...
  • Deploying Humans at the Edge of SRE
  • Challenges, Best Practices, and Solutions for Monitoring and Alerting...
  • A Better Way to Manage Stateful Systems: Design for Observability and Robust
  • Reliability Reviews in the Wild: Using Data to Drive Production Health
  • Leveraging Continuous Production Profiling for Providing Insights into...
  • Applying SRE Principles to CI/CD
  • Gremlins Exposed: Shining a Light on Mischievous Systems
  • Burnout at Scale: What to Try When You Just Can't
  • Backend API Design for SREs
  • Online Database Reliability, Performance, and Consistency Engineering
  • Migrating Datastores
  • Our Experience Tracking and Driving SLO Adoption at Goldman Sachs
  • Operationalizing ML Training Infra at Meta Scale
  • Advanced Linux Kernel Networking Monitoring
  • Using the Internet as Your Load-Balancer
  • A Post Incident Review Review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment