Skip to content

Instantly share code, notes, and snippets.

@and1truong
Last active January 15, 2025 19:15
Show Gist options
  • Save and1truong/3ff9403cc25e85384c69115e85f2fc02 to your computer and use it in GitHub Desktop.
Save and1truong/3ff9403cc25e85384c69115e85f2fc02 to your computer and use it in GitHub Desktop.

2023 - SRE conferences

  • How the Sony PlayStation Network Does SRE
  • Unleashing Generative AI: Improving Developer Productivity in SRE
  • Patterns, Not Categories: Learning Across Incidents
  • Observability in the MLOps Lifecycle with Prometheus
  • Towards Zero Carbon: Implementing Sustainable Battery Lifecycle...
  • Leveraging Analytics for Technical Efficiency and Enhanced User Experience
  • LiveMLP: ML Platform for Assisting Contact Center Agents in Real-Time
  • How Safe Is Your Domain?
  • From Push to Pull: Managing Mutable Infrastructure at a Global Scale
  • Autonomous Automation: How Cloudflare Handles Server Diagnostics...
  • Real World Debugging with eBPF
  • An SRE Guide to Linux Kernel Upgrades
  • Taming Spiky Log Volumes: Maintaining Real-Time Log Accessibility with Kaldb
  • Better Observability with No Code Changes
  • The Secret Weapon for a Successful SRE Career - And It's Not What You Think!
  • Untangling the Tangled Cloud
  • Functional Resonance Analysis: Diagramming Your System
  • Start Small, Scale Big: Building and Scaling Platforms and SRE Culture...
  • Cultivating Accountability and Resilience
  • Finding the Needle in the Haystack: Predicting Storage Device Failures in...
  • Lessons Learned Running GKE Clusters on Spot Instances
  • Are We All on the Same Page? Let's Fix That
  • Humane On-call
  • Giving Away Your Secrets: Opening Metrics Up to Users
  • From "Keeping the Lights On" to "Designing the LEDs": A Detailed Review...
  • Fighting Financial Crimes as an SRE
  • Beyond Observability - Aligning Technology Performance to Business Outcomes
  • What Is Linux Kernel Keystore and Why You Should Use It in Your Next...
  • Multicloud and the Chamber of Secrets
  • Hold My Beer - Load Testing. In Production. On Autopilot.
  • Performance Testing in Keptn Using K6
  • Mastering Chaos: Achieving Fault Tolerance with Observability-Driven...
  • Distributed Tracing: Adaptive and Telemetry-Based Approach for Effective...
  • Challenges of Managing Real-Time Financial Market Data Storage
  • Transformation Journey of E2E Customer Flow Testing...
  • The Only Constant Is Change: Lessons from a 25 Year SRE Career
  • eBPF Superpowers for SRE
  • Building a 5-Exaflop Supercomputer for Meta-AI Research and...
  • From Sysadmins to (almost) Flying Unicorns
  • Implementing SRE in a Telco with Reliability Enhancing...
  • Symptom-based Alerting for Machine Learning - What I Learned...
  • Reliable Data for Large ML Models: Principles and Practices
  • New Grads Becoming New SREs: Catalyzing a “Circle of Life”...
  • Scale Your Future: An Immersive Engineering Programme
  • Over, Under, Around, and Through: A Detailed Comparison...
  • Deploying and Debugging HTTP/3
  • The Engineer/Manager Pendulum Goes Mainstream
  • Do Not Thrash the Node.js Event Loop
  • Scaling Chef Emotionally
  • SRE for [cyber]security
  • Cloud, Kubernetes, and Service Networking - Taming the Turtles
  • Designing Matrix: A Global Decentralised End-to-End Encrypted..
  • Tracing the Journey into Distributed Tracing
  • The World Blew Up but We’re All Okay: How We Managed a...
  • When One Line Took Thousands of Websites Offline
  • HTTP Headers that Make Your Website Go Faster
  • Cache Me If You Can: How Grafana Labs Scaled Up Their...
  • Embracing the Multi-Party Dilemma: Incident Response Across...
  • The Incident Is The Way: Using Your Incidents to Win...
  • That Time I Accidentally DDoS'd My Company
  • Artificial Intelligence: How Much Will It Cost You?
  • Just the Cryptography You Need to Know for TLS
  • You Depend on DNS, This Is How It Works and You Won't...
  • 9 Things You Should Do When Starting to Use SLOs
  • Silent Spring: What if the GDPR Was Real?
  • From Exceptional Maintenance to Automated Routine Operation:...
  • Should I Use OTel (collectors), or Is Prometheus Good Enough?
  • Implementing Open-source Observability within Maersk
  • Journey from Fluent Bit, Fluentd and Prometheus to Open...
  • Level 7 Egress Control in Kubernetes: Current Solutions,...
  • Leveraging Unikernels and Kubernetes to (Transparently)...
  • Monoceros: Faster and Predictable Services through In-pod...
  • Continuous Profiling in the Cloud-Native era
  • How to Use Prometheus's Native Histograms
  • Overcoming Challenges in Serving Large Language Model
  • The Value of Reliability
  • A Dual Approach to Accountability Engineering
  • Succeeding as the Lone SRE in a Small Team
  • Deconstructing an Abstraction to Reconstruct an Outage
  • When Clouds Stop Raining Discounts: Surviving the Drought
  • Should an SRE Care About FinOps? Using Observability to...
  • Looking at SRE Needs and Trends over Two Decades with a...
  • How to Make Your Automation a Better Team Player
  • Dark Matter and Deep State: The Unseen Majority of Everything
  • The Endgame of SRE
  • SRE's Critical Role in the COVID-19 Pandemic Response in Government
  • We're Still Down: A Metastable Failure Tale
  • Watering the Roots of Resilience: Learning from Failure with Decision Trees
  • Scaling Telemetry Systems with Streaming
  • Hacking the Pachyderm: Scaling Servers and People
  • Logs Told Us It Was DNS, It Looked like DNS, It Had to Be DNS, It Wasn't DNS
  • Epic Incidents of History: The 1979 NORAD Nuclear Near Miss
  • Scaling Terraform at ThousandEyes
  • OpenTelemetry Metrics 101
  • Incident Commanders to Incident Analysts: How We Got Here
  • Handover Communications in Software Operations: Findings from the Field
  • On the Wings of SREs; J.P. Morgan's Journey into the Cloud
  • SRE in Transition: From Startup to Established Business
  • Lessons Learned from 7 Years of Running Developer Platforms
  • Cognitive Apprenticeship in Practice with Alert Triage Hour of Power
  • Building a Diverse SRE Talent Pipeline
  • The Best SREs Seem to Be the Ones without an SRE Title—And What We Can Do
  • Confessions of an SRE Manager
  • Exploring Disconnects between Reliability Practitioners and Management
  • Beacon: Intelligent Latency-Aware and Load Shedding Service Routing
  • Resiliency Practices in Managing CDN (Content Delivery Network)
  • Why This Stuff Is Hard
  • Turning an Incident Report into a Design Issue with TLA+
  • The Making of an Ultra Low Latency Trading System with Go and Java
  • Seeing the Invisible: Two Years at Wikipedia with W3C's Network Error Logging
  • Avoiding Cachepocalypse in the Land of the Monolith
  • Incident Archaeology: Extracting Value from Paperwork and Narratives
  • An Organizational Response to Incidents: Designing for Smooth Coordination
  • Building an APM with OpenTelemetry and OpenSource
  • Measuring Real-Life Latency of the Internet: A Netflix Story
  • Founder/CTO Perspectives: The Future of Distributed Tracing
  • Lightning Talks
  • Human Observability of Incident Response
  • Far from the Shallows: The Value of Deeper Incident Analysis
  • How SRE Makes Electric Vehicles
  • Warding against the Dark Arts: Crafting a Defense Strategy against Botnet DDoS
  • The Revolution Will Not Be Terraformed: SRE and the Anarchist Style
  • Implementing SRE in a Regulated Environment
  • Financial Resiliency Engineering: Taming Cloud Costs
  • Sto: A Better Way to Store and Query Profiler Data
  • Chaos-Driven Development: TDD for Distributed Systems
  • Adaptive Concurrency Control for Mixed Analytical Workloads
  • If I Can Do It on an Ambulance, You Can Do It in an Office: Scalable Incident
  • How To Take Prometheus Planet Scale: Massively Large Scale Metrics Installations
  • Your Infrastructure Needs to D.I.E.
  • Not All Minutes Are Equal: The Secret behind SLO Adoption Failure
  • Hell Is Other Platforms
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment