- How the Sony PlayStation Network Does SRE
- Unleashing Generative AI: Improving Developer Productivity in SRE
- Patterns, Not Categories: Learning Across Incidents
- Observability in the MLOps Lifecycle with Prometheus
- Towards Zero Carbon: Implementing Sustainable Battery Lifecycle...
- Leveraging Analytics for Technical Efficiency and Enhanced User Experience
- LiveMLP: ML Platform for Assisting Contact Center Agents in Real-Time
- How Safe Is Your Domain?
- From Push to Pull: Managing Mutable Infrastructure at a Global Scale
- Autonomous Automation: How Cloudflare Handles Server Diagnostics...
- Real World Debugging with eBPF
- An SRE Guide to Linux Kernel Upgrades
- Taming Spiky Log Volumes: Maintaining Real-Time Log Accessibility with Kaldb
- Better Observability with No Code Changes
- The Secret Weapon for a Successful SRE Career - And It's Not What You Think!
- Untangling the Tangled Cloud
- Functional Resonance Analysis: Diagramming Your System
- Start Small, Scale Big: Building and Scaling Platforms and SRE Culture...
- Cultivating Accountability and Resilience
- Finding the Needle in the Haystack: Predicting Storage Device Failures in...
- Lessons Learned Running GKE Clusters on Spot Instances
- Are We All on the Same Page? Let's Fix That
- Humane On-call
- Giving Away Your Secrets: Opening Metrics Up to Users
- From "Keeping the Lights On" to "Designing the LEDs": A Detailed Review...
- Fighting Financial Crimes as an SRE
- Beyond Observability - Aligning Technology Performance to Business Outcomes
- What Is Linux Kernel Keystore and Why You Should Use It in Your Next...
- Multicloud and the Chamber of Secrets
- Hold My Beer - Load Testing. In Production. On Autopilot.
- Performance Testing in Keptn Using K6
- Mastering Chaos: Achieving Fault Tolerance with Observability-Driven...
- Distributed Tracing: Adaptive and Telemetry-Based Approach for Effective...
- Challenges of Managing Real-Time Financial Market Data Storage
- Transformation Journey of E2E Customer Flow Testing...
- The Only Constant Is Change: Lessons from a 25 Year SRE Career
- eBPF Superpowers for SRE
- Building a 5-Exaflop Supercomputer for Meta-AI Research and...
- From Sysadmins to (almost) Flying Unicorns
- Implementing SRE in a Telco with Reliability Enhancing...
- Symptom-based Alerting for Machine Learning - What I Learned...
- Reliable Data for Large ML Models: Principles and Practices
- New Grads Becoming New SREs: Catalyzing a “Circle of Life”...
- Scale Your Future: An Immersive Engineering Programme
- Over, Under, Around, and Through: A Detailed Comparison...
- Deploying and Debugging HTTP/3
- The Engineer/Manager Pendulum Goes Mainstream
- Do Not Thrash the Node.js Event Loop
- Scaling Chef Emotionally
- SRE for [cyber]security
- Cloud, Kubernetes, and Service Networking - Taming the Turtles
- Designing Matrix: A Global Decentralised End-to-End Encrypted..
- Tracing the Journey into Distributed Tracing
- The World Blew Up but We’re All Okay: How We Managed a...
- When One Line Took Thousands of Websites Offline
- HTTP Headers that Make Your Website Go Faster
- Cache Me If You Can: How Grafana Labs Scaled Up Their...
- Embracing the Multi-Party Dilemma: Incident Response Across...
- The Incident Is The Way: Using Your Incidents to Win...
- That Time I Accidentally DDoS'd My Company
- Artificial Intelligence: How Much Will It Cost You?
- Just the Cryptography You Need to Know for TLS
- You Depend on DNS, This Is How It Works and You Won't...
- 9 Things You Should Do When Starting to Use SLOs
- Silent Spring: What if the GDPR Was Real?
- From Exceptional Maintenance to Automated Routine Operation:...
- Should I Use OTel (collectors), or Is Prometheus Good Enough?
- Implementing Open-source Observability within Maersk
- Journey from Fluent Bit, Fluentd and Prometheus to Open...
- Level 7 Egress Control in Kubernetes: Current Solutions,...
- Leveraging Unikernels and Kubernetes to (Transparently)...
- Monoceros: Faster and Predictable Services through In-pod...
- Continuous Profiling in the Cloud-Native era
- How to Use Prometheus's Native Histograms
- Overcoming Challenges in Serving Large Language Model
- The Value of Reliability
- A Dual Approach to Accountability Engineering
- Succeeding as the Lone SRE in a Small Team
- Deconstructing an Abstraction to Reconstruct an Outage
- When Clouds Stop Raining Discounts: Surviving the Drought
- Should an SRE Care About FinOps? Using Observability to...
- Looking at SRE Needs and Trends over Two Decades with a...
- How to Make Your Automation a Better Team Player
- Dark Matter and Deep State: The Unseen Majority of Everything
- The Endgame of SRE
- SRE's Critical Role in the COVID-19 Pandemic Response in Government
- We're Still Down: A Metastable Failure Tale
- Watering the Roots of Resilience: Learning from Failure with Decision Trees
- Scaling Telemetry Systems with Streaming
- Hacking the Pachyderm: Scaling Servers and People
- Logs Told Us It Was DNS, It Looked like DNS, It Had to Be DNS, It Wasn't DNS
- Epic Incidents of History: The 1979 NORAD Nuclear Near Miss
- Scaling Terraform at ThousandEyes
- OpenTelemetry Metrics 101
- Incident Commanders to Incident Analysts: How We Got Here
- Handover Communications in Software Operations: Findings from the Field
- On the Wings of SREs; J.P. Morgan's Journey into the Cloud
- SRE in Transition: From Startup to Established Business
- Lessons Learned from 7 Years of Running Developer Platforms
- Cognitive Apprenticeship in Practice with Alert Triage Hour of Power
- Building a Diverse SRE Talent Pipeline
- The Best SREs Seem to Be the Ones without an SRE Title—And What We Can Do
- Confessions of an SRE Manager
- Exploring Disconnects between Reliability Practitioners and Management
- Beacon: Intelligent Latency-Aware and Load Shedding Service Routing
- Resiliency Practices in Managing CDN (Content Delivery Network)
- Why This Stuff Is Hard
- Turning an Incident Report into a Design Issue with TLA+
- The Making of an Ultra Low Latency Trading System with Go and Java
- Seeing the Invisible: Two Years at Wikipedia with W3C's Network Error Logging
- Avoiding Cachepocalypse in the Land of the Monolith
- Incident Archaeology: Extracting Value from Paperwork and Narratives
- An Organizational Response to Incidents: Designing for Smooth Coordination
- Building an APM with OpenTelemetry and OpenSource
- Measuring Real-Life Latency of the Internet: A Netflix Story
- Founder/CTO Perspectives: The Future of Distributed Tracing
- Lightning Talks
- Human Observability of Incident Response
- Far from the Shallows: The Value of Deeper Incident Analysis
- How SRE Makes Electric Vehicles
- Warding against the Dark Arts: Crafting a Defense Strategy against Botnet DDoS
- The Revolution Will Not Be Terraformed: SRE and the Anarchist Style
- Implementing SRE in a Regulated Environment
- Financial Resiliency Engineering: Taming Cloud Costs
- Sto: A Better Way to Store and Query Profiler Data
- Chaos-Driven Development: TDD for Distributed Systems
- Adaptive Concurrency Control for Mixed Analytical Workloads
- If I Can Do It on an Ambulance, You Can Do It in an Office: Scalable Incident
- How To Take Prometheus Planet Scale: Massively Large Scale Metrics Installations
- Your Infrastructure Needs to D.I.E.
- Not All Minutes Are Equal: The Secret behind SLO Adoption Failure
- Hell Is Other Platforms