2023 - SRE conferences

SREcon23 Asia/Pacific

How the Sony PlayStation Network Does SRE
Unleashing Generative AI: Improving Developer Productivity in SRE
Patterns, Not Categories: Learning Across Incidents
Observability in the MLOps Lifecycle with Prometheus
Towards Zero Carbon: Implementing Sustainable Battery Lifecycle...
Leveraging Analytics for Technical Efficiency and Enhanced User Experience
LiveMLP: ML Platform for Assisting Contact Center Agents in Real-Time
How Safe Is Your Domain?
From Push to Pull: Managing Mutable Infrastructure at a Global Scale
Autonomous Automation: How Cloudflare Handles Server Diagnostics...
Real World Debugging with eBPF
An SRE Guide to Linux Kernel Upgrades
Taming Spiky Log Volumes: Maintaining Real-Time Log Accessibility with Kaldb
Better Observability with No Code Changes
The Secret Weapon for a Successful SRE Career - And It's Not What You Think!
Untangling the Tangled Cloud
Functional Resonance Analysis: Diagramming Your System
Start Small, Scale Big: Building and Scaling Platforms and SRE Culture...
Cultivating Accountability and Resilience
Finding the Needle in the Haystack: Predicting Storage Device Failures in...
Lessons Learned Running GKE Clusters on Spot Instances
Are We All on the Same Page? Let's Fix That
Humane On-call
Giving Away Your Secrets: Opening Metrics Up to Users
From "Keeping the Lights On" to "Designing the LEDs": A Detailed Review...
Fighting Financial Crimes as an SRE
Beyond Observability - Aligning Technology Performance to Business Outcomes
What Is Linux Kernel Keystore and Why You Should Use It in Your Next...
Multicloud and the Chamber of Secrets
Hold My Beer - Load Testing. In Production. On Autopilot.
Performance Testing in Keptn Using K6
Mastering Chaos: Achieving Fault Tolerance with Observability-Driven...
Distributed Tracing: Adaptive and Telemetry-Based Approach for Effective...
Challenges of Managing Real-Time Financial Market Data Storage
Transformation Journey of E2E Customer Flow Testing...
The Only Constant Is Change: Lessons from a 25 Year SRE Career

SREcon23 Europe/Middle East/Africa

eBPF Superpowers for SRE
Building a 5-Exaflop Supercomputer for Meta-AI Research and...
From Sysadmins to (almost) Flying Unicorns
Implementing SRE in a Telco with Reliability Enhancing...
Symptom-based Alerting for Machine Learning - What I Learned...
Reliable Data for Large ML Models: Principles and Practices
New Grads Becoming New SREs: Catalyzing a “Circle of Life”...
Scale Your Future: An Immersive Engineering Programme
Over, Under, Around, and Through: A Detailed Comparison...
Deploying and Debugging HTTP/3
The Engineer/Manager Pendulum Goes Mainstream
Do Not Thrash the Node.js Event Loop
Scaling Chef Emotionally
SRE for [cyber]security
Cloud, Kubernetes, and Service Networking - Taming the Turtles
Designing Matrix: A Global Decentralised End-to-End Encrypted..
Tracing the Journey into Distributed Tracing
The World Blew Up but We’re All Okay: How We Managed a...
When One Line Took Thousands of Websites Offline
HTTP Headers that Make Your Website Go Faster
Cache Me If You Can: How Grafana Labs Scaled Up Their...
Embracing the Multi-Party Dilemma: Incident Response Across...
The Incident Is The Way: Using Your Incidents to Win...
That Time I Accidentally DDoS'd My Company
Artificial Intelligence: How Much Will It Cost You?
Just the Cryptography You Need to Know for TLS
You Depend on DNS, This Is How It Works and You Won't...
9 Things You Should Do When Starting to Use SLOs
Silent Spring: What if the GDPR Was Real?
From Exceptional Maintenance to Automated Routine Operation:...
Should I Use OTel (collectors), or Is Prometheus Good Enough?
Implementing Open-source Observability within Maersk
Journey from Fluent Bit, Fluentd and Prometheus to Open...
Level 7 Egress Control in Kubernetes: Current Solutions,...
Leveraging Unikernels and Kubernetes to (Transparently)...
Monoceros: Faster and Predictable Services through In-pod...
Continuous Profiling in the Cloud-Native era
How to Use Prometheus's Native Histograms
Overcoming Challenges in Serving Large Language Model
The Value of Reliability
A Dual Approach to Accountability Engineering
Succeeding as the Lone SRE in a Small Team
Deconstructing an Abstraction to Reconstruct an Outage
When Clouds Stop Raining Discounts: Surviving the Drought
Should an SRE Care About FinOps? Using Observability to...
Looking at SRE Needs and Trends over Two Decades with a...
How to Make Your Automation a Better Team Player
Dark Matter and Deep State: The Unseen Majority of Everything

SREcon23 Americas

The Endgame of SRE
SRE's Critical Role in the COVID-19 Pandemic Response in Government
We're Still Down: A Metastable Failure Tale
Watering the Roots of Resilience: Learning from Failure with Decision Trees
Scaling Telemetry Systems with Streaming
Hacking the Pachyderm: Scaling Servers and People
Logs Told Us It Was DNS, It Looked like DNS, It Had to Be DNS, It Wasn't DNS
Epic Incidents of History: The 1979 NORAD Nuclear Near Miss
Scaling Terraform at ThousandEyes
OpenTelemetry Metrics 101
Incident Commanders to Incident Analysts: How We Got Here
Handover Communications in Software Operations: Findings from the Field
On the Wings of SREs; J.P. Morgan's Journey into the Cloud
SRE in Transition: From Startup to Established Business
Lessons Learned from 7 Years of Running Developer Platforms
Cognitive Apprenticeship in Practice with Alert Triage Hour of Power
Building a Diverse SRE Talent Pipeline
The Best SREs Seem to Be the Ones without an SRE Title—And What We Can Do
Confessions of an SRE Manager
Exploring Disconnects between Reliability Practitioners and Management
Beacon: Intelligent Latency-Aware and Load Shedding Service Routing
Resiliency Practices in Managing CDN (Content Delivery Network)
Why This Stuff Is Hard
Turning an Incident Report into a Design Issue with TLA+
The Making of an Ultra Low Latency Trading System with Go and Java
Seeing the Invisible: Two Years at Wikipedia with W3C's Network Error Logging
Avoiding Cachepocalypse in the Land of the Monolith
Incident Archaeology: Extracting Value from Paperwork and Narratives
An Organizational Response to Incidents: Designing for Smooth Coordination
Building an APM with OpenTelemetry and OpenSource
Measuring Real-Life Latency of the Internet: A Netflix Story
Founder/CTO Perspectives: The Future of Distributed Tracing
Lightning Talks
Human Observability of Incident Response
Far from the Shallows: The Value of Deeper Incident Analysis
How SRE Makes Electric Vehicles
Warding against the Dark Arts: Crafting a Defense Strategy against Botnet DDoS
The Revolution Will Not Be Terraformed: SRE and the Anarchist Style
Implementing SRE in a Regulated Environment
Financial Resiliency Engineering: Taming Cloud Costs
Sto: A Better Way to Store and Query Profiler Data
Chaos-Driven Development: TDD for Distributed Systems
Adaptive Concurrency Control for Mixed Analytical Workloads
If I Can Do It on an Ambulance, You Can Do It in an Office: Scalable Incident
How To Take Prometheus Planet Scale: Massively Large Scale Metrics Installations
Your Infrastructure Needs to D.I.E.
Not All Minutes Are Equal: The Secret behind SLO Adoption Failure
Hell Is Other Platforms

and1truong/2023 - SRE conferences.md

2023 - SRE conferences

SREcon23 Asia/Pacific

SREcon23 Europe/Middle East/Africa

SREcon23 Americas