Last active
January 16, 2025 05:38
-
-
Save and1truong/1c3decfc7f1b3816de5c666cfe48e8a6 to your computer and use it in GitHub Desktop.
SRE conferences
- Notes from Production Engineering
- Case Study: Adopting SRE Principles at StackOverflow
- Monitoring without Infrastructure at Airbnb
- Scaling Networks through Software
- Incident Analysis
- From Zero to Hero: Recommended Practises for training your ever-evolving SRE Teams
- Architecting and launching the Halo 4 Services
- Being afraid - How to Paranoia at Dropbox protects your data
- Panel: AMA with the SRECon chairs and speakers
- Netflix RaaS: Reliability as a service
- Making the Sum of AWS Networking greater than its parts -- Achieving High Availability
- Making every SRE Hire count
- Building Billion User load balancer ✨
- Panel: Educate SRE
- Collin and the Slingbot
- Smart monitor system for automatic anomaly detection at Baidu ✨
- MySQL automation at Facebook Scale
- Learning from Mistakes and outages at Facebook
- Lightning Talks
- Mux: How I stopped worrying and learned to love the multiplexing ✨
- Instagration: A case study in Cloud Migration at scale
- Error Budgets and risks
- Panel: The weeping angels of Site Reliability
- Ensuring Success During Disaster
- Panel: Fifty shades of Grey: Different Models for Reliability Work
- The Realities of the Job of Delivering Reliability
- Beyond repair: Proactive maintenance work at scale ✨
- nrrd 911 ic me: The Incident Commander Role
- Continuous Deployment to Millions of users 40 times a day
- What's NetDevOps? How do I start?
- Netflix: 190 countries and 5 core SREs ✨
- Debugging distributed systems ✨
- Doorman: Global Distributed client side rate limiting ✨
- How to improve a service by roasting it
- College student to SRE: Onboarding your entry level talent
- Service Levels and Error Budgets
- Stepping up to scale
- From Ops to SRE on a Brazilian Startup
- Shopping Event Reliability
- Using salt to make infrastructure Consumable (Tasty, Even)
- Operations at (small) Scale
- Operational Buddhism: Building reliable services from Unreliable components
- Finding the order of chaos
- Moving large workload from a public to an OpenStack Private Cloud: Is it really worth it?
- SREs + Software Engineers: Making it work
- Monitoring the unmeasureable ✨
- Go for SREs using Python
- A young lady's illustrated primer to technical decision-making
- Putting together great SRE teams
- Server Provisioning in an IPv6 only world ✨
- Privacy Reliability Engineering: Looking at privacy through the Lens of SRE
- Building reliable social infrastructure for Google
- The Evolution of Global traffic routing and failover
- Lightning talks
- Terraform at Adobe ✨
- Transforming Tier 1 Caterpillars to Butterflies
- The Art of Performance Monitoring
- Managing Grumpy: Embracing Diversity to Build Stronger Teams
- It's People All the Way Down
- Running Consul at Scale - Journey from RFC to Production
- Panel: Who/What is SRE?
- Avoiding Cascading Failures at eBay? ✨
- SRE at a Start-Up: Lessons from LinkedIn ✨
- Less Alarming Alerts!
- Shaping Reality to Shape Outcomes: Making SRE Work with Uber Growth
- Panel: SRE Managers
- Performance Checklists for SREs ✨
- LinkedIn SRE From Inception to Global Scale ✨
- Next Generation of DevOps AIOps in Practice @Baidu ✨
- How Could Small Teams Get Ready for SRE
- How We Built TechLadies in Singapore
- Focal Impact - The Service Pyramid ✨
- Smart Monitoring System for Anomaly Detection on Business Trends in Alibaba ✨
- Graphite@Scale or How to Store Millions of Metrics per Second ✨
- Data Checking at Dropbox ✨
- Managing Server Secrets at Scale with a Vaultless Password Manager
- Open Falcon - A Distributed and High Performance Monitoring System ✨
- Talking to an OpenStack Cluster in Plain English
- Distributed Consensus Algorithms
- A Distribution Framework over ANSIBLE
- Draining the Flood - A Combat against Alert Fatigue
- Good, Better, Best, Mobile User Experience
- Reliable Launches at Scale
- Didi: How to Provide a Reliable Ridesharing Service
- Measuring the Success of Incident Management at Atlassian
- Managing Changes Seamlessly on Yahoo's Hadoop Infrastructure Servers
- Event Correlation - A Fresh Approach towards Reducing
- Automated Troubleshooting of Live Site Issues
- A Unit Test Would Have Caught This
- Testing for DR Failover Testing
- Accept Partial Failures, Minimize Service Loss
- Azure SREBot - More than a Chatbot
- Merou: A Decentralized, Audited Authorization Service
- Canary in the Internet Mine
- InnoDB to MyRocks Migration in Main MySQL Database at Facebook
- Golang's Garbage
- Capacity Planning and Flow Control
- Managing Capacity @ LinkedIn
- Distributed Scheduler Hell
- SRE Your gRPC - Building Reliable Distributed Systems Illustrated with gRPC
- Operationalizing DevOps Teaching
- Scaling Reliability at Dropbox - Our Journey towards a Distributed Ownership
- Reducing MTTR and False Escalations: Event Correlation at LinkedIn
- The Service Score Card—Gamifying Operational Excellence
- Postmortem Action Items: Plan the Work and Work the Plan ✨
- Don't Call Me Remodel Building and Managing Distributed Teams
- Observability in the Cambrian Stack Era ✨
- Keep Calm and Carry On: Scaling Your Org with Microservices
- From Engineering Operations to Site Reliability Engineering
- DNSControl: A DSL for DNS as Code from StackOverflow.com
- Every Day Is Monday in Operations ✨
- Traps and Cookies
- Spotify's Love-Hate Relationship with DNS ✨
- Lyft's Envoy: Experiences Operating a Large Service Mesh ✨
- Principles of Chaos Engineering ✨
- BPerf-Bing.com Cloud Profiling on Production ✨
- I'm an SRE Lead! Now What? How to Bootstrap and Organize Your SRE Team
- Ambyr-Linkedin's Distributed Immutable Object Store ✨
- A Million Containers at Last Cool
- It's the End of the World as We Know It (I Feel Fine): Engineering for Crisis
- Killing Our Darlings: How to Degenerate Systems
- Tune Your Way to Savings!
- Feedback Loops: How SREs Benefit and What is Needed to Realize Their Potential
- Anomaly Detection in Infrequently Occurred Patterns
- SRE and Presidential Campaigns
- A Practical Guide to Monitoring and Alerting with Time Series at Scale
- Panel: Training New SREs
- Deployment Automation: Releasing Quickly and Reliably
- Lightning Talks 1
- Lightning Talks 2
- Care and Feeding of SRE
- Diversity and Inclusion in SRE: A Postmortem
- Globalizing SRE in a Walkup Culture
- Make Haste Slowly: Balancing SRE Diligence in Urgency...
- Want to Solve Over-Monitoring...
- SRE Your gRPC... ✨
- Profiling Node Applications
- The Dangers of Being Overly-Paranoid
- Show Me the RIGHT Numbers! Are Our Users Happy? ✨
- Standing On the Shoulders of Giants...
- InStream: Large Scale Distribution...
- Use Load Testing to Build a Proper Mental Model of Your Service
- Traffic Steering using Rum DNS @ Linkedin ✨
- Capturing and Analyzing Millions...
- OK Log: Distributed and Coördination-Free Logging ✨
- How We Try to Make a Lion Bulletproof...
- From Firefighting to Proactive Work: ...
- Incident Command at the Edge ✨
- Resiliency Testing with Todgvory
- Building a Culture of Reliability
- Tech Leadership in SRE
- Case Study Lessons Learned from Our First Worldwide Outage
- When Trouble Comes to Town
- The Why, What, and How of Starting an SRE Engagement
- Startup Systems Engineers Instruction Manual
- Cognitive Bias and On-Call
- Reducing MTTR and False Escalations: Event Correlation
- The Never-Ending Story of Site Reliability ✨
- Hiring SREs May Be Literally Impossible
- Gamifying Reliability Excellence—The Service Score Card
- Incident Management
- Lightning Talks
- Why Work with Tech Writers? ✨
- Postmortem Action Items: Plan the Work and Work the Plan
- Building an On-Premise Kubernetes
- Distributed Systems, Like It or Not ✨
- Avoiding and Breaking Out of Capacity Prison ✨
- Service with an Angry Smile: Passive-Aggressive Behavior in SRE ✨
- The Cult(Ure) of Strength
- Run Less Software; Use Less Bits ✨
- Monitoring Cloudflare's Planet-Scale Edge Network
- Monitoring Design Principles ✨
- And the CFO Wept: AWS Cost Control
- Have You Tried Turning It off and Turning It on Again?
- 100 Teams, 100 Ways to Fail ✨
- Persistent SRE Antipatterns: Pitfalls On the...
- If You Don’t Know Where You’re Going, It Doesn’t Matter How Fast You Get There
- Security and SRE: Natural Force Multipliers
- What It Really Means to Be an Effective Engineer
- SparkPost: The Day the DNS Died
- Stable and Accurate Health-Checking of Horizontally-Scaled Services
- Beyond Burnout: Mental Health and Neurodiversity in Engineering
- Bootstrapping an SRE Team:
- Don’t Ever Change! Are Immutable Deployments Really Simpler, Faster, and Safer?
- Lessons Learned from Our Main Database Migrations at Facebook
- Leveraging Multiple Regions to Improve Site Reliability:
- Building Successful SRE in Large Enterprises—One Year Later
- Working with Third Parties Shouldn't Suck
- When to NOT Set SLOs: Lots of Strangers Are Running My Software!
- Lessons Learned from Five Years of Multi-Cloud at PagerDuty
- Help Protect Your Data Centers with Safety Constraints
- Real World SLOs and SLIs: A Deep Dive
- How SREs Found More than $100 Million Using Failed Customer Interactions
- Learning at Scale Is Hard! Outage Pattern Analysis and Dirty Data
- How Not to Go Boom: Lessons for SREs from Oil Refineries
- Containerization War Stories
- Resolving Outages Faster with Better Debugging Strategies
- Monitoring DNS with Open-Source Solutions
- Antics, Drift, and Chaos
- Security as a Service
- Breaking in a New Job as an SRE
- "Capacity Prediction" instead of "Capacity Planning":
- Distributed Tracing, Lessons Learned
- Junior Engineers Are Features, Not Bugs
- Approaching the Unacceptable Workload Boundary
- Building Shopify's PaaS on Kubernetes
- Know Thy Enemy: How to Prioritize and Communicate Risks
- Automatic Metric Screening for Service Diagnosis
- Whispers in Chaos: Searching for Weak Signals in Incidents
- Architecting a Technical Post Mortem
- Your System Has Recovered from an Incident, but Have Your Developers?
- The History of Fire Escapes
- Leaping from Mainframes to AWS: Technology Time Travel in the Government
- Operational Excellence in April Fools’ Pranks: Being Funny Is Serious Work!
- The Evolution of Site Reliability Engineering
- Safe Client Behaviour
- Service Monitoring Manual—2018 Edition
- Introduction to Alibaba Monitoring System
- Building SRE: Culture from the Outside In
- Quantifying Empathy with Service Level Objectives
- Doing Things the Hard Way
- Achieving Observability into Your Application with OpenCensus
- Know Thy Enemy: How to Prioritize and Communicate Risks
- Data Visualization for SREs—an Essential Skill for Quick Debugging
- You Can't Stop Fires with an Ambulance
- Comprehensive Container-Based Service Monitoring with Kubernetes and Istio
- How to Make Releases Safer in Baidu
- Cultural Nuance and Effective Collaboration for Multicultural Teams
- Automatic Datacenter and Service Deployments...
- From Monitoring to Automated Testing of Your Infrastructure Code
- Shopify's Move from the Data Centre to the Cloud
- Ensuring Reliability of High-Performance Applications
- Smarter Disasters: End-to-End Automation for Incidents
- Debugging at Scale—Going from Single Box to Production
- Productionizing Machine-Learning Services: Lessons from Google SRE
- Pro Tip: Save Money on Outages by Having a Bot Do the Heavy Lifting
- Evolution of SRE and Rising Need of SRE Catalyzers
- How to Serve and Protect (with Client Isolation)
- A Tale of One Billion Time Series
- Isolation without Containers
- Automatic Traffic Scheduling for Internet Connectivity Failures
- Lessons Learned from Our Main Database Migrations at Facebook
- PV Monitoring Based on Linear Regression
- Do Docs Better: Practical Tips on Delivering...
- Characterizing and Understanding Phases of SRE Practices
- Scaling Yourself for Managing Distributed Teams...
- Interviewing for Systems Design Skills
- Scaling a Distributed Stateful System: A LinkedIn Case Study
- Mentoring: A Newcomer's Perspective
- You Get What You Measure—Why Metrics Are Important
- Blame. Language. Sharing: Three Tips for...
- A Theory and Practice of Alerting with Service Level Objectives
- Production Engineering: Connect the Dots
- Mental Models for SREs
- Circonus: Design (Failures) Case Study
- SRE Theory vs. Practice: A Song of Ice and TireFire
- Data Protection Update and Tales from the Introduction of the GDPR
- What Makes a Good SRE: Findings from the SRE Survey
- Sustainability Starts Early: Creating a Great Ops Internship
- The Silver Lining Consortium: Post-Mortems for the Rest of Us
- Migrations under Production Load: How to Switch...
- The 7 Deadly Sins of Documentation
- Y2K and Other Disappointing Disasters: Risk Reduction and Harm Mitigation
- Availability, Latency, and Cost: Withstanding Regional Outages
- SRE for Mobile Applications
- Your System Has Recovered from an Incident, but Have Your Developers?
- Against On-Call: A Polemic
- Impact of Network Automation
- Migrating Your Old Server Products to Be Stateless Cloud Services
- Lightning Talks
- Dealing with Dark Debt: Lessons Learnt at Goldman Sachs
- Halt and Don’t Catch Fire
- Applying the Principles of Chaos to Serverless
- Know Your Kubernetes Deploys
- Not Invented Here Syndrome and Dark Debt: The PagerDuty Story
- Building a Debuggable Go Server
- Building a Fellowship Program to Mentor and Grow Your SRE Team
- SoundCloud's Story of Seeking Sustainable SRE
- How We Un-Scattered Our DNS Setup and Unlocked New Automation Options
- Kernel Upgrades at Facebook
- Managing Misfortune for Best Results
- Clearing the Way for SRE in the Enterprise
- The Math behind Project Scheduling, Bug Tracking, and Triage
- Ethics in Computing
- Canarying Well: Lessons Learned from Canarying Large Populations
- Real World SLOs and SLIs: A Deep Dive
- I’m SRE and You Can Too!—A Fine Manual...
- Lessons Learned—Data Driven Hiring 3 Years Later
- SRE Team Lifecycles
- Capacity Planning in Four Parts: Telling the Future without a Crystal Ball
- The Nth Region Project: An Open Retrospective
- This IS NOT Fine: Putting Out (Code) Fires
- What Medicine Can Teach Us about Being On-Call
- Tradeoffs in Resiliency: Managing the Burden of Data Recoverability
- Scalable Coding—Find the Error
- Delete This: Decommissioning Servers at Scale
- Observability for Emerging Infra: What Got You Here Won't Get You There
- Deploying SRE Training Best Practices to Production...
- Keep Building Fresh: Shopify's Journey to Kubernetes
- The Myth of Cloud Agnosticism
- SRE for Good: Engineering Intersections between Operations and Social Activism
- Can I Tell You a Secret? I See Dead Systems
- Junior Engineers Are Features, Not Bugs
- SREcon Conversations Europe/Middle East/Africa with Amy Tobey, Equinix Metal
- SREcon Conversations Europe/Middle East/Africa with Jennifer Petoff and JC van Winkel, Google Inc.
- SREcon Conversations Europe/Middle East/Africa with Avery Pennarun, Tailscale
- SREcon Conversations Europe/Middle East/Africa with King'ori Maina, Zappi
- SREcon Conversations Europe/Middle East/Africa with Alex Hidalgo
- SREcon Conversations Europe/Middle East/Africa with Štěpán Davidovič, Google Inc.
- SREcon Conversations Asia/Pacific with Koon Seng Lim, DBS
- SREcon Conversations Asia/Pacific with Katherine Lim, Innablr
- SREcon Conversations Asia/Pacific with Karthikeyan Selvaraj and Rajesh Ramachandran, PayPal
- SREcon Conversations with David Argent, Amazon (August 2020)
- SREcon Conversations with Avleen Vig, Facebook (July 2020)
- SREcon Conversations with Ingrid Epure, Netlify (May 2020)
- The Secret Lives of SREs - Controlling the Costs of Coordination across Remote
- Identifying Hidden Dependencies
- Are We Getting Better Yet? Progress Toward Safer Operations
- Continuously Improving Culture through Design Decisions
- Avoiding Goodhart's Law - Use SLO's as Tools Not Cudgels
- Off the Beaten Path: Moving Observability Focus from Your Service
- Observing from Incidents
- Building Service Ownership Using Documentation, Telemetry, and a Chance to Make
- Study on Human Factors and Team Culture to Improve Pager Fatigue
- Squish Level Objectives: How SRE can Help Align Technical Work to User Benefit
- Building Actionable Code Ownership
- SREs at Telecom and Media Industry: Bridging between Legacy and Cloud Native App
- Jupyter as Incident Response Tool
- Sustainable Software Engineering & SREs
- When /bin/sh Attacks: Revisiting ""Automate All the Things""
- Testing Encyclopedias in Production
- Why SREs can't afford to NOT do Chaos Engineering
- Implementing Distributed Consensus
- Incident Response in Unfamiliar Sociotechnical Systems
- Making Infrastructure More Friendly for Beginners
- Low Context DevOps: Improving SRE Team Culture through Defaults, Documentation
- The Smallest Possible SRE Team
- Cloudy with a Chance of Chaos
- Confessions of a Systems Engineer: Learning from My 20+ Years of Failure
- Pragmatic Security for SRE
- "Disorganizing" Your SRE Organization
- Failure is Not an Option! SRE Lessons 50 Years after the Apollo 13 Flight
- Challenges of Starting an SRE Team from Scratch in an Enterprise
- The Good, the Bad and the Ugly: The 3 Learnings of an SRE
- 9 Years of Failure: How Racing Crappy Cars Made Me a Better SRE
- Production Population Control: My Cattle are Rabbits!
- Latency and Availability Error Budgets Done Right at Scale
- The Evolution of Traffic Routing in a Streaming World
- Heap Optimization for Go Systems
- Soft Failures, Hard Goals - Accelerating Payments at Scale During the Pandemic
- Give Your PXE wings! Bootstrapping Explained
- Hot Swap Your Datastore: A Practical Approach and Lessons Learned
- Automatically Detect the Top Performance & Scalability Issues in Distributed
- A Bartender's Guide to Network Monitoring
- Achieving the Ultimate Performance with KVM
- Weeks of Debugging Can Save You Hours of TLA+
- Capacity Planning and Performance Enhancement with Page Reference Sampling
- Achieving Mutual TLS: Secure Pod-to-Pod Communication Without the Hassle
- Panel: Learning from Adaptations to Coronavirus
- It's a Trap! How Abstractions Have Failed Us.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment