Skip to content

Instantly share code, notes, and snippets.

@tarcisio
Created May 27, 2025 11:10
Show Gist options
  • Save tarcisio/c296a0b7ac38103e4b6124e6b952feef to your computer and use it in GitHub Desktop.
Save tarcisio/c296a0b7ac38103e4b6124e6b952feef to your computer and use it in GitHub Desktop.
## Livros
Site Reliability Engineering (Google SRE Book)
URL: https://sre.google/books/
5. Eliminating Toil: https://sre.google/sre-book/eliminating-toil/
6. Monitoring Distributed Systems: https://sre.google/sre-book/monitoring-distributed-systems/
Enfatiza que automatizar trabalhos manuais com o uso de ferramentas apropriadas é essential para escalabilidade e eficiência.
### Outras partes do livro: https://sre.google/sre-book/being-on-call/
"It’s important that on-call SREs understand that they can rely on several resources that make the experience of being on-call less daunting than it may seem. The most important on-call resources are"
https://sre.google/sre-book/effective-troubleshooting/
Ask "what," "where," and "why"
A malfunctioning system is often still trying to do something—just not the thing you want it to be doing. Finding out what it’s doing, then asking why it’s doing that and where its resources are being used or where its output is going can help you understand how things have gone wrong
!!!!!!!!!!!!!!!!
https://sre.google/sre-book/software-engineering-in-sre/
SREs are in a unique position to effectively develop internal software for a number of reasons:
Auxon Case Study
Auxon, a powerful tool developed within SRE to automate capacity planning for services running in Google production
The DevOps Handbook (Gene Kim, Jez Humble, Patrick Debois, John Willis)
Capitulos: "The Three Ways" e "Automate Repetitive Tasks".
URL: https://itrevolution.com/book/the-devops-handbook/
Fornece exemplos concretos e argumentos sobre como a automação melhora a confiabilidade e performance além do que é possível com processos manuais.
## Whitepapers
Google's State of DevOps Report (DORA)
https://dora.dev/
Dados empiricos demonstram que empresas de alta performance investem pesado em automação, ferramental e soluções técnicas junto aos processos. Usando métricas como frequência de deployment, lead time, MTTR e Change Failure Rate para ilustrar resultados práticos.
Puppet State of DevOps Report
URL: https://puppet.com/resources/report/state-of-devops-report/
Contém evidências mensuraveis linkando o uso de ferramentas avançadas e automação com ganhos de eficiência significativos, reduzindo o downtime e melhorando a confiabilidade.
## Exemplos da industria
Netflix Engineering Blog
URL: https://netflixtechblog.com/
Inúmeros exemplos de automação e tecnologias resolvendo problemas do mundo real sobre escalabilidade e confiabilidade.
Netflix emphasizes "tools over process" for speed and scale.
Spotify Engineering Blog
URL: https://engineering.atspotify.com/
Why useful: Emphasizes automation, monitoring, and tooling as key elements in achieving reliability at scale.
## Artigos
"Toil Reduction as a SRE Fundamental"
URL: https://cloud.google.com/blog/topics/sre/toil-reduction-as-sre-fundamental
Why useful: Specifically details how automation and technical solutions are essential for reducing manual, repetitive work.
"Scaling Reliability Through Automation" (AWS SRE blog)
URL: https://aws.amazon.com/builders-library/scaling-reliability-through-automation/
Why useful: AWS illustrates that without proper tools and automation, process improvements alone hit diminishing returns rapidly.
## Academic and Formal Presentations
Presentation by Google SREs at Industry Conferences
Look for YouTube videos from conferences like SRECon, DevOps Days, and Velocity. These presentations often demonstrate concrete outcomes from technical tools, automation, and engineering-focused solutions rather than purely process-based improvements.
## Metrics and Concepts to Highlight
When presenting your argument, emphasize these metrics:
Reduction in Toil: Automating routine tasks drastically reduces manual labor, improving scalability and reliability.
Mean Time to Recovery (MTTR): Proper monitoring and automated remediation tools directly reduce outage duration.
Scalability: Technical tools are necessary to scale up processes. Processes alone cannot scale exponentially without proper tooling.
Observability: Comprehensive, integrated monitoring tools reduce the time to identify and remediate issues dramatically compared to fragmented manual processes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment