Created
May 27, 2025 11:10
-
-
Save tarcisio/c296a0b7ac38103e4b6124e6b952feef to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## Livros | |
Site Reliability Engineering (Google SRE Book) | |
URL: https://sre.google/books/ | |
5. Eliminating Toil: https://sre.google/sre-book/eliminating-toil/ | |
6. Monitoring Distributed Systems: https://sre.google/sre-book/monitoring-distributed-systems/ | |
Enfatiza que automatizar trabalhos manuais com o uso de ferramentas apropriadas é essential para escalabilidade e eficiência. | |
### Outras partes do livro: https://sre.google/sre-book/being-on-call/ | |
"It’s important that on-call SREs understand that they can rely on several resources that make the experience of being on-call less daunting than it may seem. The most important on-call resources are" | |
https://sre.google/sre-book/effective-troubleshooting/ | |
Ask "what," "where," and "why" | |
A malfunctioning system is often still trying to do something—just not the thing you want it to be doing. Finding out what it’s doing, then asking why it’s doing that and where its resources are being used or where its output is going can help you understand how things have gone wrong | |
!!!!!!!!!!!!!!!! | |
https://sre.google/sre-book/software-engineering-in-sre/ | |
SREs are in a unique position to effectively develop internal software for a number of reasons: | |
Auxon Case Study | |
Auxon, a powerful tool developed within SRE to automate capacity planning for services running in Google production | |
The DevOps Handbook (Gene Kim, Jez Humble, Patrick Debois, John Willis) | |
Capitulos: "The Three Ways" e "Automate Repetitive Tasks". | |
URL: https://itrevolution.com/book/the-devops-handbook/ | |
Fornece exemplos concretos e argumentos sobre como a automação melhora a confiabilidade e performance além do que é possível com processos manuais. | |
## Whitepapers | |
Google's State of DevOps Report (DORA) | |
https://dora.dev/ | |
Dados empiricos demonstram que empresas de alta performance investem pesado em automação, ferramental e soluções técnicas junto aos processos. Usando métricas como frequência de deployment, lead time, MTTR e Change Failure Rate para ilustrar resultados práticos. | |
Puppet State of DevOps Report | |
URL: https://puppet.com/resources/report/state-of-devops-report/ | |
Contém evidências mensuraveis linkando o uso de ferramentas avançadas e automação com ganhos de eficiência significativos, reduzindo o downtime e melhorando a confiabilidade. | |
## Exemplos da industria | |
Netflix Engineering Blog | |
URL: https://netflixtechblog.com/ | |
Inúmeros exemplos de automação e tecnologias resolvendo problemas do mundo real sobre escalabilidade e confiabilidade. | |
Netflix emphasizes "tools over process" for speed and scale. | |
Spotify Engineering Blog | |
URL: https://engineering.atspotify.com/ | |
Why useful: Emphasizes automation, monitoring, and tooling as key elements in achieving reliability at scale. | |
## Artigos | |
"Toil Reduction as a SRE Fundamental" | |
URL: https://cloud.google.com/blog/topics/sre/toil-reduction-as-sre-fundamental | |
Why useful: Specifically details how automation and technical solutions are essential for reducing manual, repetitive work. | |
"Scaling Reliability Through Automation" (AWS SRE blog) | |
URL: https://aws.amazon.com/builders-library/scaling-reliability-through-automation/ | |
Why useful: AWS illustrates that without proper tools and automation, process improvements alone hit diminishing returns rapidly. | |
## Academic and Formal Presentations | |
Presentation by Google SREs at Industry Conferences | |
Look for YouTube videos from conferences like SRECon, DevOps Days, and Velocity. These presentations often demonstrate concrete outcomes from technical tools, automation, and engineering-focused solutions rather than purely process-based improvements. | |
## Metrics and Concepts to Highlight | |
When presenting your argument, emphasize these metrics: | |
Reduction in Toil: Automating routine tasks drastically reduces manual labor, improving scalability and reliability. | |
Mean Time to Recovery (MTTR): Proper monitoring and automated remediation tools directly reduce outage duration. | |
Scalability: Technical tools are necessary to scale up processes. Processes alone cannot scale exponentially without proper tooling. | |
Observability: Comprehensive, integrated monitoring tools reduce the time to identify and remediate issues dramatically compared to fragmented manual processes. | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment