DevOps Engineering: From Foundations to Enterprise-Scale Platform Architecture

PART I — DEVOPS FOUNDATIONS

Chapter 1 — Introduction to DevOps
Chapter 2 — DevOps Culture & Organizational Design
Chapter 3 — Linux & System Fundamentals for DevOps

PART II — VERSION CONTROL & COLLABORATION

Chapter 4 — Git Internals & Advanced Workflows
Chapter 5 — Platforms

PART III — CI/CD PIPELINES

Chapter 6 — Continuous Integration
Chapter 7 — CI Tools
Chapter 8 — Continuous Delivery & Deployment

PART IV — CONTAINERS & ORCHESTRATION

Chapter 9 — Containerization
Chapter 10 — Kubernetes Deep Dive
Chapter 11 — Kubernetes in Production

PART V — INFRASTRUCTURE AS CODE

Chapter 12 — Infrastructure as Code Principles
Chapter 13 — IaC Tools

PART VI — CLOUD PLATFORMS

Chapter 14 — Cloud Fundamentals
Chapter 15 — Amazon Web Services
Chapter 16 — Microsoft Azure
Chapter 17 — Google Cloud Platform

PART VII — OBSERVABILITY & SRE

Chapter 18 — Monitoring & Logging
Chapter 19 — Site Reliability Engineering

PART VIII — DEVSECOPS

Chapter 20 — Secure DevOps
Chapter 21 — Security Tools

PART IX — ADVANCED TOPICS

Chapter 22 — GitOps & Platform Engineering
Chapter 23 — Serverless & Edge
Chapter 24 — Performance & Scalability
Chapter 25 — DevOps at Enterprise Scale

PART X — PRACTICAL IMPLEMENTATION

Chapter 26 — Building a Complete DevOps Pipeline
Chapter 27 — Real-World Case Studies

Appendices

PART I — DEVOPS FOUNDATIONS

Chapter 1 — Introduction to DevOps

1.1 History of Software Development

The journey of software development methodologies spans over six decades, evolving from the nascent days of computing to the sophisticated, automated pipelines we see today. Understanding this history is crucial for appreciating why DevOps emerged as a necessary evolution rather than a passing trend.

The Pioneering Era (1950s-1960s)

In the early days of computing, software was tightly coupled with hardware. Programs were written in machine language or assembly, and the concept of "software development" as a distinct discipline barely existed. The IBM 704, introduced in 1954, was one of the first mass-produced computers, and programming it involved physical plugboards and punch cards. There was no separation between development and operations—the same people who wrote the code also ran the machines. This period was characterized by:

Batch Processing: Jobs were submitted on punch cards, and results would return hours or days later.
Hardware Dominance: Software was often given away for free with hardware purchases.
No Standardization: Every machine had its own architecture and instruction set.

The Software Crisis and Structured Programming (1960s-1970s)

As hardware became more powerful and affordable, software complexity grew exponentially. The NATO Software Engineering Conferences of 1968 and 1969 coined the term "software crisis," highlighting that projects were running over budget, over time, and producing unreliable software. This crisis led to:

Structured Programming: Pioneered by Edsger Dijkstra and others, this paradigm introduced disciplined control structures (if-then-else, loops) instead of chaotic goto statements.
The Waterfall Model: Winston Royce's 1970 paper (often mischaracterized) described a sequential model that would become the dominant methodology for decades.
Separation of Concerns: For the first time, distinct roles emerged—analysts, designers, programmers, testers, and operators.

The Rise of Personal Computing and Client-Server (1980s)

The 1980s brought personal computers and the client-server architecture. Software was now shipped on floppy disks and later CDs. This era saw:

Packaged Software: Companies like Microsoft began selling software as products.
Graphical User Interfaces: The Macintosh (1984) and Windows (1985) made computing accessible to non-technical users.
Networked Applications: With the growth of LANs, applications became distributed.
Formalized ITIL: The Information Technology Infrastructure Library emerged in the UK, providing a framework for IT service management, further codifying the separation between development (creating applications) and operations (running infrastructure).

The Internet Boom (1990s)

The commercialization of the internet in the mid-1990s changed everything. Companies like Amazon (1994), eBay (1995), and Google (1998) were born in the cloud (though cloud computing as we know it didn't exist yet). This period introduced:

Web Applications: Software was no longer installed but accessed via browsers.
LAMP Stack: Linux, Apache, MySQL, and PHP/Python/Perl became the dominant open-source web development platform.
Rapid Growth: The pressure to release features quickly to beat competitors intensified.
Dot-com Bubble: The frenzy led to massive investments and subsequent crash, but the foundational technologies survived.

The Agile Manifesto (2001)

By the late 1990s, the heavyweight, documentation-driven methodologies were creaking under the pressure of internet-speed development. Seventeen software developers met at a ski resort in Utah and crafted the Agile Manifesto, which emphasized:

Individuals and interactions over processes and tools
Working software over comprehensive documentation
Customer collaboration over contract negotiation
Responding to change over following a plan

Agile methodologies like Scrum, Extreme Programming (XP), and Kanban transformed how development teams worked, promoting iterative development, continuous feedback, and cross-functional collaboration. However, Agile focused primarily on developers and product owners—operations remained largely untouched.

1.2 From Waterfall to Agile

To understand the transition from Waterfall to Agile, we must examine both methodologies in depth.

The Waterfall Model

The Waterfall model, despite its widespread adoption, was never intended to be rigid. Royce's original paper actually recommended iteration. However, the model that emerged was strictly sequential:

Requirements Analysis: Gather and document all requirements before any design begins.
System Design: Create detailed architectural and design specifications based on requirements.
Implementation: Write code according to the design documents.
Testing: Verify that the implemented system meets the requirements.
Deployment: Release the tested system to production.
Maintenance: Fix issues and make enhancements post-release.

Challenges with Waterfall:

Late Feedback: Users don't see working software until very late in the process.
Change Resistance: Changing requirements mid-stream is expensive and disruptive.
Integration Hell: Integration happens at the end, often revealing conflicts and issues that require significant rework.
Long Release Cycles: Releases might take months or years.
Siloed Teams: Developers throw code "over the wall" to testers, who then throw it to operations.

The Agile Revolution

Agile methodologies emerged as a direct response to these challenges. The Agile Manifesto's 12 principles include:

Deliver working software frequently, from a couple of weeks to a couple of months.
Welcome changing requirements, even late in development.
Business people and developers must work together daily throughout the project.
Build projects around motivated individuals and trust them to get the job done.
Working software is the primary measure of progress.
Continuous attention to technical excellence and good design enhances agility.

Scrum became the most popular Agile framework, introducing:

Sprints: Time-boxed iterations (usually 2 weeks)
Roles: Product Owner, Scrum Master, Development Team
Ceremonies: Sprint Planning, Daily Stand-up, Sprint Review, Sprint Retrospective

Kanban offered a different approach:

Visualize workflow
Limit work in progress
Manage flow
Make process policies explicit
Improve collaboratively

The Gap Agile Created

While Agile dramatically improved development productivity, it inadvertently widened the gap between Dev and Ops. Developers were now releasing software every two weeks, but operations teams (still following ITIL) were accustomed to quarterly or annual releases. This created:

Deployment Conflicts: Developers wanted frequent releases; operations prioritized stability.
Environment Inconsistencies: Code worked on developer laptops but failed in production.
Blame Game: When production issues occurred, developers blamed operations for poor infrastructure, and operations blamed developers for buggy code.
Manual Handoffs: Each release required manual documentation, change requests, and deployment procedures.

1.3 The DevOps Movement

The term "DevOps" was coined in 2009 by Patrick Debois, who organized the first DevOpsDays conference in Ghent, Belgium. However, the ideas behind DevOps had been brewing for years.

The Belgian Rails Underground

In 2008, at the Agile Conference in Toronto, Andrew Clay Shafer and Patrick Debois discussed the idea of "Agile Infrastructure." They realized that the principles of Agile—collaboration, iteration, feedback—could and should apply to operations. This conversation planted the seeds for what would become DevOps.

The Flickr Talk

At the 2009 Velocity Conference, John Allspaw and Paul Hammond from Flickr presented "10+ Deploys per Day: Dev and Ops Cooperation at Flickr." This groundbreaking talk showed how Flickr had broken down the barriers between development and operations, achieving unprecedented deployment frequency. The talk went viral in the tech community and catalyzed the DevOps movement.

Defining DevOps

DevOps is not a tool, a job title, or a specific technology. It's a cultural and professional movement that stresses communication, collaboration, and integration between software developers and IT operations professionals. At its core, DevOps aims to:

Break down silos between development, operations, and other stakeholders
Automate manual processes to increase efficiency and reduce errors
Measure everything to understand system behavior and business impact
Share knowledge, responsibility, and ownership across teams

The Three Ways

Gene Kim, in "The Phoenix Project" and "The DevOps Handbook," codified DevOps principles into "The Three Ways":

First Way: Systems Thinking (Flow)

Emphasizes the performance of the entire system, not just silos
Focus on creating fast, smooth flow from development to operations to the customer
Never pass known defects downstream
Optimize for global goals, not local efficiencies

Second Way: Amplify Feedback Loops

Create short, fast feedback loops from operations back to development
Enable quick detection and recovery from issues
Swarm problems to prevent recurrence
Build quality in by finding and fixing defects at the source

Third Way: Culture of Continuous Experimentation and Learning

Foster a culture that takes risks and learns from failure
Understand that repetition and practice are prerequisites to mastery
Allocate time for improvement of daily work
Introduce faults to increase resilience (chaos engineering)

1.4 CAMS Model (Culture, Automation, Measurement, Sharing)

The CAMS model, popularized by Damon Edwards and John Willis, provides a framework for understanding the core dimensions of DevOps.

Culture (The Foundation)

Culture is the most critical and most challenging aspect of DevOps. It encompasses:

Trust and Collaboration: Teams trust each other and collaborate across boundaries.
Shared Goals: Dev and Ops share responsibility for the entire service lifecycle.
Respect: Each team respects the others' expertise and constraints.
Experimentation: Failure is viewed as a learning opportunity, not a reason for punishment.
Continuous Improvement: Teams constantly seek ways to improve processes and systems.

Culture Anti-patterns:

Blaming individuals for system failures
Throwing work "over the wall" between teams
Hiding information or hoarding knowledge
Fear of change or experimentation

Automation (The Enabler)

Automation is what makes DevOps practices scalable and repeatable. Key areas include:

Infrastructure Automation: Provisioning servers, networks, and storage through code (Terraform, CloudFormation)
Configuration Automation: Managing system configurations (Ansible, Puppet, Chef)
Build and Deployment Automation: CI/CD pipelines (Jenkins, GitHub Actions)
Testing Automation: Automated unit, integration, and security tests
Environment Management: Consistent development, testing, and production environments

Automation Principles:

Automate repetitive, error-prone manual tasks
Version control everything (infrastructure, configuration, pipelines)
Treat automation code as production code (testing, review, documentation)
Start with the most painful manual processes first

Measurement (The Evidence)

You cannot improve what you cannot measure. Measurement in DevOps includes:

Deployment Metrics: Frequency, lead time, success rate
Operational Metrics: Availability, latency, throughput, error rates
Business Metrics: Customer satisfaction, revenue, feature adoption
Team Metrics: Morale, burnout, knowledge sharing

Key Performance Indicators (KPIs):

Deployment Frequency: How often do we deploy to production?
Lead Time for Changes: How long does it take from commit to running in production?
Mean Time to Recovery (MTTR): How quickly can we recover from failures?
Change Failure Rate: What percentage of changes cause degraded service?

Sharing (The Multiplier)

Sharing creates a virtuous cycle where knowledge and improvements propagate throughout the organization.

Cross-functional Teams: Dev and Ops work together on shared goals.
Knowledge Transfer: Pair programming, documentation, brown bag sessions.
Shared Tools and Platforms: Internal developer platforms, common toolchains.
Blame-free Postmortems: Share learnings from failures without fear of reprisal.
Open Source Contributions: Share innovations with the broader community.

1.5 DevOps vs Agile vs SRE

Understanding the distinctions and relationships between these complementary approaches is essential.

DevOps vs Agile

Aspect	Agile	DevOps
Focus	Development practices	Full lifecycle (Dev+Ops)
Primary Goal	Deliver value iteratively	Deliver value continuously and reliably
Scope	Development team	Development + Operations + QA + Security
Timeframe	Sprint iterations	Continuous delivery pipeline
Key Practices	Stand-ups, retrospectives, story pointing	CI/CD, monitoring, infrastructure as code
Metrics	Velocity, story points	DORA metrics, SLIs/SLOs

Relationship: Agile and DevOps are complementary. Agile improves how features are built; DevOps improves how those features are delivered and operated. Many organizations adopt Agile first, then DevOps to address operational bottlenecks.

DevOps vs SRE

Site Reliability Engineering (SRE) was pioneered at Google and codified by Ben Treynor Sloss. SRE applies software engineering principles to operations problems.

Aspect	DevOps	SRE
Origin	Community movement	Google internal practice
Philosophy	Break down silos, collaborate	Apply software engineering to ops
Key Concept	CAMS model	Error budgets
Implementation	Cultural and technical practices	Specific roles and practices
Focus	Collaboration and automation	Reliability and scalability

Relationship: Google describes SRE as "what happens when you ask a software engineer to design an operations team." Many consider SRE a specific implementation of DevOps principles with a stronger focus on reliability engineering.

Key SRE Practices:

Service Level Objectives (SLOs) and Error Budgets
Eliminating toil through automation
Monitoring and alerting design
Capacity planning
Incident response
Chaos engineering

1.6 DevSecOps Overview

DevSecOps integrates security practices throughout the DevOps lifecycle rather than adding security as a final gate. The motto is "Security as Code" and "Shift Left" (moving security earlier in the development process).

Why DevSecOps?

Traditional security approaches created bottlenecks:

Security testing happened at the end of development
Security findings caused last-minute delays
Security teams were seen as blockers, not enablers
Vulnerabilities were discovered too late for easy remediation

DevSecOps Principles:

Shift Left: Test security early and often throughout the pipeline
Automate Security: Integrate automated security tools into CI/CD
Security as Code: Define security policies and configurations in code
Continuous Compliance: Automate compliance checking and reporting
Shared Responsibility: Everyone owns security, not just the security team

Security Integration Points:

Code: SAST (Static Application Security Testing), secrets scanning
Dependencies: SCA (Software Composition Analysis), dependency scanning
Build: Container scanning, SBOM generation
Deploy: Policy as code, compliance validation
Runtime: DAST (Dynamic Application Security Testing), runtime protection
Infrastructure: Infrastructure scanning, cloud security posture management

1.7 Platform Engineering Evolution

Platform Engineering has emerged as a natural evolution of DevOps practices, especially in large organizations. It focuses on building Internal Developer Platforms (IDPs) that abstract infrastructure complexity and provide self-service capabilities to development teams.

The Problem Platform Engineering Solves

As organizations scale, the cognitive load on developers increases:

Multiple cloud providers
Complex Kubernetes configurations
Numerous tools and technologies
Security and compliance requirements
Observability setup

Developers spend more time on infrastructure and tooling than on business logic.

What is an Internal Developer Platform?

An IDP is a cohesive layer of tools and services that development teams use to build, deploy, and operate applications without needing to understand the underlying infrastructure.

Key Capabilities:

Self-service provisioning of environments
Standardized deployment pipelines
Built-in security and compliance controls
Golden paths and paved roads
Observability and debugging tools
Documentation and onboarding

Platform Engineering vs DevOps

Aspect	DevOps	Platform Engineering
Focus	Culture and practices	Building and maintaining platforms
Target	All teams	Platform team and application teams
Output	Improved collaboration	Internal developer platform
Key Metric	DORA metrics	Developer satisfaction, time-to-value

1.8 DevOps Myths and Anti-Patterns

Common Myths:

Myth 1: DevOps is a tool or technology Reality: DevOps is fundamentally about culture and practices. Tools enable DevOps but don't create it.

Myth 2: DevOps means no operations team Reality: Operations responsibilities shift from manual management to building automation and platforms.

Myth 3: DevOps is only for startups Reality: Large enterprises like Amazon, Netflix, and Google have successfully adopted DevOps.

Myth 4: DevOps requires rewriting everything Reality: DevOps can be applied incrementally to existing systems and processes.

Myth 5: DevOps eliminates the need for testing Reality: Testing becomes more critical and more automated.

Anti-patterns:

DevOps Team: Creating a separate "DevOps team" that acts as a silo defeats the purpose.
Tools First: Buying and installing tools without addressing culture and processes.
Automation Without Understanding: Automating broken processes just breaks things faster.
No Measurement: Implementing practices without measuring their impact.
Skipping Security: Treating security as an afterthought.
Hero Culture: Relying on individuals to fix problems manually rather than building resilient systems.
Ignoring Technical Debt: Accumulating technical debt that slows down delivery.

1.9 Business Impact of DevOps

Organizations that successfully implement DevOps see measurable business benefits:

Speed:

200x more frequent deployments (DORA research)
2555x faster lead time from commit to deploy
Faster time-to-market for new features

Stability:

3x lower change failure rate
24x faster recovery from failures
50% fewer outages

Security:

50% less time spent on security remediation
Faster vulnerability patching
Improved compliance posture

Business Outcomes:

Higher customer satisfaction
Increased market share
Better employee retention
Lower operational costs
Improved innovation capacity

1.10 Case Studies

Netflix: Cloud Native Excellence

Netflix's DevOps journey is legendary. After a major database corruption in 2008 that prevented DVD shipments for days, Netflix committed to moving to AWS and embracing cloud-native architecture.

Key Practices:

Chaos Engineering: Simian Army tools (Chaos Monkey) deliberately cause failures to test resilience
Immutable Infrastructure: Servers are never patched; they're replaced
Microservices: Thousands of microservices running on AWS
Continuous Delivery: Thousands of deployments daily
Culture of Freedom and Responsibility: Engineers have significant autonomy and ownership

Results: Netflix achieved global scale, 99.99% availability, and the ability to deploy thousands of times daily.

Amazon: The Deployment Machine

Amazon's journey to DevOps was driven by CEO Jeff Bezos' mandate: all teams must expose their data and functionality through service interfaces, and teams must communicate only through these interfaces.

Key Practices:

Two-Pizza Teams: Small, autonomous teams (fewer than 10 people)
You Build It, You Run It: Teams own their services end-to-end
Single-threaded Ownership: Clear ownership without shared responsibility
Deployment Pipeline: Sophisticated pipeline enabling 50 million+ deployments annually
API Mandate: All communication through well-defined APIs

Results: Amazon achieves 143,000 deployments in a single hour, with each team deploying independently.

Google: SRE Pioneers

Google developed SRE to manage its massive scale. The SRE team at Google is responsible for keeping services running while maintaining a 50% cap on operational work—the rest is development work to improve systems.

Key Practices:

Error Budgets: 100% reliability is the wrong target; error budgets define acceptable unreliability
Borg/Omega/Kubernetes: Internal container orchestration evolved into Kubernetes
Blameless Postmortems: Focus on fixing systems, not blaming people
Toil Elimination: Automate away repetitive operational work
Capacity Planning: Data-driven approach to scaling

Results: Google maintains incredible reliability (Gmail 99.978%) while continuously deploying thousands of changes.

Chapter 2 — DevOps Culture & Organizational Design

2.1 Organizational Structures (Functional vs Product Teams)

The structure of an organization profoundly impacts its ability to implement DevOps. Understanding different organizational models is essential.

Functional (Siloed) Structure

In traditional IT organizations, teams are structured by function:

                    CEO
        ┌────────────┼────────────┐
    Development    QA        Operations
        │             │             │
    Dev Teams    QA Teams    Ops Teams

Characteristics:

Clear career paths within functions
Deep expertise in specific domains
Standardized practices within silos
Handoffs between teams
Local optimization over global outcomes

Problems with Functional Structure:

Slow handoffs create bottlenecks
Misaligned incentives (Dev wants features, Ops wants stability)
Blame culture when things go wrong
Knowledge silos
Difficulty implementing end-to-end ownership

Product-Aligned (Cross-functional) Structure

DevOps promotes organizing around products or services:

                    CEO
        ┌────────────┼────────────┐
    Product A   Product B    Product C
        │             │             │
    [Dev, QA, Ops] [Dev, QA, Ops] [Dev, QA, Ops]

Characteristics:

Teams own their product end-to-end
Members from different functions collaborate daily
Aligned incentives around product success
Faster decision-making
Clear ownership and accountability

Benefits:

Reduced handoffs and waiting times
Faster feedback loops
Better understanding of customer needs
Improved quality through ownership
Higher team morale and autonomy

Matrix Structure (Hybrid)

Some organizations use a matrix structure where individuals report to both functional and product managers:

                    CEO
        ┌────────────┼────────────┐
    Development    QA        Operations
        │             │             │
    ┌────┼────┐   ┌────┼────┐   ┌────┼────┐
    A B C       A B C       A B C
    │ │ │       │ │ │       │ │ │
    └─┼─┼───────┼─┼─┼───────┼─┼─┘
      └─┼───────┼─┼─────────┘
        └───────┼─┘
                ↓
            Product A

Benefits:

Maintain functional expertise while enabling product focus
Flexible resource allocation
Career development within functions

Challenges:

Conflicting priorities (functional vs product goals)
Complex reporting relationships
Potential for confusion and politics

2.2 Conway's Law

Conway's Law, formulated by Melvin Conway in 1967, states:

"Organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations."

In simpler terms: Your system architecture will mirror your organizational structure.

Implications for DevOps:

Communication Patterns Become Architecture:
- If teams communicate through tickets, the system will have slow, bureaucratic interfaces
- If teams can talk directly, the system can have tight integration
- If teams are siloed, the system will have siloed components
Inverse Conway Maneuver:
- To achieve a desired architecture, reorganize teams to match it
- Want microservices? Create small, autonomous teams
- Want a platform? Create a platform team that treats other teams as customers
Team Boundaries:
- Teams should own complete, loosely-coupled components
- APIs between teams should be clean and well-documented
- Teams should be able to deploy independently

Practical Application:

When designing microservices architecture:

Identify bounded contexts (domain-driven design)
Form teams around these contexts
Ensure teams have all necessary skills (cross-functional)
Define clear APIs between team-owned services
Enable independent deployment per team

2.3 Psychological Safety

Psychological safety, a concept popularized by Harvard professor Amy Edmondson, is crucial for high-performing DevOps teams. It's defined as "a shared belief that the team is safe for interpersonal risk-taking."

Why It Matters in DevOps:

Blameless Culture: When incidents occur, teams need to investigate without fear of punishment.
Experimentation: DevOps requires trying new things; psychological safety enables this.
Learning from Failure: Only in safe environments do people openly discuss mistakes.
Speaking Up: Team members need to raise concerns about security, quality, or process issues.
Innovation: New ideas emerge when people feel safe sharing half-formed thoughts.

Building Psychological Safety:

For Leaders:

Model vulnerability by admitting your own mistakes
Ask questions, don't provide all answers
Frame work as learning problems, not execution problems
Acknowledge your own fallibility
Actively invite input from quieter team members

For Teams:

Establish ground rules for discussion
No interrupting or dismissing ideas
Focus on systems, not people, when things go wrong
Celebrate learning from failures
Create anonymous feedback channels

For Individuals:

Ask for help when needed
Offer help to others
Share your mistakes and what you learned
Assume good intentions from others

Measuring Psychological Safety:

Do team members feel comfortable admitting mistakes?
Are dissenting opinions expressed and heard?
Do people ask for help without hesitation?
Is failure discussed as a learning opportunity?
Are there diverse perspectives in decision-making?

2.4 Blameless Postmortems

The blameless postmortem is a cornerstone of DevOps culture. After an incident, teams conduct a thorough analysis focused on understanding what happened and preventing recurrence—not on assigning blame.

Principles of Blameless Postmortems:

Assume Good Intentions: Everyone was doing their best with the information they had.
Focus on Systems, Not People: Human error is a symptom of system problems.
Fix the Process, Not the Person: If a person could make a mistake, the system allowed it.
Share Learnings Widely: Postmortems should be public within the organization.
Actionable Improvements: Every postmortem should produce concrete action items.

The Postmortem Process:

Immediate Response (During Incident):

Focus on restoring service
Document actions and timestamps
Preserve evidence (logs, metrics)

Post-Incident Analysis (24-48 hours after):

Gather all participants
Timeline reconstruction
Root cause analysis (multiple contributing factors)
Identify what went well and what didn't

Writing the Postmortem:

A good postmortem includes:

Executive Summary: Brief overview for leadership
Incident Details: Date, duration, impact, severity
Timeline: Chronological sequence of events
Root Cause: Technical explanation of what failed
Contributing Factors: Why the conditions existed
Detection: How the incident was discovered
Response: How the team handled it
Lessons Learned: What we now know
Action Items: Specific, assigned tasks with due dates

Example Action Items:

"Add monitoring for database connection pool exhaustion"
"Update deployment documentation with rollback procedure"
"Implement automated testing for migration scripts"
"Add canary deployment for configuration changes"

Common Pitfalls:

Superficial Analysis: Stopping at "human error" instead of digging deeper
No Action Items: Learning without implementing improvements
Blaming Language: "He should have..." instead of "The system allowed..."
Keeping Secrets: Hiding postmortems from other teams
Punishing Honesty: Making people regret speaking openly

2.5 DevOps Leadership

DevOps transformations require leadership at all levels, but especially from those in formal leadership positions.

Characteristics of DevOps Leaders:

Servant Leadership: Leaders exist to serve and enable their teams, not the other way around.
Systems Thinkers: Leaders understand how parts of the organization interact.
Change Agents: They actively work to improve culture and processes.
Technical Empathy: They understand technical challenges and constraints.
Coaching Mindset: They develop people, not just deliver projects.
Bias for Action: They value progress over perfection.
Long-term Perspective: They invest in capabilities, not just immediate results.

Leadership Responsibilities:

Creating Vision:

Articulate why DevOps matters
Define success metrics
Communicate the transformation journey
Align DevOps goals with business objectives

Removing Obstacles:

Eliminate bureaucratic barriers
Provide resources and tools
Resolve organizational conflicts
Shield teams from distractions

Modeling Behavior:

Demonstrate blameless culture
Show vulnerability
Learn in public
Celebrate learning from failure

Building Capability:

Invest in training and development
Create career paths
Hire for culture add
Develop internal expertise

Measuring Progress:

Track DORA metrics
Survey team morale
Monitor business outcomes
Adjust strategy based on data

Leadership Anti-patterns:

Command and Control: Dictating solutions instead of enabling teams
Short-term Focus: Prioritizing immediate features over long-term capabilities
Inconsistent Messaging: Saying one thing but rewarding another
Fear-based Management: Using metrics to punish instead of improve
Hollow Empowerment: Saying "you're empowered" but overriding decisions

2.6 Change Management

DevOps transforms how organizations approach change—from rigid, approval-based processes to automated, verified, and continuous flows.

Traditional Change Management:

Change Advisory Board (CAB) approves all changes
Weekly or bi-weekly meetings
Paperwork-heavy requests
Focus on risk avoidance
Slow, batch-oriented

DevOps Change Management:

Automated validation and testing
Peer review through code review
Gradual rollout with monitoring
Fast rollback capability
Focus on risk management
Continuous, small changes

Key Principles:

Changes Should Be Small: Small changes are easier to review, test, and roll back.
Automate Where Possible: Automated testing replaces manual approval for many changes.
Verification Over Approval: Prove changes work through testing rather than seeking permission.
Gradual Exposure: Roll out changes progressively, monitoring impact.
Emergency Changes Are Rare: If you need frequent emergency changes, your process is broken.

The Change Management Spectrum:

Type	Traditional	DevOps
Infrastructure	CAB approval	Terraform + automated testing
Application	Release manager	CI/CD pipeline + canary
Configuration	Ticket + manual	Git push + automated
Security	Pen test before release	Continuous scanning

When CAB Still Makes Sense:

Regulatory compliance requirements
Financial systems with audit mandates
Changes with no rollback option
External customer commitments
Initial transformation phase

2.7 DevOps Metrics for Management

Measuring DevOps success requires moving beyond traditional IT metrics.

DORA Metrics (Four Key Metrics):

The State of DevOps Reports, produced by DORA (DevOps Research and Assessment), identified four key metrics that predict organizational performance:

Deployment Frequency: How often an organization successfully releases to production
- Elite: Multiple deploys per day
- High: Weekly to monthly
- Medium: Monthly to every 6 months
- Low: Less than every 6 months
Lead Time for Changes: The time from code commit to code successfully running in production
- Elite: Less than one hour
- High: One day to one week
- Medium: One week to one month
- Low: One month to six months
Mean Time to Recovery (MTTR): How long it takes to restore service after an incident
- Elite: Less than one hour
- High: Less than one day
- Medium: Less than one day to one week
- Low: One week to one month
Change Failure Rate: The percentage of changes that result in degraded service
- Elite: 0-15%
- High: 16-30%
- Medium: 16-30%
- Low: 16-30%

Additional Metrics:

Flow Metrics:

Deployment size (smaller is better)
Batch size (smaller is better)
Wait times between stages
Work in progress (WIP) limits

Quality Metrics:

Defect escape rate (bugs found in production)
Test coverage
Mean time to detection (MTTD)
Mean time between failures (MTBF)

Business Metrics:

Time to market for new features
Customer satisfaction (CSAT/NPS)
Revenue per employee
Feature adoption rate

Team Health Metrics:

Employee Net Promoter Score (eNPS)
Turnover rate
Burnout indicators
Learning and development hours

Metrics Anti-patterns:

Vanity Metrics: Numbers that look good but don't indicate real performance
Gaming the System: Optimizing metrics at the expense of actual outcomes
Comparing Teams: Using metrics to rank teams creates unhealthy competition
No Context: Metrics without understanding the underlying context
Measuring Everything: Analysis paralysis from too many metrics

2.8 Building High-Performance Teams

High-performing DevOps teams share common characteristics and practices.

Characteristics:

Cross-functional Composition:
- All skills needed to deliver value
- No external dependencies for common tasks
- T-shaped skills (deep in one area, broad in others)
Clear Ownership:
- End-to-end responsibility
- Clear boundaries between teams
- "You build it, you run it" mentality
Autonomy with Alignment:
- Freedom to choose how to achieve goals
- Alignment on what goals matter
- Guardrails, not gates
Psychological Safety:
- Safe to take risks
- Open communication
- Learning culture
Continuous Improvement:
- Regular retrospectives
- Time for improvement work
- Blameless problem-solving

Building Practices:

Team Formation:

Start with clear mission and boundaries
Include all necessary roles
Define success metrics together
Establish team norms and working agreements

Onboarding:

Structured mentorship program
Pair programming with experienced team members
Gradual responsibility increase
Documentation and learning resources

Team Rituals:

Daily stand-up (15 minutes max)
Regular planning sessions
Retrospectives (blameless, action-oriented)
Demo days or show-and-tell
Social activities

Knowledge Management:

Living documentation
Code comments and READMEs
Architecture decision records (ADRs)
Brown bag lunches
Internal tech talks

Career Development:

Individual growth plans
Technical and management tracks
Conference attendance and speaking
Internal mobility opportunities
Mentoring programs

2.9 InnerSource Model

InnerSource applies open source software development practices to internal software development.

What is InnerSource?

InnerSource takes the lessons learned from open source development (transparency, collaboration, meritocracy) and applies them within the corporate firewall. It enables developers from different teams to contribute to each other's codebases.

Core Principles:

Open by Default: Code is visible to everyone in the organization.
Voluntary Participation: Contributors choose what to work on.
Meritocracy: Influence comes from contribution quality, not position.
Asynchronous Collaboration: Work happens across time zones without constant coordination.
Community Over Committee: Decisions emerge from community practice.

Benefits:

Reduced Duplication: Teams can reuse and improve existing code
Cross-team Collaboration: Breaking down silos organically
Skill Development: Developers learn from diverse codebases
Faster Innovation: More contributors finding and fixing problems
Standardization: Natural emergence of best practices

InnerSource Roles:

Trusted Committers: Maintainers who review and merge contributions
Contributors: Developers submitting improvements
Product Owners: Define direction and priorities
Users: Teams that depend on the code

InnerSource Workflow:

Discover: Find a project to contribute to
Understand: Read documentation and code
Discuss: Open an issue or discussion
Develop: Create your changes
Submit: Open a pull request
Review: Work with maintainers on feedback
Merge: Code is accepted and deployed
Celebrate: Recognition for contribution

Implementing InnerSource:

Start Small:

Choose one or two foundational projects
Document contribution guidelines clearly
Make it easy to find and build projects
Recognize and reward contributions

Infrastructure Needs:

Internal code hosting (GitHub Enterprise, GitLab)
CI/CD that works for external contributors
Clear documentation and onboarding
Communication channels (Slack, mailing lists)

Cultural Requirements:

Leadership support for cross-team work
Time allocated for contributing to other teams
Recognition for contributions
Trust that teams will make good decisions

2.10 DevOps Transformation Roadmap

Transforming to DevOps is a journey, not a destination. Here's a structured approach.

Phase 1: Foundation (3-6 months)

Goals:

Build awareness and understanding
Secure leadership buy-in
Identify pilot teams and projects
Establish basic metrics

Activities:

Executive workshops on DevOps principles
Assess current state and pain points
Form a DevOps Center of Excellence (optional)
Train pilot teams on DevOps basics
Implement version control for everything

Success Criteria:

Leadership alignment on transformation goals
Pilot teams identified and trained
Baseline metrics established
Initial version control adoption

Phase 2: Pilot (6-12 months)

Goals:

Demonstrate success with pilot teams
Build reusable patterns and practices
Develop internal expertise
Create momentum for broader adoption

Activities:

Implement CI/CD for pilot applications
Automate infrastructure provisioning
Establish monitoring and alerting
Conduct blameless postmortems
Document patterns and practices
Share successes across organization

Success Criteria:

Measurable improvements in DORA metrics for pilots
Repeatable patterns documented
Internal champions developed
Interest from other teams

Phase 3: Expand (12-24 months)

Goals:

Scale practices across organization
Standardize tools and platforms
Build internal platform/self-service capabilities
Embed DevOps in organizational processes

Activities:

Train all teams on DevOps practices
Implement standard toolchain
Build Internal Developer Platform
Update HR processes (hiring, reviews)
Integrate security (DevSecOps)
Establish communities of practice

Success Criteria:

Organization-wide adoption of core practices
Self-service platform available
Security integrated in pipelines
DevOps competencies in job descriptions

Phase 4: Optimize (24+ months)

Goals:

Continuous improvement culture
Experimentation and innovation
Industry leadership
Platform evolution

Activities:

Advanced practices (chaos engineering, SRE)
Machine learning for operations
Open source contributions
Publish case studies and speak at conferences
Evolve platform based on feedback

Success Criteria:

Elite DORA performance
Industry recognition
Attract and retain top talent
Business outcomes clearly linked to DevOps

Critical Success Factors:

Leadership Commitment: Transformation requires sustained executive support
Patience: Culture change takes years, not months
Focus on Value: Always connect DevOps work to business outcomes
Celebrate Wins: Recognize and share successes
Learn from Failures: Treat setbacks as learning opportunities
Stay Humble: There's always more to learn and improve

Chapter 3 — Linux & System Fundamentals for DevOps

3.1 Linux Architecture

Understanding Linux architecture is fundamental for any DevOps engineer. Linux powers the vast majority of servers, containers, and cloud infrastructure.

The Linux Kernel

The kernel is the core of the operating system, managing hardware resources and providing essential services:

Kernel Components:

Process Scheduler (CPU Management):
- Manages process execution
- Implements scheduling policies (CFS - Completely Fair Scheduler)
- Handles context switching
- Manages CPU affinity and priorities
Memory Manager:
- Virtual memory management
- Paging and swapping
- Memory allocation (malloc/free)
- Shared memory and memory mapping
- Page cache for file I/O
File System Manager:
- Virtual File System (VFS) abstraction
- Supports multiple file systems (ext4, XFS, Btrfs)
- Inode management
- File permissions and attributes
- Journaling for reliability
Network Stack:
- Protocol implementations (TCP/IP, UDP)
- Socket abstraction
- Network device drivers
- Firewall (netfilter/iptables/nftables)
- Traffic control and QoS
Device Drivers:
- Interface with hardware devices
- Character and block devices
- USB, PCI, SCSI subsystems
- Device model and sysfs
Inter-process Communication (IPC):
- Pipes and FIFOs
- Message queues
- Shared memory
- Semaphores
- Signals

User Space vs Kernel Space

Linux separates execution into two modes:

Kernel Space:

Runs in privileged mode
Direct hardware access
Memory protected from user space
Device drivers and core services

User Space:

Runs in unprivileged mode
Access to hardware only through kernel syscalls
Applications, libraries, and services
Isolated from other user processes

System Calls

User space programs request kernel services through system calls:

Application (user space)
        ↓
    Library call (glibc)
        ↓
    System call (int 0x80 / syscall)
        ↓
Kernel (kernel space)

Common system calls:

read(), write() - File I/O
fork(), exec() - Process creation
socket(), connect() - Network
mmap() - Memory mapping
open(), close() - File operations

File System Hierarchy

Linux follows the Filesystem Hierarchy Standard (FHS):

/ (root)
├── bin - Essential user binaries
├── boot - Boot loader files
├── dev - Device files
├── etc - System configuration
├── home - User home directories
├── lib - Essential shared libraries
├── media - Mount points for removable media
├── mnt - Temporarily mounted filesystems
├── opt - Optional application software
├── proc - Virtual filesystem for process info
├── root - Root user home
├── sbin - System binaries
├── sys - Virtual filesystem for system info
├── tmp - Temporary files
├── usr - User utilities and applications
│   ├── bin - User binaries
│   ├── lib - Libraries
│   ├── local - Locally installed software
│   └── share - Architecture-independent data
└── var - Variable data
    ├── log - Log files
    ├── mail - Mail spool
    └── tmp - Temporary files preserved across reboots

3.2 Process Management

Processes are the running instances of programs. Understanding process management is crucial for debugging and performance tuning.

Process States

A process can be in one of several states:

R (Running/Tunable): Process is executing or ready to execute
S (Sleeping): Waiting for an event (interruptible)
D (Uninterruptible Sleep): Waiting for I/O (usually disk)
T (Stopped): Stopped by job control signal
Z (Zombie): Terminated but not yet reaped by parent

Process Lifecycle:

Creation: fork() creates a copy of parent, exec() loads new program
Ready: Process is ready to run and waiting for CPU
Running: Process is executing on CPU
Waiting: Process waiting for I/O or event
Terminated: Process finished execution
Zombie: Waiting for parent to read exit status

Process Attributes:

PID (Process ID): Unique identifier
PPID (Parent PID): ID of parent process
UID/EUID: User ID and effective user ID
GID/EGID: Group ID and effective group ID
Priority/Nice value: Scheduling priority
Environment variables: Process environment
File descriptors: Open files and sockets

Process Management Commands:

Viewing Processes:

ps aux                    # All processes with details
ps -ef                    # Full format listing
top                       # Interactive process viewer
htop                      # Enhanced interactive viewer
pstree                    # Process tree
pgrep sshd                # Find PIDs by name

Process Control:

kill -TERM <PID>          # Terminate gracefully
kill -KILL <PID>          # Force kill
kill -STOP <PID>          # Suspend process
kill -CONT <PID>          # Resume process
nice -n 10 command        # Start with lower priority
renice 10 <PID>           # Change priority of running process

Background/Foreground:

command &                  # Run in background
Ctrl+Z                     # Suspend foreground job
jobs                       # List background jobs
bg %1                      # Resume job in background
fg %1                      # Bring job to foreground

Process Limits:

View and modify process limits with ulimit:

ulimit -a                  # Show all limits
ulimit -n 65536            # Max open files
ulimit -u 100              # Max user processes

Important limits:

nofile: Maximum open file descriptors
nproc: Maximum user processes
stack: Stack size
core: Core file size
memlock: Max locked-in-memory address space

3.3 File Systems

Linux supports multiple file systems and provides a unified interface through the Virtual File System (VFS).

Common File Systems:

ext4 (Fourth Extended Filesystem):

Default for many Linux distributions
Journaling for reliability
Supports large files (up to 16TB) and volumes (up to 1EB)
Backward compatible with ext2/ext3

XFS:

High-performance, scalable
Excellent for large files and parallel I/O
Online defragmentation and resizing
Common for media and data-intensive applications

Btrfs (B-tree Filesystem):

Copy-on-write (COW) architecture
Built-in snapshots and rollback
Subvolumes and quotas
RAID support integrated
Checksums on data and metadata

ZFS (on Linux via OpenZFS):

Combined file system and volume manager
Data integrity with checksums
Snapshots, clones, and replication
Compression and deduplication
Originally from Solaris, now available on Linux

tmpfs:

Temporary file system in RAM
Fast but volatile
Mounted at /tmp, /run, /dev/shm

procfs and sysfs:

Virtual file systems for kernel interfaces
/proc: Process and system information
/sys: Device and kernel parameters

File System Operations:

Mounting and Unmounting:

mount /dev/sda1 /mnt/data        # Mount filesystem
umount /mnt/data                  # Unmount
mount -a                          # Mount all in fstab
findmnt                           # Show mount tree
df -h                             # Disk usage of mounted filesystems

Creating File Systems:

mkfs.ext4 /dev/sdb1              # Create ext4 filesystem
mkfs.xfs /dev/sdc1                # Create XFS filesystem
mkfs.btrfs /dev/sdd1              # Create Btrfs filesystem

Checking and Repairing:

fsck /dev/sda1                    # Check and repair
xfs_repair /dev/sdb1              # XFS repair
btrfs check /dev/sdc1             # Btrfs check

File System Tuning:

tune2fs -l /dev/sda1              # View ext4 parameters
xfs_info /dev/sdb1                 # View XFS parameters
btrfs filesystem show              # Show Btrfs info

Inodes and Directory Structure:

Inode: Metadata structure for files (permissions, ownership, timestamps, pointers to data blocks)
Directory: Mapping of filenames to inodes
Hard links: Multiple filenames pointing to same inode
Symbolic links: Special files pointing to other filenames

ls -i                             # Show inode numbers
stat file.txt                     # Show inode details
ln file.txt hardlink              # Create hard link
ln -s file.txt symlink            # Create symbolic link

3.4 Networking Basics

Networking is fundamental to distributed systems. DevOps engineers must understand Linux networking deeply.

Network Stack Overview:

Application Layer (HTTP, DNS, SSH)
    ↓
Transport Layer (TCP, UDP)
    ↓
Network Layer (IP, ICMP)
    ↓
Link Layer (Ethernet, WiFi)
    ↓
Physical Hardware

Network Configuration:

Network Interfaces:

ip link                          # List network interfaces
ip addr show                     # Show IP addresses
ip route show                    # Show routing table
ethtool eth0                     # Show interface details
ss -tulpn                        # Show listening sockets

Interface Configuration (Netplan/ifupdown):

Modern Linux uses Netplan (Ubuntu) or NetworkManager:

# /etc/netplan/01-netcfg.yaml
network:
  version: 2
  ethernets:
    eth0:
      addresses:
        - 192.168.1.100/24
      routes:
        - to: default
          via: 192.168.1.1
      nameservers:
        addresses: [8.8.8.8, 8.8.4.4]

Network Namespaces:

Network namespaces provide isolated network stacks:

ip netns add red                 # Create namespace
ip netns exec red bash           # Run shell in namespace
ip link add veth0 type veth peer name veth1  # Virtual ethernet pair
ip link set veth0 netns red      # Move interface to namespace

Socket Programming Concepts:

Socket: Endpoint for communication
Port: 16-bit number identifying service
TCP: Connection-oriented, reliable, ordered
UDP: Connectionless, unreliable, unordered
UNIX domain sockets: IPC on same host

Common Network Services:

DNS (Domain Name System):

cat /etc/resolv.conf             # DNS configuration
dig example.com                  # DNS lookup
nslookup example.com              # Alternative lookup
host example.com                  # Simple lookup

HTTP/HTTPS:

curl -I https://example.com       # Fetch HTTP headers
wget https://example.com/file     # Download file
nc -v example.com 80              # Test TCP connection

Network Diagnostics:

ping -c 4 example.com             # Test connectivity
traceroute example.com             # Trace network path
mtr example.com                    # Combined ping+traceroute
ss -tulpn                         # Socket statistics
netstat -an                        # Network statistics (older)
tcpdump -i eth0 port 80           # Capture packets
nmap -p 1-1000 example.com         # Port scanning

Firewall with iptables/nftables:

iptables (legacy):

iptables -L                        # List rules
iptables -A INPUT -p tcp --dport 22 -j ACCEPT  # Allow SSH
iptables -A INPUT -j DROP          # Drop everything else
iptables-save > rules.txt          # Save rules

nftables (modern):

nft list ruleset                   # List all rules
nft add table inet filter          # Create table
nft add chain inet filter input { type filter hook input priority 0\; }
nft add rule inet filter input tcp dport 22 accept

3.5 Shell Scripting (Bash)

Shell scripting automates repetitive tasks and is essential for DevOps.

Bash Basics:

Shebang and Execution:

#!/bin/bash
# This is a comment

echo "Hello, World!"

Variables:

name="John"
echo "Hello, $name"
readonly constant="cannot change"
export ENV_VAR="visible to child processes"

Arrays:

fruits=("apple" "banana" "orange")
echo ${fruits[0]}                  # First element
echo ${fruits[@]}                   # All elements
echo ${#fruits[@]}                  # Array length

Conditionals:

if [ "$name" == "John" ]; then
    echo "Hello John"
elif [ "$name" == "Jane" ]; then
    echo "Hello Jane"
else
    echo "Hello stranger"
fi

# File tests
if [ -f "$file" ]; then            # File exists
if [ -d "$dir" ]; then              # Directory exists
if [ -x "$executable" ]; then       # Is executable

Loops:

# For loop
for i in {1..5}; do
    echo "Number $i"
done

# While loop
count=1
while [ $count -le 5 ]; do
    echo "Count $count"
    ((count++))
done

# Reading lines
while IFS= read -r line; do
    echo "Line: $line"
done < file.txt

Functions:

greet() {
    local name="$1"                 # Local variable
    echo "Hello, $name"
    return 0                        # Return status
}

greet "World"

Error Handling:

set -e                              # Exit on error
set -u                              # Exit on undefined variable
set -o pipefail                     # Pipe fails if any command fails

trap 'cleanup' EXIT                  # Run on exit
trap 'echo "Interrupted"; exit' INT  # Handle Ctrl+C

Practical DevOps Scripts:

Backup Script:

#!/bin/bash
set -euo pipefail

BACKUP_DIR="/backup/$(date +%Y%m%d)"
SOURCE_DIR="/data"

mkdir -p "$BACKUP_DIR"
tar -czf "$BACKUP_DIR/backup.tar.gz" "$SOURCE_DIR"

# Rotate old backups (keep 7 days)
find /backup -type d -mtime +7 -exec rm -rf {} \;

Health Check Script:

#!/bin/bash
check_service() {
    local host="$1"
    local port="$2"
    timeout 1 bash -c "echo >/dev/tcp/$host/$port" 2>/dev/null
    return $?
}

if check_service "localhost" 8080; then
    echo "Service is up"
else
    echo "Service is down"
    exit 1
fi

Deployment Script:

#!/bin/bash
set -e

VERSION="$1"
if [ -z "$VERSION" ]; then
    echo "Usage: $0 <version>"
    exit 1
fi

echo "Deploying version $VERSION"
./run_tests.sh
./build.sh "$VERSION"
scp "build/app-$VERSION" server:/apps/current
ssh server systemctl restart myapp

3.6 Systemd

Systemd is the init system and service manager for most modern Linux distributions.

Core Concepts:

Units: Resources managed by systemd (services, sockets, mounts, etc.)
Targets: Groups of units (like runlevels)
Journal: Centralized logging system

Unit Types:

.service: System services
.socket: IPC or network sockets
.device: Device files
.mount: Filesystem mount points
.timer: Scheduled tasks (cron replacement)
.target: Group of units

Service Unit Example:

# /etc/systemd/system/myapp.service
[Unit]
Description=My Application
After=network.target
Wants=redis.service
Requires=mongodb.service

[Service]
Type=simple
User=myapp
Group=myapp
WorkingDirectory=/opt/myapp
ExecStart=/usr/bin/node /opt/myapp/app.js
ExecReload=/bin/kill -HUP $MAINPID
Restart=always
RestartSec=10
Environment=NODE_ENV=production
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

Common Commands:

systemctl start myapp             # Start service
systemctl stop myapp               # Stop service
systemctl restart myapp             # Restart service
systemctl reload myapp              # Reload configuration
systemctl status myapp              # Show status
systemctl enable myapp              # Enable at boot
systemctl disable myapp             # Disable at boot
systemctl daemon-reload             # Reload unit files

Journald (Logging):

journalctl -u myapp                # Show logs for service
journalctl -f                       # Follow logs
journalctl --since "1 hour ago"     # Time-based filter
journalctl -p err                    # Show only errors
journalctl _PID=1234                 # Filter by PID

Timer Units (Cron Replacement):

# /etc/systemd/system/backup.timer
[Unit]
Description=Daily backup timer

[Timer]
OnCalendar=daily
Persistent=true

[Install]
WantedBy=timers.target

# /etc/systemd/system/backup.service
[Unit]
Description=Daily backup

[Service]
Type=oneshot
ExecStart=/usr/local/bin/backup.sh

3.7 Package Management

Linux distributions use package managers to install, update, and remove software.

Debian/Ubuntu (apt/dpkg):

# Update package lists
apt update

# Upgrade all packages
apt upgrade

# Install package
apt install nginx

# Remove package
apt remove nginx

# Search packages
apt search nginx

# Show package info
apt show nginx

# List installed
dpkg -l

# Find which package owns a file
dpkg -S /etc/nginx/nginx.conf

Red Hat/CentOS/Fedora (yum/dnf):

# Update package lists
yum check-update

# Upgrade packages
yum update

# Install package
yum install nginx

# Remove package
yum remove nginx

# Search
yum search nginx

# Show info
yum info nginx

# List installed
rpm -qa

# Find package owner
rpm -qf /etc/nginx/nginx.conf

Building from Source:

Sometimes packages aren't available and you need to compile:

wget https://example.com/software.tar.gz
tar -xzf software.tar.gz
cd software
./configure --prefix=/usr/local
make
make install

3.8 Performance Monitoring

Performance monitoring helps identify bottlenecks and capacity issues.

CPU Monitoring:

top                             # Real-time process view
htop                            # Enhanced top
mpstat -P ALL 1                 # Per-CPU statistics
vmstat 1                        # System statistics
uptime                          # Load average
cat /proc/cpuinfo               # CPU information

Memory Monitoring:

free -h                         # Memory usage
vmstat 1                        # Virtual memory stats
cat /proc/meminfo               # Detailed memory info
smem                            # Memory per process

Disk I/O Monitoring:

iostat -x 1                     # Extended disk statistics
iotop                           # I/O per process
df -h                           # Filesystem usage
du -sh *                        # Directory sizes

Network Monitoring:

iftop                           # Network traffic by host
nethogs                         # Traffic by process
ss -tulpn                       # Socket statistics
sar -n DEV 1                    # Network statistics

System Performance Tuning:

Kernel Parameters (/etc/sysctl.conf):

# Increase network buffers
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216

# TCP tuning
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

# File system
fs.file-max = 2097152

# Virtual memory
vm.swappiness = 10
vm.dirty_ratio = 40

Process Limits (/etc/security/limits.conf):

* soft nofile 65536
* hard nofile 65536
* soft nproc unlimited
* hard nproc unlimited

3.9 Log Management

Logs are crucial for troubleshooting and monitoring.

System Logs:

/var/log/syslog or /var/log/messages: General system logs
/var/log/auth.log: Authentication logs
/var/log/kern.log: Kernel messages
/var/log/dmesg: Boot messages
/var/log/nginx/: Nginx logs
/var/log/mysql/: MySQL logs

Log Rotation (logrotate):

# /etc/logrotate.d/nginx
/var/log/nginx/*.log {
    daily
    missingok
    rotate 14
    compress
    delaycompress
    notifempty
    create 0640 nginx adm
    sharedscripts
    postrotate
        [ -f /var/run/nginx.pid ] && kill -USR1 `cat /var/run/nginx.pid`
    endscript
}

Centralized Logging with rsyslog:

# /etc/rsyslog.conf
*.* @logserver.example.com:514    # Send all logs to remote server

Log Analysis:

# Count errors
grep -c "ERROR" app.log

# Tail with filtering
tail -f app.log | grep ERROR

# Find unique IPs
awk '{print $1}' access.log | sort | uniq -c | sort -nr

# Time-based analysis
grep "$(date +%Y-%m-%d)" app.log

3.10 Hardening Linux Servers

Security is critical for production systems.

User and Access Management:

# Remove unnecessary users
userdel -r username

# Disable root SSH login
# In /etc/ssh/sshd_config:
# PermitRootLogin no

# Use SSH keys only
# PasswordAuthentication no

# Implement sudo with care
visudo

File Permissions:

# Secure sensitive files
chmod 600 /etc/shadow
chmod 644 /etc/passwd
chmod 600 /etc/ssh/sshd_config

# Set proper ownership
chown root:root /etc/passwd

Network Security:

# Basic firewall
ufw default deny incoming
ufw default allow outgoing
ufw allow ssh
ufw enable

# Disable unused services
systemctl disable bluetooth
systemctl disable cups

# Secure sysctl settings
# /etc/sysctl.d/99-security.conf
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.all.accept_source_route = 0
net.ipv4.tcp_syncookies = 1

Filesystem Security:

# Mount options in /etc/fstab
# /dev/sda1 /home ext4 defaults,noexec,nosuid 0 2
# /tmp tmpfs tmpfs defaults,noexec,nosuid,nodev 0 0

Auditing and Monitoring:

# Install and configure auditd
auditctl -w /etc/passwd -p wa -k passwd_changes
auditctl -w /etc/shadow -p wa -k shadow_changes

# Check for unusual activity
lastb                           # Failed login attempts
last                            # Last logins
journalctl -u ssh                # SSH logs

Automatic Security Updates:

# Ubuntu/Debian
apt install unattended-upgrades
dpkg-reconfigure -plow unattended-upgrades

# Red Hat/CentOS
yum install yum-cron
systemctl enable yum-cron

Security Tools:

Lynis: Security auditing tool
ClamAV: Antivirus
rkhunter: Rootkit hunter
chkrootkit: Rootkit detector
fail2ban: Brute force protection

PART II — VERSION CONTROL & COLLABORATION

Chapter 4 — Git Internals & Advanced Workflows

4.1 Git Architecture (Objects, Trees, Commits)

Understanding Git's internal architecture demystifies its behavior and enables advanced usage.

The Object Database

Git is fundamentally a content-addressable filesystem with a VCS interface. Everything is stored as objects in the .git/objects directory.

Object Types:

Blob: File contents (binary large object)
Tree: Directory listings (filenames + permissions + blob references)
Commit: Snapshot metadata (tree hash, parent, author, message)
Tag: Named reference to a commit (optionally signed)

Object Storage:

Each object is identified by a SHA-1 hash of its content:

echo 'hello world' | git hash-object --stdin
# 3b18e512dba79e4c8300dd08aeb37f8e728b8dad

Objects are stored compressed in .git/objects/ab/3b18e512dba79e4c8300dd08aeb37f8e728b8dad

The Commit Graph

commit (hash: a1b2c3)
tree: d4e5f6
parent: f7g8h9 (previous commit)
author: John <john@example.com>
committer: John <john@example.com>
message: Add feature X
    ↓
tree (hash: d4e5f6)
    blob: 1a2b3c (README.md)
    blob: 4d5e6f (main.py)
    tree: 7g8h9i (lib/)
        blob: 0j1k2l (lib/utils.py)

References (Refs)

Refs are pointers to commits, stored in .git/refs/:

heads/: Local branches
remotes/: Remote tracking branches
tags/: Tags

HEAD is a special ref pointing to current branch or commit.

cat .git/HEAD
# ref: refs/heads/main

cat .git/refs/heads/main
# a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6q7r8s9t0

The Index (Staging Area)

The index is a binary file (.git/index) that represents the next commit. It's a sorted list of path names with blob hashes and file metadata.

Plumbing vs Porcelain

Git commands are categorized as:

Porcelain: User-friendly commands (git add, git commit)
Plumbing: Low-level commands for scripting (git hash-object, git update-index)

Low-level Examples:

# Create blob
echo 'content' | git hash-object -w --stdin

# Create tree
git update-index --add --cacheinfo 100644 \
  $(git hash-object -w file.txt) file.txt
git write-tree

# Create commit
echo 'message' | git commit-tree TREE_HASH -p PARENT_HASH

4.2 Branching Strategies

Branching strategies define how teams use branches for development.

Git Flow

Classic branching model by Vincent Driessen:

main (production)
  ↑
release/1.0 (staging)
  ↑
develop (integration)
  ↑
feature/new-feature (development)

Branches:

main: Production-ready code
develop: Integration branch
feature/*: New features (branch from develop)
release/*: Release preparation (branch from develop, merge to main and develop)
hotfix/*: Emergency fixes (branch from main, merge to main and develop)

Pros:

Clear structure
Works well for versioned releases
Good for larger teams

Cons:

Complex
Overkill for continuous delivery
Many branches to maintain

GitHub Flow

Simpler flow used by GitHub:

main (always deployable)
  ↑
feature/* → Pull Request → main

Principles:

main is always deployable
Create feature branches for changes
Open pull requests for review
Merge and deploy immediately

Pros:

Simple
Works with CI/CD
Continuous deployment friendly

Cons:

Less structure for releases
Can be chaotic with many changes

GitLab Flow

GitLab's hybrid approach:

production (or environment branches)
  ↑
pre-production
  ↑
main
  ↑
feature/*

Environment Branches:

production: Deployed to production
staging: Deployed to staging
main: Integration branch

Pros:

Environment-specific branches
Works well with deployment pipelines
Clear promotion path

Trunk-Based Development

All developers work on short-lived branches from main:

main ←─── short branch ───┐
     └─── short branch ───┤
      └─── short branch ──┤

Rules:

Branches live < 1 day
Small, frequent commits
Feature flags for incomplete work
Automated testing before merge

Pros:

Minimal merge conflicts
Continuous integration
Fast feedback

Cons:

Requires feature flags
Discipline required
Not suitable for all projects

4.3 Git Rebase vs Merge

Understanding the difference is crucial for clean history.

Merge

git checkout main
git merge feature

Result:

Creates merge commit
Preserves exact history
Shows when branch happened

*   Merge branch 'feature' (main)
|\
| * Add feature (feature)
* | Update main (main)
|/
* Initial commit

Pros:

Preserves context
Safe (non-destructive)
Shows actual branch timeline

Cons:

Cluttered history
Many merge commits

Rebase

git checkout feature
git rebase main
git checkout main
git merge feature (fast-forward)

Result:

Replays commits on top of main
Linear history
No merge commits

* Add feature (main)
* Update main
* Initial commit

Pros:

Clean, linear history
Easier to read
Bisect friendly

Cons:

Rewrites history
Dangerous on shared branches
Loses branch context

Interactive Rebase

git rebase -i HEAD~3

Allows:

squash: Combine commits
reword: Change commit message
edit: Modify commit
drop: Remove commit
reorder: Change order

Golden Rule of Rebasing:

Never rebase commits that have been pushed to a shared repository. It will cause chaos for other developers.

When to Use What:

Use Merge When:

Merging a long-lived branch
Preserving branch history is important
Working on public/shared branch

Use Rebase When:

Updating feature branch with main
Cleaning up local commits before PR
Creating linear history

Squash and Merge (GitHub):

Combines all commits from feature branch into one commit on main. Good for keeping main history clean.

4.4 Submodules

Submodules allow including external repositories within your repository.

Basic Usage:

# Add submodule
git submodule add https://github.com/user/lib.git lib

# Clone with submodules
git clone --recursive https://github.com/user/project.git

# Update submodules
git submodule update --init --recursive

# Pull latest in submodules
git submodule update --remote

.gitmodules File:

[submodule "lib"]
    path = lib
    url = https://github.com/user/lib.git
    branch = main

Challenges:

Detached HEAD: Submodules are checked out at specific commits
Updates: Need to commit submodule reference changes
Collaboration: Team members must remember to update submodules

Alternatives:

Subtrees: Copy code into your repo (git subtree)
Package managers: npm, pip, maven, etc.
Monorepo: Single repository for all code

4.5 Monorepo vs Polyrepo

Monorepo (Single Repository)

All code in one repository.

Pros:

Atomic commits across projects
Easy code sharing
Simplified dependency management
Consistent tooling
Easier refactoring

Cons:

Scales poorly (Git struggles with huge repos)
Complex access control
Build system complexity
Learning curve

Examples: Google, Microsoft, Facebook

Polyrepo (Multiple Repositories)

Each project in its own repository.

Pros:

Clear ownership
Independent versioning
Simpler tooling per project
Better access control
Scales naturally

Cons:

Cross-repo changes are painful
Dependency hell
Inconsistent tooling
Duplication

Hybrid Approaches:

Repo orchestration tools: Google's repo, Microsoft's VFS for Git
Monorepo with modular build: Bazel, Pants, Please
Package-based monorepo: Lerna (JavaScript), Gradle (Java)

4.6 Git Hooks

Git hooks are scripts that run automatically on Git events.

Client-Side Hooks (.git/hooks/):

pre-commit: Before commit message editor
prepare-commit-msg: Before commit message editor (with template)
commit-msg: After commit message
post-commit: After commit
pre-push: Before push
pre-rebase: Before rebase
post-checkout: After checkout
post-merge: After merge

Server-Side Hooks:

pre-receive: Before accepting push
update: Per-branch pre-receive
post-receive: After push

Example pre-commit hook (linting):

#!/bin/bash
# .git/hooks/pre-commit

echo "Running linter..."
files=$(git diff --cached --name-only --diff-filter=ACM | grep '\.js$')
if [ -n "$files" ]; then
    eslint $files
    if [ $? -ne 0 ]; then
        echo "Linting failed"
        exit 1
    fi
fi

Managing Hooks with Tools:

Husky (JavaScript): Manages hooks via package.json
pre-commit (Python): Framework for multi-language hooks
overcommit (Ruby): Extensible hook manager

4.7 Large Scale Git Management

Handling Large Repositories:

Shallow Clones:

git clone --depth 1 https://github.com/user/repo.git

Partial Clones:

git clone --filter=blob:none https://github.com/user/repo.git

Sparse Checkout:

git sparse-checkout set src/

Git LFS (Large File Storage):

Replaces large files with text pointers:

git lfs track "*.psd"
git add .gitattributes
git add file.psd
git commit -m "Add design file"

Performance Optimization:

git gc: Garbage collection
git repack: Optimize pack files
git fsck: Verify database integrity
git prune: Remove unreachable objects

Scaling Git Servers:

GitLab: Built for enterprise scale
GitHub: GitHub AE for large enterprises
BitBucket Data Center: Clustered for scale
Gerrit: Code review focused, scales well

4.8 Code Review Best Practices

For Authors:

Keep changes small: < 400 lines is ideal
Write good descriptions: What, why, how
Add context: Screenshots, test results
Self-review first: Catch obvious issues
Respond graciously: To all comments
Explain changes: In comments and commits

For Reviewers:

Review promptly: Within 24 hours ideally
Be kind: Focus on code, not person
Ask questions: "What do you think about..." not "You should..."
Be specific: Point to exact lines and alternatives
Prioritize: Security > correctness > style
Approve thoughtfully: Understand the code

Code Review Checklist:

Automated Checks:

Linting: Enforce style
Static analysis: Find bugs
Test coverage: Ensure testing
Security scanning: Find vulnerabilities
Size checks: Prevent bloat

Chapter 5 — Platforms

5.1 GitHub Enterprise

GitHub Enterprise provides self-hosted or cloud-based GitHub for organizations.

Key Features:

Authentication and Authorization:

SAML/SSO integration
LDAP/Active Directory
Fine-grained permissions
Team synchronization

Security:

2FA enforcement
Audit logging
Secret scanning
Dependency graph
Security advisories

Collaboration:

Protected branches
Required reviews
Code owners
Issue templates
Project boards

Actions:

Built-in CI/CD
Self-hosted runners
Marketplace integrations
Reusable workflows

API and Automation:

GraphQL API
REST API
Webhooks
GitHub Apps

Deployment Options:

GitHub Enterprise Cloud:

Hosted by GitHub
Enterprise features
SLA guarantee
Regular updates

GitHub Enterprise Server:

Self-hosted
Full control
Air-gapped possible
Upgrade on your schedule

5.2 GitLab CI/CD

GitLab provides integrated CI/CD with their repository platform.

Core Concepts:

.gitlab-ci.yml:

stages:
  - build
  - test
  - deploy

variables:
  DOCKER_DRIVER: overlay2

build:
  stage: build
  script:
    - docker build -t $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA .
    - docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA

test:
  stage: test
  script:
    - npm install
    - npm test

deploy:
  stage: deploy
  script:
    - kubectl set image deployment/myapp app=$CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
  only:
    - main

Runners:

Shared runners: Provided by GitLab
Group runners: Shared by group
Specific runners: Project-specific
Auto-scaling: Dynamic provisioning

Features:

Auto DevOps
Review Apps (ephemeral environments)
Container registry
Dependency scanning
License compliance
Browser testing

5.3 Bitbucket

Bitbucket, part of Atlassian, integrates well with Jira and other Atlassian tools.

Key Features:

Branch Permissions:

Restrict pushes
Require pull requests
Prevent deletion
Merge checks

Pull Requests:

Code reviews
Inline comments
Task lists
Approvals required

Pipelines:

Built-in CI/CD
Docker support
Service containers
Deployments to environments

Integration:

Jira integration
Slack notifications
Marketplace add-ons
REST API

5.4 Pull Requests & Merge Requests

PRs (GitHub) and MRs (GitLab) are the primary code review mechanism.

Pull Request Lifecycle:

Create branch from main
Make changes and commit
Push branch to remote
Open PR with description
Automated checks run
Reviewers comment and approve
Address feedback with more commits
Merge when ready
Delete branch

PR Templates:

## Description
[Describe the changes]

## Type of Change
- [ ] Bug fix
- [ ] New feature
- [ ] Breaking change
- [ ] Documentation update

## Testing
[Describe how you tested]

## Screenshots
[If applicable]

## Related Issues
Fixes #123

Best Practices:

Link to issues: Connect work to tracking
Use draft PRs: For work in progress
Small PRs: Easier to review
Descriptive titles: "Fix login bug" not "Update"
Self-review: Check your own PR first

5.5 Branch Protection Rules

Branch protection prevents force pushes and requires certain conditions before merging.

Common Rules:

Require pull request reviews:

Number of approvals required
Dismiss stale reviews
Require review from code owners

Require status checks:

CI must pass
Specific checks required
Branches must be up to date

Restrict who can push:

Specific users/teams
Admins included/excluded

Other rules:

No force pushes
No deletions
Include administrators
Linear history required

Example GitHub Settings:

{
  "required_status_checks": {
    "strict": true,
    "contexts": ["continuous-integration/jenkins"]
  },
  "enforce_admins": true,
  "required_pull_request_reviews": {
    "required_approving_review_count": 2,
    "dismiss_stale_reviews": true,
    "require_code_owner_reviews": true
  },
  "restrictions": null
}

5.6 Secrets in Repositories

Never store secrets in code. Use secret management tools.

What Not to Store:

API keys
Passwords
SSH keys
Database credentials
Tokens
Certificates

Secret Management Solutions:

GitHub Encrypted Secrets:

# In GitHub Actions
env:
  API_KEY: ${{ secrets.API_KEY }}

GitLab CI/CD Variables:

# Masked and protected variables
script:
  - echo "$CI_DEPLOY_PASSWORD"

HashiCorp Vault:

vault kv put secret/myapp api_key=12345

AWS Secrets Manager:

aws secretsmanager get-secret-value --secret-id myapp

Azure Key Vault:

az keyvault secret show --name api-key --vault-name myvault

Tools for Secret Detection:

git-secrets: Prevents committing secrets
truffleHog: Searches for secrets in Git history
GitHub secret scanning: Automatic detection
GitLab secret detection: Built-in scanning

5.7 Repository Security

Access Control:

Principle of least privilege: Grant minimum needed access
Regular audits: Review who has access
Team-based permissions: Manage groups, not individuals
SSO enforcement: Require corporate authentication

Security Features:

Signed Commits:

git commit -S -m "Signed commit"
git config commit.gpgsign true

Signed Tags:

git tag -s v1.0 -m "Signed tag"

Verified commits show as "Verified" in GitHub/GitLab

Dependency Management:

Dependabot: Automated security updates
Renovate: Dependency update tool
Snyk: Vulnerability scanning
OWASP Dependency Check: Security scanning

Audit Logging:

Monitor for suspicious activity:

Repository access
Permission changes
Secret pushes
Branch deletions

Incident Response:

When secrets are exposed:

Immediate: Revoke compromised credentials
Investigate: Check access logs
Rotate: Replace all affected secrets
Notify: Inform affected parties
Prevent: Improve scanning/prevention

PART III — CI/CD PIPELINES

Chapter 6 — Continuous Integration

6.1 CI Principles

Continuous Integration is the practice of merging all developer working copies to a shared mainline several times a day.

Core Principles:

Maintain a single source repository: Everything needed to build should be in version control.
Automate the build: One command should build the system.
Make the build self-testing: Tests should be part of the build.
Everyone commits to mainline every day: Avoid long-lived branches.
Every commit should build on an integration machine: Catch problems early.
Keep the build fast: Fast feedback encourages frequent commits.
Test in a clone of production environment: Avoid environment-specific issues.
Make it easy to get the latest deliverables: Artifacts should be easily accessible.
Everyone can see what's happening: Transparency enables collaboration.
Automate deployment: Make it trivial to deploy anywhere.

Benefits:

Reduced integration risk: Problems found early
Higher code quality: Constant testing
Faster delivery: Always releasable state
Improved visibility: Build status visible
Greater confidence: Automated verification

6.2 Build Automation

Build automation compiles source code into binary artifacts.

Build Tools by Language:

Java: Maven, Gradle, Ant
JavaScript: npm, yarn, webpack
Python: setuptools, poetry, pip
Go: go build, make
Ruby: rake, bundler
C/C++: make, cmake, ninja
.NET: MSBuild, dotnet CLI

Build Automation Goals:

Repeatable: Same input → same output
Fast: Minimize feedback time
Idempotent: Can run multiple times
Self-contained: No external dependencies
Consistent: Same process everywhere

Build Script Example (Makefile):

.PHONY: build test clean

build:
	go build -o bin/app ./cmd/app

test:
	go test ./...

clean:
	rm -rf bin/

Build Pipeline Stages:

Source → Compile → Test → Package → Publish

Compile: Convert source to binaries
Test: Run unit and integration tests
Package: Create deployable artifact (JAR, Docker image)
Publish: Store artifact in repository

6.3 Artifact Management

Artifacts are the outputs of build processes that need to be stored and versioned.

Types of Artifacts:

Binaries (JAR, EXE, DLL)
Packages (DEB, RPM, NPM)
Container images
Documentation
Test reports
Configuration files

Artifact Repositories:

Language-specific:

Maven: Nexus, Artifactory, Archiva
npm: npm registry, Verdaccio
Python: PyPI, DevPI
Ruby: RubyGems, Geminabox
Go: Go proxy, Athens

Universal:

JFrog Artifactory: Multi-format support
Sonatype Nexus: Repository manager
Cloud-specific: AWS CodeArtifact, Azure Artifacts, GCP Artifact Registry

Container Registries:

Docker Hub
GitHub Container Registry
GitLab Container Registry
Amazon ECR
Azure ACR
Google GCR

Best Practices:

Version everything: Use semantic versioning
Immutable artifacts: Never change published artifacts
Metadata: Store build info, commit hash, timestamps
Retention policies: Automatically clean old artifacts
Security scanning: Scan artifacts for vulnerabilities
Access control: Who can read/write artifacts

Artifact Lifecycle:

Build → Stage → Release → Retire
  ↑        ↑        ↑        ↑
Snapshot  Testing  Production  Delete

6.4 Pipeline as Code

Define CI/CD pipelines in code, stored in version control.

Benefits:

Version control: Track changes to pipeline
Code review: Review pipeline changes
Reusability: Share pipeline templates
Consistency: Same process everywhere
Documentation: Pipeline as executable documentation

Examples:

GitHub Actions:

name: CI
on: [push]
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - run: npm install
      - run: npm test

GitLab CI:

stages:
  - build
  - test

build:
  stage: build
  script:
    - go build ./...

test:
  stage: test
  script:
    - go test ./...

Jenkinsfile (Declarative):

pipeline {
    agent any
    stages {
        stage('Build') {
            steps {
                sh 'make build'
            }
        }
        stage('Test') {
            steps {
                sh 'make test'
            }
        }
    }
}

Pipeline Patterns:

DRY (Don't Repeat Yourself):

# Reusable workflow
.build-template: &build-template
  stage: build
  script:
    - docker build -t $IMAGE .

build-app:
  <<: *build-template
  variables:
    IMAGE: app

build-api:
  <<: *build-template
  variables:
    IMAGE: api

6.5 Testing Strategies

Testing in CI/CD requires a comprehensive strategy.

Testing Pyramid:

    /\    E2E Tests (slow, expensive)
   /  \   Integration Tests
  /----\  Component Tests
 /------\ Unit Tests (fast, cheap)
/--------\

Unit Tests:

Test individual functions/classes
Fast execution (< 100ms each)
No external dependencies
High coverage (70-80%+)

Integration Tests:

Test component interactions
May use databases, APIs
Slower but more realistic
Medium coverage

Component Tests:

Test entire component in isolation
Mock external dependencies
Contract testing with consumers

E2E Tests:

Test complete user journeys
Full system with all dependencies
Slow and brittle
Few critical paths only

Other Test Types:

Smoke Tests: Quick sanity checks after deployment

Performance Tests: Load, stress, soak testing

Security Tests: Vulnerability scanning, penetration testing

Mutation Tests: Validate test quality by introducing bugs

Contract Tests: Ensure API compatibility

Test Automation Best Practices:

Run fast tests first: Fail fast
Parallelize tests: Speed up execution
Quarantine flaky tests: Don't block pipeline
Test data management: Consistent test data
Test reporting: Clear results and trends
Test environment parity: Match production

6.6 Parallel Builds

Parallel execution speeds up CI pipelines.

Types of Parallelism:

Test parallelization: Run tests across multiple workers
Matrix builds: Test multiple versions/configurations
Stage parallelization: Run independent stages simultaneously

GitHub Actions Matrix:

jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        node: [14, 16, 18]
        os: [ubuntu-latest, windows-latest]
    steps:
      - uses: actions/checkout@v2
      - uses: actions/setup-node@v2
        with:
          node-version: ${{ matrix.node }}
      - run: npm test

Test Splitting:

# Split tests by timing
jest --maxWorkers=4 --shard=1/4
jest --maxWorkers=4 --shard=2/4
jest --maxWorkers=4 --shard=3/4
jest --maxWorkers=4 --shard=4/4

Parallel Stages in GitLab:

stages:
  - test
  - deploy

test:
  stage: test
  parallel: 5
  script:
    - ./run-tests.sh $CI_NODE_INDEX $CI_NODE_TOTAL

6.7 Caching & Optimization

Caching reduces build times by reusing previous work.

Cacheable Items:

Dependency packages (node_modules, vendor/bundle)
Compiled artifacts (.class, .pyc)
Docker layers
Test results
Build tools

GitHub Actions Caching:

- name: Cache node_modules
  uses: actions/cache@v2
  with:
    path: node_modules
    key: ${{ runner.os }}-node-${{ hashFiles('package-lock.json') }}
    restore-keys: |
      ${{ runner.os }}-node-

Docker Layer Caching:

# Cache dependencies first
COPY package*.json ./
RUN npm install          # This layer cached unless package.json changes
COPY . .
RUN npm run build

Optimization Techniques:

Incremental builds: Only rebuild changed code
Conditional execution: Skip stages when not needed
Build artifacts: Save intermediate outputs
Dependency caching: Cache package managers
Workspace reuse: Reuse workspace across jobs
Container caching: Use cached base images

Pipeline Optimization Checklist:

Chapter 7 — CI Tools

7.1 Jenkins Architecture

Jenkins is the most widely used open-source automation server.

Core Architecture:

User → Jenkins UI/API
        ↓
   Jenkins Master
        ↓
   Build Queue
        ↓
   Build Executors (Master or Agents)

Jenkins Master:

Web UI and API
Job configuration
Build queue management
Monitoring and reporting
Plugin management

Jenkins Agents (Nodes):

Execute builds
Distributed across machines
Different environments
Label-based selection

Installation Options:

WAR file: java -jar jenkins.war
Package: apt/yum install jenkins
Docker: docker run jenkins/jenkins
Kubernetes: Jenkins Helm chart

Jenkins Pipeline:

Declarative Pipeline:

pipeline {
    agent any
    stages {
        stage('Build') {
            steps {
                sh 'make build'
            }
        }
        stage('Test') {
            steps {
                sh 'make test'
            }
        }
        stage('Deploy') {
            when {
                branch 'main'
            }
            steps {
                sh 'make deploy'
            }
        }
    }
    post {
        always {
            cleanWs()
        }
        failure {
            slackSend(color: 'danger', message: "Build failed")
        }
    }
}

Scripted Pipeline:

node {
    try {
        stage('Checkout') {
            checkout scm
        }
        stage('Build') {
            sh 'make build'
        }
        stage('Test') {
            sh 'make test'
        }
    } catch (err) {
        currentBuild.result = 'FAILURE'
        throw err
    } finally {
        cleanWs()
    }
}

Shared Libraries:

Reusable pipeline code across projects:

// vars/buildGo.groovy
def call(String version = '1.16') {
    sh "docker run --rm -v $PWD:/app -w /app golang:$version go build"
}

Jenkins Configuration as Code (JCasC):

jenkins:
  systemMessage: "Jenkins configured by JCasC"
  securityRealm:
    ldap:
      configurations:
        - server: ldap.example.com
          rootDN: dc=example,dc=com
  authorizationStrategy:
    globalMatrix:
      permissions:
        - "Overall/Administer:admin"

7.2 GitHub Actions

GitHub-native CI/CD tightly integrated with repositories.

Core Concepts:

Workflows: YAML files in .github/workflows/ Events: Triggers (push, pull_request, schedule) Jobs: Groups of steps (run on runners) Steps: Individual tasks (run commands or actions) Actions: Reusable units of code Runners: Virtual machines that execute jobs

Workflow Structure:

name: CI
on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]

env:
  NODE_VERSION: 16

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      
      - name: Setup Node
        uses: actions/setup-node@v2
        with:
          node-version: ${{ env.NODE_VERSION }}
          
      - name: Install dependencies
        run: npm ci
        
      - name: Run tests
        run: npm test
        
      - name: Upload artifacts
        uses: actions/upload-artifact@v2
        with:
          name: build-output
          path: dist/

Custom Actions:

Docker Container Action:

name: 'My Action'
description: 'Does something'
runs:
  using: 'docker'
  image: 'Dockerfile'

JavaScript Action:

name: 'My Action'
description: 'Does something'
runs:
  using: 'node12'
  main: 'index.js'

Composite Action:

name: 'Composite Action'
description: 'Combines steps'
runs:
  using: 'composite'
  steps:
    - run: echo Hello
      shell: bash

Workflow Features:

Matrix strategies: Test multiple configurations
Environments: Protection rules and secrets
Concurrency: Control parallel runs
Dependencies: needs keyword
Conditionals: if conditions
Reusable workflows: Call workflows from workflows

7.3 GitLab CI

Integrated CI/CD with GitLab's DevOps platform.

.gitlab-ci.yml Structure:

stages:
  - build
  - test
  - deploy

variables:
  DOCKER_DRIVER: overlay2
  IMAGE_TAG: $CI_REGISTRY_IMAGE:$CI_COMMIT_SHORT_SHA

cache:
  paths:
    - node_modules/

before_script:
  - docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY

build:
  stage: build
  script:
    - docker build -t $IMAGE_TAG .
    - docker push $IMAGE_TAG
  only:
    - main

test:
  stage: test
  script:
    - npm ci
    - npm test

deploy_staging:
  stage: deploy
  script:
    - kubectl set image deployment/app app=$IMAGE_TAG
  environment:
    name: staging
    url: https://staging.example.com
  only:
    - main

deploy_production:
  stage: deploy
  script:
    - kubectl set image deployment/app app=$IMAGE_TAG
  environment:
    name: production
    url: https://example.com
  when: manual
  only:
    - main

Key Features:

Review Apps: Ephemeral environments for MRs Auto DevOps: Preconfigured CI/CD Multi-project pipelines: Cross-project dependencies Parent-child pipelines: Dynamic pipeline generation Rules: Advanced conditional logic Includes: Include external YAML files

GitLab Runners:

Shared: Provided by GitLab.com
Group: Shared within group
Project: Dedicated to project
Specific: Custom configuration

Runner Configuration (config.toml):

concurrent = 10
[[runners]]
  name = "docker-runner"
  url = "https://gitlab.com"
  token = "xxxxx"
  executor = "docker"
  [runners.docker]
    image = "alpine"
    volumes = ["/cache"]

7.4 CircleCI

Cloud-native CI/CD with focus on speed and convenience.

Configuration (.circleci/config.yml):

version: 2.1

orbs:
  node: circleci/node@5.0.0

jobs:
  build:
    docker:
      - image: cimg/node:16.10
        auth:
          username: mydockerhub-user
          password: $DOCKERHUB_PASSWORD
    steps:
      - checkout
      - node/install-packages:
          pkg-manager: npm
      - run:
          name: Run tests
          command: npm test
      - persist_to_workspace:
          root: ~/project
          paths:
            - .

  deploy:
    docker:
      - image: cimg/base:2022.06
    steps:
      - attach_workspace:
          at: ~/project
      - run:
          name: Deploy to production
          command: ./deploy.sh

workflows:
  version: 2
  build_and_deploy:
    jobs:
      - build
      - deploy:
          requires:
            - build
          filters:
            branches:
              only: main

CircleCI Concepts:

Orbs: Reusable configuration packages Executors: Docker, machine, macOS, Windows Workspaces: Persist data between jobs Caching: Speed up dependency installation Contexts: Share environment variables SSH debugging: Debug builds interactively

7.5 Azure DevOps

Microsoft's enterprise DevOps platform.

Pipelines (YAML):

trigger:
- main

pool:
  vmImage: ubuntu-latest

variables:
  buildConfiguration: 'Release'
  majorVersion: 1
  minorVersion: 0

stages:
- stage: Build
  jobs:
  - job: BuildJob
    steps:
    - task: DotNetCoreCLI@2
      inputs:
        command: 'build'
        projects: '**/*.csproj'
        arguments: '--configuration $(buildConfiguration)'
    
    - task: DotNetCoreCLI@2
      inputs:
        command: 'test'
        projects: '**/*Tests.csproj'
        arguments: '--configuration $(buildConfiguration)'
    
    - task: DotNetCoreCLI@2
      inputs:
        command: 'publish'
        publishWebProjects: true
        arguments: '--configuration $(buildConfiguration) --output $(Build.ArtifactStagingDirectory)'
    
    - task: PublishBuildArtifacts@1
      inputs:
        PathtoPublish: '$(Build.ArtifactStagingDirectory)'
        ArtifactName: 'drop'

- stage: Deploy
  jobs:
  - deployment: DeployWeb
    environment: 'production'
    strategy:
      runOnce:
        deploy:
          steps:
          - task: AzureWebApp@1
            inputs:
              azureSubscription: 'my-connection'
              appName: 'my-app'
              package: '$(Pipeline.Workspace)/drop/**/*.zip'

Azure DevOps Components:

Azure Pipelines: CI/CD
Azure Repos: Git repositories
Azure Boards: Work tracking
Azure Test Plans: Testing tools
Azure Artifacts: Package management

Key Features:

Multi-stage pipelines: Visual designer
Environments: Track deployments
Approvals: Manual intervention
Gates: Automated health checks
Service connections: Connect to Azure services
Task groups: Reusable task collections

7.6 Pipeline Security

Securing CI/CD pipelines is critical as they have access to production.

Security Principles:

Least privilege: Minimal permissions
Isolation: Separate build environments
Secrets management: Never expose secrets
Input validation: Protect against injection
Audit logging: Track all changes
Dependency verification: Verify third-party code

Common Threats:

Credential Exposure:

Secrets in logs
Hardcoded credentials
Exposed environment variables

Supply Chain Attacks:

Compromised dependencies
Malicious packages
Typosquatting

Pipeline Tampering:

Unauthorized pipeline changes
Malicious commits
Build environment compromise

Security Best Practices:

Secrets:

# NEVER do this
- run: echo "password=12345"  # Bad!

# Use secrets
- run: echo "password=$SECRET"
  env:
    SECRET: ${{ secrets.MY_SECRET }}

OIDC (OpenID Connect):

# Instead of long-lived secrets
- name: Configure AWS credentials
  uses: aws-actions/configure-aws-credentials@v1
  with:
    role-to-assume: arn:aws:iam::123456789:role/GitHubActions
    aws-region: us-east-1

Signed Commits:

Require signed commits for sensitive repos
Verify commit signatures in pipeline

Dependency Verification:

# Verify package integrity
- run: npm audit
- run: npm ci --ignore-scripts  # Disable install scripts

Isolation:

Use ephemeral runners
Network isolation
Container sandboxing

7.7 Scaling CI Infrastructure

As teams grow, CI infrastructure needs to scale.

Scaling Strategies:

1. Horizontal Scaling:

Add more build agents
Auto-scaling based on queue
Multiple regions/zones

2. Vertical Scaling:

Bigger machines
More CPU/memory per build
Faster storage (SSD)

3. Build Optimization:

Caching dependencies
Parallel test execution
Incremental builds
Skipping unnecessary builds

Jenkins Scaling:

Master-Agent Setup:

pipeline {
    agent { label 'linux && large' }
    stages {
        stage('Build') {
            steps {
                sh 'make build'
            }
        }
    }
}

Dynamic Agents (Kubernetes):

apiVersion: v1
kind: Pod
spec:
  containers:
  - name: jnlp
    image: jenkins/inbound-agent
  - name: golang
    image: golang:1.16
    command:
    - cat
  - name: docker
    image: docker:20.10
    command:
    - cat
    volumeMounts:
    - name: docker-sock
      mountPath: /var/run/docker.sock

GitHub Actions Scaling:

Self-hosted runners: Custom machines
Runner groups: Organization/enterprise level
Auto-scaling: Dynamic provisioning

Self-hosted Runner Auto-scaling (Azure):

resource "azuredevops_agent_pool" "pool" {
  name           = "my-pool"
  auto_provision = true
}

resource "azuredevops_elastic_pool" "elastic" {
  name                = "my-elastic-pool"
  service_endpoint_id = azuredevops_serviceendpoint_azurerm.az.id
  
  azure_resource_id = azurerm_linux_virtual_machine_scale_set.vmss.id
  
  desired_idle = 1
  max_capacity = 10
}

Monitoring CI Infrastructure:

Key metrics:

Queue time
Build duration
Success/failure rate
Agent utilization
Cost per build

Cost Optimization:

Use spot/preemptible instances
Auto-scale down when idle
Cache effectively
Right-size instances

Chapter 8 — Continuous Delivery & Deployment

8.1 CD vs Continuous Deployment

Continuous Delivery

Every change is deployable, but deployment may be manual.

Commit → Build → Test → Staging → Manual Approval → Production
                           ↑
                     Always deployable

Key Characteristics:

Software always in releasable state
Deployment is a business decision
Manual approval for production
Compliance and audit gates

Continuous Deployment

Every change that passes tests is automatically deployed.

Commit → Build → Test → Staging → Auto → Production
                                     ↑
                            Automated promotion

Key Characteristics:

Fully automated pipeline
No manual intervention
Multiple daily deployments
Requires high confidence in testing

Choosing Between Them:

Continuous Delivery is better when:

Regulatory/compliance requirements
Business needs release coordination
Low deployment frequency is acceptable
Building confidence gradually

Continuous Deployment is better when:

SaaS/cloud native applications
High deployment frequency desired
Strong automated testing
Feature flags in place
Low risk tolerance for deployment

8.2 Deployment Strategies

Blue/Green Deployment

Two identical environments, one live (blue), one idle (green).

Before switch:
Users → Blue (v1)    Green (v2 - idle)

After switch:
Users → Green (v2)    Blue (v1 - idle)

Implementation:

# Kubernetes with labels
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-blue
spec:
  replicas: 10
  template:
    metadata:
      labels:
        version: blue
---
apiVersion: v1
kind: Service
metadata:
  name: app-service
spec:
  selector:
    version: blue  # Switch to green when ready

Pros:

Instant rollback (switch back)
No downtime
Staging environment always available

Cons:

Double infrastructure cost
Database schema challenges

Canary Deployment

Gradually shift traffic to new version.

Users → 90% → v1
        10% → v2 (canary)

If successful: increase to 25%, 50%, 100%
If problems: route back to 100% v1

Kubernetes with Istio:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: app
spec:
  hosts:
  - app
  http:
  - match:
    - headers:
        canary:
          exact: "true"
    route:
    - destination:
        host: app
        subset: v2
      weight: 100
  - route:
    - destination:
        host: app
        subset: v1
      weight: 90
    - destination:
        host: app
        subset: v2
      weight: 10

Pros:

Real traffic testing
Gradual risk exposure
Canary analysis

Cons:

Complex routing
Longer deployment time
Requires monitoring

Rolling Deployment

Gradually replace instances.

v1 → v1 → v1 → v1 → v1 → v1 → v1 → v1 → v1 → v1
v2 → v2 → v1 → v1 → v1 → v1 → v1 → v1 → v1 → v1
v2 → v2 → v2 → v2 → v1 → v1 → v1 → v1 → v1 → v1
v2 → v2 → v2 → v2 → v2 → v2 → v2 → v2 → v2 → v2

Kubernetes Rolling Update:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
spec:
  replicas: 10
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 2        # How many extra pods
      maxUnavailable: 1   # How many can be down
  template:
    spec:
      containers:
      - image: app:v2

Pros:

No extra infrastructure
Gradual replacement
Kubernetes native

Cons:

Slower rollout
Complex rollback
Version mix during deployment

Shadow Deployment

Run new version alongside old, mirror traffic but discard responses.

User → v1 (serves response)
   ↓
   → v2 (shadow - discard response)

Pros:

Test with production traffic
No user impact
Performance comparison

Cons:

Double resource usage
No feedback to users
Complex implementation

8.3 Feature Flags

Feature flags (toggles) enable deploying incomplete features safely.

Types of Flags:

Release toggles: Control feature visibility
Experiment toggles: A/B testing
Ops toggles: Operational controls
Permission toggles: User targeting

Implementation:

# Simple flag check
if feature_flags.is_enabled('new-checkout'):
    return new_checkout_flow()
else:
    return old_checkout_flow()

Targeting Rules:

// LaunchDarkly example
const user = { key: user.id, email: user.email };
const showFeature = ldclient.variation('new-feature', user, false);

Flag Management Systems:

LaunchDarkly: Enterprise feature management
Split.io: Feature experimentation
Flagsmith: Open source
Unleash: Open source
ConfigCat: Simple feature flags
Custom: Database + cache

Best Practices:

Short-lived flags: Remove after rollout
Flag naming: Clear and consistent
Audit logging: Track flag changes
Default to off: Safe fallback
Flag hygiene: Regular cleanup
Testing: Test with flags on/off

8.4 Database Migration Strategies

Database changes are often the riskiest part of deployment.

Principles:

Separate schema changes from code changes
Forward and backward compatible
Automated migrations
Testable rollbacks

Migration Types:

Expand/Migrate/Contract Pattern:

Phase 1: Expand
- Add new column (nullable)
- Dual-write to both columns

Phase 2: Migrate
- Backfill data to new column
- Migrate reads to new column

Phase 3: Contract
- Remove old column
- Remove dual-write

Online Schema Change Tools:

gh-ost: GitHub's online schema migration
pt-online-schema-change: Percona Toolkit
Liquibase: Database refactoring
Flyway: Version control for databases
Alembic: Python migrations

Example Flyway Migration:

-- V1__initial_schema.sql
CREATE TABLE users (
    id INT PRIMARY KEY,
    name VARCHAR(255)
);

-- V2__add_email.sql
ALTER TABLE users ADD COLUMN email VARCHAR(255);

-- V3__populate_email.sql
UPDATE users SET email = CONCAT(name, '@example.com');

Zero-Downtime Migration Strategy:

1. Add nullable column
2. Dual-write to new column (code change)
3. Backfill data
4. Make column non-nullable (if needed)
5. Remove old column (future release)

8.5 Rollbacks & Recovery

Despite best efforts, things go wrong. Be prepared.

Rollback Strategies:

Version Rollback:

Revert to previous artifact
Simple and fast
Loses new features

Feature Flag Rollback:

Disable problematic feature
No deployment needed
Keep other features

Database Rollback:

Restore from backup
Apply compensating transactions
Forward-only migrations (avoid rollbacks)

Automated Rollback Triggers:

# Canary analysis with automated rollback
deploy:
  strategy:
    canary:
      steps:
      - setWeight: 10
      - pause:
          duration: 5m
      - analysis:
          metrics:
          - name: error-rate
            threshold: 1
      - setWeight: 50
      - pause:
          duration: 5m
      - analysis:
          metrics:
          - name: error-rate
            threshold: 1

Rollback Procedure:

Detect the problem (monitoring)
Decide to roll back (automated or manual)
Execute rollback (deploy previous version)
Verify system is healthy
Post-mortem to prevent recurrence

8.6 GitOps Workflow

GitOps uses Git as the single source of truth for declarative infrastructure and applications.

Core Principles:

Declarative description: Entire system described in Git
Git as source of truth: Cluster state matches Git
Automated convergence: Software ensures cluster matches Git
Pull-based deployments: Cluster pulls changes

GitOps Architecture:

Developer pushes to Git
        ↓
   Git Repository
        ↓
GitOps Operator (ArgoCD/Flux)
        ↓
   Kubernetes Cluster

Benefits:

Audit trail: All changes in Git
Faster recovery: Recreate from Git
Standard workflows: Use Git tools
Security: Pull model reduces credentials
Observability: Drift detection

ArgoCD Example:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: myapp
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/user/repo.git
    targetRevision: HEAD
    path: k8s
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Flux Example:

apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: GitRepository
metadata:
  name: myapp
  namespace: flux-system
spec:
  interval: 1m
  url: https://github.com/user/repo
  ref:
    branch: main
---
apiVersion: kustomize.toolkit.fluxcd.io/v1beta2
kind: Kustomization
metadata:
  name: myapp
  namespace: flux-system
spec:
  interval: 10m
  path: ./k8s
  prune: true
  sourceRef:
    kind: GitRepository
    name: myapp

PART IV — CONTAINERS & ORCHESTRATION

Chapter 9 — Containerization

9.1 Container Fundamentals

Containers provide lightweight virtualization at the OS level.

What are Containers?

Containers package an application with its dependencies, libraries, and configuration files, running isolated from other processes on the same host.

Containers vs Virtual Machines:

Aspect	Containers	Virtual Machines
Isolation	Process-level	Hardware-level
OS	Share host kernel	Each has guest OS
Startup	Milliseconds	Minutes
Size	MB	GB
Performance	Native	Some overhead
Resource usage	Lightweight	Heavy

Container Technologies:

LXC (Linux Containers): Original Linux containers
Docker: Most popular container platform
Podman: Daemonless container engine
containerd: Industry-standard runtime
CRI-O: Kubernetes-specific runtime

Linux Kernel Features:

Namespaces: Isolate process views

PID: Process IDs
NET: Network interfaces
MNT: Mount points
UTS: Hostname
IPC: Inter-process communication
USER: User IDs

Cgroups (Control Groups): Limit resources

CPU shares/quota
Memory limits
Block I/O
Network bandwidth

Union Filesystems: Layer management

OverlayFS
AUFS
Device Mapper

9.2 Docker Internals

Docker Architecture:

Client (docker CLI)
    ↓
Docker Daemon (dockerd)
    ↓
Containerd
    ↓
runc (OCI runtime)
    ↓
Container

Components:

docker CLI: User interface
dockerd: Persistent daemon
containerd: Container lifecycle management
runc: OCI runtime (creates containers)
containerd-shim: Parent of container processes

Images and Layers:

Docker images are built in layers:

Layer 4: CMD ["node", "app.js"]
Layer 3: COPY . /app
Layer 2: RUN npm install
Layer 1: FROM node:16
      ↓
Union mount at runtime

Layer Caching:

Each layer is cached. When rebuilding:

Unchanged layers reused
Changed layers and all subsequent rebuilt

Docker Storage Drivers:

overlay2: Default (recommended)
devicemapper: Legacy
btrfs/zfs: Advanced features
vfs: No copy-on-write

Network Drivers:

bridge: Default, NAT through host
host: Use host network directly
overlay: Multi-host networking
macvlan: Assign MAC addresses
none: No networking

9.3 Dockerfiles Best Practices

Base Images:

# Use specific tags, not latest
FROM node:16.14.2-alpine

# Use minimal base images
FROM alpine:3.15

Layer Optimization:

# Bad - each RUN creates layer
RUN apt-get update
RUN apt-get install -y curl
RUN apt-get clean

# Good - combine commands
RUN apt-get update && \
    apt-get install -y curl && \
    apt-get clean

Order Matters:

# Copy dependency files first (cached longer)
COPY package*.json ./
RUN npm install

# Copy source last (changes frequently)
COPY . .

Multi-stage Builds:

# Build stage
FROM node:16 AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

# Production stage
FROM nginx:alpine
COPY --from=builder /app/dist /usr/share/nginx/html
COPY nginx.conf /etc/nginx/nginx.conf

Security Best Practices:

# Run as non-root
RUN addgroup -g 1000 -S appgroup && \
    adduser -u 1000 -S appuser -G appgroup
USER appuser

# No secrets in build args
ARG DB_PASSWORD  # Bad - visible in history

# Use build secrets
RUN --mount=type=secret,id=db_password \
    cat /run/secrets/db_password

.dockerignore:

node_modules
.git
*.log
.env
Dockerfile
.dockerignore

9.4 Multi-Stage Builds

Multi-stage builds optimize final image size by separating build and runtime environments.

Example: Go Application

# Build stage
FROM golang:1.17 AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o main .

# Runtime stage
FROM alpine:3.15
RUN apk --no-cache add ca-certificates
WORKDIR /root/
COPY --from=builder /app/main .
EXPOSE 8080
CMD ["./main"]

Example: React Application

# Build stage
FROM node:16 AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

# Runtime stage
FROM nginx:alpine
COPY --from=builder /app/build /usr/share/nginx/html
COPY nginx.conf /etc/nginx/nginx.conf
EXPOSE 80
CMD ["nginx", "-g", "daemon off;"]

Benefits:

Smaller images (MB vs GB)
No build tools in production
Better security
Faster pulls

9.5 Container Security

Security Principles:

Least privilege: Minimal capabilities
Immutable: No runtime changes
Read-only root filesystem
No privileged containers
Vulnerability scanning

Security Best Practices:

User Namespace Remapping:

{
  "userns-remap": "default"
}

Read-only Root:

VOLUME ["/tmp", "/var/log"]  # Writable volumes
# Rest of filesystem read-only

Drop Capabilities:

docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE

Security Context (Kubernetes):

securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  capabilities:
    drop: ["ALL"]
  readOnlyRootFilesystem: true

Image Signing:

# Docker Content Trust
export DOCKER_CONTENT_TRUST=1
docker push myapp:latest

9.6 Image Scanning

Scan images for vulnerabilities before deployment.

Common Scanners:

Trivy: Comprehensive, easy to use
Clair: CoreOS scanner
Anchore: Deep inspection
Snyk: Developer-focused
Docker Scout: Docker native
Grype: Fast vulnerability scanner

Trivy Example:

# Scan image
trivy image myapp:latest

# Scan with severity filter
trivy image --severity CRITICAL,HIGH myapp:latest

# Generate HTML report
trivy image --format template --template "@contrib/html.tpl" -o report.html myapp:latest

CI Integration:

# GitHub Actions
- name: Scan image
  uses: aquasecurity/trivy-action@master
  with:
    image-ref: 'myapp:latest'
    format: 'sarif'
    output: 'trivy-results.sarif'

SBOM (Software Bill of Materials):

# Generate SBOM
trivy image --format cyclonedx myapp:latest > sbom.json

# Scan for known vulnerabilities
trivy sbom sbom.json

9.7 OCI Standards

Open Container Initiative (OCI) ensures container format and runtime interoperability.

OCI Specifications:

Image Specification: Container image format
Runtime Specification: Container execution
Distribution Specification: Content distribution

OCI Image Layout:

myimage/
├── blobs/
│   └── sha256/
│       ├── a1b2c3... (layer)
│       ├── d4e5f6... (config)
│       └── g7h8i9... (manifest)
└── index.json

Benefits:

Interoperability: Works across tools
Portability: Run anywhere
Stability: Backward compatible
Ecosystem: Wide tool support

Tools Supporting OCI:

Docker (with containerd)
Podman
Buildah
Skopeo
CRI-O
Kubernetes

Chapter 10 — Kubernetes Deep Dive

10.1 Kubernetes Architecture

Kubernetes orchestrates containerized applications across clusters of machines.

High-Level Architecture:

                    ┌─────────────────────┐
                    │   Control Plane      │
                    │  ┌─────────────────┐ │
                    │  │  API Server     │ │
                    │  └─────────────────┘ │
                    │  ┌─────────────────┐ │
                    │  │  Scheduler      │ │
                    │  └─────────────────┘ │
                    │  ┌─────────────────┐ │
                    │  │ Controller Mgr   │ │
                    │  └─────────────────┘ │
                    │  ┌─────────────────┐ │
                    │  │  etcd           │ │
                    │  └─────────────────┘ │
                    └──────────┬──────────┘
                               │
        ┌──────────────────────┼──────────────────────┐
        │                      │                      │
┌───────▼───────┐      ┌───────▼───────┐      ┌───────▼───────┐
│   Node 1      │      │   Node 2      │      │   Node 3      │
│ ┌───────────┐ │      │ ┌───────────┐ │      │ ┌───────────┐ │
│ │ kubelet   │ │      │ │ kubelet   │ │      │ │ kubelet   │ │
│ └───────────┘ │      │ └───────────┘ │      │ └───────────┘ │
│ ┌───────────┐ │      │ ┌───────────┐ │      │ ┌───────────┐ │
│ │ kube-proxy│ │      │ │ kube-proxy│ │      │ │ kube-proxy│ │
│ └───────────┘ │      │ └───────────┘ │      │ └───────────┘ │
│ ┌───────────┐ │      │ ┌───────────┐ │      │ ┌───────────┐ │
│ │ Container │ │      │ │ Container │ │      │ │ Container │ │
│ │ Runtime   │ │      │ │ Runtime   │ │      │ │ Runtime   │ │
│ └───────────┘ │      │ └───────────┘ │      │ └───────────┘ │
└───────────────┘      └───────────────┘      └───────────────┘

10.2 Control Plane Components

API Server (kube-apiserver):

Frontend to control plane
Validates and configures objects
Serves REST API
Horizontal scalable

etcd:

Distributed key-value store
Cluster state storage
Consistent and highly available
RAFT consensus algorithm

Scheduler (kube-scheduler):

Assigns pods to nodes
Considers resources, constraints
Policy-based scheduling
Extensible with custom schedulers

Controller Manager (kube-controller-manager):

Runs controllers:

Node controller
Replication controller
Endpoint controller
Service Account controller
etc.

Cloud Controller Manager (cloud-controller-manager):

Interacts with cloud providers:

Node management
Load balancers
Routes
Volumes

10.3 Pods, Deployments, Services

Pod:

Smallest deployable unit, one or more containers.

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    app: web
spec:
  containers:
  - name: nginx
    image: nginx:1.21
    ports:
    - containerPort: 80
    resources:
      requests:
        memory: "64Mi"
        cpu: "250m"
      limits:
        memory: "128Mi"
        cpu: "500m"

Deployment:

Manages replica sets and rolling updates.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - name: nginx
        image: nginx:1.21
        ports:
        - containerPort: 80
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0

Service:

Stable network endpoint for pods.

apiVersion: v1
kind: Service
metadata:
  name: web-service
spec:
  selector:
    app: web
  ports:
  - port: 80
    targetPort: 80
  type: ClusterIP  # Default, internal only

Service types:

ClusterIP: Internal cluster IP
NodePort: Expose on each node's IP
LoadBalancer: Cloud load balancer
ExternalName: DNS alias

10.4 Networking Model

Kubernetes Networking Requirements:

Pods can communicate with all other pods without NAT
Nodes can communicate with all pods without NAT
Pod's IP is the same seen by others

CNI (Container Network Interface):

Plugins implement networking:

Calico: Network policy, BGP
Flannel: Simple overlay
Weave: Mesh networking
Cilium: eBPF-based, security
AWS VPC CNI: Native VPC integration

Network Policies:

Firewall rules for pods:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-allow
spec:
  podSelector:
    matchLabels:
      app: api
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: frontend
    ports:
    - protocol: TCP
      port: 8080

10.5 Storage in Kubernetes

Volumes:

apiVersion: v1
kind: Pod
metadata:
  name: app
spec:
  volumes:
  - name: data
    emptyDir: {}  # Temporary
  - name: config
    configMap:
      name: app-config
  - name: secret
    secret:
      secretName: db-secret
  containers:
  - name: app
    volumeMounts:
    - name: data
      mountPath: /data

Persistent Volumes (PV):

Cluster storage resource:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-volume
spec:
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  awsElasticBlockStore:
    volumeID: vol-12345
    fsType: ext4

Persistent Volume Claims (PVC):

Request storage:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: data-claim
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi

Storage Classes:

Dynamic provisioning:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp3
  fsType: ext4

10.6 RBAC (Role-Based Access Control)

Core Concepts:

Role/ClusterRole: Set of permissions
RoleBinding/ClusterRoleBinding: Bind roles to users/groups

Role Example:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: default
  name: pod-reader
rules:
- apiGroups: [""]  # Core API group
  resources: ["pods"]
  verbs: ["get", "list", "watch"]

ClusterRole Example:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cluster-admin
rules:
- apiGroups: ["*"]
  resources: ["*"]
  verbs: ["*"]

RoleBinding:

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: read-pods
  namespace: default
subjects:
- kind: User
  name: jane
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io

Service Account Example:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-sa
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: app-binding
subjects:
- kind: ServiceAccount
  name: app-sa
  namespace: default
roleRef:
  kind: ClusterRole
  name: view
  apiGroup: rbac.authorization.k8s.io

10.7 Helm Charts

Helm is the package manager for Kubernetes.

Chart Structure:

mychart/
├── Chart.yaml          # Metadata
├── values.yaml         # Default values
├── templates/          # Template files
│   ├── deployment.yaml
│   ├── service.yaml
│   └── _helpers.tpl    # Helper templates
└── charts/             # Dependencies

Chart.yaml:

apiVersion: v2
name: myapp
description: My application
type: application
version: 0.1.0
appVersion: "1.0.0"
dependencies:
- name: redis
  version: 16.0.0
  repository: https://charts.bitnami.com/bitnami

Template Example:

# templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "mychart.fullname" . }}
  labels:
    {{- include "mychart.labels" . | nindent 4 }}
spec:
  replicas: {{ .Values.replicaCount }}
  selector:
    matchLabels:
      {{- include "mychart.selectorLabels" . | nindent 6 }}
  template:
    metadata:
      labels:
        {{- include "mychart.selectorLabels" . | nindent 8 }}
    spec:
      containers:
      - name: {{ .Chart.Name }}
        image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
        ports:
        - containerPort: {{ .Values.service.port }}

values.yaml:

replicaCount: 3
image:
  repository: nginx
  tag: latest
service:
  type: ClusterIP
  port: 80

Helm Commands:

# Install chart
helm install myapp ./mychart

# Upgrade release
helm upgrade myapp ./mychart

# Rollback
helm rollback myapp 1

# Template rendering
helm template ./mychart

# Package chart
helm package ./mychart

10.8 Operators Pattern

Operators automate application management using Kubernetes custom resources.

What are Operators?

Operators encode human operational knowledge into software to:

Deploy applications
Handle backups
Perform upgrades
Respond to failures

Operator Pattern:

Custom Resource (CR) → Operator → Manage application
     ↑                    ↓
User defines          Actual state
desired state          reconciled

Example: Prometheus Operator

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: main
spec:
  replicas: 2
  resources:
    requests:
      memory: 400Mi
  alerting:
    alertmanagers:
    - namespace: monitoring
      name: alertmanager-main
      port: web

Building Operators:

Operator SDK: Framework for building
Kubebuilder: Kubernetes API extensions
Metacontroller: Simple operators

Operator Best Practices:

Idempotent: Safe to run repeatedly
Self-healing: React to changes
Upgradeable: Handle version upgrades
Observable: Emit metrics/events
Testable: Comprehensive testing

10.9 Custom Resource Definitions (CRD)

CRDs extend Kubernetes API with custom resources.

CRD Example:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: databases.example.com
spec:
  group: example.com
  names:
    kind: Database
    plural: databases
    singular: database
    shortNames:
    - db
  scope: Namespaced
  versions:
  - name: v1
    served: true
    storage: true
    schema:
      openAPIV3Schema:
        type: object
        properties:
          spec:
            type: object
            properties:
              engine:
                type: string
                enum: ["mysql", "postgres"]
              version:
                type: string
              size:
                type: string
                pattern: '^[0-9]+Gi$'

Using Custom Resource:

apiVersion: example.com/v1
kind: Database
metadata:
  name: mydb
spec:
  engine: postgres
  version: "13"
  size: 10Gi

10.10 Cluster Hardening

Security Best Practices:

API Server Security:

Enable RBAC
Use TLS for all communication
Enable audit logging
Disable anonymous auth

# kube-apiserver flags
--authorization-mode=Node,RBAC
--anonymous-auth=false
--audit-log-path=/var/log/kubernetes/audit.log
--enable-admission-plugins=NamespaceLifecycle,PodSecurityPolicy

etcd Security:

Encrypt secrets at rest
TLS for peer/client communication
Firewall access
Regular backups

Node Security:

Minimal host OS
Regular security updates
CIS benchmarks
Disable SSH or use bastion

Pod Security:

Pod Security Standards (PodSecurity admission):

apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: baseline

Pod Security Policies (deprecated in 1.21):

apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: restricted
spec:
  privileged: false
  runAsUser:
    rule: MustRunAsNonRoot
  seLinux:
    rule: RunAsAny
  fsGroup:
    rule: MustRunAs
    ranges:
    - min: 1
      max: 65535
  volumes:
  - 'configMap'
  - 'emptyDir'
  - 'projected'
  - 'secret'
  - 'downwardAPI'
  - 'persistentVolumeClaim'

Network Security:

Network policies
Encrypted traffic (mTLS with service mesh)
Limit external access

Image Security:

Scan images for vulnerabilities
Use private registry
Sign and verify images

Chapter 11 — Kubernetes in Production

11.1 High Availability Clusters

Control Plane HA:

Load Balancer
    ↓
┌───┼───┐
API  API  API
Server Server Server
 ↓     ↓     ↓
etcd  etcd  etcd (3-5 nodes)

Requirements:

Odd number of etcd nodes (3,5,7)
API servers behind load balancer
Scheduler and controller manager with leader election

Node Considerations:

Spread across availability zones
Cordoning and draining for maintenance
PodDisruptionBudgets

PodDisruptionBudget:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: app-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: myapp

11.2 Multi-Cluster Strategy

Reasons for Multi-Cluster:

Geographic distribution: Lower latency
Compliance: Data sovereignty
Isolation: Dev/test/prod separation
Scaling: Beyond single cluster limits
Disaster recovery: Active/passive or active/active

Multi-Cluster Patterns:

Federation: Single control plane managing multiple clusters (KubeFed)
Hub and Spoke: Central management with workload clusters
Independent: Separate clusters with common tooling
Hybrid: Mix of on-prem and cloud

Cluster API:

Declarative cluster management:

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: my-cluster
spec:
  clusterNetwork:
    pods:
      cidrBlocks: ["192.168.0.0/16"]
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
    kind: AWSCluster
    name: my-cluster
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AWSCluster
metadata:
  name: my-cluster
spec:
  region: us-west-2
  sshKeyName: default

11.3 Service Mesh (Istio, Linkerd)

Service meshes provide observability, security, and traffic management.

Service Mesh Architecture:

Pod
├── App Container
└── Sidecar Proxy (Envoy/Linkerd2-proxy)
     ↑
Control Plane (Istiod/Linkerd controller)

Istio Example:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: reviews
spec:
  hosts:
  - reviews
  http:
  - match:
    - headers:
        end-user:
          exact: jason
    route:
    - destination:
        host: reviews
        subset: v2
  - route:
    - destination:
        host: reviews
        subset: v1

mTLS (mutual TLS):

apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: istio-system
spec:
  mtls:
    mode: STRICT  # Require mTLS

Linkerd Example:

apiVersion: policy.linkerd.io/v1beta1
kind: HTTPRoute
metadata:
  name: api-route
  namespace: emojivoto
spec:
  parentRefs:
    - name: web-svc
      kind: Service
      group: core
      port: 80
  rules:
    - matches:
        - path:
            value: "/api/vote"
      filters:
        - type: RequestRedirect
          requestRedirect:
            scheme: https

Benefits:

Traffic management: Canary, blue/green
Security: mTLS, authorization
Observability: Metrics, tracing, logs
Resilience: Retries, timeouts, circuit breakers

11.4 Autoscaling (HPA, VPA)

Horizontal Pod Autoscaler (HPA):

Scales based on metrics:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: requests_per_second
      target:
        type: AverageValue
        averageValue: 1000

Vertical Pod Autoscaler (VPA):

Adjusts resource requests:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: app-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: app
  updatePolicy:
    updateMode: "Auto"  # Auto, Initial, Off
  resourcePolicy:
    containerPolicies:
    - containerName: '*'
      minAllowed:
        cpu: 250m
        memory: 512Mi
      maxAllowed:
        cpu: 4
        memory: 8Gi

Cluster Autoscaler:

Scales nodes based on pending pods:

# Add to deployment
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: cluster-autoscaler.kubernetes.io/scale-down-disabled
          operator: DoesNotExist

KEDA (Kubernetes Event-driven Autoscaling):

Scale based on events (Kafka, RabbitMQ, etc.):

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: kafka-scaler
spec:
  scaleTargetRef:
    name: consumer
  triggers:
  - type: kafka
    metadata:
      topic: my-topic
      bootstrapServers: kafka:9092
      consumerGroup: my-group
      lagThreshold: "10"

11.5 Observability in Kubernetes

Metrics:

Node metrics: CPU, memory, disk
Pod metrics: Resource usage
Custom metrics: Application-specific

Prometheus Stack:

# ServiceMonitor for Prometheus
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: app-monitor
spec:
  selector:
    matchLabels:
      app: myapp
  endpoints:
  - port: metrics
    interval: 30s

Logging:

Container logs: stdout/stderr
Node logs: kubelet, container runtime
Audit logs: API server activity

EFK Stack:

Elasticsearch: Storage and search
Fluentd/Fluent Bit: Log collection
Kibana: Visualization

Tracing:

Distributed tracing with Jaeger:

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: simplest

OpenTelemetry:

Vendor-neutral observability:

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: simplest
spec:
  config: |
    receivers:
      otlp:
        protocols:
          grpc:
          http:
    processors:
      memory_limiter:
        limit_mib: 512
    exporters:
      jaeger:
        endpoint: jaeger:14250
        tls:
          insecure: true
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter]
          exporters: [jaeger]

11.6 Disaster Recovery

Backup Strategies:

etcd Backup:

# Snapshot etcd
ETCDCTL_API=3 etcdctl snapshot save snapshot.db

# Restore
ETCDCTL_API=3 etcdctl snapshot restore snapshot.db

Velero (formerly Heptio Ark):

Backup and restore Kubernetes resources:

# Schedule backup
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: daily-backup
spec:
  schedule: "0 1 * * *"
  template:
    includedNamespaces:
    - production
    ttl: 720h

Velero Commands:

# On-demand backup
velero backup create app-backup --include-namespaces production

# Restore
velero restore create --from-backup app-backup

# Schedule backup
velero schedule create daily --schedule="0 1 * * *" --include-namespaces production

DR Patterns:

Active-Passive:

One cluster active, one standby
Data replication between clusters
DNS switch on failure

Active-Active:

Multiple clusters serving traffic
Global load balancing
Data synchronization challenges

Backup and Restore:

Regular backups
Documented restore procedures
Test restores regularly

11.7 Cost Optimization

Resource Management:

Rightsizing:

Use VPA to find optimal requests
Analyze usage patterns
Remove unused resources

Node Optimization:

Use spot/preemptible instances for stateless workloads
Right-size instance types
Use cluster autoscaler

Kubecost:

# Kubecost deployment
helm repo add kubecost https://kubecost.github.io/cost-analyzer/
helm install kubecost kubecost/cost-analyzer

Karpenter (AWS):

Dynamic node provisioning:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  requirements:
    - key: karpenter.sh/capacity-type
      operator: In
      values: ["spot", "on-demand"]
    - key: kubernetes.io/arch
      operator: In
      values: ["amd64", "arm64"]
  limits:
    resources:
      cpu: 1000
  provider:
    subnetSelector:
      karpenter/discovery: my-cluster
    securityGroupSelector:
      karpenter/discovery: my-cluster

Cost Optimization Checklist:

Rightsize pods (use VPA)
Use spot instances where possible
Scale down non-production clusters
Remove unused load balancers
Optimize storage (use reclaim policies)
Monitor and alert on cost spikes
Use namespace quotas
Implement resource limits

PART V — INFRASTRUCTURE AS CODE

Chapter 12 — Infrastructure as Code Principles

12.1 Declarative vs Imperative

Imperative Approach:

Describe how to achieve desired state:

# Create VPC
aws ec2 create-vpc --cidr-block 10.0.0.0/16

# Create subnet
aws ec2 create-subnet --vpc-id vpc-123 --cidr-block 10.0.1.0/24

# Create internet gateway
aws ec2 create-internet-gateway
aws ec2 attach-internet-gateway --vpc-id vpc-123 --internet-gateway-id igw-456

Problems:

Not idempotent
Difficult to reproduce
No state tracking
Error-prone

Declarative Approach:

Describe what you want:

# Terraform
resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
}

resource "aws_subnet" "main" {
  vpc_id     = aws_vpc.main.id
  cidr_block = "10.0.1.0/24"
}

resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id
}

Benefits:

Idempotent
Self-documenting
Version controllable
Predictable
Reusable

12.2 Immutable Infrastructure

Mutable Infrastructure:

Servers are updated in place
Configuration drifts over time
Configuration management tools fix drift
"Snowflake" servers

Immutable Infrastructure:

Never modify servers after deployment
Replace, don't change
Everything in version control
Identical environments
Easy rollback (redeploy previous version)

Benefits:

Consistency: All servers identical
Reproducibility: Recreate from scratch
Testing: Test immutable artifacts
Rollback: Deploy previous version
Debugging: Known state

Implementation:

Version 1:
Source → Build → Image v1 → Deploy → Running v1

Version 2:
Source → Build → Image v2 → Deploy → Running v2
                               ↓
                          Terminate v1

12.3 Idempotency

Definition: An operation is idempotent if applying it multiple times has the same effect as applying it once.

Examples:

Non-idempotent:

# Each run creates new file
echo "data" > file.txt

# Each run adds line
echo "new line" >> file.txt

Idempotent:

# Only creates if doesn't exist
touch file.txt

# Sets content regardless
echo "data" > file.txt

In IaC:

# Idempotent - creates only if doesn't exist
resource "aws_instance" "web" {
  ami           = "ami-123"
  instance_type = "t2.micro"
  
  # Tags ensure we can identify
  tags = {
    Name = "web-server"
  }
}

Benefits:

Safe to reapply
Predictable outcomes
Easy automation
Self-healing

12.4 State Management

State tracks resources managed by IaC.

Why State Matters:

Maps configuration to real resources
Tracks metadata and dependencies
Enables updates and deletion
Improves performance (caching)

State Storage:

Local State:

terraform {
  backend "local" {
    path = "terraform.tfstate"
  }
}

Simple but not for teams
No locking
Easy to lose

Remote State:

# AWS S3
terraform {
  backend "s3" {
    bucket = "my-terraform-state"
    key    = "prod/network/terraform.tfstate"
    region = "us-east-1"
    
    # Enable locking
    dynamodb_table = "terraform-locks"
  }
}

Azure Storage:

terraform {
  backend "azurerm" {
    storage_account_name = "tfstate123"
    container_name       = "tfstate"
    key                  = "prod.terraform.tfstate"
    access_key           = "xxx"
  }
}

Google Cloud Storage:

terraform {
  backend "gcs" {
    bucket = "tf-state-prod"
    prefix = "terraform/state"
  }
}

State Best Practices:

Remote storage: Never store state locally
State locking: Prevent concurrent modifications
Encryption: Encrypt state at rest
Access control: Restrict who can read/write
Backup: Regular state backups
Isolation: Separate state per environment

Chapter 13 — IaC Tools

13.1 Terraform

HashiCorp Terraform is the most popular IaC tool.

Core Concepts:

Providers: AWS, Azure, GCP, Kubernetes, etc.
Resources: Infrastructure components
Data sources: Read existing resources
Variables: Parameterize configurations
Outputs: Export resource attributes
Modules: Reusable configurations

Basic Example:

# main.tf
terraform {
  required_version = ">= 1.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 4.0"
    }
  }
}

provider "aws" {
  region = var.aws_region
}

resource "aws_instance" "web" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = var.instance_type
  
  tags = {
    Name        = "web-${var.environment}"
    Environment = var.environment
  }
}

data "aws_ami" "ubuntu" {
  most_recent = true
  owners      = ["099720109477"]  # Canonical
  
  filter {
    name   = "name"
    values = ["ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-*"]
  }
}

variable "aws_region" {
  description = "AWS region"
  type        = string
  default     = "us-east-1"
}

variable "instance_type" {
  description = "EC2 instance type"
  type        = string
}

variable "environment" {
  description = "Environment name"
  type        = string
}

output "instance_ip" {
  description = "Public IP of instance"
  value       = aws_instance.web.public_ip
}

Variables File (terraform.tfvars):

instance_type = "t3.micro"
environment   = "production"

Commands:

# Initialize (download providers)
terraform init

# Format code
terraform fmt

# Validate syntax
terraform validate

# Plan changes
terraform plan

# Apply changes
terraform apply

# Destroy resources
terraform destroy

# Show state
terraform show

# List resources
terraform state list

13.2 Ansible

Agentless configuration management and automation.

Core Concepts:

Playbooks: YAML files defining automation
Modules: Reusable units of work
Inventory: List of managed hosts
Roles: Organized playbook structure
Facts: System information gathered

Playbook Example:

---
- name: Configure web servers
  hosts: webservers
  become: yes
  vars:
    http_port: 80
    max_clients: 200
  
  tasks:
    - name: Ensure nginx is installed
      apt:
        name: nginx
        state: present
      when: ansible_os_family == "Debian"
    
    - name: Ensure nginx is running
      service:
        name: nginx
        state: started
        enabled: yes
    
    - name: Copy nginx configuration
      template:
        src: nginx.conf.j2
        dest: /etc/nginx/nginx.conf
      notify: restart nginx
    
    - name: Deploy website
      copy:
        src: index.html
        dest: /var/www/html/index.html
  
  handlers:
    - name: restart nginx
      service:
        name: nginx
        state: restarted

Inventory (hosts.ini):

[webservers]
web1.example.com
web2.example.com

[databases]
db1.example.com
db2.example.com

[all:vars]
ansible_user = ubuntu
ansible_ssh_private_key_file = ~/.ssh/prod-key.pem

Role Structure:

roles/
└── nginx/
    ├── tasks/
    │   └── main.yml
    ├── handlers/
    │   └── main.yml
    ├── templates/
    │   └── nginx.conf.j2
    ├── files/
    │   └── index.html
    ├── vars/
    │   └── main.yml
    └── defaults/
        └── main.yml

Commands:

# Ping all hosts
ansible all -m ping

# Run ad-hoc command
ansible webservers -m command -a "uptime"

# Run playbook
ansible-playbook site.yml

# Check syntax
ansible-playbook site.yml --syntax-check

# Dry run
ansible-playbook site.yml --check

# Limit to specific hosts
ansible-playbook site.yml --limit web1

13.3 Pulumi

IaC using general-purpose programming languages.

Example (TypeScript):

import * as aws from "@pulumi/aws";
import * as pulumi from "@pulumi/pulumi";

const config = new pulumi.Config();
const instanceType = config.get("instanceType") || "t3.micro";

// Get the latest Ubuntu AMI
const ubuntu = aws.ec2.getAmi({
  mostRecent: true,
  filters: [
    {
      name: "name",
      values: ["ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-*"],
    },
  ],
  owners: ["099720109477"],
});

// Create a security group
const group = new aws.ec2.SecurityGroup("web-sg", {
  description: "Allow HTTP and SSH",
  ingress: [
    { protocol: "tcp", fromPort: 22, toPort: 22, cidrBlocks: ["0.0.0.0/0"] },
    { protocol: "tcp", fromPort: 80, toPort: 80, cidrBlocks: ["0.0.0.0/0"] },
  ],
  egress: [
    { protocol: "-1", fromPort: 0, toPort: 0, cidrBlocks: ["0.0.0.0/0"] },
  ],
});

// Create an EC2 instance
const server = new aws.ec2.Instance("web-server", {
  instanceType: instanceType,
  ami: ubuntu.then(ami => ami.id),
  vpcSecurityGroupIds: [group.id],
  userData: `#!/bin/bash
    apt-get update
    apt-get install -y nginx
    systemctl start nginx
  `,
  tags: {
    Name: "web-server",
    Environment: pulumi.getStack(),
  },
});

// Export the instance's public IP
export const publicIp = server.publicIp;
export const publicHostname = server.publicDns;

Example (Python):

import pulumi
import pulumi_aws as aws

config = pulumi.Config()
instance_type = config.get("instanceType", "t3.micro")

# Get the latest Ubuntu AMI
ubuntu = aws.ec2.get_ami(
    most_recent=True,
    filters=[
        {
            "name": "name",
            "values": ["ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-*"]
        }
    ],
    owners=["099720109477"]
)

# Create security group
group = aws.ec2.SecurityGroup("web-sg",
    description="Allow HTTP and SSH",
    ingress=[
        {"protocol": "tcp", "from_port": 22, "to_port": 22, "cidr_blocks": ["0.0.0.0/0"]},
        {"protocol": "tcp", "from_port": 80, "to_port": 80, "cidr_blocks": ["0.0.0.0/0"]},
    ],
    egress=[
        {"protocol": "-1", "from_port": 0, "to_port": 0, "cidr_blocks": ["0.0.0.0/0"]}
    ]
)

# Create EC2 instance
server = aws.ec2.Instance("web-server",
    instance_type=instance_type,
    ami=ubuntu.id,
    vpc_security_group_ids=[group.id],
    user_data="""#!/bin/bash
        apt-get update
        apt-get install -y nginx
        systemctl start nginx
    """,
    tags={
        "Name": "web-server",
        "Environment": pulumi.get_stack()
    }
)

pulumi.export("public_ip", server.public_ip)
pulumi.export("public_hostname", server.public_dns)

Benefits:

Use familiar programming languages
Loops, conditionals, functions
Strong typing (TypeScript, Go)
Reuse existing code/libraries
Better IDE support

13.4 CloudFormation

AWS-native IaC tool.

Template Structure:

AWSTemplateFormatVersion: "2010-09-09"
Description: "Web server stack"

Parameters:
  InstanceType:
    Description: EC2 instance type
    Type: String
    Default: t3.micro
    AllowedValues:
      - t3.micro
      - t3.small
      - t3.medium

Mappings:
  RegionMap:
    us-east-1:
      AMI: ami-0c02fb55956c7d316  # Ubuntu 20.04
    us-west-2:
      AMI: ami-0d6621c01e8c2de54

Resources:
  WebServerSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Allow HTTP and SSH
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 22
          ToPort: 22
          CidrIp: 0.0.0.0/0
        - IpProtocol: tcp
          FromPort: 80
          ToPort: 80
          CidrIp: 0.0.0.0/0

  WebServer:
    Type: AWS::EC2::Instance
    Properties:
      ImageId: !FindInMap [RegionMap, !Ref "AWS::Region", AMI]
      InstanceType: !Ref InstanceType
      SecurityGroupIds:
        - !Ref WebServerSecurityGroup
      UserData:
        Fn::Base64: !Sub |
          #!/bin/bash
          apt-get update
          apt-get install -y nginx
          systemctl start nginx
      Tags:
        - Key: Name
          Value: WebServer

Outputs:
  PublicIP:
    Description: Public IP of web server
    Value: !GetAtt WebServer.PublicIp
  PublicDNS:
    Description: Public DNS of web server
    Value: !GetAtt WebServer.PublicDnsName

StackSets: Deploy across multiple regions/accounts.

Change Sets: Preview changes before applying.

13.5 Remote State Backends

Terraform Backends:

S3 Backend:

terraform {
  backend "s3" {
    bucket         = "company-terraform-state"
    key            = "prod/network/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

DynamoDB Lock Table:

resource "aws_dynamodb_table" "terraform_locks" {
  name         = "terraform-locks"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"
  
  attribute {
    name = "LockID"
    type = "S"
  }
}

Azure Backend:

terraform {
  backend "azurerm" {
    resource_group_name  = "terraform-state"
    storage_account_name = "tfstate123"
    container_name       = "tfstate"
    key                  = "prod.terraform.tfstate"
  }
}

GCS Backend:

terraform {
  backend "gcs" {
    bucket = "terraform-state-prod"
    prefix = "network"
  }
}

State Isolation Strategies:

Workspaces: Same config, separate state
Directory structure: Different configs per environment
Terragrunt: DRY configurations

Workspaces:

# Create workspace
terraform workspace new dev
terraform workspace new prod

# List workspaces
terraform workspace list

# Switch workspace
terraform workspace select prod

# Use in config
locals {
  environment = terraform.workspace
}

13.6 Modules & Reusability

Module Structure:

modules/
└── webserver/
    ├── main.tf
    ├── variables.tf
    ├── outputs.tf
    └── README.md

Module Code (main.tf):

resource "aws_instance" "web" {
  ami           = var.ami
  instance_type = var.instance_type
  subnet_id     = var.subnet_id
  
  vpc_security_group_ids = [aws_security_group.web.id]
  
  user_data = var.user_data
  
  tags = var.tags
}

resource "aws_security_group" "web" {
  name_prefix = "${var.name}-sg"
  vpc_id      = var.vpc_id
  
  dynamic "ingress" {
    for_each = var.ingress_rules
    content {
      from_port   = ingress.value.from_port
      to_port     = ingress.value.to_port
      protocol    = ingress.value.protocol
      cidr_blocks = ingress.value.cidr_blocks
    }
  }
  
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
  
  tags = var.tags
}

variables.tf:

variable "name" {
  description = "Name prefix for resources"
  type        = string
}

variable "ami" {
  description = "AMI ID for the instance"
  type        = string
}

variable "instance_type" {
  description = "Instance type"
  type        = string
  default     = "t3.micro"
}

variable "subnet_id" {
  description = "Subnet ID for the instance"
  type        = string
}

variable "vpc_id" {
  description = "VPC ID for security group"
  type        = string
}

variable "user_data" {
  description = "User data script"
  type        = string
  default     = ""
}

variable "ingress_rules" {
  description = "List of ingress rules"
  type = list(object({
    from_port   = number
    to_port     = number
    protocol    = string
    cidr_blocks = list(string)
  }))
  default = [
    {
      from_port   = 80
      to_port     = 80
      protocol    = "tcp"
      cidr_blocks = ["0.0.0.0/0"]
    }
  ]
}

variable "tags" {
  description = "Tags to apply"
  type        = map(string)
  default     = {}
}

outputs.tf:

output "instance_id" {
  description = "Instance ID"
  value       = aws_instance.web.id
}

output "public_ip" {
  description = "Public IP address"
  value       = aws_instance.web.public_ip
}

output "security_group_id" {
  description = "Security group ID"
  value       = aws_security_group.web.id
}

Using the Module:

module "web_server" {
  source = "../modules/webserver"
  
  name        = "prod-web"
  ami         = data.aws_ami.ubuntu.id
  instance_type = "t3.small"
  subnet_id   = aws_subnet.public.id
  vpc_id      = aws_vpc.main.id
  
  ingress_rules = [
    {
      from_port   = 80
      to_port     = 80
      protocol    = "tcp"
      cidr_blocks = ["0.0.0.0/0"]
    },
    {
      from_port   = 443
      to_port     = 443
      protocol    = "tcp"
      cidr_blocks = ["0.0.0.0/0"]
    }
  ]
  
  tags = {
    Environment = "production"
    ManagedBy   = "terraform"
  }
}

output "web_ip" {
  value = module.web_server.public_ip
}

13.7 Policy as Code

Enforce policies on infrastructure.

Sentinel (HashiCorp):

# Restrict instance types
import "tfplan"

main = rule {
  all tfplan.resources.aws_instance as _, instances {
    all instances as _, instance {
      instance.applied.instance_type in ["t3.micro", "t3.small"]
    }
  }
}

Open Policy Agent (OPA):

Rego policy:

package terraform

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "aws_instance"
  resource.change.after.instance_type == "t3.large"
  msg := sprintf("Instance type t3.large not allowed in %v", [resource.address])
}

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "aws_s3_bucket"
  not resource.change.after_unknown.aws_s3_bucket_public_access_block
  msg := sprintf("S3 bucket %v requires public access block", [resource.address])
}

Checkov:

Scan Terraform for security issues:

# Install
pip install checkov

# Scan
checkov -d ./

# Scan specific file
checkov -f main.tf

# Output formats
checkov -d ./ --output junitxml > results.xml

Example Check:

# Custom check
from checkov.common.models.enums import CheckResult, CheckCategories
from checkov.terraform.checks.resource.base_resource_check import BaseResourceCheck

class S3PublicACL(BaseResourceCheck):
    def __init__(self):
        name = "Ensure S3 bucket has no public ACL"
        id = "CUSTOM_AWS_001"
        supported_resources = ['aws_s3_bucket']
        categories = [CheckCategories.SECURITY]
        super().__init__(name=name, id=id, categories=categories, supported_resources=supported_resources)

    def scan_resource_conf(self, conf):
        if 'acl' in conf and conf['acl'] == ['public-read']:
            return CheckResult.FAILED
        return CheckResult.PASSED

check = S3PublicACL()

PART VI — CLOUD PLATFORMS

Chapter 14 — Cloud Fundamentals

14.1 IaaS, PaaS, SaaS

Infrastructure as a Service (IaaS):

Virtual machines, storage, networks
You manage OS, middleware, runtime, data, apps
Provider manages virtualization, servers, storage, networking

Examples: AWS EC2, Azure VMs, Google Compute Engine

Platform as a Service (PaaS):

Managed runtime environment
You manage data and apps
Provider manages everything else

Examples: Heroku, Google App Engine, AWS Elastic Beanstalk

Software as a Service (SaaS):

Complete application
You just use it
Provider manages everything

Examples: Salesforce, Office 365, Google Workspace

Function as a Service (FaaS):

Serverless functions
You write code, provider runs it
Pay per execution

Examples: AWS Lambda, Azure Functions, Google Cloud Functions

14.2 Public vs Private vs Hybrid

Public Cloud:

Shared infrastructure
Multi-tenant
Pay-as-you-go
Global scale
Examples: AWS, Azure, GCP

Private Cloud:

Dedicated infrastructure
Single tenant
More control
Compliance benefits
Examples: OpenStack, VMware

Hybrid Cloud:

Mix of public and private
Workload mobility
Data locality options
Burst to public cloud

Multi-Cloud:

Multiple public cloud providers
Avoid vendor lock-in
Best-of-breed services
Geographic presence

14.3 Cloud Networking

Virtual Private Cloud (VPC):

Isolated network section:

VPC (10.0.0.0/16)
├── Public Subnet (10.0.1.0/24)
│   └── Internet Gateway
├── Private Subnet (10.0.2.0/24)
│   └── NAT Gateway
└── Database Subnet (10.0.3.0/24)
    └── No internet access

Key Components:

Subnets: Network segments
Route tables: Traffic routing
Internet Gateway: Public internet access
NAT Gateway: Private subnet outbound access
VPN Gateway: On-premises connection
Load Balancers: Traffic distribution
CDN: Content delivery

Network Security:

Security Groups: Instance-level firewall (stateful)
Network ACLs: Subnet-level firewall (stateless)
WAF: Web application firewall
DDoS protection: Shield, Cloudflare

14.4 IAM Concepts

Identity and Access Management (IAM):

Core Components:

Users: Individual people/accounts
Groups: Collections of users
Roles: Temporary permissions
Policies: Permission documents
Permissions: Allow/deny actions

IAM Policy Example (AWS):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket",
        "s3:GetObject"
      ],
      "Resource": [
        "arn:aws:s3:::my-bucket",
        "arn:aws:s3:::my-bucket/*"
      ],
      "Condition": {
        "IpAddress": {
          "aws:SourceIp": "192.168.1.0/24"
        }
      }
    }
  ]
}

Least Privilege Principle:

Grant minimum necessary permissions
Regularly audit permissions
Use groups and roles
Avoid wildcards when possible

Identity Federation:

SAML 2.0 (Active Directory)
OIDC (Google, GitHub)
Social logins

Chapter 15 — Amazon Web Services

15.1 Amazon Web Services Overview

AWS is the leading cloud provider with the broadest service portfolio.

Global Infrastructure:

Regions: Geographic areas (us-east-1, eu-west-1)
Availability Zones: Isolated data centers per region
Edge Locations: CDN endpoints
Local Zones: Extend regions to population centers

Service Categories:

Compute
Storage
Database
Networking
Security & Identity
Analytics
Machine Learning
Developer Tools
Management & Governance

15.2 EC2 (Elastic Compute Cloud)

Virtual servers in the cloud.

Instance Types:

General Purpose: t3, m5 (balanced)
Compute Optimized: c5 (CPU intensive)
Memory Optimized: r5, x1 (RAM intensive)
Storage Optimized: i3, d2 (disk I/O)
GPU Instances: p3, g4 (graphics, ML)

Launch Configuration:

resource "aws_instance" "web" {
  ami           = "ami-0c02fb55956c7d316"
  instance_type = "t3.micro"
  
  subnet_id                   = aws_subnet.public.id
  vpc_security_group_ids      = [aws_security_group.web.id]
  associate_public_ip_address = true
  
  user_data = <<-EOF
    #!/bin/bash
    yum update -y
    yum install -y httpd
    systemctl start httpd
    systemctl enable httpd
    echo "<h1>Hello from $(hostname -f)</h1>" > /var/www/html/index.html
  EOF
  
  tags = {
    Name = "web-server"
  }
}

Purchase Options:

On-Demand: Pay by hour/second
Reserved: 1-3 year commitment, up to 75% discount
Spot: Bid for unused capacity, up to 90% discount
Savings Plans: Flexible pricing

15.3 S3 (Simple Storage Service)

Object storage for the cloud.

Storage Classes:

S3 Standard: Frequently accessed data
S3 Intelligent-Tiering: Auto-tiering
S3 Standard-IA: Infrequent access
S3 One Zone-IA: Lower cost, less durable
S3 Glacier: Archive (minutes to hours retrieval)
S3 Glacier Deep Archive: Long-term archive (hours retrieval)

Bucket Example:

resource "aws_s3_bucket" "data" {
  bucket = "my-company-data-${var.environment}"
  
  tags = {
    Environment = var.environment
  }
}

resource "aws_s3_bucket_versioning" "data" {
  bucket = aws_s3_bucket.data.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "data" {
  bucket = aws_s3_bucket.data.id
  
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "AES256"
    }
  }
}

resource "aws_s3_bucket_public_access_block" "data" {
  bucket = aws_s3_bucket.data.id
  
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

CLI Commands:

# List buckets
aws s3 ls

# Copy file
aws s3 cp file.txt s3://my-bucket/

# Sync directory
aws s3 sync ./local s3://my-bucket/

# Set lifecycle policy
aws s3api put-bucket-lifecycle-configuration \
  --bucket my-bucket \
  --lifecycle-configuration file://lifecycle.json

15.4 RDS (Relational Database Service)

Managed relational databases.

Supported Engines:

Amazon Aurora (MySQL/PostgreSQL compatible)
MySQL
PostgreSQL
MariaDB
Oracle
SQL Server

Example (PostgreSQL):

resource "aws_db_instance" "postgres" {
  identifier = "myapp-${var.environment}"
  
  engine         = "postgres"
  engine_version = "13.7"
  instance_class = "db.t3.micro"
  
  allocated_storage     = 20
  storage_type          = "gp3"
  storage_encrypted     = true
  
  db_name  = "myapp"
  username = "admin"
  password = random_password.db_password.result
  
  vpc_security_group_ids = [aws_security_group.database.id]
  db_subnet_group_name   = aws_db_subnet_group.main.name
  
  backup_retention_period = 30
  backup_window           = "03:00-04:00"
  maintenance_window      = "sun:04:00-sun:05:00"
  
  skip_final_snapshot = false
  final_snapshot_identifier = "myapp-${var.environment}-final-${formatdate("YYYY-MM-DD-hhmm", timestamp())}"
  
  tags = {
    Environment = var.environment
  }
}

resource "random_password" "db_password" {
  length  = 32
  special = false
}

Aurora Serverless:

resource "aws_rds_cluster" "aurora" {
  cluster_identifier = "aurora-serverless-${var.environment}"
  engine             = "aurora-postgresql"
  engine_version     = "13.6"
  database_name      = "myapp"
  master_username    = "admin"
  master_password    = random_password.db_password.result
  
  serverlessv2_scaling_configuration {
    min_capacity = 0.5
    max_capacity = 8
  }
  
  vpc_security_group_ids = [aws_security_group.database.id]
  db_subnet_group_name   = aws_db_subnet_group.main.name
  
  backup_retention_period = 7
  
  skip_final_snapshot = false
  final_snapshot_identifier = "aurora-${var.environment}-final"
}

15.5 VPC (Virtual Private Cloud)

Isolated network environment.

Complete VPC Example:

# VPC
resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true
  
  tags = {
    Name = "main-${var.environment}"
  }
}

# Public subnets
resource "aws_subnet" "public" {
  count = length(var.availability_zones)
  
  vpc_id                  = aws_vpc.main.id
  cidr_block              = "10.0.${count.index}.0/24"
  availability_zone       = var.availability_zones[count.index]
  map_public_ip_on_launch = true
  
  tags = {
    Name = "public-${var.availability_zones[count.index]}"
  }
}

# Private subnets
resource "aws_subnet" "private" {
  count = length(var.availability_zones)
  
  vpc_id            = aws_vpc.main.id
  cidr_block        = "10.0.${count.index + 10}.0/24"
  availability_zone = var.availability_zones[count.index]
  
  tags = {
    Name = "private-${var.availability_zones[count.index]}"
  }
}

# Internet Gateway
resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id
  
  tags = {
    Name = "main-igw"
  }
}

# NAT Gateways (one per AZ)
resource "aws_eip" "nat" {
  count = length(var.availability_zones)
  vpc   = true
  
  tags = {
    Name = "nat-${var.availability_zones[count.index]}"
  }
}

resource "aws_nat_gateway" "main" {
  count = length(var.availability_zones)
  
  allocation_id = aws_eip.nat[count.index].id
  subnet_id     = aws_subnet.public[count.index].id
  
  tags = {
    Name = "nat-${var.availability_zones[count.index]}"
  }
  
  depends_on = [aws_internet_gateway.main]
}

# Route tables
resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id
  
  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.main.id
  }
  
  tags = {
    Name = "public"
  }
}

resource "aws_route_table" "private" {
  count = length(var.availability_zones)
  
  vpc_id = aws_vpc.main.id
  
  route {
    cidr_block     = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.main[count.index].id
  }
  
  tags = {
    Name = "private-${var.availability_zones[count.index]}"
  }
}

# Route table associations
resource "aws_route_table_association" "public" {
  count = length(var.availability_zones)
  
  subnet_id      = aws_subnet.public[count.index].id
  route_table_id = aws_route_table.public.id
}

resource "aws_route_table_association" "private" {
  count = length(var.availability_zones)
  
  subnet_id      = aws_subnet.private[count.index].id
  route_table_id = aws_route_table.private[count.index].id
}

15.6 IAM (Identity and Access Management)

IAM User and Group:

# Create group
resource "aws_iam_group" "developers" {
  name = "developers"
}

# Create user
resource "aws_iam_user" "john" {
  name = "john.doe"
  path = "/developers/"
}

# Add user to group
resource "aws_iam_group_membership" "developers" {
  name = "developers-group-membership"
  
  users = [
    aws_iam_user.john.name,
  ]
  
  group = aws_iam_group.developers.name
}

# Group policy
resource "aws_iam_group_policy" "developers_policy" {
  name  = "developers-policy"
  group = aws_iam_group.developers.name
  
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "ec2:Describe*",
          "s3:ListBucket",
        ]
        Resource = "*"
      }
    ]
  })
}

IAM Role for EC2:

# Role
resource "aws_iam_role" "ec2_role" {
  name = "ec2-role"
  
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Principal = {
          Service = "ec2.amazonaws.com"
        }
        Action = "sts:AssumeRole"
      }
    ]
  })
}

# Policy attachment
resource "aws_iam_role_policy_attachment" "s3_read" {
  role       = aws_iam_role.ec2_role.name
  policy_arn = "arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess"
}

# Instance profile
resource "aws_iam_instance_profile" "ec2_profile" {
  name = "ec2-profile"
  role = aws_iam_role.ec2_role.name
}

15.7 EKS (Elastic Kubernetes Service)

Managed Kubernetes on AWS.

EKS Cluster:

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "19.0.0"
  
  cluster_name    = "myapp-${var.environment}"
  cluster_version = "1.24"
  
  vpc_id     = aws_vpc.main.id
  subnet_ids = concat(aws_subnet.public[*].id, aws_subnet.private[*].id)
  
  # Managed node groups
  eks_managed_node_groups = {
    main = {
      desired_size = 3
      min_size     = 1
      max_size     = 10
      
      instance_types = ["t3.medium"]
      
      tags = {
        Environment = var.environment
      }
    }
  }
  
  # Fargate profiles (serverless)
  fargate_profiles = {
    default = {
      name = "default"
      selectors = [
        {
          namespace = "default"
        }
      ]
    }
  }
  
  tags = {
    Environment = var.environment
  }
}

# Configure kubectl
resource "local_file" "kubeconfig" {
  content  = module.eks.kubeconfig
  filename = "./kubeconfig_${var.environment}"
}

Access Entry (EKS API):

resource "aws_eks_access_entry" "admin" {
  cluster_name  = module.eks.cluster_name
  principal_arn = "arn:aws:iam::123456789:role/Admin"
  type          = "STANDARD"
}

resource "aws_eks_access_policy_association" "admin" {
  cluster_name  = module.eks.cluster_name
  policy_arn    = "arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy"
  principal_arn = aws_eks_access_entry.admin.principal_arn
  
  access_scope {
    type = "cluster"
  }
}

Chapter 16 — Microsoft Azure

16.1 Microsoft Azure Overview

Azure is Microsoft's cloud platform, strong in enterprise integration.

Global Infrastructure:

60+ regions worldwide
Availability Zones
ExpressRoute private connections

Key Services:

Azure Virtual Machines (IaaS)
Azure Kubernetes Service (AKS)
Azure App Service (PaaS)
Azure SQL Database
Azure DevOps

16.2 Virtual Machines

VM Deployment:

# Terraform AzureRM provider
provider "azurerm" {
  features {}
}

resource "azurerm_resource_group" "main" {
  name     = "myapp-${var.environment}-rg"
  location = var.location
}

resource "azurerm_virtual_network" "main" {
  name                = "myapp-${var.environment}-vnet"
  address_space       = ["10.0.0.0/16"]
  location            = azurerm_resource_group.main.location
  resource_group_name = azurerm_resource_group.main.name
}

resource "azurerm_subnet" "internal" {
  name                 = "internal"
  resource_group_name  = azurerm_resource_group.main.name
  virtual_network_name = azurerm_virtual_network.main.name
  address_prefixes     = ["10.0.2.0/24"]
}

resource "azurerm_public_ip" "vm" {
  name                = "vm-public-ip"
  resource_group_name = azurerm_resource_group.main.name
  location            = azurerm_resource_group.main.location
  allocation_method   = "Dynamic"
}

resource "azurerm_network_interface" "main" {
  name                = "vm-nic"
  resource_group_name = azurerm_resource_group.main.name
  location            = azurerm_resource_group.main.location
  
  ip_configuration {
    name                          = "internal"
    subnet_id                     = azurerm_subnet.internal.id
    private_ip_address_allocation = "Dynamic"
    public_ip_address_id          = azurerm_public_ip.vm.id
  }
}

resource "azurerm_linux_virtual_machine" "main" {
  name                = "vm-${var.environment}"
  resource_group_name = azurerm_resource_group.main.name
  location            = azurerm_resource_group.main.location
  size                = "Standard_B2s"
  admin_username      = "azureuser"
  
  network_interface_ids = [
    azurerm_network_interface.main.id,
  ]
  
  admin_ssh_key {
    username   = "azureuser"
    public_key = file("~/.ssh/id_rsa.pub")
  }
  
  source_image_reference {
    publisher = "Canonical"
    offer     = "0001-com-ubuntu-server-focal"
    sku       = "20_04-lts"
    version   = "latest"
  }
  
  os_disk {
    caching              = "ReadWrite"
    storage_account_type = "Standard_LRS"
  }
  
  tags = {
    environment = var.environment
  }
}

16.3 Azure Kubernetes Service (AKS)

AKS Cluster:

resource "azurerm_kubernetes_cluster" "main" {
  name                = "aks-${var.environment}"
  location            = azurerm_resource_group.main.location
  resource_group_name = azurerm_resource_group.main.name
  dns_prefix          = "myapp-${var.environment}"
  
  default_node_pool {
    name       = "default"
    node_count = 3
    vm_size    = "Standard_DS2_v2"
    
    enable_auto_scaling = true
    min_count           = 1
    max_count           = 5
  }
  
  identity {
    type = "SystemAssigned"
  }
  
  network_profile {
    network_plugin = "azure"
    network_policy = "calico"
  }
  
  role_based_access_control_enabled = true
  
  azure_active_directory_role_based_access_control {
    managed            = true
    azure_rbac_enabled = true
  }
  
  tags = {
    Environment = var.environment
  }
}

# Get credentials
resource "local_file" "kubeconfig" {
  content  = azurerm_kubernetes_cluster.main.kube_config_raw
  filename = "./kubeconfig_aks_${var.environment}"
}

AKS with Availability Zones:

resource "azurerm_kubernetes_cluster" "main" {
  # ... existing configuration ...
  
  default_node_pool {
    name                = "default"
    node_count          = 3
    vm_size             = "Standard_DS2_v2"
    availability_zones  = ["1", "2", "3"]
    enable_node_public_ip = false
    
    upgrade_settings {
      max_surge = "33%"
    }
  }
  
  # Enable cluster autoscaler
  auto_scaler_profile {
    balance_similar_node_groups = true
    max_graceful_termination_sec = 600
  }
}

16.4 Azure DevOps Integration

Service Connection:

# azure-pipelines.yml
trigger:
- main

pool:
  vmImage: ubuntu-latest

variables:
  azureSubscription: 'my-azure-connection'
  resourceGroup: 'myapp-prod-rg'
  aksCluster: 'myapp-prod-aks'

stages:
- stage: Build
  jobs:
  - job: Build
    steps:
    - task: Docker@2
      inputs:
        containerRegistry: 'my-acr'
        repository: 'myapp'
        command: 'buildAndPush'
        Dockerfile: '**/Dockerfile'
        tags: '$(Build.BuildId)'

- stage: Deploy
  jobs:
  - deployment: Deploy
    environment: 'production'
    strategy:
      runOnce:
        deploy:
          steps:
          - task: KubernetesManifest@0
            inputs:
              action: 'deploy'
              kubernetesServiceConnection: 'my-aks-connection'
              namespace: 'default'
              manifests: 'manifests/deployment.yaml'
              containers: 'myacr.azurecr.io/myapp:$(Build.BuildId)'

16.5 Networking & Security

Virtual Network with Service Endpoints:

resource "azurerm_virtual_network" "main" {
  name                = "vnet-${var.environment}"
  location            = azurerm_resource_group.main.location
  resource_group_name = azurerm_resource_group.main.name
  address_space       = ["10.0.0.0/16"]
}

# Subnet with service endpoints
resource "azurerm_subnet" "private" {
  name                 = "private"
  resource_group_name  = azurerm_resource_group.main.name
  virtual_network_name = azurerm_virtual_network.main.name
  address_prefixes     = ["10.0.1.0/24"]
  
  service_endpoints = [
    "Microsoft.Sql",
    "Microsoft.Storage"
  ]
}

# Private endpoint for storage
resource "azurerm_private_endpoint" "storage" {
  name                = "pe-storage-${var.environment}"
  location            = azurerm_resource_group.main.location
  resource_group_name = azurerm_resource_group.main.name
  subnet_id           = azurerm_subnet.private.id
  
  private_service_connection {
    name                           = "storage-connection"
    private_connection_resource_id = azurerm_storage_account.main.id
    is_manual_connection           = false
    subresource_names              = ["blob"]
  }
}

Network Security Group:

resource "azurerm_network_security_group" "web" {
  name                = "nsg-web"
  location            = azurerm_resource_group.main.location
  resource_group_name = azurerm_resource_group.main.name
  
  security_rule {
    name                       = "HTTP"
    priority                   = 100
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = "80"
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }
  
  security_rule {
    name                       = "HTTPS"
    priority                   = 110
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = "443"
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }
  
  security_rule {
    name                       = "SSH"
    priority                   = 120
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = "22"
    source_address_prefix      = "10.0.0.0/8"
    destination_address_prefix = "*"
  }
}

Chapter 17 — Google Cloud Platform

17.1 Google Cloud Platform Overview

GCP excels in data analytics, machine learning, and containers.

Global Infrastructure:

30+ regions
100+ edge locations
Global fiber network

Key Services:

Compute Engine (VMs)
Google Kubernetes Engine (GKE)
BigQuery (analytics)
Cloud Run (serverless containers)
Cloud Functions

17.2 Compute Engine

VM Instance:

# Terraform GCP provider
provider "google" {
  project = var.project_id
  region  = var.region
}

resource "google_compute_network" "vpc" {
  name                    = "vpc-${var.environment}"
  auto_create_subnetworks = false
}

resource "google_compute_subnetwork" "subnet" {
  name          = "subnet-${var.environment}"
  ip_cidr_range = "10.0.1.0/24"
  region        = var.region
  network       = google_compute_network.vpc.id
  
  private_ip_google_access = true
}

resource "google_compute_firewall" "ssh" {
  name    = "allow-ssh"
  network = google_compute_network.vpc.name
  
  allow {
    protocol = "tcp"
    ports    = ["22"]
  }
  
  source_ranges = ["0.0.0.0/0"]
  target_tags   = ["ssh"]
}

resource "google_compute_address" "static" {
  name = "vm-address-${var.environment}"
}

resource "google_compute_instance" "default" {
  name         = "vm-${var.environment}"
  machine_type = "e2-medium"
  zone         = var.zone
  
  tags = ["ssh", "http"]
  
  boot_disk {
    initialize_params {
      image = "ubuntu-os-cloud/ubuntu-2004-lts"
      size  = 50
      type  = "pd-ssd"
    }
  }
  
  network_interface {
    network    = google_compute_network.vpc.name
    subnetwork = google_compute_subnetwork.subnet.name
    
    access_config {
      nat_ip = google_compute_address.static.address
    }
  }
  
  metadata_startup_script = <<-EOF
    #!/bin/bash
    apt-get update
    apt-get install -y nginx
    systemctl start nginx
  EOF
  
  service_account {
    scopes = ["cloud-platform"]
  }
}

17.3 GKE (Google Kubernetes Engine)

GKE Cluster:

resource "google_container_cluster" "primary" {
  name     = "gke-${var.environment}"
  location = var.region
  
  remove_default_node_pool = true
  initial_node_count       = 1
  
  network    = google_compute_network.vpc.name
  subnetwork = google_compute_subnetwork.subnet.name
  
  # Enable Shielded Nodes
  enable_shielded_nodes = true
  
  # Release channel (RAPID, REGULAR, STABLE)
  release_channel {
    channel = "REGULAR"
  }
  
  # Private cluster
  private_cluster_config {
    enable_private_nodes    = true
    enable_private_endpoint = false
    master_ipv4_cidr_block  = "172.16.0.0/28"
  }
  
  # Network policy
  network_policy {
    enabled = true
  }
  
  # Workload identity
  workload_identity_config {
    workload_pool = "${var.project_id}.svc.id.goog"
  }
  
  maintenance_policy {
    recurring_window {
      start_time = "2023-01-01T04:00:00Z"
      end_time   = "2023-01-01T06:00:00Z"
      recurrence = "FREQ=WEEKLY;BYDAY=SA,SU"
    }
  }
}

resource "google_container_node_pool" "primary_nodes" {
  name       = "primary-pool"
  location   = var.region
  cluster    = google_container_cluster.primary.name
  node_count = 3
  
  node_config {
    machine_type = "e2-standard-4"
    
    service_account = google_service_account.gke.email
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]
    
    metadata = {
      disable-legacy-endpoints = "true"
    }
    
    labels = {
      environment = var.environment
    }
    
    tags = ["gke-node", var.environment]
    
    shielded_instance_config {
      enable_secure_boot          = true
      enable_integrity_monitoring = true
    }
    
    workload_metadata_config {
      mode = "GKE_METADATA"
    }
  }
  
  autoscaling {
    min_node_count = 1
    max_node_count = 10
  }
  
  management {
    auto_repair  = true
    auto_upgrade = true
  }
}

17.4 IAM

Service Account:

# Service account
resource "google_service_account" "gke" {
  account_id   = "gke-sa-${var.environment}"
  display_name = "GKE Service Account"
}

# IAM binding
resource "google_project_iam_member" "gke_logging" {
  project = var.project_id
  role    = "roles/logging.logWriter"
  member  = "serviceAccount:${google_service_account.gke.email}"
}

resource "google_project_iam_member" "gke_monitoring" {
  project = var.project_id
  role    = "roles/monitoring.metricWriter"
  member  = "serviceAccount:${google_service_account.gke.email}"
}

resource "google_project_iam_member" "gke_metadata" {
  project = var.project_id
  role    = "roles/stackdriver.resourceMetadata.writer"
  member  = "serviceAccount:${google_service_account.gke.email}"
}

Custom Role:

resource "google_project_iam_custom_role" "myrole" {
  role_id     = "customRole_${var.environment}"
  title       = "Custom Role"
  description = "Custom role for myapp"
  permissions = [
    "storage.buckets.get",
    "storage.objects.get",
    "storage.objects.list",
  ]
}

resource "google_project_iam_member" "custom" {
  project = var.project_id
  role    = google_project_iam_custom_role.myrole.id
  member  = "serviceAccount:${google_service_account.app.email}"
}

17.5 BigQuery

Data warehouse for analytics.

Dataset and Table:

resource "google_bigquery_dataset" "dataset" {
  dataset_id    = "myapp_${replace(var.environment, "-", "_")}"
  friendly_name = "MyApp Dataset"
  description   = "Dataset for MyApp analytics"
  location      = var.region
  
  default_table_expiration_ms = 2592000000 # 30 days
  
  labels = {
    environment = var.environment
  }
}

resource "google_bigquery_table" "events" {
  dataset_id = google_bigquery_dataset.dataset.dataset_id
  table_id   = "events"
  
  time_partitioning {
    type = "DAY"
  }
  
  clustering = ["event_type", "user_id"]
  
  schema = jsonencode([
    {
      name = "event_id"
      type = "STRING"
      mode = "REQUIRED"
    },
    {
      name = "event_type"
      type = "STRING"
      mode = "REQUIRED"
    },
    {
      name = "user_id"
      type = "STRING"
      mode = "REQUIRED"
    },
    {
      name = "timestamp"
      type = "TIMESTAMP"
      mode = "REQUIRED"
    },
    {
      name = "properties"
      type = "JSON"
      mode = "NULLABLE"
    }
  ])
}

# Authorized view
resource "google_bigquery_table" "daily_events" {
  dataset_id = google_bigquery_dataset.dataset.dataset_id
  table_id   = "daily_events"
  
  view {
    query = <<EOF
      SELECT
        DATE(timestamp) as event_date,
        event_type,
        COUNT(*) as count
      FROM `${var.project_id}.${google_bigquery_dataset.dataset.dataset_id}.events`
      GROUP BY event_date, event_type
    EOF
    
    use_legacy_sql = false
  }
}

BigQuery Query Example:

-- Top users by event count
SELECT
  user_id,
  COUNT(*) as event_count
FROM `myproject.myapp_prod.events`
WHERE timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
GROUP BY user_id
ORDER BY event_count DESC
LIMIT 10;

-- Real-time dashboard query
SELECT
  event_type,
  COUNT(*) as events,
  COUNT(DISTINCT user_id) as unique_users
FROM `myproject.myapp_prod.events`
WHERE timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
GROUP BY event_type;

PART VII — OBSERVABILITY & SRE

Chapter 18 — Monitoring & Logging

18.1 Monitoring Principles

What to Monitor:

Infrastructure: CPU, memory, disk, network
Application: Request rate, errors, latency
Business: Active users, revenue, conversions
Security: Auth failures, suspicious patterns

The Four Golden Signals (Google):

Latency: Time to serve requests
Traffic: How much demand
Errors: Rate of failed requests
Saturation: How "full" the system is

RED Method (for services):

Rate: Requests per second
Errors: Failed requests per second
Duration: Distribution of request latencies

USE Method (for resources):

Utilization: Average time resource busy
Saturation: Extra work resource can't handle
Errors: Error counts

18.2 Metrics vs Logs vs Traces

Metrics:

Numerical measurements over time
Small data footprint
Aggregatable
Best for: Alerting, dashboards, trends

Examples: CPU usage, request latency p99, error rate

Logs:

Detailed event records
Text or structured data
Large volume
Best for: Debugging, audit trails, detailed analysis

Examples: Error stack traces, access logs, audit events

Traces:

End-to-end request paths
Span context
Show service dependencies
Best for: Performance analysis, distributed debugging

Examples:

Frontend → API → Auth → Database
Service call hierarchies

The Three Pillars of Observability:

Observability
├── Metrics (what's happening)
├── Logs (why it's happening)
└── Traces (where it's happening)

18.3 Prometheus

Prometheus is the leading open-source monitoring system.

Architecture:

Service → Exporter → Prometheus Server → Alertmanager
              ↑            ↓
          Service      Grafana
          Discovery

Prometheus Configuration:

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - 'alerts.yml'

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

Exporters:

node_exporter: System metrics
blackbox_exporter: HTTP/HTTPS probing
mysqld_exporter: MySQL metrics
postgres_exporter: PostgreSQL metrics
nginx_exporter: Nginx metrics

PromQL (Prometheus Query Language):

# CPU usage
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Request rate
rate(http_requests_total[5m])

# Error ratio
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

# 95th percentile latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Memory usage
container_memory_usage_bytes{container!=""} / container_spec_memory_limit_bytes

18.4 Grafana

Visualization and dashboards.

Dashboard Example:

{
  "title": "Web Service Dashboard",
  "panels": [
    {
      "title": "Request Rate",
      "type": "graph",
      "targets": [
        {
          "expr": "rate(http_requests_total[1m])",
          "legendFormat": "{{service}}"
        }
      ]
    },
    {
      "title": "Error Rate",
      "type": "graph",
      "targets": [
        {
          "expr": "rate(http_requests_total{status=~'5..'}[1m])",
          "legendFormat": "{{service}}"
        }
      ]
    },
    {
      "title": "Latency (p99)",
      "type": "heatmap",
      "targets": [
        {
          "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))"
        }
      ]
    }
  ]
}

Grafana Datasources:

Prometheus
Elasticsearch
InfluxDB
Graphite
CloudWatch
Azure Monitor
Google Cloud Monitoring

18.5 ELK Stack

Elasticsearch, Logstash, Kibana for logging.

Architecture:

Logs → Filebeat → Logstash → Elasticsearch → Kibana
                    ↑
              (Processing)

Filebeat Configuration:

# filebeat.yml
filebeat.inputs:
- type: container
  paths:
    - /var/log/containers/*.log
  processors:
    - add_kubernetes_metadata:
        host: ${NODE_NAME}
        matchers:
        - logs_path:
            logs_path: "/var/log/containers/"

output.logstash:
  hosts: ["logstash:5044"]

Logstash Configuration:

input {
  beats {
    port => 5044
  }
}

filter {
  grok {
    match => { "message" => "%{COMBINEDAPACHELOG}" }
  }
  
  date {
    match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
  }
  
  geoip {
    source => "clientip"
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "logs-%{+YYYY.MM.dd}"
  }
}

Kibana Queries:

# Find errors
log_level: ERROR

# Find specific request
request_id: "abc123"

# Time range and filter
@timestamp >= "now-1h" AND kubernetes.namespace: production

# Pattern matching
message: "Failed to connect to *"

18.6 Alerting Strategies

Alert Design Principles:

Actionable: Alerts should require action
Urgent: Alert on imminent problems
Real: Avoid false positives
Understandable: Clear what's wrong
Documented: Runbooks for alerts

Alert Severity Levels:

P0/Critical: Service down, immediate response
P1/High: Severe degradation, respond within hour
P2/Medium: Minor issues, respond within day
P3/Low: Informational, no response needed

Alert Rules (Prometheus):

# alerts.yml
groups:
- name: instance_alerts
  rules:
  - alert: InstanceDown
    expr: up == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }} down"
      description: "{{ $labels.instance }} has been down for more than 5 minutes."

  - alert: HighCPUUsage
    expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU usage is {{ $value }}% for 10 minutes."

- name: service_alerts
  rules:
  - alert: HighErrorRate
    expr: |
      sum(rate(http_requests_total{status=~"5.."}[5m])) 
      / 
      sum(rate(http_requests_total[5m])) > 0.05
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate for {{ $labels.service }}"
      description: "Error rate is {{ $value | humanizePercentage }}"

Alertmanager Configuration:

# alertmanager.yml
global:
  slack_api_url: 'https://hooks.slack.com/services/...'

route:
  group_by: ['alertname', 'cluster']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'team-alerts'
  routes:
  - match:
      severity: critical
    receiver: 'pagerduty-critical'
    continue: true
  - match:
      severity: warning
    receiver: 'slack-warnings'

receivers:
- name: 'team-alerts'
  slack_configs:
  - channel: '#alerts'
    title: '{{ template "slack.title" . }}'
    text: '{{ template "slack.text" . }}'

- name: 'pagerduty-critical'
  pagerduty_configs:
  - service_key: '...'

- name: 'slack-warnings'
  slack_configs:
  - channel: '#warnings'
    title: '{{ template "slack.title" . }}'
    text: '{{ template "slack.text" . }}'

18.7 Incident Response

Incident Management Process:

Detection: Alert triggers or user reports
Triage: Assess severity and impact
Response: Assign incident commander
Mitigation: Stop the bleeding
Resolution: Fix root cause
Post-mortem: Learn and prevent

Incident Severity Matrix:

Severity	Impact	Response	Examples
SEV1	Critical outage	Immediate, all hands	Site down, data loss
SEV2	Major degradation	< 1 hour response	Feature broken, slow
SEV3	Minor issue	< 1 day response	UI glitch, non-critical
SEV4	Informational	Next release	Cosmetic issues

Incident Commander Responsibilities:

Coordinate response
Communicate status
Make decisions
Delegate tasks
Track timeline

Communication Templates:

Initial Alert:

INCIDENT: {{title}}
SEVERITY: {{severity}}
TIME: {{timestamp}}
IMPACT: {{impact}}
LEAD: {{commander}}
CHANNEL: {{slack_channel}}

Status Update:

STATUS UPDATE ({{time}})
Current: {{what's happening}}
Action: {{what's being done}}
Next: {{next check-in}}

Resolution:

RESOLVED: {{title}}
TIME: {{timestamp}}
DURATION: {{duration}}
ACTION: {{mitigation}}
ROOT CAUSE: {{cause}}
POST-MORTEM: {{link}}

Chapter 19 — Site Reliability Engineering

19.1 SRE Principles

SRE applies software engineering to operations.

Core Principles (Google):

Operations is a software problem: Automate away toil
Manage by service level objectives: SLOs drive decisions
Work to minimize toil: Spend 50% time on development
Monotonically decreasing toil: Always reducing
Error budgets: Balance reliability and velocity
Monitoring should be minimal: Alert on symptoms, not causes

SRE vs Traditional Ops:

Aspect	Traditional Ops	SRE
Focus	Keep systems running	Build systems that run themselves
Change	Minimize change	Embrace change with safety
Measurement	Uptime	Error budgets
Work	Manual operations	Automation development
Incidents	Fix and forget	Post-mortems and prevention

19.2 SLIs, SLOs, SLAs

Service Level Indicators (SLIs):

Metrics that measure service performance:

Availability: % of successful requests
Latency: Time to respond (e.g., p99 < 100ms)
Throughput: Requests per second
Durability: Data persistence rate
Correctness: % of accurate responses

Service Level Objectives (SLOs):

Target values for SLIs:

"99.9% of requests complete in < 200ms over rolling 30 days"

Characteristics:

Specific and measurable
Time-bound
Achievable
Business-aligned

Service Level Agreements (SLAs):

Contracts with consequences for missing SLOs:

Financial penalties
Service credits
Legal implications

SLO Examples:

apiVersion: v1
kind: ServiceLevelObjective
metadata:
  name: api-availability
spec:
  service: user-api
  indicator:
    type: availability
    ratio:
          good: 
            filter: "job='api' and status_code=200"
            count: successful_requests
          total: 
            filter: "job='api'"
            count: total_requests
  target: 99.9%
  window: 30d
---
apiVersion: v1
kind: ServiceLevelObjective
metadata:
  name: api-latency
spec:
  service: user-api
  indicator:
    type: latency
    latency:
      threshold: 200ms
    filter: "job='api'"
  target: 99%
  window: 7d

19.3 Error Budgets

Error budgets = 100% - SLO target

Example: 99.9% SLO → 0.1% error budget

Error Budget Calculation:

Error Budget = (1 - SLO) × Total Time

For 30 days (2,592,000 seconds) with 99.9% SLO:
Error Budget = 0.001 × 2,592,000 = 2,592 seconds = 43.2 minutes

Error Budget Policy:

While budget remains: Release velocity prioritized
When budget exhausted: Freeze releases, focus on reliability

Benefits:

Aligns Dev and Ops goals
Data-driven release decisions
Balances risk and innovation

19.4 Toil Reduction

What is Toil?

Manual, repetitive, automatable work with no enduring value.

Examples of Toil:

Manual deployments
Password resets
Restarting services
Answering repetitive questions
Manual data fixes

Toil Characteristics:

Manual: Requires human action
Repetitive: Done frequently
Automatable: Could be done by machine
Tactical: No lasting value
Scales linearly: More work = more people

Toil Reduction Strategies:

Measure toil: Track time spent
Set goals: Target < 50% time on toil
Automate everything: Scripts, tools, platforms
Build self-service: Empower developers
Improve reliability: Reduce firefighting

Toil Budget:

Time Allocation:
├── 50% max toil (operational)
└── 50% min engineering (development)
    ├── Automation
    ├── Tooling
    └── Architecture improvements

19.5 Chaos Engineering

Definition: "Disciplined approach to identifying failures before they become outages" (Principles of Chaos)

Principles (from Principles of Chaos):

Build a hypothesis around steady state
Vary real-world events
Run experiments in production
Automate experiments to run continuously
Minimize blast radius

Chaos Engineering Tools:

Chaos Monkey: Random instance termination
Gremlin: Chaos engineering platform
Litmus: Kubernetes chaos
Chaos Mesh: Kubernetes chaos platform
AWS Fault Injection Simulator

Chaos Experiment Example (Chaos Mesh):

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-kill-example
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: web-server
  duration: "60s"

Experiment Design:

Define steady state: Normal metrics (error rate < 0.1%)
Hypothesis: System survives losing one pod
Run experiment: Kill one pod
Prove/disprove: Did error rate spike?
Fix or automate: Add redundancy or document

19.6 Capacity Planning

Goals:

Meet demand without waste
Anticipate scaling needs
Optimize costs

Capacity Planning Process:

Measure current usage: Trends, peaks
Forecast demand: Business growth, seasonality
Model scenarios: What-if analysis
Plan capacity: When to add resources
Procure/scale: Execute plan

Key Metrics:

Peak utilization: Max observed
Headroom: Buffer for spikes
Growth rate: % increase over time
Lead time: How long to add capacity

Prediction Methods:

Trend Analysis:

Future Capacity = Current Usage × (1 + Growth Rate)^Time

Seasonal Patterns:

Daily patterns
Weekly patterns
Holiday spikes
Marketing campaigns

Tools:

Prometheus: Historical metrics
Grafana: Visualization
Forecast libraries: Prophet, statsmodels
Cloud auto-scaling: Dynamic capacity

PART VIII — DEVSECOPS

Chapter 20 — Secure DevOps

20.1 Threat Modeling

Identify and prioritize security threats.

Threat Modeling Process (STRIDE):

Spoofing: Impersonating something/someone
Tampering: Modifying data/code
Repudiation: Denying actions
Information Disclosure: Exposing data
Denial of Service: Disrupting service
Elevation of Privilege: Gaining unauthorized access

Common Frameworks:

PASTA (Process for Attack Simulation and Threat Analysis):

Define objectives
Define technical scope
Decompose application
Threat analysis
Vulnerability analysis
Attack modeling
Risk analysis

Threat Modeling Example:

System: User Authentication Service

Assets:
- User credentials
- Session tokens
- Personal data

Trust Boundaries:
- Browser ↔ API
- API ↔ Database

Threats:
1. SQL Injection (Tampering)
   Mitigation: Parameterized queries, input validation

2. Session Hijacking (Spoofing)
   Mitigation: HTTPS, secure cookies, short expiration

3. Brute Force (DoS)
   Mitigation: Rate limiting, account lockout

4. Password Leak (Info Disclosure)
   Mitigation: Hashing, encryption, secure storage

20.2 Supply Chain Security

Protect against compromised dependencies and tools.

Supply Chain Attacks:

Dependency confusion: Malicious packages with same name
Typosquatting: Similar package names
Compromised maintainers: Attacked developer accounts
Build pipeline: Inject malware during build

Mitigation Strategies:

Lock dependencies: Use lock files (package-lock.json)
Verify integrity: Checksums, signatures
Private registry: Curated packages
Continuous scanning: Detect vulnerabilities
Least privilege: Limit CI/CD permissions

Software Bill of Materials (SBOM):

{
  "bomFormat": "CycloneDX",
  "specVersion": "1.4",
  "version": 1,
  "components": [
    {
      "type": "library",
      "name": "lodash",
      "version": "4.17.21",
      "purl": "pkg:npm/lodash@4.17.21",
      "licenses": ["MIT"]
    }
  ]
}

20.3 SBOM (Software Bill of Materials)

What is SBOM?

A formal, machine-readable inventory of software components and dependencies.

SBOM Formats:

SPDX: Linux Foundation
CycloneDX: OWASP
SWID: ISO standard

Why SBOM Matters:

Know what's in your software
Rapid vulnerability response
License compliance
Supply chain transparency

Generating SBOM:

# Using syft
syft myapp:latest -o cyclonedx > sbom.json

# Using trivy
trivy image --format cyclonedx myapp:latest > sbom.json

# Using cdxgen
cdxgen -o bom.xml

20.4 Secrets Management

Never store secrets in code.

Secret Types:

API keys
Database passwords
TLS certificates
SSH keys
OAuth tokens

Secret Management Solutions:

HashiCorp Vault:

# Vault policy
path "secret/data/myapp/*" {
  capabilities = ["read"]
}

# Store secret
vault kv put secret/myapp/api key=12345

# Read secret
vault kv get secret/myapp/api

# Dynamic database credentials
vault read database/creds/myapp

Cloud Secret Managers:

AWS Secrets Manager:

aws secretsmanager create-secret --name myapp/api --secret-string '{"key":"12345"}'

Azure Key Vault:

az keyvault secret set --vault-name myvault --name api-key --value 12345

Google Secret Manager:

echo -n "12345" | gcloud secrets create api-key --data-file=-

Kubernetes Secrets:

apiVersion: v1
kind: Secret
metadata:
  name: db-secret
type: Opaque
data:
  username: YWRtaW4=  # base64 encoded
  password: MWYyZDFlMmU2N2Rm  # base64 encoded
---
apiVersion: v1
kind: Pod
spec:
  containers:
  - name: app
    env:
    - name: DB_USERNAME
      valueFrom:
        secretKeyRef:
          name: db-secret
          key: username
    - name: DB_PASSWORD
      valueFrom:
        secretKeyRef:
          name: db-secret
          key: password

Tools for Secret Detection:

# GitHub Actions secret scanning
name: Secret Scanning
on: [push]
jobs:
  scan:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    - name: TruffleHog
      uses: trufflesecurity/trufflehog@main
      with:
        path: ./
        base: ${{ github.event.repository.default_branch }}

20.5 CI/CD Security Hardening

Pipeline Security Checklist:

Secure Pipeline Example:

name: Secure CI/CD

on: [push]

jobs:
  security-scans:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      
      - name: Scan code for secrets
        uses: trufflesecurity/trufflehog@main
        
      - name: Run SAST
        uses: github/codeql-action/init@v1
        with:
          languages: javascript
        
      - name: Scan dependencies
        run: |
          npm audit --audit-level=high
          npm outdated
      
      - name: Build image
        run: docker build -t myapp:${{ github.sha }} .
      
      - name: Scan image
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: 'myapp:${{ github.sha }}'
          severity: 'CRITICAL,HIGH'
      
      - name: Sign image
        run: |
          cosign sign --key k8s://my-namespace/cosign myapp:${{ github.sha }}
      
      - name: Deploy (if scans pass)
        if: success()
        run: ./deploy.sh

Chapter 21 — Security Tools

21.1 SAST (Static Application Security Testing)

Analyze source code for vulnerabilities.

Common SAST Tools:

SonarQube: Multi-language, quality and security
Checkmarx: Enterprise SAST
Fortify: Micro Focus
Semgrep: Fast, customizable
CodeQL: GitHub's analysis engine
ESLint (security plugins): JavaScript

Semgrep Example:

# semgrep.yml
rules:
  - id: no-hardcoded-secrets
    patterns:
      - pattern: password = "..."
      - pattern-not: password = os.getenv("...")
    message: "Hardcoded password detected"
    languages: [python]
    severity: ERROR

  - id: sql-injection
    patterns:
      - pattern: |
          cursor.execute("SELECT ... WHERE ... = " + $VAR)
    message: "Possible SQL injection"
    languages: [python]
    severity: WARNING

CI Integration:

- name: Run Semgrep
  uses: returntocorp/semgrep-action@v1
  with:
    config: >-
      p/security-audit
      p/secrets

21.2 DAST (Dynamic Application Security Testing)

Test running applications for vulnerabilities.

Common DAST Tools:

OWASP ZAP: Free, powerful
Burp Suite: Professional penetration testing
Acunetix: Commercial scanner
Nessus: Vulnerability scanner
Qualys: Cloud-based scanning

OWASP ZAP in CI:

- name: ZAP Scan
  uses: zaproxy/action-full-scan@v0.4.0
  with:
    target: 'https://staging.example.com'
    rules_file_name: '.zap/rules.tsv'
    cmd_options: '-a'

Types of DAST Tests:

Vulnerability scanning: SQLi, XSS, CSRF
Fuzzing: Unexpected inputs
Authentication testing: Login bypass
Session management: Token handling
Input validation: Boundary testing

21.3 Container Scanning

Scan container images for vulnerabilities.

Container Scanning Tools:

Trivy: Comprehensive, fast
Clair: CoreOS scanner
Anchore: Deep inspection
Docker Scout: Docker native
Grype: From Anchore
Snyk Container: Developer friendly

Trivy Example:

# Scan image
trivy image myapp:latest

# Scan with severity filter
trivy image --severity CRITICAL,HIGH myapp:latest

# Ignore unfixed
trivy image --ignore-unfixed myapp:latest

# Output formats
trivy image --format sarif myapp:latest > results.sarif

# Scan filesystem
trivy fs --severity HIGH,CRITICAL .

Kubernetes Admission Control:

apiVersion: v1
kind: ConfigMap
metadata:
  name: trivy-admission
data:
  policy.rego: |
    package trivy
    
    deny[msg] {
      input.request.kind.kind == "Pod"
      image := input.request.object.spec.containers[_].image
      not valid_image(image)
      msg := sprintf("Image %v has critical vulnerabilities", [image])
    }
    
    valid_image(image) {
      # Check with Trivy
      # ...
    }

21.4 Dependency Scanning

Scan project dependencies for known vulnerabilities.

Tools:

OWASP Dependency Check: Java, .NET, Python
Snyk: Multi-language, commercial
npm audit: JavaScript
Safety: Python
Gemnasium: GitLab's scanner
Dependabot: GitHub's automated updates

Snyk Example:

# .snyk
version: v1.25.0
ignore:
  SNYK-JS-LODASH-567746:
    - '*':
        reason: 'No patch available'
        expires: '2024-01-01'
patch: {}

CI Integration:

- name: Snyk Scan
  uses: snyk/actions/node@master
  env:
    SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}
  with:
    args: --severity-threshold=high

Dependabot Configuration:

# .github/dependabot.yml
version: 2
updates:
  - package-ecosystem: "npm"
    directory: "/"
    schedule:
      interval: "daily"
    open-pull-requests-limit: 10
    ignore:
      - dependency-name: "express"
        versions: ["5.x"]
    labels:
      - "dependencies"
      - "security"

21.5 Policy Enforcement

Enforce security policies across infrastructure.

Open Policy Agent (OPA):

package kubernetes.admission

deny[msg] {
  input.request.kind.kind == "Pod"
  container := input.request.object.spec.containers[_]
  container.securityContext.runAsRoot
  msg := "Containers must not run as root"
}

deny[msg] {
  input.request.kind.kind == "Deployment"
  not input.request.object.spec.template.metadata.labels.owner
  msg := "All resources must have owner label"
}

Kyverno (Kubernetes):

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-labels
spec:
  validationFailureAction: enforce
  rules:
  - name: check-for-labels
    match:
      resources:
        kinds:
        - Pod
    validate:
      message: "Label 'app' is required"
      pattern:
        metadata:
          labels:
            app: "?*"
---
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: disallow-latest-tag
spec:
  validationFailureAction: audit
  rules:
  - name: require-image-tag
    match:
      resources:
        kinds:
        - Pod
    validate:
      message: "Using 'latest' tag is not allowed"
      pattern:
        spec:
          containers:
          - image: "!*:latest"

Conftest (Configuration Testing):

package main

deny[msg] {
  input.kind == "Deployment"
  not input.spec.template.metadata.labels.app
  msg = "Deployments must have app label"
}

deny[msg] {
  input.kind == "Service"
  input.spec.type == "LoadBalancer"
  not input.metadata.annotations["service.beta.kubernetes.io/aws-load-balancer-internal"]
  msg = "LoadBalancer services must be internal"
}

# Test Kubernetes manifests
conftest test deployment.yaml --policy policy/

PART IX — ADVANCED TOPICS

Chapter 22 — GitOps & Platform Engineering

22.1 GitOps Principles

Core Principles:

Declarative: Entire system described declaratively
Versioned and Immutable: Desired state stored in Git
Pulled Automatically: Software agents pull changes
Continuously Reconciled: Correct drift automatically

GitOps Workflow:

Developer → Git Push
    ↓
Git Repository (source of truth)
    ↓
GitOps Operator (ArgoCD/Flux)
    ↓
Kubernetes Cluster
    ↑
Monitoring (drift detection)

Benefits:

Audit trail: All changes in Git
Faster recovery: Recreate cluster from Git
Standard tools: Use Git workflows
Security: Pull model reduces credentials
Observability: Drift detection

22.2 ArgoCD

Declarative GitOps for Kubernetes.

ArgoCD Architecture:

User (CLI/UI) → ArgoCD API Server
        ↓
   Repository Server
        ↓
   Controller
        ↓
   Kubernetes API

Application Definition:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: myapp
  namespace: argocd
spec:
  project: default
  
  source:
    repoURL: https://github.com/user/repo.git
    targetRevision: HEAD
    path: k8s
    helm:
      valueFiles:
      - values-production.yaml
  
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
    - CreateNamespace=true
    - PruneLast=true
  
  revisionHistoryLimit: 10

ApplicationSet (Multi-cluster):

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: myapp
spec:
  generators:
  - clusters:
      selector:
        matchLabels:
          environment: production
  template:
    metadata:
      name: '{{name}}-myapp'
    spec:
      project: default
      source:
        repoURL: https://github.com/user/repo.git
        targetRevision: HEAD
        path: k8s
      destination:
        server: '{{server}}'
        namespace: 'myapp-{{name}}'

ArgoCD Commands:

# List apps
argocd app list

# Sync app
argocd app sync myapp

# Get app details
argocd app get myapp

# Rollback
argocd app rollback myapp 1

# Set image (with Kustomize)
argocd app set myapp --kustomize-image myapp:v2

22.3 Flux

Another GitOps operator, lighter weight.

Flux Components:

Source Controller: Manages Git repositories
Kustomize Controller: Applies Kustomize overlays
Helm Controller: Manages Helm releases
Notification Controller: Handles alerts

Flux Configuration:

# GitRepository source
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: GitRepository
metadata:
  name: myapp
  namespace: flux-system
spec:
  interval: 1m
  url: https://github.com/user/repo
  ref:
    branch: main
  secretRef:
    name: repo-auth

# Kustomization
apiVersion: kustomize.toolkit.fluxcd.io/v1beta2
kind: Kustomization
metadata:
  name: myapp
  namespace: flux-system
spec:
  interval: 10m
  path: ./k8s/overlays/production
  prune: true
  sourceRef:
    kind: GitRepository
    name: myapp
  validation: client
  healthChecks:
    - apiVersion: apps/v1
      kind: Deployment
      name: myapp
      namespace: production

Flux with Helm:

# HelmRepository
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmRepository
metadata:
  name: bitnami
  namespace: flux-system
spec:
  interval: 1h
  url: https://charts.bitnami.com/bitnami

# HelmRelease
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: redis
  namespace: production
spec:
  interval: 5m
  chart:
    spec:
      chart: redis
      sourceRef:
        kind: HelmRepository
        name: bitnami
        namespace: flux-system
      interval: 1m
  values:
    architecture: standalone
    auth:
      enabled: false

22.4 Internal Developer Platforms

What is an IDP?

A layer of tools and services that development teams use to build, deploy, and operate applications without needing to understand the underlying infrastructure.

IDP Components:

Developer Portal (Backstage, Kratix)
    ↓
Orchestration (Terraform, Crossplane)
    ↓
GitOps (ArgoCD, Flux)
    ↓
Kubernetes (EKS, AKS, GKE)
    ↓
Cloud Providers (AWS, Azure, GCP)

Backstage (Spotify's Developer Portal):

// Component definition
import { Entity } from '@backstage/catalog-model';

export const myComponent: Entity = {
  apiVersion: 'backstage.io/v1alpha1',
  kind: 'Component',
  metadata: {
    name: 'my-service',
    description: 'My awesome service',
    annotations: {
      'github.com/project-slug': 'org/my-service',
      'backstage.io/techdocs-ref': 'dir:.',
    },
    tags: ['java', 'web'],
  },
  spec: {
    type: 'service',
    lifecycle: 'production',
    owner: 'team-a',
    system: 'product-catalog',
  },
};

Crossplane (Infrastructure as Code Platform):

apiVersion: aws.crossplane.io/v1beta1
kind: ProviderConfig
metadata:
  name: aws-provider
spec:
  credentials:
    source: Secret
    secretRef:
      namespace: crossplane-system
      name: aws-creds
      key: creds

---
apiVersion: database.aws.crossplane.io/v1beta1
kind: RDSInstance
metadata:
  name: mydb
spec:
  forProvider:
    region: us-east-1
    dbInstanceClass: db.t3.micro
    masterUsername: admin
    engine: postgres
    engineVersion: "13"
    allocatedStorage: 20
    publiclyAccessible: false
  writeConnectionSecretToRef:
    name: db-conn
    namespace: production
  providerConfigRef:
    name: aws-provider

Platform Engineering Team Responsibilities:

Build and maintain IDP
Define "golden paths" for developers
Provide self-service capabilities
Abstract infrastructure complexity
Ensure security and compliance
Collect feedback and improve

Golden Path Example:

Developer Workflow:
1. Create repo from template
2. Run `platform create-service`
3. Add code and push
4. PR creates preview environment
5. Merge to main → staging deploy
6. Promote to production via UI

Chapter 23 — Serverless & Edge

23.1 Serverless Architecture

What is Serverless?

No server management
Automatic scaling
Pay per execution
Event-driven

Benefits:

Reduced operational overhead
Auto-scaling to zero
Cost efficiency for variable workloads
Faster time to market

Trade-offs:

Cold starts
Vendor lock-in
Execution limits
Debugging complexity

23.2 AWS Lambda

Lambda Function Example (Node.js):

exports.handler = async (event) => {
  console.log('Event:', JSON.stringify(event, null, 2));
  
  try {
    const { name } = event.queryStringParameters || {};
    const response = {
      statusCode: 200,
      headers: {
        'Content-Type': 'application/json',
      },
      body: JSON.stringify({
        message: `Hello, ${name || 'World'}!`,
        timestamp: new Date().toISOString(),
      }),
    };
    
    return response;
  } catch (error) {
    console.error('Error:', error);
    return {
      statusCode: 500,
      body: JSON.stringify({ error: 'Internal Server Error' }),
    };
  }
};

Terraform Lambda Deployment:

# IAM Role
resource "aws_iam_role" "lambda_role" {
  name = "lambda_role"
  
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "lambda.amazonaws.com"
        }
      }
    ]
  })
}

# Lambda function
resource "aws_lambda_function" "api" {
  filename      = "function.zip"
  function_name = "my-api"
  role          = aws_iam_role.lambda_role.arn
  handler       = "index.handler"
  runtime       = "nodejs18.x"
  
  environment {
    variables = {
      TABLE_NAME = aws_dynamodb_table.data.name
    }
  }
  
  tracing_config {
    mode = "Active"
  }
}

# API Gateway trigger
resource "aws_apigatewayv2_api" "lambda" {
  name          = "serverless-api"
  protocol_type = "HTTP"
  
  cors {
    allow_origins = ["*"]
    allow_methods = ["GET", "POST"]
  }
}

resource "aws_apigatewayv2_integration" "lambda" {
  api_id = aws_apigatewayv2_api.lambda.id
  
  integration_uri    = aws_lambda_function.api.invoke_arn
  integration_type   = "AWS_PROXY"
  integration_method = "POST"
}

resource "aws_apigatewayv2_route" "get" {
  api_id    = aws_apigatewayv2_api.lambda.id
  route_key = "GET /hello"
  target    = "integrations/${aws_apigatewayv2_integration.lambda.id}"
}

23.3 Azure Functions

Azure Function (Python):

import azure.functions as func
import logging
import json

def main(req: func.HttpRequest) -> func.HttpResponse:
    logging.info('Python HTTP trigger function processed a request.')
    
    name = req.params.get('name')
    if not name:
        try:
            req_body = req.get_json()
        except ValueError:
            pass
        else:
            name = req_body.get('name')
    
    if name:
        return func.HttpResponse(
            json.dumps({
                "message": f"Hello, {name}!",
                "timestamp": datetime.utcnow().isoformat()
            }),
            status_code=200,
            mimetype="application/json"
        )
    else:
        return func.HttpResponse(
            "Please pass a name on the query string or in the request body",
            status_code=400
        )

Azure Functions Configuration:

{
  "IsEncrypted": false,
  "Values": {
    "AzureWebJobsStorage": "UseDevelopmentStorage=true",
    "FUNCTIONS_WORKER_RUNTIME": "python",
    "COSMOS_CONNECTION": "AccountEndpoint=...;"
  }
}

23.4 Cloud Run

Serverless containers on GCP.

Cloud Run Service:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: hello-world
spec:
  template:
    spec:
      containers:
      - image: gcr.io/myproject/hello:v1
        ports:
        - containerPort: 8080
        resources:
          limits:
            memory: "256Mi"
            cpu: "1"
        env:
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: db-secret
              key: url

Deployment with gcloud:

# Build and deploy
gcloud builds submit --tag gcr.io/myproject/hello:v1
gcloud run deploy hello \
  --image gcr.io/myproject/hello:v1 \
  --platform managed \
  --region us-central1 \
  --allow-unauthenticated \
  --memory 256Mi \
  --concurrency 80

Terraform:

resource "google_cloud_run_service" "default" {
  name     = "hello"
  location = "us-central1"
  
  template {
    spec {
      containers {
        image = "gcr.io/myproject/hello:v1"
        
        resources {
          limits = {
            cpu    = "1000m"
            memory = "256Mi"
          }
        }
        
        env {
          name = "DATABASE_URL"
          value_from {
            secret_key_ref {
              name = google_secret_manager_secret.db.secret_id
              key  = "latest"
            }
          }
        }
      }
      
      container_concurrency = 80
      timeout_seconds       = 300
    }
  }
  
  traffic {
    percent         = 100
    latest_revision = true
  }
}

23.5 Edge Computing

Compute at the network edge, closer to users.

Cloudflare Workers:

// Cloudflare Worker
addEventListener('fetch', event => {
  event.respondWith(handleRequest(event.request))
})

async function handleRequest(request) {
  const cache = caches.default
  let response = await cache.match(request)
  
  if (!response) {
    response = await fetch(request)
    
    // Cache responses
    if (response.status === 200) {
      const cloned = response.clone()
      const headers = new Headers(cloned.headers)
      headers.set('Cache-Control', 'public, max-age=3600')
      
      const cached = new Response(cloned.body, {
        status: cloned.status,
        statusText: cloned.statusText,
        headers: headers
      })
      
      event.waitUntil(cache.put(request, cached))
    }
  }
  
  return response
}

AWS Lambda@Edge:

'use strict';

// Origin response trigger
exports.handler = (event, context, callback) => {
  const response = event.Records[0].cf.response;
  const headers = response.headers;
  
  // Add security headers
  headers['strict-transport-security'] = [{
    key: 'Strict-Transport-Security',
    value: 'max-age=63072000; includeSubdomains; preload'
  }];
  
  headers['x-content-type-options'] = [{
    key: 'X-Content-Type-Options',
    value: 'nosniff'
  }];
  
  headers['x-frame-options'] = [{
    key: 'X-Frame-Options',
    value: 'DENY'
  }];
  
  headers['x-xss-protection'] = [{
    key: 'X-XSS-Protection',
    value: '1; mode=block'
  }];
  
  callback(null, response);
};

Use Cases:

CDN caching
Authentication at edge
A/B testing
Geolocation routing
Bot mitigation
API aggregation

Chapter 24 — Performance & Scalability

24.1 Load Balancing

Distribute traffic across multiple servers.

Load Balancer Types:

Layer 4 (Transport): TCP/UDP, IP-based
Layer 7 (Application): HTTP/HTTPS, content-based

Algorithms:

Round Robin: Simple rotation
Least Connections: To busiest server
IP Hash: Sticky sessions
Weighted: Capacity-based distribution

AWS Application Load Balancer:

resource "aws_lb" "main" {
  name               = "app-lb"
  internal           = false
  load_balancer_type = "application"
  security_groups    = [aws_security_group.lb.id]
  subnets            = aws_subnet.public[*].id
  
  enable_deletion_protection = true
  
  access_logs {
    bucket  = aws_s3_bucket.lb_logs.bucket
    prefix  = "alb-logs"
    enabled = true
  }
}

resource "aws_lb_target_group" "app" {
  name     = "app-targets"
  port     = 80
  protocol = "HTTP"
  vpc_id   = aws_vpc.main.id
  
  health_check {
    enabled             = true
    healthy_threshold   = 2
    unhealthy_threshold = 2
    timeout             = 5
    interval            = 30
    path                = "/health"
  }
  
  stickiness {
    type            = "lb_cookie"
    cookie_duration = 86400
    enabled         = true
  }
}

resource "aws_lb_listener" "front_end" {
  load_balancer_arn = aws_lb.main.arn
  port              = "443"
  protocol          = "HTTPS"
  ssl_policy        = "ELBSecurityPolicy-2016-08"
  certificate_arn   = aws_acm_certificate.lb.arn
  
  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.app.arn
  }
}

24.2 CDN (Content Delivery Network)

Distribute content globally for faster delivery.

CloudFront with S3:

# Origin Access Identity
resource "aws_cloudfront_origin_access_identity" "oai" {
  comment = "OAI for S3 bucket"
}

# CloudFront distribution
resource "aws_cloudfront_distribution" "cdn" {
  enabled = true
  
  origin {
    domain_name = aws_s3_bucket.website.bucket_regional_domain_name
    origin_id   = "S3-website"
    
    s3_origin_config {
      origin_access_identity = aws_cloudfront_origin_access_identity.oai.cloudfront_access_identity_path
    }
  }
  
  default_cache_behavior {
    allowed_methods  = ["GET", "HEAD", "OPTIONS"]
    cached_methods   = ["GET", "HEAD"]
    target_origin_id = "S3-website"
    
    forwarded_values {
      query_string = false
      cookies {
        forward = "none"
      }
    }
    
    viewer_protocol_policy = "redirect-to-https"
    min_ttl                = 0
    default_ttl            = 3600
    max_ttl                = 86400
    
    compress = true
  }
  
  price_class = "PriceClass_100"
  
  viewer_certificate {
    cloudfront_default_certificate = true
  }
  
  restrictions {
    geo_restriction {
      restriction_type = "none"
    }
  }
  
  custom_error_response {
    error_code            = 404
    response_code         = 200
    response_page_path    = "/index.html"
    error_caching_min_ttl = 300
  }
  
  tags = {
    Environment = var.environment
  }
}

24.3 Caching Strategies

Cache Levels:

Browser Cache: Local to user
CDN Cache: Edge locations
Application Cache: In-memory (Redis, Memcached)
Database Cache: Query cache

Cache Headers:

# Nginx cache configuration
location /static/ {
    expires 1y;
    add_header Cache-Control "public, immutable";
}

location /api/ {
    expires 1m;
    add_header Cache-Control "private, must-revalidate";
    
    # Proxy cache
    proxy_cache api_cache;
    proxy_cache_key "$scheme$request_method$host$request_uri";
    proxy_cache_valid 200 302 60m;
    proxy_cache_valid 404 1m;
    proxy_cache_use_stale error timeout updating;
}

Redis Caching:

import redis
import json

redis_client = redis.Redis(host='redis', port=6379, db=0)

def get_user(user_id):
    # Try cache first
    cached = redis_client.get(f"user:{user_id}")
    if cached:
        return json.loads(cached)
    
    # Cache miss - get from database
    user = db.query(User).get(user_id)
    if user:
        # Store in cache for 1 hour
        redis_client.setex(
            f"user:{user_id}",
            3600,
            json.dumps(user.to_dict())
        )
    return user

def invalidate_user(user_id):
    redis_client.delete(f"user:{user_id}")

Cache Invalidation Strategies:

Time-based: Expire after TTL
Event-based: Invalidate on update
Version-based: Use version in cache key
Manual: Purge via API

24.4 Database Scaling

Vertical Scaling (Scale Up):

Bigger instance
More CPU/RAM
Limited by hardware

Horizontal Scaling (Scale Out):

More instances
Sharding
Read replicas

Read Replicas:

-- Write to master
INSERT INTO users (name) VALUES ('John');

-- Read from replica
SELECT * FROM users;  -- Connect to replica endpoint

Database Sharding:

Shard 0: users 0-10000
Shard 1: users 10001-20000
Shard 2: users 20001-30000

shard_id = user_id % num_shards

Connection Pooling:

from sqlalchemy import create_engine
from sqlalchemy.pool import QueuePool

engine = create_engine(
    'postgresql://user:pass@localhost/mydb',
    poolclass=QueuePool,
    pool_size=20,
    max_overflow=10,
    pool_pre_ping=True,
    pool_recycle=3600
)

24.5 High Throughput Systems

Asynchronous Processing:

# FastAPI with background tasks
from fastapi import FastAPI, BackgroundTasks
import asyncio

app = FastAPI()

async def process_order(order_id: str):
    # Long-running task
    await asyncio.sleep(5)
    # Update order status
    await update_database(order_id, "processed")

@app.post("/orders")
async def create_order(order: Order, background_tasks: BackgroundTasks):
    # Save order quickly
    order_id = await save_order(order)
    
    # Process in background
    background_tasks.add_task(process_order, order_id)
    
    return {"order_id": order_id, "status": "accepted"}

Message Queues:

# Producer (FastAPI)
import aio_pika

async def publish_order(order):
    connection = await aio_pika.connect_robust("amqp://guest:guest@rabbitmq/")
    channel = await connection.channel()
    
    await channel.default_exchange.publish(
        aio_pika.Message(
            body=json.dumps(order).encode(),
            delivery_mode=aio_pika.DeliveryMode.PERSISTENT
        ),
        routing_key="orders"
    )
    
    await connection.close()

# Consumer (Worker)
async def process_orders():
    connection = await aio_pika.connect_robust("amqp://guest:guest@rabbitmq/")
    channel = await connection.channel()
    
    queue = await channel.declare_queue("orders", durable=True)
    
    async with queue.iterator() as queue_iter:
        async for message in queue_iter:
            async with message.process():
                order = json.loads(message.body)
                await process_order(order)

Rate Limiting:

from fastapi import FastAPI, HTTPException
from datetime import datetime, timedelta
import redis

app = FastAPI()
redis_client = redis.Redis(host='redis', port=6379, db=0)

@app.middleware("http")
async def rate_limit(request: Request, call_next):
    client_ip = request.client.host
    key = f"rate_limit:{client_ip}"
    
    # Check rate limit
    current = redis_client.get(key)
    if current and int(current) > 100:
        raise HTTPException(status_code=429, detail="Too many requests")
    
    # Increment counter
    pipe = redis_client.pipeline()
    pipe.incr(key)
    pipe.expire(key, 60)  # 1 minute window
    pipe.execute()
    
    response = await call_next(request)
    return response

Chapter 25 — DevOps at Enterprise Scale

25.1 Multi-Region Architecture

Active-Passive:

Region A (Primary)
├── Traffic: 100%
├── Database: Read/Write
└── Ready for failover

Region B (Standby)
├── Traffic: 0%
├── Database: Read-only replica
└── Failover target

Active-Active:

Global Load Balancer
    ↓
┌───┴───┐
Region A Region B
Traffic 50% Traffic 50%
Database sync Database sync

DNS Failover (Route53):

resource "aws_route53_record" "www" {
  zone_id = data.aws_route53_zone.main.zone_id
  name    = "www.example.com"
  type    = "A"
  
  alias {
    name                   = aws_lb.primary.dns_name
    zone_id                = aws_lb.primary.zone_id
    evaluate_target_health = true
  }
  
  failover_routing_policy {
    type = "PRIMARY"
  }
  
  set_identifier = "primary"
}

resource "aws_route53_record" "www_failover" {
  zone_id = data.aws_route53_zone.main.zone_id
  name    = "www.example.com"
  type    = "A"
  
  alias {
    name                   = aws_lb.secondary.dns_name
    zone_id                = aws_lb.secondary.zone_id
    evaluate_target_health = true
  }
  
  failover_routing_policy {
    type = "SECONDARY"
  }
  
  set_identifier = "secondary"
}

25.2 Compliance (ISO, SOC2)

Common Compliance Frameworks:

ISO 27001: Information security management
SOC 2: Service organization controls
PCI DSS: Payment card industry
HIPAA: Healthcare
GDPR: Data privacy

Automated Compliance Checks:

# AWS Config rule
resource "aws_config_config_rule" "encrypted_volumes" {
  name = "encrypted-volumes"
  
  source {
    owner             = "AWS"
    source_identifier = "ENCRYPTED_VOLUMES"
  }
  
  scope {
    compliance_resource_types = ["AWS::EC2::Volume"]
  }
}

Evidence Collection:

# Automated evidence collection
import boto3
import json
from datetime import datetime

def collect_evidence():
    # Collect IAM policies
    iam = boto3.client('iam')
    policies = iam.list_policies(Scope='Local')
    
    # Collect security group rules
    ec2 = boto3.client('ec2')
    security_groups = ec2.describe_security_groups()
    
    # Collect CloudTrail logs
    cloudtrail = boto3.client('cloudtrail')
    trails = cloudtrail.describe_trails()
    
    evidence = {
        'timestamp': datetime.utcnow().isoformat(),
        'iam_policies': policies,
        'security_groups': security_groups,
        'cloudtrail': trails
    }
    
    # Store in secure bucket
    s3 = boto3.client('s3')
    s3.put_object(
        Bucket='compliance-evidence',
        Key=f"evidence/{datetime.now().date()}/config.json",
        Body=json.dumps(evidence, default=str)
    )

25.3 Governance

Policy as Code:

# AWS Service Control Policy (SCP)
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Deny",
      "Action": [
        "ec2:RunInstances"
      ],
      "Resource": [
        "arn:aws:ec2:*:*:instance/*"
      ],
      "Condition": {
        "StringNotEquals": {
          "ec2:InstanceType": [
            "t3.micro",
            "t3.small",
            "m5.large"
          ]
        }
      }
    },
    {
      "Effect": "Deny",
      "Action": [
        "s3:PutBucketPublicAccessBlock"
      ],
      "Resource": "*"
    }
  ]
}

Tagging Strategy:

# Enforce tags
resource "aws_cloudformation_stack" "enforce_tags" {
  name = "enforce-tags"
  
  template_body = <<TEMPLATE
Resources:
  EnforceTagsLambda:
    Type: AWS::Lambda::Function
    Properties:
      Handler: index.handler
      Runtime: python3.9
      Code:
        ZipFile: |
          import boto3
          import json
          
          def handler(event, context):
              ec2 = boto3.client('ec2')
              
              # Find untagged resources
              resources = ec2.describe_instances(
                  Filters=[
                      {
                          'Name': 'tag:Environment',
                          'Values': ['missing']
                      }
                  ]
              )
              
              # Stop or terminate untagged resources
              for reservation in resources['Reservations']:
                  for instance in reservation['Instances']:
                      ec2.stop_instances(InstanceIds=[instance['InstanceId']])
                      
              return {'status': 'completed'}
TEMPLATE
}

25.4 Cost Management

Cost Allocation Tags:

resource "aws_instance" "web" {
  # ... other configuration
  
  tags = {
    Name        = "web-server"
    Environment = "production"
    CostCenter  = "product-engineering"
    Project     = "customer-portal"
    Owner       = "team-alpha"
    Expires     = "never"  # or "2024-12-31"
  }
}

Budget Alerts:

resource "aws_budgets_budget" "monthly" {
  name         = "monthly-budget"
  budget_type  = "COST"
  limit_amount = "10000"
  limit_unit   = "USD"
  time_unit    = "MONTHLY"
  
  cost_types {
    include_credit = false
    include_discount = false
    include_other_subscription = true
    include_recurring = true
    include_refund = false
    include_subscription = true
    include_support = true
    include_tax = true
    include_upfront = true
    use_blended = false
  }
  
  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_email_addresses = ["finance@example.com"]
  }
}

25.5 FinOps

Financial operations for cloud.

FinOps Principles:

Teams need to collaborate: Finance, engineering, product
Decisions driven by business value: Cost vs. features
Everyone takes ownership: Decentralized accountability
Reports should be accessible: Transparency
Cloud is variable cost: Optimize continuously

Cost Optimization Strategies:

# Automated rightsizing recommendation
def analyze_rightsizing():
    # Get usage metrics
    cloudwatch = boto3.client('cloudwatch')
    
    # For each instance
    for instance in get_all_instances():
        # Get CPU utilization
        stats = cloudwatch.get_metric_statistics(
            Namespace='AWS/EC2',
            MetricName='CPUUtilization',
            Dimensions=[{'Name': 'InstanceId', 'Value': instance.id}],
            StartTime=datetime.now() - timedelta(days=30),
            EndTime=datetime.now(),
            Period=3600,
            Statistics=['Average']
        )
        
        avg_cpu = sum(p['Average'] for p in stats['Datapoints']) / len(stats['Datapoints'])
        
        # Recommend downsizing if low utilization
        if avg_cpu < 10:
            recommend_smaller_instance(instance)
        
        # Recommend spot if appropriate
        if can_use_spot(instance):
            recommend_spot_conversion(instance)

Spot Instance Strategy:

# Spot instance with mixed types
resource "aws_ec2_fleet" "compute" {
  launch_template_config {
    launch_template_specification {
      launch_template_id = aws_launch_template.app.id
      version            = "$Latest"
    }
    
    overrides {
      instance_type = "c5.large"
      weighted_capacity = 2
    }
    
    overrides {
      instance_type = "c5a.large"
      weighted_capacity = 2
    }
    
    overrides {
      instance_type = "m5.large"
      weighted_capacity = 2
    }
  }
  
  target_capacity_specification {
    default_target_capacity_type = "spot"
    total_target_capacity       = 20
    spot_target_capacity        = 20
  }
  
  spot_options {
    allocation_strategy              = "capacity-optimized"
    instance_interruption_behavior    = "terminate"
    min_target_capacity              = 10
  }
}

25.6 Migration Strategies

The 7 Rs of Migration:

Rehost (Lift and Shift): Move as-is
Replatform (Lift, Tinker, Shift): Minor optimizations
Repurchase (Drop and Shop): Move to SaaS
Refactor (Re-architect): Modernize for cloud
Retire: Decommission unused
Retain: Keep on-premises
Relocate: Move to hyperconverged

Migration Phases:

Assess: Discovery and planning
Mobilize: Pilot and skills building
Migrate: Scale migration
Modernize: Optimize and innovate

Database Migration Service:

# AWS DMS replication task
resource "aws_dms_replication_task" "migrate" {
  replication_task_id       = "migrate-db"
  migration_type            = "full-load"
  replication_instance_arn  = aws_dms_replication_instance.dms.replication_instance_arn
  source_endpoint_arn       = aws_dms_endpoint.source.endpoint_arn
  target_endpoint_arn       = aws_dms_endpoint.target.endpoint_arn
  table_mappings            = jsonencode({
    "rules": [
      {
        "rule-type": "selection",
        "rule-id": "1",
        "rule-name": "1",
        "object-locator": {
          "schema-name": "public",
          "table-name": "users"
        },
        "rule-action": "include"
      }
    ]
  })
  
  replication_task_settings = jsonencode({
    "TargetMetadata": {
      "TargetSchema": "",
      "SupportLobs": true,
      "FullLobMode": false,
      "LobChunkSize": 64,
      "LimitedSizeLobMode": false,
      "LobMaxSize": 32
    },
    "FullLoadSettings": {
      "TargetTablePrepMode": "DROP_AND_CREATE",
      "CreatePkAfterFullLoad": false,
      "StopTaskCachedChangesApplied": false,
      "StopTaskCachedChangesNotApplied": false,
      "MaxFullLoadSubTasks": 8,
      "TransactionConsistencyTimeout": 600,
      "CommitRate": 10000
    }
  })
}

PART X — PRACTICAL IMPLEMENTATION

Chapter 26 — Building a Complete DevOps Pipeline

26.1 Sample Microservices Project

Architecture:

┌─────────┐    ┌─────────┐    ┌─────────┐
│  React  │ → │   API   │ → │  Users  │
│   App   │ ← │ Gateway │ ← │ Service │
└─────────┘    └─────────┘    └─────────┘
                    ↓              ↓
              ┌─────────┐    ┌─────────┐
              │  Auth   │    │  Posts  │
              │ Service │    │ Service │
              └─────────┘    └─────────┘

Repository Structure:

myapp/
├── services/
│   ├── api-gateway/
│   │   ├── src/
│   │   ├── Dockerfile
│   │   └── package.json
│   ├── users-service/
│   │   ├── src/
│   │   ├── Dockerfile
│   │   └── requirements.txt
│   └── posts-service/
│       ├── src/
│       ├── Dockerfile
│       └── go.mod
├── frontend/
│   ├── src/
│   ├── Dockerfile
│   └── package.json
├── k8s/
│   ├── base/
│   │   ├── deployment.yaml
│   │   └── service.yaml
│   └── overlays/
│       ├── dev/
│       └── prod/
├── terraform/
│   ├── main.tf
│   ├── variables.tf
│   └── outputs.tf
├── .github/
│   └── workflows/
│       ├── ci.yml
│       └── cd.yml
└── README.md

26.2 Git Workflow

Branch Strategy:

main - Production-ready code
develop - Integration branch
feature/* - New features
release/* - Release preparation
hotfix/* - Emergency fixes

PR Template:

## Description
[Describe your changes]

## Type of Change
- [ ] Bug fix
- [ ] New feature
- [ ] Breaking change
- [ ] Documentation update

## Testing
- [ ] Unit tests passing
- [ ] Integration tests passing
- [ ] Manual testing completed

## Checklist
- [ ] Code follows style guide
- [ ] Documentation updated
- [ ] Dependencies updated
- [ ] Security considerations addressed

## Related Issues
Closes #[issue-number]

26.3 CI Pipeline

GitHub Actions CI:

# .github/workflows/ci.yml
name: CI Pipeline

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      
      - name: Lint API Gateway
        working-directory: services/api-gateway
        run: |
          npm install
          npm run lint
      
      - name: Lint Users Service
        working-directory: services/users-service
        run: |
          pip install flake8
          flake8 src/
      
      - name: Lint Posts Service
        working-directory: services/posts-service
        run: |
          go install golang.org/x/lint/golint@latest
          golint ./...
  
  test:
    runs-on: ubuntu-latest
    services:
      postgres:
        image: postgres:13
        env:
          POSTGRES_PASSWORD: testpass
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
        ports:
          - 5432:5432
      
      redis:
        image: redis:6
        options: >-
          --health-cmd "redis-cli ping"
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
        ports:
          - 6379:6379
    
    steps:
      - uses: actions/checkout@v2
      
      - name: Test API Gateway
        working-directory: services/api-gateway
        run: |
          npm install
          npm test -- --coverage
      
      - name: Test Users Service
        working-directory: services/users-service
        env:
          DATABASE_URL: postgresql://postgres:testpass@localhost/test
        run: |
          pip install -r requirements.txt
          pytest --cov=src tests/
      
      - name: Test Posts Service
        working-directory: services/posts-service
        run: |
          go mod download
          go test -v -cover ./...
  
  security:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      
      - name: Run SAST
        uses: github/codeql-action/init@v1
        with:
          languages: javascript,python,go
      
      - name: Scan dependencies
        run: |
          npm audit --audit-level=high
          safety check
          go list -json -deps | nancy sleuth
      
      - name: Scan for secrets
        uses: trufflesecurity/trufflehog@main
  
  build:
    runs-on: ubuntu-latest
    needs: [lint, test, security]
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    
    steps:
      - uses: actions/checkout@v2
      
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v1
      
      - name: Login to Container Registry
        uses: docker/login-action@v1
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      
      - name: Build and push API Gateway
        uses: docker/build-push-action@v2
        with:
          context: services/api-gateway
          push: true
          tags: |
            ghcr.io/${{ github.repository }}/api-gateway:${{ github.sha }}
            ghcr.io/${{ github.repository }}/api-gateway:latest
          cache-from: type=gha
          cache-to: type=gha,mode=max
      
      - name: Build and push Users Service
        uses: docker/build-push-action@v2
        with:
          context: services/users-service
          push: true
          tags: |
            ghcr.io/${{ github.repository }}/users-service:${{ github.sha }}
            ghcr.io/${{ github.repository }}/users-service:latest
      
      - name: Build and push Posts Service
        uses: docker/build-push-action@v2
        with:
          context: services/posts-service
          push: true
          tags: |
            ghcr.io/${{ github.repository }}/posts-service:${{ github.sha }}
            ghcr.io/${{ github.repository }}/posts-service:latest
      
      - name: Scan images for vulnerabilities
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: 'ghcr.io/${{ github.repository }}/api-gateway:${{ github.sha }}'
          severity: 'CRITICAL,HIGH'
          format: 'sarif'
          output: 'trivy-results.sarif'

26.4 Dockerization

API Gateway Dockerfile:

FROM node:18-alpine AS builder

WORKDIR /app

COPY package*.json ./
RUN npm ci --only=production

COPY . .

FROM node:18-alpine

RUN addgroup -g 1001 -S nodejs && \
    adduser -S nodejs -u 1001

WORKDIR /app

COPY --from=builder --chown=nodejs:nodejs /app/node_modules ./node_modules
COPY --chown=nodejs:nodejs . .

USER nodejs

EXPOSE 3000

HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
  CMD node healthcheck.js

CMD ["node", "src/server.js"]

Users Service Dockerfile:

FROM python:3.10-slim AS builder

WORKDIR /app

COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt

FROM python:3.10-slim

RUN apt-get update && apt-get install -y --no-install-recommends \
    curl \
    && rm -rf /var/lib/apt/lists/*

RUN groupadd -r appuser && useradd -r -g appuser appuser

WORKDIR /app

COPY --from=builder /root/.local /home/appuser/.local
COPY . .

ENV PATH=/home/appuser/.local/bin:$PATH

USER appuser

EXPOSE 8000

HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
  CMD curl -f http://localhost:8000/health || exit 1

CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]

Posts Service Dockerfile:

FROM golang:1.19-alpine AS builder

WORKDIR /app

COPY go.mod go.sum ./
RUN go mod download

COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o posts-service ./cmd/server

FROM alpine:3.17

RUN apk --no-cache add ca-certificates

RUN addgroup -g 1001 -S appgroup && \
    adduser -S appuser -u 1001 -G appgroup

WORKDIR /app

COPY --from=builder --chown=appuser:appgroup /app/posts-service .

USER appuser

EXPOSE 8080

HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
  CMD ["./posts-service", "health"]

CMD ["./posts-service"]

26.5 Kubernetes Deployment

Kustomize Base:

# k8s/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-gateway
  labels:
    app: api-gateway
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api-gateway
  template:
    metadata:
      labels:
        app: api-gateway
    spec:
      containers:
      - name: api-gateway
        image: ghcr.io/myorg/myapp/api-gateway:latest
        ports:
        - containerPort: 3000
        env:
        - name: NODE_ENV
          value: "production"
        - name: USERS_SERVICE_URL
          value: "http://users-service:8000"
        - name: POSTS_SERVICE_URL
          value: "http://posts-service:8080"
        - name: REDIS_URL
          valueFrom:
            secretKeyRef:
              name: app-secrets
              key: redis-url
        resources:
          requests:
            memory: "128Mi"
            cpu: "250m"
          limits:
            memory: "256Mi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /health
            port: 3000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 3000
          initialDelaySeconds: 5
          periodSeconds: 5
---
# k8s/base/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: api-gateway
spec:
  selector:
    app: api-gateway
  ports:
  - port: 80
    targetPort: 3000
  type: ClusterIP
---
# k8s/base/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api-gateway
  annotations:
    kubernetes.io/ingress.class: nginx
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  tls:
  - hosts:
    - api.example.com
    secretName: api-tls
  rules:
  - host: api.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: api-gateway
            port:
              number: 80

Production Overlay:

# k8s/overlays/prod/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

bases:
- ../../base

namespace: production

images:
- name: ghcr.io/myorg/myapp/api-gateway
  newTag: v1.2.3
- name: ghcr.io/myorg/myapp/users-service
  newTag: v1.2.3
- name: ghcr.io/myorg/myapp/posts-service
  newTag: v1.2.3

patchesStrategicMerge:
- increase-replicas.yaml
- resource-limits.yaml

configMapGenerator:
- name: app-config
  behavior: merge
  literals:
  - LOG_LEVEL=info
  - ENVIRONMENT=production

secretGenerator:
- name: app-secrets
  behavior: merge
  literals:
  - redis-url=redis://redis-service:6379
  - database-url=postgresql://user:pass@postgres:5432/prod

# k8s/overlays/prod/increase-replicas.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-gateway
spec:
  replicas: 5
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: users-service
spec:
  replicas: 3
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: posts-service
spec:
  replicas: 3

26.6 Monitoring Setup

Prometheus Configuration:

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
- job_name: 'kubernetes-pods'
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    action: keep
    regex: true
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
    action: replace
    target_label: __metrics_path__
    regex: (.+)
  - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
    action: replace
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    target_label: __address__
  - source_labels: [__meta_kubernetes_pod_phase]
    regex: (Failed|Succeeded)
    action: drop

ServiceMonitor for Custom Metrics:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: api-gateway
spec:
  selector:
    matchLabels:
      app: api-gateway
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics
  namespaceSelector:
    matchNames:
    - production

Grafana Dashboard:

{
  "dashboard": {
    "title": "API Gateway Dashboard",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{app='api-gateway'}[5m])) by (status_code)",
            "legendFormat": "{{status_code}}"
          }
        ]
      },
      {
        "title": "Request Latency (p99)",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{app='api-gateway'}[5m])) by (le))",
            "legendFormat": "p99"
          }
        ]
      },
      {
        "title": "Error Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{app='api-gateway', status_code=~'5..'}[5m])) / sum(rate(http_requests_total{app='api-gateway'}[5m]))",
            "legendFormat": "error ratio"
          }
        ]
      },
      {
        "title": "CPU Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(container_cpu_usage_seconds_total{container='api-gateway'}) by (pod)",
            "legendFormat": "{{pod}}"
          }
        ]
      },
      {
        "title": "Memory Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(container_memory_working_set_bytes{container='api-gateway'}) by (pod)",
            "legendFormat": "{{pod}}"
          }
        ]
      }
    ]
  }
}

Alert Rules:

# alerts.yml
groups:
- name: api-gateway
  rules:
  - alert: APIHighErrorRate
    expr: |
      sum(rate(http_requests_total{app='api-gateway', status_code=~'5..'}[5m]))
      /
      sum(rate(http_requests_total{app='api-gateway'}[5m])) > 0.05
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "API Gateway high error rate"
      description: "Error rate is {{ $value | humanizePercentage }} for 5 minutes"

  - alert: APIHighLatency
    expr: |
      histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{app='api-gateway'}[5m])) by (le)) > 0.5
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "API Gateway high latency"
      description: "p99 latency is {{ $value }}s for 10 minutes"

  - alert: APIDown
    expr: up{job='api-gateway'} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "API Gateway is down"
      description: "API Gateway has been down for more than 1 minute"

26.7 Security Integration

Secret Management:

# secrets.yaml (encrypted with sops)
apiVersion: v1
kind: Secret
metadata:
  name: app-secrets
type: Opaque
data:
  database-url: ENC[AES256_GCM,data:...]
  redis-url: ENC[AES256_GCM,data:...]
  api-key: ENC[AES256_GCM,data:...]
sops:
  kms:
  - arn: arn:aws:kms:us-east-1:123456789:key/...
    created_at: "..."
    enc: "..."

Pod Security Policy:

apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: restricted
spec:
  privileged: false
  allowPrivilegeEscalation: false
  requiredDropCapabilities:
  - ALL
  volumes:
  - 'configMap'
  - 'emptyDir'
  - 'projected'
  - 'secret'
  - 'downwardAPI'
  hostNetwork: false
  hostIPC: false
  hostPID: false
  runAsUser:
    rule: 'MustRunAsNonRoot'
  seLinux:
    rule: 'RunAsAny'
  supplementalGroups:
    rule: 'MustRunAs'
    ranges:
    - min: 1
      max: 65535
  fsGroup:
    rule: 'MustRunAs'
    ranges:
    - min: 1
      max: 65535

Network Policy:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-gateway-network-policy
spec:
  podSelector:
    matchLabels:
      app: api-gateway
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: frontend
    ports:
    - protocol: TCP
      port: 3000
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: users-service
    ports:
    - protocol: TCP
      port: 8000
  - to:
    - podSelector:
        matchLabels:
          app: posts-service
    ports:
    - protocol: TCP
      port: 8080
  - to:
    - podSelector:
        matchLabels:
          app: redis
    ports:
    - protocol: TCP
      port: 6379

Chapter 27 — Real-World Case Studies

27.1 Netflix DevOps Model

Scale:

200M+ subscribers
Thousands of microservices
Millions of streaming hours daily
Thousands of deployments daily

Key Practices:

1. Chaos Engineering

Chaos Monkey randomly terminates instances
Simian Army tests various failure modes
Latency Monkey introduces delays
Conformity Monkey enforces best practices

# Chaos Monkey simplified example
import random
import boto3

class ChaosMonkey:
    def __init__(self, probability=0.01):
        self.probability = probability
        self.ec2 = boto3.client('ec2')
    
    def run(self):
        instances = self.get_production_instances()
        for instance in instances:
            if random.random() < self.probability:
                self.terminate_instance(instance)
                self.notify_team(instance)
    
    def get_production_instances(self):
        # Get instances with production tag
        response = self.ec2.describe_instances(
            Filters=[
                {'Name': 'tag:Environment', 'Values': ['production']}
            ]
        )
        return response['Reservations']
    
    def terminate_instance(self, instance):
        instance_id = instance['Instances'][0]['InstanceId']
        self.ec2.terminate_instances(InstanceIds=[instance_id])

2. Immutable Infrastructure

Servers never patched, always replaced
Golden AMIs with everything baked in
Blue/green deployments
Automated rollback

3. Spinnaker for CD

Multi-cloud continuous delivery
Pipeline stages: bake, test, deploy
Canary analysis
Automated rollbacks

// Spinnaker pipeline
{
  "application": "netflix",
  "name": "deploy-service",
  "stages": [
    {
      "type": "bake",
      "name": "Bake Image",
      "baseOs": "ubuntu",
      "package": "myapp"
    },
    {
      "type": "canary",
      "name": "Canary Deploy",
      "cluster": "myapp-canary",
      "targetSize": 5,
      "analysisType": "realTime",
      "metrics": [
        "error_rate < 0.1%",
        "latency_p99 < 200ms"
      ]
    },
    {
      "type": "rollingPush",
      "name": "Production Deploy",
      "cluster": "myapp-prod",
      "targetSize": 100
    }
  ]
}

4. Culture of Freedom and Responsibility

"You build it, you run it"
Engineers own their services
Blameless postmortems
Data-driven decisions

27.2 Amazon Deployment Model

Scale:

100M+ deployments per year
143,000 deployments in peak hour
2-pizza teams (6-10 people)
Service-oriented architecture

Key Practices:

1. Two-Pizza Teams

Small, autonomous teams
Full ownership of services
Independent deployment
Clear API contracts

2. Deployment Pipeline

# Amazon's deployment pipeline simplified
class DeploymentPipeline:
    def __init__(self, service_name):
        self.service = service_name
        self.stages = [
            'commit',
            'build',
            'unit_tests',
            'integration_tests',
            'performance_tests',
            'security_scan',
            'canary',
            'production'
        ]
    
    def execute(self, version):
        for stage in self.stages:
            if not self.run_stage(stage, version):
                self.rollback(version)
                return False
            
            # Collect metrics
            metrics = self.collect_metrics(stage)
            if self.thresholds_exceeded(metrics):
                self.rollback(version)
                return False
        
        return True
    
    def canary_deploy(self, version):
        # Deploy to 1% of instances
        canary_group = self.deploy_to_group(version, percent=1)
        
        # Monitor for 15 minutes
        time.sleep(900)
        
        # Check metrics
        if self.canary_healthy(canary_group):
            # Gradual rollout
            self.deploy_to_group(version, percent=10)
            time.sleep(300)
            self.deploy_to_group(version, percent=25)
            time.sleep(300)
            self.deploy_to_group(version, percent=50)
            time.sleep(300)
            self.deploy_to_group(version, percent=100)
        else:
            self.rollback_canary(version)

3. API Mandate

All teams expose APIs
No direct database access
Backward compatibility required
Versioned APIs

4. "You Build It, You Run It"

Developers carry pagers
On-call rotation within dev teams
Operational excellence is priority
Automated remediation

27.3 Google SRE Model

Scale:

Billions of users
Global infrastructure
100% services with SLOs
Error budgets for all services

Key Practices:

1. Error Budgets

class ErrorBudget:
    def __init__(self, service, slo=99.99):
        self.service = service
        self.slo = slo
        self.budget = 100 - slo
        self.consumed = 0
    
    def track_error(self, duration):
        # Track error against budget
        error_seconds = duration
        total_seconds = self.get_total_seconds()
        
        self.consumed = (error_seconds / total_seconds) * 100
        
        if self.consumed > self.budget:
            self.enforce_freeze()
    
    def enforce_freeze(self):
        # Block releases when budget exhausted
        print(f"Error budget exhausted for {self.service}")
        self.block_releases()
        self.focus_on_reliability()
    
    def reset_monthly(self):
        self.consumed = 0
        self.unblock_releases()

2. Toil Elimination

Target < 50% time on toil
Automate everything
Self-service platforms
Continuous improvement

# Toil tracking
class ToilTracker:
    def __init__(self):
        self.toil_time = 0
        self.eng_time = 0
    
    def track_activity(self, activity_type, duration):
        if activity_type == 'toil':
            self.toil_time += duration
        else:
            self.eng_time += duration
        
        self.ensure_balance()
    
    def ensure_balance(self):
        total = self.toil_time + self.eng_time
        if total > 0:
            toil_percentage = (self.toil_time / total) * 100
            
            if toil_percentage > 50:
                self.trigger_toil_reduction()
    
    def trigger_toil_reduction(self):
        print("Toil exceeds 50% - initiating reduction projects")
        # Start automation projects
        # Assign engineering time to reduce toil

3. Monitoring Philosophy

Monitor symptoms, not causes
Only alert if action required
Use SLIs, SLOs, error budgets
Minimal, actionable alerts

27.4 Startup DevOps Strategy

Profile:

Series B startup
50 engineers
AWS cloud
10 microservices
100K users

DevOps Implementation:

Phase 1: Foundation (Month 1-3)

GitHub for version control
GitHub Actions for CI
Terraform for infrastructure
Docker for containerization
ECS for orchestration (simpler than K8s)

Phase 2: Automation (Month 4-6)

Automated testing in CI
Container image building
Blue/green deployments
Basic monitoring (CloudWatch)

Phase 3: Scaling (Month 7-12)

Migrate to EKS
Service mesh (Linkerd)
Prometheus/Grafana
Centralized logging (ELK)
Security scanning (Trivy)

Sample CI Pipeline:

name: Startup CI/CD

on:
  push:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - run: npm ci
      - run: npm test
      - run: npm run lint
  
  build:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - run: docker build -t myapp:${{ github.sha }} .
      - run: docker tag myapp:${{ github.sha }} ${{ secrets.ECR_REPO }}:latest
      - run: aws ecr get-login-password | docker login --username AWS --password-stdin ${{ secrets.ECR_REPO }}
      - run: docker push ${{ secrets.ECR_REPO }}:latest
  
  deploy:
    needs: build
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v2
      - run: |
          aws ecs update-service \
            --cluster myapp-cluster \
            --service myapp-service \
            --force-new-deployment \
            --region us-east-1

27.5 Enterprise Migration Story

Profile:

Fortune 500 financial services
10,000+ employees
1,000+ applications
Legacy data centers
Strict regulatory requirements

Challenges:

Legacy mainframe applications
Regulatory compliance (SOX, PCI)
Security concerns
Siloed teams
Vendor lock-in

Migration Phases:

Phase 1: Assessment (6 months)

Application portfolio analysis
Dependency mapping
Compliance requirements review
Skills assessment
Vendor evaluation

Phase 2: Foundation (12 months)

Create cloud landing zone
Establish governance framework
Build central platform team
Implement security controls
Set up connectivity (Direct Connect)

# Enterprise landing zone
module "landing_zone" {
  source = "terraform-aws-modules/control-tower/aws"
  
  # Multi-account structure
  organizational_units = {
    "Security" = {
      accounts = ["audit", "security-tooling"]
    }
    "Infrastructure" = {
      accounts = ["network", "shared-services", "cicd"]
    }
    "Workloads" = {
      accounts = ["dev", "test", "prod", "dr"]
    }
  }
  
  # Guardrails
  guardrails = {
    "DISALLOW_PUBLIC_IPS" = {
      type = "mandatory"
    }
    "ENFORCE_ENCRYPTION" = {
      type = "mandatory"
    }
    "ENABLE_CLOUDTRAIL" = {
      type = "mandatory"
    }
  }
}

Phase 3: Pilot (6 months)

Select 3 pilot applications
Lift-and-shift initial migrations
Validate security controls
Train first teams
Document patterns

Phase 4: Scale (18 months)

Wave-based migrations
Automate where possible
Modernize applications
Implement CI/CD
Establish FinOps

Phase 5: Optimize (ongoing)

Rightsizing
Spot instances
Containerization
Serverless adoption
Continuous improvement

Key Success Factors:

Executive sponsorship - C-level support
Center of Excellence - Central team
Training program - Skill development
Security first - Compliance from day one
Measurable wins - Show progress
Cultural change - DevOps mindset

Appendices

Appendix A: Linux Command Reference

File Operations:

ls -la                    # List all files with details
cd /path/to/dir           # Change directory
pwd                       # Print working directory
cp -r source dest         # Copy recursively
mv source dest            # Move/rename
rm -rf dir                # Remove forcefully
mkdir -p path/to/dir      # Create directory with parents
touch file.txt            # Create empty file/update timestamp
cat file.txt              # Display file content
less file.txt             # View file page by page
head -n 10 file.txt       # First 10 lines
tail -f file.txt          # Follow file (live updates)
find . -name "*.txt"      # Find files by name
grep -r "pattern" .       # Search recursively

Process Management:

ps aux                     # All processes
top                        # Interactive process viewer
htop                       # Enhanced top
kill -9 PID                # Force kill process
kill -15 PID               # Graceful termination
pgrep process_name         # Find PID by name
pkill process_name         # Kill by name
jobs                       # List background jobs
bg %1                      # Resume job in background
fg %1                      # Bring to foreground
nohup command &            # Run immune to hangups

Network Commands:

ip addr show               # IP addresses
ip route show              # Routing table
ss -tulpn                  # Listening ports
netstat -an                # Network statistics (legacy)
curl -I http://example.com # HTTP headers
wget http://example.com/file # Download file
ping -c 4 example.com      # ICMP ping
traceroute example.com     # Trace route
nslookup example.com       # DNS lookup
dig example.com            # Detailed DNS
telnet host port           # Test TCP connection
nc -vz host port           # Netcat port scan
tcpdump -i eth0            # Capture packets

System Information:

uname -a                    # Kernel info
cat /etc/os-release         # OS info
lscpu                       # CPU info
free -h                     # Memory usage
df -h                       # Disk usage
du -sh *                    # Directory sizes
uptime                      # System uptime
whoami                      # Current user
id                          # User identity
hostname                    # System hostname
date                        # Current date/time
dmesg | tail                # Kernel messages

Package Management (Ubuntu/Debian):

apt update                  # Update package lists
apt upgrade                 # Upgrade all packages
apt install package         # Install package
apt remove package          # Remove package
apt autoremove              # Remove unused packages
apt search pattern          # Search packages
dpkg -l                     # List installed
dpkg -S /path/to/file       # Which package owns file

Package Management (RHEL/CentOS):

yum update                  # Update all packages
yum install package         # Install package
yum remove package          # Remove package
yum search pattern          # Search packages
rpm -qa                     # List installed
rpm -qf /path/to/file       # Which package owns file

Systemd Commands:

systemctl status service    # Service status
systemctl start service      # Start service
systemctl stop service       # Stop service
systemctl restart service    # Restart service
systemctl enable service     # Enable at boot
systemctl disable service    # Disable at boot
systemctl list-units         # List all units
journalctl -u service        # View logs
journalctl -f                # Follow logs
systemctl daemon-reload      # Reload unit files

Appendix B: Git Cheat Sheet

Basic Commands:

git init                    # Initialize repository
git clone url               # Clone repository
git add file                # Stage file
git add .                   # Stage all
git commit -m "message"     # Commit staged
git status                  # Show status
git log                     # Show history
git log --oneline           # Compact history
git diff                    # Show unstaged changes
git diff --staged           # Show staged changes

Branching:

git branch                  # List branches
git branch new-branch       # Create branch
git checkout branch         # Switch branch
git checkout -b new-branch  # Create and switch
git merge branch            # Merge branch into current
git branch -d branch        # Delete branch
git push origin --delete branch # Delete remote branch

Remote Operations:

git remote -v               # List remotes
git remote add origin url   # Add remote
git push origin main        # Push to remote
git pull origin main        # Pull from remote
git fetch origin            # Fetch without merge
git remote update           # Update all remotes

Undoing Changes:

git reset file              # Unstage file
git reset --soft HEAD~1     # Undo commit, keep changes
git reset --hard HEAD~1     # Undo commit, discard changes
git revert HEAD             # Create revert commit
git checkout -- file        # Discard changes in file
git clean -fd               # Remove untracked files

Stashing:

git stash                   # Stash changes
git stash list              # List stashes
git stash pop               # Apply and remove stash
git stash apply             # Apply stash
git stash drop stash@{0}    # Drop stash
git stash branch new-branch # Create branch from stash

History and Debugging:

git log --graph --oneline   # Visual history
git blame file              # Who changed what
git bisect start            # Binary search for bug
git bisect bad              # Current is bad
git bisect good commit      # Mark good commit
git reflog                  # Reference log

Advanced:

git rebase -i HEAD~3        # Interactive rebase
git cherry-pick commit      # Apply specific commit
git tag v1.0.0              # Create tag
git push --tags             # Push tags
git submodule add url       # Add submodule
git submodule update --init # Update submodules

Appendix C: Kubernetes YAML Reference

Pod:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
  labels:
    app: myapp
spec:
  containers:
  - name: my-container
    image: nginx:latest
    ports:
    - containerPort: 80
    env:
    - name: ENV_VAR
      value: "value"
    resources:
      requests:
        memory: "64Mi"
        cpu: "250m"
      limits:
        memory: "128Mi"
        cpu: "500m"
    volumeMounts:
    - name: data
      mountPath: /data
  volumes:
  - name: data
    emptyDir: {}

Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
      - name: myapp
        image: myapp:latest
        ports:
        - containerPort: 8080
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5

Service:

apiVersion: v1
kind: Service
metadata:
  name: my-service
spec:
  selector:
    app: myapp
  ports:
  - port: 80
    targetPort: 8080
    nodePort: 30080
  type: NodePort  # ClusterIP, NodePort, LoadBalancer

Ingress:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-ingress
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
spec:
  rules:
  - host: myapp.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: my-service
            port:
              number: 80
  tls:
  - hosts:
    - myapp.example.com
    secretName: myapp-tls

ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
data:
  config.json: |
    {
      "log_level": "info",
      "max_connections": 100
    }
  database_url: "postgresql://localhost/mydb"

Secret:

apiVersion: v1
kind: Secret
metadata:
  name: app-secret
type: Opaque
data:
  username: YWRtaW4=  # base64 encoded
  password: MWYyZDFlMmU2N2Rm

PersistentVolumeClaim:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: data-claim
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  storageClassName: fast

Appendix D: Terraform Module Examples

VPC Module:

# modules/vpc/main.tf
resource "aws_vpc" "main" {
  cidr_block           = var.cidr_block
  enable_dns_hostnames = true
  enable_dns_support   = true
  
  tags = var.tags
}

resource "aws_subnet" "public" {
  count = length(var.public_subnets)
  
  vpc_id            = aws_vpc.main.id
  cidr_block        = var.public_subnets[count.index]
  availability_zone = var.availability_zones[count.index]
  
  map_public_ip_on_launch = true
  
  tags = merge(var.tags, {
    Name = "public-${var.availability_zones[count.index]}"
  })
}

# modules/vpc/variables.tf
variable "cidr_block" {
  description = "CIDR block for VPC"
  type        = string
}

variable "public_subnets" {
  description = "List of public subnet CIDRs"
  type        = list(string)
}

variable "availability_zones" {
  description = "List of availability zones"
  type        = list(string)
}

variable "tags" {
  description = "Tags to apply"
  type        = map(string)
  default     = {}
}

# modules/vpc/outputs.tf
output "vpc_id" {
  value = aws_vpc.main.id
}

output "public_subnet_ids" {
  value = aws_subnet.public[*].id
}

EC2 Instance Module:

# modules/ec2/main.tf
data "aws_ami" "amazon_linux" {
  most_recent = true
  owners      = ["amazon"]
  
  filter {
    name   = "name"
    values = ["amzn2-ami-hvm-*-x86_64-gp2"]
  }
}

resource "aws_instance" "this" {
  ami                    = var.ami != "" ? var.ami : data.aws_ami.amazon_linux.id
  instance_type          = var.instance_type
  subnet_id              = var.subnet_id
  vpc_security_group_ids = var.security_group_ids
  key_name               = var.key_name
  
  user_data = var.user_data
  
  root_block_device {
    volume_type = var.root_volume_type
    volume_size = var.root_volume_size
    encrypted   = var.root_volume_encrypted
  }
  
  tags = merge(var.tags, {
    Name = var.name
  })
}

# modules/ec2/variables.tf
variable "name" {
  description = "Instance name"
  type        = string
}

variable "instance_type" {
  description = "Instance type"
  type        = string
}

variable "subnet_id" {
  description = "Subnet ID"
  type        = string
}

variable "security_group_ids" {
  description = "Security group IDs"
  type        = list(string)
}

variable "ami" {
  description = "AMI ID (optional)"
  type        = string
  default     = ""
}

variable "key_name" {
  description = "Key pair name"
  type        = string
  default     = ""
}

variable "user_data" {
  description = "User data script"
  type        = string
  default     = ""
}

variable "root_volume_size" {
  description = "Root volume size in GB"
  type        = number
  default     = 20
}

variable "root_volume_type" {
  description = "Root volume type"
  type        = string
  default     = "gp3"
}

variable "root_volume_encrypted" {
  description = "Encrypt root volume"
  type        = bool
  default     = true
}

variable "tags" {
  description = "Tags to apply"
  type        = map(string)
  default     = {}
}

# modules/ec2/outputs.tf
output "instance_id" {
  value = aws_instance.this.id
}

output "public_ip" {
  value = aws_instance.this.public_ip
}

output "private_ip" {
  value = aws_instance.this.private_ip
}

Appendix E: CI/CD Templates

GitHub Actions Multi-Stage:

name: Multi-Stage Pipeline

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

env:
  AWS_REGION: us-east-1
  ECR_REPOSITORY: myapp

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    
    - name: Run tests
      run: |
        npm ci
        npm test
        npm run lint
    
    - name: Upload coverage
      uses: codecov/codecov-action@v2

  build:
    needs: test
    runs-on: ubuntu-latest
    if: github.event_name == 'push'
    outputs:
      image_tag: ${{ steps.docker_build.outputs.image_tag }}
    
    steps:
    - uses: actions/checkout@v2
    
    - name: Configure AWS credentials
      uses: aws-actions/configure-aws-credentials@v1
      with:
        aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
        aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        aws-region: ${{ env.AWS_REGION }}
    
    - name: Login to Amazon ECR
      id: login-ecr
      uses: aws-actions/amazon-ecr-login@v1
    
    - name: Build and push Docker image
      id: docker_build
      env:
        ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
        IMAGE_TAG: ${{ github.sha }}
      run: |
        docker build -t $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG .
        docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG
        echo "::set-output name=image_tag::$IMAGE_TAG"
    
    - name: Scan image
      uses: aquasecurity/trivy-action@master
      with:
        image-ref: ${{ steps.login-ecr.outputs.registry }}/${{ env.ECR_REPOSITORY }}:${{ github.sha }}
        severity: CRITICAL,HIGH
        exit-code: 1

  deploy-dev:
    needs: build
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/develop'
    environment: development
    
    steps:
    - uses: actions/checkout@v2
    
    - name: Configure AWS credentials
      uses: aws-actions/configure-aws-credentials@v1
      with:
        aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
        aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        aws-region: ${{ env.AWS_REGION }}
    
    - name: Update kubeconfig
      run: aws eks update-kubeconfig --name dev-cluster --region ${{ env.AWS_REGION }}
    
    - name: Deploy to EKS
      run: |
        kubectl set image deployment/myapp \
          myapp=${{ steps.login-ecr.outputs.registry }}/${{ env.ECR_REPOSITORY }}:${{ needs.build.outputs.image_tag }} \
          -n development
        
        kubectl rollout status deployment/myapp -n development

  deploy-prod:
    needs: build
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    environment: production
    
    steps:
    - uses: actions/checkout@v2
    
    - name: Configure AWS credentials
      uses: aws-actions/configure-aws-credentials@v1
      with:
        aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
        aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        aws-region: ${{ env.AWS_REGION }}
    
    - name: Update kubeconfig
      run: aws eks update-kubeconfig --name prod-cluster --region ${{ env.AWS_REGION }}
    
    - name: Deploy to production
      run: |
        # Canary deployment (10%)
        kubectl set image deployment/myapp-canary \
          myapp=${{ steps.login-ecr.outputs.registry }}/${{ env.ECR_REPOSITORY }}:${{ needs.build.outputs.image_tag }} \
          -n production
        
        # Wait and monitor
        sleep 300
        
        # Full rollout
        kubectl set image deployment/myapp \
          myapp=${{ steps.login-ecr.outputs.registry }}/${{ env.ECR_REPOSITORY }}:${{ needs.build.outputs.image_tag }} \
          -n production
        
        kubectl rollout status deployment/myapp -n production

GitLab CI Pipeline:

stages:
  - test
  - build
  - deploy

variables:
  DOCKER_DRIVER: overlay2
  IMAGE_TAG: $CI_COMMIT_SHORT_SHA
  DOCKER_HOST: tcp://docker:2375

cache:
  paths:
    - node_modules/

before_script:
  - docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY

test:
  stage: test
  image: node:16
  script:
    - npm ci
    - npm run lint
    - npm test
  coverage: '/All files[^|]*\|[^|]*\s+([\d\.]+)/'

build:
  stage: build
  image: docker:20.10.16
  services:
    - docker:20.10.16-dind
  script:
    - docker build -t $CI_REGISTRY_IMAGE:$IMAGE_TAG .
    - docker push $CI_REGISTRY_IMAGE:$IMAGE_TAG
  only:
    - main
    - develop

.deploy_template: &deploy_template
  stage: deploy
  image: alpine/k8s:1.22
  script:
    - apk add --no-cache curl
    - curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
    - chmod +x kubectl && mv kubectl /usr/local/bin/
    - kubectl set image deployment/myapp myapp=$CI_REGISTRY_IMAGE:$IMAGE_TAG -n $K8S_NAMESPACE
    - kubectl rollout status deployment/myapp -n $K8S_NAMESPACE

deploy_dev:
  <<: *deploy_template
  variables:
    K8S_NAMESPACE: development
  environment:
    name: development
    url: https://dev.example.com
  only:
    - develop

deploy_staging:
  <<: *deploy_template
  variables:
    K8S_NAMESPACE: staging
  environment:
    name: staging
    url: https://staging.example.com
  only:
    - main

deploy_production:
  <<: *deploy_template
  variables:
    K8S_NAMESPACE: production
  environment:
    name: production
    url: https://example.com
  only:
    - main
  when: manual
  needs: ["deploy_staging"]

Appendix F: DevOps Interview Questions

General DevOps:

What is DevOps and why is it important?
Explain the CAMS model.
What are the Three Ways of DevOps?
How do you measure DevOps success?
What is the difference between Continuous Delivery and Continuous Deployment?
Explain the concept of "shift left" in security.
What is Conway's Law and how does it apply to DevOps?
How do you handle blameless postmortems?
What are DORA metrics?
Explain the difference between Agile and DevOps.

CI/CD:

How would you design a CI/CD pipeline?
What's the difference between Jenkins, GitHub Actions, and GitLab CI?
How do you handle database migrations in CI/CD?
Explain blue/green deployment.
What is canary deployment and when would you use it?
How do you handle secrets in CI/CD pipelines?
What is pipeline as code and why is it important?
How do you ensure pipeline security?
Explain the concept of "build once, deploy many".
How do you handle rollbacks?

Containers & Kubernetes:

What's the difference between Docker and Kubernetes?
Explain Kubernetes architecture.
How do you expose an application running in Kubernetes?
What are Kubernetes Operators?
How do you handle persistent storage in Kubernetes?
Explain Kubernetes network policies.
What's the difference between a deployment and a statefulset?
How do you debug a pod that won't start?
What is Helm and why use it?
Explain Kubernetes RBAC.

Infrastructure as Code:

What's the difference between declarative and imperative IaC?
Explain Terraform vs Ansible.
How do you manage Terraform state?
What are modules in Terraform and why use them?
How do you test infrastructure code?
What is immutable infrastructure?
Explain idempotency in IaC.
How do you handle secrets in Terraform?
What's the difference between Terraform and CloudFormation?
How do you version infrastructure code?

Cloud:

Explain the shared responsibility model.
What's the difference between IaaS, PaaS, and SaaS?
How do you design for high availability?
Explain multi-region architecture.
How do you manage cloud costs?
What is VPC peering?
Explain the difference between security groups and network ACLs.
How do you implement disaster recovery?
What is a landing zone?
How do you handle cloud governance?

Monitoring & SRE:

What are the four golden signals?
Explain SLIs, SLOs, and SLAs.
What is an error budget?
How do you design effective alerts?
What's the difference between metrics, logs, and traces?
Explain the USE method.
What is the RED method?
How do you handle on-call rotations?
What is chaos engineering?
How do you measure reliability?

Security:

What is DevSecOps?
How do you implement security in CI/CD?
What is SAST vs DAST?
Explain container security best practices.
How do you manage secrets?
What is SBOM and why is it important?
How do you scan for vulnerabilities?
Explain the principle of least privilege.
What is policy as code?
How do you handle compliance in cloud?

Scenario Questions:

A deployment is causing 500 errors. How do you respond?
How would you migrate a legacy application to the cloud?
Your builds are taking 30 minutes. How do you optimize?
How would you implement a multi-region disaster recovery plan?
A critical vulnerability is found in a dependency. What do you do?
How would you convince management to invest in DevOps?
Your team is experiencing burnout from on-call. How do you fix it?
How would you design a platform for 100 microservices?
A database migration caused downtime. How do you prevent recurrence?
How would you implement cost optimization for a growing startup?

Appendix G: DevOps Maturity Model

Level 1: Initial

Manual deployments
No version control
Siloed teams
Reactive monitoring
Long release cycles (months)
High failure rate
Firefighting culture

Level 2: Managed

Version control for code
Basic CI (build automation)
Some documentation
Scheduled releases
Basic monitoring
Defined roles
Tickets for operations

Level 3: Defined

CI/CD pipelines
Automated testing
Configuration management
Standardized environments
Proactive monitoring
Defined SLIs/SLOs
Blameless postmortems

Level 4: Measured

Pipeline as code
Infrastructure as code
Self-service platforms
Automated security scanning
Performance testing
Capacity planning
Error budgets

Level 5: Optimizing

GitOps workflows
Chaos engineering
AIOps/MLOps
Auto-remediation
Continuous experimentation
FinOps optimization
Platform engineering

Appendix H: Glossary

Agile: Iterative software development methodology
Artifact: Output of build process (JAR, Docker image)
Autoscaling: Automatically adjusting resources based on demand

Blue/Green Deployment: Two identical environments, switch traffic
Build: Process of compiling source code into artifacts

CAMS: Culture, Automation, Measurement, Sharing
Canary Deployment: Gradual rollout to subset of users
CD: Continuous Delivery/Deployment
CI: Continuous Integration
Chaos Engineering: Deliberately introducing failures
CNCF: Cloud Native Computing Foundation
Container: Lightweight virtualization at OS level
CRD: Custom Resource Definition (Kubernetes)

DaemonSet: Runs pod on every node (Kubernetes)
DAST: Dynamic Application Security Testing
Deployment: Kubernetes resource for managing pods
DevOps: Cultural and technical movement for collaboration
DORA: DevOps Research and Assessment
Docker: Container platform

EKS: Amazon Elastic Kubernetes Service
ELK: Elasticsearch, Logstash, Kibana
Error Budget: (1 - SLO) * time, acceptable failure

Feature Flag: Toggle for feature visibility
FinOps: Cloud financial management
Flux: GitOps operator

Git: Distributed version control
GitOps: Git as source of truth for infrastructure
GKE: Google Kubernetes Engine
Grafana: Visualization platform

Helm: Kubernetes package manager
HPA: Horizontal Pod Autoscaler
Hybrid Cloud: Mix of public and private cloud

IaC: Infrastructure as Code
IAM: Identity and Access Management
Idempotent: Operation with same effect when run multiple times
Ingress: Kubernetes API object for external access
Istio: Service mesh

Jenkins: CI/CD automation server
JSON: JavaScript Object Notation

K8s: Kubernetes (K + 8 letters)
Kustomize: Kubernetes configuration customization
Kyverno: Kubernetes policy engine

Lambda: AWS serverless function
Load Balancer: Distributes traffic
Logging: Recording events

Microservices: Architecture with small, independent services
Monitoring: Collecting and analyzing metrics
mTLS: Mutual TLS for service authentication

Namespace: Isolation mechanism in Kubernetes
Network Policy: Firewall rules for pods
Node: Worker machine in Kubernetes

Observability: Understanding system internals through outputs
OCI: Open Container Initiative
OPA: Open Policy Agent
Operator: Kubernetes extension for application management

PaaS: Platform as a Service
Pod: Smallest deployable unit in Kubernetes
Prometheus: Monitoring system
PV: Persistent Volume
PVC: Persistent Volume Claim

RBAC: Role-Based Access Control
ReplicaSet: Ensures specified number of pods running
Rolling Update: Gradually replacing instances
Runbook: Documented procedures for operations

SaaS: Software as a Service
SAST: Static Application Security Testing
SBOM: Software Bill of Materials
Secret: Kubernetes resource for sensitive data
Service Mesh: Infrastructure layer for service communication
SLA: Service Level Agreement
SLI: Service Level Indicator
SLO: Service Level Objective
SRE: Site Reliability Engineering

Terraform: IaC tool by HashiCorp
Toil: Manual, repetitive operational work
Tracing: Tracking request through distributed system

Unit Test: Testing individual components
USE Method: Utilization, Saturation, Errors

VCS: Version Control System
VPC: Virtual Private Cloud
VPA: Vertical Pod Autoscaler

Waterfall: Sequential development methodology
Workload: Application running on Kubernetes

XML: eXtensible Markup Language

YAML: YAML Ain't Markup Language

Zero Downtime Deployment: Deployment without service interruption

aw-junaid/DevOps.md

DevOps Engineering: From Foundations to Enterprise-Scale Platform Architecture

Table of Contents

PART I — DEVOPS FOUNDATIONS

Chapter 1 — Introduction to DevOps

1.1 History of Software Development

1.2 From Waterfall to Agile

1.3 The DevOps Movement

1.4 CAMS Model (Culture, Automation, Measurement, Sharing)

1.5 DevOps vs Agile vs SRE

1.6 DevSecOps Overview

1.7 Platform Engineering Evolution

1.8 DevOps Myths and Anti-Patterns

1.9 Business Impact of DevOps

1.10 Case Studies

Chapter 2 — DevOps Culture & Organizational Design

2.1 Organizational Structures (Functional vs Product Teams)

2.2 Conway's Law

2.3 Psychological Safety

2.4 Blameless Postmortems

2.5 DevOps Leadership

2.6 Change Management

2.7 DevOps Metrics for Management

2.8 Building High-Performance Teams

2.9 InnerSource Model

2.10 DevOps Transformation Roadmap

Chapter 3 — Linux & System Fundamentals for DevOps

3.1 Linux Architecture

3.2 Process Management

3.3 File Systems

3.4 Networking Basics

3.5 Shell Scripting (Bash)

3.6 Systemd

3.7 Package Management

3.8 Performance Monitoring

3.9 Log Management

3.10 Hardening Linux Servers

PART II — VERSION CONTROL & COLLABORATION

Chapter 4 — Git Internals & Advanced Workflows

4.1 Git Architecture (Objects, Trees, Commits)

4.2 Branching Strategies

4.3 Git Rebase vs Merge

4.4 Submodules

4.5 Monorepo vs Polyrepo

4.6 Git Hooks

4.7 Large Scale Git Management

4.8 Code Review Best Practices

Chapter 5 — Platforms

5.1 GitHub Enterprise

5.2 GitLab CI/CD

5.3 Bitbucket

5.4 Pull Requests & Merge Requests

5.5 Branch Protection Rules

5.6 Secrets in Repositories

5.7 Repository Security

PART III — CI/CD PIPELINES

Chapter 6 — Continuous Integration

6.1 CI Principles

6.2 Build Automation

6.3 Artifact Management

6.4 Pipeline as Code

6.5 Testing Strategies

6.6 Parallel Builds

6.7 Caching & Optimization

Chapter 7 — CI Tools

7.1 Jenkins Architecture

7.2 GitHub Actions

7.3 GitLab CI

7.4 CircleCI

7.5 Azure DevOps

7.6 Pipeline Security

7.7 Scaling CI Infrastructure

Chapter 8 — Continuous Delivery & Deployment

8.1 CD vs Continuous Deployment

8.2 Deployment Strategies

8.3 Feature Flags

8.4 Database Migration Strategies

8.5 Rollbacks & Recovery

8.6 GitOps Workflow

PART IV — CONTAINERS & ORCHESTRATION