Skip to content

Instantly share code, notes, and snippets.

@aw-junaid
Created February 23, 2026 23:15
Show Gist options
  • Select an option

  • Save aw-junaid/3eb7c1cabcac35a9879b15f96d1d032b to your computer and use it in GitHub Desktop.

Select an option

Save aw-junaid/3eb7c1cabcac35a9879b15f96d1d032b to your computer and use it in GitHub Desktop.
A Comprehensive Guide to Modern DevOps Practices, Tools, and Cultural Transformation

DevOps Engineering: From Foundations to Enterprise-Scale Platform Architecture


Table of Contents

PART I — DEVOPS FOUNDATIONS

  • Chapter 1 — Introduction to DevOps
  • Chapter 2 — DevOps Culture & Organizational Design
  • Chapter 3 — Linux & System Fundamentals for DevOps

PART II — VERSION CONTROL & COLLABORATION

  • Chapter 4 — Git Internals & Advanced Workflows
  • Chapter 5 — Platforms

PART III — CI/CD PIPELINES

  • Chapter 6 — Continuous Integration
  • Chapter 7 — CI Tools
  • Chapter 8 — Continuous Delivery & Deployment

PART IV — CONTAINERS & ORCHESTRATION

  • Chapter 9 — Containerization
  • Chapter 10 — Kubernetes Deep Dive
  • Chapter 11 — Kubernetes in Production

PART V — INFRASTRUCTURE AS CODE

  • Chapter 12 — Infrastructure as Code Principles
  • Chapter 13 — IaC Tools

PART VI — CLOUD PLATFORMS

  • Chapter 14 — Cloud Fundamentals
  • Chapter 15 — Amazon Web Services
  • Chapter 16 — Microsoft Azure
  • Chapter 17 — Google Cloud Platform

PART VII — OBSERVABILITY & SRE

  • Chapter 18 — Monitoring & Logging
  • Chapter 19 — Site Reliability Engineering

PART VIII — DEVSECOPS

  • Chapter 20 — Secure DevOps
  • Chapter 21 — Security Tools

PART IX — ADVANCED TOPICS

  • Chapter 22 — GitOps & Platform Engineering
  • Chapter 23 — Serverless & Edge
  • Chapter 24 — Performance & Scalability
  • Chapter 25 — DevOps at Enterprise Scale

PART X — PRACTICAL IMPLEMENTATION

  • Chapter 26 — Building a Complete DevOps Pipeline
  • Chapter 27 — Real-World Case Studies

Appendices


PART I — DEVOPS FOUNDATIONS

Chapter 1 — Introduction to DevOps

1.1 History of Software Development

The journey of software development methodologies spans over six decades, evolving from the nascent days of computing to the sophisticated, automated pipelines we see today. Understanding this history is crucial for appreciating why DevOps emerged as a necessary evolution rather than a passing trend.

The Pioneering Era (1950s-1960s)

In the early days of computing, software was tightly coupled with hardware. Programs were written in machine language or assembly, and the concept of "software development" as a distinct discipline barely existed. The IBM 704, introduced in 1954, was one of the first mass-produced computers, and programming it involved physical plugboards and punch cards. There was no separation between development and operations—the same people who wrote the code also ran the machines. This period was characterized by:

  • Batch Processing: Jobs were submitted on punch cards, and results would return hours or days later.
  • Hardware Dominance: Software was often given away for free with hardware purchases.
  • No Standardization: Every machine had its own architecture and instruction set.

The Software Crisis and Structured Programming (1960s-1970s)

As hardware became more powerful and affordable, software complexity grew exponentially. The NATO Software Engineering Conferences of 1968 and 1969 coined the term "software crisis," highlighting that projects were running over budget, over time, and producing unreliable software. This crisis led to:

  • Structured Programming: Pioneered by Edsger Dijkstra and others, this paradigm introduced disciplined control structures (if-then-else, loops) instead of chaotic goto statements.
  • The Waterfall Model: Winston Royce's 1970 paper (often mischaracterized) described a sequential model that would become the dominant methodology for decades.
  • Separation of Concerns: For the first time, distinct roles emerged—analysts, designers, programmers, testers, and operators.

The Rise of Personal Computing and Client-Server (1980s)

The 1980s brought personal computers and the client-server architecture. Software was now shipped on floppy disks and later CDs. This era saw:

  • Packaged Software: Companies like Microsoft began selling software as products.
  • Graphical User Interfaces: The Macintosh (1984) and Windows (1985) made computing accessible to non-technical users.
  • Networked Applications: With the growth of LANs, applications became distributed.
  • Formalized ITIL: The Information Technology Infrastructure Library emerged in the UK, providing a framework for IT service management, further codifying the separation between development (creating applications) and operations (running infrastructure).

The Internet Boom (1990s)

The commercialization of the internet in the mid-1990s changed everything. Companies like Amazon (1994), eBay (1995), and Google (1998) were born in the cloud (though cloud computing as we know it didn't exist yet). This period introduced:

  • Web Applications: Software was no longer installed but accessed via browsers.
  • LAMP Stack: Linux, Apache, MySQL, and PHP/Python/Perl became the dominant open-source web development platform.
  • Rapid Growth: The pressure to release features quickly to beat competitors intensified.
  • Dot-com Bubble: The frenzy led to massive investments and subsequent crash, but the foundational technologies survived.

The Agile Manifesto (2001)

By the late 1990s, the heavyweight, documentation-driven methodologies were creaking under the pressure of internet-speed development. Seventeen software developers met at a ski resort in Utah and crafted the Agile Manifesto, which emphasized:

  • Individuals and interactions over processes and tools
  • Working software over comprehensive documentation
  • Customer collaboration over contract negotiation
  • Responding to change over following a plan

Agile methodologies like Scrum, Extreme Programming (XP), and Kanban transformed how development teams worked, promoting iterative development, continuous feedback, and cross-functional collaboration. However, Agile focused primarily on developers and product owners—operations remained largely untouched.

1.2 From Waterfall to Agile

To understand the transition from Waterfall to Agile, we must examine both methodologies in depth.

The Waterfall Model

The Waterfall model, despite its widespread adoption, was never intended to be rigid. Royce's original paper actually recommended iteration. However, the model that emerged was strictly sequential:

  1. Requirements Analysis: Gather and document all requirements before any design begins.
  2. System Design: Create detailed architectural and design specifications based on requirements.
  3. Implementation: Write code according to the design documents.
  4. Testing: Verify that the implemented system meets the requirements.
  5. Deployment: Release the tested system to production.
  6. Maintenance: Fix issues and make enhancements post-release.

Challenges with Waterfall:

  • Late Feedback: Users don't see working software until very late in the process.
  • Change Resistance: Changing requirements mid-stream is expensive and disruptive.
  • Integration Hell: Integration happens at the end, often revealing conflicts and issues that require significant rework.
  • Long Release Cycles: Releases might take months or years.
  • Siloed Teams: Developers throw code "over the wall" to testers, who then throw it to operations.

The Agile Revolution

Agile methodologies emerged as a direct response to these challenges. The Agile Manifesto's 12 principles include:

  • Deliver working software frequently, from a couple of weeks to a couple of months.
  • Welcome changing requirements, even late in development.
  • Business people and developers must work together daily throughout the project.
  • Build projects around motivated individuals and trust them to get the job done.
  • Working software is the primary measure of progress.
  • Continuous attention to technical excellence and good design enhances agility.

Scrum became the most popular Agile framework, introducing:

  • Sprints: Time-boxed iterations (usually 2 weeks)
  • Roles: Product Owner, Scrum Master, Development Team
  • Ceremonies: Sprint Planning, Daily Stand-up, Sprint Review, Sprint Retrospective

Kanban offered a different approach:

  • Visualize workflow
  • Limit work in progress
  • Manage flow
  • Make process policies explicit
  • Improve collaboratively

The Gap Agile Created

While Agile dramatically improved development productivity, it inadvertently widened the gap between Dev and Ops. Developers were now releasing software every two weeks, but operations teams (still following ITIL) were accustomed to quarterly or annual releases. This created:

  • Deployment Conflicts: Developers wanted frequent releases; operations prioritized stability.
  • Environment Inconsistencies: Code worked on developer laptops but failed in production.
  • Blame Game: When production issues occurred, developers blamed operations for poor infrastructure, and operations blamed developers for buggy code.
  • Manual Handoffs: Each release required manual documentation, change requests, and deployment procedures.

1.3 The DevOps Movement

The term "DevOps" was coined in 2009 by Patrick Debois, who organized the first DevOpsDays conference in Ghent, Belgium. However, the ideas behind DevOps had been brewing for years.

The Belgian Rails Underground

In 2008, at the Agile Conference in Toronto, Andrew Clay Shafer and Patrick Debois discussed the idea of "Agile Infrastructure." They realized that the principles of Agile—collaboration, iteration, feedback—could and should apply to operations. This conversation planted the seeds for what would become DevOps.

The Flickr Talk

At the 2009 Velocity Conference, John Allspaw and Paul Hammond from Flickr presented "10+ Deploys per Day: Dev and Ops Cooperation at Flickr." This groundbreaking talk showed how Flickr had broken down the barriers between development and operations, achieving unprecedented deployment frequency. The talk went viral in the tech community and catalyzed the DevOps movement.

Defining DevOps

DevOps is not a tool, a job title, or a specific technology. It's a cultural and professional movement that stresses communication, collaboration, and integration between software developers and IT operations professionals. At its core, DevOps aims to:

  • Break down silos between development, operations, and other stakeholders
  • Automate manual processes to increase efficiency and reduce errors
  • Measure everything to understand system behavior and business impact
  • Share knowledge, responsibility, and ownership across teams

The Three Ways

Gene Kim, in "The Phoenix Project" and "The DevOps Handbook," codified DevOps principles into "The Three Ways":

First Way: Systems Thinking (Flow)

  • Emphasizes the performance of the entire system, not just silos
  • Focus on creating fast, smooth flow from development to operations to the customer
  • Never pass known defects downstream
  • Optimize for global goals, not local efficiencies

Second Way: Amplify Feedback Loops

  • Create short, fast feedback loops from operations back to development
  • Enable quick detection and recovery from issues
  • Swarm problems to prevent recurrence
  • Build quality in by finding and fixing defects at the source

Third Way: Culture of Continuous Experimentation and Learning

  • Foster a culture that takes risks and learns from failure
  • Understand that repetition and practice are prerequisites to mastery
  • Allocate time for improvement of daily work
  • Introduce faults to increase resilience (chaos engineering)

1.4 CAMS Model (Culture, Automation, Measurement, Sharing)

The CAMS model, popularized by Damon Edwards and John Willis, provides a framework for understanding the core dimensions of DevOps.

Culture (The Foundation)

Culture is the most critical and most challenging aspect of DevOps. It encompasses:

  • Trust and Collaboration: Teams trust each other and collaborate across boundaries.
  • Shared Goals: Dev and Ops share responsibility for the entire service lifecycle.
  • Respect: Each team respects the others' expertise and constraints.
  • Experimentation: Failure is viewed as a learning opportunity, not a reason for punishment.
  • Continuous Improvement: Teams constantly seek ways to improve processes and systems.

Culture Anti-patterns:

  • Blaming individuals for system failures
  • Throwing work "over the wall" between teams
  • Hiding information or hoarding knowledge
  • Fear of change or experimentation

Automation (The Enabler)

Automation is what makes DevOps practices scalable and repeatable. Key areas include:

  • Infrastructure Automation: Provisioning servers, networks, and storage through code (Terraform, CloudFormation)
  • Configuration Automation: Managing system configurations (Ansible, Puppet, Chef)
  • Build and Deployment Automation: CI/CD pipelines (Jenkins, GitHub Actions)
  • Testing Automation: Automated unit, integration, and security tests
  • Environment Management: Consistent development, testing, and production environments

Automation Principles:

  • Automate repetitive, error-prone manual tasks
  • Version control everything (infrastructure, configuration, pipelines)
  • Treat automation code as production code (testing, review, documentation)
  • Start with the most painful manual processes first

Measurement (The Evidence)

You cannot improve what you cannot measure. Measurement in DevOps includes:

  • Deployment Metrics: Frequency, lead time, success rate
  • Operational Metrics: Availability, latency, throughput, error rates
  • Business Metrics: Customer satisfaction, revenue, feature adoption
  • Team Metrics: Morale, burnout, knowledge sharing

Key Performance Indicators (KPIs):

  • Deployment Frequency: How often do we deploy to production?
  • Lead Time for Changes: How long does it take from commit to running in production?
  • Mean Time to Recovery (MTTR): How quickly can we recover from failures?
  • Change Failure Rate: What percentage of changes cause degraded service?

Sharing (The Multiplier)

Sharing creates a virtuous cycle where knowledge and improvements propagate throughout the organization.

  • Cross-functional Teams: Dev and Ops work together on shared goals.
  • Knowledge Transfer: Pair programming, documentation, brown bag sessions.
  • Shared Tools and Platforms: Internal developer platforms, common toolchains.
  • Blame-free Postmortems: Share learnings from failures without fear of reprisal.
  • Open Source Contributions: Share innovations with the broader community.

1.5 DevOps vs Agile vs SRE

Understanding the distinctions and relationships between these complementary approaches is essential.

DevOps vs Agile

Aspect Agile DevOps
Focus Development practices Full lifecycle (Dev+Ops)
Primary Goal Deliver value iteratively Deliver value continuously and reliably
Scope Development team Development + Operations + QA + Security
Timeframe Sprint iterations Continuous delivery pipeline
Key Practices Stand-ups, retrospectives, story pointing CI/CD, monitoring, infrastructure as code
Metrics Velocity, story points DORA metrics, SLIs/SLOs

Relationship: Agile and DevOps are complementary. Agile improves how features are built; DevOps improves how those features are delivered and operated. Many organizations adopt Agile first, then DevOps to address operational bottlenecks.

DevOps vs SRE

Site Reliability Engineering (SRE) was pioneered at Google and codified by Ben Treynor Sloss. SRE applies software engineering principles to operations problems.

Aspect DevOps SRE
Origin Community movement Google internal practice
Philosophy Break down silos, collaborate Apply software engineering to ops
Key Concept CAMS model Error budgets
Implementation Cultural and technical practices Specific roles and practices
Focus Collaboration and automation Reliability and scalability

Relationship: Google describes SRE as "what happens when you ask a software engineer to design an operations team." Many consider SRE a specific implementation of DevOps principles with a stronger focus on reliability engineering.

Key SRE Practices:

  • Service Level Objectives (SLOs) and Error Budgets
  • Eliminating toil through automation
  • Monitoring and alerting design
  • Capacity planning
  • Incident response
  • Chaos engineering

1.6 DevSecOps Overview

DevSecOps integrates security practices throughout the DevOps lifecycle rather than adding security as a final gate. The motto is "Security as Code" and "Shift Left" (moving security earlier in the development process).

Why DevSecOps?

Traditional security approaches created bottlenecks:

  • Security testing happened at the end of development
  • Security findings caused last-minute delays
  • Security teams were seen as blockers, not enablers
  • Vulnerabilities were discovered too late for easy remediation

DevSecOps Principles:

  1. Shift Left: Test security early and often throughout the pipeline
  2. Automate Security: Integrate automated security tools into CI/CD
  3. Security as Code: Define security policies and configurations in code
  4. Continuous Compliance: Automate compliance checking and reporting
  5. Shared Responsibility: Everyone owns security, not just the security team

Security Integration Points:

  • Code: SAST (Static Application Security Testing), secrets scanning
  • Dependencies: SCA (Software Composition Analysis), dependency scanning
  • Build: Container scanning, SBOM generation
  • Deploy: Policy as code, compliance validation
  • Runtime: DAST (Dynamic Application Security Testing), runtime protection
  • Infrastructure: Infrastructure scanning, cloud security posture management

1.7 Platform Engineering Evolution

Platform Engineering has emerged as a natural evolution of DevOps practices, especially in large organizations. It focuses on building Internal Developer Platforms (IDPs) that abstract infrastructure complexity and provide self-service capabilities to development teams.

The Problem Platform Engineering Solves

As organizations scale, the cognitive load on developers increases:

  • Multiple cloud providers
  • Complex Kubernetes configurations
  • Numerous tools and technologies
  • Security and compliance requirements
  • Observability setup

Developers spend more time on infrastructure and tooling than on business logic.

What is an Internal Developer Platform?

An IDP is a cohesive layer of tools and services that development teams use to build, deploy, and operate applications without needing to understand the underlying infrastructure.

Key Capabilities:

  • Self-service provisioning of environments
  • Standardized deployment pipelines
  • Built-in security and compliance controls
  • Golden paths and paved roads
  • Observability and debugging tools
  • Documentation and onboarding

Platform Engineering vs DevOps

Aspect DevOps Platform Engineering
Focus Culture and practices Building and maintaining platforms
Target All teams Platform team and application teams
Output Improved collaboration Internal developer platform
Key Metric DORA metrics Developer satisfaction, time-to-value

1.8 DevOps Myths and Anti-Patterns

Common Myths:

Myth 1: DevOps is a tool or technology Reality: DevOps is fundamentally about culture and practices. Tools enable DevOps but don't create it.

Myth 2: DevOps means no operations team Reality: Operations responsibilities shift from manual management to building automation and platforms.

Myth 3: DevOps is only for startups Reality: Large enterprises like Amazon, Netflix, and Google have successfully adopted DevOps.

Myth 4: DevOps requires rewriting everything Reality: DevOps can be applied incrementally to existing systems and processes.

Myth 5: DevOps eliminates the need for testing Reality: Testing becomes more critical and more automated.

Anti-patterns:

  1. DevOps Team: Creating a separate "DevOps team" that acts as a silo defeats the purpose.

  2. Tools First: Buying and installing tools without addressing culture and processes.

  3. Automation Without Understanding: Automating broken processes just breaks things faster.

  4. No Measurement: Implementing practices without measuring their impact.

  5. Skipping Security: Treating security as an afterthought.

  6. Hero Culture: Relying on individuals to fix problems manually rather than building resilient systems.

  7. Ignoring Technical Debt: Accumulating technical debt that slows down delivery.

1.9 Business Impact of DevOps

Organizations that successfully implement DevOps see measurable business benefits:

Speed:

  • 200x more frequent deployments (DORA research)
  • 2555x faster lead time from commit to deploy
  • Faster time-to-market for new features

Stability:

  • 3x lower change failure rate
  • 24x faster recovery from failures
  • 50% fewer outages

Security:

  • 50% less time spent on security remediation
  • Faster vulnerability patching
  • Improved compliance posture

Business Outcomes:

  • Higher customer satisfaction
  • Increased market share
  • Better employee retention
  • Lower operational costs
  • Improved innovation capacity

1.10 Case Studies

Netflix: Cloud Native Excellence

Netflix's DevOps journey is legendary. After a major database corruption in 2008 that prevented DVD shipments for days, Netflix committed to moving to AWS and embracing cloud-native architecture.

Key Practices:

  • Chaos Engineering: Simian Army tools (Chaos Monkey) deliberately cause failures to test resilience
  • Immutable Infrastructure: Servers are never patched; they're replaced
  • Microservices: Thousands of microservices running on AWS
  • Continuous Delivery: Thousands of deployments daily
  • Culture of Freedom and Responsibility: Engineers have significant autonomy and ownership

Results: Netflix achieved global scale, 99.99% availability, and the ability to deploy thousands of times daily.

Amazon: The Deployment Machine

Amazon's journey to DevOps was driven by CEO Jeff Bezos' mandate: all teams must expose their data and functionality through service interfaces, and teams must communicate only through these interfaces.

Key Practices:

  • Two-Pizza Teams: Small, autonomous teams (fewer than 10 people)
  • You Build It, You Run It: Teams own their services end-to-end
  • Single-threaded Ownership: Clear ownership without shared responsibility
  • Deployment Pipeline: Sophisticated pipeline enabling 50 million+ deployments annually
  • API Mandate: All communication through well-defined APIs

Results: Amazon achieves 143,000 deployments in a single hour, with each team deploying independently.

Google: SRE Pioneers

Google developed SRE to manage its massive scale. The SRE team at Google is responsible for keeping services running while maintaining a 50% cap on operational work—the rest is development work to improve systems.

Key Practices:

  • Error Budgets: 100% reliability is the wrong target; error budgets define acceptable unreliability
  • Borg/Omega/Kubernetes: Internal container orchestration evolved into Kubernetes
  • Blameless Postmortems: Focus on fixing systems, not blaming people
  • Toil Elimination: Automate away repetitive operational work
  • Capacity Planning: Data-driven approach to scaling

Results: Google maintains incredible reliability (Gmail 99.978%) while continuously deploying thousands of changes.


Chapter 2 — DevOps Culture & Organizational Design

2.1 Organizational Structures (Functional vs Product Teams)

The structure of an organization profoundly impacts its ability to implement DevOps. Understanding different organizational models is essential.

Functional (Siloed) Structure

In traditional IT organizations, teams are structured by function:

                    CEO
        ┌────────────┼────────────┐
    Development    QA        Operations
        │             │             │
    Dev Teams    QA Teams    Ops Teams

Characteristics:

  • Clear career paths within functions
  • Deep expertise in specific domains
  • Standardized practices within silos
  • Handoffs between teams
  • Local optimization over global outcomes

Problems with Functional Structure:

  • Slow handoffs create bottlenecks
  • Misaligned incentives (Dev wants features, Ops wants stability)
  • Blame culture when things go wrong
  • Knowledge silos
  • Difficulty implementing end-to-end ownership

Product-Aligned (Cross-functional) Structure

DevOps promotes organizing around products or services:

                    CEO
        ┌────────────┼────────────┐
    Product A   Product B    Product C
        │             │             │
    [Dev, QA, Ops] [Dev, QA, Ops] [Dev, QA, Ops]

Characteristics:

  • Teams own their product end-to-end
  • Members from different functions collaborate daily
  • Aligned incentives around product success
  • Faster decision-making
  • Clear ownership and accountability

Benefits:

  • Reduced handoffs and waiting times
  • Faster feedback loops
  • Better understanding of customer needs
  • Improved quality through ownership
  • Higher team morale and autonomy

Matrix Structure (Hybrid)

Some organizations use a matrix structure where individuals report to both functional and product managers:

                    CEO
        ┌────────────┼────────────┐
    Development    QA        Operations
        │             │             │
    ┌────┼────┐   ┌────┼────┐   ┌────┼────┐
    A B C       A B C       A B C
    │ │ │       │ │ │       │ │ │
    └─┼─┼───────┼─┼─┼───────┼─┼─┘
      └─┼───────┼─┼─────────┘
        └───────┼─┘
                ↓
            Product A

Benefits:

  • Maintain functional expertise while enabling product focus
  • Flexible resource allocation
  • Career development within functions

Challenges:

  • Conflicting priorities (functional vs product goals)
  • Complex reporting relationships
  • Potential for confusion and politics

2.2 Conway's Law

Conway's Law, formulated by Melvin Conway in 1967, states:

"Organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations."

In simpler terms: Your system architecture will mirror your organizational structure.

Implications for DevOps:

  1. Communication Patterns Become Architecture:

    • If teams communicate through tickets, the system will have slow, bureaucratic interfaces
    • If teams can talk directly, the system can have tight integration
    • If teams are siloed, the system will have siloed components
  2. Inverse Conway Maneuver:

    • To achieve a desired architecture, reorganize teams to match it
    • Want microservices? Create small, autonomous teams
    • Want a platform? Create a platform team that treats other teams as customers
  3. Team Boundaries:

    • Teams should own complete, loosely-coupled components
    • APIs between teams should be clean and well-documented
    • Teams should be able to deploy independently

Practical Application:

When designing microservices architecture:

  • Identify bounded contexts (domain-driven design)
  • Form teams around these contexts
  • Ensure teams have all necessary skills (cross-functional)
  • Define clear APIs between team-owned services
  • Enable independent deployment per team

2.3 Psychological Safety

Psychological safety, a concept popularized by Harvard professor Amy Edmondson, is crucial for high-performing DevOps teams. It's defined as "a shared belief that the team is safe for interpersonal risk-taking."

Why It Matters in DevOps:

  1. Blameless Culture: When incidents occur, teams need to investigate without fear of punishment.

  2. Experimentation: DevOps requires trying new things; psychological safety enables this.

  3. Learning from Failure: Only in safe environments do people openly discuss mistakes.

  4. Speaking Up: Team members need to raise concerns about security, quality, or process issues.

  5. Innovation: New ideas emerge when people feel safe sharing half-formed thoughts.

Building Psychological Safety:

For Leaders:

  • Model vulnerability by admitting your own mistakes
  • Ask questions, don't provide all answers
  • Frame work as learning problems, not execution problems
  • Acknowledge your own fallibility
  • Actively invite input from quieter team members

For Teams:

  • Establish ground rules for discussion
  • No interrupting or dismissing ideas
  • Focus on systems, not people, when things go wrong
  • Celebrate learning from failures
  • Create anonymous feedback channels

For Individuals:

  • Ask for help when needed
  • Offer help to others
  • Share your mistakes and what you learned
  • Assume good intentions from others

Measuring Psychological Safety:

  • Do team members feel comfortable admitting mistakes?
  • Are dissenting opinions expressed and heard?
  • Do people ask for help without hesitation?
  • Is failure discussed as a learning opportunity?
  • Are there diverse perspectives in decision-making?

2.4 Blameless Postmortems

The blameless postmortem is a cornerstone of DevOps culture. After an incident, teams conduct a thorough analysis focused on understanding what happened and preventing recurrence—not on assigning blame.

Principles of Blameless Postmortems:

  1. Assume Good Intentions: Everyone was doing their best with the information they had.

  2. Focus on Systems, Not People: Human error is a symptom of system problems.

  3. Fix the Process, Not the Person: If a person could make a mistake, the system allowed it.

  4. Share Learnings Widely: Postmortems should be public within the organization.

  5. Actionable Improvements: Every postmortem should produce concrete action items.

The Postmortem Process:

Immediate Response (During Incident):

  • Focus on restoring service
  • Document actions and timestamps
  • Preserve evidence (logs, metrics)

Post-Incident Analysis (24-48 hours after):

  • Gather all participants
  • Timeline reconstruction
  • Root cause analysis (multiple contributing factors)
  • Identify what went well and what didn't

Writing the Postmortem:

A good postmortem includes:

  • Executive Summary: Brief overview for leadership
  • Incident Details: Date, duration, impact, severity
  • Timeline: Chronological sequence of events
  • Root Cause: Technical explanation of what failed
  • Contributing Factors: Why the conditions existed
  • Detection: How the incident was discovered
  • Response: How the team handled it
  • Lessons Learned: What we now know
  • Action Items: Specific, assigned tasks with due dates

Example Action Items:

  • "Add monitoring for database connection pool exhaustion"
  • "Update deployment documentation with rollback procedure"
  • "Implement automated testing for migration scripts"
  • "Add canary deployment for configuration changes"

Common Pitfalls:

  • Superficial Analysis: Stopping at "human error" instead of digging deeper
  • No Action Items: Learning without implementing improvements
  • Blaming Language: "He should have..." instead of "The system allowed..."
  • Keeping Secrets: Hiding postmortems from other teams
  • Punishing Honesty: Making people regret speaking openly

2.5 DevOps Leadership

DevOps transformations require leadership at all levels, but especially from those in formal leadership positions.

Characteristics of DevOps Leaders:

  1. Servant Leadership: Leaders exist to serve and enable their teams, not the other way around.

  2. Systems Thinkers: Leaders understand how parts of the organization interact.

  3. Change Agents: They actively work to improve culture and processes.

  4. Technical Empathy: They understand technical challenges and constraints.

  5. Coaching Mindset: They develop people, not just deliver projects.

  6. Bias for Action: They value progress over perfection.

  7. Long-term Perspective: They invest in capabilities, not just immediate results.

Leadership Responsibilities:

Creating Vision:

  • Articulate why DevOps matters
  • Define success metrics
  • Communicate the transformation journey
  • Align DevOps goals with business objectives

Removing Obstacles:

  • Eliminate bureaucratic barriers
  • Provide resources and tools
  • Resolve organizational conflicts
  • Shield teams from distractions

Modeling Behavior:

  • Demonstrate blameless culture
  • Show vulnerability
  • Learn in public
  • Celebrate learning from failure

Building Capability:

  • Invest in training and development
  • Create career paths
  • Hire for culture add
  • Develop internal expertise

Measuring Progress:

  • Track DORA metrics
  • Survey team morale
  • Monitor business outcomes
  • Adjust strategy based on data

Leadership Anti-patterns:

  • Command and Control: Dictating solutions instead of enabling teams
  • Short-term Focus: Prioritizing immediate features over long-term capabilities
  • Inconsistent Messaging: Saying one thing but rewarding another
  • Fear-based Management: Using metrics to punish instead of improve
  • Hollow Empowerment: Saying "you're empowered" but overriding decisions

2.6 Change Management

DevOps transforms how organizations approach change—from rigid, approval-based processes to automated, verified, and continuous flows.

Traditional Change Management:

  • Change Advisory Board (CAB) approves all changes
  • Weekly or bi-weekly meetings
  • Paperwork-heavy requests
  • Focus on risk avoidance
  • Slow, batch-oriented

DevOps Change Management:

  • Automated validation and testing
  • Peer review through code review
  • Gradual rollout with monitoring
  • Fast rollback capability
  • Focus on risk management
  • Continuous, small changes

Key Principles:

  1. Changes Should Be Small: Small changes are easier to review, test, and roll back.

  2. Automate Where Possible: Automated testing replaces manual approval for many changes.

  3. Verification Over Approval: Prove changes work through testing rather than seeking permission.

  4. Gradual Exposure: Roll out changes progressively, monitoring impact.

  5. Emergency Changes Are Rare: If you need frequent emergency changes, your process is broken.

The Change Management Spectrum:

Type Traditional DevOps
Infrastructure CAB approval Terraform + automated testing
Application Release manager CI/CD pipeline + canary
Configuration Ticket + manual Git push + automated
Security Pen test before release Continuous scanning

When CAB Still Makes Sense:

  • Regulatory compliance requirements
  • Financial systems with audit mandates
  • Changes with no rollback option
  • External customer commitments
  • Initial transformation phase

2.7 DevOps Metrics for Management

Measuring DevOps success requires moving beyond traditional IT metrics.

DORA Metrics (Four Key Metrics):

The State of DevOps Reports, produced by DORA (DevOps Research and Assessment), identified four key metrics that predict organizational performance:

  1. Deployment Frequency: How often an organization successfully releases to production

    • Elite: Multiple deploys per day
    • High: Weekly to monthly
    • Medium: Monthly to every 6 months
    • Low: Less than every 6 months
  2. Lead Time for Changes: The time from code commit to code successfully running in production

    • Elite: Less than one hour
    • High: One day to one week
    • Medium: One week to one month
    • Low: One month to six months
  3. Mean Time to Recovery (MTTR): How long it takes to restore service after an incident

    • Elite: Less than one hour
    • High: Less than one day
    • Medium: Less than one day to one week
    • Low: One week to one month
  4. Change Failure Rate: The percentage of changes that result in degraded service

    • Elite: 0-15%
    • High: 16-30%
    • Medium: 16-30%
    • Low: 16-30%

Additional Metrics:

Flow Metrics:

  • Deployment size (smaller is better)
  • Batch size (smaller is better)
  • Wait times between stages
  • Work in progress (WIP) limits

Quality Metrics:

  • Defect escape rate (bugs found in production)
  • Test coverage
  • Mean time to detection (MTTD)
  • Mean time between failures (MTBF)

Business Metrics:

  • Time to market for new features
  • Customer satisfaction (CSAT/NPS)
  • Revenue per employee
  • Feature adoption rate

Team Health Metrics:

  • Employee Net Promoter Score (eNPS)
  • Turnover rate
  • Burnout indicators
  • Learning and development hours

Metrics Anti-patterns:

  • Vanity Metrics: Numbers that look good but don't indicate real performance
  • Gaming the System: Optimizing metrics at the expense of actual outcomes
  • Comparing Teams: Using metrics to rank teams creates unhealthy competition
  • No Context: Metrics without understanding the underlying context
  • Measuring Everything: Analysis paralysis from too many metrics

2.8 Building High-Performance Teams

High-performing DevOps teams share common characteristics and practices.

Characteristics:

  1. Cross-functional Composition:

    • All skills needed to deliver value
    • No external dependencies for common tasks
    • T-shaped skills (deep in one area, broad in others)
  2. Clear Ownership:

    • End-to-end responsibility
    • Clear boundaries between teams
    • "You build it, you run it" mentality
  3. Autonomy with Alignment:

    • Freedom to choose how to achieve goals
    • Alignment on what goals matter
    • Guardrails, not gates
  4. Psychological Safety:

    • Safe to take risks
    • Open communication
    • Learning culture
  5. Continuous Improvement:

    • Regular retrospectives
    • Time for improvement work
    • Blameless problem-solving

Building Practices:

Team Formation:

  • Start with clear mission and boundaries
  • Include all necessary roles
  • Define success metrics together
  • Establish team norms and working agreements

Onboarding:

  • Structured mentorship program
  • Pair programming with experienced team members
  • Gradual responsibility increase
  • Documentation and learning resources

Team Rituals:

  • Daily stand-up (15 minutes max)
  • Regular planning sessions
  • Retrospectives (blameless, action-oriented)
  • Demo days or show-and-tell
  • Social activities

Knowledge Management:

  • Living documentation
  • Code comments and READMEs
  • Architecture decision records (ADRs)
  • Brown bag lunches
  • Internal tech talks

Career Development:

  • Individual growth plans
  • Technical and management tracks
  • Conference attendance and speaking
  • Internal mobility opportunities
  • Mentoring programs

2.9 InnerSource Model

InnerSource applies open source software development practices to internal software development.

What is InnerSource?

InnerSource takes the lessons learned from open source development (transparency, collaboration, meritocracy) and applies them within the corporate firewall. It enables developers from different teams to contribute to each other's codebases.

Core Principles:

  1. Open by Default: Code is visible to everyone in the organization.

  2. Voluntary Participation: Contributors choose what to work on.

  3. Meritocracy: Influence comes from contribution quality, not position.

  4. Asynchronous Collaboration: Work happens across time zones without constant coordination.

  5. Community Over Committee: Decisions emerge from community practice.

Benefits:

  • Reduced Duplication: Teams can reuse and improve existing code
  • Cross-team Collaboration: Breaking down silos organically
  • Skill Development: Developers learn from diverse codebases
  • Faster Innovation: More contributors finding and fixing problems
  • Standardization: Natural emergence of best practices

InnerSource Roles:

  • Trusted Committers: Maintainers who review and merge contributions
  • Contributors: Developers submitting improvements
  • Product Owners: Define direction and priorities
  • Users: Teams that depend on the code

InnerSource Workflow:

  1. Discover: Find a project to contribute to
  2. Understand: Read documentation and code
  3. Discuss: Open an issue or discussion
  4. Develop: Create your changes
  5. Submit: Open a pull request
  6. Review: Work with maintainers on feedback
  7. Merge: Code is accepted and deployed
  8. Celebrate: Recognition for contribution

Implementing InnerSource:

Start Small:

  • Choose one or two foundational projects
  • Document contribution guidelines clearly
  • Make it easy to find and build projects
  • Recognize and reward contributions

Infrastructure Needs:

  • Internal code hosting (GitHub Enterprise, GitLab)
  • CI/CD that works for external contributors
  • Clear documentation and onboarding
  • Communication channels (Slack, mailing lists)

Cultural Requirements:

  • Leadership support for cross-team work
  • Time allocated for contributing to other teams
  • Recognition for contributions
  • Trust that teams will make good decisions

2.10 DevOps Transformation Roadmap

Transforming to DevOps is a journey, not a destination. Here's a structured approach.

Phase 1: Foundation (3-6 months)

Goals:

  • Build awareness and understanding
  • Secure leadership buy-in
  • Identify pilot teams and projects
  • Establish basic metrics

Activities:

  • Executive workshops on DevOps principles
  • Assess current state and pain points
  • Form a DevOps Center of Excellence (optional)
  • Train pilot teams on DevOps basics
  • Implement version control for everything

Success Criteria:

  • Leadership alignment on transformation goals
  • Pilot teams identified and trained
  • Baseline metrics established
  • Initial version control adoption

Phase 2: Pilot (6-12 months)

Goals:

  • Demonstrate success with pilot teams
  • Build reusable patterns and practices
  • Develop internal expertise
  • Create momentum for broader adoption

Activities:

  • Implement CI/CD for pilot applications
  • Automate infrastructure provisioning
  • Establish monitoring and alerting
  • Conduct blameless postmortems
  • Document patterns and practices
  • Share successes across organization

Success Criteria:

  • Measurable improvements in DORA metrics for pilots
  • Repeatable patterns documented
  • Internal champions developed
  • Interest from other teams

Phase 3: Expand (12-24 months)

Goals:

  • Scale practices across organization
  • Standardize tools and platforms
  • Build internal platform/self-service capabilities
  • Embed DevOps in organizational processes

Activities:

  • Train all teams on DevOps practices
  • Implement standard toolchain
  • Build Internal Developer Platform
  • Update HR processes (hiring, reviews)
  • Integrate security (DevSecOps)
  • Establish communities of practice

Success Criteria:

  • Organization-wide adoption of core practices
  • Self-service platform available
  • Security integrated in pipelines
  • DevOps competencies in job descriptions

Phase 4: Optimize (24+ months)

Goals:

  • Continuous improvement culture
  • Experimentation and innovation
  • Industry leadership
  • Platform evolution

Activities:

  • Advanced practices (chaos engineering, SRE)
  • Machine learning for operations
  • Open source contributions
  • Publish case studies and speak at conferences
  • Evolve platform based on feedback

Success Criteria:

  • Elite DORA performance
  • Industry recognition
  • Attract and retain top talent
  • Business outcomes clearly linked to DevOps

Critical Success Factors:

  1. Leadership Commitment: Transformation requires sustained executive support
  2. Patience: Culture change takes years, not months
  3. Focus on Value: Always connect DevOps work to business outcomes
  4. Celebrate Wins: Recognize and share successes
  5. Learn from Failures: Treat setbacks as learning opportunities
  6. Stay Humble: There's always more to learn and improve

Chapter 3 — Linux & System Fundamentals for DevOps

3.1 Linux Architecture

Understanding Linux architecture is fundamental for any DevOps engineer. Linux powers the vast majority of servers, containers, and cloud infrastructure.

The Linux Kernel

The kernel is the core of the operating system, managing hardware resources and providing essential services:

Kernel Components:

  1. Process Scheduler (CPU Management):

    • Manages process execution
    • Implements scheduling policies (CFS - Completely Fair Scheduler)
    • Handles context switching
    • Manages CPU affinity and priorities
  2. Memory Manager:

    • Virtual memory management
    • Paging and swapping
    • Memory allocation (malloc/free)
    • Shared memory and memory mapping
    • Page cache for file I/O
  3. File System Manager:

    • Virtual File System (VFS) abstraction
    • Supports multiple file systems (ext4, XFS, Btrfs)
    • Inode management
    • File permissions and attributes
    • Journaling for reliability
  4. Network Stack:

    • Protocol implementations (TCP/IP, UDP)
    • Socket abstraction
    • Network device drivers
    • Firewall (netfilter/iptables/nftables)
    • Traffic control and QoS
  5. Device Drivers:

    • Interface with hardware devices
    • Character and block devices
    • USB, PCI, SCSI subsystems
    • Device model and sysfs
  6. Inter-process Communication (IPC):

    • Pipes and FIFOs
    • Message queues
    • Shared memory
    • Semaphores
    • Signals

User Space vs Kernel Space

Linux separates execution into two modes:

Kernel Space:

  • Runs in privileged mode
  • Direct hardware access
  • Memory protected from user space
  • Device drivers and core services

User Space:

  • Runs in unprivileged mode
  • Access to hardware only through kernel syscalls
  • Applications, libraries, and services
  • Isolated from other user processes

System Calls

User space programs request kernel services through system calls:

Application (user space)
        ↓
    Library call (glibc)
        ↓
    System call (int 0x80 / syscall)
        ↓
Kernel (kernel space)

Common system calls:

  • read(), write() - File I/O
  • fork(), exec() - Process creation
  • socket(), connect() - Network
  • mmap() - Memory mapping
  • open(), close() - File operations

File System Hierarchy

Linux follows the Filesystem Hierarchy Standard (FHS):

/ (root)
├── bin - Essential user binaries
├── boot - Boot loader files
├── dev - Device files
├── etc - System configuration
├── home - User home directories
├── lib - Essential shared libraries
├── media - Mount points for removable media
├── mnt - Temporarily mounted filesystems
├── opt - Optional application software
├── proc - Virtual filesystem for process info
├── root - Root user home
├── sbin - System binaries
├── sys - Virtual filesystem for system info
├── tmp - Temporary files
├── usr - User utilities and applications
│   ├── bin - User binaries
│   ├── lib - Libraries
│   ├── local - Locally installed software
│   └── share - Architecture-independent data
└── var - Variable data
    ├── log - Log files
    ├── mail - Mail spool
    └── tmp - Temporary files preserved across reboots

3.2 Process Management

Processes are the running instances of programs. Understanding process management is crucial for debugging and performance tuning.

Process States

A process can be in one of several states:

R (Running/Tunable): Process is executing or ready to execute
S (Sleeping): Waiting for an event (interruptible)
D (Uninterruptible Sleep): Waiting for I/O (usually disk)
T (Stopped): Stopped by job control signal
Z (Zombie): Terminated but not yet reaped by parent

Process Lifecycle:

  1. Creation: fork() creates a copy of parent, exec() loads new program
  2. Ready: Process is ready to run and waiting for CPU
  3. Running: Process is executing on CPU
  4. Waiting: Process waiting for I/O or event
  5. Terminated: Process finished execution
  6. Zombie: Waiting for parent to read exit status

Process Attributes:

  • PID (Process ID): Unique identifier
  • PPID (Parent PID): ID of parent process
  • UID/EUID: User ID and effective user ID
  • GID/EGID: Group ID and effective group ID
  • Priority/Nice value: Scheduling priority
  • Environment variables: Process environment
  • File descriptors: Open files and sockets

Process Management Commands:

Viewing Processes:

ps aux                    # All processes with details
ps -ef                    # Full format listing
top                       # Interactive process viewer
htop                      # Enhanced interactive viewer
pstree                    # Process tree
pgrep sshd                # Find PIDs by name

Process Control:

kill -TERM <PID>          # Terminate gracefully
kill -KILL <PID>          # Force kill
kill -STOP <PID>          # Suspend process
kill -CONT <PID>          # Resume process
nice -n 10 command        # Start with lower priority
renice 10 <PID>           # Change priority of running process

Background/Foreground:

command &                  # Run in background
Ctrl+Z                     # Suspend foreground job
jobs                       # List background jobs
bg %1                      # Resume job in background
fg %1                      # Bring job to foreground

Process Limits:

View and modify process limits with ulimit:

ulimit -a                  # Show all limits
ulimit -n 65536            # Max open files
ulimit -u 100              # Max user processes

Important limits:

  • nofile: Maximum open file descriptors
  • nproc: Maximum user processes
  • stack: Stack size
  • core: Core file size
  • memlock: Max locked-in-memory address space

3.3 File Systems

Linux supports multiple file systems and provides a unified interface through the Virtual File System (VFS).

Common File Systems:

ext4 (Fourth Extended Filesystem):

  • Default for many Linux distributions
  • Journaling for reliability
  • Supports large files (up to 16TB) and volumes (up to 1EB)
  • Backward compatible with ext2/ext3

XFS:

  • High-performance, scalable
  • Excellent for large files and parallel I/O
  • Online defragmentation and resizing
  • Common for media and data-intensive applications

Btrfs (B-tree Filesystem):

  • Copy-on-write (COW) architecture
  • Built-in snapshots and rollback
  • Subvolumes and quotas
  • RAID support integrated
  • Checksums on data and metadata

ZFS (on Linux via OpenZFS):

  • Combined file system and volume manager
  • Data integrity with checksums
  • Snapshots, clones, and replication
  • Compression and deduplication
  • Originally from Solaris, now available on Linux

tmpfs:

  • Temporary file system in RAM
  • Fast but volatile
  • Mounted at /tmp, /run, /dev/shm

procfs and sysfs:

  • Virtual file systems for kernel interfaces
  • /proc: Process and system information
  • /sys: Device and kernel parameters

File System Operations:

Mounting and Unmounting:

mount /dev/sda1 /mnt/data        # Mount filesystem
umount /mnt/data                  # Unmount
mount -a                          # Mount all in fstab
findmnt                           # Show mount tree
df -h                             # Disk usage of mounted filesystems

Creating File Systems:

mkfs.ext4 /dev/sdb1              # Create ext4 filesystem
mkfs.xfs /dev/sdc1                # Create XFS filesystem
mkfs.btrfs /dev/sdd1              # Create Btrfs filesystem

Checking and Repairing:

fsck /dev/sda1                    # Check and repair
xfs_repair /dev/sdb1              # XFS repair
btrfs check /dev/sdc1             # Btrfs check

File System Tuning:

tune2fs -l /dev/sda1              # View ext4 parameters
xfs_info /dev/sdb1                 # View XFS parameters
btrfs filesystem show              # Show Btrfs info

Inodes and Directory Structure:

  • Inode: Metadata structure for files (permissions, ownership, timestamps, pointers to data blocks)
  • Directory: Mapping of filenames to inodes
  • Hard links: Multiple filenames pointing to same inode
  • Symbolic links: Special files pointing to other filenames
ls -i                             # Show inode numbers
stat file.txt                     # Show inode details
ln file.txt hardlink              # Create hard link
ln -s file.txt symlink            # Create symbolic link

3.4 Networking Basics

Networking is fundamental to distributed systems. DevOps engineers must understand Linux networking deeply.

Network Stack Overview:

Application Layer (HTTP, DNS, SSH)
    ↓
Transport Layer (TCP, UDP)
    ↓
Network Layer (IP, ICMP)
    ↓
Link Layer (Ethernet, WiFi)
    ↓
Physical Hardware

Network Configuration:

Network Interfaces:

ip link                          # List network interfaces
ip addr show                     # Show IP addresses
ip route show                    # Show routing table
ethtool eth0                     # Show interface details
ss -tulpn                        # Show listening sockets

Interface Configuration (Netplan/ifupdown):

Modern Linux uses Netplan (Ubuntu) or NetworkManager:

# /etc/netplan/01-netcfg.yaml
network:
  version: 2
  ethernets:
    eth0:
      addresses:
        - 192.168.1.100/24
      routes:
        - to: default
          via: 192.168.1.1
      nameservers:
        addresses: [8.8.8.8, 8.8.4.4]

Network Namespaces:

Network namespaces provide isolated network stacks:

ip netns add red                 # Create namespace
ip netns exec red bash           # Run shell in namespace
ip link add veth0 type veth peer name veth1  # Virtual ethernet pair
ip link set veth0 netns red      # Move interface to namespace

Socket Programming Concepts:

  • Socket: Endpoint for communication
  • Port: 16-bit number identifying service
  • TCP: Connection-oriented, reliable, ordered
  • UDP: Connectionless, unreliable, unordered
  • UNIX domain sockets: IPC on same host

Common Network Services:

DNS (Domain Name System):

cat /etc/resolv.conf             # DNS configuration
dig example.com                  # DNS lookup
nslookup example.com              # Alternative lookup
host example.com                  # Simple lookup

HTTP/HTTPS:

curl -I https://example.com       # Fetch HTTP headers
wget https://example.com/file     # Download file
nc -v example.com 80              # Test TCP connection

Network Diagnostics:

ping -c 4 example.com             # Test connectivity
traceroute example.com             # Trace network path
mtr example.com                    # Combined ping+traceroute
ss -tulpn                         # Socket statistics
netstat -an                        # Network statistics (older)
tcpdump -i eth0 port 80           # Capture packets
nmap -p 1-1000 example.com         # Port scanning

Firewall with iptables/nftables:

iptables (legacy):

iptables -L                        # List rules
iptables -A INPUT -p tcp --dport 22 -j ACCEPT  # Allow SSH
iptables -A INPUT -j DROP          # Drop everything else
iptables-save > rules.txt          # Save rules

nftables (modern):

nft list ruleset                   # List all rules
nft add table inet filter          # Create table
nft add chain inet filter input { type filter hook input priority 0\; }
nft add rule inet filter input tcp dport 22 accept

3.5 Shell Scripting (Bash)

Shell scripting automates repetitive tasks and is essential for DevOps.

Bash Basics:

Shebang and Execution:

#!/bin/bash
# This is a comment

echo "Hello, World!"

Variables:

name="John"
echo "Hello, $name"
readonly constant="cannot change"
export ENV_VAR="visible to child processes"

Arrays:

fruits=("apple" "banana" "orange")
echo ${fruits[0]}                  # First element
echo ${fruits[@]}                   # All elements
echo ${#fruits[@]}                  # Array length

Conditionals:

if [ "$name" == "John" ]; then
    echo "Hello John"
elif [ "$name" == "Jane" ]; then
    echo "Hello Jane"
else
    echo "Hello stranger"
fi

# File tests
if [ -f "$file" ]; then            # File exists
if [ -d "$dir" ]; then              # Directory exists
if [ -x "$executable" ]; then       # Is executable

Loops:

# For loop
for i in {1..5}; do
    echo "Number $i"
done

# While loop
count=1
while [ $count -le 5 ]; do
    echo "Count $count"
    ((count++))
done

# Reading lines
while IFS= read -r line; do
    echo "Line: $line"
done < file.txt

Functions:

greet() {
    local name="$1"                 # Local variable
    echo "Hello, $name"
    return 0                        # Return status
}

greet "World"

Error Handling:

set -e                              # Exit on error
set -u                              # Exit on undefined variable
set -o pipefail                     # Pipe fails if any command fails

trap 'cleanup' EXIT                  # Run on exit
trap 'echo "Interrupted"; exit' INT  # Handle Ctrl+C

Practical DevOps Scripts:

Backup Script:

#!/bin/bash
set -euo pipefail

BACKUP_DIR="/backup/$(date +%Y%m%d)"
SOURCE_DIR="/data"

mkdir -p "$BACKUP_DIR"
tar -czf "$BACKUP_DIR/backup.tar.gz" "$SOURCE_DIR"

# Rotate old backups (keep 7 days)
find /backup -type d -mtime +7 -exec rm -rf {} \;

Health Check Script:

#!/bin/bash
check_service() {
    local host="$1"
    local port="$2"
    timeout 1 bash -c "echo >/dev/tcp/$host/$port" 2>/dev/null
    return $?
}

if check_service "localhost" 8080; then
    echo "Service is up"
else
    echo "Service is down"
    exit 1
fi

Deployment Script:

#!/bin/bash
set -e

VERSION="$1"
if [ -z "$VERSION" ]; then
    echo "Usage: $0 <version>"
    exit 1
fi

echo "Deploying version $VERSION"
./run_tests.sh
./build.sh "$VERSION"
scp "build/app-$VERSION" server:/apps/current
ssh server systemctl restart myapp

3.6 Systemd

Systemd is the init system and service manager for most modern Linux distributions.

Core Concepts:

  • Units: Resources managed by systemd (services, sockets, mounts, etc.)
  • Targets: Groups of units (like runlevels)
  • Journal: Centralized logging system

Unit Types:

  • .service: System services
  • .socket: IPC or network sockets
  • .device: Device files
  • .mount: Filesystem mount points
  • .timer: Scheduled tasks (cron replacement)
  • .target: Group of units

Service Unit Example:

# /etc/systemd/system/myapp.service
[Unit]
Description=My Application
After=network.target
Wants=redis.service
Requires=mongodb.service

[Service]
Type=simple
User=myapp
Group=myapp
WorkingDirectory=/opt/myapp
ExecStart=/usr/bin/node /opt/myapp/app.js
ExecReload=/bin/kill -HUP $MAINPID
Restart=always
RestartSec=10
Environment=NODE_ENV=production
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

Common Commands:

systemctl start myapp             # Start service
systemctl stop myapp               # Stop service
systemctl restart myapp             # Restart service
systemctl reload myapp              # Reload configuration
systemctl status myapp              # Show status
systemctl enable myapp              # Enable at boot
systemctl disable myapp             # Disable at boot
systemctl daemon-reload             # Reload unit files

Journald (Logging):

journalctl -u myapp                # Show logs for service
journalctl -f                       # Follow logs
journalctl --since "1 hour ago"     # Time-based filter
journalctl -p err                    # Show only errors
journalctl _PID=1234                 # Filter by PID

Timer Units (Cron Replacement):

# /etc/systemd/system/backup.timer
[Unit]
Description=Daily backup timer

[Timer]
OnCalendar=daily
Persistent=true

[Install]
WantedBy=timers.target
# /etc/systemd/system/backup.service
[Unit]
Description=Daily backup

[Service]
Type=oneshot
ExecStart=/usr/local/bin/backup.sh

3.7 Package Management

Linux distributions use package managers to install, update, and remove software.

Debian/Ubuntu (apt/dpkg):

# Update package lists
apt update

# Upgrade all packages
apt upgrade

# Install package
apt install nginx

# Remove package
apt remove nginx

# Search packages
apt search nginx

# Show package info
apt show nginx

# List installed
dpkg -l

# Find which package owns a file
dpkg -S /etc/nginx/nginx.conf

Red Hat/CentOS/Fedora (yum/dnf):

# Update package lists
yum check-update

# Upgrade packages
yum update

# Install package
yum install nginx

# Remove package
yum remove nginx

# Search
yum search nginx

# Show info
yum info nginx

# List installed
rpm -qa

# Find package owner
rpm -qf /etc/nginx/nginx.conf

Building from Source:

Sometimes packages aren't available and you need to compile:

wget https://example.com/software.tar.gz
tar -xzf software.tar.gz
cd software
./configure --prefix=/usr/local
make
make install

3.8 Performance Monitoring

Performance monitoring helps identify bottlenecks and capacity issues.

CPU Monitoring:

top                             # Real-time process view
htop                            # Enhanced top
mpstat -P ALL 1                 # Per-CPU statistics
vmstat 1                        # System statistics
uptime                          # Load average
cat /proc/cpuinfo               # CPU information

Memory Monitoring:

free -h                         # Memory usage
vmstat 1                        # Virtual memory stats
cat /proc/meminfo               # Detailed memory info
smem                            # Memory per process

Disk I/O Monitoring:

iostat -x 1                     # Extended disk statistics
iotop                           # I/O per process
df -h                           # Filesystem usage
du -sh *                        # Directory sizes

Network Monitoring:

iftop                           # Network traffic by host
nethogs                         # Traffic by process
ss -tulpn                       # Socket statistics
sar -n DEV 1                    # Network statistics

System Performance Tuning:

Kernel Parameters (/etc/sysctl.conf):

# Increase network buffers
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216

# TCP tuning
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

# File system
fs.file-max = 2097152

# Virtual memory
vm.swappiness = 10
vm.dirty_ratio = 40

Process Limits (/etc/security/limits.conf):

* soft nofile 65536
* hard nofile 65536
* soft nproc unlimited
* hard nproc unlimited

3.9 Log Management

Logs are crucial for troubleshooting and monitoring.

System Logs:

  • /var/log/syslog or /var/log/messages: General system logs
  • /var/log/auth.log: Authentication logs
  • /var/log/kern.log: Kernel messages
  • /var/log/dmesg: Boot messages
  • /var/log/nginx/: Nginx logs
  • /var/log/mysql/: MySQL logs

Log Rotation (logrotate):

# /etc/logrotate.d/nginx
/var/log/nginx/*.log {
    daily
    missingok
    rotate 14
    compress
    delaycompress
    notifempty
    create 0640 nginx adm
    sharedscripts
    postrotate
        [ -f /var/run/nginx.pid ] && kill -USR1 `cat /var/run/nginx.pid`
    endscript
}

Centralized Logging with rsyslog:

# /etc/rsyslog.conf
*.* @logserver.example.com:514    # Send all logs to remote server

Log Analysis:

# Count errors
grep -c "ERROR" app.log

# Tail with filtering
tail -f app.log | grep ERROR

# Find unique IPs
awk '{print $1}' access.log | sort | uniq -c | sort -nr

# Time-based analysis
grep "$(date +%Y-%m-%d)" app.log

3.10 Hardening Linux Servers

Security is critical for production systems.

User and Access Management:

# Remove unnecessary users
userdel -r username

# Disable root SSH login
# In /etc/ssh/sshd_config:
# PermitRootLogin no

# Use SSH keys only
# PasswordAuthentication no

# Implement sudo with care
visudo

File Permissions:

# Secure sensitive files
chmod 600 /etc/shadow
chmod 644 /etc/passwd
chmod 600 /etc/ssh/sshd_config

# Set proper ownership
chown root:root /etc/passwd

Network Security:

# Basic firewall
ufw default deny incoming
ufw default allow outgoing
ufw allow ssh
ufw enable

# Disable unused services
systemctl disable bluetooth
systemctl disable cups

# Secure sysctl settings
# /etc/sysctl.d/99-security.conf
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.all.accept_source_route = 0
net.ipv4.tcp_syncookies = 1

Filesystem Security:

# Mount options in /etc/fstab
# /dev/sda1 /home ext4 defaults,noexec,nosuid 0 2
# /tmp tmpfs tmpfs defaults,noexec,nosuid,nodev 0 0

Auditing and Monitoring:

# Install and configure auditd
auditctl -w /etc/passwd -p wa -k passwd_changes
auditctl -w /etc/shadow -p wa -k shadow_changes

# Check for unusual activity
lastb                           # Failed login attempts
last                            # Last logins
journalctl -u ssh                # SSH logs

Automatic Security Updates:

# Ubuntu/Debian
apt install unattended-upgrades
dpkg-reconfigure -plow unattended-upgrades

# Red Hat/CentOS
yum install yum-cron
systemctl enable yum-cron

Security Tools:

  • Lynis: Security auditing tool
  • ClamAV: Antivirus
  • rkhunter: Rootkit hunter
  • chkrootkit: Rootkit detector
  • fail2ban: Brute force protection

PART II — VERSION CONTROL & COLLABORATION

Chapter 4 — Git Internals & Advanced Workflows

4.1 Git Architecture (Objects, Trees, Commits)

Understanding Git's internal architecture demystifies its behavior and enables advanced usage.

The Object Database

Git is fundamentally a content-addressable filesystem with a VCS interface. Everything is stored as objects in the .git/objects directory.

Object Types:

  1. Blob: File contents (binary large object)
  2. Tree: Directory listings (filenames + permissions + blob references)
  3. Commit: Snapshot metadata (tree hash, parent, author, message)
  4. Tag: Named reference to a commit (optionally signed)

Object Storage:

Each object is identified by a SHA-1 hash of its content:

echo 'hello world' | git hash-object --stdin
# 3b18e512dba79e4c8300dd08aeb37f8e728b8dad

Objects are stored compressed in .git/objects/ab/3b18e512dba79e4c8300dd08aeb37f8e728b8dad

The Commit Graph

commit (hash: a1b2c3)
tree: d4e5f6
parent: f7g8h9 (previous commit)
author: John <john@example.com>
committer: John <john@example.com>
message: Add feature X
    ↓
tree (hash: d4e5f6)
    blob: 1a2b3c (README.md)
    blob: 4d5e6f (main.py)
    tree: 7g8h9i (lib/)
        blob: 0j1k2l (lib/utils.py)

References (Refs)

Refs are pointers to commits, stored in .git/refs/:

  • heads/: Local branches
  • remotes/: Remote tracking branches
  • tags/: Tags

HEAD is a special ref pointing to current branch or commit.

cat .git/HEAD
# ref: refs/heads/main

cat .git/refs/heads/main
# a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6q7r8s9t0

The Index (Staging Area)

The index is a binary file (.git/index) that represents the next commit. It's a sorted list of path names with blob hashes and file metadata.

Plumbing vs Porcelain

Git commands are categorized as:

  • Porcelain: User-friendly commands (git add, git commit)
  • Plumbing: Low-level commands for scripting (git hash-object, git update-index)

Low-level Examples:

# Create blob
echo 'content' | git hash-object -w --stdin

# Create tree
git update-index --add --cacheinfo 100644 \
  $(git hash-object -w file.txt) file.txt
git write-tree

# Create commit
echo 'message' | git commit-tree TREE_HASH -p PARENT_HASH

4.2 Branching Strategies

Branching strategies define how teams use branches for development.

Git Flow

Classic branching model by Vincent Driessen:

main (production)
  ↑
release/1.0 (staging)
  ↑
develop (integration)
  ↑
feature/new-feature (development)

Branches:

  • main: Production-ready code
  • develop: Integration branch
  • feature/*: New features (branch from develop)
  • release/*: Release preparation (branch from develop, merge to main and develop)
  • hotfix/*: Emergency fixes (branch from main, merge to main and develop)

Pros:

  • Clear structure
  • Works well for versioned releases
  • Good for larger teams

Cons:

  • Complex
  • Overkill for continuous delivery
  • Many branches to maintain

GitHub Flow

Simpler flow used by GitHub:

main (always deployable)
  ↑
feature/* → Pull Request → main

Principles:

  • main is always deployable
  • Create feature branches for changes
  • Open pull requests for review
  • Merge and deploy immediately

Pros:

  • Simple
  • Works with CI/CD
  • Continuous deployment friendly

Cons:

  • Less structure for releases
  • Can be chaotic with many changes

GitLab Flow

GitLab's hybrid approach:

production (or environment branches)
  ↑
pre-production
  ↑
main
  ↑
feature/*

Environment Branches:

  • production: Deployed to production
  • staging: Deployed to staging
  • main: Integration branch

Pros:

  • Environment-specific branches
  • Works well with deployment pipelines
  • Clear promotion path

Trunk-Based Development

All developers work on short-lived branches from main:

main ←─── short branch ───┐
     └─── short branch ───┤
      └─── short branch ──┤

Rules:

  • Branches live < 1 day
  • Small, frequent commits
  • Feature flags for incomplete work
  • Automated testing before merge

Pros:

  • Minimal merge conflicts
  • Continuous integration
  • Fast feedback

Cons:

  • Requires feature flags
  • Discipline required
  • Not suitable for all projects

4.3 Git Rebase vs Merge

Understanding the difference is crucial for clean history.

Merge

git checkout main
git merge feature

Result:

  • Creates merge commit
  • Preserves exact history
  • Shows when branch happened
*   Merge branch 'feature' (main)
|\
| * Add feature (feature)
* | Update main (main)
|/
* Initial commit

Pros:

  • Preserves context
  • Safe (non-destructive)
  • Shows actual branch timeline

Cons:

  • Cluttered history
  • Many merge commits

Rebase

git checkout feature
git rebase main
git checkout main
git merge feature (fast-forward)

Result:

  • Replays commits on top of main
  • Linear history
  • No merge commits
* Add feature (main)
* Update main
* Initial commit

Pros:

  • Clean, linear history
  • Easier to read
  • Bisect friendly

Cons:

  • Rewrites history
  • Dangerous on shared branches
  • Loses branch context

Interactive Rebase

git rebase -i HEAD~3

Allows:

  • squash: Combine commits
  • reword: Change commit message
  • edit: Modify commit
  • drop: Remove commit
  • reorder: Change order

Golden Rule of Rebasing:

Never rebase commits that have been pushed to a shared repository. It will cause chaos for other developers.

When to Use What:

Use Merge When:

  • Merging a long-lived branch
  • Preserving branch history is important
  • Working on public/shared branch

Use Rebase When:

  • Updating feature branch with main
  • Cleaning up local commits before PR
  • Creating linear history

Squash and Merge (GitHub):

Combines all commits from feature branch into one commit on main. Good for keeping main history clean.

4.4 Submodules

Submodules allow including external repositories within your repository.

Basic Usage:

# Add submodule
git submodule add https://github.com/user/lib.git lib

# Clone with submodules
git clone --recursive https://github.com/user/project.git

# Update submodules
git submodule update --init --recursive

# Pull latest in submodules
git submodule update --remote

.gitmodules File:

[submodule "lib"]
    path = lib
    url = https://github.com/user/lib.git
    branch = main

Challenges:

  1. Detached HEAD: Submodules are checked out at specific commits
  2. Updates: Need to commit submodule reference changes
  3. Collaboration: Team members must remember to update submodules

Alternatives:

  • Subtrees: Copy code into your repo (git subtree)
  • Package managers: npm, pip, maven, etc.
  • Monorepo: Single repository for all code

4.5 Monorepo vs Polyrepo

Monorepo (Single Repository)

All code in one repository.

Pros:

  • Atomic commits across projects
  • Easy code sharing
  • Simplified dependency management
  • Consistent tooling
  • Easier refactoring

Cons:

  • Scales poorly (Git struggles with huge repos)
  • Complex access control
  • Build system complexity
  • Learning curve

Examples: Google, Microsoft, Facebook

Polyrepo (Multiple Repositories)

Each project in its own repository.

Pros:

  • Clear ownership
  • Independent versioning
  • Simpler tooling per project
  • Better access control
  • Scales naturally

Cons:

  • Cross-repo changes are painful
  • Dependency hell
  • Inconsistent tooling
  • Duplication

Hybrid Approaches:

  • Repo orchestration tools: Google's repo, Microsoft's VFS for Git
  • Monorepo with modular build: Bazel, Pants, Please
  • Package-based monorepo: Lerna (JavaScript), Gradle (Java)

4.6 Git Hooks

Git hooks are scripts that run automatically on Git events.

Client-Side Hooks (.git/hooks/):

  • pre-commit: Before commit message editor
  • prepare-commit-msg: Before commit message editor (with template)
  • commit-msg: After commit message
  • post-commit: After commit
  • pre-push: Before push
  • pre-rebase: Before rebase
  • post-checkout: After checkout
  • post-merge: After merge

Server-Side Hooks:

  • pre-receive: Before accepting push
  • update: Per-branch pre-receive
  • post-receive: After push

Example pre-commit hook (linting):

#!/bin/bash
# .git/hooks/pre-commit

echo "Running linter..."
files=$(git diff --cached --name-only --diff-filter=ACM | grep '\.js$')
if [ -n "$files" ]; then
    eslint $files
    if [ $? -ne 0 ]; then
        echo "Linting failed"
        exit 1
    fi
fi

Managing Hooks with Tools:

  • Husky (JavaScript): Manages hooks via package.json
  • pre-commit (Python): Framework for multi-language hooks
  • overcommit (Ruby): Extensible hook manager

4.7 Large Scale Git Management

Handling Large Repositories:

Shallow Clones:

git clone --depth 1 https://github.com/user/repo.git

Partial Clones:

git clone --filter=blob:none https://github.com/user/repo.git

Sparse Checkout:

git sparse-checkout set src/

Git LFS (Large File Storage):

Replaces large files with text pointers:

git lfs track "*.psd"
git add .gitattributes
git add file.psd
git commit -m "Add design file"

Performance Optimization:

  • git gc: Garbage collection
  • git repack: Optimize pack files
  • git fsck: Verify database integrity
  • git prune: Remove unreachable objects

Scaling Git Servers:

  • GitLab: Built for enterprise scale
  • GitHub: GitHub AE for large enterprises
  • BitBucket Data Center: Clustered for scale
  • Gerrit: Code review focused, scales well

4.8 Code Review Best Practices

For Authors:

  1. Keep changes small: < 400 lines is ideal
  2. Write good descriptions: What, why, how
  3. Add context: Screenshots, test results
  4. Self-review first: Catch obvious issues
  5. Respond graciously: To all comments
  6. Explain changes: In comments and commits

For Reviewers:

  1. Review promptly: Within 24 hours ideally
  2. Be kind: Focus on code, not person
  3. Ask questions: "What do you think about..." not "You should..."
  4. Be specific: Point to exact lines and alternatives
  5. Prioritize: Security > correctness > style
  6. Approve thoughtfully: Understand the code

Code Review Checklist:

  • Does the code work?
  • Is it tested appropriately?
  • Is it secure?
  • Is it performant?
  • Is it maintainable?
  • Is it well-named?
  • Does it follow style guide?
  • Is documentation updated?
  • Are there edge cases?
  • Will it scale?

Automated Checks:

  • Linting: Enforce style
  • Static analysis: Find bugs
  • Test coverage: Ensure testing
  • Security scanning: Find vulnerabilities
  • Size checks: Prevent bloat

Chapter 5 — Platforms

5.1 GitHub Enterprise

GitHub Enterprise provides self-hosted or cloud-based GitHub for organizations.

Key Features:

Authentication and Authorization:

  • SAML/SSO integration
  • LDAP/Active Directory
  • Fine-grained permissions
  • Team synchronization

Security:

  • 2FA enforcement
  • Audit logging
  • Secret scanning
  • Dependency graph
  • Security advisories

Collaboration:

  • Protected branches
  • Required reviews
  • Code owners
  • Issue templates
  • Project boards

Actions:

  • Built-in CI/CD
  • Self-hosted runners
  • Marketplace integrations
  • Reusable workflows

API and Automation:

  • GraphQL API
  • REST API
  • Webhooks
  • GitHub Apps

Deployment Options:

GitHub Enterprise Cloud:

  • Hosted by GitHub
  • Enterprise features
  • SLA guarantee
  • Regular updates

GitHub Enterprise Server:

  • Self-hosted
  • Full control
  • Air-gapped possible
  • Upgrade on your schedule

5.2 GitLab CI/CD

GitLab provides integrated CI/CD with their repository platform.

Core Concepts:

.gitlab-ci.yml:

stages:
  - build
  - test
  - deploy

variables:
  DOCKER_DRIVER: overlay2

build:
  stage: build
  script:
    - docker build -t $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA .
    - docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA

test:
  stage: test
  script:
    - npm install
    - npm test

deploy:
  stage: deploy
  script:
    - kubectl set image deployment/myapp app=$CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
  only:
    - main

Runners:

  • Shared runners: Provided by GitLab
  • Group runners: Shared by group
  • Specific runners: Project-specific
  • Auto-scaling: Dynamic provisioning

Features:

  • Auto DevOps
  • Review Apps (ephemeral environments)
  • Container registry
  • Dependency scanning
  • License compliance
  • Browser testing

5.3 Bitbucket

Bitbucket, part of Atlassian, integrates well with Jira and other Atlassian tools.

Key Features:

Branch Permissions:

  • Restrict pushes
  • Require pull requests
  • Prevent deletion
  • Merge checks

Pull Requests:

  • Code reviews
  • Inline comments
  • Task lists
  • Approvals required

Pipelines:

  • Built-in CI/CD
  • Docker support
  • Service containers
  • Deployments to environments

Integration:

  • Jira integration
  • Slack notifications
  • Marketplace add-ons
  • REST API

5.4 Pull Requests & Merge Requests

PRs (GitHub) and MRs (GitLab) are the primary code review mechanism.

Pull Request Lifecycle:

  1. Create branch from main
  2. Make changes and commit
  3. Push branch to remote
  4. Open PR with description
  5. Automated checks run
  6. Reviewers comment and approve
  7. Address feedback with more commits
  8. Merge when ready
  9. Delete branch

PR Templates:

## Description
[Describe the changes]

## Type of Change
- [ ] Bug fix
- [ ] New feature
- [ ] Breaking change
- [ ] Documentation update

## Testing
[Describe how you tested]

## Screenshots
[If applicable]

## Related Issues
Fixes #123

Best Practices:

  • Link to issues: Connect work to tracking
  • Use draft PRs: For work in progress
  • Small PRs: Easier to review
  • Descriptive titles: "Fix login bug" not "Update"
  • Self-review: Check your own PR first

5.5 Branch Protection Rules

Branch protection prevents force pushes and requires certain conditions before merging.

Common Rules:

Require pull request reviews:

  • Number of approvals required
  • Dismiss stale reviews
  • Require review from code owners

Require status checks:

  • CI must pass
  • Specific checks required
  • Branches must be up to date

Restrict who can push:

  • Specific users/teams
  • Admins included/excluded

Other rules:

  • No force pushes
  • No deletions
  • Include administrators
  • Linear history required

Example GitHub Settings:

{
  "required_status_checks": {
    "strict": true,
    "contexts": ["continuous-integration/jenkins"]
  },
  "enforce_admins": true,
  "required_pull_request_reviews": {
    "required_approving_review_count": 2,
    "dismiss_stale_reviews": true,
    "require_code_owner_reviews": true
  },
  "restrictions": null
}

5.6 Secrets in Repositories

Never store secrets in code. Use secret management tools.

What Not to Store:

  • API keys
  • Passwords
  • SSH keys
  • Database credentials
  • Tokens
  • Certificates

Secret Management Solutions:

GitHub Encrypted Secrets:

# In GitHub Actions
env:
  API_KEY: ${{ secrets.API_KEY }}

GitLab CI/CD Variables:

# Masked and protected variables
script:
  - echo "$CI_DEPLOY_PASSWORD"

HashiCorp Vault:

vault kv put secret/myapp api_key=12345

AWS Secrets Manager:

aws secretsmanager get-secret-value --secret-id myapp

Azure Key Vault:

az keyvault secret show --name api-key --vault-name myvault

Tools for Secret Detection:

  • git-secrets: Prevents committing secrets
  • truffleHog: Searches for secrets in Git history
  • GitHub secret scanning: Automatic detection
  • GitLab secret detection: Built-in scanning

5.7 Repository Security

Access Control:

  • Principle of least privilege: Grant minimum needed access
  • Regular audits: Review who has access
  • Team-based permissions: Manage groups, not individuals
  • SSO enforcement: Require corporate authentication

Security Features:

Signed Commits:

git commit -S -m "Signed commit"
git config commit.gpgsign true

Signed Tags:

git tag -s v1.0 -m "Signed tag"

Verified commits show as "Verified" in GitHub/GitLab

Dependency Management:

  • Dependabot: Automated security updates
  • Renovate: Dependency update tool
  • Snyk: Vulnerability scanning
  • OWASP Dependency Check: Security scanning

Audit Logging:

Monitor for suspicious activity:

  • Repository access
  • Permission changes
  • Secret pushes
  • Branch deletions

Incident Response:

When secrets are exposed:

  1. Immediate: Revoke compromised credentials
  2. Investigate: Check access logs
  3. Rotate: Replace all affected secrets
  4. Notify: Inform affected parties
  5. Prevent: Improve scanning/prevention

PART III — CI/CD PIPELINES

Chapter 6 — Continuous Integration

6.1 CI Principles

Continuous Integration is the practice of merging all developer working copies to a shared mainline several times a day.

Core Principles:

  1. Maintain a single source repository: Everything needed to build should be in version control.

  2. Automate the build: One command should build the system.

  3. Make the build self-testing: Tests should be part of the build.

  4. Everyone commits to mainline every day: Avoid long-lived branches.

  5. Every commit should build on an integration machine: Catch problems early.

  6. Keep the build fast: Fast feedback encourages frequent commits.

  7. Test in a clone of production environment: Avoid environment-specific issues.

  8. Make it easy to get the latest deliverables: Artifacts should be easily accessible.

  9. Everyone can see what's happening: Transparency enables collaboration.

  10. Automate deployment: Make it trivial to deploy anywhere.

Benefits:

  • Reduced integration risk: Problems found early
  • Higher code quality: Constant testing
  • Faster delivery: Always releasable state
  • Improved visibility: Build status visible
  • Greater confidence: Automated verification

6.2 Build Automation

Build automation compiles source code into binary artifacts.

Build Tools by Language:

  • Java: Maven, Gradle, Ant
  • JavaScript: npm, yarn, webpack
  • Python: setuptools, poetry, pip
  • Go: go build, make
  • Ruby: rake, bundler
  • C/C++: make, cmake, ninja
  • .NET: MSBuild, dotnet CLI

Build Automation Goals:

  1. Repeatable: Same input → same output
  2. Fast: Minimize feedback time
  3. Idempotent: Can run multiple times
  4. Self-contained: No external dependencies
  5. Consistent: Same process everywhere

Build Script Example (Makefile):

.PHONY: build test clean

build:
	go build -o bin/app ./cmd/app

test:
	go test ./...

clean:
	rm -rf bin/

Build Pipeline Stages:

Source → Compile → Test → Package → Publish
  1. Compile: Convert source to binaries
  2. Test: Run unit and integration tests
  3. Package: Create deployable artifact (JAR, Docker image)
  4. Publish: Store artifact in repository

6.3 Artifact Management

Artifacts are the outputs of build processes that need to be stored and versioned.

Types of Artifacts:

  • Binaries (JAR, EXE, DLL)
  • Packages (DEB, RPM, NPM)
  • Container images
  • Documentation
  • Test reports
  • Configuration files

Artifact Repositories:

Language-specific:

  • Maven: Nexus, Artifactory, Archiva
  • npm: npm registry, Verdaccio
  • Python: PyPI, DevPI
  • Ruby: RubyGems, Geminabox
  • Go: Go proxy, Athens

Universal:

  • JFrog Artifactory: Multi-format support
  • Sonatype Nexus: Repository manager
  • Cloud-specific: AWS CodeArtifact, Azure Artifacts, GCP Artifact Registry

Container Registries:

  • Docker Hub
  • GitHub Container Registry
  • GitLab Container Registry
  • Amazon ECR
  • Azure ACR
  • Google GCR

Best Practices:

  1. Version everything: Use semantic versioning
  2. Immutable artifacts: Never change published artifacts
  3. Metadata: Store build info, commit hash, timestamps
  4. Retention policies: Automatically clean old artifacts
  5. Security scanning: Scan artifacts for vulnerabilities
  6. Access control: Who can read/write artifacts

Artifact Lifecycle:

Build → Stage → Release → Retire
  ↑        ↑        ↑        ↑
Snapshot  Testing  Production  Delete

6.4 Pipeline as Code

Define CI/CD pipelines in code, stored in version control.

Benefits:

  • Version control: Track changes to pipeline
  • Code review: Review pipeline changes
  • Reusability: Share pipeline templates
  • Consistency: Same process everywhere
  • Documentation: Pipeline as executable documentation

Examples:

GitHub Actions:

name: CI
on: [push]
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - run: npm install
      - run: npm test

GitLab CI:

stages:
  - build
  - test

build:
  stage: build
  script:
    - go build ./...

test:
  stage: test
  script:
    - go test ./...

Jenkinsfile (Declarative):

pipeline {
    agent any
    stages {
        stage('Build') {
            steps {
                sh 'make build'
            }
        }
        stage('Test') {
            steps {
                sh 'make test'
            }
        }
    }
}

Pipeline Patterns:

DRY (Don't Repeat Yourself):

# Reusable workflow
.build-template: &build-template
  stage: build
  script:
    - docker build -t $IMAGE .

build-app:
  <<: *build-template
  variables:
    IMAGE: app

build-api:
  <<: *build-template
  variables:
    IMAGE: api

6.5 Testing Strategies

Testing in CI/CD requires a comprehensive strategy.

Testing Pyramid:

    /\    E2E Tests (slow, expensive)
   /  \   Integration Tests
  /----\  Component Tests
 /------\ Unit Tests (fast, cheap)
/--------\

Unit Tests:

  • Test individual functions/classes
  • Fast execution (< 100ms each)
  • No external dependencies
  • High coverage (70-80%+)

Integration Tests:

  • Test component interactions
  • May use databases, APIs
  • Slower but more realistic
  • Medium coverage

Component Tests:

  • Test entire component in isolation
  • Mock external dependencies
  • Contract testing with consumers

E2E Tests:

  • Test complete user journeys
  • Full system with all dependencies
  • Slow and brittle
  • Few critical paths only

Other Test Types:

Smoke Tests: Quick sanity checks after deployment

Performance Tests: Load, stress, soak testing

Security Tests: Vulnerability scanning, penetration testing

Mutation Tests: Validate test quality by introducing bugs

Contract Tests: Ensure API compatibility

Test Automation Best Practices:

  1. Run fast tests first: Fail fast
  2. Parallelize tests: Speed up execution
  3. Quarantine flaky tests: Don't block pipeline
  4. Test data management: Consistent test data
  5. Test reporting: Clear results and trends
  6. Test environment parity: Match production

6.6 Parallel Builds

Parallel execution speeds up CI pipelines.

Types of Parallelism:

  1. Test parallelization: Run tests across multiple workers
  2. Matrix builds: Test multiple versions/configurations
  3. Stage parallelization: Run independent stages simultaneously

GitHub Actions Matrix:

jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        node: [14, 16, 18]
        os: [ubuntu-latest, windows-latest]
    steps:
      - uses: actions/checkout@v2
      - uses: actions/setup-node@v2
        with:
          node-version: ${{ matrix.node }}
      - run: npm test

Test Splitting:

# Split tests by timing
jest --maxWorkers=4 --shard=1/4
jest --maxWorkers=4 --shard=2/4
jest --maxWorkers=4 --shard=3/4
jest --maxWorkers=4 --shard=4/4

Parallel Stages in GitLab:

stages:
  - test
  - deploy

test:
  stage: test
  parallel: 5
  script:
    - ./run-tests.sh $CI_NODE_INDEX $CI_NODE_TOTAL

6.7 Caching & Optimization

Caching reduces build times by reusing previous work.

Cacheable Items:

  • Dependency packages (node_modules, vendor/bundle)
  • Compiled artifacts (.class, .pyc)
  • Docker layers
  • Test results
  • Build tools

GitHub Actions Caching:

- name: Cache node_modules
  uses: actions/cache@v2
  with:
    path: node_modules
    key: ${{ runner.os }}-node-${{ hashFiles('package-lock.json') }}
    restore-keys: |
      ${{ runner.os }}-node-

Docker Layer Caching:

# Cache dependencies first
COPY package*.json ./
RUN npm install          # This layer cached unless package.json changes
COPY . .
RUN npm run build

Optimization Techniques:

  1. Incremental builds: Only rebuild changed code
  2. Conditional execution: Skip stages when not needed
  3. Build artifacts: Save intermediate outputs
  4. Dependency caching: Cache package managers
  5. Workspace reuse: Reuse workspace across jobs
  6. Container caching: Use cached base images

Pipeline Optimization Checklist:

  • Fast feedback (< 10 minutes)
  • Parallel execution where possible
  • Caching dependencies
  • Skipping irrelevant jobs
  • Efficient test ordering
  • Build only changed code

Chapter 7 — CI Tools

7.1 Jenkins Architecture

Jenkins is the most widely used open-source automation server.

Core Architecture:

User → Jenkins UI/API
        ↓
   Jenkins Master
        ↓
   Build Queue
        ↓
   Build Executors (Master or Agents)

Jenkins Master:

  • Web UI and API
  • Job configuration
  • Build queue management
  • Monitoring and reporting
  • Plugin management

Jenkins Agents (Nodes):

  • Execute builds
  • Distributed across machines
  • Different environments
  • Label-based selection

Installation Options:

  • WAR file: java -jar jenkins.war
  • Package: apt/yum install jenkins
  • Docker: docker run jenkins/jenkins
  • Kubernetes: Jenkins Helm chart

Jenkins Pipeline:

Declarative Pipeline:

pipeline {
    agent any
    stages {
        stage('Build') {
            steps {
                sh 'make build'
            }
        }
        stage('Test') {
            steps {
                sh 'make test'
            }
        }
        stage('Deploy') {
            when {
                branch 'main'
            }
            steps {
                sh 'make deploy'
            }
        }
    }
    post {
        always {
            cleanWs()
        }
        failure {
            slackSend(color: 'danger', message: "Build failed")
        }
    }
}

Scripted Pipeline:

node {
    try {
        stage('Checkout') {
            checkout scm
        }
        stage('Build') {
            sh 'make build'
        }
        stage('Test') {
            sh 'make test'
        }
    } catch (err) {
        currentBuild.result = 'FAILURE'
        throw err
    } finally {
        cleanWs()
    }
}

Shared Libraries:

Reusable pipeline code across projects:

// vars/buildGo.groovy
def call(String version = '1.16') {
    sh "docker run --rm -v $PWD:/app -w /app golang:$version go build"
}

Jenkins Configuration as Code (JCasC):

jenkins:
  systemMessage: "Jenkins configured by JCasC"
  securityRealm:
    ldap:
      configurations:
        - server: ldap.example.com
          rootDN: dc=example,dc=com
  authorizationStrategy:
    globalMatrix:
      permissions:
        - "Overall/Administer:admin"

7.2 GitHub Actions

GitHub-native CI/CD tightly integrated with repositories.

Core Concepts:

Workflows: YAML files in .github/workflows/ Events: Triggers (push, pull_request, schedule) Jobs: Groups of steps (run on runners) Steps: Individual tasks (run commands or actions) Actions: Reusable units of code Runners: Virtual machines that execute jobs

Workflow Structure:

name: CI
on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]

env:
  NODE_VERSION: 16

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      
      - name: Setup Node
        uses: actions/setup-node@v2
        with:
          node-version: ${{ env.NODE_VERSION }}
          
      - name: Install dependencies
        run: npm ci
        
      - name: Run tests
        run: npm test
        
      - name: Upload artifacts
        uses: actions/upload-artifact@v2
        with:
          name: build-output
          path: dist/

Custom Actions:

Docker Container Action:

name: 'My Action'
description: 'Does something'
runs:
  using: 'docker'
  image: 'Dockerfile'

JavaScript Action:

name: 'My Action'
description: 'Does something'
runs:
  using: 'node12'
  main: 'index.js'

Composite Action:

name: 'Composite Action'
description: 'Combines steps'
runs:
  using: 'composite'
  steps:
    - run: echo Hello
      shell: bash

Workflow Features:

  • Matrix strategies: Test multiple configurations
  • Environments: Protection rules and secrets
  • Concurrency: Control parallel runs
  • Dependencies: needs keyword
  • Conditionals: if conditions
  • Reusable workflows: Call workflows from workflows

7.3 GitLab CI

Integrated CI/CD with GitLab's DevOps platform.

.gitlab-ci.yml Structure:

stages:
  - build
  - test
  - deploy

variables:
  DOCKER_DRIVER: overlay2
  IMAGE_TAG: $CI_REGISTRY_IMAGE:$CI_COMMIT_SHORT_SHA

cache:
  paths:
    - node_modules/

before_script:
  - docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY

build:
  stage: build
  script:
    - docker build -t $IMAGE_TAG .
    - docker push $IMAGE_TAG
  only:
    - main

test:
  stage: test
  script:
    - npm ci
    - npm test

deploy_staging:
  stage: deploy
  script:
    - kubectl set image deployment/app app=$IMAGE_TAG
  environment:
    name: staging
    url: https://staging.example.com
  only:
    - main

deploy_production:
  stage: deploy
  script:
    - kubectl set image deployment/app app=$IMAGE_TAG
  environment:
    name: production
    url: https://example.com
  when: manual
  only:
    - main

Key Features:

Review Apps: Ephemeral environments for MRs Auto DevOps: Preconfigured CI/CD Multi-project pipelines: Cross-project dependencies Parent-child pipelines: Dynamic pipeline generation Rules: Advanced conditional logic Includes: Include external YAML files

GitLab Runners:

  • Shared: Provided by GitLab.com
  • Group: Shared within group
  • Project: Dedicated to project
  • Specific: Custom configuration

Runner Configuration (config.toml):

concurrent = 10
[[runners]]
  name = "docker-runner"
  url = "https://gitlab.com"
  token = "xxxxx"
  executor = "docker"
  [runners.docker]
    image = "alpine"
    volumes = ["/cache"]

7.4 CircleCI

Cloud-native CI/CD with focus on speed and convenience.

Configuration (.circleci/config.yml):

version: 2.1

orbs:
  node: circleci/node@5.0.0

jobs:
  build:
    docker:
      - image: cimg/node:16.10
        auth:
          username: mydockerhub-user
          password: $DOCKERHUB_PASSWORD
    steps:
      - checkout
      - node/install-packages:
          pkg-manager: npm
      - run:
          name: Run tests
          command: npm test
      - persist_to_workspace:
          root: ~/project
          paths:
            - .

  deploy:
    docker:
      - image: cimg/base:2022.06
    steps:
      - attach_workspace:
          at: ~/project
      - run:
          name: Deploy to production
          command: ./deploy.sh

workflows:
  version: 2
  build_and_deploy:
    jobs:
      - build
      - deploy:
          requires:
            - build
          filters:
            branches:
              only: main

CircleCI Concepts:

Orbs: Reusable configuration packages Executors: Docker, machine, macOS, Windows Workspaces: Persist data between jobs Caching: Speed up dependency installation Contexts: Share environment variables SSH debugging: Debug builds interactively

7.5 Azure DevOps

Microsoft's enterprise DevOps platform.

Pipelines (YAML):

trigger:
- main

pool:
  vmImage: ubuntu-latest

variables:
  buildConfiguration: 'Release'
  majorVersion: 1
  minorVersion: 0

stages:
- stage: Build
  jobs:
  - job: BuildJob
    steps:
    - task: DotNetCoreCLI@2
      inputs:
        command: 'build'
        projects: '**/*.csproj'
        arguments: '--configuration $(buildConfiguration)'
    
    - task: DotNetCoreCLI@2
      inputs:
        command: 'test'
        projects: '**/*Tests.csproj'
        arguments: '--configuration $(buildConfiguration)'
    
    - task: DotNetCoreCLI@2
      inputs:
        command: 'publish'
        publishWebProjects: true
        arguments: '--configuration $(buildConfiguration) --output $(Build.ArtifactStagingDirectory)'
    
    - task: PublishBuildArtifacts@1
      inputs:
        PathtoPublish: '$(Build.ArtifactStagingDirectory)'
        ArtifactName: 'drop'

- stage: Deploy
  jobs:
  - deployment: DeployWeb
    environment: 'production'
    strategy:
      runOnce:
        deploy:
          steps:
          - task: AzureWebApp@1
            inputs:
              azureSubscription: 'my-connection'
              appName: 'my-app'
              package: '$(Pipeline.Workspace)/drop/**/*.zip'

Azure DevOps Components:

  • Azure Pipelines: CI/CD
  • Azure Repos: Git repositories
  • Azure Boards: Work tracking
  • Azure Test Plans: Testing tools
  • Azure Artifacts: Package management

Key Features:

  • Multi-stage pipelines: Visual designer
  • Environments: Track deployments
  • Approvals: Manual intervention
  • Gates: Automated health checks
  • Service connections: Connect to Azure services
  • Task groups: Reusable task collections

7.6 Pipeline Security

Securing CI/CD pipelines is critical as they have access to production.

Security Principles:

  1. Least privilege: Minimal permissions
  2. Isolation: Separate build environments
  3. Secrets management: Never expose secrets
  4. Input validation: Protect against injection
  5. Audit logging: Track all changes
  6. Dependency verification: Verify third-party code

Common Threats:

Credential Exposure:

  • Secrets in logs
  • Hardcoded credentials
  • Exposed environment variables

Supply Chain Attacks:

  • Compromised dependencies
  • Malicious packages
  • Typosquatting

Pipeline Tampering:

  • Unauthorized pipeline changes
  • Malicious commits
  • Build environment compromise

Security Best Practices:

Secrets:

# NEVER do this
- run: echo "password=12345"  # Bad!

# Use secrets
- run: echo "password=$SECRET"
  env:
    SECRET: ${{ secrets.MY_SECRET }}

OIDC (OpenID Connect):

# Instead of long-lived secrets
- name: Configure AWS credentials
  uses: aws-actions/configure-aws-credentials@v1
  with:
    role-to-assume: arn:aws:iam::123456789:role/GitHubActions
    aws-region: us-east-1

Signed Commits:

  • Require signed commits for sensitive repos
  • Verify commit signatures in pipeline

Dependency Verification:

# Verify package integrity
- run: npm audit
- run: npm ci --ignore-scripts  # Disable install scripts

Isolation:

  • Use ephemeral runners
  • Network isolation
  • Container sandboxing

7.7 Scaling CI Infrastructure

As teams grow, CI infrastructure needs to scale.

Scaling Strategies:

1. Horizontal Scaling:

  • Add more build agents
  • Auto-scaling based on queue
  • Multiple regions/zones

2. Vertical Scaling:

  • Bigger machines
  • More CPU/memory per build
  • Faster storage (SSD)

3. Build Optimization:

  • Caching dependencies
  • Parallel test execution
  • Incremental builds
  • Skipping unnecessary builds

Jenkins Scaling:

Master-Agent Setup:

pipeline {
    agent { label 'linux && large' }
    stages {
        stage('Build') {
            steps {
                sh 'make build'
            }
        }
    }
}

Dynamic Agents (Kubernetes):

apiVersion: v1
kind: Pod
spec:
  containers:
  - name: jnlp
    image: jenkins/inbound-agent
  - name: golang
    image: golang:1.16
    command:
    - cat
  - name: docker
    image: docker:20.10
    command:
    - cat
    volumeMounts:
    - name: docker-sock
      mountPath: /var/run/docker.sock

GitHub Actions Scaling:

  • Self-hosted runners: Custom machines
  • Runner groups: Organization/enterprise level
  • Auto-scaling: Dynamic provisioning

Self-hosted Runner Auto-scaling (Azure):

resource "azuredevops_agent_pool" "pool" {
  name           = "my-pool"
  auto_provision = true
}

resource "azuredevops_elastic_pool" "elastic" {
  name                = "my-elastic-pool"
  service_endpoint_id = azuredevops_serviceendpoint_azurerm.az.id
  
  azure_resource_id = azurerm_linux_virtual_machine_scale_set.vmss.id
  
  desired_idle = 1
  max_capacity = 10
}

Monitoring CI Infrastructure:

Key metrics:

  • Queue time
  • Build duration
  • Success/failure rate
  • Agent utilization
  • Cost per build

Cost Optimization:

  • Use spot/preemptible instances
  • Auto-scale down when idle
  • Cache effectively
  • Right-size instances

Chapter 8 — Continuous Delivery & Deployment

8.1 CD vs Continuous Deployment

Continuous Delivery

Every change is deployable, but deployment may be manual.

Commit → Build → Test → Staging → Manual Approval → Production
                           ↑
                     Always deployable

Key Characteristics:

  • Software always in releasable state
  • Deployment is a business decision
  • Manual approval for production
  • Compliance and audit gates

Continuous Deployment

Every change that passes tests is automatically deployed.

Commit → Build → Test → Staging → Auto → Production
                                     ↑
                            Automated promotion

Key Characteristics:

  • Fully automated pipeline
  • No manual intervention
  • Multiple daily deployments
  • Requires high confidence in testing

Choosing Between Them:

Continuous Delivery is better when:

  • Regulatory/compliance requirements
  • Business needs release coordination
  • Low deployment frequency is acceptable
  • Building confidence gradually

Continuous Deployment is better when:

  • SaaS/cloud native applications
  • High deployment frequency desired
  • Strong automated testing
  • Feature flags in place
  • Low risk tolerance for deployment

8.2 Deployment Strategies

Blue/Green Deployment

Two identical environments, one live (blue), one idle (green).

Before switch:
Users → Blue (v1)    Green (v2 - idle)

After switch:
Users → Green (v2)    Blue (v1 - idle)

Implementation:

# Kubernetes with labels
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-blue
spec:
  replicas: 10
  template:
    metadata:
      labels:
        version: blue
---
apiVersion: v1
kind: Service
metadata:
  name: app-service
spec:
  selector:
    version: blue  # Switch to green when ready

Pros:

  • Instant rollback (switch back)
  • No downtime
  • Staging environment always available

Cons:

  • Double infrastructure cost
  • Database schema challenges

Canary Deployment

Gradually shift traffic to new version.

Users → 90% → v1
        10% → v2 (canary)

If successful: increase to 25%, 50%, 100%
If problems: route back to 100% v1

Kubernetes with Istio:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: app
spec:
  hosts:
  - app
  http:
  - match:
    - headers:
        canary:
          exact: "true"
    route:
    - destination:
        host: app
        subset: v2
      weight: 100
  - route:
    - destination:
        host: app
        subset: v1
      weight: 90
    - destination:
        host: app
        subset: v2
      weight: 10

Pros:

  • Real traffic testing
  • Gradual risk exposure
  • Canary analysis

Cons:

  • Complex routing
  • Longer deployment time
  • Requires monitoring

Rolling Deployment

Gradually replace instances.

v1 → v1 → v1 → v1 → v1 → v1 → v1 → v1 → v1 → v1
v2 → v2 → v1 → v1 → v1 → v1 → v1 → v1 → v1 → v1
v2 → v2 → v2 → v2 → v1 → v1 → v1 → v1 → v1 → v1
v2 → v2 → v2 → v2 → v2 → v2 → v2 → v2 → v2 → v2

Kubernetes Rolling Update:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
spec:
  replicas: 10
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 2        # How many extra pods
      maxUnavailable: 1   # How many can be down
  template:
    spec:
      containers:
      - image: app:v2

Pros:

  • No extra infrastructure
  • Gradual replacement
  • Kubernetes native

Cons:

  • Slower rollout
  • Complex rollback
  • Version mix during deployment

Shadow Deployment

Run new version alongside old, mirror traffic but discard responses.

User → v1 (serves response)
   ↓
   → v2 (shadow - discard response)

Pros:

  • Test with production traffic
  • No user impact
  • Performance comparison

Cons:

  • Double resource usage
  • No feedback to users
  • Complex implementation

8.3 Feature Flags

Feature flags (toggles) enable deploying incomplete features safely.

Types of Flags:

  1. Release toggles: Control feature visibility
  2. Experiment toggles: A/B testing
  3. Ops toggles: Operational controls
  4. Permission toggles: User targeting

Implementation:

# Simple flag check
if feature_flags.is_enabled('new-checkout'):
    return new_checkout_flow()
else:
    return old_checkout_flow()

Targeting Rules:

// LaunchDarkly example
const user = { key: user.id, email: user.email };
const showFeature = ldclient.variation('new-feature', user, false);

Flag Management Systems:

  • LaunchDarkly: Enterprise feature management
  • Split.io: Feature experimentation
  • Flagsmith: Open source
  • Unleash: Open source
  • ConfigCat: Simple feature flags
  • Custom: Database + cache

Best Practices:

  1. Short-lived flags: Remove after rollout
  2. Flag naming: Clear and consistent
  3. Audit logging: Track flag changes
  4. Default to off: Safe fallback
  5. Flag hygiene: Regular cleanup
  6. Testing: Test with flags on/off

8.4 Database Migration Strategies

Database changes are often the riskiest part of deployment.

Principles:

  1. Separate schema changes from code changes
  2. Forward and backward compatible
  3. Automated migrations
  4. Testable rollbacks

Migration Types:

Expand/Migrate/Contract Pattern:

Phase 1: Expand
- Add new column (nullable)
- Dual-write to both columns

Phase 2: Migrate
- Backfill data to new column
- Migrate reads to new column

Phase 3: Contract
- Remove old column
- Remove dual-write

Online Schema Change Tools:

  • gh-ost: GitHub's online schema migration
  • pt-online-schema-change: Percona Toolkit
  • Liquibase: Database refactoring
  • Flyway: Version control for databases
  • Alembic: Python migrations

Example Flyway Migration:

-- V1__initial_schema.sql
CREATE TABLE users (
    id INT PRIMARY KEY,
    name VARCHAR(255)
);

-- V2__add_email.sql
ALTER TABLE users ADD COLUMN email VARCHAR(255);

-- V3__populate_email.sql
UPDATE users SET email = CONCAT(name, '@example.com');

Zero-Downtime Migration Strategy:

1. Add nullable column
2. Dual-write to new column (code change)
3. Backfill data
4. Make column non-nullable (if needed)
5. Remove old column (future release)

8.5 Rollbacks & Recovery

Despite best efforts, things go wrong. Be prepared.

Rollback Strategies:

Version Rollback:

  • Revert to previous artifact
  • Simple and fast
  • Loses new features

Feature Flag Rollback:

  • Disable problematic feature
  • No deployment needed
  • Keep other features

Database Rollback:

  • Restore from backup
  • Apply compensating transactions
  • Forward-only migrations (avoid rollbacks)

Automated Rollback Triggers:

# Canary analysis with automated rollback
deploy:
  strategy:
    canary:
      steps:
      - setWeight: 10
      - pause:
          duration: 5m
      - analysis:
          metrics:
          - name: error-rate
            threshold: 1
      - setWeight: 50
      - pause:
          duration: 5m
      - analysis:
          metrics:
          - name: error-rate
            threshold: 1

Rollback Procedure:

  1. Detect the problem (monitoring)
  2. Decide to roll back (automated or manual)
  3. Execute rollback (deploy previous version)
  4. Verify system is healthy
  5. Post-mortem to prevent recurrence

8.6 GitOps Workflow

GitOps uses Git as the single source of truth for declarative infrastructure and applications.

Core Principles:

  1. Declarative description: Entire system described in Git
  2. Git as source of truth: Cluster state matches Git
  3. Automated convergence: Software ensures cluster matches Git
  4. Pull-based deployments: Cluster pulls changes

GitOps Architecture:

Developer pushes to Git
        ↓
   Git Repository
        ↓
GitOps Operator (ArgoCD/Flux)
        ↓
   Kubernetes Cluster

Benefits:

  • Audit trail: All changes in Git
  • Faster recovery: Recreate from Git
  • Standard workflows: Use Git tools
  • Security: Pull model reduces credentials
  • Observability: Drift detection

ArgoCD Example:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: myapp
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/user/repo.git
    targetRevision: HEAD
    path: k8s
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Flux Example:

apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: GitRepository
metadata:
  name: myapp
  namespace: flux-system
spec:
  interval: 1m
  url: https://github.com/user/repo
  ref:
    branch: main
---
apiVersion: kustomize.toolkit.fluxcd.io/v1beta2
kind: Kustomization
metadata:
  name: myapp
  namespace: flux-system
spec:
  interval: 10m
  path: ./k8s
  prune: true
  sourceRef:
    kind: GitRepository
    name: myapp

PART IV — CONTAINERS & ORCHESTRATION

Chapter 9 — Containerization

9.1 Container Fundamentals

Containers provide lightweight virtualization at the OS level.

What are Containers?

Containers package an application with its dependencies, libraries, and configuration files, running isolated from other processes on the same host.

Containers vs Virtual Machines:

Aspect Containers Virtual Machines
Isolation Process-level Hardware-level
OS Share host kernel Each has guest OS
Startup Milliseconds Minutes
Size MB GB
Performance Native Some overhead
Resource usage Lightweight Heavy

Container Technologies:

  • LXC (Linux Containers): Original Linux containers
  • Docker: Most popular container platform
  • Podman: Daemonless container engine
  • containerd: Industry-standard runtime
  • CRI-O: Kubernetes-specific runtime

Linux Kernel Features:

Namespaces: Isolate process views

  • PID: Process IDs
  • NET: Network interfaces
  • MNT: Mount points
  • UTS: Hostname
  • IPC: Inter-process communication
  • USER: User IDs

Cgroups (Control Groups): Limit resources

  • CPU shares/quota
  • Memory limits
  • Block I/O
  • Network bandwidth

Union Filesystems: Layer management

  • OverlayFS
  • AUFS
  • Device Mapper

9.2 Docker Internals

Docker Architecture:

Client (docker CLI)
    ↓
Docker Daemon (dockerd)
    ↓
Containerd
    ↓
runc (OCI runtime)
    ↓
Container

Components:

  • docker CLI: User interface
  • dockerd: Persistent daemon
  • containerd: Container lifecycle management
  • runc: OCI runtime (creates containers)
  • containerd-shim: Parent of container processes

Images and Layers:

Docker images are built in layers:

Layer 4: CMD ["node", "app.js"]
Layer 3: COPY . /app
Layer 2: RUN npm install
Layer 1: FROM node:16
      ↓
Union mount at runtime

Layer Caching:

Each layer is cached. When rebuilding:

  • Unchanged layers reused
  • Changed layers and all subsequent rebuilt

Docker Storage Drivers:

  • overlay2: Default (recommended)
  • devicemapper: Legacy
  • btrfs/zfs: Advanced features
  • vfs: No copy-on-write

Network Drivers:

  • bridge: Default, NAT through host
  • host: Use host network directly
  • overlay: Multi-host networking
  • macvlan: Assign MAC addresses
  • none: No networking

9.3 Dockerfiles Best Practices

Base Images:

# Use specific tags, not latest
FROM node:16.14.2-alpine

# Use minimal base images
FROM alpine:3.15

Layer Optimization:

# Bad - each RUN creates layer
RUN apt-get update
RUN apt-get install -y curl
RUN apt-get clean

# Good - combine commands
RUN apt-get update && \
    apt-get install -y curl && \
    apt-get clean

Order Matters:

# Copy dependency files first (cached longer)
COPY package*.json ./
RUN npm install

# Copy source last (changes frequently)
COPY . .

Multi-stage Builds:

# Build stage
FROM node:16 AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

# Production stage
FROM nginx:alpine
COPY --from=builder /app/dist /usr/share/nginx/html
COPY nginx.conf /etc/nginx/nginx.conf

Security Best Practices:

# Run as non-root
RUN addgroup -g 1000 -S appgroup && \
    adduser -u 1000 -S appuser -G appgroup
USER appuser

# No secrets in build args
ARG DB_PASSWORD  # Bad - visible in history

# Use build secrets
RUN --mount=type=secret,id=db_password \
    cat /run/secrets/db_password

.dockerignore:

node_modules
.git
*.log
.env
Dockerfile
.dockerignore

9.4 Multi-Stage Builds

Multi-stage builds optimize final image size by separating build and runtime environments.

Example: Go Application

# Build stage
FROM golang:1.17 AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o main .

# Runtime stage
FROM alpine:3.15
RUN apk --no-cache add ca-certificates
WORKDIR /root/
COPY --from=builder /app/main .
EXPOSE 8080
CMD ["./main"]

Example: React Application

# Build stage
FROM node:16 AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

# Runtime stage
FROM nginx:alpine
COPY --from=builder /app/build /usr/share/nginx/html
COPY nginx.conf /etc/nginx/nginx.conf
EXPOSE 80
CMD ["nginx", "-g", "daemon off;"]

Benefits:

  • Smaller images (MB vs GB)
  • No build tools in production
  • Better security
  • Faster pulls

9.5 Container Security

Security Principles:

  1. Least privilege: Minimal capabilities
  2. Immutable: No runtime changes
  3. Read-only root filesystem
  4. No privileged containers
  5. Vulnerability scanning

Security Best Practices:

User Namespace Remapping:

{
  "userns-remap": "default"
}

Read-only Root:

VOLUME ["/tmp", "/var/log"]  # Writable volumes
# Rest of filesystem read-only

Drop Capabilities:

docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE

Security Context (Kubernetes):

securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  capabilities:
    drop: ["ALL"]
  readOnlyRootFilesystem: true

Image Signing:

# Docker Content Trust
export DOCKER_CONTENT_TRUST=1
docker push myapp:latest

9.6 Image Scanning

Scan images for vulnerabilities before deployment.

Common Scanners:

  • Trivy: Comprehensive, easy to use
  • Clair: CoreOS scanner
  • Anchore: Deep inspection
  • Snyk: Developer-focused
  • Docker Scout: Docker native
  • Grype: Fast vulnerability scanner

Trivy Example:

# Scan image
trivy image myapp:latest

# Scan with severity filter
trivy image --severity CRITICAL,HIGH myapp:latest

# Generate HTML report
trivy image --format template --template "@contrib/html.tpl" -o report.html myapp:latest

CI Integration:

# GitHub Actions
- name: Scan image
  uses: aquasecurity/trivy-action@master
  with:
    image-ref: 'myapp:latest'
    format: 'sarif'
    output: 'trivy-results.sarif'

SBOM (Software Bill of Materials):

# Generate SBOM
trivy image --format cyclonedx myapp:latest > sbom.json

# Scan for known vulnerabilities
trivy sbom sbom.json

9.7 OCI Standards

Open Container Initiative (OCI) ensures container format and runtime interoperability.

OCI Specifications:

  1. Image Specification: Container image format
  2. Runtime Specification: Container execution
  3. Distribution Specification: Content distribution

OCI Image Layout:

myimage/
├── blobs/
│   └── sha256/
│       ├── a1b2c3... (layer)
│       ├── d4e5f6... (config)
│       └── g7h8i9... (manifest)
└── index.json

Benefits:

  • Interoperability: Works across tools
  • Portability: Run anywhere
  • Stability: Backward compatible
  • Ecosystem: Wide tool support

Tools Supporting OCI:

  • Docker (with containerd)
  • Podman
  • Buildah
  • Skopeo
  • CRI-O
  • Kubernetes

Chapter 10 — Kubernetes Deep Dive

10.1 Kubernetes Architecture

Kubernetes orchestrates containerized applications across clusters of machines.

High-Level Architecture:

                    ┌─────────────────────┐
                    │   Control Plane      │
                    │  ┌─────────────────┐ │
                    │  │  API Server     │ │
                    │  └─────────────────┘ │
                    │  ┌─────────────────┐ │
                    │  │  Scheduler      │ │
                    │  └─────────────────┘ │
                    │  ┌─────────────────┐ │
                    │  │ Controller Mgr   │ │
                    │  └─────────────────┘ │
                    │  ┌─────────────────┐ │
                    │  │  etcd           │ │
                    │  └─────────────────┘ │
                    └──────────┬──────────┘
                               │
        ┌──────────────────────┼──────────────────────┐
        │                      │                      │
┌───────▼───────┐      ┌───────▼───────┐      ┌───────▼───────┐
│   Node 1      │      │   Node 2      │      │   Node 3      │
│ ┌───────────┐ │      │ ┌───────────┐ │      │ ┌───────────┐ │
│ │ kubelet   │ │      │ │ kubelet   │ │      │ │ kubelet   │ │
│ └───────────┘ │      │ └───────────┘ │      │ └───────────┘ │
│ ┌───────────┐ │      │ ┌───────────┐ │      │ ┌───────────┐ │
│ │ kube-proxy│ │      │ │ kube-proxy│ │      │ │ kube-proxy│ │
│ └───────────┘ │      │ └───────────┘ │      │ └───────────┘ │
│ ┌───────────┐ │      │ ┌───────────┐ │      │ ┌───────────┐ │
│ │ Container │ │      │ │ Container │ │      │ │ Container │ │
│ │ Runtime   │ │      │ │ Runtime   │ │      │ │ Runtime   │ │
│ └───────────┘ │      │ └───────────┘ │      │ └───────────┘ │
└───────────────┘      └───────────────┘      └───────────────┘

10.2 Control Plane Components

API Server (kube-apiserver):

  • Frontend to control plane
  • Validates and configures objects
  • Serves REST API
  • Horizontal scalable

etcd:

  • Distributed key-value store
  • Cluster state storage
  • Consistent and highly available
  • RAFT consensus algorithm

Scheduler (kube-scheduler):

  • Assigns pods to nodes
  • Considers resources, constraints
  • Policy-based scheduling
  • Extensible with custom schedulers

Controller Manager (kube-controller-manager):

Runs controllers:

  • Node controller
  • Replication controller
  • Endpoint controller
  • Service Account controller
  • etc.

Cloud Controller Manager (cloud-controller-manager):

Interacts with cloud providers:

  • Node management
  • Load balancers
  • Routes
  • Volumes

10.3 Pods, Deployments, Services

Pod:

Smallest deployable unit, one or more containers.

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    app: web
spec:
  containers:
  - name: nginx
    image: nginx:1.21
    ports:
    - containerPort: 80
    resources:
      requests:
        memory: "64Mi"
        cpu: "250m"
      limits:
        memory: "128Mi"
        cpu: "500m"

Deployment:

Manages replica sets and rolling updates.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - name: nginx
        image: nginx:1.21
        ports:
        - containerPort: 80
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0

Service:

Stable network endpoint for pods.

apiVersion: v1
kind: Service
metadata:
  name: web-service
spec:
  selector:
    app: web
  ports:
  - port: 80
    targetPort: 80
  type: ClusterIP  # Default, internal only

Service types:

  • ClusterIP: Internal cluster IP
  • NodePort: Expose on each node's IP
  • LoadBalancer: Cloud load balancer
  • ExternalName: DNS alias

10.4 Networking Model

Kubernetes Networking Requirements:

  • Pods can communicate with all other pods without NAT
  • Nodes can communicate with all pods without NAT
  • Pod's IP is the same seen by others

CNI (Container Network Interface):

Plugins implement networking:

  • Calico: Network policy, BGP
  • Flannel: Simple overlay
  • Weave: Mesh networking
  • Cilium: eBPF-based, security
  • AWS VPC CNI: Native VPC integration

Network Policies:

Firewall rules for pods:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-allow
spec:
  podSelector:
    matchLabels:
      app: api
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: frontend
    ports:
    - protocol: TCP
      port: 8080

10.5 Storage in Kubernetes

Volumes:

apiVersion: v1
kind: Pod
metadata:
  name: app
spec:
  volumes:
  - name: data
    emptyDir: {}  # Temporary
  - name: config
    configMap:
      name: app-config
  - name: secret
    secret:
      secretName: db-secret
  containers:
  - name: app
    volumeMounts:
    - name: data
      mountPath: /data

Persistent Volumes (PV):

Cluster storage resource:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-volume
spec:
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  awsElasticBlockStore:
    volumeID: vol-12345
    fsType: ext4

Persistent Volume Claims (PVC):

Request storage:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: data-claim
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi

Storage Classes:

Dynamic provisioning:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp3
  fsType: ext4

10.6 RBAC (Role-Based Access Control)

Core Concepts:

  • Role/ClusterRole: Set of permissions
  • RoleBinding/ClusterRoleBinding: Bind roles to users/groups

Role Example:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: default
  name: pod-reader
rules:
- apiGroups: [""]  # Core API group
  resources: ["pods"]
  verbs: ["get", "list", "watch"]

ClusterRole Example:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cluster-admin
rules:
- apiGroups: ["*"]
  resources: ["*"]
  verbs: ["*"]

RoleBinding:

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: read-pods
  namespace: default
subjects:
- kind: User
  name: jane
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io

Service Account Example:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-sa
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: app-binding
subjects:
- kind: ServiceAccount
  name: app-sa
  namespace: default
roleRef:
  kind: ClusterRole
  name: view
  apiGroup: rbac.authorization.k8s.io

10.7 Helm Charts

Helm is the package manager for Kubernetes.

Chart Structure:

mychart/
├── Chart.yaml          # Metadata
├── values.yaml         # Default values
├── templates/          # Template files
│   ├── deployment.yaml
│   ├── service.yaml
│   └── _helpers.tpl    # Helper templates
└── charts/             # Dependencies

Chart.yaml:

apiVersion: v2
name: myapp
description: My application
type: application
version: 0.1.0
appVersion: "1.0.0"
dependencies:
- name: redis
  version: 16.0.0
  repository: https://charts.bitnami.com/bitnami

Template Example:

# templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "mychart.fullname" . }}
  labels:
    {{- include "mychart.labels" . | nindent 4 }}
spec:
  replicas: {{ .Values.replicaCount }}
  selector:
    matchLabels:
      {{- include "mychart.selectorLabels" . | nindent 6 }}
  template:
    metadata:
      labels:
        {{- include "mychart.selectorLabels" . | nindent 8 }}
    spec:
      containers:
      - name: {{ .Chart.Name }}
        image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
        ports:
        - containerPort: {{ .Values.service.port }}

values.yaml:

replicaCount: 3
image:
  repository: nginx
  tag: latest
service:
  type: ClusterIP
  port: 80

Helm Commands:

# Install chart
helm install myapp ./mychart

# Upgrade release
helm upgrade myapp ./mychart

# Rollback
helm rollback myapp 1

# Template rendering
helm template ./mychart

# Package chart
helm package ./mychart

10.8 Operators Pattern

Operators automate application management using Kubernetes custom resources.

What are Operators?

Operators encode human operational knowledge into software to:

  • Deploy applications
  • Handle backups
  • Perform upgrades
  • Respond to failures

Operator Pattern:

Custom Resource (CR) → Operator → Manage application
     ↑                    ↓
User defines          Actual state
desired state          reconciled

Example: Prometheus Operator

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: main
spec:
  replicas: 2
  resources:
    requests:
      memory: 400Mi
  alerting:
    alertmanagers:
    - namespace: monitoring
      name: alertmanager-main
      port: web

Building Operators:

  • Operator SDK: Framework for building
  • Kubebuilder: Kubernetes API extensions
  • Metacontroller: Simple operators

Operator Best Practices:

  1. Idempotent: Safe to run repeatedly
  2. Self-healing: React to changes
  3. Upgradeable: Handle version upgrades
  4. Observable: Emit metrics/events
  5. Testable: Comprehensive testing

10.9 Custom Resource Definitions (CRD)

CRDs extend Kubernetes API with custom resources.

CRD Example:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: databases.example.com
spec:
  group: example.com
  names:
    kind: Database
    plural: databases
    singular: database
    shortNames:
    - db
  scope: Namespaced
  versions:
  - name: v1
    served: true
    storage: true
    schema:
      openAPIV3Schema:
        type: object
        properties:
          spec:
            type: object
            properties:
              engine:
                type: string
                enum: ["mysql", "postgres"]
              version:
                type: string
              size:
                type: string
                pattern: '^[0-9]+Gi$'

Using Custom Resource:

apiVersion: example.com/v1
kind: Database
metadata:
  name: mydb
spec:
  engine: postgres
  version: "13"
  size: 10Gi

10.10 Cluster Hardening

Security Best Practices:

API Server Security:

  • Enable RBAC
  • Use TLS for all communication
  • Enable audit logging
  • Disable anonymous auth
# kube-apiserver flags
--authorization-mode=Node,RBAC
--anonymous-auth=false
--audit-log-path=/var/log/kubernetes/audit.log
--enable-admission-plugins=NamespaceLifecycle,PodSecurityPolicy

etcd Security:

  • Encrypt secrets at rest
  • TLS for peer/client communication
  • Firewall access
  • Regular backups

Node Security:

  • Minimal host OS
  • Regular security updates
  • CIS benchmarks
  • Disable SSH or use bastion

Pod Security:

Pod Security Standards (PodSecurity admission):

apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: baseline

Pod Security Policies (deprecated in 1.21):

apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: restricted
spec:
  privileged: false
  runAsUser:
    rule: MustRunAsNonRoot
  seLinux:
    rule: RunAsAny
  fsGroup:
    rule: MustRunAs
    ranges:
    - min: 1
      max: 65535
  volumes:
  - 'configMap'
  - 'emptyDir'
  - 'projected'
  - 'secret'
  - 'downwardAPI'
  - 'persistentVolumeClaim'

Network Security:

  • Network policies
  • Encrypted traffic (mTLS with service mesh)
  • Limit external access

Image Security:

  • Scan images for vulnerabilities
  • Use private registry
  • Sign and verify images

Chapter 11 — Kubernetes in Production

11.1 High Availability Clusters

Control Plane HA:

Load Balancer
    ↓
┌───┼───┐
API  API  API
Server Server Server
 ↓     ↓     ↓
etcd  etcd  etcd (3-5 nodes)

Requirements:

  • Odd number of etcd nodes (3,5,7)
  • API servers behind load balancer
  • Scheduler and controller manager with leader election

Node Considerations:

  • Spread across availability zones
  • Cordoning and draining for maintenance
  • PodDisruptionBudgets

PodDisruptionBudget:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: app-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: myapp

11.2 Multi-Cluster Strategy

Reasons for Multi-Cluster:

  • Geographic distribution: Lower latency
  • Compliance: Data sovereignty
  • Isolation: Dev/test/prod separation
  • Scaling: Beyond single cluster limits
  • Disaster recovery: Active/passive or active/active

Multi-Cluster Patterns:

  1. Federation: Single control plane managing multiple clusters (KubeFed)

  2. Hub and Spoke: Central management with workload clusters

  3. Independent: Separate clusters with common tooling

  4. Hybrid: Mix of on-prem and cloud

Cluster API:

Declarative cluster management:

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: my-cluster
spec:
  clusterNetwork:
    pods:
      cidrBlocks: ["192.168.0.0/16"]
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
    kind: AWSCluster
    name: my-cluster
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AWSCluster
metadata:
  name: my-cluster
spec:
  region: us-west-2
  sshKeyName: default

11.3 Service Mesh (Istio, Linkerd)

Service meshes provide observability, security, and traffic management.

Service Mesh Architecture:

Pod
├── App Container
└── Sidecar Proxy (Envoy/Linkerd2-proxy)
     ↑
Control Plane (Istiod/Linkerd controller)

Istio Example:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: reviews
spec:
  hosts:
  - reviews
  http:
  - match:
    - headers:
        end-user:
          exact: jason
    route:
    - destination:
        host: reviews
        subset: v2
  - route:
    - destination:
        host: reviews
        subset: v1

mTLS (mutual TLS):

apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: istio-system
spec:
  mtls:
    mode: STRICT  # Require mTLS

Linkerd Example:

apiVersion: policy.linkerd.io/v1beta1
kind: HTTPRoute
metadata:
  name: api-route
  namespace: emojivoto
spec:
  parentRefs:
    - name: web-svc
      kind: Service
      group: core
      port: 80
  rules:
    - matches:
        - path:
            value: "/api/vote"
      filters:
        - type: RequestRedirect
          requestRedirect:
            scheme: https

Benefits:

  • Traffic management: Canary, blue/green
  • Security: mTLS, authorization
  • Observability: Metrics, tracing, logs
  • Resilience: Retries, timeouts, circuit breakers

11.4 Autoscaling (HPA, VPA)

Horizontal Pod Autoscaler (HPA):

Scales based on metrics:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: requests_per_second
      target:
        type: AverageValue
        averageValue: 1000

Vertical Pod Autoscaler (VPA):

Adjusts resource requests:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: app-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: app
  updatePolicy:
    updateMode: "Auto"  # Auto, Initial, Off
  resourcePolicy:
    containerPolicies:
    - containerName: '*'
      minAllowed:
        cpu: 250m
        memory: 512Mi
      maxAllowed:
        cpu: 4
        memory: 8Gi

Cluster Autoscaler:

Scales nodes based on pending pods:

# Add to deployment
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: cluster-autoscaler.kubernetes.io/scale-down-disabled
          operator: DoesNotExist

KEDA (Kubernetes Event-driven Autoscaling):

Scale based on events (Kafka, RabbitMQ, etc.):

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: kafka-scaler
spec:
  scaleTargetRef:
    name: consumer
  triggers:
  - type: kafka
    metadata:
      topic: my-topic
      bootstrapServers: kafka:9092
      consumerGroup: my-group
      lagThreshold: "10"

11.5 Observability in Kubernetes

Metrics:

  • Node metrics: CPU, memory, disk
  • Pod metrics: Resource usage
  • Custom metrics: Application-specific

Prometheus Stack:

# ServiceMonitor for Prometheus
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: app-monitor
spec:
  selector:
    matchLabels:
      app: myapp
  endpoints:
  - port: metrics
    interval: 30s

Logging:

  • Container logs: stdout/stderr
  • Node logs: kubelet, container runtime
  • Audit logs: API server activity

EFK Stack:

  • Elasticsearch: Storage and search
  • Fluentd/Fluent Bit: Log collection
  • Kibana: Visualization

Tracing:

Distributed tracing with Jaeger:

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: simplest

OpenTelemetry:

Vendor-neutral observability:

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: simplest
spec:
  config: |
    receivers:
      otlp:
        protocols:
          grpc:
          http:
    processors:
      memory_limiter:
        limit_mib: 512
    exporters:
      jaeger:
        endpoint: jaeger:14250
        tls:
          insecure: true
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter]
          exporters: [jaeger]

11.6 Disaster Recovery

Backup Strategies:

etcd Backup:

# Snapshot etcd
ETCDCTL_API=3 etcdctl snapshot save snapshot.db

# Restore
ETCDCTL_API=3 etcdctl snapshot restore snapshot.db

Velero (formerly Heptio Ark):

Backup and restore Kubernetes resources:

# Schedule backup
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: daily-backup
spec:
  schedule: "0 1 * * *"
  template:
    includedNamespaces:
    - production
    ttl: 720h

Velero Commands:

# On-demand backup
velero backup create app-backup --include-namespaces production

# Restore
velero restore create --from-backup app-backup

# Schedule backup
velero schedule create daily --schedule="0 1 * * *" --include-namespaces production

DR Patterns:

Active-Passive:

  • One cluster active, one standby
  • Data replication between clusters
  • DNS switch on failure

Active-Active:

  • Multiple clusters serving traffic
  • Global load balancing
  • Data synchronization challenges

Backup and Restore:

  • Regular backups
  • Documented restore procedures
  • Test restores regularly

11.7 Cost Optimization

Resource Management:

Rightsizing:

  • Use VPA to find optimal requests
  • Analyze usage patterns
  • Remove unused resources

Node Optimization:

  • Use spot/preemptible instances for stateless workloads
  • Right-size instance types
  • Use cluster autoscaler

Kubecost:

# Kubecost deployment
helm repo add kubecost https://kubecost.github.io/cost-analyzer/
helm install kubecost kubecost/cost-analyzer

Karpenter (AWS):

Dynamic node provisioning:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  requirements:
    - key: karpenter.sh/capacity-type
      operator: In
      values: ["spot", "on-demand"]
    - key: kubernetes.io/arch
      operator: In
      values: ["amd64", "arm64"]
  limits:
    resources:
      cpu: 1000
  provider:
    subnetSelector:
      karpenter/discovery: my-cluster
    securityGroupSelector:
      karpenter/discovery: my-cluster

Cost Optimization Checklist:

  • Rightsize pods (use VPA)
  • Use spot instances where possible
  • Scale down non-production clusters
  • Remove unused load balancers
  • Optimize storage (use reclaim policies)
  • Monitor and alert on cost spikes
  • Use namespace quotas
  • Implement resource limits

PART V — INFRASTRUCTURE AS CODE

Chapter 12 — Infrastructure as Code Principles

12.1 Declarative vs Imperative

Imperative Approach:

Describe how to achieve desired state:

# Create VPC
aws ec2 create-vpc --cidr-block 10.0.0.0/16

# Create subnet
aws ec2 create-subnet --vpc-id vpc-123 --cidr-block 10.0.1.0/24

# Create internet gateway
aws ec2 create-internet-gateway
aws ec2 attach-internet-gateway --vpc-id vpc-123 --internet-gateway-id igw-456

Problems:

  • Not idempotent
  • Difficult to reproduce
  • No state tracking
  • Error-prone

Declarative Approach:

Describe what you want:

# Terraform
resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
}

resource "aws_subnet" "main" {
  vpc_id     = aws_vpc.main.id
  cidr_block = "10.0.1.0/24"
}

resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id
}

Benefits:

  • Idempotent
  • Self-documenting
  • Version controllable
  • Predictable
  • Reusable

12.2 Immutable Infrastructure

Mutable Infrastructure:

  • Servers are updated in place
  • Configuration drifts over time
  • Configuration management tools fix drift
  • "Snowflake" servers

Immutable Infrastructure:

  • Never modify servers after deployment
  • Replace, don't change
  • Everything in version control
  • Identical environments
  • Easy rollback (redeploy previous version)

Benefits:

  • Consistency: All servers identical
  • Reproducibility: Recreate from scratch
  • Testing: Test immutable artifacts
  • Rollback: Deploy previous version
  • Debugging: Known state

Implementation:

Version 1:
Source → Build → Image v1 → Deploy → Running v1

Version 2:
Source → Build → Image v2 → Deploy → Running v2
                               ↓
                          Terminate v1

12.3 Idempotency

Definition: An operation is idempotent if applying it multiple times has the same effect as applying it once.

Examples:

Non-idempotent:

# Each run creates new file
echo "data" > file.txt

# Each run adds line
echo "new line" >> file.txt

Idempotent:

# Only creates if doesn't exist
touch file.txt

# Sets content regardless
echo "data" > file.txt

In IaC:

# Idempotent - creates only if doesn't exist
resource "aws_instance" "web" {
  ami           = "ami-123"
  instance_type = "t2.micro"
  
  # Tags ensure we can identify
  tags = {
    Name = "web-server"
  }
}

Benefits:

  • Safe to reapply
  • Predictable outcomes
  • Easy automation
  • Self-healing

12.4 State Management

State tracks resources managed by IaC.

Why State Matters:

  • Maps configuration to real resources
  • Tracks metadata and dependencies
  • Enables updates and deletion
  • Improves performance (caching)

State Storage:

Local State:

terraform {
  backend "local" {
    path = "terraform.tfstate"
  }
}
  • Simple but not for teams
  • No locking
  • Easy to lose

Remote State:

# AWS S3
terraform {
  backend "s3" {
    bucket = "my-terraform-state"
    key    = "prod/network/terraform.tfstate"
    region = "us-east-1"
    
    # Enable locking
    dynamodb_table = "terraform-locks"
  }
}

Azure Storage:

terraform {
  backend "azurerm" {
    storage_account_name = "tfstate123"
    container_name       = "tfstate"
    key                  = "prod.terraform.tfstate"
    access_key           = "xxx"
  }
}

Google Cloud Storage:

terraform {
  backend "gcs" {
    bucket = "tf-state-prod"
    prefix = "terraform/state"
  }
}

State Best Practices:

  1. Remote storage: Never store state locally
  2. State locking: Prevent concurrent modifications
  3. Encryption: Encrypt state at rest
  4. Access control: Restrict who can read/write
  5. Backup: Regular state backups
  6. Isolation: Separate state per environment

Chapter 13 — IaC Tools

13.1 Terraform

HashiCorp Terraform is the most popular IaC tool.

Core Concepts:

  • Providers: AWS, Azure, GCP, Kubernetes, etc.
  • Resources: Infrastructure components
  • Data sources: Read existing resources
  • Variables: Parameterize configurations
  • Outputs: Export resource attributes
  • Modules: Reusable configurations

Basic Example:

# main.tf
terraform {
  required_version = ">= 1.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 4.0"
    }
  }
}

provider "aws" {
  region = var.aws_region
}

resource "aws_instance" "web" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = var.instance_type
  
  tags = {
    Name        = "web-${var.environment}"
    Environment = var.environment
  }
}

data "aws_ami" "ubuntu" {
  most_recent = true
  owners      = ["099720109477"]  # Canonical
  
  filter {
    name   = "name"
    values = ["ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-*"]
  }
}

variable "aws_region" {
  description = "AWS region"
  type        = string
  default     = "us-east-1"
}

variable "instance_type" {
  description = "EC2 instance type"
  type        = string
}

variable "environment" {
  description = "Environment name"
  type        = string
}

output "instance_ip" {
  description = "Public IP of instance"
  value       = aws_instance.web.public_ip
}

Variables File (terraform.tfvars):

instance_type = "t3.micro"
environment   = "production"

Commands:

# Initialize (download providers)
terraform init

# Format code
terraform fmt

# Validate syntax
terraform validate

# Plan changes
terraform plan

# Apply changes
terraform apply

# Destroy resources
terraform destroy

# Show state
terraform show

# List resources
terraform state list

13.2 Ansible

Agentless configuration management and automation.

Core Concepts:

  • Playbooks: YAML files defining automation
  • Modules: Reusable units of work
  • Inventory: List of managed hosts
  • Roles: Organized playbook structure
  • Facts: System information gathered

Playbook Example:

---
- name: Configure web servers
  hosts: webservers
  become: yes
  vars:
    http_port: 80
    max_clients: 200
  
  tasks:
    - name: Ensure nginx is installed
      apt:
        name: nginx
        state: present
      when: ansible_os_family == "Debian"
    
    - name: Ensure nginx is running
      service:
        name: nginx
        state: started
        enabled: yes
    
    - name: Copy nginx configuration
      template:
        src: nginx.conf.j2
        dest: /etc/nginx/nginx.conf
      notify: restart nginx
    
    - name: Deploy website
      copy:
        src: index.html
        dest: /var/www/html/index.html
  
  handlers:
    - name: restart nginx
      service:
        name: nginx
        state: restarted

Inventory (hosts.ini):

[webservers]
web1.example.com
web2.example.com

[databases]
db1.example.com
db2.example.com

[all:vars]
ansible_user = ubuntu
ansible_ssh_private_key_file = ~/.ssh/prod-key.pem

Role Structure:

roles/
└── nginx/
    ├── tasks/
    │   └── main.yml
    ├── handlers/
    │   └── main.yml
    ├── templates/
    │   └── nginx.conf.j2
    ├── files/
    │   └── index.html
    ├── vars/
    │   └── main.yml
    └── defaults/
        └── main.yml

Commands:

# Ping all hosts
ansible all -m ping

# Run ad-hoc command
ansible webservers -m command -a "uptime"

# Run playbook
ansible-playbook site.yml

# Check syntax
ansible-playbook site.yml --syntax-check

# Dry run
ansible-playbook site.yml --check

# Limit to specific hosts
ansible-playbook site.yml --limit web1

13.3 Pulumi

IaC using general-purpose programming languages.

Example (TypeScript):

import * as aws from "@pulumi/aws";
import * as pulumi from "@pulumi/pulumi";

const config = new pulumi.Config();
const instanceType = config.get("instanceType") || "t3.micro";

// Get the latest Ubuntu AMI
const ubuntu = aws.ec2.getAmi({
  mostRecent: true,
  filters: [
    {
      name: "name",
      values: ["ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-*"],
    },
  ],
  owners: ["099720109477"],
});

// Create a security group
const group = new aws.ec2.SecurityGroup("web-sg", {
  description: "Allow HTTP and SSH",
  ingress: [
    { protocol: "tcp", fromPort: 22, toPort: 22, cidrBlocks: ["0.0.0.0/0"] },
    { protocol: "tcp", fromPort: 80, toPort: 80, cidrBlocks: ["0.0.0.0/0"] },
  ],
  egress: [
    { protocol: "-1", fromPort: 0, toPort: 0, cidrBlocks: ["0.0.0.0/0"] },
  ],
});

// Create an EC2 instance
const server = new aws.ec2.Instance("web-server", {
  instanceType: instanceType,
  ami: ubuntu.then(ami => ami.id),
  vpcSecurityGroupIds: [group.id],
  userData: `#!/bin/bash
    apt-get update
    apt-get install -y nginx
    systemctl start nginx
  `,
  tags: {
    Name: "web-server",
    Environment: pulumi.getStack(),
  },
});

// Export the instance's public IP
export const publicIp = server.publicIp;
export const publicHostname = server.publicDns;

Example (Python):

import pulumi
import pulumi_aws as aws

config = pulumi.Config()
instance_type = config.get("instanceType", "t3.micro")

# Get the latest Ubuntu AMI
ubuntu = aws.ec2.get_ami(
    most_recent=True,
    filters=[
        {
            "name": "name",
            "values": ["ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-*"]
        }
    ],
    owners=["099720109477"]
)

# Create security group
group = aws.ec2.SecurityGroup("web-sg",
    description="Allow HTTP and SSH",
    ingress=[
        {"protocol": "tcp", "from_port": 22, "to_port": 22, "cidr_blocks": ["0.0.0.0/0"]},
        {"protocol": "tcp", "from_port": 80, "to_port": 80, "cidr_blocks": ["0.0.0.0/0"]},
    ],
    egress=[
        {"protocol": "-1", "from_port": 0, "to_port": 0, "cidr_blocks": ["0.0.0.0/0"]}
    ]
)

# Create EC2 instance
server = aws.ec2.Instance("web-server",
    instance_type=instance_type,
    ami=ubuntu.id,
    vpc_security_group_ids=[group.id],
    user_data="""#!/bin/bash
        apt-get update
        apt-get install -y nginx
        systemctl start nginx
    """,
    tags={
        "Name": "web-server",
        "Environment": pulumi.get_stack()
    }
)

pulumi.export("public_ip", server.public_ip)
pulumi.export("public_hostname", server.public_dns)

Benefits:

  • Use familiar programming languages
  • Loops, conditionals, functions
  • Strong typing (TypeScript, Go)
  • Reuse existing code/libraries
  • Better IDE support

13.4 CloudFormation

AWS-native IaC tool.

Template Structure:

AWSTemplateFormatVersion: "2010-09-09"
Description: "Web server stack"

Parameters:
  InstanceType:
    Description: EC2 instance type
    Type: String
    Default: t3.micro
    AllowedValues:
      - t3.micro
      - t3.small
      - t3.medium

Mappings:
  RegionMap:
    us-east-1:
      AMI: ami-0c02fb55956c7d316  # Ubuntu 20.04
    us-west-2:
      AMI: ami-0d6621c01e8c2de54

Resources:
  WebServerSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Allow HTTP and SSH
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 22
          ToPort: 22
          CidrIp: 0.0.0.0/0
        - IpProtocol: tcp
          FromPort: 80
          ToPort: 80
          CidrIp: 0.0.0.0/0

  WebServer:
    Type: AWS::EC2::Instance
    Properties:
      ImageId: !FindInMap [RegionMap, !Ref "AWS::Region", AMI]
      InstanceType: !Ref InstanceType
      SecurityGroupIds:
        - !Ref WebServerSecurityGroup
      UserData:
        Fn::Base64: !Sub |
          #!/bin/bash
          apt-get update
          apt-get install -y nginx
          systemctl start nginx
      Tags:
        - Key: Name
          Value: WebServer

Outputs:
  PublicIP:
    Description: Public IP of web server
    Value: !GetAtt WebServer.PublicIp
  PublicDNS:
    Description: Public DNS of web server
    Value: !GetAtt WebServer.PublicDnsName

StackSets: Deploy across multiple regions/accounts.

Change Sets: Preview changes before applying.

13.5 Remote State Backends

Terraform Backends:

S3 Backend:

terraform {
  backend "s3" {
    bucket         = "company-terraform-state"
    key            = "prod/network/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

DynamoDB Lock Table:

resource "aws_dynamodb_table" "terraform_locks" {
  name         = "terraform-locks"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"
  
  attribute {
    name = "LockID"
    type = "S"
  }
}

Azure Backend:

terraform {
  backend "azurerm" {
    resource_group_name  = "terraform-state"
    storage_account_name = "tfstate123"
    container_name       = "tfstate"
    key                  = "prod.terraform.tfstate"
  }
}

GCS Backend:

terraform {
  backend "gcs" {
    bucket = "terraform-state-prod"
    prefix = "network"
  }
}

State Isolation Strategies:

  1. Workspaces: Same config, separate state
  2. Directory structure: Different configs per environment
  3. Terragrunt: DRY configurations

Workspaces:

# Create workspace
terraform workspace new dev
terraform workspace new prod

# List workspaces
terraform workspace list

# Switch workspace
terraform workspace select prod

# Use in config
locals {
  environment = terraform.workspace
}

13.6 Modules & Reusability

Module Structure:

modules/
└── webserver/
    ├── main.tf
    ├── variables.tf
    ├── outputs.tf
    └── README.md

Module Code (main.tf):

resource "aws_instance" "web" {
  ami           = var.ami
  instance_type = var.instance_type
  subnet_id     = var.subnet_id
  
  vpc_security_group_ids = [aws_security_group.web.id]
  
  user_data = var.user_data
  
  tags = var.tags
}

resource "aws_security_group" "web" {
  name_prefix = "${var.name}-sg"
  vpc_id      = var.vpc_id
  
  dynamic "ingress" {
    for_each = var.ingress_rules
    content {
      from_port   = ingress.value.from_port
      to_port     = ingress.value.to_port
      protocol    = ingress.value.protocol
      cidr_blocks = ingress.value.cidr_blocks
    }
  }
  
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
  
  tags = var.tags
}

variables.tf:

variable "name" {
  description = "Name prefix for resources"
  type        = string
}

variable "ami" {
  description = "AMI ID for the instance"
  type        = string
}

variable "instance_type" {
  description = "Instance type"
  type        = string
  default     = "t3.micro"
}

variable "subnet_id" {
  description = "Subnet ID for the instance"
  type        = string
}

variable "vpc_id" {
  description = "VPC ID for security group"
  type        = string
}

variable "user_data" {
  description = "User data script"
  type        = string
  default     = ""
}

variable "ingress_rules" {
  description = "List of ingress rules"
  type = list(object({
    from_port   = number
    to_port     = number
    protocol    = string
    cidr_blocks = list(string)
  }))
  default = [
    {
      from_port   = 80
      to_port     = 80
      protocol    = "tcp"
      cidr_blocks = ["0.0.0.0/0"]
    }
  ]
}

variable "tags" {
  description = "Tags to apply"
  type        = map(string)
  default     = {}
}

outputs.tf:

output "instance_id" {
  description = "Instance ID"
  value       = aws_instance.web.id
}

output "public_ip" {
  description = "Public IP address"
  value       = aws_instance.web.public_ip
}

output "security_group_id" {
  description = "Security group ID"
  value       = aws_security_group.web.id
}

Using the Module:

module "web_server" {
  source = "../modules/webserver"
  
  name        = "prod-web"
  ami         = data.aws_ami.ubuntu.id
  instance_type = "t3.small"
  subnet_id   = aws_subnet.public.id
  vpc_id      = aws_vpc.main.id
  
  ingress_rules = [
    {
      from_port   = 80
      to_port     = 80
      protocol    = "tcp"
      cidr_blocks = ["0.0.0.0/0"]
    },
    {
      from_port   = 443
      to_port     = 443
      protocol    = "tcp"
      cidr_blocks = ["0.0.0.0/0"]
    }
  ]
  
  tags = {
    Environment = "production"
    ManagedBy   = "terraform"
  }
}

output "web_ip" {
  value = module.web_server.public_ip
}

13.7 Policy as Code

Enforce policies on infrastructure.

Sentinel (HashiCorp):

# Restrict instance types
import "tfplan"

main = rule {
  all tfplan.resources.aws_instance as _, instances {
    all instances as _, instance {
      instance.applied.instance_type in ["t3.micro", "t3.small"]
    }
  }
}

Open Policy Agent (OPA):

Rego policy:

package terraform

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "aws_instance"
  resource.change.after.instance_type == "t3.large"
  msg := sprintf("Instance type t3.large not allowed in %v", [resource.address])
}

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "aws_s3_bucket"
  not resource.change.after_unknown.aws_s3_bucket_public_access_block
  msg := sprintf("S3 bucket %v requires public access block", [resource.address])
}

Checkov:

Scan Terraform for security issues:

# Install
pip install checkov

# Scan
checkov -d ./

# Scan specific file
checkov -f main.tf

# Output formats
checkov -d ./ --output junitxml > results.xml

Example Check:

# Custom check
from checkov.common.models.enums import CheckResult, CheckCategories
from checkov.terraform.checks.resource.base_resource_check import BaseResourceCheck

class S3PublicACL(BaseResourceCheck):
    def __init__(self):
        name = "Ensure S3 bucket has no public ACL"
        id = "CUSTOM_AWS_001"
        supported_resources = ['aws_s3_bucket']
        categories = [CheckCategories.SECURITY]
        super().__init__(name=name, id=id, categories=categories, supported_resources=supported_resources)

    def scan_resource_conf(self, conf):
        if 'acl' in conf and conf['acl'] == ['public-read']:
            return CheckResult.FAILED
        return CheckResult.PASSED

check = S3PublicACL()

PART VI — CLOUD PLATFORMS

Chapter 14 — Cloud Fundamentals

14.1 IaaS, PaaS, SaaS

Infrastructure as a Service (IaaS):

  • Virtual machines, storage, networks
  • You manage OS, middleware, runtime, data, apps
  • Provider manages virtualization, servers, storage, networking

Examples: AWS EC2, Azure VMs, Google Compute Engine

Platform as a Service (PaaS):

  • Managed runtime environment
  • You manage data and apps
  • Provider manages everything else

Examples: Heroku, Google App Engine, AWS Elastic Beanstalk

Software as a Service (SaaS):

  • Complete application
  • You just use it
  • Provider manages everything

Examples: Salesforce, Office 365, Google Workspace

Function as a Service (FaaS):

  • Serverless functions
  • You write code, provider runs it
  • Pay per execution

Examples: AWS Lambda, Azure Functions, Google Cloud Functions

14.2 Public vs Private vs Hybrid

Public Cloud:

  • Shared infrastructure
  • Multi-tenant
  • Pay-as-you-go
  • Global scale
  • Examples: AWS, Azure, GCP

Private Cloud:

  • Dedicated infrastructure
  • Single tenant
  • More control
  • Compliance benefits
  • Examples: OpenStack, VMware

Hybrid Cloud:

  • Mix of public and private
  • Workload mobility
  • Data locality options
  • Burst to public cloud

Multi-Cloud:

  • Multiple public cloud providers
  • Avoid vendor lock-in
  • Best-of-breed services
  • Geographic presence

14.3 Cloud Networking

Virtual Private Cloud (VPC):

Isolated network section:

VPC (10.0.0.0/16)
├── Public Subnet (10.0.1.0/24)
│   └── Internet Gateway
├── Private Subnet (10.0.2.0/24)
│   └── NAT Gateway
└── Database Subnet (10.0.3.0/24)
    └── No internet access

Key Components:

  • Subnets: Network segments
  • Route tables: Traffic routing
  • Internet Gateway: Public internet access
  • NAT Gateway: Private subnet outbound access
  • VPN Gateway: On-premises connection
  • Load Balancers: Traffic distribution
  • CDN: Content delivery

Network Security:

  • Security Groups: Instance-level firewall (stateful)
  • Network ACLs: Subnet-level firewall (stateless)
  • WAF: Web application firewall
  • DDoS protection: Shield, Cloudflare

14.4 IAM Concepts

Identity and Access Management (IAM):

Core Components:

  • Users: Individual people/accounts
  • Groups: Collections of users
  • Roles: Temporary permissions
  • Policies: Permission documents
  • Permissions: Allow/deny actions

IAM Policy Example (AWS):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket",
        "s3:GetObject"
      ],
      "Resource": [
        "arn:aws:s3:::my-bucket",
        "arn:aws:s3:::my-bucket/*"
      ],
      "Condition": {
        "IpAddress": {
          "aws:SourceIp": "192.168.1.0/24"
        }
      }
    }
  ]
}

Least Privilege Principle:

  • Grant minimum necessary permissions
  • Regularly audit permissions
  • Use groups and roles
  • Avoid wildcards when possible

Identity Federation:

  • SAML 2.0 (Active Directory)
  • OIDC (Google, GitHub)
  • Social logins

Chapter 15 — Amazon Web Services

15.1 Amazon Web Services Overview

AWS is the leading cloud provider with the broadest service portfolio.

Global Infrastructure:

  • Regions: Geographic areas (us-east-1, eu-west-1)
  • Availability Zones: Isolated data centers per region
  • Edge Locations: CDN endpoints
  • Local Zones: Extend regions to population centers

Service Categories:

  • Compute
  • Storage
  • Database
  • Networking
  • Security & Identity
  • Analytics
  • Machine Learning
  • Developer Tools
  • Management & Governance

15.2 EC2 (Elastic Compute Cloud)

Virtual servers in the cloud.

Instance Types:

  • General Purpose: t3, m5 (balanced)
  • Compute Optimized: c5 (CPU intensive)
  • Memory Optimized: r5, x1 (RAM intensive)
  • Storage Optimized: i3, d2 (disk I/O)
  • GPU Instances: p3, g4 (graphics, ML)

Launch Configuration:

resource "aws_instance" "web" {
  ami           = "ami-0c02fb55956c7d316"
  instance_type = "t3.micro"
  
  subnet_id                   = aws_subnet.public.id
  vpc_security_group_ids      = [aws_security_group.web.id]
  associate_public_ip_address = true
  
  user_data = <<-EOF
    #!/bin/bash
    yum update -y
    yum install -y httpd
    systemctl start httpd
    systemctl enable httpd
    echo "<h1>Hello from $(hostname -f)</h1>" > /var/www/html/index.html
  EOF
  
  tags = {
    Name = "web-server"
  }
}

Purchase Options:

  • On-Demand: Pay by hour/second
  • Reserved: 1-3 year commitment, up to 75% discount
  • Spot: Bid for unused capacity, up to 90% discount
  • Savings Plans: Flexible pricing

15.3 S3 (Simple Storage Service)

Object storage for the cloud.

Storage Classes:

  • S3 Standard: Frequently accessed data
  • S3 Intelligent-Tiering: Auto-tiering
  • S3 Standard-IA: Infrequent access
  • S3 One Zone-IA: Lower cost, less durable
  • S3 Glacier: Archive (minutes to hours retrieval)
  • S3 Glacier Deep Archive: Long-term archive (hours retrieval)

Bucket Example:

resource "aws_s3_bucket" "data" {
  bucket = "my-company-data-${var.environment}"
  
  tags = {
    Environment = var.environment
  }
}

resource "aws_s3_bucket_versioning" "data" {
  bucket = aws_s3_bucket.data.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "data" {
  bucket = aws_s3_bucket.data.id
  
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "AES256"
    }
  }
}

resource "aws_s3_bucket_public_access_block" "data" {
  bucket = aws_s3_bucket.data.id
  
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

CLI Commands:

# List buckets
aws s3 ls

# Copy file
aws s3 cp file.txt s3://my-bucket/

# Sync directory
aws s3 sync ./local s3://my-bucket/

# Set lifecycle policy
aws s3api put-bucket-lifecycle-configuration \
  --bucket my-bucket \
  --lifecycle-configuration file://lifecycle.json

15.4 RDS (Relational Database Service)

Managed relational databases.

Supported Engines:

  • Amazon Aurora (MySQL/PostgreSQL compatible)
  • MySQL
  • PostgreSQL
  • MariaDB
  • Oracle
  • SQL Server

Example (PostgreSQL):

resource "aws_db_instance" "postgres" {
  identifier = "myapp-${var.environment}"
  
  engine         = "postgres"
  engine_version = "13.7"
  instance_class = "db.t3.micro"
  
  allocated_storage     = 20
  storage_type          = "gp3"
  storage_encrypted     = true
  
  db_name  = "myapp"
  username = "admin"
  password = random_password.db_password.result
  
  vpc_security_group_ids = [aws_security_group.database.id]
  db_subnet_group_name   = aws_db_subnet_group.main.name
  
  backup_retention_period = 30
  backup_window           = "03:00-04:00"
  maintenance_window      = "sun:04:00-sun:05:00"
  
  skip_final_snapshot = false
  final_snapshot_identifier = "myapp-${var.environment}-final-${formatdate("YYYY-MM-DD-hhmm", timestamp())}"
  
  tags = {
    Environment = var.environment
  }
}

resource "random_password" "db_password" {
  length  = 32
  special = false
}

Aurora Serverless:

resource "aws_rds_cluster" "aurora" {
  cluster_identifier = "aurora-serverless-${var.environment}"
  engine             = "aurora-postgresql"
  engine_version     = "13.6"
  database_name      = "myapp"
  master_username    = "admin"
  master_password    = random_password.db_password.result
  
  serverlessv2_scaling_configuration {
    min_capacity = 0.5
    max_capacity = 8
  }
  
  vpc_security_group_ids = [aws_security_group.database.id]
  db_subnet_group_name   = aws_db_subnet_group.main.name
  
  backup_retention_period = 7
  
  skip_final_snapshot = false
  final_snapshot_identifier = "aurora-${var.environment}-final"
}

15.5 VPC (Virtual Private Cloud)

Isolated network environment.

Complete VPC Example:

# VPC
resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true
  
  tags = {
    Name = "main-${var.environment}"
  }
}

# Public subnets
resource "aws_subnet" "public" {
  count = length(var.availability_zones)
  
  vpc_id                  = aws_vpc.main.id
  cidr_block              = "10.0.${count.index}.0/24"
  availability_zone       = var.availability_zones[count.index]
  map_public_ip_on_launch = true
  
  tags = {
    Name = "public-${var.availability_zones[count.index]}"
  }
}

# Private subnets
resource "aws_subnet" "private" {
  count = length(var.availability_zones)
  
  vpc_id            = aws_vpc.main.id
  cidr_block        = "10.0.${count.index + 10}.0/24"
  availability_zone = var.availability_zones[count.index]
  
  tags = {
    Name = "private-${var.availability_zones[count.index]}"
  }
}

# Internet Gateway
resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id
  
  tags = {
    Name = "main-igw"
  }
}

# NAT Gateways (one per AZ)
resource "aws_eip" "nat" {
  count = length(var.availability_zones)
  vpc   = true
  
  tags = {
    Name = "nat-${var.availability_zones[count.index]}"
  }
}

resource "aws_nat_gateway" "main" {
  count = length(var.availability_zones)
  
  allocation_id = aws_eip.nat[count.index].id
  subnet_id     = aws_subnet.public[count.index].id
  
  tags = {
    Name = "nat-${var.availability_zones[count.index]}"
  }
  
  depends_on = [aws_internet_gateway.main]
}

# Route tables
resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id
  
  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.main.id
  }
  
  tags = {
    Name = "public"
  }
}

resource "aws_route_table" "private" {
  count = length(var.availability_zones)
  
  vpc_id = aws_vpc.main.id
  
  route {
    cidr_block     = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.main[count.index].id
  }
  
  tags = {
    Name = "private-${var.availability_zones[count.index]}"
  }
}

# Route table associations
resource "aws_route_table_association" "public" {
  count = length(var.availability_zones)
  
  subnet_id      = aws_subnet.public[count.index].id
  route_table_id = aws_route_table.public.id
}

resource "aws_route_table_association" "private" {
  count = length(var.availability_zones)
  
  subnet_id      = aws_subnet.private[count.index].id
  route_table_id = aws_route_table.private[count.index].id
}

15.6 IAM (Identity and Access Management)

IAM User and Group:

# Create group
resource "aws_iam_group" "developers" {
  name = "developers"
}

# Create user
resource "aws_iam_user" "john" {
  name = "john.doe"
  path = "/developers/"
}

# Add user to group
resource "aws_iam_group_membership" "developers" {
  name = "developers-group-membership"
  
  users = [
    aws_iam_user.john.name,
  ]
  
  group = aws_iam_group.developers.name
}

# Group policy
resource "aws_iam_group_policy" "developers_policy" {
  name  = "developers-policy"
  group = aws_iam_group.developers.name
  
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "ec2:Describe*",
          "s3:ListBucket",
        ]
        Resource = "*"
      }
    ]
  })
}

IAM Role for EC2:

# Role
resource "aws_iam_role" "ec2_role" {
  name = "ec2-role"
  
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Principal = {
          Service = "ec2.amazonaws.com"
        }
        Action = "sts:AssumeRole"
      }
    ]
  })
}

# Policy attachment
resource "aws_iam_role_policy_attachment" "s3_read" {
  role       = aws_iam_role.ec2_role.name
  policy_arn = "arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess"
}

# Instance profile
resource "aws_iam_instance_profile" "ec2_profile" {
  name = "ec2-profile"
  role = aws_iam_role.ec2_role.name
}

15.7 EKS (Elastic Kubernetes Service)

Managed Kubernetes on AWS.

EKS Cluster:

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "19.0.0"
  
  cluster_name    = "myapp-${var.environment}"
  cluster_version = "1.24"
  
  vpc_id     = aws_vpc.main.id
  subnet_ids = concat(aws_subnet.public[*].id, aws_subnet.private[*].id)
  
  # Managed node groups
  eks_managed_node_groups = {
    main = {
      desired_size = 3
      min_size     = 1
      max_size     = 10
      
      instance_types = ["t3.medium"]
      
      tags = {
        Environment = var.environment
      }
    }
  }
  
  # Fargate profiles (serverless)
  fargate_profiles = {
    default = {
      name = "default"
      selectors = [
        {
          namespace = "default"
        }
      ]
    }
  }
  
  tags = {
    Environment = var.environment
  }
}

# Configure kubectl
resource "local_file" "kubeconfig" {
  content  = module.eks.kubeconfig
  filename = "./kubeconfig_${var.environment}"
}

Access Entry (EKS API):

resource "aws_eks_access_entry" "admin" {
  cluster_name  = module.eks.cluster_name
  principal_arn = "arn:aws:iam::123456789:role/Admin"
  type          = "STANDARD"
}

resource "aws_eks_access_policy_association" "admin" {
  cluster_name  = module.eks.cluster_name
  policy_arn    = "arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy"
  principal_arn = aws_eks_access_entry.admin.principal_arn
  
  access_scope {
    type = "cluster"
  }
}

Chapter 16 — Microsoft Azure

16.1 Microsoft Azure Overview

Azure is Microsoft's cloud platform, strong in enterprise integration.

Global Infrastructure:

  • 60+ regions worldwide
  • Availability Zones
  • ExpressRoute private connections

Key Services:

  • Azure Virtual Machines (IaaS)
  • Azure Kubernetes Service (AKS)
  • Azure App Service (PaaS)
  • Azure SQL Database
  • Azure DevOps

16.2 Virtual Machines

VM Deployment:

# Terraform AzureRM provider
provider "azurerm" {
  features {}
}

resource "azurerm_resource_group" "main" {
  name     = "myapp-${var.environment}-rg"
  location = var.location
}

resource "azurerm_virtual_network" "main" {
  name                = "myapp-${var.environment}-vnet"
  address_space       = ["10.0.0.0/16"]
  location            = azurerm_resource_group.main.location
  resource_group_name = azurerm_resource_group.main.name
}

resource "azurerm_subnet" "internal" {
  name                 = "internal"
  resource_group_name  = azurerm_resource_group.main.name
  virtual_network_name = azurerm_virtual_network.main.name
  address_prefixes     = ["10.0.2.0/24"]
}

resource "azurerm_public_ip" "vm" {
  name                = "vm-public-ip"
  resource_group_name = azurerm_resource_group.main.name
  location            = azurerm_resource_group.main.location
  allocation_method   = "Dynamic"
}

resource "azurerm_network_interface" "main" {
  name                = "vm-nic"
  resource_group_name = azurerm_resource_group.main.name
  location            = azurerm_resource_group.main.location
  
  ip_configuration {
    name                          = "internal"
    subnet_id                     = azurerm_subnet.internal.id
    private_ip_address_allocation = "Dynamic"
    public_ip_address_id          = azurerm_public_ip.vm.id
  }
}

resource "azurerm_linux_virtual_machine" "main" {
  name                = "vm-${var.environment}"
  resource_group_name = azurerm_resource_group.main.name
  location            = azurerm_resource_group.main.location
  size                = "Standard_B2s"
  admin_username      = "azureuser"
  
  network_interface_ids = [
    azurerm_network_interface.main.id,
  ]
  
  admin_ssh_key {
    username   = "azureuser"
    public_key = file("~/.ssh/id_rsa.pub")
  }
  
  source_image_reference {
    publisher = "Canonical"
    offer     = "0001-com-ubuntu-server-focal"
    sku       = "20_04-lts"
    version   = "latest"
  }
  
  os_disk {
    caching              = "ReadWrite"
    storage_account_type = "Standard_LRS"
  }
  
  tags = {
    environment = var.environment
  }
}

16.3 Azure Kubernetes Service (AKS)

AKS Cluster:

resource "azurerm_kubernetes_cluster" "main" {
  name                = "aks-${var.environment}"
  location            = azurerm_resource_group.main.location
  resource_group_name = azurerm_resource_group.main.name
  dns_prefix          = "myapp-${var.environment}"
  
  default_node_pool {
    name       = "default"
    node_count = 3
    vm_size    = "Standard_DS2_v2"
    
    enable_auto_scaling = true
    min_count           = 1
    max_count           = 5
  }
  
  identity {
    type = "SystemAssigned"
  }
  
  network_profile {
    network_plugin = "azure"
    network_policy = "calico"
  }
  
  role_based_access_control_enabled = true
  
  azure_active_directory_role_based_access_control {
    managed            = true
    azure_rbac_enabled = true
  }
  
  tags = {
    Environment = var.environment
  }
}

# Get credentials
resource "local_file" "kubeconfig" {
  content  = azurerm_kubernetes_cluster.main.kube_config_raw
  filename = "./kubeconfig_aks_${var.environment}"
}

AKS with Availability Zones:

resource "azurerm_kubernetes_cluster" "main" {
  # ... existing configuration ...
  
  default_node_pool {
    name                = "default"
    node_count          = 3
    vm_size             = "Standard_DS2_v2"
    availability_zones  = ["1", "2", "3"]
    enable_node_public_ip = false
    
    upgrade_settings {
      max_surge = "33%"
    }
  }
  
  # Enable cluster autoscaler
  auto_scaler_profile {
    balance_similar_node_groups = true
    max_graceful_termination_sec = 600
  }
}

16.4 Azure DevOps Integration

Service Connection:

# azure-pipelines.yml
trigger:
- main

pool:
  vmImage: ubuntu-latest

variables:
  azureSubscription: 'my-azure-connection'
  resourceGroup: 'myapp-prod-rg'
  aksCluster: 'myapp-prod-aks'

stages:
- stage: Build
  jobs:
  - job: Build
    steps:
    - task: Docker@2
      inputs:
        containerRegistry: 'my-acr'
        repository: 'myapp'
        command: 'buildAndPush'
        Dockerfile: '**/Dockerfile'
        tags: '$(Build.BuildId)'

- stage: Deploy
  jobs:
  - deployment: Deploy
    environment: 'production'
    strategy:
      runOnce:
        deploy:
          steps:
          - task: KubernetesManifest@0
            inputs:
              action: 'deploy'
              kubernetesServiceConnection: 'my-aks-connection'
              namespace: 'default'
              manifests: 'manifests/deployment.yaml'
              containers: 'myacr.azurecr.io/myapp:$(Build.BuildId)'

16.5 Networking & Security

Virtual Network with Service Endpoints:

resource "azurerm_virtual_network" "main" {
  name                = "vnet-${var.environment}"
  location            = azurerm_resource_group.main.location
  resource_group_name = azurerm_resource_group.main.name
  address_space       = ["10.0.0.0/16"]
}

# Subnet with service endpoints
resource "azurerm_subnet" "private" {
  name                 = "private"
  resource_group_name  = azurerm_resource_group.main.name
  virtual_network_name = azurerm_virtual_network.main.name
  address_prefixes     = ["10.0.1.0/24"]
  
  service_endpoints = [
    "Microsoft.Sql",
    "Microsoft.Storage"
  ]
}

# Private endpoint for storage
resource "azurerm_private_endpoint" "storage" {
  name                = "pe-storage-${var.environment}"
  location            = azurerm_resource_group.main.location
  resource_group_name = azurerm_resource_group.main.name
  subnet_id           = azurerm_subnet.private.id
  
  private_service_connection {
    name                           = "storage-connection"
    private_connection_resource_id = azurerm_storage_account.main.id
    is_manual_connection           = false
    subresource_names              = ["blob"]
  }
}

Network Security Group:

resource "azurerm_network_security_group" "web" {
  name                = "nsg-web"
  location            = azurerm_resource_group.main.location
  resource_group_name = azurerm_resource_group.main.name
  
  security_rule {
    name                       = "HTTP"
    priority                   = 100
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = "80"
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }
  
  security_rule {
    name                       = "HTTPS"
    priority                   = 110
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = "443"
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }
  
  security_rule {
    name                       = "SSH"
    priority                   = 120
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = "22"
    source_address_prefix      = "10.0.0.0/8"
    destination_address_prefix = "*"
  }
}

Chapter 17 — Google Cloud Platform

17.1 Google Cloud Platform Overview

GCP excels in data analytics, machine learning, and containers.

Global Infrastructure:

  • 30+ regions
  • 100+ edge locations
  • Global fiber network

Key Services:

  • Compute Engine (VMs)
  • Google Kubernetes Engine (GKE)
  • BigQuery (analytics)
  • Cloud Run (serverless containers)
  • Cloud Functions

17.2 Compute Engine

VM Instance:

# Terraform GCP provider
provider "google" {
  project = var.project_id
  region  = var.region
}

resource "google_compute_network" "vpc" {
  name                    = "vpc-${var.environment}"
  auto_create_subnetworks = false
}

resource "google_compute_subnetwork" "subnet" {
  name          = "subnet-${var.environment}"
  ip_cidr_range = "10.0.1.0/24"
  region        = var.region
  network       = google_compute_network.vpc.id
  
  private_ip_google_access = true
}

resource "google_compute_firewall" "ssh" {
  name    = "allow-ssh"
  network = google_compute_network.vpc.name
  
  allow {
    protocol = "tcp"
    ports    = ["22"]
  }
  
  source_ranges = ["0.0.0.0/0"]
  target_tags   = ["ssh"]
}

resource "google_compute_address" "static" {
  name = "vm-address-${var.environment}"
}

resource "google_compute_instance" "default" {
  name         = "vm-${var.environment}"
  machine_type = "e2-medium"
  zone         = var.zone
  
  tags = ["ssh", "http"]
  
  boot_disk {
    initialize_params {
      image = "ubuntu-os-cloud/ubuntu-2004-lts"
      size  = 50
      type  = "pd-ssd"
    }
  }
  
  network_interface {
    network    = google_compute_network.vpc.name
    subnetwork = google_compute_subnetwork.subnet.name
    
    access_config {
      nat_ip = google_compute_address.static.address
    }
  }
  
  metadata_startup_script = <<-EOF
    #!/bin/bash
    apt-get update
    apt-get install -y nginx
    systemctl start nginx
  EOF
  
  service_account {
    scopes = ["cloud-platform"]
  }
}

17.3 GKE (Google Kubernetes Engine)

GKE Cluster:

resource "google_container_cluster" "primary" {
  name     = "gke-${var.environment}"
  location = var.region
  
  remove_default_node_pool = true
  initial_node_count       = 1
  
  network    = google_compute_network.vpc.name
  subnetwork = google_compute_subnetwork.subnet.name
  
  # Enable Shielded Nodes
  enable_shielded_nodes = true
  
  # Release channel (RAPID, REGULAR, STABLE)
  release_channel {
    channel = "REGULAR"
  }
  
  # Private cluster
  private_cluster_config {
    enable_private_nodes    = true
    enable_private_endpoint = false
    master_ipv4_cidr_block  = "172.16.0.0/28"
  }
  
  # Network policy
  network_policy {
    enabled = true
  }
  
  # Workload identity
  workload_identity_config {
    workload_pool = "${var.project_id}.svc.id.goog"
  }
  
  maintenance_policy {
    recurring_window {
      start_time = "2023-01-01T04:00:00Z"
      end_time   = "2023-01-01T06:00:00Z"
      recurrence = "FREQ=WEEKLY;BYDAY=SA,SU"
    }
  }
}

resource "google_container_node_pool" "primary_nodes" {
  name       = "primary-pool"
  location   = var.region
  cluster    = google_container_cluster.primary.name
  node_count = 3
  
  node_config {
    machine_type = "e2-standard-4"
    
    service_account = google_service_account.gke.email
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]
    
    metadata = {
      disable-legacy-endpoints = "true"
    }
    
    labels = {
      environment = var.environment
    }
    
    tags = ["gke-node", var.environment]
    
    shielded_instance_config {
      enable_secure_boot          = true
      enable_integrity_monitoring = true
    }
    
    workload_metadata_config {
      mode = "GKE_METADATA"
    }
  }
  
  autoscaling {
    min_node_count = 1
    max_node_count = 10
  }
  
  management {
    auto_repair  = true
    auto_upgrade = true
  }
}

17.4 IAM

Service Account:

# Service account
resource "google_service_account" "gke" {
  account_id   = "gke-sa-${var.environment}"
  display_name = "GKE Service Account"
}

# IAM binding
resource "google_project_iam_member" "gke_logging" {
  project = var.project_id
  role    = "roles/logging.logWriter"
  member  = "serviceAccount:${google_service_account.gke.email}"
}

resource "google_project_iam_member" "gke_monitoring" {
  project = var.project_id
  role    = "roles/monitoring.metricWriter"
  member  = "serviceAccount:${google_service_account.gke.email}"
}

resource "google_project_iam_member" "gke_metadata" {
  project = var.project_id
  role    = "roles/stackdriver.resourceMetadata.writer"
  member  = "serviceAccount:${google_service_account.gke.email}"
}

Custom Role:

resource "google_project_iam_custom_role" "myrole" {
  role_id     = "customRole_${var.environment}"
  title       = "Custom Role"
  description = "Custom role for myapp"
  permissions = [
    "storage.buckets.get",
    "storage.objects.get",
    "storage.objects.list",
  ]
}

resource "google_project_iam_member" "custom" {
  project = var.project_id
  role    = google_project_iam_custom_role.myrole.id
  member  = "serviceAccount:${google_service_account.app.email}"
}

17.5 BigQuery

Data warehouse for analytics.

Dataset and Table:

resource "google_bigquery_dataset" "dataset" {
  dataset_id    = "myapp_${replace(var.environment, "-", "_")}"
  friendly_name = "MyApp Dataset"
  description   = "Dataset for MyApp analytics"
  location      = var.region
  
  default_table_expiration_ms = 2592000000 # 30 days
  
  labels = {
    environment = var.environment
  }
}

resource "google_bigquery_table" "events" {
  dataset_id = google_bigquery_dataset.dataset.dataset_id
  table_id   = "events"
  
  time_partitioning {
    type = "DAY"
  }
  
  clustering = ["event_type", "user_id"]
  
  schema = jsonencode([
    {
      name = "event_id"
      type = "STRING"
      mode = "REQUIRED"
    },
    {
      name = "event_type"
      type = "STRING"
      mode = "REQUIRED"
    },
    {
      name = "user_id"
      type = "STRING"
      mode = "REQUIRED"
    },
    {
      name = "timestamp"
      type = "TIMESTAMP"
      mode = "REQUIRED"
    },
    {
      name = "properties"
      type = "JSON"
      mode = "NULLABLE"
    }
  ])
}

# Authorized view
resource "google_bigquery_table" "daily_events" {
  dataset_id = google_bigquery_dataset.dataset.dataset_id
  table_id   = "daily_events"
  
  view {
    query = <<EOF
      SELECT
        DATE(timestamp) as event_date,
        event_type,
        COUNT(*) as count
      FROM `${var.project_id}.${google_bigquery_dataset.dataset.dataset_id}.events`
      GROUP BY event_date, event_type
    EOF
    
    use_legacy_sql = false
  }
}

BigQuery Query Example:

-- Top users by event count
SELECT
  user_id,
  COUNT(*) as event_count
FROM `myproject.myapp_prod.events`
WHERE timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
GROUP BY user_id
ORDER BY event_count DESC
LIMIT 10;

-- Real-time dashboard query
SELECT
  event_type,
  COUNT(*) as events,
  COUNT(DISTINCT user_id) as unique_users
FROM `myproject.myapp_prod.events`
WHERE timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
GROUP BY event_type;

PART VII — OBSERVABILITY & SRE

Chapter 18 — Monitoring & Logging

18.1 Monitoring Principles

What to Monitor:

  • Infrastructure: CPU, memory, disk, network
  • Application: Request rate, errors, latency
  • Business: Active users, revenue, conversions
  • Security: Auth failures, suspicious patterns

The Four Golden Signals (Google):

  1. Latency: Time to serve requests
  2. Traffic: How much demand
  3. Errors: Rate of failed requests
  4. Saturation: How "full" the system is

RED Method (for services):

  • Rate: Requests per second
  • Errors: Failed requests per second
  • Duration: Distribution of request latencies

USE Method (for resources):

  • Utilization: Average time resource busy
  • Saturation: Extra work resource can't handle
  • Errors: Error counts

18.2 Metrics vs Logs vs Traces

Metrics:

  • Numerical measurements over time
  • Small data footprint
  • Aggregatable
  • Best for: Alerting, dashboards, trends

Examples: CPU usage, request latency p99, error rate

Logs:

  • Detailed event records
  • Text or structured data
  • Large volume
  • Best for: Debugging, audit trails, detailed analysis

Examples: Error stack traces, access logs, audit events

Traces:

  • End-to-end request paths
  • Span context
  • Show service dependencies
  • Best for: Performance analysis, distributed debugging

Examples:

  • Frontend → API → Auth → Database
  • Service call hierarchies

The Three Pillars of Observability:

Observability
├── Metrics (what's happening)
├── Logs (why it's happening)
└── Traces (where it's happening)

18.3 Prometheus

Prometheus is the leading open-source monitoring system.

Architecture:

Service → Exporter → Prometheus Server → Alertmanager
              ↑            ↓
          Service      Grafana
          Discovery

Prometheus Configuration:

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - 'alerts.yml'

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

Exporters:

  • node_exporter: System metrics
  • blackbox_exporter: HTTP/HTTPS probing
  • mysqld_exporter: MySQL metrics
  • postgres_exporter: PostgreSQL metrics
  • nginx_exporter: Nginx metrics

PromQL (Prometheus Query Language):

# CPU usage
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Request rate
rate(http_requests_total[5m])

# Error ratio
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

# 95th percentile latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Memory usage
container_memory_usage_bytes{container!=""} / container_spec_memory_limit_bytes

18.4 Grafana

Visualization and dashboards.

Dashboard Example:

{
  "title": "Web Service Dashboard",
  "panels": [
    {
      "title": "Request Rate",
      "type": "graph",
      "targets": [
        {
          "expr": "rate(http_requests_total[1m])",
          "legendFormat": "{{service}}"
        }
      ]
    },
    {
      "title": "Error Rate",
      "type": "graph",
      "targets": [
        {
          "expr": "rate(http_requests_total{status=~'5..'}[1m])",
          "legendFormat": "{{service}}"
        }
      ]
    },
    {
      "title": "Latency (p99)",
      "type": "heatmap",
      "targets": [
        {
          "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))"
        }
      ]
    }
  ]
}

Grafana Datasources:

  • Prometheus
  • Elasticsearch
  • InfluxDB
  • Graphite
  • CloudWatch
  • Azure Monitor
  • Google Cloud Monitoring

18.5 ELK Stack

Elasticsearch, Logstash, Kibana for logging.

Architecture:

Logs → Filebeat → Logstash → Elasticsearch → Kibana
                    ↑
              (Processing)

Filebeat Configuration:

# filebeat.yml
filebeat.inputs:
- type: container
  paths:
    - /var/log/containers/*.log
  processors:
    - add_kubernetes_metadata:
        host: ${NODE_NAME}
        matchers:
        - logs_path:
            logs_path: "/var/log/containers/"

output.logstash:
  hosts: ["logstash:5044"]

Logstash Configuration:

input {
  beats {
    port => 5044
  }
}

filter {
  grok {
    match => { "message" => "%{COMBINEDAPACHELOG}" }
  }
  
  date {
    match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
  }
  
  geoip {
    source => "clientip"
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "logs-%{+YYYY.MM.dd}"
  }
}

Kibana Queries:

# Find errors
log_level: ERROR

# Find specific request
request_id: "abc123"

# Time range and filter
@timestamp >= "now-1h" AND kubernetes.namespace: production

# Pattern matching
message: "Failed to connect to *"

18.6 Alerting Strategies

Alert Design Principles:

  1. Actionable: Alerts should require action
  2. Urgent: Alert on imminent problems
  3. Real: Avoid false positives
  4. Understandable: Clear what's wrong
  5. Documented: Runbooks for alerts

Alert Severity Levels:

  • P0/Critical: Service down, immediate response
  • P1/High: Severe degradation, respond within hour
  • P2/Medium: Minor issues, respond within day
  • P3/Low: Informational, no response needed

Alert Rules (Prometheus):

# alerts.yml
groups:
- name: instance_alerts
  rules:
  - alert: InstanceDown
    expr: up == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }} down"
      description: "{{ $labels.instance }} has been down for more than 5 minutes."

  - alert: HighCPUUsage
    expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU usage is {{ $value }}% for 10 minutes."

- name: service_alerts
  rules:
  - alert: HighErrorRate
    expr: |
      sum(rate(http_requests_total{status=~"5.."}[5m])) 
      / 
      sum(rate(http_requests_total[5m])) > 0.05
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate for {{ $labels.service }}"
      description: "Error rate is {{ $value | humanizePercentage }}"

Alertmanager Configuration:

# alertmanager.yml
global:
  slack_api_url: 'https://hooks.slack.com/services/...'

route:
  group_by: ['alertname', 'cluster']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'team-alerts'
  routes:
  - match:
      severity: critical
    receiver: 'pagerduty-critical'
    continue: true
  - match:
      severity: warning
    receiver: 'slack-warnings'

receivers:
- name: 'team-alerts'
  slack_configs:
  - channel: '#alerts'
    title: '{{ template "slack.title" . }}'
    text: '{{ template "slack.text" . }}'

- name: 'pagerduty-critical'
  pagerduty_configs:
  - service_key: '...'

- name: 'slack-warnings'
  slack_configs:
  - channel: '#warnings'
    title: '{{ template "slack.title" . }}'
    text: '{{ template "slack.text" . }}'

18.7 Incident Response

Incident Management Process:

  1. Detection: Alert triggers or user reports
  2. Triage: Assess severity and impact
  3. Response: Assign incident commander
  4. Mitigation: Stop the bleeding
  5. Resolution: Fix root cause
  6. Post-mortem: Learn and prevent

Incident Severity Matrix:

Severity Impact Response Examples
SEV1 Critical outage Immediate, all hands Site down, data loss
SEV2 Major degradation < 1 hour response Feature broken, slow
SEV3 Minor issue < 1 day response UI glitch, non-critical
SEV4 Informational Next release Cosmetic issues

Incident Commander Responsibilities:

  • Coordinate response
  • Communicate status
  • Make decisions
  • Delegate tasks
  • Track timeline

Communication Templates:

Initial Alert:

INCIDENT: {{title}}
SEVERITY: {{severity}}
TIME: {{timestamp}}
IMPACT: {{impact}}
LEAD: {{commander}}
CHANNEL: {{slack_channel}}

Status Update:

STATUS UPDATE ({{time}})
Current: {{what's happening}}
Action: {{what's being done}}
Next: {{next check-in}}

Resolution:

RESOLVED: {{title}}
TIME: {{timestamp}}
DURATION: {{duration}}
ACTION: {{mitigation}}
ROOT CAUSE: {{cause}}
POST-MORTEM: {{link}}

Chapter 19 — Site Reliability Engineering

19.1 SRE Principles

SRE applies software engineering to operations.

Core Principles (Google):

  1. Operations is a software problem: Automate away toil
  2. Manage by service level objectives: SLOs drive decisions
  3. Work to minimize toil: Spend 50% time on development
  4. Monotonically decreasing toil: Always reducing
  5. Error budgets: Balance reliability and velocity
  6. Monitoring should be minimal: Alert on symptoms, not causes

SRE vs Traditional Ops:

Aspect Traditional Ops SRE
Focus Keep systems running Build systems that run themselves
Change Minimize change Embrace change with safety
Measurement Uptime Error budgets
Work Manual operations Automation development
Incidents Fix and forget Post-mortems and prevention

19.2 SLIs, SLOs, SLAs

Service Level Indicators (SLIs):

Metrics that measure service performance:

  • Availability: % of successful requests
  • Latency: Time to respond (e.g., p99 < 100ms)
  • Throughput: Requests per second
  • Durability: Data persistence rate
  • Correctness: % of accurate responses

Service Level Objectives (SLOs):

Target values for SLIs:

"99.9% of requests complete in < 200ms over rolling 30 days"

Characteristics:

  • Specific and measurable
  • Time-bound
  • Achievable
  • Business-aligned

Service Level Agreements (SLAs):

Contracts with consequences for missing SLOs:

  • Financial penalties
  • Service credits
  • Legal implications

SLO Examples:

apiVersion: v1
kind: ServiceLevelObjective
metadata:
  name: api-availability
spec:
  service: user-api
  indicator:
    type: availability
    ratio:
          good: 
            filter: "job='api' and status_code=200"
            count: successful_requests
          total: 
            filter: "job='api'"
            count: total_requests
  target: 99.9%
  window: 30d
---
apiVersion: v1
kind: ServiceLevelObjective
metadata:
  name: api-latency
spec:
  service: user-api
  indicator:
    type: latency
    latency:
      threshold: 200ms
    filter: "job='api'"
  target: 99%
  window: 7d

19.3 Error Budgets

Error budgets = 100% - SLO target

Example: 99.9% SLO → 0.1% error budget

Error Budget Calculation:

Error Budget = (1 - SLO) × Total Time

For 30 days (2,592,000 seconds) with 99.9% SLO:
Error Budget = 0.001 × 2,592,000 = 2,592 seconds = 43.2 minutes

Error Budget Policy:

  • While budget remains: Release velocity prioritized
  • When budget exhausted: Freeze releases, focus on reliability

Benefits:

  • Aligns Dev and Ops goals
  • Data-driven release decisions
  • Balances risk and innovation

19.4 Toil Reduction

What is Toil?

Manual, repetitive, automatable work with no enduring value.

Examples of Toil:

  • Manual deployments
  • Password resets
  • Restarting services
  • Answering repetitive questions
  • Manual data fixes

Toil Characteristics:

  1. Manual: Requires human action
  2. Repetitive: Done frequently
  3. Automatable: Could be done by machine
  4. Tactical: No lasting value
  5. Scales linearly: More work = more people

Toil Reduction Strategies:

  1. Measure toil: Track time spent
  2. Set goals: Target < 50% time on toil
  3. Automate everything: Scripts, tools, platforms
  4. Build self-service: Empower developers
  5. Improve reliability: Reduce firefighting

Toil Budget:

Time Allocation:
├── 50% max toil (operational)
└── 50% min engineering (development)
    ├── Automation
    ├── Tooling
    └── Architecture improvements

19.5 Chaos Engineering

Definition: "Disciplined approach to identifying failures before they become outages" (Principles of Chaos)

Principles (from Principles of Chaos):

  1. Build a hypothesis around steady state
  2. Vary real-world events
  3. Run experiments in production
  4. Automate experiments to run continuously
  5. Minimize blast radius

Chaos Engineering Tools:

  • Chaos Monkey: Random instance termination
  • Gremlin: Chaos engineering platform
  • Litmus: Kubernetes chaos
  • Chaos Mesh: Kubernetes chaos platform
  • AWS Fault Injection Simulator

Chaos Experiment Example (Chaos Mesh):

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-kill-example
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: web-server
  duration: "60s"

Experiment Design:

  1. Define steady state: Normal metrics (error rate < 0.1%)
  2. Hypothesis: System survives losing one pod
  3. Run experiment: Kill one pod
  4. Prove/disprove: Did error rate spike?
  5. Fix or automate: Add redundancy or document

19.6 Capacity Planning

Goals:

  • Meet demand without waste
  • Anticipate scaling needs
  • Optimize costs

Capacity Planning Process:

  1. Measure current usage: Trends, peaks
  2. Forecast demand: Business growth, seasonality
  3. Model scenarios: What-if analysis
  4. Plan capacity: When to add resources
  5. Procure/scale: Execute plan

Key Metrics:

  • Peak utilization: Max observed
  • Headroom: Buffer for spikes
  • Growth rate: % increase over time
  • Lead time: How long to add capacity

Prediction Methods:

Trend Analysis:

Future Capacity = Current Usage × (1 + Growth Rate)^Time

Seasonal Patterns:

  • Daily patterns
  • Weekly patterns
  • Holiday spikes
  • Marketing campaigns

Tools:

  • Prometheus: Historical metrics
  • Grafana: Visualization
  • Forecast libraries: Prophet, statsmodels
  • Cloud auto-scaling: Dynamic capacity

PART VIII — DEVSECOPS

Chapter 20 — Secure DevOps

20.1 Threat Modeling

Identify and prioritize security threats.

Threat Modeling Process (STRIDE):

  • Spoofing: Impersonating something/someone
  • Tampering: Modifying data/code
  • Repudiation: Denying actions
  • Information Disclosure: Exposing data
  • Denial of Service: Disrupting service
  • Elevation of Privilege: Gaining unauthorized access

Common Frameworks:

PASTA (Process for Attack Simulation and Threat Analysis):

  1. Define objectives
  2. Define technical scope
  3. Decompose application
  4. Threat analysis
  5. Vulnerability analysis
  6. Attack modeling
  7. Risk analysis

Threat Modeling Example:

System: User Authentication Service

Assets:
- User credentials
- Session tokens
- Personal data

Trust Boundaries:
- Browser ↔ API
- API ↔ Database

Threats:
1. SQL Injection (Tampering)
   Mitigation: Parameterized queries, input validation

2. Session Hijacking (Spoofing)
   Mitigation: HTTPS, secure cookies, short expiration

3. Brute Force (DoS)
   Mitigation: Rate limiting, account lockout

4. Password Leak (Info Disclosure)
   Mitigation: Hashing, encryption, secure storage

20.2 Supply Chain Security

Protect against compromised dependencies and tools.

Supply Chain Attacks:

  • Dependency confusion: Malicious packages with same name
  • Typosquatting: Similar package names
  • Compromised maintainers: Attacked developer accounts
  • Build pipeline: Inject malware during build

Mitigation Strategies:

  1. Lock dependencies: Use lock files (package-lock.json)
  2. Verify integrity: Checksums, signatures
  3. Private registry: Curated packages
  4. Continuous scanning: Detect vulnerabilities
  5. Least privilege: Limit CI/CD permissions

Software Bill of Materials (SBOM):

{
  "bomFormat": "CycloneDX",
  "specVersion": "1.4",
  "version": 1,
  "components": [
    {
      "type": "library",
      "name": "lodash",
      "version": "4.17.21",
      "purl": "pkg:npm/lodash@4.17.21",
      "licenses": ["MIT"]
    }
  ]
}

20.3 SBOM (Software Bill of Materials)

What is SBOM?

A formal, machine-readable inventory of software components and dependencies.

SBOM Formats:

  • SPDX: Linux Foundation
  • CycloneDX: OWASP
  • SWID: ISO standard

Why SBOM Matters:

  • Know what's in your software
  • Rapid vulnerability response
  • License compliance
  • Supply chain transparency

Generating SBOM:

# Using syft
syft myapp:latest -o cyclonedx > sbom.json

# Using trivy
trivy image --format cyclonedx myapp:latest > sbom.json

# Using cdxgen
cdxgen -o bom.xml

20.4 Secrets Management

Never store secrets in code.

Secret Types:

  • API keys
  • Database passwords
  • TLS certificates
  • SSH keys
  • OAuth tokens

Secret Management Solutions:

HashiCorp Vault:

# Vault policy
path "secret/data/myapp/*" {
  capabilities = ["read"]
}
# Store secret
vault kv put secret/myapp/api key=12345

# Read secret
vault kv get secret/myapp/api

# Dynamic database credentials
vault read database/creds/myapp

Cloud Secret Managers:

  • AWS Secrets Manager:
aws secretsmanager create-secret --name myapp/api --secret-string '{"key":"12345"}'
  • Azure Key Vault:
az keyvault secret set --vault-name myvault --name api-key --value 12345
  • Google Secret Manager:
echo -n "12345" | gcloud secrets create api-key --data-file=-

Kubernetes Secrets:

apiVersion: v1
kind: Secret
metadata:
  name: db-secret
type: Opaque
data:
  username: YWRtaW4=  # base64 encoded
  password: MWYyZDFlMmU2N2Rm  # base64 encoded
---
apiVersion: v1
kind: Pod
spec:
  containers:
  - name: app
    env:
    - name: DB_USERNAME
      valueFrom:
        secretKeyRef:
          name: db-secret
          key: username
    - name: DB_PASSWORD
      valueFrom:
        secretKeyRef:
          name: db-secret
          key: password

Tools for Secret Detection:

# GitHub Actions secret scanning
name: Secret Scanning
on: [push]
jobs:
  scan:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    - name: TruffleHog
      uses: trufflesecurity/trufflehog@main
      with:
        path: ./
        base: ${{ github.event.repository.default_branch }}

20.5 CI/CD Security Hardening

Pipeline Security Checklist:

  • Use OIDC instead of long-lived credentials
  • Scan dependencies for vulnerabilities
  • Scan container images
  • Run SAST on code
  • Run DAST on deployments
  • Sign and verify artifacts
  • Immutable build environments
  • Least privilege for CI jobs
  • Audit all pipeline changes
  • Secrets never in logs

Secure Pipeline Example:

name: Secure CI/CD

on: [push]

jobs:
  security-scans:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      
      - name: Scan code for secrets
        uses: trufflesecurity/trufflehog@main
        
      - name: Run SAST
        uses: github/codeql-action/init@v1
        with:
          languages: javascript
        
      - name: Scan dependencies
        run: |
          npm audit --audit-level=high
          npm outdated
      
      - name: Build image
        run: docker build -t myapp:${{ github.sha }} .
      
      - name: Scan image
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: 'myapp:${{ github.sha }}'
          severity: 'CRITICAL,HIGH'
      
      - name: Sign image
        run: |
          cosign sign --key k8s://my-namespace/cosign myapp:${{ github.sha }}
      
      - name: Deploy (if scans pass)
        if: success()
        run: ./deploy.sh

Chapter 21 — Security Tools

21.1 SAST (Static Application Security Testing)

Analyze source code for vulnerabilities.

Common SAST Tools:

  • SonarQube: Multi-language, quality and security
  • Checkmarx: Enterprise SAST
  • Fortify: Micro Focus
  • Semgrep: Fast, customizable
  • CodeQL: GitHub's analysis engine
  • ESLint (security plugins): JavaScript

Semgrep Example:

# semgrep.yml
rules:
  - id: no-hardcoded-secrets
    patterns:
      - pattern: password = "..."
      - pattern-not: password = os.getenv("...")
    message: "Hardcoded password detected"
    languages: [python]
    severity: ERROR

  - id: sql-injection
    patterns:
      - pattern: |
          cursor.execute("SELECT ... WHERE ... = " + $VAR)
    message: "Possible SQL injection"
    languages: [python]
    severity: WARNING

CI Integration:

- name: Run Semgrep
  uses: returntocorp/semgrep-action@v1
  with:
    config: >-
      p/security-audit
      p/secrets

21.2 DAST (Dynamic Application Security Testing)

Test running applications for vulnerabilities.

Common DAST Tools:

  • OWASP ZAP: Free, powerful
  • Burp Suite: Professional penetration testing
  • Acunetix: Commercial scanner
  • Nessus: Vulnerability scanner
  • Qualys: Cloud-based scanning

OWASP ZAP in CI:

- name: ZAP Scan
  uses: zaproxy/action-full-scan@v0.4.0
  with:
    target: 'https://staging.example.com'
    rules_file_name: '.zap/rules.tsv'
    cmd_options: '-a'

Types of DAST Tests:

  • Vulnerability scanning: SQLi, XSS, CSRF
  • Fuzzing: Unexpected inputs
  • Authentication testing: Login bypass
  • Session management: Token handling
  • Input validation: Boundary testing

21.3 Container Scanning

Scan container images for vulnerabilities.

Container Scanning Tools:

  • Trivy: Comprehensive, fast
  • Clair: CoreOS scanner
  • Anchore: Deep inspection
  • Docker Scout: Docker native
  • Grype: From Anchore
  • Snyk Container: Developer friendly

Trivy Example:

# Scan image
trivy image myapp:latest

# Scan with severity filter
trivy image --severity CRITICAL,HIGH myapp:latest

# Ignore unfixed
trivy image --ignore-unfixed myapp:latest

# Output formats
trivy image --format sarif myapp:latest > results.sarif

# Scan filesystem
trivy fs --severity HIGH,CRITICAL .

Kubernetes Admission Control:

apiVersion: v1
kind: ConfigMap
metadata:
  name: trivy-admission
data:
  policy.rego: |
    package trivy
    
    deny[msg] {
      input.request.kind.kind == "Pod"
      image := input.request.object.spec.containers[_].image
      not valid_image(image)
      msg := sprintf("Image %v has critical vulnerabilities", [image])
    }
    
    valid_image(image) {
      # Check with Trivy
      # ...
    }

21.4 Dependency Scanning

Scan project dependencies for known vulnerabilities.

Tools:

  • OWASP Dependency Check: Java, .NET, Python
  • Snyk: Multi-language, commercial
  • npm audit: JavaScript
  • Safety: Python
  • Gemnasium: GitLab's scanner
  • Dependabot: GitHub's automated updates

Snyk Example:

# .snyk
version: v1.25.0
ignore:
  SNYK-JS-LODASH-567746:
    - '*':
        reason: 'No patch available'
        expires: '2024-01-01'
patch: {}

CI Integration:

- name: Snyk Scan
  uses: snyk/actions/node@master
  env:
    SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}
  with:
    args: --severity-threshold=high

Dependabot Configuration:

# .github/dependabot.yml
version: 2
updates:
  - package-ecosystem: "npm"
    directory: "/"
    schedule:
      interval: "daily"
    open-pull-requests-limit: 10
    ignore:
      - dependency-name: "express"
        versions: ["5.x"]
    labels:
      - "dependencies"
      - "security"

21.5 Policy Enforcement

Enforce security policies across infrastructure.

Open Policy Agent (OPA):

package kubernetes.admission

deny[msg] {
  input.request.kind.kind == "Pod"
  container := input.request.object.spec.containers[_]
  container.securityContext.runAsRoot
  msg := "Containers must not run as root"
}

deny[msg] {
  input.request.kind.kind == "Deployment"
  not input.request.object.spec.template.metadata.labels.owner
  msg := "All resources must have owner label"
}

Kyverno (Kubernetes):

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-labels
spec:
  validationFailureAction: enforce
  rules:
  - name: check-for-labels
    match:
      resources:
        kinds:
        - Pod
    validate:
      message: "Label 'app' is required"
      pattern:
        metadata:
          labels:
            app: "?*"
---
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: disallow-latest-tag
spec:
  validationFailureAction: audit
  rules:
  - name: require-image-tag
    match:
      resources:
        kinds:
        - Pod
    validate:
      message: "Using 'latest' tag is not allowed"
      pattern:
        spec:
          containers:
          - image: "!*:latest"

Conftest (Configuration Testing):

package main

deny[msg] {
  input.kind == "Deployment"
  not input.spec.template.metadata.labels.app
  msg = "Deployments must have app label"
}

deny[msg] {
  input.kind == "Service"
  input.spec.type == "LoadBalancer"
  not input.metadata.annotations["service.beta.kubernetes.io/aws-load-balancer-internal"]
  msg = "LoadBalancer services must be internal"
}
# Test Kubernetes manifests
conftest test deployment.yaml --policy policy/

PART IX — ADVANCED TOPICS

Chapter 22 — GitOps & Platform Engineering

22.1 GitOps Principles

Core Principles:

  1. Declarative: Entire system described declaratively
  2. Versioned and Immutable: Desired state stored in Git
  3. Pulled Automatically: Software agents pull changes
  4. Continuously Reconciled: Correct drift automatically

GitOps Workflow:

Developer → Git Push
    ↓
Git Repository (source of truth)
    ↓
GitOps Operator (ArgoCD/Flux)
    ↓
Kubernetes Cluster
    ↑
Monitoring (drift detection)

Benefits:

  • Audit trail: All changes in Git
  • Faster recovery: Recreate cluster from Git
  • Standard tools: Use Git workflows
  • Security: Pull model reduces credentials
  • Observability: Drift detection

22.2 ArgoCD

Declarative GitOps for Kubernetes.

ArgoCD Architecture:

User (CLI/UI) → ArgoCD API Server
        ↓
   Repository Server
        ↓
   Controller
        ↓
   Kubernetes API

Application Definition:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: myapp
  namespace: argocd
spec:
  project: default
  
  source:
    repoURL: https://github.com/user/repo.git
    targetRevision: HEAD
    path: k8s
    helm:
      valueFiles:
      - values-production.yaml
  
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
    - CreateNamespace=true
    - PruneLast=true
  
  revisionHistoryLimit: 10

ApplicationSet (Multi-cluster):

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: myapp
spec:
  generators:
  - clusters:
      selector:
        matchLabels:
          environment: production
  template:
    metadata:
      name: '{{name}}-myapp'
    spec:
      project: default
      source:
        repoURL: https://github.com/user/repo.git
        targetRevision: HEAD
        path: k8s
      destination:
        server: '{{server}}'
        namespace: 'myapp-{{name}}'

ArgoCD Commands:

# List apps
argocd app list

# Sync app
argocd app sync myapp

# Get app details
argocd app get myapp

# Rollback
argocd app rollback myapp 1

# Set image (with Kustomize)
argocd app set myapp --kustomize-image myapp:v2

22.3 Flux

Another GitOps operator, lighter weight.

Flux Components:

  • Source Controller: Manages Git repositories
  • Kustomize Controller: Applies Kustomize overlays
  • Helm Controller: Manages Helm releases
  • Notification Controller: Handles alerts

Flux Configuration:

# GitRepository source
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: GitRepository
metadata:
  name: myapp
  namespace: flux-system
spec:
  interval: 1m
  url: https://github.com/user/repo
  ref:
    branch: main
  secretRef:
    name: repo-auth

# Kustomization
apiVersion: kustomize.toolkit.fluxcd.io/v1beta2
kind: Kustomization
metadata:
  name: myapp
  namespace: flux-system
spec:
  interval: 10m
  path: ./k8s/overlays/production
  prune: true
  sourceRef:
    kind: GitRepository
    name: myapp
  validation: client
  healthChecks:
    - apiVersion: apps/v1
      kind: Deployment
      name: myapp
      namespace: production

Flux with Helm:

# HelmRepository
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmRepository
metadata:
  name: bitnami
  namespace: flux-system
spec:
  interval: 1h
  url: https://charts.bitnami.com/bitnami

# HelmRelease
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: redis
  namespace: production
spec:
  interval: 5m
  chart:
    spec:
      chart: redis
      sourceRef:
        kind: HelmRepository
        name: bitnami
        namespace: flux-system
      interval: 1m
  values:
    architecture: standalone
    auth:
      enabled: false

22.4 Internal Developer Platforms

What is an IDP?

A layer of tools and services that development teams use to build, deploy, and operate applications without needing to understand the underlying infrastructure.

IDP Components:

Developer Portal (Backstage, Kratix)
    ↓
Orchestration (Terraform, Crossplane)
    ↓
GitOps (ArgoCD, Flux)
    ↓
Kubernetes (EKS, AKS, GKE)
    ↓
Cloud Providers (AWS, Azure, GCP)

Backstage (Spotify's Developer Portal):

// Component definition
import { Entity } from '@backstage/catalog-model';

export const myComponent: Entity = {
  apiVersion: 'backstage.io/v1alpha1',
  kind: 'Component',
  metadata: {
    name: 'my-service',
    description: 'My awesome service',
    annotations: {
      'github.com/project-slug': 'org/my-service',
      'backstage.io/techdocs-ref': 'dir:.',
    },
    tags: ['java', 'web'],
  },
  spec: {
    type: 'service',
    lifecycle: 'production',
    owner: 'team-a',
    system: 'product-catalog',
  },
};

Crossplane (Infrastructure as Code Platform):

apiVersion: aws.crossplane.io/v1beta1
kind: ProviderConfig
metadata:
  name: aws-provider
spec:
  credentials:
    source: Secret
    secretRef:
      namespace: crossplane-system
      name: aws-creds
      key: creds

---
apiVersion: database.aws.crossplane.io/v1beta1
kind: RDSInstance
metadata:
  name: mydb
spec:
  forProvider:
    region: us-east-1
    dbInstanceClass: db.t3.micro
    masterUsername: admin
    engine: postgres
    engineVersion: "13"
    allocatedStorage: 20
    publiclyAccessible: false
  writeConnectionSecretToRef:
    name: db-conn
    namespace: production
  providerConfigRef:
    name: aws-provider

Platform Engineering Team Responsibilities:

  • Build and maintain IDP
  • Define "golden paths" for developers
  • Provide self-service capabilities
  • Abstract infrastructure complexity
  • Ensure security and compliance
  • Collect feedback and improve

Golden Path Example:

Developer Workflow:
1. Create repo from template
2. Run `platform create-service`
3. Add code and push
4. PR creates preview environment
5. Merge to main → staging deploy
6. Promote to production via UI

Chapter 23 — Serverless & Edge

23.1 Serverless Architecture

What is Serverless?

  • No server management
  • Automatic scaling
  • Pay per execution
  • Event-driven

Benefits:

  • Reduced operational overhead
  • Auto-scaling to zero
  • Cost efficiency for variable workloads
  • Faster time to market

Trade-offs:

  • Cold starts
  • Vendor lock-in
  • Execution limits
  • Debugging complexity

23.2 AWS Lambda

Lambda Function Example (Node.js):

exports.handler = async (event) => {
  console.log('Event:', JSON.stringify(event, null, 2));
  
  try {
    const { name } = event.queryStringParameters || {};
    const response = {
      statusCode: 200,
      headers: {
        'Content-Type': 'application/json',
      },
      body: JSON.stringify({
        message: `Hello, ${name || 'World'}!`,
        timestamp: new Date().toISOString(),
      }),
    };
    
    return response;
  } catch (error) {
    console.error('Error:', error);
    return {
      statusCode: 500,
      body: JSON.stringify({ error: 'Internal Server Error' }),
    };
  }
};

Terraform Lambda Deployment:

# IAM Role
resource "aws_iam_role" "lambda_role" {
  name = "lambda_role"
  
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "lambda.amazonaws.com"
        }
      }
    ]
  })
}

# Lambda function
resource "aws_lambda_function" "api" {
  filename      = "function.zip"
  function_name = "my-api"
  role          = aws_iam_role.lambda_role.arn
  handler       = "index.handler"
  runtime       = "nodejs18.x"
  
  environment {
    variables = {
      TABLE_NAME = aws_dynamodb_table.data.name
    }
  }
  
  tracing_config {
    mode = "Active"
  }
}

# API Gateway trigger
resource "aws_apigatewayv2_api" "lambda" {
  name          = "serverless-api"
  protocol_type = "HTTP"
  
  cors {
    allow_origins = ["*"]
    allow_methods = ["GET", "POST"]
  }
}

resource "aws_apigatewayv2_integration" "lambda" {
  api_id = aws_apigatewayv2_api.lambda.id
  
  integration_uri    = aws_lambda_function.api.invoke_arn
  integration_type   = "AWS_PROXY"
  integration_method = "POST"
}

resource "aws_apigatewayv2_route" "get" {
  api_id    = aws_apigatewayv2_api.lambda.id
  route_key = "GET /hello"
  target    = "integrations/${aws_apigatewayv2_integration.lambda.id}"
}

23.3 Azure Functions

Azure Function (Python):

import azure.functions as func
import logging
import json

def main(req: func.HttpRequest) -> func.HttpResponse:
    logging.info('Python HTTP trigger function processed a request.')
    
    name = req.params.get('name')
    if not name:
        try:
            req_body = req.get_json()
        except ValueError:
            pass
        else:
            name = req_body.get('name')
    
    if name:
        return func.HttpResponse(
            json.dumps({
                "message": f"Hello, {name}!",
                "timestamp": datetime.utcnow().isoformat()
            }),
            status_code=200,
            mimetype="application/json"
        )
    else:
        return func.HttpResponse(
            "Please pass a name on the query string or in the request body",
            status_code=400
        )

Azure Functions Configuration:

{
  "IsEncrypted": false,
  "Values": {
    "AzureWebJobsStorage": "UseDevelopmentStorage=true",
    "FUNCTIONS_WORKER_RUNTIME": "python",
    "COSMOS_CONNECTION": "AccountEndpoint=...;"
  }
}

23.4 Cloud Run

Serverless containers on GCP.

Cloud Run Service:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: hello-world
spec:
  template:
    spec:
      containers:
      - image: gcr.io/myproject/hello:v1
        ports:
        - containerPort: 8080
        resources:
          limits:
            memory: "256Mi"
            cpu: "1"
        env:
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: db-secret
              key: url

Deployment with gcloud:

# Build and deploy
gcloud builds submit --tag gcr.io/myproject/hello:v1
gcloud run deploy hello \
  --image gcr.io/myproject/hello:v1 \
  --platform managed \
  --region us-central1 \
  --allow-unauthenticated \
  --memory 256Mi \
  --concurrency 80

Terraform:

resource "google_cloud_run_service" "default" {
  name     = "hello"
  location = "us-central1"
  
  template {
    spec {
      containers {
        image = "gcr.io/myproject/hello:v1"
        
        resources {
          limits = {
            cpu    = "1000m"
            memory = "256Mi"
          }
        }
        
        env {
          name = "DATABASE_URL"
          value_from {
            secret_key_ref {
              name = google_secret_manager_secret.db.secret_id
              key  = "latest"
            }
          }
        }
      }
      
      container_concurrency = 80
      timeout_seconds       = 300
    }
  }
  
  traffic {
    percent         = 100
    latest_revision = true
  }
}

23.5 Edge Computing

Compute at the network edge, closer to users.

Cloudflare Workers:

// Cloudflare Worker
addEventListener('fetch', event => {
  event.respondWith(handleRequest(event.request))
})

async function handleRequest(request) {
  const cache = caches.default
  let response = await cache.match(request)
  
  if (!response) {
    response = await fetch(request)
    
    // Cache responses
    if (response.status === 200) {
      const cloned = response.clone()
      const headers = new Headers(cloned.headers)
      headers.set('Cache-Control', 'public, max-age=3600')
      
      const cached = new Response(cloned.body, {
        status: cloned.status,
        statusText: cloned.statusText,
        headers: headers
      })
      
      event.waitUntil(cache.put(request, cached))
    }
  }
  
  return response
}

AWS Lambda@Edge:

'use strict';

// Origin response trigger
exports.handler = (event, context, callback) => {
  const response = event.Records[0].cf.response;
  const headers = response.headers;
  
  // Add security headers
  headers['strict-transport-security'] = [{
    key: 'Strict-Transport-Security',
    value: 'max-age=63072000; includeSubdomains; preload'
  }];
  
  headers['x-content-type-options'] = [{
    key: 'X-Content-Type-Options',
    value: 'nosniff'
  }];
  
  headers['x-frame-options'] = [{
    key: 'X-Frame-Options',
    value: 'DENY'
  }];
  
  headers['x-xss-protection'] = [{
    key: 'X-XSS-Protection',
    value: '1; mode=block'
  }];
  
  callback(null, response);
};

Use Cases:

  • CDN caching
  • Authentication at edge
  • A/B testing
  • Geolocation routing
  • Bot mitigation
  • API aggregation

Chapter 24 — Performance & Scalability

24.1 Load Balancing

Distribute traffic across multiple servers.

Load Balancer Types:

  • Layer 4 (Transport): TCP/UDP, IP-based
  • Layer 7 (Application): HTTP/HTTPS, content-based

Algorithms:

  • Round Robin: Simple rotation
  • Least Connections: To busiest server
  • IP Hash: Sticky sessions
  • Weighted: Capacity-based distribution

AWS Application Load Balancer:

resource "aws_lb" "main" {
  name               = "app-lb"
  internal           = false
  load_balancer_type = "application"
  security_groups    = [aws_security_group.lb.id]
  subnets            = aws_subnet.public[*].id
  
  enable_deletion_protection = true
  
  access_logs {
    bucket  = aws_s3_bucket.lb_logs.bucket
    prefix  = "alb-logs"
    enabled = true
  }
}

resource "aws_lb_target_group" "app" {
  name     = "app-targets"
  port     = 80
  protocol = "HTTP"
  vpc_id   = aws_vpc.main.id
  
  health_check {
    enabled             = true
    healthy_threshold   = 2
    unhealthy_threshold = 2
    timeout             = 5
    interval            = 30
    path                = "/health"
  }
  
  stickiness {
    type            = "lb_cookie"
    cookie_duration = 86400
    enabled         = true
  }
}

resource "aws_lb_listener" "front_end" {
  load_balancer_arn = aws_lb.main.arn
  port              = "443"
  protocol          = "HTTPS"
  ssl_policy        = "ELBSecurityPolicy-2016-08"
  certificate_arn   = aws_acm_certificate.lb.arn
  
  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.app.arn
  }
}

24.2 CDN (Content Delivery Network)

Distribute content globally for faster delivery.

CloudFront with S3:

# Origin Access Identity
resource "aws_cloudfront_origin_access_identity" "oai" {
  comment = "OAI for S3 bucket"
}

# CloudFront distribution
resource "aws_cloudfront_distribution" "cdn" {
  enabled = true
  
  origin {
    domain_name = aws_s3_bucket.website.bucket_regional_domain_name
    origin_id   = "S3-website"
    
    s3_origin_config {
      origin_access_identity = aws_cloudfront_origin_access_identity.oai.cloudfront_access_identity_path
    }
  }
  
  default_cache_behavior {
    allowed_methods  = ["GET", "HEAD", "OPTIONS"]
    cached_methods   = ["GET", "HEAD"]
    target_origin_id = "S3-website"
    
    forwarded_values {
      query_string = false
      cookies {
        forward = "none"
      }
    }
    
    viewer_protocol_policy = "redirect-to-https"
    min_ttl                = 0
    default_ttl            = 3600
    max_ttl                = 86400
    
    compress = true
  }
  
  price_class = "PriceClass_100"
  
  viewer_certificate {
    cloudfront_default_certificate = true
  }
  
  restrictions {
    geo_restriction {
      restriction_type = "none"
    }
  }
  
  custom_error_response {
    error_code            = 404
    response_code         = 200
    response_page_path    = "/index.html"
    error_caching_min_ttl = 300
  }
  
  tags = {
    Environment = var.environment
  }
}

24.3 Caching Strategies

Cache Levels:

  1. Browser Cache: Local to user
  2. CDN Cache: Edge locations
  3. Application Cache: In-memory (Redis, Memcached)
  4. Database Cache: Query cache

Cache Headers:

# Nginx cache configuration
location /static/ {
    expires 1y;
    add_header Cache-Control "public, immutable";
}

location /api/ {
    expires 1m;
    add_header Cache-Control "private, must-revalidate";
    
    # Proxy cache
    proxy_cache api_cache;
    proxy_cache_key "$scheme$request_method$host$request_uri";
    proxy_cache_valid 200 302 60m;
    proxy_cache_valid 404 1m;
    proxy_cache_use_stale error timeout updating;
}

Redis Caching:

import redis
import json

redis_client = redis.Redis(host='redis', port=6379, db=0)

def get_user(user_id):
    # Try cache first
    cached = redis_client.get(f"user:{user_id}")
    if cached:
        return json.loads(cached)
    
    # Cache miss - get from database
    user = db.query(User).get(user_id)
    if user:
        # Store in cache for 1 hour
        redis_client.setex(
            f"user:{user_id}",
            3600,
            json.dumps(user.to_dict())
        )
    return user

def invalidate_user(user_id):
    redis_client.delete(f"user:{user_id}")

Cache Invalidation Strategies:

  • Time-based: Expire after TTL
  • Event-based: Invalidate on update
  • Version-based: Use version in cache key
  • Manual: Purge via API

24.4 Database Scaling

Vertical Scaling (Scale Up):

  • Bigger instance
  • More CPU/RAM
  • Limited by hardware

Horizontal Scaling (Scale Out):

  • More instances
  • Sharding
  • Read replicas

Read Replicas:

-- Write to master
INSERT INTO users (name) VALUES ('John');

-- Read from replica
SELECT * FROM users;  -- Connect to replica endpoint

Database Sharding:

Shard 0: users 0-10000
Shard 1: users 10001-20000
Shard 2: users 20001-30000

shard_id = user_id % num_shards

Connection Pooling:

from sqlalchemy import create_engine
from sqlalchemy.pool import QueuePool

engine = create_engine(
    'postgresql://user:pass@localhost/mydb',
    poolclass=QueuePool,
    pool_size=20,
    max_overflow=10,
    pool_pre_ping=True,
    pool_recycle=3600
)

24.5 High Throughput Systems

Asynchronous Processing:

# FastAPI with background tasks
from fastapi import FastAPI, BackgroundTasks
import asyncio

app = FastAPI()

async def process_order(order_id: str):
    # Long-running task
    await asyncio.sleep(5)
    # Update order status
    await update_database(order_id, "processed")

@app.post("/orders")
async def create_order(order: Order, background_tasks: BackgroundTasks):
    # Save order quickly
    order_id = await save_order(order)
    
    # Process in background
    background_tasks.add_task(process_order, order_id)
    
    return {"order_id": order_id, "status": "accepted"}

Message Queues:

# Producer (FastAPI)
import aio_pika

async def publish_order(order):
    connection = await aio_pika.connect_robust("amqp://guest:guest@rabbitmq/")
    channel = await connection.channel()
    
    await channel.default_exchange.publish(
        aio_pika.Message(
            body=json.dumps(order).encode(),
            delivery_mode=aio_pika.DeliveryMode.PERSISTENT
        ),
        routing_key="orders"
    )
    
    await connection.close()

# Consumer (Worker)
async def process_orders():
    connection = await aio_pika.connect_robust("amqp://guest:guest@rabbitmq/")
    channel = await connection.channel()
    
    queue = await channel.declare_queue("orders", durable=True)
    
    async with queue.iterator() as queue_iter:
        async for message in queue_iter:
            async with message.process():
                order = json.loads(message.body)
                await process_order(order)

Rate Limiting:

from fastapi import FastAPI, HTTPException
from datetime import datetime, timedelta
import redis

app = FastAPI()
redis_client = redis.Redis(host='redis', port=6379, db=0)

@app.middleware("http")
async def rate_limit(request: Request, call_next):
    client_ip = request.client.host
    key = f"rate_limit:{client_ip}"
    
    # Check rate limit
    current = redis_client.get(key)
    if current and int(current) > 100:
        raise HTTPException(status_code=429, detail="Too many requests")
    
    # Increment counter
    pipe = redis_client.pipeline()
    pipe.incr(key)
    pipe.expire(key, 60)  # 1 minute window
    pipe.execute()
    
    response = await call_next(request)
    return response

Chapter 25 — DevOps at Enterprise Scale

25.1 Multi-Region Architecture

Active-Passive:

Region A (Primary)
├── Traffic: 100%
├── Database: Read/Write
└── Ready for failover

Region B (Standby)
├── Traffic: 0%
├── Database: Read-only replica
└── Failover target

Active-Active:

Global Load Balancer
    ↓
┌───┴───┐
Region A Region B
Traffic 50% Traffic 50%
Database sync Database sync

DNS Failover (Route53):

resource "aws_route53_record" "www" {
  zone_id = data.aws_route53_zone.main.zone_id
  name    = "www.example.com"
  type    = "A"
  
  alias {
    name                   = aws_lb.primary.dns_name
    zone_id                = aws_lb.primary.zone_id
    evaluate_target_health = true
  }
  
  failover_routing_policy {
    type = "PRIMARY"
  }
  
  set_identifier = "primary"
}

resource "aws_route53_record" "www_failover" {
  zone_id = data.aws_route53_zone.main.zone_id
  name    = "www.example.com"
  type    = "A"
  
  alias {
    name                   = aws_lb.secondary.dns_name
    zone_id                = aws_lb.secondary.zone_id
    evaluate_target_health = true
  }
  
  failover_routing_policy {
    type = "SECONDARY"
  }
  
  set_identifier = "secondary"
}

25.2 Compliance (ISO, SOC2)

Common Compliance Frameworks:

  • ISO 27001: Information security management
  • SOC 2: Service organization controls
  • PCI DSS: Payment card industry
  • HIPAA: Healthcare
  • GDPR: Data privacy

Automated Compliance Checks:

# AWS Config rule
resource "aws_config_config_rule" "encrypted_volumes" {
  name = "encrypted-volumes"
  
  source {
    owner             = "AWS"
    source_identifier = "ENCRYPTED_VOLUMES"
  }
  
  scope {
    compliance_resource_types = ["AWS::EC2::Volume"]
  }
}

Evidence Collection:

# Automated evidence collection
import boto3
import json
from datetime import datetime

def collect_evidence():
    # Collect IAM policies
    iam = boto3.client('iam')
    policies = iam.list_policies(Scope='Local')
    
    # Collect security group rules
    ec2 = boto3.client('ec2')
    security_groups = ec2.describe_security_groups()
    
    # Collect CloudTrail logs
    cloudtrail = boto3.client('cloudtrail')
    trails = cloudtrail.describe_trails()
    
    evidence = {
        'timestamp': datetime.utcnow().isoformat(),
        'iam_policies': policies,
        'security_groups': security_groups,
        'cloudtrail': trails
    }
    
    # Store in secure bucket
    s3 = boto3.client('s3')
    s3.put_object(
        Bucket='compliance-evidence',
        Key=f"evidence/{datetime.now().date()}/config.json",
        Body=json.dumps(evidence, default=str)
    )

25.3 Governance

Policy as Code:

# AWS Service Control Policy (SCP)
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Deny",
      "Action": [
        "ec2:RunInstances"
      ],
      "Resource": [
        "arn:aws:ec2:*:*:instance/*"
      ],
      "Condition": {
        "StringNotEquals": {
          "ec2:InstanceType": [
            "t3.micro",
            "t3.small",
            "m5.large"
          ]
        }
      }
    },
    {
      "Effect": "Deny",
      "Action": [
        "s3:PutBucketPublicAccessBlock"
      ],
      "Resource": "*"
    }
  ]
}

Tagging Strategy:

# Enforce tags
resource "aws_cloudformation_stack" "enforce_tags" {
  name = "enforce-tags"
  
  template_body = <<TEMPLATE
Resources:
  EnforceTagsLambda:
    Type: AWS::Lambda::Function
    Properties:
      Handler: index.handler
      Runtime: python3.9
      Code:
        ZipFile: |
          import boto3
          import json
          
          def handler(event, context):
              ec2 = boto3.client('ec2')
              
              # Find untagged resources
              resources = ec2.describe_instances(
                  Filters=[
                      {
                          'Name': 'tag:Environment',
                          'Values': ['missing']
                      }
                  ]
              )
              
              # Stop or terminate untagged resources
              for reservation in resources['Reservations']:
                  for instance in reservation['Instances']:
                      ec2.stop_instances(InstanceIds=[instance['InstanceId']])
                      
              return {'status': 'completed'}
TEMPLATE
}

25.4 Cost Management

Cost Allocation Tags:

resource "aws_instance" "web" {
  # ... other configuration
  
  tags = {
    Name        = "web-server"
    Environment = "production"
    CostCenter  = "product-engineering"
    Project     = "customer-portal"
    Owner       = "team-alpha"
    Expires     = "never"  # or "2024-12-31"
  }
}

Budget Alerts:

resource "aws_budgets_budget" "monthly" {
  name         = "monthly-budget"
  budget_type  = "COST"
  limit_amount = "10000"
  limit_unit   = "USD"
  time_unit    = "MONTHLY"
  
  cost_types {
    include_credit = false
    include_discount = false
    include_other_subscription = true
    include_recurring = true
    include_refund = false
    include_subscription = true
    include_support = true
    include_tax = true
    include_upfront = true
    use_blended = false
  }
  
  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_email_addresses = ["finance@example.com"]
  }
}

25.5 FinOps

Financial operations for cloud.

FinOps Principles:

  1. Teams need to collaborate: Finance, engineering, product
  2. Decisions driven by business value: Cost vs. features
  3. Everyone takes ownership: Decentralized accountability
  4. Reports should be accessible: Transparency
  5. Cloud is variable cost: Optimize continuously

Cost Optimization Strategies:

# Automated rightsizing recommendation
def analyze_rightsizing():
    # Get usage metrics
    cloudwatch = boto3.client('cloudwatch')
    
    # For each instance
    for instance in get_all_instances():
        # Get CPU utilization
        stats = cloudwatch.get_metric_statistics(
            Namespace='AWS/EC2',
            MetricName='CPUUtilization',
            Dimensions=[{'Name': 'InstanceId', 'Value': instance.id}],
            StartTime=datetime.now() - timedelta(days=30),
            EndTime=datetime.now(),
            Period=3600,
            Statistics=['Average']
        )
        
        avg_cpu = sum(p['Average'] for p in stats['Datapoints']) / len(stats['Datapoints'])
        
        # Recommend downsizing if low utilization
        if avg_cpu < 10:
            recommend_smaller_instance(instance)
        
        # Recommend spot if appropriate
        if can_use_spot(instance):
            recommend_spot_conversion(instance)

Spot Instance Strategy:

# Spot instance with mixed types
resource "aws_ec2_fleet" "compute" {
  launch_template_config {
    launch_template_specification {
      launch_template_id = aws_launch_template.app.id
      version            = "$Latest"
    }
    
    overrides {
      instance_type = "c5.large"
      weighted_capacity = 2
    }
    
    overrides {
      instance_type = "c5a.large"
      weighted_capacity = 2
    }
    
    overrides {
      instance_type = "m5.large"
      weighted_capacity = 2
    }
  }
  
  target_capacity_specification {
    default_target_capacity_type = "spot"
    total_target_capacity       = 20
    spot_target_capacity        = 20
  }
  
  spot_options {
    allocation_strategy              = "capacity-optimized"
    instance_interruption_behavior    = "terminate"
    min_target_capacity              = 10
  }
}

25.6 Migration Strategies

The 7 Rs of Migration:

  1. Rehost (Lift and Shift): Move as-is
  2. Replatform (Lift, Tinker, Shift): Minor optimizations
  3. Repurchase (Drop and Shop): Move to SaaS
  4. Refactor (Re-architect): Modernize for cloud
  5. Retire: Decommission unused
  6. Retain: Keep on-premises
  7. Relocate: Move to hyperconverged

Migration Phases:

  1. Assess: Discovery and planning
  2. Mobilize: Pilot and skills building
  3. Migrate: Scale migration
  4. Modernize: Optimize and innovate

Database Migration Service:

# AWS DMS replication task
resource "aws_dms_replication_task" "migrate" {
  replication_task_id       = "migrate-db"
  migration_type            = "full-load"
  replication_instance_arn  = aws_dms_replication_instance.dms.replication_instance_arn
  source_endpoint_arn       = aws_dms_endpoint.source.endpoint_arn
  target_endpoint_arn       = aws_dms_endpoint.target.endpoint_arn
  table_mappings            = jsonencode({
    "rules": [
      {
        "rule-type": "selection",
        "rule-id": "1",
        "rule-name": "1",
        "object-locator": {
          "schema-name": "public",
          "table-name": "users"
        },
        "rule-action": "include"
      }
    ]
  })
  
  replication_task_settings = jsonencode({
    "TargetMetadata": {
      "TargetSchema": "",
      "SupportLobs": true,
      "FullLobMode": false,
      "LobChunkSize": 64,
      "LimitedSizeLobMode": false,
      "LobMaxSize": 32
    },
    "FullLoadSettings": {
      "TargetTablePrepMode": "DROP_AND_CREATE",
      "CreatePkAfterFullLoad": false,
      "StopTaskCachedChangesApplied": false,
      "StopTaskCachedChangesNotApplied": false,
      "MaxFullLoadSubTasks": 8,
      "TransactionConsistencyTimeout": 600,
      "CommitRate": 10000
    }
  })
}

PART X — PRACTICAL IMPLEMENTATION

Chapter 26 — Building a Complete DevOps Pipeline

26.1 Sample Microservices Project

Architecture:

┌─────────┐    ┌─────────┐    ┌─────────┐
│  React  │ → │   API   │ → │  Users  │
│   App   │ ← │ Gateway │ ← │ Service │
└─────────┘    └─────────┘    └─────────┘
                    ↓              ↓
              ┌─────────┐    ┌─────────┐
              │  Auth   │    │  Posts  │
              │ Service │    │ Service │
              └─────────┘    └─────────┘

Repository Structure:

myapp/
├── services/
│   ├── api-gateway/
│   │   ├── src/
│   │   ├── Dockerfile
│   │   └── package.json
│   ├── users-service/
│   │   ├── src/
│   │   ├── Dockerfile
│   │   └── requirements.txt
│   └── posts-service/
│       ├── src/
│       ├── Dockerfile
│       └── go.mod
├── frontend/
│   ├── src/
│   ├── Dockerfile
│   └── package.json
├── k8s/
│   ├── base/
│   │   ├── deployment.yaml
│   │   └── service.yaml
│   └── overlays/
│       ├── dev/
│       └── prod/
├── terraform/
│   ├── main.tf
│   ├── variables.tf
│   └── outputs.tf
├── .github/
│   └── workflows/
│       ├── ci.yml
│       └── cd.yml
└── README.md

26.2 Git Workflow

Branch Strategy:

  • main - Production-ready code
  • develop - Integration branch
  • feature/* - New features
  • release/* - Release preparation
  • hotfix/* - Emergency fixes

PR Template:

## Description
[Describe your changes]

## Type of Change
- [ ] Bug fix
- [ ] New feature
- [ ] Breaking change
- [ ] Documentation update

## Testing
- [ ] Unit tests passing
- [ ] Integration tests passing
- [ ] Manual testing completed

## Checklist
- [ ] Code follows style guide
- [ ] Documentation updated
- [ ] Dependencies updated
- [ ] Security considerations addressed

## Related Issues
Closes #[issue-number]

26.3 CI Pipeline

GitHub Actions CI:

# .github/workflows/ci.yml
name: CI Pipeline

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      
      - name: Lint API Gateway
        working-directory: services/api-gateway
        run: |
          npm install
          npm run lint
      
      - name: Lint Users Service
        working-directory: services/users-service
        run: |
          pip install flake8
          flake8 src/
      
      - name: Lint Posts Service
        working-directory: services/posts-service
        run: |
          go install golang.org/x/lint/golint@latest
          golint ./...
  
  test:
    runs-on: ubuntu-latest
    services:
      postgres:
        image: postgres:13
        env:
          POSTGRES_PASSWORD: testpass
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
        ports:
          - 5432:5432
      
      redis:
        image: redis:6
        options: >-
          --health-cmd "redis-cli ping"
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
        ports:
          - 6379:6379
    
    steps:
      - uses: actions/checkout@v2
      
      - name: Test API Gateway
        working-directory: services/api-gateway
        run: |
          npm install
          npm test -- --coverage
      
      - name: Test Users Service
        working-directory: services/users-service
        env:
          DATABASE_URL: postgresql://postgres:testpass@localhost/test
        run: |
          pip install -r requirements.txt
          pytest --cov=src tests/
      
      - name: Test Posts Service
        working-directory: services/posts-service
        run: |
          go mod download
          go test -v -cover ./...
  
  security:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      
      - name: Run SAST
        uses: github/codeql-action/init@v1
        with:
          languages: javascript,python,go
      
      - name: Scan dependencies
        run: |
          npm audit --audit-level=high
          safety check
          go list -json -deps | nancy sleuth
      
      - name: Scan for secrets
        uses: trufflesecurity/trufflehog@main
  
  build:
    runs-on: ubuntu-latest
    needs: [lint, test, security]
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    
    steps:
      - uses: actions/checkout@v2
      
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v1
      
      - name: Login to Container Registry
        uses: docker/login-action@v1
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      
      - name: Build and push API Gateway
        uses: docker/build-push-action@v2
        with:
          context: services/api-gateway
          push: true
          tags: |
            ghcr.io/${{ github.repository }}/api-gateway:${{ github.sha }}
            ghcr.io/${{ github.repository }}/api-gateway:latest
          cache-from: type=gha
          cache-to: type=gha,mode=max
      
      - name: Build and push Users Service
        uses: docker/build-push-action@v2
        with:
          context: services/users-service
          push: true
          tags: |
            ghcr.io/${{ github.repository }}/users-service:${{ github.sha }}
            ghcr.io/${{ github.repository }}/users-service:latest
      
      - name: Build and push Posts Service
        uses: docker/build-push-action@v2
        with:
          context: services/posts-service
          push: true
          tags: |
            ghcr.io/${{ github.repository }}/posts-service:${{ github.sha }}
            ghcr.io/${{ github.repository }}/posts-service:latest
      
      - name: Scan images for vulnerabilities
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: 'ghcr.io/${{ github.repository }}/api-gateway:${{ github.sha }}'
          severity: 'CRITICAL,HIGH'
          format: 'sarif'
          output: 'trivy-results.sarif'

26.4 Dockerization

API Gateway Dockerfile:

FROM node:18-alpine AS builder

WORKDIR /app

COPY package*.json ./
RUN npm ci --only=production

COPY . .

FROM node:18-alpine

RUN addgroup -g 1001 -S nodejs && \
    adduser -S nodejs -u 1001

WORKDIR /app

COPY --from=builder --chown=nodejs:nodejs /app/node_modules ./node_modules
COPY --chown=nodejs:nodejs . .

USER nodejs

EXPOSE 3000

HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
  CMD node healthcheck.js

CMD ["node", "src/server.js"]

Users Service Dockerfile:

FROM python:3.10-slim AS builder

WORKDIR /app

COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt

FROM python:3.10-slim

RUN apt-get update && apt-get install -y --no-install-recommends \
    curl \
    && rm -rf /var/lib/apt/lists/*

RUN groupadd -r appuser && useradd -r -g appuser appuser

WORKDIR /app

COPY --from=builder /root/.local /home/appuser/.local
COPY . .

ENV PATH=/home/appuser/.local/bin:$PATH

USER appuser

EXPOSE 8000

HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
  CMD curl -f http://localhost:8000/health || exit 1

CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]

Posts Service Dockerfile:

FROM golang:1.19-alpine AS builder

WORKDIR /app

COPY go.mod go.sum ./
RUN go mod download

COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o posts-service ./cmd/server

FROM alpine:3.17

RUN apk --no-cache add ca-certificates

RUN addgroup -g 1001 -S appgroup && \
    adduser -S appuser -u 1001 -G appgroup

WORKDIR /app

COPY --from=builder --chown=appuser:appgroup /app/posts-service .

USER appuser

EXPOSE 8080

HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
  CMD ["./posts-service", "health"]

CMD ["./posts-service"]

26.5 Kubernetes Deployment

Kustomize Base:

# k8s/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-gateway
  labels:
    app: api-gateway
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api-gateway
  template:
    metadata:
      labels:
        app: api-gateway
    spec:
      containers:
      - name: api-gateway
        image: ghcr.io/myorg/myapp/api-gateway:latest
        ports:
        - containerPort: 3000
        env:
        - name: NODE_ENV
          value: "production"
        - name: USERS_SERVICE_URL
          value: "http://users-service:8000"
        - name: POSTS_SERVICE_URL
          value: "http://posts-service:8080"
        - name: REDIS_URL
          valueFrom:
            secretKeyRef:
              name: app-secrets
              key: redis-url
        resources:
          requests:
            memory: "128Mi"
            cpu: "250m"
          limits:
            memory: "256Mi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /health
            port: 3000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 3000
          initialDelaySeconds: 5
          periodSeconds: 5
---
# k8s/base/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: api-gateway
spec:
  selector:
    app: api-gateway
  ports:
  - port: 80
    targetPort: 3000
  type: ClusterIP
---
# k8s/base/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api-gateway
  annotations:
    kubernetes.io/ingress.class: nginx
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  tls:
  - hosts:
    - api.example.com
    secretName: api-tls
  rules:
  - host: api.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: api-gateway
            port:
              number: 80

Production Overlay:

# k8s/overlays/prod/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

bases:
- ../../base

namespace: production

images:
- name: ghcr.io/myorg/myapp/api-gateway
  newTag: v1.2.3
- name: ghcr.io/myorg/myapp/users-service
  newTag: v1.2.3
- name: ghcr.io/myorg/myapp/posts-service
  newTag: v1.2.3

patchesStrategicMerge:
- increase-replicas.yaml
- resource-limits.yaml

configMapGenerator:
- name: app-config
  behavior: merge
  literals:
  - LOG_LEVEL=info
  - ENVIRONMENT=production

secretGenerator:
- name: app-secrets
  behavior: merge
  literals:
  - redis-url=redis://redis-service:6379
  - database-url=postgresql://user:pass@postgres:5432/prod
# k8s/overlays/prod/increase-replicas.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-gateway
spec:
  replicas: 5
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: users-service
spec:
  replicas: 3
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: posts-service
spec:
  replicas: 3

26.6 Monitoring Setup

Prometheus Configuration:

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
- job_name: 'kubernetes-pods'
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    action: keep
    regex: true
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
    action: replace
    target_label: __metrics_path__
    regex: (.+)
  - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
    action: replace
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    target_label: __address__
  - source_labels: [__meta_kubernetes_pod_phase]
    regex: (Failed|Succeeded)
    action: drop

ServiceMonitor for Custom Metrics:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: api-gateway
spec:
  selector:
    matchLabels:
      app: api-gateway
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics
  namespaceSelector:
    matchNames:
    - production

Grafana Dashboard:

{
  "dashboard": {
    "title": "API Gateway Dashboard",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{app='api-gateway'}[5m])) by (status_code)",
            "legendFormat": "{{status_code}}"
          }
        ]
      },
      {
        "title": "Request Latency (p99)",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{app='api-gateway'}[5m])) by (le))",
            "legendFormat": "p99"
          }
        ]
      },
      {
        "title": "Error Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{app='api-gateway', status_code=~'5..'}[5m])) / sum(rate(http_requests_total{app='api-gateway'}[5m]))",
            "legendFormat": "error ratio"
          }
        ]
      },
      {
        "title": "CPU Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(container_cpu_usage_seconds_total{container='api-gateway'}) by (pod)",
            "legendFormat": "{{pod}}"
          }
        ]
      },
      {
        "title": "Memory Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(container_memory_working_set_bytes{container='api-gateway'}) by (pod)",
            "legendFormat": "{{pod}}"
          }
        ]
      }
    ]
  }
}

Alert Rules:

# alerts.yml
groups:
- name: api-gateway
  rules:
  - alert: APIHighErrorRate
    expr: |
      sum(rate(http_requests_total{app='api-gateway', status_code=~'5..'}[5m]))
      /
      sum(rate(http_requests_total{app='api-gateway'}[5m])) > 0.05
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "API Gateway high error rate"
      description: "Error rate is {{ $value | humanizePercentage }} for 5 minutes"

  - alert: APIHighLatency
    expr: |
      histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{app='api-gateway'}[5m])) by (le)) > 0.5
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "API Gateway high latency"
      description: "p99 latency is {{ $value }}s for 10 minutes"

  - alert: APIDown
    expr: up{job='api-gateway'} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "API Gateway is down"
      description: "API Gateway has been down for more than 1 minute"

26.7 Security Integration

Secret Management:

# secrets.yaml (encrypted with sops)
apiVersion: v1
kind: Secret
metadata:
  name: app-secrets
type: Opaque
data:
  database-url: ENC[AES256_GCM,data:...]
  redis-url: ENC[AES256_GCM,data:...]
  api-key: ENC[AES256_GCM,data:...]
sops:
  kms:
  - arn: arn:aws:kms:us-east-1:123456789:key/...
    created_at: "..."
    enc: "..."

Pod Security Policy:

apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: restricted
spec:
  privileged: false
  allowPrivilegeEscalation: false
  requiredDropCapabilities:
  - ALL
  volumes:
  - 'configMap'
  - 'emptyDir'
  - 'projected'
  - 'secret'
  - 'downwardAPI'
  hostNetwork: false
  hostIPC: false
  hostPID: false
  runAsUser:
    rule: 'MustRunAsNonRoot'
  seLinux:
    rule: 'RunAsAny'
  supplementalGroups:
    rule: 'MustRunAs'
    ranges:
    - min: 1
      max: 65535
  fsGroup:
    rule: 'MustRunAs'
    ranges:
    - min: 1
      max: 65535

Network Policy:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-gateway-network-policy
spec:
  podSelector:
    matchLabels:
      app: api-gateway
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: frontend
    ports:
    - protocol: TCP
      port: 3000
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: users-service
    ports:
    - protocol: TCP
      port: 8000
  - to:
    - podSelector:
        matchLabels:
          app: posts-service
    ports:
    - protocol: TCP
      port: 8080
  - to:
    - podSelector:
        matchLabels:
          app: redis
    ports:
    - protocol: TCP
      port: 6379

Chapter 27 — Real-World Case Studies

27.1 Netflix DevOps Model

Scale:

  • 200M+ subscribers
  • Thousands of microservices
  • Millions of streaming hours daily
  • Thousands of deployments daily

Key Practices:

1. Chaos Engineering

  • Chaos Monkey randomly terminates instances
  • Simian Army tests various failure modes
  • Latency Monkey introduces delays
  • Conformity Monkey enforces best practices
# Chaos Monkey simplified example
import random
import boto3

class ChaosMonkey:
    def __init__(self, probability=0.01):
        self.probability = probability
        self.ec2 = boto3.client('ec2')
    
    def run(self):
        instances = self.get_production_instances()
        for instance in instances:
            if random.random() < self.probability:
                self.terminate_instance(instance)
                self.notify_team(instance)
    
    def get_production_instances(self):
        # Get instances with production tag
        response = self.ec2.describe_instances(
            Filters=[
                {'Name': 'tag:Environment', 'Values': ['production']}
            ]
        )
        return response['Reservations']
    
    def terminate_instance(self, instance):
        instance_id = instance['Instances'][0]['InstanceId']
        self.ec2.terminate_instances(InstanceIds=[instance_id])

2. Immutable Infrastructure

  • Servers never patched, always replaced
  • Golden AMIs with everything baked in
  • Blue/green deployments
  • Automated rollback

3. Spinnaker for CD

  • Multi-cloud continuous delivery
  • Pipeline stages: bake, test, deploy
  • Canary analysis
  • Automated rollbacks
// Spinnaker pipeline
{
  "application": "netflix",
  "name": "deploy-service",
  "stages": [
    {
      "type": "bake",
      "name": "Bake Image",
      "baseOs": "ubuntu",
      "package": "myapp"
    },
    {
      "type": "canary",
      "name": "Canary Deploy",
      "cluster": "myapp-canary",
      "targetSize": 5,
      "analysisType": "realTime",
      "metrics": [
        "error_rate < 0.1%",
        "latency_p99 < 200ms"
      ]
    },
    {
      "type": "rollingPush",
      "name": "Production Deploy",
      "cluster": "myapp-prod",
      "targetSize": 100
    }
  ]
}

4. Culture of Freedom and Responsibility

  • "You build it, you run it"
  • Engineers own their services
  • Blameless postmortems
  • Data-driven decisions

27.2 Amazon Deployment Model

Scale:

  • 100M+ deployments per year
  • 143,000 deployments in peak hour
  • 2-pizza teams (6-10 people)
  • Service-oriented architecture

Key Practices:

1. Two-Pizza Teams

  • Small, autonomous teams
  • Full ownership of services
  • Independent deployment
  • Clear API contracts

2. Deployment Pipeline

# Amazon's deployment pipeline simplified
class DeploymentPipeline:
    def __init__(self, service_name):
        self.service = service_name
        self.stages = [
            'commit',
            'build',
            'unit_tests',
            'integration_tests',
            'performance_tests',
            'security_scan',
            'canary',
            'production'
        ]
    
    def execute(self, version):
        for stage in self.stages:
            if not self.run_stage(stage, version):
                self.rollback(version)
                return False
            
            # Collect metrics
            metrics = self.collect_metrics(stage)
            if self.thresholds_exceeded(metrics):
                self.rollback(version)
                return False
        
        return True
    
    def canary_deploy(self, version):
        # Deploy to 1% of instances
        canary_group = self.deploy_to_group(version, percent=1)
        
        # Monitor for 15 minutes
        time.sleep(900)
        
        # Check metrics
        if self.canary_healthy(canary_group):
            # Gradual rollout
            self.deploy_to_group(version, percent=10)
            time.sleep(300)
            self.deploy_to_group(version, percent=25)
            time.sleep(300)
            self.deploy_to_group(version, percent=50)
            time.sleep(300)
            self.deploy_to_group(version, percent=100)
        else:
            self.rollback_canary(version)

3. API Mandate

  • All teams expose APIs
  • No direct database access
  • Backward compatibility required
  • Versioned APIs

4. "You Build It, You Run It"

  • Developers carry pagers
  • On-call rotation within dev teams
  • Operational excellence is priority
  • Automated remediation

27.3 Google SRE Model

Scale:

  • Billions of users
  • Global infrastructure
  • 100% services with SLOs
  • Error budgets for all services

Key Practices:

1. Error Budgets

class ErrorBudget:
    def __init__(self, service, slo=99.99):
        self.service = service
        self.slo = slo
        self.budget = 100 - slo
        self.consumed = 0
    
    def track_error(self, duration):
        # Track error against budget
        error_seconds = duration
        total_seconds = self.get_total_seconds()
        
        self.consumed = (error_seconds / total_seconds) * 100
        
        if self.consumed > self.budget:
            self.enforce_freeze()
    
    def enforce_freeze(self):
        # Block releases when budget exhausted
        print(f"Error budget exhausted for {self.service}")
        self.block_releases()
        self.focus_on_reliability()
    
    def reset_monthly(self):
        self.consumed = 0
        self.unblock_releases()

2. Toil Elimination

  • Target < 50% time on toil
  • Automate everything
  • Self-service platforms
  • Continuous improvement
# Toil tracking
class ToilTracker:
    def __init__(self):
        self.toil_time = 0
        self.eng_time = 0
    
    def track_activity(self, activity_type, duration):
        if activity_type == 'toil':
            self.toil_time += duration
        else:
            self.eng_time += duration
        
        self.ensure_balance()
    
    def ensure_balance(self):
        total = self.toil_time + self.eng_time
        if total > 0:
            toil_percentage = (self.toil_time / total) * 100
            
            if toil_percentage > 50:
                self.trigger_toil_reduction()
    
    def trigger_toil_reduction(self):
        print("Toil exceeds 50% - initiating reduction projects")
        # Start automation projects
        # Assign engineering time to reduce toil

3. Monitoring Philosophy

  • Monitor symptoms, not causes
  • Only alert if action required
  • Use SLIs, SLOs, error budgets
  • Minimal, actionable alerts

27.4 Startup DevOps Strategy

Profile:

  • Series B startup
  • 50 engineers
  • AWS cloud
  • 10 microservices
  • 100K users

DevOps Implementation:

Phase 1: Foundation (Month 1-3)

  • GitHub for version control
  • GitHub Actions for CI
  • Terraform for infrastructure
  • Docker for containerization
  • ECS for orchestration (simpler than K8s)

Phase 2: Automation (Month 4-6)

  • Automated testing in CI
  • Container image building
  • Blue/green deployments
  • Basic monitoring (CloudWatch)

Phase 3: Scaling (Month 7-12)

  • Migrate to EKS
  • Service mesh (Linkerd)
  • Prometheus/Grafana
  • Centralized logging (ELK)
  • Security scanning (Trivy)

Sample CI Pipeline:

name: Startup CI/CD

on:
  push:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - run: npm ci
      - run: npm test
      - run: npm run lint
  
  build:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - run: docker build -t myapp:${{ github.sha }} .
      - run: docker tag myapp:${{ github.sha }} ${{ secrets.ECR_REPO }}:latest
      - run: aws ecr get-login-password | docker login --username AWS --password-stdin ${{ secrets.ECR_REPO }}
      - run: docker push ${{ secrets.ECR_REPO }}:latest
  
  deploy:
    needs: build
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v2
      - run: |
          aws ecs update-service \
            --cluster myapp-cluster \
            --service myapp-service \
            --force-new-deployment \
            --region us-east-1

27.5 Enterprise Migration Story

Profile:

  • Fortune 500 financial services
  • 10,000+ employees
  • 1,000+ applications
  • Legacy data centers
  • Strict regulatory requirements

Challenges:

  • Legacy mainframe applications
  • Regulatory compliance (SOX, PCI)
  • Security concerns
  • Siloed teams
  • Vendor lock-in

Migration Phases:

Phase 1: Assessment (6 months)

  • Application portfolio analysis
  • Dependency mapping
  • Compliance requirements review
  • Skills assessment
  • Vendor evaluation

Phase 2: Foundation (12 months)

  • Create cloud landing zone
  • Establish governance framework
  • Build central platform team
  • Implement security controls
  • Set up connectivity (Direct Connect)
# Enterprise landing zone
module "landing_zone" {
  source = "terraform-aws-modules/control-tower/aws"
  
  # Multi-account structure
  organizational_units = {
    "Security" = {
      accounts = ["audit", "security-tooling"]
    }
    "Infrastructure" = {
      accounts = ["network", "shared-services", "cicd"]
    }
    "Workloads" = {
      accounts = ["dev", "test", "prod", "dr"]
    }
  }
  
  # Guardrails
  guardrails = {
    "DISALLOW_PUBLIC_IPS" = {
      type = "mandatory"
    }
    "ENFORCE_ENCRYPTION" = {
      type = "mandatory"
    }
    "ENABLE_CLOUDTRAIL" = {
      type = "mandatory"
    }
  }
}

Phase 3: Pilot (6 months)

  • Select 3 pilot applications
  • Lift-and-shift initial migrations
  • Validate security controls
  • Train first teams
  • Document patterns

Phase 4: Scale (18 months)

  • Wave-based migrations
  • Automate where possible
  • Modernize applications
  • Implement CI/CD
  • Establish FinOps

Phase 5: Optimize (ongoing)

  • Rightsizing
  • Spot instances
  • Containerization
  • Serverless adoption
  • Continuous improvement

Key Success Factors:

  1. Executive sponsorship - C-level support
  2. Center of Excellence - Central team
  3. Training program - Skill development
  4. Security first - Compliance from day one
  5. Measurable wins - Show progress
  6. Cultural change - DevOps mindset

Appendices

Appendix A: Linux Command Reference

File Operations:

ls -la                    # List all files with details
cd /path/to/dir           # Change directory
pwd                       # Print working directory
cp -r source dest         # Copy recursively
mv source dest            # Move/rename
rm -rf dir                # Remove forcefully
mkdir -p path/to/dir      # Create directory with parents
touch file.txt            # Create empty file/update timestamp
cat file.txt              # Display file content
less file.txt             # View file page by page
head -n 10 file.txt       # First 10 lines
tail -f file.txt          # Follow file (live updates)
find . -name "*.txt"      # Find files by name
grep -r "pattern" .       # Search recursively

Process Management:

ps aux                     # All processes
top                        # Interactive process viewer
htop                       # Enhanced top
kill -9 PID                # Force kill process
kill -15 PID               # Graceful termination
pgrep process_name         # Find PID by name
pkill process_name         # Kill by name
jobs                       # List background jobs
bg %1                      # Resume job in background
fg %1                      # Bring to foreground
nohup command &            # Run immune to hangups

Network Commands:

ip addr show               # IP addresses
ip route show              # Routing table
ss -tulpn                  # Listening ports
netstat -an                # Network statistics (legacy)
curl -I http://example.com # HTTP headers
wget http://example.com/file # Download file
ping -c 4 example.com      # ICMP ping
traceroute example.com     # Trace route
nslookup example.com       # DNS lookup
dig example.com            # Detailed DNS
telnet host port           # Test TCP connection
nc -vz host port           # Netcat port scan
tcpdump -i eth0            # Capture packets

System Information:

uname -a                    # Kernel info
cat /etc/os-release         # OS info
lscpu                       # CPU info
free -h                     # Memory usage
df -h                       # Disk usage
du -sh *                    # Directory sizes
uptime                      # System uptime
whoami                      # Current user
id                          # User identity
hostname                    # System hostname
date                        # Current date/time
dmesg | tail                # Kernel messages

Package Management (Ubuntu/Debian):

apt update                  # Update package lists
apt upgrade                 # Upgrade all packages
apt install package         # Install package
apt remove package          # Remove package
apt autoremove              # Remove unused packages
apt search pattern          # Search packages
dpkg -l                     # List installed
dpkg -S /path/to/file       # Which package owns file

Package Management (RHEL/CentOS):

yum update                  # Update all packages
yum install package         # Install package
yum remove package          # Remove package
yum search pattern          # Search packages
rpm -qa                     # List installed
rpm -qf /path/to/file       # Which package owns file

Systemd Commands:

systemctl status service    # Service status
systemctl start service      # Start service
systemctl stop service       # Stop service
systemctl restart service    # Restart service
systemctl enable service     # Enable at boot
systemctl disable service    # Disable at boot
systemctl list-units         # List all units
journalctl -u service        # View logs
journalctl -f                # Follow logs
systemctl daemon-reload      # Reload unit files

Appendix B: Git Cheat Sheet

Basic Commands:

git init                    # Initialize repository
git clone url               # Clone repository
git add file                # Stage file
git add .                   # Stage all
git commit -m "message"     # Commit staged
git status                  # Show status
git log                     # Show history
git log --oneline           # Compact history
git diff                    # Show unstaged changes
git diff --staged           # Show staged changes

Branching:

git branch                  # List branches
git branch new-branch       # Create branch
git checkout branch         # Switch branch
git checkout -b new-branch  # Create and switch
git merge branch            # Merge branch into current
git branch -d branch        # Delete branch
git push origin --delete branch # Delete remote branch

Remote Operations:

git remote -v               # List remotes
git remote add origin url   # Add remote
git push origin main        # Push to remote
git pull origin main        # Pull from remote
git fetch origin            # Fetch without merge
git remote update           # Update all remotes

Undoing Changes:

git reset file              # Unstage file
git reset --soft HEAD~1     # Undo commit, keep changes
git reset --hard HEAD~1     # Undo commit, discard changes
git revert HEAD             # Create revert commit
git checkout -- file        # Discard changes in file
git clean -fd               # Remove untracked files

Stashing:

git stash                   # Stash changes
git stash list              # List stashes
git stash pop               # Apply and remove stash
git stash apply             # Apply stash
git stash drop stash@{0}    # Drop stash
git stash branch new-branch # Create branch from stash

History and Debugging:

git log --graph --oneline   # Visual history
git blame file              # Who changed what
git bisect start            # Binary search for bug
git bisect bad              # Current is bad
git bisect good commit      # Mark good commit
git reflog                  # Reference log

Advanced:

git rebase -i HEAD~3        # Interactive rebase
git cherry-pick commit      # Apply specific commit
git tag v1.0.0              # Create tag
git push --tags             # Push tags
git submodule add url       # Add submodule
git submodule update --init # Update submodules

Appendix C: Kubernetes YAML Reference

Pod:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
  labels:
    app: myapp
spec:
  containers:
  - name: my-container
    image: nginx:latest
    ports:
    - containerPort: 80
    env:
    - name: ENV_VAR
      value: "value"
    resources:
      requests:
        memory: "64Mi"
        cpu: "250m"
      limits:
        memory: "128Mi"
        cpu: "500m"
    volumeMounts:
    - name: data
      mountPath: /data
  volumes:
  - name: data
    emptyDir: {}

Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
      - name: myapp
        image: myapp:latest
        ports:
        - containerPort: 8080
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5

Service:

apiVersion: v1
kind: Service
metadata:
  name: my-service
spec:
  selector:
    app: myapp
  ports:
  - port: 80
    targetPort: 8080
    nodePort: 30080
  type: NodePort  # ClusterIP, NodePort, LoadBalancer

Ingress:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-ingress
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
spec:
  rules:
  - host: myapp.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: my-service
            port:
              number: 80
  tls:
  - hosts:
    - myapp.example.com
    secretName: myapp-tls

ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
data:
  config.json: |
    {
      "log_level": "info",
      "max_connections": 100
    }
  database_url: "postgresql://localhost/mydb"

Secret:

apiVersion: v1
kind: Secret
metadata:
  name: app-secret
type: Opaque
data:
  username: YWRtaW4=  # base64 encoded
  password: MWYyZDFlMmU2N2Rm

PersistentVolumeClaim:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: data-claim
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  storageClassName: fast

Appendix D: Terraform Module Examples

VPC Module:

# modules/vpc/main.tf
resource "aws_vpc" "main" {
  cidr_block           = var.cidr_block
  enable_dns_hostnames = true
  enable_dns_support   = true
  
  tags = var.tags
}

resource "aws_subnet" "public" {
  count = length(var.public_subnets)
  
  vpc_id            = aws_vpc.main.id
  cidr_block        = var.public_subnets[count.index]
  availability_zone = var.availability_zones[count.index]
  
  map_public_ip_on_launch = true
  
  tags = merge(var.tags, {
    Name = "public-${var.availability_zones[count.index]}"
  })
}

# modules/vpc/variables.tf
variable "cidr_block" {
  description = "CIDR block for VPC"
  type        = string
}

variable "public_subnets" {
  description = "List of public subnet CIDRs"
  type        = list(string)
}

variable "availability_zones" {
  description = "List of availability zones"
  type        = list(string)
}

variable "tags" {
  description = "Tags to apply"
  type        = map(string)
  default     = {}
}

# modules/vpc/outputs.tf
output "vpc_id" {
  value = aws_vpc.main.id
}

output "public_subnet_ids" {
  value = aws_subnet.public[*].id
}

EC2 Instance Module:

# modules/ec2/main.tf
data "aws_ami" "amazon_linux" {
  most_recent = true
  owners      = ["amazon"]
  
  filter {
    name   = "name"
    values = ["amzn2-ami-hvm-*-x86_64-gp2"]
  }
}

resource "aws_instance" "this" {
  ami                    = var.ami != "" ? var.ami : data.aws_ami.amazon_linux.id
  instance_type          = var.instance_type
  subnet_id              = var.subnet_id
  vpc_security_group_ids = var.security_group_ids
  key_name               = var.key_name
  
  user_data = var.user_data
  
  root_block_device {
    volume_type = var.root_volume_type
    volume_size = var.root_volume_size
    encrypted   = var.root_volume_encrypted
  }
  
  tags = merge(var.tags, {
    Name = var.name
  })
}

# modules/ec2/variables.tf
variable "name" {
  description = "Instance name"
  type        = string
}

variable "instance_type" {
  description = "Instance type"
  type        = string
}

variable "subnet_id" {
  description = "Subnet ID"
  type        = string
}

variable "security_group_ids" {
  description = "Security group IDs"
  type        = list(string)
}

variable "ami" {
  description = "AMI ID (optional)"
  type        = string
  default     = ""
}

variable "key_name" {
  description = "Key pair name"
  type        = string
  default     = ""
}

variable "user_data" {
  description = "User data script"
  type        = string
  default     = ""
}

variable "root_volume_size" {
  description = "Root volume size in GB"
  type        = number
  default     = 20
}

variable "root_volume_type" {
  description = "Root volume type"
  type        = string
  default     = "gp3"
}

variable "root_volume_encrypted" {
  description = "Encrypt root volume"
  type        = bool
  default     = true
}

variable "tags" {
  description = "Tags to apply"
  type        = map(string)
  default     = {}
}

# modules/ec2/outputs.tf
output "instance_id" {
  value = aws_instance.this.id
}

output "public_ip" {
  value = aws_instance.this.public_ip
}

output "private_ip" {
  value = aws_instance.this.private_ip
}

Appendix E: CI/CD Templates

GitHub Actions Multi-Stage:

name: Multi-Stage Pipeline

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

env:
  AWS_REGION: us-east-1
  ECR_REPOSITORY: myapp

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    
    - name: Run tests
      run: |
        npm ci
        npm test
        npm run lint
    
    - name: Upload coverage
      uses: codecov/codecov-action@v2

  build:
    needs: test
    runs-on: ubuntu-latest
    if: github.event_name == 'push'
    outputs:
      image_tag: ${{ steps.docker_build.outputs.image_tag }}
    
    steps:
    - uses: actions/checkout@v2
    
    - name: Configure AWS credentials
      uses: aws-actions/configure-aws-credentials@v1
      with:
        aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
        aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        aws-region: ${{ env.AWS_REGION }}
    
    - name: Login to Amazon ECR
      id: login-ecr
      uses: aws-actions/amazon-ecr-login@v1
    
    - name: Build and push Docker image
      id: docker_build
      env:
        ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
        IMAGE_TAG: ${{ github.sha }}
      run: |
        docker build -t $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG .
        docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG
        echo "::set-output name=image_tag::$IMAGE_TAG"
    
    - name: Scan image
      uses: aquasecurity/trivy-action@master
      with:
        image-ref: ${{ steps.login-ecr.outputs.registry }}/${{ env.ECR_REPOSITORY }}:${{ github.sha }}
        severity: CRITICAL,HIGH
        exit-code: 1

  deploy-dev:
    needs: build
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/develop'
    environment: development
    
    steps:
    - uses: actions/checkout@v2
    
    - name: Configure AWS credentials
      uses: aws-actions/configure-aws-credentials@v1
      with:
        aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
        aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        aws-region: ${{ env.AWS_REGION }}
    
    - name: Update kubeconfig
      run: aws eks update-kubeconfig --name dev-cluster --region ${{ env.AWS_REGION }}
    
    - name: Deploy to EKS
      run: |
        kubectl set image deployment/myapp \
          myapp=${{ steps.login-ecr.outputs.registry }}/${{ env.ECR_REPOSITORY }}:${{ needs.build.outputs.image_tag }} \
          -n development
        
        kubectl rollout status deployment/myapp -n development

  deploy-prod:
    needs: build
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    environment: production
    
    steps:
    - uses: actions/checkout@v2
    
    - name: Configure AWS credentials
      uses: aws-actions/configure-aws-credentials@v1
      with:
        aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
        aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        aws-region: ${{ env.AWS_REGION }}
    
    - name: Update kubeconfig
      run: aws eks update-kubeconfig --name prod-cluster --region ${{ env.AWS_REGION }}
    
    - name: Deploy to production
      run: |
        # Canary deployment (10%)
        kubectl set image deployment/myapp-canary \
          myapp=${{ steps.login-ecr.outputs.registry }}/${{ env.ECR_REPOSITORY }}:${{ needs.build.outputs.image_tag }} \
          -n production
        
        # Wait and monitor
        sleep 300
        
        # Full rollout
        kubectl set image deployment/myapp \
          myapp=${{ steps.login-ecr.outputs.registry }}/${{ env.ECR_REPOSITORY }}:${{ needs.build.outputs.image_tag }} \
          -n production
        
        kubectl rollout status deployment/myapp -n production

GitLab CI Pipeline:

stages:
  - test
  - build
  - deploy

variables:
  DOCKER_DRIVER: overlay2
  IMAGE_TAG: $CI_COMMIT_SHORT_SHA
  DOCKER_HOST: tcp://docker:2375

cache:
  paths:
    - node_modules/

before_script:
  - docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY

test:
  stage: test
  image: node:16
  script:
    - npm ci
    - npm run lint
    - npm test
  coverage: '/All files[^|]*\|[^|]*\s+([\d\.]+)/'

build:
  stage: build
  image: docker:20.10.16
  services:
    - docker:20.10.16-dind
  script:
    - docker build -t $CI_REGISTRY_IMAGE:$IMAGE_TAG .
    - docker push $CI_REGISTRY_IMAGE:$IMAGE_TAG
  only:
    - main
    - develop

.deploy_template: &deploy_template
  stage: deploy
  image: alpine/k8s:1.22
  script:
    - apk add --no-cache curl
    - curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
    - chmod +x kubectl && mv kubectl /usr/local/bin/
    - kubectl set image deployment/myapp myapp=$CI_REGISTRY_IMAGE:$IMAGE_TAG -n $K8S_NAMESPACE
    - kubectl rollout status deployment/myapp -n $K8S_NAMESPACE

deploy_dev:
  <<: *deploy_template
  variables:
    K8S_NAMESPACE: development
  environment:
    name: development
    url: https://dev.example.com
  only:
    - develop

deploy_staging:
  <<: *deploy_template
  variables:
    K8S_NAMESPACE: staging
  environment:
    name: staging
    url: https://staging.example.com
  only:
    - main

deploy_production:
  <<: *deploy_template
  variables:
    K8S_NAMESPACE: production
  environment:
    name: production
    url: https://example.com
  only:
    - main
  when: manual
  needs: ["deploy_staging"]

Appendix F: DevOps Interview Questions

General DevOps:

  1. What is DevOps and why is it important?
  2. Explain the CAMS model.
  3. What are the Three Ways of DevOps?
  4. How do you measure DevOps success?
  5. What is the difference between Continuous Delivery and Continuous Deployment?
  6. Explain the concept of "shift left" in security.
  7. What is Conway's Law and how does it apply to DevOps?
  8. How do you handle blameless postmortems?
  9. What are DORA metrics?
  10. Explain the difference between Agile and DevOps.

CI/CD:

  1. How would you design a CI/CD pipeline?
  2. What's the difference between Jenkins, GitHub Actions, and GitLab CI?
  3. How do you handle database migrations in CI/CD?
  4. Explain blue/green deployment.
  5. What is canary deployment and when would you use it?
  6. How do you handle secrets in CI/CD pipelines?
  7. What is pipeline as code and why is it important?
  8. How do you ensure pipeline security?
  9. Explain the concept of "build once, deploy many".
  10. How do you handle rollbacks?

Containers & Kubernetes:

  1. What's the difference between Docker and Kubernetes?
  2. Explain Kubernetes architecture.
  3. How do you expose an application running in Kubernetes?
  4. What are Kubernetes Operators?
  5. How do you handle persistent storage in Kubernetes?
  6. Explain Kubernetes network policies.
  7. What's the difference between a deployment and a statefulset?
  8. How do you debug a pod that won't start?
  9. What is Helm and why use it?
  10. Explain Kubernetes RBAC.

Infrastructure as Code:

  1. What's the difference between declarative and imperative IaC?
  2. Explain Terraform vs Ansible.
  3. How do you manage Terraform state?
  4. What are modules in Terraform and why use them?
  5. How do you test infrastructure code?
  6. What is immutable infrastructure?
  7. Explain idempotency in IaC.
  8. How do you handle secrets in Terraform?
  9. What's the difference between Terraform and CloudFormation?
  10. How do you version infrastructure code?

Cloud:

  1. Explain the shared responsibility model.
  2. What's the difference between IaaS, PaaS, and SaaS?
  3. How do you design for high availability?
  4. Explain multi-region architecture.
  5. How do you manage cloud costs?
  6. What is VPC peering?
  7. Explain the difference between security groups and network ACLs.
  8. How do you implement disaster recovery?
  9. What is a landing zone?
  10. How do you handle cloud governance?

Monitoring & SRE:

  1. What are the four golden signals?
  2. Explain SLIs, SLOs, and SLAs.
  3. What is an error budget?
  4. How do you design effective alerts?
  5. What's the difference between metrics, logs, and traces?
  6. Explain the USE method.
  7. What is the RED method?
  8. How do you handle on-call rotations?
  9. What is chaos engineering?
  10. How do you measure reliability?

Security:

  1. What is DevSecOps?
  2. How do you implement security in CI/CD?
  3. What is SAST vs DAST?
  4. Explain container security best practices.
  5. How do you manage secrets?
  6. What is SBOM and why is it important?
  7. How do you scan for vulnerabilities?
  8. Explain the principle of least privilege.
  9. What is policy as code?
  10. How do you handle compliance in cloud?

Scenario Questions:

  1. A deployment is causing 500 errors. How do you respond?
  2. How would you migrate a legacy application to the cloud?
  3. Your builds are taking 30 minutes. How do you optimize?
  4. How would you implement a multi-region disaster recovery plan?
  5. A critical vulnerability is found in a dependency. What do you do?
  6. How would you convince management to invest in DevOps?
  7. Your team is experiencing burnout from on-call. How do you fix it?
  8. How would you design a platform for 100 microservices?
  9. A database migration caused downtime. How do you prevent recurrence?
  10. How would you implement cost optimization for a growing startup?

Appendix G: DevOps Maturity Model

Level 1: Initial

  • Manual deployments
  • No version control
  • Siloed teams
  • Reactive monitoring
  • Long release cycles (months)
  • High failure rate
  • Firefighting culture

Level 2: Managed

  • Version control for code
  • Basic CI (build automation)
  • Some documentation
  • Scheduled releases
  • Basic monitoring
  • Defined roles
  • Tickets for operations

Level 3: Defined

  • CI/CD pipelines
  • Automated testing
  • Configuration management
  • Standardized environments
  • Proactive monitoring
  • Defined SLIs/SLOs
  • Blameless postmortems

Level 4: Measured

  • Pipeline as code
  • Infrastructure as code
  • Self-service platforms
  • Automated security scanning
  • Performance testing
  • Capacity planning
  • Error budgets

Level 5: Optimizing

  • GitOps workflows
  • Chaos engineering
  • AIOps/MLOps
  • Auto-remediation
  • Continuous experimentation
  • FinOps optimization
  • Platform engineering

Appendix H: Glossary

A

  • Agile: Iterative software development methodology
  • Artifact: Output of build process (JAR, Docker image)
  • Autoscaling: Automatically adjusting resources based on demand

B

  • Blue/Green Deployment: Two identical environments, switch traffic
  • Build: Process of compiling source code into artifacts

C

  • CAMS: Culture, Automation, Measurement, Sharing
  • Canary Deployment: Gradual rollout to subset of users
  • CD: Continuous Delivery/Deployment
  • CI: Continuous Integration
  • Chaos Engineering: Deliberately introducing failures
  • CNCF: Cloud Native Computing Foundation
  • Container: Lightweight virtualization at OS level
  • CRD: Custom Resource Definition (Kubernetes)

D

  • DaemonSet: Runs pod on every node (Kubernetes)
  • DAST: Dynamic Application Security Testing
  • Deployment: Kubernetes resource for managing pods
  • DevOps: Cultural and technical movement for collaboration
  • DORA: DevOps Research and Assessment
  • Docker: Container platform

E

  • EKS: Amazon Elastic Kubernetes Service
  • ELK: Elasticsearch, Logstash, Kibana
  • Error Budget: (1 - SLO) * time, acceptable failure

F

  • Feature Flag: Toggle for feature visibility
  • FinOps: Cloud financial management
  • Flux: GitOps operator

G

  • Git: Distributed version control
  • GitOps: Git as source of truth for infrastructure
  • GKE: Google Kubernetes Engine
  • Grafana: Visualization platform

H

  • Helm: Kubernetes package manager
  • HPA: Horizontal Pod Autoscaler
  • Hybrid Cloud: Mix of public and private cloud

I

  • IaC: Infrastructure as Code
  • IAM: Identity and Access Management
  • Idempotent: Operation with same effect when run multiple times
  • Ingress: Kubernetes API object for external access
  • Istio: Service mesh

J

  • Jenkins: CI/CD automation server
  • JSON: JavaScript Object Notation

K

  • K8s: Kubernetes (K + 8 letters)
  • Kustomize: Kubernetes configuration customization
  • Kyverno: Kubernetes policy engine

L

  • Lambda: AWS serverless function
  • Load Balancer: Distributes traffic
  • Logging: Recording events

M

  • Microservices: Architecture with small, independent services
  • Monitoring: Collecting and analyzing metrics
  • mTLS: Mutual TLS for service authentication

N

  • Namespace: Isolation mechanism in Kubernetes
  • Network Policy: Firewall rules for pods
  • Node: Worker machine in Kubernetes

O

  • Observability: Understanding system internals through outputs
  • OCI: Open Container Initiative
  • OPA: Open Policy Agent
  • Operator: Kubernetes extension for application management

P

  • PaaS: Platform as a Service
  • Pod: Smallest deployable unit in Kubernetes
  • Prometheus: Monitoring system
  • PV: Persistent Volume
  • PVC: Persistent Volume Claim

R

  • RBAC: Role-Based Access Control
  • ReplicaSet: Ensures specified number of pods running
  • Rolling Update: Gradually replacing instances
  • Runbook: Documented procedures for operations

S

  • SaaS: Software as a Service
  • SAST: Static Application Security Testing
  • SBOM: Software Bill of Materials
  • Secret: Kubernetes resource for sensitive data
  • Service Mesh: Infrastructure layer for service communication
  • SLA: Service Level Agreement
  • SLI: Service Level Indicator
  • SLO: Service Level Objective
  • SRE: Site Reliability Engineering

T

  • Terraform: IaC tool by HashiCorp
  • Toil: Manual, repetitive operational work
  • Tracing: Tracking request through distributed system

U

  • Unit Test: Testing individual components
  • USE Method: Utilization, Saturation, Errors

V

  • VCS: Version Control System
  • VPC: Virtual Private Cloud
  • VPA: Vertical Pod Autoscaler

W

  • Waterfall: Sequential development methodology
  • Workload: Application running on Kubernetes

X

  • XML: eXtensible Markup Language

Y

  • YAML: YAML Ain't Markup Language

Z

  • Zero Downtime Deployment: Deployment without service interruption

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment