PART I — DEVOPS FOUNDATIONS
- Chapter 1 — Introduction to DevOps
- Chapter 2 — DevOps Culture & Organizational Design
- Chapter 3 — Linux & System Fundamentals for DevOps
PART II — VERSION CONTROL & COLLABORATION
- Chapter 4 — Git Internals & Advanced Workflows
- Chapter 5 — Platforms
PART III — CI/CD PIPELINES
- Chapter 6 — Continuous Integration
- Chapter 7 — CI Tools
- Chapter 8 — Continuous Delivery & Deployment
PART IV — CONTAINERS & ORCHESTRATION
- Chapter 9 — Containerization
- Chapter 10 — Kubernetes Deep Dive
- Chapter 11 — Kubernetes in Production
PART V — INFRASTRUCTURE AS CODE
- Chapter 12 — Infrastructure as Code Principles
- Chapter 13 — IaC Tools
PART VI — CLOUD PLATFORMS
- Chapter 14 — Cloud Fundamentals
- Chapter 15 — Amazon Web Services
- Chapter 16 — Microsoft Azure
- Chapter 17 — Google Cloud Platform
PART VII — OBSERVABILITY & SRE
- Chapter 18 — Monitoring & Logging
- Chapter 19 — Site Reliability Engineering
PART VIII — DEVSECOPS
- Chapter 20 — Secure DevOps
- Chapter 21 — Security Tools
PART IX — ADVANCED TOPICS
- Chapter 22 — GitOps & Platform Engineering
- Chapter 23 — Serverless & Edge
- Chapter 24 — Performance & Scalability
- Chapter 25 — DevOps at Enterprise Scale
PART X — PRACTICAL IMPLEMENTATION
- Chapter 26 — Building a Complete DevOps Pipeline
- Chapter 27 — Real-World Case Studies
Appendices
The journey of software development methodologies spans over six decades, evolving from the nascent days of computing to the sophisticated, automated pipelines we see today. Understanding this history is crucial for appreciating why DevOps emerged as a necessary evolution rather than a passing trend.
The Pioneering Era (1950s-1960s)
In the early days of computing, software was tightly coupled with hardware. Programs were written in machine language or assembly, and the concept of "software development" as a distinct discipline barely existed. The IBM 704, introduced in 1954, was one of the first mass-produced computers, and programming it involved physical plugboards and punch cards. There was no separation between development and operations—the same people who wrote the code also ran the machines. This period was characterized by:
- Batch Processing: Jobs were submitted on punch cards, and results would return hours or days later.
- Hardware Dominance: Software was often given away for free with hardware purchases.
- No Standardization: Every machine had its own architecture and instruction set.
The Software Crisis and Structured Programming (1960s-1970s)
As hardware became more powerful and affordable, software complexity grew exponentially. The NATO Software Engineering Conferences of 1968 and 1969 coined the term "software crisis," highlighting that projects were running over budget, over time, and producing unreliable software. This crisis led to:
- Structured Programming: Pioneered by Edsger Dijkstra and others, this paradigm introduced disciplined control structures (if-then-else, loops) instead of chaotic goto statements.
- The Waterfall Model: Winston Royce's 1970 paper (often mischaracterized) described a sequential model that would become the dominant methodology for decades.
- Separation of Concerns: For the first time, distinct roles emerged—analysts, designers, programmers, testers, and operators.
The Rise of Personal Computing and Client-Server (1980s)
The 1980s brought personal computers and the client-server architecture. Software was now shipped on floppy disks and later CDs. This era saw:
- Packaged Software: Companies like Microsoft began selling software as products.
- Graphical User Interfaces: The Macintosh (1984) and Windows (1985) made computing accessible to non-technical users.
- Networked Applications: With the growth of LANs, applications became distributed.
- Formalized ITIL: The Information Technology Infrastructure Library emerged in the UK, providing a framework for IT service management, further codifying the separation between development (creating applications) and operations (running infrastructure).
The Internet Boom (1990s)
The commercialization of the internet in the mid-1990s changed everything. Companies like Amazon (1994), eBay (1995), and Google (1998) were born in the cloud (though cloud computing as we know it didn't exist yet). This period introduced:
- Web Applications: Software was no longer installed but accessed via browsers.
- LAMP Stack: Linux, Apache, MySQL, and PHP/Python/Perl became the dominant open-source web development platform.
- Rapid Growth: The pressure to release features quickly to beat competitors intensified.
- Dot-com Bubble: The frenzy led to massive investments and subsequent crash, but the foundational technologies survived.
The Agile Manifesto (2001)
By the late 1990s, the heavyweight, documentation-driven methodologies were creaking under the pressure of internet-speed development. Seventeen software developers met at a ski resort in Utah and crafted the Agile Manifesto, which emphasized:
- Individuals and interactions over processes and tools
- Working software over comprehensive documentation
- Customer collaboration over contract negotiation
- Responding to change over following a plan
Agile methodologies like Scrum, Extreme Programming (XP), and Kanban transformed how development teams worked, promoting iterative development, continuous feedback, and cross-functional collaboration. However, Agile focused primarily on developers and product owners—operations remained largely untouched.
To understand the transition from Waterfall to Agile, we must examine both methodologies in depth.
The Waterfall Model
The Waterfall model, despite its widespread adoption, was never intended to be rigid. Royce's original paper actually recommended iteration. However, the model that emerged was strictly sequential:
- Requirements Analysis: Gather and document all requirements before any design begins.
- System Design: Create detailed architectural and design specifications based on requirements.
- Implementation: Write code according to the design documents.
- Testing: Verify that the implemented system meets the requirements.
- Deployment: Release the tested system to production.
- Maintenance: Fix issues and make enhancements post-release.
Challenges with Waterfall:
- Late Feedback: Users don't see working software until very late in the process.
- Change Resistance: Changing requirements mid-stream is expensive and disruptive.
- Integration Hell: Integration happens at the end, often revealing conflicts and issues that require significant rework.
- Long Release Cycles: Releases might take months or years.
- Siloed Teams: Developers throw code "over the wall" to testers, who then throw it to operations.
The Agile Revolution
Agile methodologies emerged as a direct response to these challenges. The Agile Manifesto's 12 principles include:
- Deliver working software frequently, from a couple of weeks to a couple of months.
- Welcome changing requirements, even late in development.
- Business people and developers must work together daily throughout the project.
- Build projects around motivated individuals and trust them to get the job done.
- Working software is the primary measure of progress.
- Continuous attention to technical excellence and good design enhances agility.
Scrum became the most popular Agile framework, introducing:
- Sprints: Time-boxed iterations (usually 2 weeks)
- Roles: Product Owner, Scrum Master, Development Team
- Ceremonies: Sprint Planning, Daily Stand-up, Sprint Review, Sprint Retrospective
Kanban offered a different approach:
- Visualize workflow
- Limit work in progress
- Manage flow
- Make process policies explicit
- Improve collaboratively
The Gap Agile Created
While Agile dramatically improved development productivity, it inadvertently widened the gap between Dev and Ops. Developers were now releasing software every two weeks, but operations teams (still following ITIL) were accustomed to quarterly or annual releases. This created:
- Deployment Conflicts: Developers wanted frequent releases; operations prioritized stability.
- Environment Inconsistencies: Code worked on developer laptops but failed in production.
- Blame Game: When production issues occurred, developers blamed operations for poor infrastructure, and operations blamed developers for buggy code.
- Manual Handoffs: Each release required manual documentation, change requests, and deployment procedures.
The term "DevOps" was coined in 2009 by Patrick Debois, who organized the first DevOpsDays conference in Ghent, Belgium. However, the ideas behind DevOps had been brewing for years.
The Belgian Rails Underground
In 2008, at the Agile Conference in Toronto, Andrew Clay Shafer and Patrick Debois discussed the idea of "Agile Infrastructure." They realized that the principles of Agile—collaboration, iteration, feedback—could and should apply to operations. This conversation planted the seeds for what would become DevOps.
The Flickr Talk
At the 2009 Velocity Conference, John Allspaw and Paul Hammond from Flickr presented "10+ Deploys per Day: Dev and Ops Cooperation at Flickr." This groundbreaking talk showed how Flickr had broken down the barriers between development and operations, achieving unprecedented deployment frequency. The talk went viral in the tech community and catalyzed the DevOps movement.
Defining DevOps
DevOps is not a tool, a job title, or a specific technology. It's a cultural and professional movement that stresses communication, collaboration, and integration between software developers and IT operations professionals. At its core, DevOps aims to:
- Break down silos between development, operations, and other stakeholders
- Automate manual processes to increase efficiency and reduce errors
- Measure everything to understand system behavior and business impact
- Share knowledge, responsibility, and ownership across teams
The Three Ways
Gene Kim, in "The Phoenix Project" and "The DevOps Handbook," codified DevOps principles into "The Three Ways":
First Way: Systems Thinking (Flow)
- Emphasizes the performance of the entire system, not just silos
- Focus on creating fast, smooth flow from development to operations to the customer
- Never pass known defects downstream
- Optimize for global goals, not local efficiencies
Second Way: Amplify Feedback Loops
- Create short, fast feedback loops from operations back to development
- Enable quick detection and recovery from issues
- Swarm problems to prevent recurrence
- Build quality in by finding and fixing defects at the source
Third Way: Culture of Continuous Experimentation and Learning
- Foster a culture that takes risks and learns from failure
- Understand that repetition and practice are prerequisites to mastery
- Allocate time for improvement of daily work
- Introduce faults to increase resilience (chaos engineering)
The CAMS model, popularized by Damon Edwards and John Willis, provides a framework for understanding the core dimensions of DevOps.
Culture (The Foundation)
Culture is the most critical and most challenging aspect of DevOps. It encompasses:
- Trust and Collaboration: Teams trust each other and collaborate across boundaries.
- Shared Goals: Dev and Ops share responsibility for the entire service lifecycle.
- Respect: Each team respects the others' expertise and constraints.
- Experimentation: Failure is viewed as a learning opportunity, not a reason for punishment.
- Continuous Improvement: Teams constantly seek ways to improve processes and systems.
Culture Anti-patterns:
- Blaming individuals for system failures
- Throwing work "over the wall" between teams
- Hiding information or hoarding knowledge
- Fear of change or experimentation
Automation (The Enabler)
Automation is what makes DevOps practices scalable and repeatable. Key areas include:
- Infrastructure Automation: Provisioning servers, networks, and storage through code (Terraform, CloudFormation)
- Configuration Automation: Managing system configurations (Ansible, Puppet, Chef)
- Build and Deployment Automation: CI/CD pipelines (Jenkins, GitHub Actions)
- Testing Automation: Automated unit, integration, and security tests
- Environment Management: Consistent development, testing, and production environments
Automation Principles:
- Automate repetitive, error-prone manual tasks
- Version control everything (infrastructure, configuration, pipelines)
- Treat automation code as production code (testing, review, documentation)
- Start with the most painful manual processes first
Measurement (The Evidence)
You cannot improve what you cannot measure. Measurement in DevOps includes:
- Deployment Metrics: Frequency, lead time, success rate
- Operational Metrics: Availability, latency, throughput, error rates
- Business Metrics: Customer satisfaction, revenue, feature adoption
- Team Metrics: Morale, burnout, knowledge sharing
Key Performance Indicators (KPIs):
- Deployment Frequency: How often do we deploy to production?
- Lead Time for Changes: How long does it take from commit to running in production?
- Mean Time to Recovery (MTTR): How quickly can we recover from failures?
- Change Failure Rate: What percentage of changes cause degraded service?
Sharing (The Multiplier)
Sharing creates a virtuous cycle where knowledge and improvements propagate throughout the organization.
- Cross-functional Teams: Dev and Ops work together on shared goals.
- Knowledge Transfer: Pair programming, documentation, brown bag sessions.
- Shared Tools and Platforms: Internal developer platforms, common toolchains.
- Blame-free Postmortems: Share learnings from failures without fear of reprisal.
- Open Source Contributions: Share innovations with the broader community.
Understanding the distinctions and relationships between these complementary approaches is essential.
DevOps vs Agile
| Aspect | Agile | DevOps |
|---|---|---|
| Focus | Development practices | Full lifecycle (Dev+Ops) |
| Primary Goal | Deliver value iteratively | Deliver value continuously and reliably |
| Scope | Development team | Development + Operations + QA + Security |
| Timeframe | Sprint iterations | Continuous delivery pipeline |
| Key Practices | Stand-ups, retrospectives, story pointing | CI/CD, monitoring, infrastructure as code |
| Metrics | Velocity, story points | DORA metrics, SLIs/SLOs |
Relationship: Agile and DevOps are complementary. Agile improves how features are built; DevOps improves how those features are delivered and operated. Many organizations adopt Agile first, then DevOps to address operational bottlenecks.
DevOps vs SRE
Site Reliability Engineering (SRE) was pioneered at Google and codified by Ben Treynor Sloss. SRE applies software engineering principles to operations problems.
| Aspect | DevOps | SRE |
|---|---|---|
| Origin | Community movement | Google internal practice |
| Philosophy | Break down silos, collaborate | Apply software engineering to ops |
| Key Concept | CAMS model | Error budgets |
| Implementation | Cultural and technical practices | Specific roles and practices |
| Focus | Collaboration and automation | Reliability and scalability |
Relationship: Google describes SRE as "what happens when you ask a software engineer to design an operations team." Many consider SRE a specific implementation of DevOps principles with a stronger focus on reliability engineering.
Key SRE Practices:
- Service Level Objectives (SLOs) and Error Budgets
- Eliminating toil through automation
- Monitoring and alerting design
- Capacity planning
- Incident response
- Chaos engineering
DevSecOps integrates security practices throughout the DevOps lifecycle rather than adding security as a final gate. The motto is "Security as Code" and "Shift Left" (moving security earlier in the development process).
Why DevSecOps?
Traditional security approaches created bottlenecks:
- Security testing happened at the end of development
- Security findings caused last-minute delays
- Security teams were seen as blockers, not enablers
- Vulnerabilities were discovered too late for easy remediation
DevSecOps Principles:
- Shift Left: Test security early and often throughout the pipeline
- Automate Security: Integrate automated security tools into CI/CD
- Security as Code: Define security policies and configurations in code
- Continuous Compliance: Automate compliance checking and reporting
- Shared Responsibility: Everyone owns security, not just the security team
Security Integration Points:
- Code: SAST (Static Application Security Testing), secrets scanning
- Dependencies: SCA (Software Composition Analysis), dependency scanning
- Build: Container scanning, SBOM generation
- Deploy: Policy as code, compliance validation
- Runtime: DAST (Dynamic Application Security Testing), runtime protection
- Infrastructure: Infrastructure scanning, cloud security posture management
Platform Engineering has emerged as a natural evolution of DevOps practices, especially in large organizations. It focuses on building Internal Developer Platforms (IDPs) that abstract infrastructure complexity and provide self-service capabilities to development teams.
The Problem Platform Engineering Solves
As organizations scale, the cognitive load on developers increases:
- Multiple cloud providers
- Complex Kubernetes configurations
- Numerous tools and technologies
- Security and compliance requirements
- Observability setup
Developers spend more time on infrastructure and tooling than on business logic.
What is an Internal Developer Platform?
An IDP is a cohesive layer of tools and services that development teams use to build, deploy, and operate applications without needing to understand the underlying infrastructure.
Key Capabilities:
- Self-service provisioning of environments
- Standardized deployment pipelines
- Built-in security and compliance controls
- Golden paths and paved roads
- Observability and debugging tools
- Documentation and onboarding
Platform Engineering vs DevOps
| Aspect | DevOps | Platform Engineering |
|---|---|---|
| Focus | Culture and practices | Building and maintaining platforms |
| Target | All teams | Platform team and application teams |
| Output | Improved collaboration | Internal developer platform |
| Key Metric | DORA metrics | Developer satisfaction, time-to-value |
Common Myths:
Myth 1: DevOps is a tool or technology Reality: DevOps is fundamentally about culture and practices. Tools enable DevOps but don't create it.
Myth 2: DevOps means no operations team Reality: Operations responsibilities shift from manual management to building automation and platforms.
Myth 3: DevOps is only for startups Reality: Large enterprises like Amazon, Netflix, and Google have successfully adopted DevOps.
Myth 4: DevOps requires rewriting everything Reality: DevOps can be applied incrementally to existing systems and processes.
Myth 5: DevOps eliminates the need for testing Reality: Testing becomes more critical and more automated.
Anti-patterns:
-
DevOps Team: Creating a separate "DevOps team" that acts as a silo defeats the purpose.
-
Tools First: Buying and installing tools without addressing culture and processes.
-
Automation Without Understanding: Automating broken processes just breaks things faster.
-
No Measurement: Implementing practices without measuring their impact.
-
Skipping Security: Treating security as an afterthought.
-
Hero Culture: Relying on individuals to fix problems manually rather than building resilient systems.
-
Ignoring Technical Debt: Accumulating technical debt that slows down delivery.
Organizations that successfully implement DevOps see measurable business benefits:
Speed:
- 200x more frequent deployments (DORA research)
- 2555x faster lead time from commit to deploy
- Faster time-to-market for new features
Stability:
- 3x lower change failure rate
- 24x faster recovery from failures
- 50% fewer outages
Security:
- 50% less time spent on security remediation
- Faster vulnerability patching
- Improved compliance posture
Business Outcomes:
- Higher customer satisfaction
- Increased market share
- Better employee retention
- Lower operational costs
- Improved innovation capacity
Netflix: Cloud Native Excellence
Netflix's DevOps journey is legendary. After a major database corruption in 2008 that prevented DVD shipments for days, Netflix committed to moving to AWS and embracing cloud-native architecture.
Key Practices:
- Chaos Engineering: Simian Army tools (Chaos Monkey) deliberately cause failures to test resilience
- Immutable Infrastructure: Servers are never patched; they're replaced
- Microservices: Thousands of microservices running on AWS
- Continuous Delivery: Thousands of deployments daily
- Culture of Freedom and Responsibility: Engineers have significant autonomy and ownership
Results: Netflix achieved global scale, 99.99% availability, and the ability to deploy thousands of times daily.
Amazon: The Deployment Machine
Amazon's journey to DevOps was driven by CEO Jeff Bezos' mandate: all teams must expose their data and functionality through service interfaces, and teams must communicate only through these interfaces.
Key Practices:
- Two-Pizza Teams: Small, autonomous teams (fewer than 10 people)
- You Build It, You Run It: Teams own their services end-to-end
- Single-threaded Ownership: Clear ownership without shared responsibility
- Deployment Pipeline: Sophisticated pipeline enabling 50 million+ deployments annually
- API Mandate: All communication through well-defined APIs
Results: Amazon achieves 143,000 deployments in a single hour, with each team deploying independently.
Google: SRE Pioneers
Google developed SRE to manage its massive scale. The SRE team at Google is responsible for keeping services running while maintaining a 50% cap on operational work—the rest is development work to improve systems.
Key Practices:
- Error Budgets: 100% reliability is the wrong target; error budgets define acceptable unreliability
- Borg/Omega/Kubernetes: Internal container orchestration evolved into Kubernetes
- Blameless Postmortems: Focus on fixing systems, not blaming people
- Toil Elimination: Automate away repetitive operational work
- Capacity Planning: Data-driven approach to scaling
Results: Google maintains incredible reliability (Gmail 99.978%) while continuously deploying thousands of changes.
The structure of an organization profoundly impacts its ability to implement DevOps. Understanding different organizational models is essential.
Functional (Siloed) Structure
In traditional IT organizations, teams are structured by function:
CEO
┌────────────┼────────────┐
Development QA Operations
│ │ │
Dev Teams QA Teams Ops Teams
Characteristics:
- Clear career paths within functions
- Deep expertise in specific domains
- Standardized practices within silos
- Handoffs between teams
- Local optimization over global outcomes
Problems with Functional Structure:
- Slow handoffs create bottlenecks
- Misaligned incentives (Dev wants features, Ops wants stability)
- Blame culture when things go wrong
- Knowledge silos
- Difficulty implementing end-to-end ownership
Product-Aligned (Cross-functional) Structure
DevOps promotes organizing around products or services:
CEO
┌────────────┼────────────┐
Product A Product B Product C
│ │ │
[Dev, QA, Ops] [Dev, QA, Ops] [Dev, QA, Ops]
Characteristics:
- Teams own their product end-to-end
- Members from different functions collaborate daily
- Aligned incentives around product success
- Faster decision-making
- Clear ownership and accountability
Benefits:
- Reduced handoffs and waiting times
- Faster feedback loops
- Better understanding of customer needs
- Improved quality through ownership
- Higher team morale and autonomy
Matrix Structure (Hybrid)
Some organizations use a matrix structure where individuals report to both functional and product managers:
CEO
┌────────────┼────────────┐
Development QA Operations
│ │ │
┌────┼────┐ ┌────┼────┐ ┌────┼────┐
A B C A B C A B C
│ │ │ │ │ │ │ │ │
└─┼─┼───────┼─┼─┼───────┼─┼─┘
└─┼───────┼─┼─────────┘
└───────┼─┘
↓
Product A
Benefits:
- Maintain functional expertise while enabling product focus
- Flexible resource allocation
- Career development within functions
Challenges:
- Conflicting priorities (functional vs product goals)
- Complex reporting relationships
- Potential for confusion and politics
Conway's Law, formulated by Melvin Conway in 1967, states:
"Organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations."
In simpler terms: Your system architecture will mirror your organizational structure.
Implications for DevOps:
-
Communication Patterns Become Architecture:
- If teams communicate through tickets, the system will have slow, bureaucratic interfaces
- If teams can talk directly, the system can have tight integration
- If teams are siloed, the system will have siloed components
-
Inverse Conway Maneuver:
- To achieve a desired architecture, reorganize teams to match it
- Want microservices? Create small, autonomous teams
- Want a platform? Create a platform team that treats other teams as customers
-
Team Boundaries:
- Teams should own complete, loosely-coupled components
- APIs between teams should be clean and well-documented
- Teams should be able to deploy independently
Practical Application:
When designing microservices architecture:
- Identify bounded contexts (domain-driven design)
- Form teams around these contexts
- Ensure teams have all necessary skills (cross-functional)
- Define clear APIs between team-owned services
- Enable independent deployment per team
Psychological safety, a concept popularized by Harvard professor Amy Edmondson, is crucial for high-performing DevOps teams. It's defined as "a shared belief that the team is safe for interpersonal risk-taking."
Why It Matters in DevOps:
-
Blameless Culture: When incidents occur, teams need to investigate without fear of punishment.
-
Experimentation: DevOps requires trying new things; psychological safety enables this.
-
Learning from Failure: Only in safe environments do people openly discuss mistakes.
-
Speaking Up: Team members need to raise concerns about security, quality, or process issues.
-
Innovation: New ideas emerge when people feel safe sharing half-formed thoughts.
Building Psychological Safety:
For Leaders:
- Model vulnerability by admitting your own mistakes
- Ask questions, don't provide all answers
- Frame work as learning problems, not execution problems
- Acknowledge your own fallibility
- Actively invite input from quieter team members
For Teams:
- Establish ground rules for discussion
- No interrupting or dismissing ideas
- Focus on systems, not people, when things go wrong
- Celebrate learning from failures
- Create anonymous feedback channels
For Individuals:
- Ask for help when needed
- Offer help to others
- Share your mistakes and what you learned
- Assume good intentions from others
Measuring Psychological Safety:
- Do team members feel comfortable admitting mistakes?
- Are dissenting opinions expressed and heard?
- Do people ask for help without hesitation?
- Is failure discussed as a learning opportunity?
- Are there diverse perspectives in decision-making?
The blameless postmortem is a cornerstone of DevOps culture. After an incident, teams conduct a thorough analysis focused on understanding what happened and preventing recurrence—not on assigning blame.
Principles of Blameless Postmortems:
-
Assume Good Intentions: Everyone was doing their best with the information they had.
-
Focus on Systems, Not People: Human error is a symptom of system problems.
-
Fix the Process, Not the Person: If a person could make a mistake, the system allowed it.
-
Share Learnings Widely: Postmortems should be public within the organization.
-
Actionable Improvements: Every postmortem should produce concrete action items.
The Postmortem Process:
Immediate Response (During Incident):
- Focus on restoring service
- Document actions and timestamps
- Preserve evidence (logs, metrics)
Post-Incident Analysis (24-48 hours after):
- Gather all participants
- Timeline reconstruction
- Root cause analysis (multiple contributing factors)
- Identify what went well and what didn't
Writing the Postmortem:
A good postmortem includes:
- Executive Summary: Brief overview for leadership
- Incident Details: Date, duration, impact, severity
- Timeline: Chronological sequence of events
- Root Cause: Technical explanation of what failed
- Contributing Factors: Why the conditions existed
- Detection: How the incident was discovered
- Response: How the team handled it
- Lessons Learned: What we now know
- Action Items: Specific, assigned tasks with due dates
Example Action Items:
- "Add monitoring for database connection pool exhaustion"
- "Update deployment documentation with rollback procedure"
- "Implement automated testing for migration scripts"
- "Add canary deployment for configuration changes"
Common Pitfalls:
- Superficial Analysis: Stopping at "human error" instead of digging deeper
- No Action Items: Learning without implementing improvements
- Blaming Language: "He should have..." instead of "The system allowed..."
- Keeping Secrets: Hiding postmortems from other teams
- Punishing Honesty: Making people regret speaking openly
DevOps transformations require leadership at all levels, but especially from those in formal leadership positions.
Characteristics of DevOps Leaders:
-
Servant Leadership: Leaders exist to serve and enable their teams, not the other way around.
-
Systems Thinkers: Leaders understand how parts of the organization interact.
-
Change Agents: They actively work to improve culture and processes.
-
Technical Empathy: They understand technical challenges and constraints.
-
Coaching Mindset: They develop people, not just deliver projects.
-
Bias for Action: They value progress over perfection.
-
Long-term Perspective: They invest in capabilities, not just immediate results.
Leadership Responsibilities:
Creating Vision:
- Articulate why DevOps matters
- Define success metrics
- Communicate the transformation journey
- Align DevOps goals with business objectives
Removing Obstacles:
- Eliminate bureaucratic barriers
- Provide resources and tools
- Resolve organizational conflicts
- Shield teams from distractions
Modeling Behavior:
- Demonstrate blameless culture
- Show vulnerability
- Learn in public
- Celebrate learning from failure
Building Capability:
- Invest in training and development
- Create career paths
- Hire for culture add
- Develop internal expertise
Measuring Progress:
- Track DORA metrics
- Survey team morale
- Monitor business outcomes
- Adjust strategy based on data
Leadership Anti-patterns:
- Command and Control: Dictating solutions instead of enabling teams
- Short-term Focus: Prioritizing immediate features over long-term capabilities
- Inconsistent Messaging: Saying one thing but rewarding another
- Fear-based Management: Using metrics to punish instead of improve
- Hollow Empowerment: Saying "you're empowered" but overriding decisions
DevOps transforms how organizations approach change—from rigid, approval-based processes to automated, verified, and continuous flows.
Traditional Change Management:
- Change Advisory Board (CAB) approves all changes
- Weekly or bi-weekly meetings
- Paperwork-heavy requests
- Focus on risk avoidance
- Slow, batch-oriented
DevOps Change Management:
- Automated validation and testing
- Peer review through code review
- Gradual rollout with monitoring
- Fast rollback capability
- Focus on risk management
- Continuous, small changes
Key Principles:
-
Changes Should Be Small: Small changes are easier to review, test, and roll back.
-
Automate Where Possible: Automated testing replaces manual approval for many changes.
-
Verification Over Approval: Prove changes work through testing rather than seeking permission.
-
Gradual Exposure: Roll out changes progressively, monitoring impact.
-
Emergency Changes Are Rare: If you need frequent emergency changes, your process is broken.
The Change Management Spectrum:
| Type | Traditional | DevOps |
|---|---|---|
| Infrastructure | CAB approval | Terraform + automated testing |
| Application | Release manager | CI/CD pipeline + canary |
| Configuration | Ticket + manual | Git push + automated |
| Security | Pen test before release | Continuous scanning |
When CAB Still Makes Sense:
- Regulatory compliance requirements
- Financial systems with audit mandates
- Changes with no rollback option
- External customer commitments
- Initial transformation phase
Measuring DevOps success requires moving beyond traditional IT metrics.
DORA Metrics (Four Key Metrics):
The State of DevOps Reports, produced by DORA (DevOps Research and Assessment), identified four key metrics that predict organizational performance:
-
Deployment Frequency: How often an organization successfully releases to production
- Elite: Multiple deploys per day
- High: Weekly to monthly
- Medium: Monthly to every 6 months
- Low: Less than every 6 months
-
Lead Time for Changes: The time from code commit to code successfully running in production
- Elite: Less than one hour
- High: One day to one week
- Medium: One week to one month
- Low: One month to six months
-
Mean Time to Recovery (MTTR): How long it takes to restore service after an incident
- Elite: Less than one hour
- High: Less than one day
- Medium: Less than one day to one week
- Low: One week to one month
-
Change Failure Rate: The percentage of changes that result in degraded service
- Elite: 0-15%
- High: 16-30%
- Medium: 16-30%
- Low: 16-30%
Additional Metrics:
Flow Metrics:
- Deployment size (smaller is better)
- Batch size (smaller is better)
- Wait times between stages
- Work in progress (WIP) limits
Quality Metrics:
- Defect escape rate (bugs found in production)
- Test coverage
- Mean time to detection (MTTD)
- Mean time between failures (MTBF)
Business Metrics:
- Time to market for new features
- Customer satisfaction (CSAT/NPS)
- Revenue per employee
- Feature adoption rate
Team Health Metrics:
- Employee Net Promoter Score (eNPS)
- Turnover rate
- Burnout indicators
- Learning and development hours
Metrics Anti-patterns:
- Vanity Metrics: Numbers that look good but don't indicate real performance
- Gaming the System: Optimizing metrics at the expense of actual outcomes
- Comparing Teams: Using metrics to rank teams creates unhealthy competition
- No Context: Metrics without understanding the underlying context
- Measuring Everything: Analysis paralysis from too many metrics
High-performing DevOps teams share common characteristics and practices.
Characteristics:
-
Cross-functional Composition:
- All skills needed to deliver value
- No external dependencies for common tasks
- T-shaped skills (deep in one area, broad in others)
-
Clear Ownership:
- End-to-end responsibility
- Clear boundaries between teams
- "You build it, you run it" mentality
-
Autonomy with Alignment:
- Freedom to choose how to achieve goals
- Alignment on what goals matter
- Guardrails, not gates
-
Psychological Safety:
- Safe to take risks
- Open communication
- Learning culture
-
Continuous Improvement:
- Regular retrospectives
- Time for improvement work
- Blameless problem-solving
Building Practices:
Team Formation:
- Start with clear mission and boundaries
- Include all necessary roles
- Define success metrics together
- Establish team norms and working agreements
Onboarding:
- Structured mentorship program
- Pair programming with experienced team members
- Gradual responsibility increase
- Documentation and learning resources
Team Rituals:
- Daily stand-up (15 minutes max)
- Regular planning sessions
- Retrospectives (blameless, action-oriented)
- Demo days or show-and-tell
- Social activities
Knowledge Management:
- Living documentation
- Code comments and READMEs
- Architecture decision records (ADRs)
- Brown bag lunches
- Internal tech talks
Career Development:
- Individual growth plans
- Technical and management tracks
- Conference attendance and speaking
- Internal mobility opportunities
- Mentoring programs
InnerSource applies open source software development practices to internal software development.
What is InnerSource?
InnerSource takes the lessons learned from open source development (transparency, collaboration, meritocracy) and applies them within the corporate firewall. It enables developers from different teams to contribute to each other's codebases.
Core Principles:
-
Open by Default: Code is visible to everyone in the organization.
-
Voluntary Participation: Contributors choose what to work on.
-
Meritocracy: Influence comes from contribution quality, not position.
-
Asynchronous Collaboration: Work happens across time zones without constant coordination.
-
Community Over Committee: Decisions emerge from community practice.
Benefits:
- Reduced Duplication: Teams can reuse and improve existing code
- Cross-team Collaboration: Breaking down silos organically
- Skill Development: Developers learn from diverse codebases
- Faster Innovation: More contributors finding and fixing problems
- Standardization: Natural emergence of best practices
InnerSource Roles:
- Trusted Committers: Maintainers who review and merge contributions
- Contributors: Developers submitting improvements
- Product Owners: Define direction and priorities
- Users: Teams that depend on the code
InnerSource Workflow:
- Discover: Find a project to contribute to
- Understand: Read documentation and code
- Discuss: Open an issue or discussion
- Develop: Create your changes
- Submit: Open a pull request
- Review: Work with maintainers on feedback
- Merge: Code is accepted and deployed
- Celebrate: Recognition for contribution
Implementing InnerSource:
Start Small:
- Choose one or two foundational projects
- Document contribution guidelines clearly
- Make it easy to find and build projects
- Recognize and reward contributions
Infrastructure Needs:
- Internal code hosting (GitHub Enterprise, GitLab)
- CI/CD that works for external contributors
- Clear documentation and onboarding
- Communication channels (Slack, mailing lists)
Cultural Requirements:
- Leadership support for cross-team work
- Time allocated for contributing to other teams
- Recognition for contributions
- Trust that teams will make good decisions
Transforming to DevOps is a journey, not a destination. Here's a structured approach.
Phase 1: Foundation (3-6 months)
Goals:
- Build awareness and understanding
- Secure leadership buy-in
- Identify pilot teams and projects
- Establish basic metrics
Activities:
- Executive workshops on DevOps principles
- Assess current state and pain points
- Form a DevOps Center of Excellence (optional)
- Train pilot teams on DevOps basics
- Implement version control for everything
Success Criteria:
- Leadership alignment on transformation goals
- Pilot teams identified and trained
- Baseline metrics established
- Initial version control adoption
Phase 2: Pilot (6-12 months)
Goals:
- Demonstrate success with pilot teams
- Build reusable patterns and practices
- Develop internal expertise
- Create momentum for broader adoption
Activities:
- Implement CI/CD for pilot applications
- Automate infrastructure provisioning
- Establish monitoring and alerting
- Conduct blameless postmortems
- Document patterns and practices
- Share successes across organization
Success Criteria:
- Measurable improvements in DORA metrics for pilots
- Repeatable patterns documented
- Internal champions developed
- Interest from other teams
Phase 3: Expand (12-24 months)
Goals:
- Scale practices across organization
- Standardize tools and platforms
- Build internal platform/self-service capabilities
- Embed DevOps in organizational processes
Activities:
- Train all teams on DevOps practices
- Implement standard toolchain
- Build Internal Developer Platform
- Update HR processes (hiring, reviews)
- Integrate security (DevSecOps)
- Establish communities of practice
Success Criteria:
- Organization-wide adoption of core practices
- Self-service platform available
- Security integrated in pipelines
- DevOps competencies in job descriptions
Phase 4: Optimize (24+ months)
Goals:
- Continuous improvement culture
- Experimentation and innovation
- Industry leadership
- Platform evolution
Activities:
- Advanced practices (chaos engineering, SRE)
- Machine learning for operations
- Open source contributions
- Publish case studies and speak at conferences
- Evolve platform based on feedback
Success Criteria:
- Elite DORA performance
- Industry recognition
- Attract and retain top talent
- Business outcomes clearly linked to DevOps
Critical Success Factors:
- Leadership Commitment: Transformation requires sustained executive support
- Patience: Culture change takes years, not months
- Focus on Value: Always connect DevOps work to business outcomes
- Celebrate Wins: Recognize and share successes
- Learn from Failures: Treat setbacks as learning opportunities
- Stay Humble: There's always more to learn and improve
Understanding Linux architecture is fundamental for any DevOps engineer. Linux powers the vast majority of servers, containers, and cloud infrastructure.
The Linux Kernel
The kernel is the core of the operating system, managing hardware resources and providing essential services:
Kernel Components:
-
Process Scheduler (CPU Management):
- Manages process execution
- Implements scheduling policies (CFS - Completely Fair Scheduler)
- Handles context switching
- Manages CPU affinity and priorities
-
Memory Manager:
- Virtual memory management
- Paging and swapping
- Memory allocation (malloc/free)
- Shared memory and memory mapping
- Page cache for file I/O
-
File System Manager:
- Virtual File System (VFS) abstraction
- Supports multiple file systems (ext4, XFS, Btrfs)
- Inode management
- File permissions and attributes
- Journaling for reliability
-
Network Stack:
- Protocol implementations (TCP/IP, UDP)
- Socket abstraction
- Network device drivers
- Firewall (netfilter/iptables/nftables)
- Traffic control and QoS
-
Device Drivers:
- Interface with hardware devices
- Character and block devices
- USB, PCI, SCSI subsystems
- Device model and sysfs
-
Inter-process Communication (IPC):
- Pipes and FIFOs
- Message queues
- Shared memory
- Semaphores
- Signals
User Space vs Kernel Space
Linux separates execution into two modes:
Kernel Space:
- Runs in privileged mode
- Direct hardware access
- Memory protected from user space
- Device drivers and core services
User Space:
- Runs in unprivileged mode
- Access to hardware only through kernel syscalls
- Applications, libraries, and services
- Isolated from other user processes
System Calls
User space programs request kernel services through system calls:
Application (user space)
↓
Library call (glibc)
↓
System call (int 0x80 / syscall)
↓
Kernel (kernel space)
Common system calls:
read(),write()- File I/Ofork(),exec()- Process creationsocket(),connect()- Networkmmap()- Memory mappingopen(),close()- File operations
File System Hierarchy
Linux follows the Filesystem Hierarchy Standard (FHS):
/ (root)
├── bin - Essential user binaries
├── boot - Boot loader files
├── dev - Device files
├── etc - System configuration
├── home - User home directories
├── lib - Essential shared libraries
├── media - Mount points for removable media
├── mnt - Temporarily mounted filesystems
├── opt - Optional application software
├── proc - Virtual filesystem for process info
├── root - Root user home
├── sbin - System binaries
├── sys - Virtual filesystem for system info
├── tmp - Temporary files
├── usr - User utilities and applications
│ ├── bin - User binaries
│ ├── lib - Libraries
│ ├── local - Locally installed software
│ └── share - Architecture-independent data
└── var - Variable data
├── log - Log files
├── mail - Mail spool
└── tmp - Temporary files preserved across reboots
Processes are the running instances of programs. Understanding process management is crucial for debugging and performance tuning.
Process States
A process can be in one of several states:
R (Running/Tunable): Process is executing or ready to execute
S (Sleeping): Waiting for an event (interruptible)
D (Uninterruptible Sleep): Waiting for I/O (usually disk)
T (Stopped): Stopped by job control signal
Z (Zombie): Terminated but not yet reaped by parent
Process Lifecycle:
- Creation:
fork()creates a copy of parent,exec()loads new program - Ready: Process is ready to run and waiting for CPU
- Running: Process is executing on CPU
- Waiting: Process waiting for I/O or event
- Terminated: Process finished execution
- Zombie: Waiting for parent to read exit status
Process Attributes:
- PID (Process ID): Unique identifier
- PPID (Parent PID): ID of parent process
- UID/EUID: User ID and effective user ID
- GID/EGID: Group ID and effective group ID
- Priority/Nice value: Scheduling priority
- Environment variables: Process environment
- File descriptors: Open files and sockets
Process Management Commands:
Viewing Processes:
ps aux # All processes with details
ps -ef # Full format listing
top # Interactive process viewer
htop # Enhanced interactive viewer
pstree # Process tree
pgrep sshd # Find PIDs by nameProcess Control:
kill -TERM <PID> # Terminate gracefully
kill -KILL <PID> # Force kill
kill -STOP <PID> # Suspend process
kill -CONT <PID> # Resume process
nice -n 10 command # Start with lower priority
renice 10 <PID> # Change priority of running processBackground/Foreground:
command & # Run in background
Ctrl+Z # Suspend foreground job
jobs # List background jobs
bg %1 # Resume job in background
fg %1 # Bring job to foregroundProcess Limits:
View and modify process limits with ulimit:
ulimit -a # Show all limits
ulimit -n 65536 # Max open files
ulimit -u 100 # Max user processesImportant limits:
nofile: Maximum open file descriptorsnproc: Maximum user processesstack: Stack sizecore: Core file sizememlock: Max locked-in-memory address space
Linux supports multiple file systems and provides a unified interface through the Virtual File System (VFS).
Common File Systems:
ext4 (Fourth Extended Filesystem):
- Default for many Linux distributions
- Journaling for reliability
- Supports large files (up to 16TB) and volumes (up to 1EB)
- Backward compatible with ext2/ext3
XFS:
- High-performance, scalable
- Excellent for large files and parallel I/O
- Online defragmentation and resizing
- Common for media and data-intensive applications
Btrfs (B-tree Filesystem):
- Copy-on-write (COW) architecture
- Built-in snapshots and rollback
- Subvolumes and quotas
- RAID support integrated
- Checksums on data and metadata
ZFS (on Linux via OpenZFS):
- Combined file system and volume manager
- Data integrity with checksums
- Snapshots, clones, and replication
- Compression and deduplication
- Originally from Solaris, now available on Linux
tmpfs:
- Temporary file system in RAM
- Fast but volatile
- Mounted at
/tmp,/run,/dev/shm
procfs and sysfs:
- Virtual file systems for kernel interfaces
/proc: Process and system information/sys: Device and kernel parameters
File System Operations:
Mounting and Unmounting:
mount /dev/sda1 /mnt/data # Mount filesystem
umount /mnt/data # Unmount
mount -a # Mount all in fstab
findmnt # Show mount tree
df -h # Disk usage of mounted filesystemsCreating File Systems:
mkfs.ext4 /dev/sdb1 # Create ext4 filesystem
mkfs.xfs /dev/sdc1 # Create XFS filesystem
mkfs.btrfs /dev/sdd1 # Create Btrfs filesystemChecking and Repairing:
fsck /dev/sda1 # Check and repair
xfs_repair /dev/sdb1 # XFS repair
btrfs check /dev/sdc1 # Btrfs checkFile System Tuning:
tune2fs -l /dev/sda1 # View ext4 parameters
xfs_info /dev/sdb1 # View XFS parameters
btrfs filesystem show # Show Btrfs infoInodes and Directory Structure:
- Inode: Metadata structure for files (permissions, ownership, timestamps, pointers to data blocks)
- Directory: Mapping of filenames to inodes
- Hard links: Multiple filenames pointing to same inode
- Symbolic links: Special files pointing to other filenames
ls -i # Show inode numbers
stat file.txt # Show inode details
ln file.txt hardlink # Create hard link
ln -s file.txt symlink # Create symbolic linkNetworking is fundamental to distributed systems. DevOps engineers must understand Linux networking deeply.
Network Stack Overview:
Application Layer (HTTP, DNS, SSH)
↓
Transport Layer (TCP, UDP)
↓
Network Layer (IP, ICMP)
↓
Link Layer (Ethernet, WiFi)
↓
Physical Hardware
Network Configuration:
Network Interfaces:
ip link # List network interfaces
ip addr show # Show IP addresses
ip route show # Show routing table
ethtool eth0 # Show interface details
ss -tulpn # Show listening socketsInterface Configuration (Netplan/ifupdown):
Modern Linux uses Netplan (Ubuntu) or NetworkManager:
# /etc/netplan/01-netcfg.yaml
network:
version: 2
ethernets:
eth0:
addresses:
- 192.168.1.100/24
routes:
- to: default
via: 192.168.1.1
nameservers:
addresses: [8.8.8.8, 8.8.4.4]Network Namespaces:
Network namespaces provide isolated network stacks:
ip netns add red # Create namespace
ip netns exec red bash # Run shell in namespace
ip link add veth0 type veth peer name veth1 # Virtual ethernet pair
ip link set veth0 netns red # Move interface to namespaceSocket Programming Concepts:
- Socket: Endpoint for communication
- Port: 16-bit number identifying service
- TCP: Connection-oriented, reliable, ordered
- UDP: Connectionless, unreliable, unordered
- UNIX domain sockets: IPC on same host
Common Network Services:
DNS (Domain Name System):
cat /etc/resolv.conf # DNS configuration
dig example.com # DNS lookup
nslookup example.com # Alternative lookup
host example.com # Simple lookupHTTP/HTTPS:
curl -I https://example.com # Fetch HTTP headers
wget https://example.com/file # Download file
nc -v example.com 80 # Test TCP connectionNetwork Diagnostics:
ping -c 4 example.com # Test connectivity
traceroute example.com # Trace network path
mtr example.com # Combined ping+traceroute
ss -tulpn # Socket statistics
netstat -an # Network statistics (older)
tcpdump -i eth0 port 80 # Capture packets
nmap -p 1-1000 example.com # Port scanningFirewall with iptables/nftables:
iptables (legacy):
iptables -L # List rules
iptables -A INPUT -p tcp --dport 22 -j ACCEPT # Allow SSH
iptables -A INPUT -j DROP # Drop everything else
iptables-save > rules.txt # Save rulesnftables (modern):
nft list ruleset # List all rules
nft add table inet filter # Create table
nft add chain inet filter input { type filter hook input priority 0\; }
nft add rule inet filter input tcp dport 22 acceptShell scripting automates repetitive tasks and is essential for DevOps.
Bash Basics:
Shebang and Execution:
#!/bin/bash
# This is a comment
echo "Hello, World!"Variables:
name="John"
echo "Hello, $name"
readonly constant="cannot change"
export ENV_VAR="visible to child processes"Arrays:
fruits=("apple" "banana" "orange")
echo ${fruits[0]} # First element
echo ${fruits[@]} # All elements
echo ${#fruits[@]} # Array lengthConditionals:
if [ "$name" == "John" ]; then
echo "Hello John"
elif [ "$name" == "Jane" ]; then
echo "Hello Jane"
else
echo "Hello stranger"
fi
# File tests
if [ -f "$file" ]; then # File exists
if [ -d "$dir" ]; then # Directory exists
if [ -x "$executable" ]; then # Is executableLoops:
# For loop
for i in {1..5}; do
echo "Number $i"
done
# While loop
count=1
while [ $count -le 5 ]; do
echo "Count $count"
((count++))
done
# Reading lines
while IFS= read -r line; do
echo "Line: $line"
done < file.txtFunctions:
greet() {
local name="$1" # Local variable
echo "Hello, $name"
return 0 # Return status
}
greet "World"Error Handling:
set -e # Exit on error
set -u # Exit on undefined variable
set -o pipefail # Pipe fails if any command fails
trap 'cleanup' EXIT # Run on exit
trap 'echo "Interrupted"; exit' INT # Handle Ctrl+CPractical DevOps Scripts:
Backup Script:
#!/bin/bash
set -euo pipefail
BACKUP_DIR="/backup/$(date +%Y%m%d)"
SOURCE_DIR="/data"
mkdir -p "$BACKUP_DIR"
tar -czf "$BACKUP_DIR/backup.tar.gz" "$SOURCE_DIR"
# Rotate old backups (keep 7 days)
find /backup -type d -mtime +7 -exec rm -rf {} \;Health Check Script:
#!/bin/bash
check_service() {
local host="$1"
local port="$2"
timeout 1 bash -c "echo >/dev/tcp/$host/$port" 2>/dev/null
return $?
}
if check_service "localhost" 8080; then
echo "Service is up"
else
echo "Service is down"
exit 1
fiDeployment Script:
#!/bin/bash
set -e
VERSION="$1"
if [ -z "$VERSION" ]; then
echo "Usage: $0 <version>"
exit 1
fi
echo "Deploying version $VERSION"
./run_tests.sh
./build.sh "$VERSION"
scp "build/app-$VERSION" server:/apps/current
ssh server systemctl restart myappSystemd is the init system and service manager for most modern Linux distributions.
Core Concepts:
- Units: Resources managed by systemd (services, sockets, mounts, etc.)
- Targets: Groups of units (like runlevels)
- Journal: Centralized logging system
Unit Types:
.service: System services.socket: IPC or network sockets.device: Device files.mount: Filesystem mount points.timer: Scheduled tasks (cron replacement).target: Group of units
Service Unit Example:
# /etc/systemd/system/myapp.service
[Unit]
Description=My Application
After=network.target
Wants=redis.service
Requires=mongodb.service
[Service]
Type=simple
User=myapp
Group=myapp
WorkingDirectory=/opt/myapp
ExecStart=/usr/bin/node /opt/myapp/app.js
ExecReload=/bin/kill -HUP $MAINPID
Restart=always
RestartSec=10
Environment=NODE_ENV=production
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
Common Commands:
systemctl start myapp # Start service
systemctl stop myapp # Stop service
systemctl restart myapp # Restart service
systemctl reload myapp # Reload configuration
systemctl status myapp # Show status
systemctl enable myapp # Enable at boot
systemctl disable myapp # Disable at boot
systemctl daemon-reload # Reload unit filesJournald (Logging):
journalctl -u myapp # Show logs for service
journalctl -f # Follow logs
journalctl --since "1 hour ago" # Time-based filter
journalctl -p err # Show only errors
journalctl _PID=1234 # Filter by PIDTimer Units (Cron Replacement):
# /etc/systemd/system/backup.timer
[Unit]
Description=Daily backup timer
[Timer]
OnCalendar=daily
Persistent=true
[Install]
WantedBy=timers.target
# /etc/systemd/system/backup.service
[Unit]
Description=Daily backup
[Service]
Type=oneshot
ExecStart=/usr/local/bin/backup.sh
Linux distributions use package managers to install, update, and remove software.
Debian/Ubuntu (apt/dpkg):
# Update package lists
apt update
# Upgrade all packages
apt upgrade
# Install package
apt install nginx
# Remove package
apt remove nginx
# Search packages
apt search nginx
# Show package info
apt show nginx
# List installed
dpkg -l
# Find which package owns a file
dpkg -S /etc/nginx/nginx.confRed Hat/CentOS/Fedora (yum/dnf):
# Update package lists
yum check-update
# Upgrade packages
yum update
# Install package
yum install nginx
# Remove package
yum remove nginx
# Search
yum search nginx
# Show info
yum info nginx
# List installed
rpm -qa
# Find package owner
rpm -qf /etc/nginx/nginx.confBuilding from Source:
Sometimes packages aren't available and you need to compile:
wget https://example.com/software.tar.gz
tar -xzf software.tar.gz
cd software
./configure --prefix=/usr/local
make
make installPerformance monitoring helps identify bottlenecks and capacity issues.
CPU Monitoring:
top # Real-time process view
htop # Enhanced top
mpstat -P ALL 1 # Per-CPU statistics
vmstat 1 # System statistics
uptime # Load average
cat /proc/cpuinfo # CPU informationMemory Monitoring:
free -h # Memory usage
vmstat 1 # Virtual memory stats
cat /proc/meminfo # Detailed memory info
smem # Memory per processDisk I/O Monitoring:
iostat -x 1 # Extended disk statistics
iotop # I/O per process
df -h # Filesystem usage
du -sh * # Directory sizesNetwork Monitoring:
iftop # Network traffic by host
nethogs # Traffic by process
ss -tulpn # Socket statistics
sar -n DEV 1 # Network statisticsSystem Performance Tuning:
Kernel Parameters (/etc/sysctl.conf):
# Increase network buffers
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
# TCP tuning
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
# File system
fs.file-max = 2097152
# Virtual memory
vm.swappiness = 10
vm.dirty_ratio = 40Process Limits (/etc/security/limits.conf):
* soft nofile 65536
* hard nofile 65536
* soft nproc unlimited
* hard nproc unlimited
Logs are crucial for troubleshooting and monitoring.
System Logs:
/var/log/syslogor/var/log/messages: General system logs/var/log/auth.log: Authentication logs/var/log/kern.log: Kernel messages/var/log/dmesg: Boot messages/var/log/nginx/: Nginx logs/var/log/mysql/: MySQL logs
Log Rotation (logrotate):
# /etc/logrotate.d/nginx
/var/log/nginx/*.log {
daily
missingok
rotate 14
compress
delaycompress
notifempty
create 0640 nginx adm
sharedscripts
postrotate
[ -f /var/run/nginx.pid ] && kill -USR1 `cat /var/run/nginx.pid`
endscript
}
Centralized Logging with rsyslog:
# /etc/rsyslog.conf
*.* @logserver.example.com:514 # Send all logs to remote server
Log Analysis:
# Count errors
grep -c "ERROR" app.log
# Tail with filtering
tail -f app.log | grep ERROR
# Find unique IPs
awk '{print $1}' access.log | sort | uniq -c | sort -nr
# Time-based analysis
grep "$(date +%Y-%m-%d)" app.logSecurity is critical for production systems.
User and Access Management:
# Remove unnecessary users
userdel -r username
# Disable root SSH login
# In /etc/ssh/sshd_config:
# PermitRootLogin no
# Use SSH keys only
# PasswordAuthentication no
# Implement sudo with care
visudoFile Permissions:
# Secure sensitive files
chmod 600 /etc/shadow
chmod 644 /etc/passwd
chmod 600 /etc/ssh/sshd_config
# Set proper ownership
chown root:root /etc/passwdNetwork Security:
# Basic firewall
ufw default deny incoming
ufw default allow outgoing
ufw allow ssh
ufw enable
# Disable unused services
systemctl disable bluetooth
systemctl disable cups
# Secure sysctl settings
# /etc/sysctl.d/99-security.conf
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.all.accept_source_route = 0
net.ipv4.tcp_syncookies = 1Filesystem Security:
# Mount options in /etc/fstab
# /dev/sda1 /home ext4 defaults,noexec,nosuid 0 2
# /tmp tmpfs tmpfs defaults,noexec,nosuid,nodev 0 0Auditing and Monitoring:
# Install and configure auditd
auditctl -w /etc/passwd -p wa -k passwd_changes
auditctl -w /etc/shadow -p wa -k shadow_changes
# Check for unusual activity
lastb # Failed login attempts
last # Last logins
journalctl -u ssh # SSH logsAutomatic Security Updates:
# Ubuntu/Debian
apt install unattended-upgrades
dpkg-reconfigure -plow unattended-upgrades
# Red Hat/CentOS
yum install yum-cron
systemctl enable yum-cronSecurity Tools:
- Lynis: Security auditing tool
- ClamAV: Antivirus
- rkhunter: Rootkit hunter
- chkrootkit: Rootkit detector
- fail2ban: Brute force protection
Understanding Git's internal architecture demystifies its behavior and enables advanced usage.
The Object Database
Git is fundamentally a content-addressable filesystem with a VCS interface. Everything is stored as objects in the .git/objects directory.
Object Types:
- Blob: File contents (binary large object)
- Tree: Directory listings (filenames + permissions + blob references)
- Commit: Snapshot metadata (tree hash, parent, author, message)
- Tag: Named reference to a commit (optionally signed)
Object Storage:
Each object is identified by a SHA-1 hash of its content:
echo 'hello world' | git hash-object --stdin
# 3b18e512dba79e4c8300dd08aeb37f8e728b8dadObjects are stored compressed in .git/objects/ab/3b18e512dba79e4c8300dd08aeb37f8e728b8dad
The Commit Graph
commit (hash: a1b2c3)
tree: d4e5f6
parent: f7g8h9 (previous commit)
author: John <john@example.com>
committer: John <john@example.com>
message: Add feature X
↓
tree (hash: d4e5f6)
blob: 1a2b3c (README.md)
blob: 4d5e6f (main.py)
tree: 7g8h9i (lib/)
blob: 0j1k2l (lib/utils.py)
References (Refs)
Refs are pointers to commits, stored in .git/refs/:
heads/: Local branchesremotes/: Remote tracking branchestags/: Tags
HEAD is a special ref pointing to current branch or commit.
cat .git/HEAD
# ref: refs/heads/main
cat .git/refs/heads/main
# a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6q7r8s9t0The Index (Staging Area)
The index is a binary file (.git/index) that represents the next commit. It's a sorted list of path names with blob hashes and file metadata.
Plumbing vs Porcelain
Git commands are categorized as:
- Porcelain: User-friendly commands (
git add,git commit) - Plumbing: Low-level commands for scripting (
git hash-object,git update-index)
Low-level Examples:
# Create blob
echo 'content' | git hash-object -w --stdin
# Create tree
git update-index --add --cacheinfo 100644 \
$(git hash-object -w file.txt) file.txt
git write-tree
# Create commit
echo 'message' | git commit-tree TREE_HASH -p PARENT_HASHBranching strategies define how teams use branches for development.
Git Flow
Classic branching model by Vincent Driessen:
main (production)
↑
release/1.0 (staging)
↑
develop (integration)
↑
feature/new-feature (development)
Branches:
main: Production-ready codedevelop: Integration branchfeature/*: New features (branch from develop)release/*: Release preparation (branch from develop, merge to main and develop)hotfix/*: Emergency fixes (branch from main, merge to main and develop)
Pros:
- Clear structure
- Works well for versioned releases
- Good for larger teams
Cons:
- Complex
- Overkill for continuous delivery
- Many branches to maintain
GitHub Flow
Simpler flow used by GitHub:
main (always deployable)
↑
feature/* → Pull Request → main
Principles:
mainis always deployable- Create feature branches for changes
- Open pull requests for review
- Merge and deploy immediately
Pros:
- Simple
- Works with CI/CD
- Continuous deployment friendly
Cons:
- Less structure for releases
- Can be chaotic with many changes
GitLab Flow
GitLab's hybrid approach:
production (or environment branches)
↑
pre-production
↑
main
↑
feature/*
Environment Branches:
production: Deployed to productionstaging: Deployed to stagingmain: Integration branch
Pros:
- Environment-specific branches
- Works well with deployment pipelines
- Clear promotion path
Trunk-Based Development
All developers work on short-lived branches from main:
main ←─── short branch ───┐
└─── short branch ───┤
└─── short branch ──┤
Rules:
- Branches live < 1 day
- Small, frequent commits
- Feature flags for incomplete work
- Automated testing before merge
Pros:
- Minimal merge conflicts
- Continuous integration
- Fast feedback
Cons:
- Requires feature flags
- Discipline required
- Not suitable for all projects
Understanding the difference is crucial for clean history.
Merge
git checkout main
git merge featureResult:
- Creates merge commit
- Preserves exact history
- Shows when branch happened
* Merge branch 'feature' (main)
|\
| * Add feature (feature)
* | Update main (main)
|/
* Initial commit
Pros:
- Preserves context
- Safe (non-destructive)
- Shows actual branch timeline
Cons:
- Cluttered history
- Many merge commits
Rebase
git checkout feature
git rebase main
git checkout main
git merge feature (fast-forward)Result:
- Replays commits on top of main
- Linear history
- No merge commits
* Add feature (main)
* Update main
* Initial commit
Pros:
- Clean, linear history
- Easier to read
- Bisect friendly
Cons:
- Rewrites history
- Dangerous on shared branches
- Loses branch context
Interactive Rebase
git rebase -i HEAD~3Allows:
- squash: Combine commits
- reword: Change commit message
- edit: Modify commit
- drop: Remove commit
- reorder: Change order
Golden Rule of Rebasing:
Never rebase commits that have been pushed to a shared repository. It will cause chaos for other developers.
When to Use What:
Use Merge When:
- Merging a long-lived branch
- Preserving branch history is important
- Working on public/shared branch
Use Rebase When:
- Updating feature branch with main
- Cleaning up local commits before PR
- Creating linear history
Squash and Merge (GitHub):
Combines all commits from feature branch into one commit on main. Good for keeping main history clean.
Submodules allow including external repositories within your repository.
Basic Usage:
# Add submodule
git submodule add https://github.com/user/lib.git lib
# Clone with submodules
git clone --recursive https://github.com/user/project.git
# Update submodules
git submodule update --init --recursive
# Pull latest in submodules
git submodule update --remote.gitmodules File:
[submodule "lib"]
path = lib
url = https://github.com/user/lib.git
branch = main
Challenges:
- Detached HEAD: Submodules are checked out at specific commits
- Updates: Need to commit submodule reference changes
- Collaboration: Team members must remember to update submodules
Alternatives:
- Subtrees: Copy code into your repo (git subtree)
- Package managers: npm, pip, maven, etc.
- Monorepo: Single repository for all code
Monorepo (Single Repository)
All code in one repository.
Pros:
- Atomic commits across projects
- Easy code sharing
- Simplified dependency management
- Consistent tooling
- Easier refactoring
Cons:
- Scales poorly (Git struggles with huge repos)
- Complex access control
- Build system complexity
- Learning curve
Examples: Google, Microsoft, Facebook
Polyrepo (Multiple Repositories)
Each project in its own repository.
Pros:
- Clear ownership
- Independent versioning
- Simpler tooling per project
- Better access control
- Scales naturally
Cons:
- Cross-repo changes are painful
- Dependency hell
- Inconsistent tooling
- Duplication
Hybrid Approaches:
- Repo orchestration tools: Google's repo, Microsoft's VFS for Git
- Monorepo with modular build: Bazel, Pants, Please
- Package-based monorepo: Lerna (JavaScript), Gradle (Java)
Git hooks are scripts that run automatically on Git events.
Client-Side Hooks (.git/hooks/):
pre-commit: Before commit message editorprepare-commit-msg: Before commit message editor (with template)commit-msg: After commit messagepost-commit: After commitpre-push: Before pushpre-rebase: Before rebasepost-checkout: After checkoutpost-merge: After merge
Server-Side Hooks:
pre-receive: Before accepting pushupdate: Per-branch pre-receivepost-receive: After push
Example pre-commit hook (linting):
#!/bin/bash
# .git/hooks/pre-commit
echo "Running linter..."
files=$(git diff --cached --name-only --diff-filter=ACM | grep '\.js$')
if [ -n "$files" ]; then
eslint $files
if [ $? -ne 0 ]; then
echo "Linting failed"
exit 1
fi
fiManaging Hooks with Tools:
- Husky (JavaScript): Manages hooks via package.json
- pre-commit (Python): Framework for multi-language hooks
- overcommit (Ruby): Extensible hook manager
Handling Large Repositories:
Shallow Clones:
git clone --depth 1 https://github.com/user/repo.gitPartial Clones:
git clone --filter=blob:none https://github.com/user/repo.gitSparse Checkout:
git sparse-checkout set src/Git LFS (Large File Storage):
Replaces large files with text pointers:
git lfs track "*.psd"
git add .gitattributes
git add file.psd
git commit -m "Add design file"Performance Optimization:
git gc: Garbage collectiongit repack: Optimize pack filesgit fsck: Verify database integritygit prune: Remove unreachable objects
Scaling Git Servers:
- GitLab: Built for enterprise scale
- GitHub: GitHub AE for large enterprises
- BitBucket Data Center: Clustered for scale
- Gerrit: Code review focused, scales well
For Authors:
- Keep changes small: < 400 lines is ideal
- Write good descriptions: What, why, how
- Add context: Screenshots, test results
- Self-review first: Catch obvious issues
- Respond graciously: To all comments
- Explain changes: In comments and commits
For Reviewers:
- Review promptly: Within 24 hours ideally
- Be kind: Focus on code, not person
- Ask questions: "What do you think about..." not "You should..."
- Be specific: Point to exact lines and alternatives
- Prioritize: Security > correctness > style
- Approve thoughtfully: Understand the code
Code Review Checklist:
- Does the code work?
- Is it tested appropriately?
- Is it secure?
- Is it performant?
- Is it maintainable?
- Is it well-named?
- Does it follow style guide?
- Is documentation updated?
- Are there edge cases?
- Will it scale?
Automated Checks:
- Linting: Enforce style
- Static analysis: Find bugs
- Test coverage: Ensure testing
- Security scanning: Find vulnerabilities
- Size checks: Prevent bloat
GitHub Enterprise provides self-hosted or cloud-based GitHub for organizations.
Key Features:
Authentication and Authorization:
- SAML/SSO integration
- LDAP/Active Directory
- Fine-grained permissions
- Team synchronization
Security:
- 2FA enforcement
- Audit logging
- Secret scanning
- Dependency graph
- Security advisories
Collaboration:
- Protected branches
- Required reviews
- Code owners
- Issue templates
- Project boards
Actions:
- Built-in CI/CD
- Self-hosted runners
- Marketplace integrations
- Reusable workflows
API and Automation:
- GraphQL API
- REST API
- Webhooks
- GitHub Apps
Deployment Options:
GitHub Enterprise Cloud:
- Hosted by GitHub
- Enterprise features
- SLA guarantee
- Regular updates
GitHub Enterprise Server:
- Self-hosted
- Full control
- Air-gapped possible
- Upgrade on your schedule
GitLab provides integrated CI/CD with their repository platform.
Core Concepts:
.gitlab-ci.yml:
stages:
- build
- test
- deploy
variables:
DOCKER_DRIVER: overlay2
build:
stage: build
script:
- docker build -t $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA .
- docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
test:
stage: test
script:
- npm install
- npm test
deploy:
stage: deploy
script:
- kubectl set image deployment/myapp app=$CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
only:
- mainRunners:
- Shared runners: Provided by GitLab
- Group runners: Shared by group
- Specific runners: Project-specific
- Auto-scaling: Dynamic provisioning
Features:
- Auto DevOps
- Review Apps (ephemeral environments)
- Container registry
- Dependency scanning
- License compliance
- Browser testing
Bitbucket, part of Atlassian, integrates well with Jira and other Atlassian tools.
Key Features:
Branch Permissions:
- Restrict pushes
- Require pull requests
- Prevent deletion
- Merge checks
Pull Requests:
- Code reviews
- Inline comments
- Task lists
- Approvals required
Pipelines:
- Built-in CI/CD
- Docker support
- Service containers
- Deployments to environments
Integration:
- Jira integration
- Slack notifications
- Marketplace add-ons
- REST API
PRs (GitHub) and MRs (GitLab) are the primary code review mechanism.
Pull Request Lifecycle:
- Create branch from main
- Make changes and commit
- Push branch to remote
- Open PR with description
- Automated checks run
- Reviewers comment and approve
- Address feedback with more commits
- Merge when ready
- Delete branch
PR Templates:
## Description
[Describe the changes]
## Type of Change
- [ ] Bug fix
- [ ] New feature
- [ ] Breaking change
- [ ] Documentation update
## Testing
[Describe how you tested]
## Screenshots
[If applicable]
## Related Issues
Fixes #123
Best Practices:
- Link to issues: Connect work to tracking
- Use draft PRs: For work in progress
- Small PRs: Easier to review
- Descriptive titles: "Fix login bug" not "Update"
- Self-review: Check your own PR first
Branch protection prevents force pushes and requires certain conditions before merging.
Common Rules:
Require pull request reviews:
- Number of approvals required
- Dismiss stale reviews
- Require review from code owners
Require status checks:
- CI must pass
- Specific checks required
- Branches must be up to date
Restrict who can push:
- Specific users/teams
- Admins included/excluded
Other rules:
- No force pushes
- No deletions
- Include administrators
- Linear history required
Example GitHub Settings:
{
"required_status_checks": {
"strict": true,
"contexts": ["continuous-integration/jenkins"]
},
"enforce_admins": true,
"required_pull_request_reviews": {
"required_approving_review_count": 2,
"dismiss_stale_reviews": true,
"require_code_owner_reviews": true
},
"restrictions": null
}Never store secrets in code. Use secret management tools.
What Not to Store:
- API keys
- Passwords
- SSH keys
- Database credentials
- Tokens
- Certificates
Secret Management Solutions:
GitHub Encrypted Secrets:
# In GitHub Actions
env:
API_KEY: ${{ secrets.API_KEY }}GitLab CI/CD Variables:
# Masked and protected variables
script:
- echo "$CI_DEPLOY_PASSWORD"HashiCorp Vault:
vault kv put secret/myapp api_key=12345AWS Secrets Manager:
aws secretsmanager get-secret-value --secret-id myappAzure Key Vault:
az keyvault secret show --name api-key --vault-name myvaultTools for Secret Detection:
- git-secrets: Prevents committing secrets
- truffleHog: Searches for secrets in Git history
- GitHub secret scanning: Automatic detection
- GitLab secret detection: Built-in scanning
Access Control:
- Principle of least privilege: Grant minimum needed access
- Regular audits: Review who has access
- Team-based permissions: Manage groups, not individuals
- SSO enforcement: Require corporate authentication
Security Features:
Signed Commits:
git commit -S -m "Signed commit"
git config commit.gpgsign trueSigned Tags:
git tag -s v1.0 -m "Signed tag"Verified commits show as "Verified" in GitHub/GitLab
Dependency Management:
- Dependabot: Automated security updates
- Renovate: Dependency update tool
- Snyk: Vulnerability scanning
- OWASP Dependency Check: Security scanning
Audit Logging:
Monitor for suspicious activity:
- Repository access
- Permission changes
- Secret pushes
- Branch deletions
Incident Response:
When secrets are exposed:
- Immediate: Revoke compromised credentials
- Investigate: Check access logs
- Rotate: Replace all affected secrets
- Notify: Inform affected parties
- Prevent: Improve scanning/prevention
Continuous Integration is the practice of merging all developer working copies to a shared mainline several times a day.
Core Principles:
-
Maintain a single source repository: Everything needed to build should be in version control.
-
Automate the build: One command should build the system.
-
Make the build self-testing: Tests should be part of the build.
-
Everyone commits to mainline every day: Avoid long-lived branches.
-
Every commit should build on an integration machine: Catch problems early.
-
Keep the build fast: Fast feedback encourages frequent commits.
-
Test in a clone of production environment: Avoid environment-specific issues.
-
Make it easy to get the latest deliverables: Artifacts should be easily accessible.
-
Everyone can see what's happening: Transparency enables collaboration.
-
Automate deployment: Make it trivial to deploy anywhere.
Benefits:
- Reduced integration risk: Problems found early
- Higher code quality: Constant testing
- Faster delivery: Always releasable state
- Improved visibility: Build status visible
- Greater confidence: Automated verification
Build automation compiles source code into binary artifacts.
Build Tools by Language:
- Java: Maven, Gradle, Ant
- JavaScript: npm, yarn, webpack
- Python: setuptools, poetry, pip
- Go: go build, make
- Ruby: rake, bundler
- C/C++: make, cmake, ninja
- .NET: MSBuild, dotnet CLI
Build Automation Goals:
- Repeatable: Same input → same output
- Fast: Minimize feedback time
- Idempotent: Can run multiple times
- Self-contained: No external dependencies
- Consistent: Same process everywhere
Build Script Example (Makefile):
.PHONY: build test clean
build:
go build -o bin/app ./cmd/app
test:
go test ./...
clean:
rm -rf bin/Build Pipeline Stages:
Source → Compile → Test → Package → Publish
- Compile: Convert source to binaries
- Test: Run unit and integration tests
- Package: Create deployable artifact (JAR, Docker image)
- Publish: Store artifact in repository
Artifacts are the outputs of build processes that need to be stored and versioned.
Types of Artifacts:
- Binaries (JAR, EXE, DLL)
- Packages (DEB, RPM, NPM)
- Container images
- Documentation
- Test reports
- Configuration files
Artifact Repositories:
Language-specific:
- Maven: Nexus, Artifactory, Archiva
- npm: npm registry, Verdaccio
- Python: PyPI, DevPI
- Ruby: RubyGems, Geminabox
- Go: Go proxy, Athens
Universal:
- JFrog Artifactory: Multi-format support
- Sonatype Nexus: Repository manager
- Cloud-specific: AWS CodeArtifact, Azure Artifacts, GCP Artifact Registry
Container Registries:
- Docker Hub
- GitHub Container Registry
- GitLab Container Registry
- Amazon ECR
- Azure ACR
- Google GCR
Best Practices:
- Version everything: Use semantic versioning
- Immutable artifacts: Never change published artifacts
- Metadata: Store build info, commit hash, timestamps
- Retention policies: Automatically clean old artifacts
- Security scanning: Scan artifacts for vulnerabilities
- Access control: Who can read/write artifacts
Artifact Lifecycle:
Build → Stage → Release → Retire
↑ ↑ ↑ ↑
Snapshot Testing Production Delete
Define CI/CD pipelines in code, stored in version control.
Benefits:
- Version control: Track changes to pipeline
- Code review: Review pipeline changes
- Reusability: Share pipeline templates
- Consistency: Same process everywhere
- Documentation: Pipeline as executable documentation
Examples:
GitHub Actions:
name: CI
on: [push]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- run: npm install
- run: npm testGitLab CI:
stages:
- build
- test
build:
stage: build
script:
- go build ./...
test:
stage: test
script:
- go test ./...Jenkinsfile (Declarative):
pipeline {
agent any
stages {
stage('Build') {
steps {
sh 'make build'
}
}
stage('Test') {
steps {
sh 'make test'
}
}
}
}Pipeline Patterns:
DRY (Don't Repeat Yourself):
# Reusable workflow
.build-template: &build-template
stage: build
script:
- docker build -t $IMAGE .
build-app:
<<: *build-template
variables:
IMAGE: app
build-api:
<<: *build-template
variables:
IMAGE: apiTesting in CI/CD requires a comprehensive strategy.
Testing Pyramid:
/\ E2E Tests (slow, expensive)
/ \ Integration Tests
/----\ Component Tests
/------\ Unit Tests (fast, cheap)
/--------\
Unit Tests:
- Test individual functions/classes
- Fast execution (< 100ms each)
- No external dependencies
- High coverage (70-80%+)
Integration Tests:
- Test component interactions
- May use databases, APIs
- Slower but more realistic
- Medium coverage
Component Tests:
- Test entire component in isolation
- Mock external dependencies
- Contract testing with consumers
E2E Tests:
- Test complete user journeys
- Full system with all dependencies
- Slow and brittle
- Few critical paths only
Other Test Types:
Smoke Tests: Quick sanity checks after deployment
Performance Tests: Load, stress, soak testing
Security Tests: Vulnerability scanning, penetration testing
Mutation Tests: Validate test quality by introducing bugs
Contract Tests: Ensure API compatibility
Test Automation Best Practices:
- Run fast tests first: Fail fast
- Parallelize tests: Speed up execution
- Quarantine flaky tests: Don't block pipeline
- Test data management: Consistent test data
- Test reporting: Clear results and trends
- Test environment parity: Match production
Parallel execution speeds up CI pipelines.
Types of Parallelism:
- Test parallelization: Run tests across multiple workers
- Matrix builds: Test multiple versions/configurations
- Stage parallelization: Run independent stages simultaneously
GitHub Actions Matrix:
jobs:
test:
runs-on: ubuntu-latest
strategy:
matrix:
node: [14, 16, 18]
os: [ubuntu-latest, windows-latest]
steps:
- uses: actions/checkout@v2
- uses: actions/setup-node@v2
with:
node-version: ${{ matrix.node }}
- run: npm testTest Splitting:
# Split tests by timing
jest --maxWorkers=4 --shard=1/4
jest --maxWorkers=4 --shard=2/4
jest --maxWorkers=4 --shard=3/4
jest --maxWorkers=4 --shard=4/4Parallel Stages in GitLab:
stages:
- test
- deploy
test:
stage: test
parallel: 5
script:
- ./run-tests.sh $CI_NODE_INDEX $CI_NODE_TOTALCaching reduces build times by reusing previous work.
Cacheable Items:
- Dependency packages (node_modules, vendor/bundle)
- Compiled artifacts (.class, .pyc)
- Docker layers
- Test results
- Build tools
GitHub Actions Caching:
- name: Cache node_modules
uses: actions/cache@v2
with:
path: node_modules
key: ${{ runner.os }}-node-${{ hashFiles('package-lock.json') }}
restore-keys: |
${{ runner.os }}-node-Docker Layer Caching:
# Cache dependencies first
COPY package*.json ./
RUN npm install # This layer cached unless package.json changes
COPY . .
RUN npm run buildOptimization Techniques:
- Incremental builds: Only rebuild changed code
- Conditional execution: Skip stages when not needed
- Build artifacts: Save intermediate outputs
- Dependency caching: Cache package managers
- Workspace reuse: Reuse workspace across jobs
- Container caching: Use cached base images
Pipeline Optimization Checklist:
- Fast feedback (< 10 minutes)
- Parallel execution where possible
- Caching dependencies
- Skipping irrelevant jobs
- Efficient test ordering
- Build only changed code
Jenkins is the most widely used open-source automation server.
Core Architecture:
User → Jenkins UI/API
↓
Jenkins Master
↓
Build Queue
↓
Build Executors (Master or Agents)
Jenkins Master:
- Web UI and API
- Job configuration
- Build queue management
- Monitoring and reporting
- Plugin management
Jenkins Agents (Nodes):
- Execute builds
- Distributed across machines
- Different environments
- Label-based selection
Installation Options:
- WAR file: java -jar jenkins.war
- Package: apt/yum install jenkins
- Docker: docker run jenkins/jenkins
- Kubernetes: Jenkins Helm chart
Jenkins Pipeline:
Declarative Pipeline:
pipeline {
agent any
stages {
stage('Build') {
steps {
sh 'make build'
}
}
stage('Test') {
steps {
sh 'make test'
}
}
stage('Deploy') {
when {
branch 'main'
}
steps {
sh 'make deploy'
}
}
}
post {
always {
cleanWs()
}
failure {
slackSend(color: 'danger', message: "Build failed")
}
}
}Scripted Pipeline:
node {
try {
stage('Checkout') {
checkout scm
}
stage('Build') {
sh 'make build'
}
stage('Test') {
sh 'make test'
}
} catch (err) {
currentBuild.result = 'FAILURE'
throw err
} finally {
cleanWs()
}
}Shared Libraries:
Reusable pipeline code across projects:
// vars/buildGo.groovy
def call(String version = '1.16') {
sh "docker run --rm -v $PWD:/app -w /app golang:$version go build"
}Jenkins Configuration as Code (JCasC):
jenkins:
systemMessage: "Jenkins configured by JCasC"
securityRealm:
ldap:
configurations:
- server: ldap.example.com
rootDN: dc=example,dc=com
authorizationStrategy:
globalMatrix:
permissions:
- "Overall/Administer:admin"GitHub-native CI/CD tightly integrated with repositories.
Core Concepts:
Workflows: YAML files in .github/workflows/
Events: Triggers (push, pull_request, schedule)
Jobs: Groups of steps (run on runners)
Steps: Individual tasks (run commands or actions)
Actions: Reusable units of code
Runners: Virtual machines that execute jobs
Workflow Structure:
name: CI
on:
push:
branches: [ main ]
pull_request:
branches: [ main ]
env:
NODE_VERSION: 16
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Setup Node
uses: actions/setup-node@v2
with:
node-version: ${{ env.NODE_VERSION }}
- name: Install dependencies
run: npm ci
- name: Run tests
run: npm test
- name: Upload artifacts
uses: actions/upload-artifact@v2
with:
name: build-output
path: dist/Custom Actions:
Docker Container Action:
name: 'My Action'
description: 'Does something'
runs:
using: 'docker'
image: 'Dockerfile'JavaScript Action:
name: 'My Action'
description: 'Does something'
runs:
using: 'node12'
main: 'index.js'Composite Action:
name: 'Composite Action'
description: 'Combines steps'
runs:
using: 'composite'
steps:
- run: echo Hello
shell: bashWorkflow Features:
- Matrix strategies: Test multiple configurations
- Environments: Protection rules and secrets
- Concurrency: Control parallel runs
- Dependencies:
needskeyword - Conditionals:
ifconditions - Reusable workflows: Call workflows from workflows
Integrated CI/CD with GitLab's DevOps platform.
.gitlab-ci.yml Structure:
stages:
- build
- test
- deploy
variables:
DOCKER_DRIVER: overlay2
IMAGE_TAG: $CI_REGISTRY_IMAGE:$CI_COMMIT_SHORT_SHA
cache:
paths:
- node_modules/
before_script:
- docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
build:
stage: build
script:
- docker build -t $IMAGE_TAG .
- docker push $IMAGE_TAG
only:
- main
test:
stage: test
script:
- npm ci
- npm test
deploy_staging:
stage: deploy
script:
- kubectl set image deployment/app app=$IMAGE_TAG
environment:
name: staging
url: https://staging.example.com
only:
- main
deploy_production:
stage: deploy
script:
- kubectl set image deployment/app app=$IMAGE_TAG
environment:
name: production
url: https://example.com
when: manual
only:
- mainKey Features:
Review Apps: Ephemeral environments for MRs Auto DevOps: Preconfigured CI/CD Multi-project pipelines: Cross-project dependencies Parent-child pipelines: Dynamic pipeline generation Rules: Advanced conditional logic Includes: Include external YAML files
GitLab Runners:
- Shared: Provided by GitLab.com
- Group: Shared within group
- Project: Dedicated to project
- Specific: Custom configuration
Runner Configuration (config.toml):
concurrent = 10
[[runners]]
name = "docker-runner"
url = "https://gitlab.com"
token = "xxxxx"
executor = "docker"
[runners.docker]
image = "alpine"
volumes = ["/cache"]Cloud-native CI/CD with focus on speed and convenience.
Configuration (.circleci/config.yml):
version: 2.1
orbs:
node: circleci/node@5.0.0
jobs:
build:
docker:
- image: cimg/node:16.10
auth:
username: mydockerhub-user
password: $DOCKERHUB_PASSWORD
steps:
- checkout
- node/install-packages:
pkg-manager: npm
- run:
name: Run tests
command: npm test
- persist_to_workspace:
root: ~/project
paths:
- .
deploy:
docker:
- image: cimg/base:2022.06
steps:
- attach_workspace:
at: ~/project
- run:
name: Deploy to production
command: ./deploy.sh
workflows:
version: 2
build_and_deploy:
jobs:
- build
- deploy:
requires:
- build
filters:
branches:
only: mainCircleCI Concepts:
Orbs: Reusable configuration packages Executors: Docker, machine, macOS, Windows Workspaces: Persist data between jobs Caching: Speed up dependency installation Contexts: Share environment variables SSH debugging: Debug builds interactively
Microsoft's enterprise DevOps platform.
Pipelines (YAML):
trigger:
- main
pool:
vmImage: ubuntu-latest
variables:
buildConfiguration: 'Release'
majorVersion: 1
minorVersion: 0
stages:
- stage: Build
jobs:
- job: BuildJob
steps:
- task: DotNetCoreCLI@2
inputs:
command: 'build'
projects: '**/*.csproj'
arguments: '--configuration $(buildConfiguration)'
- task: DotNetCoreCLI@2
inputs:
command: 'test'
projects: '**/*Tests.csproj'
arguments: '--configuration $(buildConfiguration)'
- task: DotNetCoreCLI@2
inputs:
command: 'publish'
publishWebProjects: true
arguments: '--configuration $(buildConfiguration) --output $(Build.ArtifactStagingDirectory)'
- task: PublishBuildArtifacts@1
inputs:
PathtoPublish: '$(Build.ArtifactStagingDirectory)'
ArtifactName: 'drop'
- stage: Deploy
jobs:
- deployment: DeployWeb
environment: 'production'
strategy:
runOnce:
deploy:
steps:
- task: AzureWebApp@1
inputs:
azureSubscription: 'my-connection'
appName: 'my-app'
package: '$(Pipeline.Workspace)/drop/**/*.zip'Azure DevOps Components:
- Azure Pipelines: CI/CD
- Azure Repos: Git repositories
- Azure Boards: Work tracking
- Azure Test Plans: Testing tools
- Azure Artifacts: Package management
Key Features:
- Multi-stage pipelines: Visual designer
- Environments: Track deployments
- Approvals: Manual intervention
- Gates: Automated health checks
- Service connections: Connect to Azure services
- Task groups: Reusable task collections
Securing CI/CD pipelines is critical as they have access to production.
Security Principles:
- Least privilege: Minimal permissions
- Isolation: Separate build environments
- Secrets management: Never expose secrets
- Input validation: Protect against injection
- Audit logging: Track all changes
- Dependency verification: Verify third-party code
Common Threats:
Credential Exposure:
- Secrets in logs
- Hardcoded credentials
- Exposed environment variables
Supply Chain Attacks:
- Compromised dependencies
- Malicious packages
- Typosquatting
Pipeline Tampering:
- Unauthorized pipeline changes
- Malicious commits
- Build environment compromise
Security Best Practices:
Secrets:
# NEVER do this
- run: echo "password=12345" # Bad!
# Use secrets
- run: echo "password=$SECRET"
env:
SECRET: ${{ secrets.MY_SECRET }}OIDC (OpenID Connect):
# Instead of long-lived secrets
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v1
with:
role-to-assume: arn:aws:iam::123456789:role/GitHubActions
aws-region: us-east-1Signed Commits:
- Require signed commits for sensitive repos
- Verify commit signatures in pipeline
Dependency Verification:
# Verify package integrity
- run: npm audit
- run: npm ci --ignore-scripts # Disable install scriptsIsolation:
- Use ephemeral runners
- Network isolation
- Container sandboxing
As teams grow, CI infrastructure needs to scale.
Scaling Strategies:
1. Horizontal Scaling:
- Add more build agents
- Auto-scaling based on queue
- Multiple regions/zones
2. Vertical Scaling:
- Bigger machines
- More CPU/memory per build
- Faster storage (SSD)
3. Build Optimization:
- Caching dependencies
- Parallel test execution
- Incremental builds
- Skipping unnecessary builds
Jenkins Scaling:
Master-Agent Setup:
pipeline {
agent { label 'linux && large' }
stages {
stage('Build') {
steps {
sh 'make build'
}
}
}
}Dynamic Agents (Kubernetes):
apiVersion: v1
kind: Pod
spec:
containers:
- name: jnlp
image: jenkins/inbound-agent
- name: golang
image: golang:1.16
command:
- cat
- name: docker
image: docker:20.10
command:
- cat
volumeMounts:
- name: docker-sock
mountPath: /var/run/docker.sockGitHub Actions Scaling:
- Self-hosted runners: Custom machines
- Runner groups: Organization/enterprise level
- Auto-scaling: Dynamic provisioning
Self-hosted Runner Auto-scaling (Azure):
resource "azuredevops_agent_pool" "pool" {
name = "my-pool"
auto_provision = true
}
resource "azuredevops_elastic_pool" "elastic" {
name = "my-elastic-pool"
service_endpoint_id = azuredevops_serviceendpoint_azurerm.az.id
azure_resource_id = azurerm_linux_virtual_machine_scale_set.vmss.id
desired_idle = 1
max_capacity = 10
}Monitoring CI Infrastructure:
Key metrics:
- Queue time
- Build duration
- Success/failure rate
- Agent utilization
- Cost per build
Cost Optimization:
- Use spot/preemptible instances
- Auto-scale down when idle
- Cache effectively
- Right-size instances
Continuous Delivery
Every change is deployable, but deployment may be manual.
Commit → Build → Test → Staging → Manual Approval → Production
↑
Always deployable
Key Characteristics:
- Software always in releasable state
- Deployment is a business decision
- Manual approval for production
- Compliance and audit gates
Continuous Deployment
Every change that passes tests is automatically deployed.
Commit → Build → Test → Staging → Auto → Production
↑
Automated promotion
Key Characteristics:
- Fully automated pipeline
- No manual intervention
- Multiple daily deployments
- Requires high confidence in testing
Choosing Between Them:
Continuous Delivery is better when:
- Regulatory/compliance requirements
- Business needs release coordination
- Low deployment frequency is acceptable
- Building confidence gradually
Continuous Deployment is better when:
- SaaS/cloud native applications
- High deployment frequency desired
- Strong automated testing
- Feature flags in place
- Low risk tolerance for deployment
Blue/Green Deployment
Two identical environments, one live (blue), one idle (green).
Before switch:
Users → Blue (v1) Green (v2 - idle)
After switch:
Users → Green (v2) Blue (v1 - idle)
Implementation:
# Kubernetes with labels
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-blue
spec:
replicas: 10
template:
metadata:
labels:
version: blue
---
apiVersion: v1
kind: Service
metadata:
name: app-service
spec:
selector:
version: blue # Switch to green when readyPros:
- Instant rollback (switch back)
- No downtime
- Staging environment always available
Cons:
- Double infrastructure cost
- Database schema challenges
Canary Deployment
Gradually shift traffic to new version.
Users → 90% → v1
10% → v2 (canary)
If successful: increase to 25%, 50%, 100%
If problems: route back to 100% v1
Kubernetes with Istio:
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: app
spec:
hosts:
- app
http:
- match:
- headers:
canary:
exact: "true"
route:
- destination:
host: app
subset: v2
weight: 100
- route:
- destination:
host: app
subset: v1
weight: 90
- destination:
host: app
subset: v2
weight: 10Pros:
- Real traffic testing
- Gradual risk exposure
- Canary analysis
Cons:
- Complex routing
- Longer deployment time
- Requires monitoring
Rolling Deployment
Gradually replace instances.
v1 → v1 → v1 → v1 → v1 → v1 → v1 → v1 → v1 → v1
v2 → v2 → v1 → v1 → v1 → v1 → v1 → v1 → v1 → v1
v2 → v2 → v2 → v2 → v1 → v1 → v1 → v1 → v1 → v1
v2 → v2 → v2 → v2 → v2 → v2 → v2 → v2 → v2 → v2
Kubernetes Rolling Update:
apiVersion: apps/v1
kind: Deployment
metadata:
name: app
spec:
replicas: 10
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 2 # How many extra pods
maxUnavailable: 1 # How many can be down
template:
spec:
containers:
- image: app:v2Pros:
- No extra infrastructure
- Gradual replacement
- Kubernetes native
Cons:
- Slower rollout
- Complex rollback
- Version mix during deployment
Shadow Deployment
Run new version alongside old, mirror traffic but discard responses.
User → v1 (serves response)
↓
→ v2 (shadow - discard response)
Pros:
- Test with production traffic
- No user impact
- Performance comparison
Cons:
- Double resource usage
- No feedback to users
- Complex implementation
Feature flags (toggles) enable deploying incomplete features safely.
Types of Flags:
- Release toggles: Control feature visibility
- Experiment toggles: A/B testing
- Ops toggles: Operational controls
- Permission toggles: User targeting
Implementation:
# Simple flag check
if feature_flags.is_enabled('new-checkout'):
return new_checkout_flow()
else:
return old_checkout_flow()Targeting Rules:
// LaunchDarkly example
const user = { key: user.id, email: user.email };
const showFeature = ldclient.variation('new-feature', user, false);Flag Management Systems:
- LaunchDarkly: Enterprise feature management
- Split.io: Feature experimentation
- Flagsmith: Open source
- Unleash: Open source
- ConfigCat: Simple feature flags
- Custom: Database + cache
Best Practices:
- Short-lived flags: Remove after rollout
- Flag naming: Clear and consistent
- Audit logging: Track flag changes
- Default to off: Safe fallback
- Flag hygiene: Regular cleanup
- Testing: Test with flags on/off
Database changes are often the riskiest part of deployment.
Principles:
- Separate schema changes from code changes
- Forward and backward compatible
- Automated migrations
- Testable rollbacks
Migration Types:
Expand/Migrate/Contract Pattern:
Phase 1: Expand
- Add new column (nullable)
- Dual-write to both columns
Phase 2: Migrate
- Backfill data to new column
- Migrate reads to new column
Phase 3: Contract
- Remove old column
- Remove dual-write
Online Schema Change Tools:
- gh-ost: GitHub's online schema migration
- pt-online-schema-change: Percona Toolkit
- Liquibase: Database refactoring
- Flyway: Version control for databases
- Alembic: Python migrations
Example Flyway Migration:
-- V1__initial_schema.sql
CREATE TABLE users (
id INT PRIMARY KEY,
name VARCHAR(255)
);
-- V2__add_email.sql
ALTER TABLE users ADD COLUMN email VARCHAR(255);
-- V3__populate_email.sql
UPDATE users SET email = CONCAT(name, '@example.com');Zero-Downtime Migration Strategy:
1. Add nullable column
2. Dual-write to new column (code change)
3. Backfill data
4. Make column non-nullable (if needed)
5. Remove old column (future release)
Despite best efforts, things go wrong. Be prepared.
Rollback Strategies:
Version Rollback:
- Revert to previous artifact
- Simple and fast
- Loses new features
Feature Flag Rollback:
- Disable problematic feature
- No deployment needed
- Keep other features
Database Rollback:
- Restore from backup
- Apply compensating transactions
- Forward-only migrations (avoid rollbacks)
Automated Rollback Triggers:
# Canary analysis with automated rollback
deploy:
strategy:
canary:
steps:
- setWeight: 10
- pause:
duration: 5m
- analysis:
metrics:
- name: error-rate
threshold: 1
- setWeight: 50
- pause:
duration: 5m
- analysis:
metrics:
- name: error-rate
threshold: 1Rollback Procedure:
- Detect the problem (monitoring)
- Decide to roll back (automated or manual)
- Execute rollback (deploy previous version)
- Verify system is healthy
- Post-mortem to prevent recurrence
GitOps uses Git as the single source of truth for declarative infrastructure and applications.
Core Principles:
- Declarative description: Entire system described in Git
- Git as source of truth: Cluster state matches Git
- Automated convergence: Software ensures cluster matches Git
- Pull-based deployments: Cluster pulls changes
GitOps Architecture:
Developer pushes to Git
↓
Git Repository
↓
GitOps Operator (ArgoCD/Flux)
↓
Kubernetes Cluster
Benefits:
- Audit trail: All changes in Git
- Faster recovery: Recreate from Git
- Standard workflows: Use Git tools
- Security: Pull model reduces credentials
- Observability: Drift detection
ArgoCD Example:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: myapp
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/user/repo.git
targetRevision: HEAD
path: k8s
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true
selfHeal: trueFlux Example:
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: GitRepository
metadata:
name: myapp
namespace: flux-system
spec:
interval: 1m
url: https://github.com/user/repo
ref:
branch: main
---
apiVersion: kustomize.toolkit.fluxcd.io/v1beta2
kind: Kustomization
metadata:
name: myapp
namespace: flux-system
spec:
interval: 10m
path: ./k8s
prune: true
sourceRef:
kind: GitRepository
name: myappContainers provide lightweight virtualization at the OS level.
What are Containers?
Containers package an application with its dependencies, libraries, and configuration files, running isolated from other processes on the same host.
Containers vs Virtual Machines:
| Aspect | Containers | Virtual Machines |
|---|---|---|
| Isolation | Process-level | Hardware-level |
| OS | Share host kernel | Each has guest OS |
| Startup | Milliseconds | Minutes |
| Size | MB | GB |
| Performance | Native | Some overhead |
| Resource usage | Lightweight | Heavy |
Container Technologies:
- LXC (Linux Containers): Original Linux containers
- Docker: Most popular container platform
- Podman: Daemonless container engine
- containerd: Industry-standard runtime
- CRI-O: Kubernetes-specific runtime
Linux Kernel Features:
Namespaces: Isolate process views
- PID: Process IDs
- NET: Network interfaces
- MNT: Mount points
- UTS: Hostname
- IPC: Inter-process communication
- USER: User IDs
Cgroups (Control Groups): Limit resources
- CPU shares/quota
- Memory limits
- Block I/O
- Network bandwidth
Union Filesystems: Layer management
- OverlayFS
- AUFS
- Device Mapper
Docker Architecture:
Client (docker CLI)
↓
Docker Daemon (dockerd)
↓
Containerd
↓
runc (OCI runtime)
↓
Container
Components:
- docker CLI: User interface
- dockerd: Persistent daemon
- containerd: Container lifecycle management
- runc: OCI runtime (creates containers)
- containerd-shim: Parent of container processes
Images and Layers:
Docker images are built in layers:
Layer 4: CMD ["node", "app.js"]
Layer 3: COPY . /app
Layer 2: RUN npm install
Layer 1: FROM node:16
↓
Union mount at runtime
Layer Caching:
Each layer is cached. When rebuilding:
- Unchanged layers reused
- Changed layers and all subsequent rebuilt
Docker Storage Drivers:
- overlay2: Default (recommended)
- devicemapper: Legacy
- btrfs/zfs: Advanced features
- vfs: No copy-on-write
Network Drivers:
- bridge: Default, NAT through host
- host: Use host network directly
- overlay: Multi-host networking
- macvlan: Assign MAC addresses
- none: No networking
Base Images:
# Use specific tags, not latest
FROM node:16.14.2-alpine
# Use minimal base images
FROM alpine:3.15Layer Optimization:
# Bad - each RUN creates layer
RUN apt-get update
RUN apt-get install -y curl
RUN apt-get clean
# Good - combine commands
RUN apt-get update && \
apt-get install -y curl && \
apt-get cleanOrder Matters:
# Copy dependency files first (cached longer)
COPY package*.json ./
RUN npm install
# Copy source last (changes frequently)
COPY . .Multi-stage Builds:
# Build stage
FROM node:16 AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build
# Production stage
FROM nginx:alpine
COPY --from=builder /app/dist /usr/share/nginx/html
COPY nginx.conf /etc/nginx/nginx.confSecurity Best Practices:
# Run as non-root
RUN addgroup -g 1000 -S appgroup && \
adduser -u 1000 -S appuser -G appgroup
USER appuser
# No secrets in build args
ARG DB_PASSWORD # Bad - visible in history
# Use build secrets
RUN --mount=type=secret,id=db_password \
cat /run/secrets/db_password.dockerignore:
node_modules
.git
*.log
.env
Dockerfile
.dockerignore
Multi-stage builds optimize final image size by separating build and runtime environments.
Example: Go Application
# Build stage
FROM golang:1.17 AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o main .
# Runtime stage
FROM alpine:3.15
RUN apk --no-cache add ca-certificates
WORKDIR /root/
COPY --from=builder /app/main .
EXPOSE 8080
CMD ["./main"]Example: React Application
# Build stage
FROM node:16 AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build
# Runtime stage
FROM nginx:alpine
COPY --from=builder /app/build /usr/share/nginx/html
COPY nginx.conf /etc/nginx/nginx.conf
EXPOSE 80
CMD ["nginx", "-g", "daemon off;"]Benefits:
- Smaller images (MB vs GB)
- No build tools in production
- Better security
- Faster pulls
Security Principles:
- Least privilege: Minimal capabilities
- Immutable: No runtime changes
- Read-only root filesystem
- No privileged containers
- Vulnerability scanning
Security Best Practices:
User Namespace Remapping:
{
"userns-remap": "default"
}Read-only Root:
VOLUME ["/tmp", "/var/log"] # Writable volumes
# Rest of filesystem read-onlyDrop Capabilities:
docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICESecurity Context (Kubernetes):
securityContext:
runAsNonRoot: true
runAsUser: 1000
capabilities:
drop: ["ALL"]
readOnlyRootFilesystem: trueImage Signing:
# Docker Content Trust
export DOCKER_CONTENT_TRUST=1
docker push myapp:latestScan images for vulnerabilities before deployment.
Common Scanners:
- Trivy: Comprehensive, easy to use
- Clair: CoreOS scanner
- Anchore: Deep inspection
- Snyk: Developer-focused
- Docker Scout: Docker native
- Grype: Fast vulnerability scanner
Trivy Example:
# Scan image
trivy image myapp:latest
# Scan with severity filter
trivy image --severity CRITICAL,HIGH myapp:latest
# Generate HTML report
trivy image --format template --template "@contrib/html.tpl" -o report.html myapp:latestCI Integration:
# GitHub Actions
- name: Scan image
uses: aquasecurity/trivy-action@master
with:
image-ref: 'myapp:latest'
format: 'sarif'
output: 'trivy-results.sarif'SBOM (Software Bill of Materials):
# Generate SBOM
trivy image --format cyclonedx myapp:latest > sbom.json
# Scan for known vulnerabilities
trivy sbom sbom.jsonOpen Container Initiative (OCI) ensures container format and runtime interoperability.
OCI Specifications:
- Image Specification: Container image format
- Runtime Specification: Container execution
- Distribution Specification: Content distribution
OCI Image Layout:
myimage/
├── blobs/
│ └── sha256/
│ ├── a1b2c3... (layer)
│ ├── d4e5f6... (config)
│ └── g7h8i9... (manifest)
└── index.json
Benefits:
- Interoperability: Works across tools
- Portability: Run anywhere
- Stability: Backward compatible
- Ecosystem: Wide tool support
Tools Supporting OCI:
- Docker (with containerd)
- Podman
- Buildah
- Skopeo
- CRI-O
- Kubernetes
Kubernetes orchestrates containerized applications across clusters of machines.
High-Level Architecture:
┌─────────────────────┐
│ Control Plane │
│ ┌─────────────────┐ │
│ │ API Server │ │
│ └─────────────────┘ │
│ ┌─────────────────┐ │
│ │ Scheduler │ │
│ └─────────────────┘ │
│ ┌─────────────────┐ │
│ │ Controller Mgr │ │
│ └─────────────────┘ │
│ ┌─────────────────┐ │
│ │ etcd │ │
│ └─────────────────┘ │
└──────────┬──────────┘
│
┌──────────────────────┼──────────────────────┐
│ │ │
┌───────▼───────┐ ┌───────▼───────┐ ┌───────▼───────┐
│ Node 1 │ │ Node 2 │ │ Node 3 │
│ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │
│ │ kubelet │ │ │ │ kubelet │ │ │ │ kubelet │ │
│ └───────────┘ │ │ └───────────┘ │ │ └───────────┘ │
│ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │
│ │ kube-proxy│ │ │ │ kube-proxy│ │ │ │ kube-proxy│ │
│ └───────────┘ │ │ └───────────┘ │ │ └───────────┘ │
│ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │
│ │ Container │ │ │ │ Container │ │ │ │ Container │ │
│ │ Runtime │ │ │ │ Runtime │ │ │ │ Runtime │ │
│ └───────────┘ │ │ └───────────┘ │ │ └───────────┘ │
└───────────────┘ └───────────────┘ └───────────────┘
API Server (kube-apiserver):
- Frontend to control plane
- Validates and configures objects
- Serves REST API
- Horizontal scalable
etcd:
- Distributed key-value store
- Cluster state storage
- Consistent and highly available
- RAFT consensus algorithm
Scheduler (kube-scheduler):
- Assigns pods to nodes
- Considers resources, constraints
- Policy-based scheduling
- Extensible with custom schedulers
Controller Manager (kube-controller-manager):
Runs controllers:
- Node controller
- Replication controller
- Endpoint controller
- Service Account controller
- etc.
Cloud Controller Manager (cloud-controller-manager):
Interacts with cloud providers:
- Node management
- Load balancers
- Routes
- Volumes
Pod:
Smallest deployable unit, one or more containers.
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
app: web
spec:
containers:
- name: nginx
image: nginx:1.21
ports:
- containerPort: 80
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"Deployment:
Manages replica sets and rolling updates.
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
replicas: 3
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
containers:
- name: nginx
image: nginx:1.21
ports:
- containerPort: 80
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0Service:
Stable network endpoint for pods.
apiVersion: v1
kind: Service
metadata:
name: web-service
spec:
selector:
app: web
ports:
- port: 80
targetPort: 80
type: ClusterIP # Default, internal onlyService types:
- ClusterIP: Internal cluster IP
- NodePort: Expose on each node's IP
- LoadBalancer: Cloud load balancer
- ExternalName: DNS alias
Kubernetes Networking Requirements:
- Pods can communicate with all other pods without NAT
- Nodes can communicate with all pods without NAT
- Pod's IP is the same seen by others
CNI (Container Network Interface):
Plugins implement networking:
- Calico: Network policy, BGP
- Flannel: Simple overlay
- Weave: Mesh networking
- Cilium: eBPF-based, security
- AWS VPC CNI: Native VPC integration
Network Policies:
Firewall rules for pods:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-allow
spec:
podSelector:
matchLabels:
app: api
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080Volumes:
apiVersion: v1
kind: Pod
metadata:
name: app
spec:
volumes:
- name: data
emptyDir: {} # Temporary
- name: config
configMap:
name: app-config
- name: secret
secret:
secretName: db-secret
containers:
- name: app
volumeMounts:
- name: data
mountPath: /dataPersistent Volumes (PV):
Cluster storage resource:
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv-volume
spec:
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
awsElasticBlockStore:
volumeID: vol-12345
fsType: ext4Persistent Volume Claims (PVC):
Request storage:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: data-claim
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5GiStorage Classes:
Dynamic provisioning:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast
provisioner: kubernetes.io/aws-ebs
parameters:
type: gp3
fsType: ext4Core Concepts:
- Role/ClusterRole: Set of permissions
- RoleBinding/ClusterRoleBinding: Bind roles to users/groups
Role Example:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: default
name: pod-reader
rules:
- apiGroups: [""] # Core API group
resources: ["pods"]
verbs: ["get", "list", "watch"]ClusterRole Example:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: cluster-admin
rules:
- apiGroups: ["*"]
resources: ["*"]
verbs: ["*"]RoleBinding:
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: read-pods
namespace: default
subjects:
- kind: User
name: jane
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: Role
name: pod-reader
apiGroup: rbac.authorization.k8s.ioService Account Example:
apiVersion: v1
kind: ServiceAccount
metadata:
name: app-sa
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: app-binding
subjects:
- kind: ServiceAccount
name: app-sa
namespace: default
roleRef:
kind: ClusterRole
name: view
apiGroup: rbac.authorization.k8s.ioHelm is the package manager for Kubernetes.
Chart Structure:
mychart/
├── Chart.yaml # Metadata
├── values.yaml # Default values
├── templates/ # Template files
│ ├── deployment.yaml
│ ├── service.yaml
│ └── _helpers.tpl # Helper templates
└── charts/ # Dependencies
Chart.yaml:
apiVersion: v2
name: myapp
description: My application
type: application
version: 0.1.0
appVersion: "1.0.0"
dependencies:
- name: redis
version: 16.0.0
repository: https://charts.bitnami.com/bitnamiTemplate Example:
# templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ include "mychart.fullname" . }}
labels:
{{- include "mychart.labels" . | nindent 4 }}
spec:
replicas: {{ .Values.replicaCount }}
selector:
matchLabels:
{{- include "mychart.selectorLabels" . | nindent 6 }}
template:
metadata:
labels:
{{- include "mychart.selectorLabels" . | nindent 8 }}
spec:
containers:
- name: {{ .Chart.Name }}
image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
ports:
- containerPort: {{ .Values.service.port }}values.yaml:
replicaCount: 3
image:
repository: nginx
tag: latest
service:
type: ClusterIP
port: 80Helm Commands:
# Install chart
helm install myapp ./mychart
# Upgrade release
helm upgrade myapp ./mychart
# Rollback
helm rollback myapp 1
# Template rendering
helm template ./mychart
# Package chart
helm package ./mychartOperators automate application management using Kubernetes custom resources.
What are Operators?
Operators encode human operational knowledge into software to:
- Deploy applications
- Handle backups
- Perform upgrades
- Respond to failures
Operator Pattern:
Custom Resource (CR) → Operator → Manage application
↑ ↓
User defines Actual state
desired state reconciled
Example: Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: main
spec:
replicas: 2
resources:
requests:
memory: 400Mi
alerting:
alertmanagers:
- namespace: monitoring
name: alertmanager-main
port: webBuilding Operators:
- Operator SDK: Framework for building
- Kubebuilder: Kubernetes API extensions
- Metacontroller: Simple operators
Operator Best Practices:
- Idempotent: Safe to run repeatedly
- Self-healing: React to changes
- Upgradeable: Handle version upgrades
- Observable: Emit metrics/events
- Testable: Comprehensive testing
CRDs extend Kubernetes API with custom resources.
CRD Example:
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: databases.example.com
spec:
group: example.com
names:
kind: Database
plural: databases
singular: database
shortNames:
- db
scope: Namespaced
versions:
- name: v1
served: true
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
engine:
type: string
enum: ["mysql", "postgres"]
version:
type: string
size:
type: string
pattern: '^[0-9]+Gi$'Using Custom Resource:
apiVersion: example.com/v1
kind: Database
metadata:
name: mydb
spec:
engine: postgres
version: "13"
size: 10GiSecurity Best Practices:
API Server Security:
- Enable RBAC
- Use TLS for all communication
- Enable audit logging
- Disable anonymous auth
# kube-apiserver flags
--authorization-mode=Node,RBAC
--anonymous-auth=false
--audit-log-path=/var/log/kubernetes/audit.log
--enable-admission-plugins=NamespaceLifecycle,PodSecurityPolicyetcd Security:
- Encrypt secrets at rest
- TLS for peer/client communication
- Firewall access
- Regular backups
Node Security:
- Minimal host OS
- Regular security updates
- CIS benchmarks
- Disable SSH or use bastion
Pod Security:
Pod Security Standards (PodSecurity admission):
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: baselinePod Security Policies (deprecated in 1.21):
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: restricted
spec:
privileged: false
runAsUser:
rule: MustRunAsNonRoot
seLinux:
rule: RunAsAny
fsGroup:
rule: MustRunAs
ranges:
- min: 1
max: 65535
volumes:
- 'configMap'
- 'emptyDir'
- 'projected'
- 'secret'
- 'downwardAPI'
- 'persistentVolumeClaim'Network Security:
- Network policies
- Encrypted traffic (mTLS with service mesh)
- Limit external access
Image Security:
- Scan images for vulnerabilities
- Use private registry
- Sign and verify images
Control Plane HA:
Load Balancer
↓
┌───┼───┐
API API API
Server Server Server
↓ ↓ ↓
etcd etcd etcd (3-5 nodes)
Requirements:
- Odd number of etcd nodes (3,5,7)
- API servers behind load balancer
- Scheduler and controller manager with leader election
Node Considerations:
- Spread across availability zones
- Cordoning and draining for maintenance
- PodDisruptionBudgets
PodDisruptionBudget:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: app-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: myappReasons for Multi-Cluster:
- Geographic distribution: Lower latency
- Compliance: Data sovereignty
- Isolation: Dev/test/prod separation
- Scaling: Beyond single cluster limits
- Disaster recovery: Active/passive or active/active
Multi-Cluster Patterns:
-
Federation: Single control plane managing multiple clusters (KubeFed)
-
Hub and Spoke: Central management with workload clusters
-
Independent: Separate clusters with common tooling
-
Hybrid: Mix of on-prem and cloud
Cluster API:
Declarative cluster management:
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
name: my-cluster
spec:
clusterNetwork:
pods:
cidrBlocks: ["192.168.0.0/16"]
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AWSCluster
name: my-cluster
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AWSCluster
metadata:
name: my-cluster
spec:
region: us-west-2
sshKeyName: defaultService meshes provide observability, security, and traffic management.
Service Mesh Architecture:
Pod
├── App Container
└── Sidecar Proxy (Envoy/Linkerd2-proxy)
↑
Control Plane (Istiod/Linkerd controller)
Istio Example:
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: reviews
spec:
hosts:
- reviews
http:
- match:
- headers:
end-user:
exact: jason
route:
- destination:
host: reviews
subset: v2
- route:
- destination:
host: reviews
subset: v1mTLS (mutual TLS):
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: istio-system
spec:
mtls:
mode: STRICT # Require mTLSLinkerd Example:
apiVersion: policy.linkerd.io/v1beta1
kind: HTTPRoute
metadata:
name: api-route
namespace: emojivoto
spec:
parentRefs:
- name: web-svc
kind: Service
group: core
port: 80
rules:
- matches:
- path:
value: "/api/vote"
filters:
- type: RequestRedirect
requestRedirect:
scheme: httpsBenefits:
- Traffic management: Canary, blue/green
- Security: mTLS, authorization
- Observability: Metrics, tracing, logs
- Resilience: Retries, timeouts, circuit breakers
Horizontal Pod Autoscaler (HPA):
Scales based on metrics:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: app
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: requests_per_second
target:
type: AverageValue
averageValue: 1000Vertical Pod Autoscaler (VPA):
Adjusts resource requests:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: app-vpa
spec:
targetRef:
apiVersion: "apps/v1"
kind: Deployment
name: app
updatePolicy:
updateMode: "Auto" # Auto, Initial, Off
resourcePolicy:
containerPolicies:
- containerName: '*'
minAllowed:
cpu: 250m
memory: 512Mi
maxAllowed:
cpu: 4
memory: 8GiCluster Autoscaler:
Scales nodes based on pending pods:
# Add to deployment
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cluster-autoscaler.kubernetes.io/scale-down-disabled
operator: DoesNotExistKEDA (Kubernetes Event-driven Autoscaling):
Scale based on events (Kafka, RabbitMQ, etc.):
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: kafka-scaler
spec:
scaleTargetRef:
name: consumer
triggers:
- type: kafka
metadata:
topic: my-topic
bootstrapServers: kafka:9092
consumerGroup: my-group
lagThreshold: "10"Metrics:
- Node metrics: CPU, memory, disk
- Pod metrics: Resource usage
- Custom metrics: Application-specific
Prometheus Stack:
# ServiceMonitor for Prometheus
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: app-monitor
spec:
selector:
matchLabels:
app: myapp
endpoints:
- port: metrics
interval: 30sLogging:
- Container logs: stdout/stderr
- Node logs: kubelet, container runtime
- Audit logs: API server activity
EFK Stack:
- Elasticsearch: Storage and search
- Fluentd/Fluent Bit: Log collection
- Kibana: Visualization
Tracing:
Distributed tracing with Jaeger:
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: simplestOpenTelemetry:
Vendor-neutral observability:
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: simplest
spec:
config: |
receivers:
otlp:
protocols:
grpc:
http:
processors:
memory_limiter:
limit_mib: 512
exporters:
jaeger:
endpoint: jaeger:14250
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter]
exporters: [jaeger]Backup Strategies:
etcd Backup:
# Snapshot etcd
ETCDCTL_API=3 etcdctl snapshot save snapshot.db
# Restore
ETCDCTL_API=3 etcdctl snapshot restore snapshot.dbVelero (formerly Heptio Ark):
Backup and restore Kubernetes resources:
# Schedule backup
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: daily-backup
spec:
schedule: "0 1 * * *"
template:
includedNamespaces:
- production
ttl: 720hVelero Commands:
# On-demand backup
velero backup create app-backup --include-namespaces production
# Restore
velero restore create --from-backup app-backup
# Schedule backup
velero schedule create daily --schedule="0 1 * * *" --include-namespaces productionDR Patterns:
Active-Passive:
- One cluster active, one standby
- Data replication between clusters
- DNS switch on failure
Active-Active:
- Multiple clusters serving traffic
- Global load balancing
- Data synchronization challenges
Backup and Restore:
- Regular backups
- Documented restore procedures
- Test restores regularly
Resource Management:
Rightsizing:
- Use VPA to find optimal requests
- Analyze usage patterns
- Remove unused resources
Node Optimization:
- Use spot/preemptible instances for stateless workloads
- Right-size instance types
- Use cluster autoscaler
Kubecost:
# Kubecost deployment
helm repo add kubecost https://kubecost.github.io/cost-analyzer/
helm install kubecost kubecost/cost-analyzerKarpenter (AWS):
Dynamic node provisioning:
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: default
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"]
limits:
resources:
cpu: 1000
provider:
subnetSelector:
karpenter/discovery: my-cluster
securityGroupSelector:
karpenter/discovery: my-clusterCost Optimization Checklist:
- Rightsize pods (use VPA)
- Use spot instances where possible
- Scale down non-production clusters
- Remove unused load balancers
- Optimize storage (use reclaim policies)
- Monitor and alert on cost spikes
- Use namespace quotas
- Implement resource limits
Imperative Approach:
Describe how to achieve desired state:
# Create VPC
aws ec2 create-vpc --cidr-block 10.0.0.0/16
# Create subnet
aws ec2 create-subnet --vpc-id vpc-123 --cidr-block 10.0.1.0/24
# Create internet gateway
aws ec2 create-internet-gateway
aws ec2 attach-internet-gateway --vpc-id vpc-123 --internet-gateway-id igw-456Problems:
- Not idempotent
- Difficult to reproduce
- No state tracking
- Error-prone
Declarative Approach:
Describe what you want:
# Terraform
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
}
resource "aws_subnet" "main" {
vpc_id = aws_vpc.main.id
cidr_block = "10.0.1.0/24"
}
resource "aws_internet_gateway" "main" {
vpc_id = aws_vpc.main.id
}Benefits:
- Idempotent
- Self-documenting
- Version controllable
- Predictable
- Reusable
Mutable Infrastructure:
- Servers are updated in place
- Configuration drifts over time
- Configuration management tools fix drift
- "Snowflake" servers
Immutable Infrastructure:
- Never modify servers after deployment
- Replace, don't change
- Everything in version control
- Identical environments
- Easy rollback (redeploy previous version)
Benefits:
- Consistency: All servers identical
- Reproducibility: Recreate from scratch
- Testing: Test immutable artifacts
- Rollback: Deploy previous version
- Debugging: Known state
Implementation:
Version 1:
Source → Build → Image v1 → Deploy → Running v1
Version 2:
Source → Build → Image v2 → Deploy → Running v2
↓
Terminate v1
Definition: An operation is idempotent if applying it multiple times has the same effect as applying it once.
Examples:
Non-idempotent:
# Each run creates new file
echo "data" > file.txt
# Each run adds line
echo "new line" >> file.txtIdempotent:
# Only creates if doesn't exist
touch file.txt
# Sets content regardless
echo "data" > file.txtIn IaC:
# Idempotent - creates only if doesn't exist
resource "aws_instance" "web" {
ami = "ami-123"
instance_type = "t2.micro"
# Tags ensure we can identify
tags = {
Name = "web-server"
}
}Benefits:
- Safe to reapply
- Predictable outcomes
- Easy automation
- Self-healing
State tracks resources managed by IaC.
Why State Matters:
- Maps configuration to real resources
- Tracks metadata and dependencies
- Enables updates and deletion
- Improves performance (caching)
State Storage:
Local State:
terraform {
backend "local" {
path = "terraform.tfstate"
}
}- Simple but not for teams
- No locking
- Easy to lose
Remote State:
# AWS S3
terraform {
backend "s3" {
bucket = "my-terraform-state"
key = "prod/network/terraform.tfstate"
region = "us-east-1"
# Enable locking
dynamodb_table = "terraform-locks"
}
}Azure Storage:
terraform {
backend "azurerm" {
storage_account_name = "tfstate123"
container_name = "tfstate"
key = "prod.terraform.tfstate"
access_key = "xxx"
}
}Google Cloud Storage:
terraform {
backend "gcs" {
bucket = "tf-state-prod"
prefix = "terraform/state"
}
}State Best Practices:
- Remote storage: Never store state locally
- State locking: Prevent concurrent modifications
- Encryption: Encrypt state at rest
- Access control: Restrict who can read/write
- Backup: Regular state backups
- Isolation: Separate state per environment
HashiCorp Terraform is the most popular IaC tool.
Core Concepts:
- Providers: AWS, Azure, GCP, Kubernetes, etc.
- Resources: Infrastructure components
- Data sources: Read existing resources
- Variables: Parameterize configurations
- Outputs: Export resource attributes
- Modules: Reusable configurations
Basic Example:
# main.tf
terraform {
required_version = ">= 1.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 4.0"
}
}
}
provider "aws" {
region = var.aws_region
}
resource "aws_instance" "web" {
ami = data.aws_ami.ubuntu.id
instance_type = var.instance_type
tags = {
Name = "web-${var.environment}"
Environment = var.environment
}
}
data "aws_ami" "ubuntu" {
most_recent = true
owners = ["099720109477"] # Canonical
filter {
name = "name"
values = ["ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-*"]
}
}
variable "aws_region" {
description = "AWS region"
type = string
default = "us-east-1"
}
variable "instance_type" {
description = "EC2 instance type"
type = string
}
variable "environment" {
description = "Environment name"
type = string
}
output "instance_ip" {
description = "Public IP of instance"
value = aws_instance.web.public_ip
}Variables File (terraform.tfvars):
instance_type = "t3.micro"
environment = "production"Commands:
# Initialize (download providers)
terraform init
# Format code
terraform fmt
# Validate syntax
terraform validate
# Plan changes
terraform plan
# Apply changes
terraform apply
# Destroy resources
terraform destroy
# Show state
terraform show
# List resources
terraform state listAgentless configuration management and automation.
Core Concepts:
- Playbooks: YAML files defining automation
- Modules: Reusable units of work
- Inventory: List of managed hosts
- Roles: Organized playbook structure
- Facts: System information gathered
Playbook Example:
---
- name: Configure web servers
hosts: webservers
become: yes
vars:
http_port: 80
max_clients: 200
tasks:
- name: Ensure nginx is installed
apt:
name: nginx
state: present
when: ansible_os_family == "Debian"
- name: Ensure nginx is running
service:
name: nginx
state: started
enabled: yes
- name: Copy nginx configuration
template:
src: nginx.conf.j2
dest: /etc/nginx/nginx.conf
notify: restart nginx
- name: Deploy website
copy:
src: index.html
dest: /var/www/html/index.html
handlers:
- name: restart nginx
service:
name: nginx
state: restartedInventory (hosts.ini):
[webservers]
web1.example.com
web2.example.com
[databases]
db1.example.com
db2.example.com
[all:vars]
ansible_user = ubuntu
ansible_ssh_private_key_file = ~/.ssh/prod-key.pemRole Structure:
roles/
└── nginx/
├── tasks/
│ └── main.yml
├── handlers/
│ └── main.yml
├── templates/
│ └── nginx.conf.j2
├── files/
│ └── index.html
├── vars/
│ └── main.yml
└── defaults/
└── main.yml
Commands:
# Ping all hosts
ansible all -m ping
# Run ad-hoc command
ansible webservers -m command -a "uptime"
# Run playbook
ansible-playbook site.yml
# Check syntax
ansible-playbook site.yml --syntax-check
# Dry run
ansible-playbook site.yml --check
# Limit to specific hosts
ansible-playbook site.yml --limit web1IaC using general-purpose programming languages.
Example (TypeScript):
import * as aws from "@pulumi/aws";
import * as pulumi from "@pulumi/pulumi";
const config = new pulumi.Config();
const instanceType = config.get("instanceType") || "t3.micro";
// Get the latest Ubuntu AMI
const ubuntu = aws.ec2.getAmi({
mostRecent: true,
filters: [
{
name: "name",
values: ["ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-*"],
},
],
owners: ["099720109477"],
});
// Create a security group
const group = new aws.ec2.SecurityGroup("web-sg", {
description: "Allow HTTP and SSH",
ingress: [
{ protocol: "tcp", fromPort: 22, toPort: 22, cidrBlocks: ["0.0.0.0/0"] },
{ protocol: "tcp", fromPort: 80, toPort: 80, cidrBlocks: ["0.0.0.0/0"] },
],
egress: [
{ protocol: "-1", fromPort: 0, toPort: 0, cidrBlocks: ["0.0.0.0/0"] },
],
});
// Create an EC2 instance
const server = new aws.ec2.Instance("web-server", {
instanceType: instanceType,
ami: ubuntu.then(ami => ami.id),
vpcSecurityGroupIds: [group.id],
userData: `#!/bin/bash
apt-get update
apt-get install -y nginx
systemctl start nginx
`,
tags: {
Name: "web-server",
Environment: pulumi.getStack(),
},
});
// Export the instance's public IP
export const publicIp = server.publicIp;
export const publicHostname = server.publicDns;Example (Python):
import pulumi
import pulumi_aws as aws
config = pulumi.Config()
instance_type = config.get("instanceType", "t3.micro")
# Get the latest Ubuntu AMI
ubuntu = aws.ec2.get_ami(
most_recent=True,
filters=[
{
"name": "name",
"values": ["ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-*"]
}
],
owners=["099720109477"]
)
# Create security group
group = aws.ec2.SecurityGroup("web-sg",
description="Allow HTTP and SSH",
ingress=[
{"protocol": "tcp", "from_port": 22, "to_port": 22, "cidr_blocks": ["0.0.0.0/0"]},
{"protocol": "tcp", "from_port": 80, "to_port": 80, "cidr_blocks": ["0.0.0.0/0"]},
],
egress=[
{"protocol": "-1", "from_port": 0, "to_port": 0, "cidr_blocks": ["0.0.0.0/0"]}
]
)
# Create EC2 instance
server = aws.ec2.Instance("web-server",
instance_type=instance_type,
ami=ubuntu.id,
vpc_security_group_ids=[group.id],
user_data="""#!/bin/bash
apt-get update
apt-get install -y nginx
systemctl start nginx
""",
tags={
"Name": "web-server",
"Environment": pulumi.get_stack()
}
)
pulumi.export("public_ip", server.public_ip)
pulumi.export("public_hostname", server.public_dns)Benefits:
- Use familiar programming languages
- Loops, conditionals, functions
- Strong typing (TypeScript, Go)
- Reuse existing code/libraries
- Better IDE support
AWS-native IaC tool.
Template Structure:
AWSTemplateFormatVersion: "2010-09-09"
Description: "Web server stack"
Parameters:
InstanceType:
Description: EC2 instance type
Type: String
Default: t3.micro
AllowedValues:
- t3.micro
- t3.small
- t3.medium
Mappings:
RegionMap:
us-east-1:
AMI: ami-0c02fb55956c7d316 # Ubuntu 20.04
us-west-2:
AMI: ami-0d6621c01e8c2de54
Resources:
WebServerSecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: Allow HTTP and SSH
SecurityGroupIngress:
- IpProtocol: tcp
FromPort: 22
ToPort: 22
CidrIp: 0.0.0.0/0
- IpProtocol: tcp
FromPort: 80
ToPort: 80
CidrIp: 0.0.0.0/0
WebServer:
Type: AWS::EC2::Instance
Properties:
ImageId: !FindInMap [RegionMap, !Ref "AWS::Region", AMI]
InstanceType: !Ref InstanceType
SecurityGroupIds:
- !Ref WebServerSecurityGroup
UserData:
Fn::Base64: !Sub |
#!/bin/bash
apt-get update
apt-get install -y nginx
systemctl start nginx
Tags:
- Key: Name
Value: WebServer
Outputs:
PublicIP:
Description: Public IP of web server
Value: !GetAtt WebServer.PublicIp
PublicDNS:
Description: Public DNS of web server
Value: !GetAtt WebServer.PublicDnsNameStackSets: Deploy across multiple regions/accounts.
Change Sets: Preview changes before applying.
Terraform Backends:
S3 Backend:
terraform {
backend "s3" {
bucket = "company-terraform-state"
key = "prod/network/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-locks"
}
}DynamoDB Lock Table:
resource "aws_dynamodb_table" "terraform_locks" {
name = "terraform-locks"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
}Azure Backend:
terraform {
backend "azurerm" {
resource_group_name = "terraform-state"
storage_account_name = "tfstate123"
container_name = "tfstate"
key = "prod.terraform.tfstate"
}
}GCS Backend:
terraform {
backend "gcs" {
bucket = "terraform-state-prod"
prefix = "network"
}
}State Isolation Strategies:
- Workspaces: Same config, separate state
- Directory structure: Different configs per environment
- Terragrunt: DRY configurations
Workspaces:
# Create workspace
terraform workspace new dev
terraform workspace new prod
# List workspaces
terraform workspace list
# Switch workspace
terraform workspace select prod
# Use in config
locals {
environment = terraform.workspace
}Module Structure:
modules/
└── webserver/
├── main.tf
├── variables.tf
├── outputs.tf
└── README.md
Module Code (main.tf):
resource "aws_instance" "web" {
ami = var.ami
instance_type = var.instance_type
subnet_id = var.subnet_id
vpc_security_group_ids = [aws_security_group.web.id]
user_data = var.user_data
tags = var.tags
}
resource "aws_security_group" "web" {
name_prefix = "${var.name}-sg"
vpc_id = var.vpc_id
dynamic "ingress" {
for_each = var.ingress_rules
content {
from_port = ingress.value.from_port
to_port = ingress.value.to_port
protocol = ingress.value.protocol
cidr_blocks = ingress.value.cidr_blocks
}
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
tags = var.tags
}variables.tf:
variable "name" {
description = "Name prefix for resources"
type = string
}
variable "ami" {
description = "AMI ID for the instance"
type = string
}
variable "instance_type" {
description = "Instance type"
type = string
default = "t3.micro"
}
variable "subnet_id" {
description = "Subnet ID for the instance"
type = string
}
variable "vpc_id" {
description = "VPC ID for security group"
type = string
}
variable "user_data" {
description = "User data script"
type = string
default = ""
}
variable "ingress_rules" {
description = "List of ingress rules"
type = list(object({
from_port = number
to_port = number
protocol = string
cidr_blocks = list(string)
}))
default = [
{
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
]
}
variable "tags" {
description = "Tags to apply"
type = map(string)
default = {}
}outputs.tf:
output "instance_id" {
description = "Instance ID"
value = aws_instance.web.id
}
output "public_ip" {
description = "Public IP address"
value = aws_instance.web.public_ip
}
output "security_group_id" {
description = "Security group ID"
value = aws_security_group.web.id
}Using the Module:
module "web_server" {
source = "../modules/webserver"
name = "prod-web"
ami = data.aws_ami.ubuntu.id
instance_type = "t3.small"
subnet_id = aws_subnet.public.id
vpc_id = aws_vpc.main.id
ingress_rules = [
{
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
},
{
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
]
tags = {
Environment = "production"
ManagedBy = "terraform"
}
}
output "web_ip" {
value = module.web_server.public_ip
}Enforce policies on infrastructure.
Sentinel (HashiCorp):
# Restrict instance types
import "tfplan"
main = rule {
all tfplan.resources.aws_instance as _, instances {
all instances as _, instance {
instance.applied.instance_type in ["t3.micro", "t3.small"]
}
}
}Open Policy Agent (OPA):
Rego policy:
package terraform
deny[msg] {
resource := input.resource_changes[_]
resource.type == "aws_instance"
resource.change.after.instance_type == "t3.large"
msg := sprintf("Instance type t3.large not allowed in %v", [resource.address])
}
deny[msg] {
resource := input.resource_changes[_]
resource.type == "aws_s3_bucket"
not resource.change.after_unknown.aws_s3_bucket_public_access_block
msg := sprintf("S3 bucket %v requires public access block", [resource.address])
}Checkov:
Scan Terraform for security issues:
# Install
pip install checkov
# Scan
checkov -d ./
# Scan specific file
checkov -f main.tf
# Output formats
checkov -d ./ --output junitxml > results.xmlExample Check:
# Custom check
from checkov.common.models.enums import CheckResult, CheckCategories
from checkov.terraform.checks.resource.base_resource_check import BaseResourceCheck
class S3PublicACL(BaseResourceCheck):
def __init__(self):
name = "Ensure S3 bucket has no public ACL"
id = "CUSTOM_AWS_001"
supported_resources = ['aws_s3_bucket']
categories = [CheckCategories.SECURITY]
super().__init__(name=name, id=id, categories=categories, supported_resources=supported_resources)
def scan_resource_conf(self, conf):
if 'acl' in conf and conf['acl'] == ['public-read']:
return CheckResult.FAILED
return CheckResult.PASSED
check = S3PublicACL()Infrastructure as a Service (IaaS):
- Virtual machines, storage, networks
- You manage OS, middleware, runtime, data, apps
- Provider manages virtualization, servers, storage, networking
Examples: AWS EC2, Azure VMs, Google Compute Engine
Platform as a Service (PaaS):
- Managed runtime environment
- You manage data and apps
- Provider manages everything else
Examples: Heroku, Google App Engine, AWS Elastic Beanstalk
Software as a Service (SaaS):
- Complete application
- You just use it
- Provider manages everything
Examples: Salesforce, Office 365, Google Workspace
Function as a Service (FaaS):
- Serverless functions
- You write code, provider runs it
- Pay per execution
Examples: AWS Lambda, Azure Functions, Google Cloud Functions
Public Cloud:
- Shared infrastructure
- Multi-tenant
- Pay-as-you-go
- Global scale
- Examples: AWS, Azure, GCP
Private Cloud:
- Dedicated infrastructure
- Single tenant
- More control
- Compliance benefits
- Examples: OpenStack, VMware
Hybrid Cloud:
- Mix of public and private
- Workload mobility
- Data locality options
- Burst to public cloud
Multi-Cloud:
- Multiple public cloud providers
- Avoid vendor lock-in
- Best-of-breed services
- Geographic presence
Virtual Private Cloud (VPC):
Isolated network section:
VPC (10.0.0.0/16)
├── Public Subnet (10.0.1.0/24)
│ └── Internet Gateway
├── Private Subnet (10.0.2.0/24)
│ └── NAT Gateway
└── Database Subnet (10.0.3.0/24)
└── No internet access
Key Components:
- Subnets: Network segments
- Route tables: Traffic routing
- Internet Gateway: Public internet access
- NAT Gateway: Private subnet outbound access
- VPN Gateway: On-premises connection
- Load Balancers: Traffic distribution
- CDN: Content delivery
Network Security:
- Security Groups: Instance-level firewall (stateful)
- Network ACLs: Subnet-level firewall (stateless)
- WAF: Web application firewall
- DDoS protection: Shield, Cloudflare
Identity and Access Management (IAM):
Core Components:
- Users: Individual people/accounts
- Groups: Collections of users
- Roles: Temporary permissions
- Policies: Permission documents
- Permissions: Allow/deny actions
IAM Policy Example (AWS):
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetObject"
],
"Resource": [
"arn:aws:s3:::my-bucket",
"arn:aws:s3:::my-bucket/*"
],
"Condition": {
"IpAddress": {
"aws:SourceIp": "192.168.1.0/24"
}
}
}
]
}Least Privilege Principle:
- Grant minimum necessary permissions
- Regularly audit permissions
- Use groups and roles
- Avoid wildcards when possible
Identity Federation:
- SAML 2.0 (Active Directory)
- OIDC (Google, GitHub)
- Social logins
AWS is the leading cloud provider with the broadest service portfolio.
Global Infrastructure:
- Regions: Geographic areas (us-east-1, eu-west-1)
- Availability Zones: Isolated data centers per region
- Edge Locations: CDN endpoints
- Local Zones: Extend regions to population centers
Service Categories:
- Compute
- Storage
- Database
- Networking
- Security & Identity
- Analytics
- Machine Learning
- Developer Tools
- Management & Governance
Virtual servers in the cloud.
Instance Types:
- General Purpose: t3, m5 (balanced)
- Compute Optimized: c5 (CPU intensive)
- Memory Optimized: r5, x1 (RAM intensive)
- Storage Optimized: i3, d2 (disk I/O)
- GPU Instances: p3, g4 (graphics, ML)
Launch Configuration:
resource "aws_instance" "web" {
ami = "ami-0c02fb55956c7d316"
instance_type = "t3.micro"
subnet_id = aws_subnet.public.id
vpc_security_group_ids = [aws_security_group.web.id]
associate_public_ip_address = true
user_data = <<-EOF
#!/bin/bash
yum update -y
yum install -y httpd
systemctl start httpd
systemctl enable httpd
echo "<h1>Hello from $(hostname -f)</h1>" > /var/www/html/index.html
EOF
tags = {
Name = "web-server"
}
}Purchase Options:
- On-Demand: Pay by hour/second
- Reserved: 1-3 year commitment, up to 75% discount
- Spot: Bid for unused capacity, up to 90% discount
- Savings Plans: Flexible pricing
Object storage for the cloud.
Storage Classes:
- S3 Standard: Frequently accessed data
- S3 Intelligent-Tiering: Auto-tiering
- S3 Standard-IA: Infrequent access
- S3 One Zone-IA: Lower cost, less durable
- S3 Glacier: Archive (minutes to hours retrieval)
- S3 Glacier Deep Archive: Long-term archive (hours retrieval)
Bucket Example:
resource "aws_s3_bucket" "data" {
bucket = "my-company-data-${var.environment}"
tags = {
Environment = var.environment
}
}
resource "aws_s3_bucket_versioning" "data" {
bucket = aws_s3_bucket.data.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_server_side_encryption_configuration" "data" {
bucket = aws_s3_bucket.data.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "AES256"
}
}
}
resource "aws_s3_bucket_public_access_block" "data" {
bucket = aws_s3_bucket.data.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}CLI Commands:
# List buckets
aws s3 ls
# Copy file
aws s3 cp file.txt s3://my-bucket/
# Sync directory
aws s3 sync ./local s3://my-bucket/
# Set lifecycle policy
aws s3api put-bucket-lifecycle-configuration \
--bucket my-bucket \
--lifecycle-configuration file://lifecycle.jsonManaged relational databases.
Supported Engines:
- Amazon Aurora (MySQL/PostgreSQL compatible)
- MySQL
- PostgreSQL
- MariaDB
- Oracle
- SQL Server
Example (PostgreSQL):
resource "aws_db_instance" "postgres" {
identifier = "myapp-${var.environment}"
engine = "postgres"
engine_version = "13.7"
instance_class = "db.t3.micro"
allocated_storage = 20
storage_type = "gp3"
storage_encrypted = true
db_name = "myapp"
username = "admin"
password = random_password.db_password.result
vpc_security_group_ids = [aws_security_group.database.id]
db_subnet_group_name = aws_db_subnet_group.main.name
backup_retention_period = 30
backup_window = "03:00-04:00"
maintenance_window = "sun:04:00-sun:05:00"
skip_final_snapshot = false
final_snapshot_identifier = "myapp-${var.environment}-final-${formatdate("YYYY-MM-DD-hhmm", timestamp())}"
tags = {
Environment = var.environment
}
}
resource "random_password" "db_password" {
length = 32
special = false
}Aurora Serverless:
resource "aws_rds_cluster" "aurora" {
cluster_identifier = "aurora-serverless-${var.environment}"
engine = "aurora-postgresql"
engine_version = "13.6"
database_name = "myapp"
master_username = "admin"
master_password = random_password.db_password.result
serverlessv2_scaling_configuration {
min_capacity = 0.5
max_capacity = 8
}
vpc_security_group_ids = [aws_security_group.database.id]
db_subnet_group_name = aws_db_subnet_group.main.name
backup_retention_period = 7
skip_final_snapshot = false
final_snapshot_identifier = "aurora-${var.environment}-final"
}Isolated network environment.
Complete VPC Example:
# VPC
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
enable_dns_hostnames = true
enable_dns_support = true
tags = {
Name = "main-${var.environment}"
}
}
# Public subnets
resource "aws_subnet" "public" {
count = length(var.availability_zones)
vpc_id = aws_vpc.main.id
cidr_block = "10.0.${count.index}.0/24"
availability_zone = var.availability_zones[count.index]
map_public_ip_on_launch = true
tags = {
Name = "public-${var.availability_zones[count.index]}"
}
}
# Private subnets
resource "aws_subnet" "private" {
count = length(var.availability_zones)
vpc_id = aws_vpc.main.id
cidr_block = "10.0.${count.index + 10}.0/24"
availability_zone = var.availability_zones[count.index]
tags = {
Name = "private-${var.availability_zones[count.index]}"
}
}
# Internet Gateway
resource "aws_internet_gateway" "main" {
vpc_id = aws_vpc.main.id
tags = {
Name = "main-igw"
}
}
# NAT Gateways (one per AZ)
resource "aws_eip" "nat" {
count = length(var.availability_zones)
vpc = true
tags = {
Name = "nat-${var.availability_zones[count.index]}"
}
}
resource "aws_nat_gateway" "main" {
count = length(var.availability_zones)
allocation_id = aws_eip.nat[count.index].id
subnet_id = aws_subnet.public[count.index].id
tags = {
Name = "nat-${var.availability_zones[count.index]}"
}
depends_on = [aws_internet_gateway.main]
}
# Route tables
resource "aws_route_table" "public" {
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
gateway_id = aws_internet_gateway.main.id
}
tags = {
Name = "public"
}
}
resource "aws_route_table" "private" {
count = length(var.availability_zones)
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
nat_gateway_id = aws_nat_gateway.main[count.index].id
}
tags = {
Name = "private-${var.availability_zones[count.index]}"
}
}
# Route table associations
resource "aws_route_table_association" "public" {
count = length(var.availability_zones)
subnet_id = aws_subnet.public[count.index].id
route_table_id = aws_route_table.public.id
}
resource "aws_route_table_association" "private" {
count = length(var.availability_zones)
subnet_id = aws_subnet.private[count.index].id
route_table_id = aws_route_table.private[count.index].id
}IAM User and Group:
# Create group
resource "aws_iam_group" "developers" {
name = "developers"
}
# Create user
resource "aws_iam_user" "john" {
name = "john.doe"
path = "/developers/"
}
# Add user to group
resource "aws_iam_group_membership" "developers" {
name = "developers-group-membership"
users = [
aws_iam_user.john.name,
]
group = aws_iam_group.developers.name
}
# Group policy
resource "aws_iam_group_policy" "developers_policy" {
name = "developers-policy"
group = aws_iam_group.developers.name
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"ec2:Describe*",
"s3:ListBucket",
]
Resource = "*"
}
]
})
}IAM Role for EC2:
# Role
resource "aws_iam_role" "ec2_role" {
name = "ec2-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Principal = {
Service = "ec2.amazonaws.com"
}
Action = "sts:AssumeRole"
}
]
})
}
# Policy attachment
resource "aws_iam_role_policy_attachment" "s3_read" {
role = aws_iam_role.ec2_role.name
policy_arn = "arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess"
}
# Instance profile
resource "aws_iam_instance_profile" "ec2_profile" {
name = "ec2-profile"
role = aws_iam_role.ec2_role.name
}Managed Kubernetes on AWS.
EKS Cluster:
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "19.0.0"
cluster_name = "myapp-${var.environment}"
cluster_version = "1.24"
vpc_id = aws_vpc.main.id
subnet_ids = concat(aws_subnet.public[*].id, aws_subnet.private[*].id)
# Managed node groups
eks_managed_node_groups = {
main = {
desired_size = 3
min_size = 1
max_size = 10
instance_types = ["t3.medium"]
tags = {
Environment = var.environment
}
}
}
# Fargate profiles (serverless)
fargate_profiles = {
default = {
name = "default"
selectors = [
{
namespace = "default"
}
]
}
}
tags = {
Environment = var.environment
}
}
# Configure kubectl
resource "local_file" "kubeconfig" {
content = module.eks.kubeconfig
filename = "./kubeconfig_${var.environment}"
}Access Entry (EKS API):
resource "aws_eks_access_entry" "admin" {
cluster_name = module.eks.cluster_name
principal_arn = "arn:aws:iam::123456789:role/Admin"
type = "STANDARD"
}
resource "aws_eks_access_policy_association" "admin" {
cluster_name = module.eks.cluster_name
policy_arn = "arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy"
principal_arn = aws_eks_access_entry.admin.principal_arn
access_scope {
type = "cluster"
}
}Azure is Microsoft's cloud platform, strong in enterprise integration.
Global Infrastructure:
- 60+ regions worldwide
- Availability Zones
- ExpressRoute private connections
Key Services:
- Azure Virtual Machines (IaaS)
- Azure Kubernetes Service (AKS)
- Azure App Service (PaaS)
- Azure SQL Database
- Azure DevOps
VM Deployment:
# Terraform AzureRM provider
provider "azurerm" {
features {}
}
resource "azurerm_resource_group" "main" {
name = "myapp-${var.environment}-rg"
location = var.location
}
resource "azurerm_virtual_network" "main" {
name = "myapp-${var.environment}-vnet"
address_space = ["10.0.0.0/16"]
location = azurerm_resource_group.main.location
resource_group_name = azurerm_resource_group.main.name
}
resource "azurerm_subnet" "internal" {
name = "internal"
resource_group_name = azurerm_resource_group.main.name
virtual_network_name = azurerm_virtual_network.main.name
address_prefixes = ["10.0.2.0/24"]
}
resource "azurerm_public_ip" "vm" {
name = "vm-public-ip"
resource_group_name = azurerm_resource_group.main.name
location = azurerm_resource_group.main.location
allocation_method = "Dynamic"
}
resource "azurerm_network_interface" "main" {
name = "vm-nic"
resource_group_name = azurerm_resource_group.main.name
location = azurerm_resource_group.main.location
ip_configuration {
name = "internal"
subnet_id = azurerm_subnet.internal.id
private_ip_address_allocation = "Dynamic"
public_ip_address_id = azurerm_public_ip.vm.id
}
}
resource "azurerm_linux_virtual_machine" "main" {
name = "vm-${var.environment}"
resource_group_name = azurerm_resource_group.main.name
location = azurerm_resource_group.main.location
size = "Standard_B2s"
admin_username = "azureuser"
network_interface_ids = [
azurerm_network_interface.main.id,
]
admin_ssh_key {
username = "azureuser"
public_key = file("~/.ssh/id_rsa.pub")
}
source_image_reference {
publisher = "Canonical"
offer = "0001-com-ubuntu-server-focal"
sku = "20_04-lts"
version = "latest"
}
os_disk {
caching = "ReadWrite"
storage_account_type = "Standard_LRS"
}
tags = {
environment = var.environment
}
}AKS Cluster:
resource "azurerm_kubernetes_cluster" "main" {
name = "aks-${var.environment}"
location = azurerm_resource_group.main.location
resource_group_name = azurerm_resource_group.main.name
dns_prefix = "myapp-${var.environment}"
default_node_pool {
name = "default"
node_count = 3
vm_size = "Standard_DS2_v2"
enable_auto_scaling = true
min_count = 1
max_count = 5
}
identity {
type = "SystemAssigned"
}
network_profile {
network_plugin = "azure"
network_policy = "calico"
}
role_based_access_control_enabled = true
azure_active_directory_role_based_access_control {
managed = true
azure_rbac_enabled = true
}
tags = {
Environment = var.environment
}
}
# Get credentials
resource "local_file" "kubeconfig" {
content = azurerm_kubernetes_cluster.main.kube_config_raw
filename = "./kubeconfig_aks_${var.environment}"
}AKS with Availability Zones:
resource "azurerm_kubernetes_cluster" "main" {
# ... existing configuration ...
default_node_pool {
name = "default"
node_count = 3
vm_size = "Standard_DS2_v2"
availability_zones = ["1", "2", "3"]
enable_node_public_ip = false
upgrade_settings {
max_surge = "33%"
}
}
# Enable cluster autoscaler
auto_scaler_profile {
balance_similar_node_groups = true
max_graceful_termination_sec = 600
}
}Service Connection:
# azure-pipelines.yml
trigger:
- main
pool:
vmImage: ubuntu-latest
variables:
azureSubscription: 'my-azure-connection'
resourceGroup: 'myapp-prod-rg'
aksCluster: 'myapp-prod-aks'
stages:
- stage: Build
jobs:
- job: Build
steps:
- task: Docker@2
inputs:
containerRegistry: 'my-acr'
repository: 'myapp'
command: 'buildAndPush'
Dockerfile: '**/Dockerfile'
tags: '$(Build.BuildId)'
- stage: Deploy
jobs:
- deployment: Deploy
environment: 'production'
strategy:
runOnce:
deploy:
steps:
- task: KubernetesManifest@0
inputs:
action: 'deploy'
kubernetesServiceConnection: 'my-aks-connection'
namespace: 'default'
manifests: 'manifests/deployment.yaml'
containers: 'myacr.azurecr.io/myapp:$(Build.BuildId)'Virtual Network with Service Endpoints:
resource "azurerm_virtual_network" "main" {
name = "vnet-${var.environment}"
location = azurerm_resource_group.main.location
resource_group_name = azurerm_resource_group.main.name
address_space = ["10.0.0.0/16"]
}
# Subnet with service endpoints
resource "azurerm_subnet" "private" {
name = "private"
resource_group_name = azurerm_resource_group.main.name
virtual_network_name = azurerm_virtual_network.main.name
address_prefixes = ["10.0.1.0/24"]
service_endpoints = [
"Microsoft.Sql",
"Microsoft.Storage"
]
}
# Private endpoint for storage
resource "azurerm_private_endpoint" "storage" {
name = "pe-storage-${var.environment}"
location = azurerm_resource_group.main.location
resource_group_name = azurerm_resource_group.main.name
subnet_id = azurerm_subnet.private.id
private_service_connection {
name = "storage-connection"
private_connection_resource_id = azurerm_storage_account.main.id
is_manual_connection = false
subresource_names = ["blob"]
}
}Network Security Group:
resource "azurerm_network_security_group" "web" {
name = "nsg-web"
location = azurerm_resource_group.main.location
resource_group_name = azurerm_resource_group.main.name
security_rule {
name = "HTTP"
priority = 100
direction = "Inbound"
access = "Allow"
protocol = "Tcp"
source_port_range = "*"
destination_port_range = "80"
source_address_prefix = "*"
destination_address_prefix = "*"
}
security_rule {
name = "HTTPS"
priority = 110
direction = "Inbound"
access = "Allow"
protocol = "Tcp"
source_port_range = "*"
destination_port_range = "443"
source_address_prefix = "*"
destination_address_prefix = "*"
}
security_rule {
name = "SSH"
priority = 120
direction = "Inbound"
access = "Allow"
protocol = "Tcp"
source_port_range = "*"
destination_port_range = "22"
source_address_prefix = "10.0.0.0/8"
destination_address_prefix = "*"
}
}GCP excels in data analytics, machine learning, and containers.
Global Infrastructure:
- 30+ regions
- 100+ edge locations
- Global fiber network
Key Services:
- Compute Engine (VMs)
- Google Kubernetes Engine (GKE)
- BigQuery (analytics)
- Cloud Run (serverless containers)
- Cloud Functions
VM Instance:
# Terraform GCP provider
provider "google" {
project = var.project_id
region = var.region
}
resource "google_compute_network" "vpc" {
name = "vpc-${var.environment}"
auto_create_subnetworks = false
}
resource "google_compute_subnetwork" "subnet" {
name = "subnet-${var.environment}"
ip_cidr_range = "10.0.1.0/24"
region = var.region
network = google_compute_network.vpc.id
private_ip_google_access = true
}
resource "google_compute_firewall" "ssh" {
name = "allow-ssh"
network = google_compute_network.vpc.name
allow {
protocol = "tcp"
ports = ["22"]
}
source_ranges = ["0.0.0.0/0"]
target_tags = ["ssh"]
}
resource "google_compute_address" "static" {
name = "vm-address-${var.environment}"
}
resource "google_compute_instance" "default" {
name = "vm-${var.environment}"
machine_type = "e2-medium"
zone = var.zone
tags = ["ssh", "http"]
boot_disk {
initialize_params {
image = "ubuntu-os-cloud/ubuntu-2004-lts"
size = 50
type = "pd-ssd"
}
}
network_interface {
network = google_compute_network.vpc.name
subnetwork = google_compute_subnetwork.subnet.name
access_config {
nat_ip = google_compute_address.static.address
}
}
metadata_startup_script = <<-EOF
#!/bin/bash
apt-get update
apt-get install -y nginx
systemctl start nginx
EOF
service_account {
scopes = ["cloud-platform"]
}
}GKE Cluster:
resource "google_container_cluster" "primary" {
name = "gke-${var.environment}"
location = var.region
remove_default_node_pool = true
initial_node_count = 1
network = google_compute_network.vpc.name
subnetwork = google_compute_subnetwork.subnet.name
# Enable Shielded Nodes
enable_shielded_nodes = true
# Release channel (RAPID, REGULAR, STABLE)
release_channel {
channel = "REGULAR"
}
# Private cluster
private_cluster_config {
enable_private_nodes = true
enable_private_endpoint = false
master_ipv4_cidr_block = "172.16.0.0/28"
}
# Network policy
network_policy {
enabled = true
}
# Workload identity
workload_identity_config {
workload_pool = "${var.project_id}.svc.id.goog"
}
maintenance_policy {
recurring_window {
start_time = "2023-01-01T04:00:00Z"
end_time = "2023-01-01T06:00:00Z"
recurrence = "FREQ=WEEKLY;BYDAY=SA,SU"
}
}
}
resource "google_container_node_pool" "primary_nodes" {
name = "primary-pool"
location = var.region
cluster = google_container_cluster.primary.name
node_count = 3
node_config {
machine_type = "e2-standard-4"
service_account = google_service_account.gke.email
oauth_scopes = [
"https://www.googleapis.com/auth/cloud-platform"
]
metadata = {
disable-legacy-endpoints = "true"
}
labels = {
environment = var.environment
}
tags = ["gke-node", var.environment]
shielded_instance_config {
enable_secure_boot = true
enable_integrity_monitoring = true
}
workload_metadata_config {
mode = "GKE_METADATA"
}
}
autoscaling {
min_node_count = 1
max_node_count = 10
}
management {
auto_repair = true
auto_upgrade = true
}
}Service Account:
# Service account
resource "google_service_account" "gke" {
account_id = "gke-sa-${var.environment}"
display_name = "GKE Service Account"
}
# IAM binding
resource "google_project_iam_member" "gke_logging" {
project = var.project_id
role = "roles/logging.logWriter"
member = "serviceAccount:${google_service_account.gke.email}"
}
resource "google_project_iam_member" "gke_monitoring" {
project = var.project_id
role = "roles/monitoring.metricWriter"
member = "serviceAccount:${google_service_account.gke.email}"
}
resource "google_project_iam_member" "gke_metadata" {
project = var.project_id
role = "roles/stackdriver.resourceMetadata.writer"
member = "serviceAccount:${google_service_account.gke.email}"
}Custom Role:
resource "google_project_iam_custom_role" "myrole" {
role_id = "customRole_${var.environment}"
title = "Custom Role"
description = "Custom role for myapp"
permissions = [
"storage.buckets.get",
"storage.objects.get",
"storage.objects.list",
]
}
resource "google_project_iam_member" "custom" {
project = var.project_id
role = google_project_iam_custom_role.myrole.id
member = "serviceAccount:${google_service_account.app.email}"
}Data warehouse for analytics.
Dataset and Table:
resource "google_bigquery_dataset" "dataset" {
dataset_id = "myapp_${replace(var.environment, "-", "_")}"
friendly_name = "MyApp Dataset"
description = "Dataset for MyApp analytics"
location = var.region
default_table_expiration_ms = 2592000000 # 30 days
labels = {
environment = var.environment
}
}
resource "google_bigquery_table" "events" {
dataset_id = google_bigquery_dataset.dataset.dataset_id
table_id = "events"
time_partitioning {
type = "DAY"
}
clustering = ["event_type", "user_id"]
schema = jsonencode([
{
name = "event_id"
type = "STRING"
mode = "REQUIRED"
},
{
name = "event_type"
type = "STRING"
mode = "REQUIRED"
},
{
name = "user_id"
type = "STRING"
mode = "REQUIRED"
},
{
name = "timestamp"
type = "TIMESTAMP"
mode = "REQUIRED"
},
{
name = "properties"
type = "JSON"
mode = "NULLABLE"
}
])
}
# Authorized view
resource "google_bigquery_table" "daily_events" {
dataset_id = google_bigquery_dataset.dataset.dataset_id
table_id = "daily_events"
view {
query = <<EOF
SELECT
DATE(timestamp) as event_date,
event_type,
COUNT(*) as count
FROM `${var.project_id}.${google_bigquery_dataset.dataset.dataset_id}.events`
GROUP BY event_date, event_type
EOF
use_legacy_sql = false
}
}BigQuery Query Example:
-- Top users by event count
SELECT
user_id,
COUNT(*) as event_count
FROM `myproject.myapp_prod.events`
WHERE timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
GROUP BY user_id
ORDER BY event_count DESC
LIMIT 10;
-- Real-time dashboard query
SELECT
event_type,
COUNT(*) as events,
COUNT(DISTINCT user_id) as unique_users
FROM `myproject.myapp_prod.events`
WHERE timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
GROUP BY event_type;What to Monitor:
- Infrastructure: CPU, memory, disk, network
- Application: Request rate, errors, latency
- Business: Active users, revenue, conversions
- Security: Auth failures, suspicious patterns
The Four Golden Signals (Google):
- Latency: Time to serve requests
- Traffic: How much demand
- Errors: Rate of failed requests
- Saturation: How "full" the system is
RED Method (for services):
- Rate: Requests per second
- Errors: Failed requests per second
- Duration: Distribution of request latencies
USE Method (for resources):
- Utilization: Average time resource busy
- Saturation: Extra work resource can't handle
- Errors: Error counts
Metrics:
- Numerical measurements over time
- Small data footprint
- Aggregatable
- Best for: Alerting, dashboards, trends
Examples: CPU usage, request latency p99, error rate
Logs:
- Detailed event records
- Text or structured data
- Large volume
- Best for: Debugging, audit trails, detailed analysis
Examples: Error stack traces, access logs, audit events
Traces:
- End-to-end request paths
- Span context
- Show service dependencies
- Best for: Performance analysis, distributed debugging
Examples:
- Frontend → API → Auth → Database
- Service call hierarchies
The Three Pillars of Observability:
Observability
├── Metrics (what's happening)
├── Logs (why it's happening)
└── Traces (where it's happening)
Prometheus is the leading open-source monitoring system.
Architecture:
Service → Exporter → Prometheus Server → Alertmanager
↑ ↓
Service Grafana
Discovery
Prometheus Configuration:
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
rule_files:
- 'alerts.yml'
scrape_configs:
- job_name: 'node'
static_configs:
- targets: ['localhost:9100']
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)Exporters:
- node_exporter: System metrics
- blackbox_exporter: HTTP/HTTPS probing
- mysqld_exporter: MySQL metrics
- postgres_exporter: PostgreSQL metrics
- nginx_exporter: Nginx metrics
PromQL (Prometheus Query Language):
# CPU usage
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Request rate
rate(http_requests_total[5m])
# Error ratio
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
# 95th percentile latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# Memory usage
container_memory_usage_bytes{container!=""} / container_spec_memory_limit_bytes
Visualization and dashboards.
Dashboard Example:
{
"title": "Web Service Dashboard",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total[1m])",
"legendFormat": "{{service}}"
}
]
},
{
"title": "Error Rate",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total{status=~'5..'}[1m])",
"legendFormat": "{{service}}"
}
]
},
{
"title": "Latency (p99)",
"type": "heatmap",
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))"
}
]
}
]
}Grafana Datasources:
- Prometheus
- Elasticsearch
- InfluxDB
- Graphite
- CloudWatch
- Azure Monitor
- Google Cloud Monitoring
Elasticsearch, Logstash, Kibana for logging.
Architecture:
Logs → Filebeat → Logstash → Elasticsearch → Kibana
↑
(Processing)
Filebeat Configuration:
# filebeat.yml
filebeat.inputs:
- type: container
paths:
- /var/log/containers/*.log
processors:
- add_kubernetes_metadata:
host: ${NODE_NAME}
matchers:
- logs_path:
logs_path: "/var/log/containers/"
output.logstash:
hosts: ["logstash:5044"]Logstash Configuration:
input {
beats {
port => 5044
}
}
filter {
grok {
match => { "message" => "%{COMBINEDAPACHELOG}" }
}
date {
match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
}
geoip {
source => "clientip"
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "logs-%{+YYYY.MM.dd}"
}
}
Kibana Queries:
# Find errors
log_level: ERROR
# Find specific request
request_id: "abc123"
# Time range and filter
@timestamp >= "now-1h" AND kubernetes.namespace: production
# Pattern matching
message: "Failed to connect to *"
Alert Design Principles:
- Actionable: Alerts should require action
- Urgent: Alert on imminent problems
- Real: Avoid false positives
- Understandable: Clear what's wrong
- Documented: Runbooks for alerts
Alert Severity Levels:
- P0/Critical: Service down, immediate response
- P1/High: Severe degradation, respond within hour
- P2/Medium: Minor issues, respond within day
- P3/Low: Informational, no response needed
Alert Rules (Prometheus):
# alerts.yml
groups:
- name: instance_alerts
rules:
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} has been down for more than 5 minutes."
- alert: HighCPUUsage
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value }}% for 10 minutes."
- name: service_alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate for {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }}"Alertmanager Configuration:
# alertmanager.yml
global:
slack_api_url: 'https://hooks.slack.com/services/...'
route:
group_by: ['alertname', 'cluster']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'team-alerts'
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
continue: true
- match:
severity: warning
receiver: 'slack-warnings'
receivers:
- name: 'team-alerts'
slack_configs:
- channel: '#alerts'
title: '{{ template "slack.title" . }}'
text: '{{ template "slack.text" . }}'
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: '...'
- name: 'slack-warnings'
slack_configs:
- channel: '#warnings'
title: '{{ template "slack.title" . }}'
text: '{{ template "slack.text" . }}'Incident Management Process:
- Detection: Alert triggers or user reports
- Triage: Assess severity and impact
- Response: Assign incident commander
- Mitigation: Stop the bleeding
- Resolution: Fix root cause
- Post-mortem: Learn and prevent
Incident Severity Matrix:
| Severity | Impact | Response | Examples |
|---|---|---|---|
| SEV1 | Critical outage | Immediate, all hands | Site down, data loss |
| SEV2 | Major degradation | < 1 hour response | Feature broken, slow |
| SEV3 | Minor issue | < 1 day response | UI glitch, non-critical |
| SEV4 | Informational | Next release | Cosmetic issues |
Incident Commander Responsibilities:
- Coordinate response
- Communicate status
- Make decisions
- Delegate tasks
- Track timeline
Communication Templates:
Initial Alert:
INCIDENT: {{title}}
SEVERITY: {{severity}}
TIME: {{timestamp}}
IMPACT: {{impact}}
LEAD: {{commander}}
CHANNEL: {{slack_channel}}
Status Update:
STATUS UPDATE ({{time}})
Current: {{what's happening}}
Action: {{what's being done}}
Next: {{next check-in}}
Resolution:
RESOLVED: {{title}}
TIME: {{timestamp}}
DURATION: {{duration}}
ACTION: {{mitigation}}
ROOT CAUSE: {{cause}}
POST-MORTEM: {{link}}
SRE applies software engineering to operations.
Core Principles (Google):
- Operations is a software problem: Automate away toil
- Manage by service level objectives: SLOs drive decisions
- Work to minimize toil: Spend 50% time on development
- Monotonically decreasing toil: Always reducing
- Error budgets: Balance reliability and velocity
- Monitoring should be minimal: Alert on symptoms, not causes
SRE vs Traditional Ops:
| Aspect | Traditional Ops | SRE |
|---|---|---|
| Focus | Keep systems running | Build systems that run themselves |
| Change | Minimize change | Embrace change with safety |
| Measurement | Uptime | Error budgets |
| Work | Manual operations | Automation development |
| Incidents | Fix and forget | Post-mortems and prevention |
Service Level Indicators (SLIs):
Metrics that measure service performance:
- Availability: % of successful requests
- Latency: Time to respond (e.g., p99 < 100ms)
- Throughput: Requests per second
- Durability: Data persistence rate
- Correctness: % of accurate responses
Service Level Objectives (SLOs):
Target values for SLIs:
"99.9% of requests complete in < 200ms over rolling 30 days"
Characteristics:
- Specific and measurable
- Time-bound
- Achievable
- Business-aligned
Service Level Agreements (SLAs):
Contracts with consequences for missing SLOs:
- Financial penalties
- Service credits
- Legal implications
SLO Examples:
apiVersion: v1
kind: ServiceLevelObjective
metadata:
name: api-availability
spec:
service: user-api
indicator:
type: availability
ratio:
good:
filter: "job='api' and status_code=200"
count: successful_requests
total:
filter: "job='api'"
count: total_requests
target: 99.9%
window: 30d
---
apiVersion: v1
kind: ServiceLevelObjective
metadata:
name: api-latency
spec:
service: user-api
indicator:
type: latency
latency:
threshold: 200ms
filter: "job='api'"
target: 99%
window: 7dError budgets = 100% - SLO target
Example: 99.9% SLO → 0.1% error budget
Error Budget Calculation:
Error Budget = (1 - SLO) × Total Time
For 30 days (2,592,000 seconds) with 99.9% SLO:
Error Budget = 0.001 × 2,592,000 = 2,592 seconds = 43.2 minutes
Error Budget Policy:
- While budget remains: Release velocity prioritized
- When budget exhausted: Freeze releases, focus on reliability
Benefits:
- Aligns Dev and Ops goals
- Data-driven release decisions
- Balances risk and innovation
What is Toil?
Manual, repetitive, automatable work with no enduring value.
Examples of Toil:
- Manual deployments
- Password resets
- Restarting services
- Answering repetitive questions
- Manual data fixes
Toil Characteristics:
- Manual: Requires human action
- Repetitive: Done frequently
- Automatable: Could be done by machine
- Tactical: No lasting value
- Scales linearly: More work = more people
Toil Reduction Strategies:
- Measure toil: Track time spent
- Set goals: Target < 50% time on toil
- Automate everything: Scripts, tools, platforms
- Build self-service: Empower developers
- Improve reliability: Reduce firefighting
Toil Budget:
Time Allocation:
├── 50% max toil (operational)
└── 50% min engineering (development)
├── Automation
├── Tooling
└── Architecture improvements
Definition: "Disciplined approach to identifying failures before they become outages" (Principles of Chaos)
Principles (from Principles of Chaos):
- Build a hypothesis around steady state
- Vary real-world events
- Run experiments in production
- Automate experiments to run continuously
- Minimize blast radius
Chaos Engineering Tools:
- Chaos Monkey: Random instance termination
- Gremlin: Chaos engineering platform
- Litmus: Kubernetes chaos
- Chaos Mesh: Kubernetes chaos platform
- AWS Fault Injection Simulator
Chaos Experiment Example (Chaos Mesh):
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-kill-example
spec:
action: pod-kill
mode: one
selector:
namespaces:
- production
labelSelectors:
app: web-server
duration: "60s"Experiment Design:
- Define steady state: Normal metrics (error rate < 0.1%)
- Hypothesis: System survives losing one pod
- Run experiment: Kill one pod
- Prove/disprove: Did error rate spike?
- Fix or automate: Add redundancy or document
Goals:
- Meet demand without waste
- Anticipate scaling needs
- Optimize costs
Capacity Planning Process:
- Measure current usage: Trends, peaks
- Forecast demand: Business growth, seasonality
- Model scenarios: What-if analysis
- Plan capacity: When to add resources
- Procure/scale: Execute plan
Key Metrics:
- Peak utilization: Max observed
- Headroom: Buffer for spikes
- Growth rate: % increase over time
- Lead time: How long to add capacity
Prediction Methods:
Trend Analysis:
Future Capacity = Current Usage × (1 + Growth Rate)^Time
Seasonal Patterns:
- Daily patterns
- Weekly patterns
- Holiday spikes
- Marketing campaigns
Tools:
- Prometheus: Historical metrics
- Grafana: Visualization
- Forecast libraries: Prophet, statsmodels
- Cloud auto-scaling: Dynamic capacity
Identify and prioritize security threats.
Threat Modeling Process (STRIDE):
- Spoofing: Impersonating something/someone
- Tampering: Modifying data/code
- Repudiation: Denying actions
- Information Disclosure: Exposing data
- Denial of Service: Disrupting service
- Elevation of Privilege: Gaining unauthorized access
Common Frameworks:
PASTA (Process for Attack Simulation and Threat Analysis):
- Define objectives
- Define technical scope
- Decompose application
- Threat analysis
- Vulnerability analysis
- Attack modeling
- Risk analysis
Threat Modeling Example:
System: User Authentication Service
Assets:
- User credentials
- Session tokens
- Personal data
Trust Boundaries:
- Browser ↔ API
- API ↔ Database
Threats:
1. SQL Injection (Tampering)
Mitigation: Parameterized queries, input validation
2. Session Hijacking (Spoofing)
Mitigation: HTTPS, secure cookies, short expiration
3. Brute Force (DoS)
Mitigation: Rate limiting, account lockout
4. Password Leak (Info Disclosure)
Mitigation: Hashing, encryption, secure storage
Protect against compromised dependencies and tools.
Supply Chain Attacks:
- Dependency confusion: Malicious packages with same name
- Typosquatting: Similar package names
- Compromised maintainers: Attacked developer accounts
- Build pipeline: Inject malware during build
Mitigation Strategies:
- Lock dependencies: Use lock files (package-lock.json)
- Verify integrity: Checksums, signatures
- Private registry: Curated packages
- Continuous scanning: Detect vulnerabilities
- Least privilege: Limit CI/CD permissions
Software Bill of Materials (SBOM):
{
"bomFormat": "CycloneDX",
"specVersion": "1.4",
"version": 1,
"components": [
{
"type": "library",
"name": "lodash",
"version": "4.17.21",
"purl": "pkg:npm/lodash@4.17.21",
"licenses": ["MIT"]
}
]
}What is SBOM?
A formal, machine-readable inventory of software components and dependencies.
SBOM Formats:
- SPDX: Linux Foundation
- CycloneDX: OWASP
- SWID: ISO standard
Why SBOM Matters:
- Know what's in your software
- Rapid vulnerability response
- License compliance
- Supply chain transparency
Generating SBOM:
# Using syft
syft myapp:latest -o cyclonedx > sbom.json
# Using trivy
trivy image --format cyclonedx myapp:latest > sbom.json
# Using cdxgen
cdxgen -o bom.xmlNever store secrets in code.
Secret Types:
- API keys
- Database passwords
- TLS certificates
- SSH keys
- OAuth tokens
Secret Management Solutions:
HashiCorp Vault:
# Vault policy
path "secret/data/myapp/*" {
capabilities = ["read"]
}# Store secret
vault kv put secret/myapp/api key=12345
# Read secret
vault kv get secret/myapp/api
# Dynamic database credentials
vault read database/creds/myappCloud Secret Managers:
- AWS Secrets Manager:
aws secretsmanager create-secret --name myapp/api --secret-string '{"key":"12345"}'- Azure Key Vault:
az keyvault secret set --vault-name myvault --name api-key --value 12345- Google Secret Manager:
echo -n "12345" | gcloud secrets create api-key --data-file=-Kubernetes Secrets:
apiVersion: v1
kind: Secret
metadata:
name: db-secret
type: Opaque
data:
username: YWRtaW4= # base64 encoded
password: MWYyZDFlMmU2N2Rm # base64 encoded
---
apiVersion: v1
kind: Pod
spec:
containers:
- name: app
env:
- name: DB_USERNAME
valueFrom:
secretKeyRef:
name: db-secret
key: username
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: db-secret
key: passwordTools for Secret Detection:
# GitHub Actions secret scanning
name: Secret Scanning
on: [push]
jobs:
scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: TruffleHog
uses: trufflesecurity/trufflehog@main
with:
path: ./
base: ${{ github.event.repository.default_branch }}Pipeline Security Checklist:
- Use OIDC instead of long-lived credentials
- Scan dependencies for vulnerabilities
- Scan container images
- Run SAST on code
- Run DAST on deployments
- Sign and verify artifacts
- Immutable build environments
- Least privilege for CI jobs
- Audit all pipeline changes
- Secrets never in logs
Secure Pipeline Example:
name: Secure CI/CD
on: [push]
jobs:
security-scans:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Scan code for secrets
uses: trufflesecurity/trufflehog@main
- name: Run SAST
uses: github/codeql-action/init@v1
with:
languages: javascript
- name: Scan dependencies
run: |
npm audit --audit-level=high
npm outdated
- name: Build image
run: docker build -t myapp:${{ github.sha }} .
- name: Scan image
uses: aquasecurity/trivy-action@master
with:
image-ref: 'myapp:${{ github.sha }}'
severity: 'CRITICAL,HIGH'
- name: Sign image
run: |
cosign sign --key k8s://my-namespace/cosign myapp:${{ github.sha }}
- name: Deploy (if scans pass)
if: success()
run: ./deploy.shAnalyze source code for vulnerabilities.
Common SAST Tools:
- SonarQube: Multi-language, quality and security
- Checkmarx: Enterprise SAST
- Fortify: Micro Focus
- Semgrep: Fast, customizable
- CodeQL: GitHub's analysis engine
- ESLint (security plugins): JavaScript
Semgrep Example:
# semgrep.yml
rules:
- id: no-hardcoded-secrets
patterns:
- pattern: password = "..."
- pattern-not: password = os.getenv("...")
message: "Hardcoded password detected"
languages: [python]
severity: ERROR
- id: sql-injection
patterns:
- pattern: |
cursor.execute("SELECT ... WHERE ... = " + $VAR)
message: "Possible SQL injection"
languages: [python]
severity: WARNINGCI Integration:
- name: Run Semgrep
uses: returntocorp/semgrep-action@v1
with:
config: >-
p/security-audit
p/secretsTest running applications for vulnerabilities.
Common DAST Tools:
- OWASP ZAP: Free, powerful
- Burp Suite: Professional penetration testing
- Acunetix: Commercial scanner
- Nessus: Vulnerability scanner
- Qualys: Cloud-based scanning
OWASP ZAP in CI:
- name: ZAP Scan
uses: zaproxy/action-full-scan@v0.4.0
with:
target: 'https://staging.example.com'
rules_file_name: '.zap/rules.tsv'
cmd_options: '-a'Types of DAST Tests:
- Vulnerability scanning: SQLi, XSS, CSRF
- Fuzzing: Unexpected inputs
- Authentication testing: Login bypass
- Session management: Token handling
- Input validation: Boundary testing
Scan container images for vulnerabilities.
Container Scanning Tools:
- Trivy: Comprehensive, fast
- Clair: CoreOS scanner
- Anchore: Deep inspection
- Docker Scout: Docker native
- Grype: From Anchore
- Snyk Container: Developer friendly
Trivy Example:
# Scan image
trivy image myapp:latest
# Scan with severity filter
trivy image --severity CRITICAL,HIGH myapp:latest
# Ignore unfixed
trivy image --ignore-unfixed myapp:latest
# Output formats
trivy image --format sarif myapp:latest > results.sarif
# Scan filesystem
trivy fs --severity HIGH,CRITICAL .Kubernetes Admission Control:
apiVersion: v1
kind: ConfigMap
metadata:
name: trivy-admission
data:
policy.rego: |
package trivy
deny[msg] {
input.request.kind.kind == "Pod"
image := input.request.object.spec.containers[_].image
not valid_image(image)
msg := sprintf("Image %v has critical vulnerabilities", [image])
}
valid_image(image) {
# Check with Trivy
# ...
}Scan project dependencies for known vulnerabilities.
Tools:
- OWASP Dependency Check: Java, .NET, Python
- Snyk: Multi-language, commercial
- npm audit: JavaScript
- Safety: Python
- Gemnasium: GitLab's scanner
- Dependabot: GitHub's automated updates
Snyk Example:
# .snyk
version: v1.25.0
ignore:
SNYK-JS-LODASH-567746:
- '*':
reason: 'No patch available'
expires: '2024-01-01'
patch: {}CI Integration:
- name: Snyk Scan
uses: snyk/actions/node@master
env:
SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}
with:
args: --severity-threshold=highDependabot Configuration:
# .github/dependabot.yml
version: 2
updates:
- package-ecosystem: "npm"
directory: "/"
schedule:
interval: "daily"
open-pull-requests-limit: 10
ignore:
- dependency-name: "express"
versions: ["5.x"]
labels:
- "dependencies"
- "security"Enforce security policies across infrastructure.
Open Policy Agent (OPA):
package kubernetes.admission
deny[msg] {
input.request.kind.kind == "Pod"
container := input.request.object.spec.containers[_]
container.securityContext.runAsRoot
msg := "Containers must not run as root"
}
deny[msg] {
input.request.kind.kind == "Deployment"
not input.request.object.spec.template.metadata.labels.owner
msg := "All resources must have owner label"
}Kyverno (Kubernetes):
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-labels
spec:
validationFailureAction: enforce
rules:
- name: check-for-labels
match:
resources:
kinds:
- Pod
validate:
message: "Label 'app' is required"
pattern:
metadata:
labels:
app: "?*"
---
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: disallow-latest-tag
spec:
validationFailureAction: audit
rules:
- name: require-image-tag
match:
resources:
kinds:
- Pod
validate:
message: "Using 'latest' tag is not allowed"
pattern:
spec:
containers:
- image: "!*:latest"Conftest (Configuration Testing):
package main
deny[msg] {
input.kind == "Deployment"
not input.spec.template.metadata.labels.app
msg = "Deployments must have app label"
}
deny[msg] {
input.kind == "Service"
input.spec.type == "LoadBalancer"
not input.metadata.annotations["service.beta.kubernetes.io/aws-load-balancer-internal"]
msg = "LoadBalancer services must be internal"
}# Test Kubernetes manifests
conftest test deployment.yaml --policy policy/Core Principles:
- Declarative: Entire system described declaratively
- Versioned and Immutable: Desired state stored in Git
- Pulled Automatically: Software agents pull changes
- Continuously Reconciled: Correct drift automatically
GitOps Workflow:
Developer → Git Push
↓
Git Repository (source of truth)
↓
GitOps Operator (ArgoCD/Flux)
↓
Kubernetes Cluster
↑
Monitoring (drift detection)
Benefits:
- Audit trail: All changes in Git
- Faster recovery: Recreate cluster from Git
- Standard tools: Use Git workflows
- Security: Pull model reduces credentials
- Observability: Drift detection
Declarative GitOps for Kubernetes.
ArgoCD Architecture:
User (CLI/UI) → ArgoCD API Server
↓
Repository Server
↓
Controller
↓
Kubernetes API
Application Definition:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: myapp
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/user/repo.git
targetRevision: HEAD
path: k8s
helm:
valueFiles:
- values-production.yaml
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
- PruneLast=true
revisionHistoryLimit: 10ApplicationSet (Multi-cluster):
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: myapp
spec:
generators:
- clusters:
selector:
matchLabels:
environment: production
template:
metadata:
name: '{{name}}-myapp'
spec:
project: default
source:
repoURL: https://github.com/user/repo.git
targetRevision: HEAD
path: k8s
destination:
server: '{{server}}'
namespace: 'myapp-{{name}}'ArgoCD Commands:
# List apps
argocd app list
# Sync app
argocd app sync myapp
# Get app details
argocd app get myapp
# Rollback
argocd app rollback myapp 1
# Set image (with Kustomize)
argocd app set myapp --kustomize-image myapp:v2Another GitOps operator, lighter weight.
Flux Components:
- Source Controller: Manages Git repositories
- Kustomize Controller: Applies Kustomize overlays
- Helm Controller: Manages Helm releases
- Notification Controller: Handles alerts
Flux Configuration:
# GitRepository source
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: GitRepository
metadata:
name: myapp
namespace: flux-system
spec:
interval: 1m
url: https://github.com/user/repo
ref:
branch: main
secretRef:
name: repo-auth
# Kustomization
apiVersion: kustomize.toolkit.fluxcd.io/v1beta2
kind: Kustomization
metadata:
name: myapp
namespace: flux-system
spec:
interval: 10m
path: ./k8s/overlays/production
prune: true
sourceRef:
kind: GitRepository
name: myapp
validation: client
healthChecks:
- apiVersion: apps/v1
kind: Deployment
name: myapp
namespace: productionFlux with Helm:
# HelmRepository
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmRepository
metadata:
name: bitnami
namespace: flux-system
spec:
interval: 1h
url: https://charts.bitnami.com/bitnami
# HelmRelease
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: redis
namespace: production
spec:
interval: 5m
chart:
spec:
chart: redis
sourceRef:
kind: HelmRepository
name: bitnami
namespace: flux-system
interval: 1m
values:
architecture: standalone
auth:
enabled: falseWhat is an IDP?
A layer of tools and services that development teams use to build, deploy, and operate applications without needing to understand the underlying infrastructure.
IDP Components:
Developer Portal (Backstage, Kratix)
↓
Orchestration (Terraform, Crossplane)
↓
GitOps (ArgoCD, Flux)
↓
Kubernetes (EKS, AKS, GKE)
↓
Cloud Providers (AWS, Azure, GCP)
Backstage (Spotify's Developer Portal):
// Component definition
import { Entity } from '@backstage/catalog-model';
export const myComponent: Entity = {
apiVersion: 'backstage.io/v1alpha1',
kind: 'Component',
metadata: {
name: 'my-service',
description: 'My awesome service',
annotations: {
'github.com/project-slug': 'org/my-service',
'backstage.io/techdocs-ref': 'dir:.',
},
tags: ['java', 'web'],
},
spec: {
type: 'service',
lifecycle: 'production',
owner: 'team-a',
system: 'product-catalog',
},
};Crossplane (Infrastructure as Code Platform):
apiVersion: aws.crossplane.io/v1beta1
kind: ProviderConfig
metadata:
name: aws-provider
spec:
credentials:
source: Secret
secretRef:
namespace: crossplane-system
name: aws-creds
key: creds
---
apiVersion: database.aws.crossplane.io/v1beta1
kind: RDSInstance
metadata:
name: mydb
spec:
forProvider:
region: us-east-1
dbInstanceClass: db.t3.micro
masterUsername: admin
engine: postgres
engineVersion: "13"
allocatedStorage: 20
publiclyAccessible: false
writeConnectionSecretToRef:
name: db-conn
namespace: production
providerConfigRef:
name: aws-providerPlatform Engineering Team Responsibilities:
- Build and maintain IDP
- Define "golden paths" for developers
- Provide self-service capabilities
- Abstract infrastructure complexity
- Ensure security and compliance
- Collect feedback and improve
Golden Path Example:
Developer Workflow:
1. Create repo from template
2. Run `platform create-service`
3. Add code and push
4. PR creates preview environment
5. Merge to main → staging deploy
6. Promote to production via UI
What is Serverless?
- No server management
- Automatic scaling
- Pay per execution
- Event-driven
Benefits:
- Reduced operational overhead
- Auto-scaling to zero
- Cost efficiency for variable workloads
- Faster time to market
Trade-offs:
- Cold starts
- Vendor lock-in
- Execution limits
- Debugging complexity
Lambda Function Example (Node.js):
exports.handler = async (event) => {
console.log('Event:', JSON.stringify(event, null, 2));
try {
const { name } = event.queryStringParameters || {};
const response = {
statusCode: 200,
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify({
message: `Hello, ${name || 'World'}!`,
timestamp: new Date().toISOString(),
}),
};
return response;
} catch (error) {
console.error('Error:', error);
return {
statusCode: 500,
body: JSON.stringify({ error: 'Internal Server Error' }),
};
}
};Terraform Lambda Deployment:
# IAM Role
resource "aws_iam_role" "lambda_role" {
name = "lambda_role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "lambda.amazonaws.com"
}
}
]
})
}
# Lambda function
resource "aws_lambda_function" "api" {
filename = "function.zip"
function_name = "my-api"
role = aws_iam_role.lambda_role.arn
handler = "index.handler"
runtime = "nodejs18.x"
environment {
variables = {
TABLE_NAME = aws_dynamodb_table.data.name
}
}
tracing_config {
mode = "Active"
}
}
# API Gateway trigger
resource "aws_apigatewayv2_api" "lambda" {
name = "serverless-api"
protocol_type = "HTTP"
cors {
allow_origins = ["*"]
allow_methods = ["GET", "POST"]
}
}
resource "aws_apigatewayv2_integration" "lambda" {
api_id = aws_apigatewayv2_api.lambda.id
integration_uri = aws_lambda_function.api.invoke_arn
integration_type = "AWS_PROXY"
integration_method = "POST"
}
resource "aws_apigatewayv2_route" "get" {
api_id = aws_apigatewayv2_api.lambda.id
route_key = "GET /hello"
target = "integrations/${aws_apigatewayv2_integration.lambda.id}"
}Azure Function (Python):
import azure.functions as func
import logging
import json
def main(req: func.HttpRequest) -> func.HttpResponse:
logging.info('Python HTTP trigger function processed a request.')
name = req.params.get('name')
if not name:
try:
req_body = req.get_json()
except ValueError:
pass
else:
name = req_body.get('name')
if name:
return func.HttpResponse(
json.dumps({
"message": f"Hello, {name}!",
"timestamp": datetime.utcnow().isoformat()
}),
status_code=200,
mimetype="application/json"
)
else:
return func.HttpResponse(
"Please pass a name on the query string or in the request body",
status_code=400
)Azure Functions Configuration:
{
"IsEncrypted": false,
"Values": {
"AzureWebJobsStorage": "UseDevelopmentStorage=true",
"FUNCTIONS_WORKER_RUNTIME": "python",
"COSMOS_CONNECTION": "AccountEndpoint=...;"
}
}Serverless containers on GCP.
Cloud Run Service:
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: hello-world
spec:
template:
spec:
containers:
- image: gcr.io/myproject/hello:v1
ports:
- containerPort: 8080
resources:
limits:
memory: "256Mi"
cpu: "1"
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-secret
key: urlDeployment with gcloud:
# Build and deploy
gcloud builds submit --tag gcr.io/myproject/hello:v1
gcloud run deploy hello \
--image gcr.io/myproject/hello:v1 \
--platform managed \
--region us-central1 \
--allow-unauthenticated \
--memory 256Mi \
--concurrency 80Terraform:
resource "google_cloud_run_service" "default" {
name = "hello"
location = "us-central1"
template {
spec {
containers {
image = "gcr.io/myproject/hello:v1"
resources {
limits = {
cpu = "1000m"
memory = "256Mi"
}
}
env {
name = "DATABASE_URL"
value_from {
secret_key_ref {
name = google_secret_manager_secret.db.secret_id
key = "latest"
}
}
}
}
container_concurrency = 80
timeout_seconds = 300
}
}
traffic {
percent = 100
latest_revision = true
}
}Compute at the network edge, closer to users.
Cloudflare Workers:
// Cloudflare Worker
addEventListener('fetch', event => {
event.respondWith(handleRequest(event.request))
})
async function handleRequest(request) {
const cache = caches.default
let response = await cache.match(request)
if (!response) {
response = await fetch(request)
// Cache responses
if (response.status === 200) {
const cloned = response.clone()
const headers = new Headers(cloned.headers)
headers.set('Cache-Control', 'public, max-age=3600')
const cached = new Response(cloned.body, {
status: cloned.status,
statusText: cloned.statusText,
headers: headers
})
event.waitUntil(cache.put(request, cached))
}
}
return response
}AWS Lambda@Edge:
'use strict';
// Origin response trigger
exports.handler = (event, context, callback) => {
const response = event.Records[0].cf.response;
const headers = response.headers;
// Add security headers
headers['strict-transport-security'] = [{
key: 'Strict-Transport-Security',
value: 'max-age=63072000; includeSubdomains; preload'
}];
headers['x-content-type-options'] = [{
key: 'X-Content-Type-Options',
value: 'nosniff'
}];
headers['x-frame-options'] = [{
key: 'X-Frame-Options',
value: 'DENY'
}];
headers['x-xss-protection'] = [{
key: 'X-XSS-Protection',
value: '1; mode=block'
}];
callback(null, response);
};Use Cases:
- CDN caching
- Authentication at edge
- A/B testing
- Geolocation routing
- Bot mitigation
- API aggregation
Distribute traffic across multiple servers.
Load Balancer Types:
- Layer 4 (Transport): TCP/UDP, IP-based
- Layer 7 (Application): HTTP/HTTPS, content-based
Algorithms:
- Round Robin: Simple rotation
- Least Connections: To busiest server
- IP Hash: Sticky sessions
- Weighted: Capacity-based distribution
AWS Application Load Balancer:
resource "aws_lb" "main" {
name = "app-lb"
internal = false
load_balancer_type = "application"
security_groups = [aws_security_group.lb.id]
subnets = aws_subnet.public[*].id
enable_deletion_protection = true
access_logs {
bucket = aws_s3_bucket.lb_logs.bucket
prefix = "alb-logs"
enabled = true
}
}
resource "aws_lb_target_group" "app" {
name = "app-targets"
port = 80
protocol = "HTTP"
vpc_id = aws_vpc.main.id
health_check {
enabled = true
healthy_threshold = 2
unhealthy_threshold = 2
timeout = 5
interval = 30
path = "/health"
}
stickiness {
type = "lb_cookie"
cookie_duration = 86400
enabled = true
}
}
resource "aws_lb_listener" "front_end" {
load_balancer_arn = aws_lb.main.arn
port = "443"
protocol = "HTTPS"
ssl_policy = "ELBSecurityPolicy-2016-08"
certificate_arn = aws_acm_certificate.lb.arn
default_action {
type = "forward"
target_group_arn = aws_lb_target_group.app.arn
}
}Distribute content globally for faster delivery.
CloudFront with S3:
# Origin Access Identity
resource "aws_cloudfront_origin_access_identity" "oai" {
comment = "OAI for S3 bucket"
}
# CloudFront distribution
resource "aws_cloudfront_distribution" "cdn" {
enabled = true
origin {
domain_name = aws_s3_bucket.website.bucket_regional_domain_name
origin_id = "S3-website"
s3_origin_config {
origin_access_identity = aws_cloudfront_origin_access_identity.oai.cloudfront_access_identity_path
}
}
default_cache_behavior {
allowed_methods = ["GET", "HEAD", "OPTIONS"]
cached_methods = ["GET", "HEAD"]
target_origin_id = "S3-website"
forwarded_values {
query_string = false
cookies {
forward = "none"
}
}
viewer_protocol_policy = "redirect-to-https"
min_ttl = 0
default_ttl = 3600
max_ttl = 86400
compress = true
}
price_class = "PriceClass_100"
viewer_certificate {
cloudfront_default_certificate = true
}
restrictions {
geo_restriction {
restriction_type = "none"
}
}
custom_error_response {
error_code = 404
response_code = 200
response_page_path = "/index.html"
error_caching_min_ttl = 300
}
tags = {
Environment = var.environment
}
}Cache Levels:
- Browser Cache: Local to user
- CDN Cache: Edge locations
- Application Cache: In-memory (Redis, Memcached)
- Database Cache: Query cache
Cache Headers:
# Nginx cache configuration
location /static/ {
expires 1y;
add_header Cache-Control "public, immutable";
}
location /api/ {
expires 1m;
add_header Cache-Control "private, must-revalidate";
# Proxy cache
proxy_cache api_cache;
proxy_cache_key "$scheme$request_method$host$request_uri";
proxy_cache_valid 200 302 60m;
proxy_cache_valid 404 1m;
proxy_cache_use_stale error timeout updating;
}Redis Caching:
import redis
import json
redis_client = redis.Redis(host='redis', port=6379, db=0)
def get_user(user_id):
# Try cache first
cached = redis_client.get(f"user:{user_id}")
if cached:
return json.loads(cached)
# Cache miss - get from database
user = db.query(User).get(user_id)
if user:
# Store in cache for 1 hour
redis_client.setex(
f"user:{user_id}",
3600,
json.dumps(user.to_dict())
)
return user
def invalidate_user(user_id):
redis_client.delete(f"user:{user_id}")Cache Invalidation Strategies:
- Time-based: Expire after TTL
- Event-based: Invalidate on update
- Version-based: Use version in cache key
- Manual: Purge via API
Vertical Scaling (Scale Up):
- Bigger instance
- More CPU/RAM
- Limited by hardware
Horizontal Scaling (Scale Out):
- More instances
- Sharding
- Read replicas
Read Replicas:
-- Write to master
INSERT INTO users (name) VALUES ('John');
-- Read from replica
SELECT * FROM users; -- Connect to replica endpointDatabase Sharding:
Shard 0: users 0-10000
Shard 1: users 10001-20000
Shard 2: users 20001-30000
shard_id = user_id % num_shards
Connection Pooling:
from sqlalchemy import create_engine
from sqlalchemy.pool import QueuePool
engine = create_engine(
'postgresql://user:pass@localhost/mydb',
poolclass=QueuePool,
pool_size=20,
max_overflow=10,
pool_pre_ping=True,
pool_recycle=3600
)Asynchronous Processing:
# FastAPI with background tasks
from fastapi import FastAPI, BackgroundTasks
import asyncio
app = FastAPI()
async def process_order(order_id: str):
# Long-running task
await asyncio.sleep(5)
# Update order status
await update_database(order_id, "processed")
@app.post("/orders")
async def create_order(order: Order, background_tasks: BackgroundTasks):
# Save order quickly
order_id = await save_order(order)
# Process in background
background_tasks.add_task(process_order, order_id)
return {"order_id": order_id, "status": "accepted"}Message Queues:
# Producer (FastAPI)
import aio_pika
async def publish_order(order):
connection = await aio_pika.connect_robust("amqp://guest:guest@rabbitmq/")
channel = await connection.channel()
await channel.default_exchange.publish(
aio_pika.Message(
body=json.dumps(order).encode(),
delivery_mode=aio_pika.DeliveryMode.PERSISTENT
),
routing_key="orders"
)
await connection.close()
# Consumer (Worker)
async def process_orders():
connection = await aio_pika.connect_robust("amqp://guest:guest@rabbitmq/")
channel = await connection.channel()
queue = await channel.declare_queue("orders", durable=True)
async with queue.iterator() as queue_iter:
async for message in queue_iter:
async with message.process():
order = json.loads(message.body)
await process_order(order)Rate Limiting:
from fastapi import FastAPI, HTTPException
from datetime import datetime, timedelta
import redis
app = FastAPI()
redis_client = redis.Redis(host='redis', port=6379, db=0)
@app.middleware("http")
async def rate_limit(request: Request, call_next):
client_ip = request.client.host
key = f"rate_limit:{client_ip}"
# Check rate limit
current = redis_client.get(key)
if current and int(current) > 100:
raise HTTPException(status_code=429, detail="Too many requests")
# Increment counter
pipe = redis_client.pipeline()
pipe.incr(key)
pipe.expire(key, 60) # 1 minute window
pipe.execute()
response = await call_next(request)
return responseActive-Passive:
Region A (Primary)
├── Traffic: 100%
├── Database: Read/Write
└── Ready for failover
Region B (Standby)
├── Traffic: 0%
├── Database: Read-only replica
└── Failover target
Active-Active:
Global Load Balancer
↓
┌───┴───┐
Region A Region B
Traffic 50% Traffic 50%
Database sync Database sync
DNS Failover (Route53):
resource "aws_route53_record" "www" {
zone_id = data.aws_route53_zone.main.zone_id
name = "www.example.com"
type = "A"
alias {
name = aws_lb.primary.dns_name
zone_id = aws_lb.primary.zone_id
evaluate_target_health = true
}
failover_routing_policy {
type = "PRIMARY"
}
set_identifier = "primary"
}
resource "aws_route53_record" "www_failover" {
zone_id = data.aws_route53_zone.main.zone_id
name = "www.example.com"
type = "A"
alias {
name = aws_lb.secondary.dns_name
zone_id = aws_lb.secondary.zone_id
evaluate_target_health = true
}
failover_routing_policy {
type = "SECONDARY"
}
set_identifier = "secondary"
}Common Compliance Frameworks:
- ISO 27001: Information security management
- SOC 2: Service organization controls
- PCI DSS: Payment card industry
- HIPAA: Healthcare
- GDPR: Data privacy
Automated Compliance Checks:
# AWS Config rule
resource "aws_config_config_rule" "encrypted_volumes" {
name = "encrypted-volumes"
source {
owner = "AWS"
source_identifier = "ENCRYPTED_VOLUMES"
}
scope {
compliance_resource_types = ["AWS::EC2::Volume"]
}
}Evidence Collection:
# Automated evidence collection
import boto3
import json
from datetime import datetime
def collect_evidence():
# Collect IAM policies
iam = boto3.client('iam')
policies = iam.list_policies(Scope='Local')
# Collect security group rules
ec2 = boto3.client('ec2')
security_groups = ec2.describe_security_groups()
# Collect CloudTrail logs
cloudtrail = boto3.client('cloudtrail')
trails = cloudtrail.describe_trails()
evidence = {
'timestamp': datetime.utcnow().isoformat(),
'iam_policies': policies,
'security_groups': security_groups,
'cloudtrail': trails
}
# Store in secure bucket
s3 = boto3.client('s3')
s3.put_object(
Bucket='compliance-evidence',
Key=f"evidence/{datetime.now().date()}/config.json",
Body=json.dumps(evidence, default=str)
)Policy as Code:
# AWS Service Control Policy (SCP)
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Deny",
"Action": [
"ec2:RunInstances"
],
"Resource": [
"arn:aws:ec2:*:*:instance/*"
],
"Condition": {
"StringNotEquals": {
"ec2:InstanceType": [
"t3.micro",
"t3.small",
"m5.large"
]
}
}
},
{
"Effect": "Deny",
"Action": [
"s3:PutBucketPublicAccessBlock"
],
"Resource": "*"
}
]
}Tagging Strategy:
# Enforce tags
resource "aws_cloudformation_stack" "enforce_tags" {
name = "enforce-tags"
template_body = <<TEMPLATE
Resources:
EnforceTagsLambda:
Type: AWS::Lambda::Function
Properties:
Handler: index.handler
Runtime: python3.9
Code:
ZipFile: |
import boto3
import json
def handler(event, context):
ec2 = boto3.client('ec2')
# Find untagged resources
resources = ec2.describe_instances(
Filters=[
{
'Name': 'tag:Environment',
'Values': ['missing']
}
]
)
# Stop or terminate untagged resources
for reservation in resources['Reservations']:
for instance in reservation['Instances']:
ec2.stop_instances(InstanceIds=[instance['InstanceId']])
return {'status': 'completed'}
TEMPLATE
}Cost Allocation Tags:
resource "aws_instance" "web" {
# ... other configuration
tags = {
Name = "web-server"
Environment = "production"
CostCenter = "product-engineering"
Project = "customer-portal"
Owner = "team-alpha"
Expires = "never" # or "2024-12-31"
}
}Budget Alerts:
resource "aws_budgets_budget" "monthly" {
name = "monthly-budget"
budget_type = "COST"
limit_amount = "10000"
limit_unit = "USD"
time_unit = "MONTHLY"
cost_types {
include_credit = false
include_discount = false
include_other_subscription = true
include_recurring = true
include_refund = false
include_subscription = true
include_support = true
include_tax = true
include_upfront = true
use_blended = false
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 80
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = ["finance@example.com"]
}
}Financial operations for cloud.
FinOps Principles:
- Teams need to collaborate: Finance, engineering, product
- Decisions driven by business value: Cost vs. features
- Everyone takes ownership: Decentralized accountability
- Reports should be accessible: Transparency
- Cloud is variable cost: Optimize continuously
Cost Optimization Strategies:
# Automated rightsizing recommendation
def analyze_rightsizing():
# Get usage metrics
cloudwatch = boto3.client('cloudwatch')
# For each instance
for instance in get_all_instances():
# Get CPU utilization
stats = cloudwatch.get_metric_statistics(
Namespace='AWS/EC2',
MetricName='CPUUtilization',
Dimensions=[{'Name': 'InstanceId', 'Value': instance.id}],
StartTime=datetime.now() - timedelta(days=30),
EndTime=datetime.now(),
Period=3600,
Statistics=['Average']
)
avg_cpu = sum(p['Average'] for p in stats['Datapoints']) / len(stats['Datapoints'])
# Recommend downsizing if low utilization
if avg_cpu < 10:
recommend_smaller_instance(instance)
# Recommend spot if appropriate
if can_use_spot(instance):
recommend_spot_conversion(instance)Spot Instance Strategy:
# Spot instance with mixed types
resource "aws_ec2_fleet" "compute" {
launch_template_config {
launch_template_specification {
launch_template_id = aws_launch_template.app.id
version = "$Latest"
}
overrides {
instance_type = "c5.large"
weighted_capacity = 2
}
overrides {
instance_type = "c5a.large"
weighted_capacity = 2
}
overrides {
instance_type = "m5.large"
weighted_capacity = 2
}
}
target_capacity_specification {
default_target_capacity_type = "spot"
total_target_capacity = 20
spot_target_capacity = 20
}
spot_options {
allocation_strategy = "capacity-optimized"
instance_interruption_behavior = "terminate"
min_target_capacity = 10
}
}The 7 Rs of Migration:
- Rehost (Lift and Shift): Move as-is
- Replatform (Lift, Tinker, Shift): Minor optimizations
- Repurchase (Drop and Shop): Move to SaaS
- Refactor (Re-architect): Modernize for cloud
- Retire: Decommission unused
- Retain: Keep on-premises
- Relocate: Move to hyperconverged
Migration Phases:
- Assess: Discovery and planning
- Mobilize: Pilot and skills building
- Migrate: Scale migration
- Modernize: Optimize and innovate
Database Migration Service:
# AWS DMS replication task
resource "aws_dms_replication_task" "migrate" {
replication_task_id = "migrate-db"
migration_type = "full-load"
replication_instance_arn = aws_dms_replication_instance.dms.replication_instance_arn
source_endpoint_arn = aws_dms_endpoint.source.endpoint_arn
target_endpoint_arn = aws_dms_endpoint.target.endpoint_arn
table_mappings = jsonencode({
"rules": [
{
"rule-type": "selection",
"rule-id": "1",
"rule-name": "1",
"object-locator": {
"schema-name": "public",
"table-name": "users"
},
"rule-action": "include"
}
]
})
replication_task_settings = jsonencode({
"TargetMetadata": {
"TargetSchema": "",
"SupportLobs": true,
"FullLobMode": false,
"LobChunkSize": 64,
"LimitedSizeLobMode": false,
"LobMaxSize": 32
},
"FullLoadSettings": {
"TargetTablePrepMode": "DROP_AND_CREATE",
"CreatePkAfterFullLoad": false,
"StopTaskCachedChangesApplied": false,
"StopTaskCachedChangesNotApplied": false,
"MaxFullLoadSubTasks": 8,
"TransactionConsistencyTimeout": 600,
"CommitRate": 10000
}
})
}Architecture:
┌─────────┐ ┌─────────┐ ┌─────────┐
│ React │ → │ API │ → │ Users │
│ App │ ← │ Gateway │ ← │ Service │
└─────────┘ └─────────┘ └─────────┘
↓ ↓
┌─────────┐ ┌─────────┐
│ Auth │ │ Posts │
│ Service │ │ Service │
└─────────┘ └─────────┘
Repository Structure:
myapp/
├── services/
│ ├── api-gateway/
│ │ ├── src/
│ │ ├── Dockerfile
│ │ └── package.json
│ ├── users-service/
│ │ ├── src/
│ │ ├── Dockerfile
│ │ └── requirements.txt
│ └── posts-service/
│ ├── src/
│ ├── Dockerfile
│ └── go.mod
├── frontend/
│ ├── src/
│ ├── Dockerfile
│ └── package.json
├── k8s/
│ ├── base/
│ │ ├── deployment.yaml
│ │ └── service.yaml
│ └── overlays/
│ ├── dev/
│ └── prod/
├── terraform/
│ ├── main.tf
│ ├── variables.tf
│ └── outputs.tf
├── .github/
│ └── workflows/
│ ├── ci.yml
│ └── cd.yml
└── README.md
Branch Strategy:
main- Production-ready codedevelop- Integration branchfeature/*- New featuresrelease/*- Release preparationhotfix/*- Emergency fixes
PR Template:
## Description
[Describe your changes]
## Type of Change
- [ ] Bug fix
- [ ] New feature
- [ ] Breaking change
- [ ] Documentation update
## Testing
- [ ] Unit tests passing
- [ ] Integration tests passing
- [ ] Manual testing completed
## Checklist
- [ ] Code follows style guide
- [ ] Documentation updated
- [ ] Dependencies updated
- [ ] Security considerations addressed
## Related Issues
Closes #[issue-number]GitHub Actions CI:
# .github/workflows/ci.yml
name: CI Pipeline
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Lint API Gateway
working-directory: services/api-gateway
run: |
npm install
npm run lint
- name: Lint Users Service
working-directory: services/users-service
run: |
pip install flake8
flake8 src/
- name: Lint Posts Service
working-directory: services/posts-service
run: |
go install golang.org/x/lint/golint@latest
golint ./...
test:
runs-on: ubuntu-latest
services:
postgres:
image: postgres:13
env:
POSTGRES_PASSWORD: testpass
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
ports:
- 5432:5432
redis:
image: redis:6
options: >-
--health-cmd "redis-cli ping"
--health-interval 10s
--health-timeout 5s
--health-retries 5
ports:
- 6379:6379
steps:
- uses: actions/checkout@v2
- name: Test API Gateway
working-directory: services/api-gateway
run: |
npm install
npm test -- --coverage
- name: Test Users Service
working-directory: services/users-service
env:
DATABASE_URL: postgresql://postgres:testpass@localhost/test
run: |
pip install -r requirements.txt
pytest --cov=src tests/
- name: Test Posts Service
working-directory: services/posts-service
run: |
go mod download
go test -v -cover ./...
security:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Run SAST
uses: github/codeql-action/init@v1
with:
languages: javascript,python,go
- name: Scan dependencies
run: |
npm audit --audit-level=high
safety check
go list -json -deps | nancy sleuth
- name: Scan for secrets
uses: trufflesecurity/trufflehog@main
build:
runs-on: ubuntu-latest
needs: [lint, test, security]
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
steps:
- uses: actions/checkout@v2
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v1
- name: Login to Container Registry
uses: docker/login-action@v1
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Build and push API Gateway
uses: docker/build-push-action@v2
with:
context: services/api-gateway
push: true
tags: |
ghcr.io/${{ github.repository }}/api-gateway:${{ github.sha }}
ghcr.io/${{ github.repository }}/api-gateway:latest
cache-from: type=gha
cache-to: type=gha,mode=max
- name: Build and push Users Service
uses: docker/build-push-action@v2
with:
context: services/users-service
push: true
tags: |
ghcr.io/${{ github.repository }}/users-service:${{ github.sha }}
ghcr.io/${{ github.repository }}/users-service:latest
- name: Build and push Posts Service
uses: docker/build-push-action@v2
with:
context: services/posts-service
push: true
tags: |
ghcr.io/${{ github.repository }}/posts-service:${{ github.sha }}
ghcr.io/${{ github.repository }}/posts-service:latest
- name: Scan images for vulnerabilities
uses: aquasecurity/trivy-action@master
with:
image-ref: 'ghcr.io/${{ github.repository }}/api-gateway:${{ github.sha }}'
severity: 'CRITICAL,HIGH'
format: 'sarif'
output: 'trivy-results.sarif'API Gateway Dockerfile:
FROM node:18-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
FROM node:18-alpine
RUN addgroup -g 1001 -S nodejs && \
adduser -S nodejs -u 1001
WORKDIR /app
COPY --from=builder --chown=nodejs:nodejs /app/node_modules ./node_modules
COPY --chown=nodejs:nodejs . .
USER nodejs
EXPOSE 3000
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
CMD node healthcheck.js
CMD ["node", "src/server.js"]Users Service Dockerfile:
FROM python:3.10-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt
FROM python:3.10-slim
RUN apt-get update && apt-get install -y --no-install-recommends \
curl \
&& rm -rf /var/lib/apt/lists/*
RUN groupadd -r appuser && useradd -r -g appuser appuser
WORKDIR /app
COPY --from=builder /root/.local /home/appuser/.local
COPY . .
ENV PATH=/home/appuser/.local/bin:$PATH
USER appuser
EXPOSE 8000
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
CMD curl -f http://localhost:8000/health || exit 1
CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]Posts Service Dockerfile:
FROM golang:1.19-alpine AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o posts-service ./cmd/server
FROM alpine:3.17
RUN apk --no-cache add ca-certificates
RUN addgroup -g 1001 -S appgroup && \
adduser -S appuser -u 1001 -G appgroup
WORKDIR /app
COPY --from=builder --chown=appuser:appgroup /app/posts-service .
USER appuser
EXPOSE 8080
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
CMD ["./posts-service", "health"]
CMD ["./posts-service"]Kustomize Base:
# k8s/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-gateway
labels:
app: api-gateway
spec:
replicas: 3
selector:
matchLabels:
app: api-gateway
template:
metadata:
labels:
app: api-gateway
spec:
containers:
- name: api-gateway
image: ghcr.io/myorg/myapp/api-gateway:latest
ports:
- containerPort: 3000
env:
- name: NODE_ENV
value: "production"
- name: USERS_SERVICE_URL
value: "http://users-service:8000"
- name: POSTS_SERVICE_URL
value: "http://posts-service:8080"
- name: REDIS_URL
valueFrom:
secretKeyRef:
name: app-secrets
key: redis-url
resources:
requests:
memory: "128Mi"
cpu: "250m"
limits:
memory: "256Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 5
periodSeconds: 5
---
# k8s/base/service.yaml
apiVersion: v1
kind: Service
metadata:
name: api-gateway
spec:
selector:
app: api-gateway
ports:
- port: 80
targetPort: 3000
type: ClusterIP
---
# k8s/base/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: api-gateway
annotations:
kubernetes.io/ingress.class: nginx
cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
tls:
- hosts:
- api.example.com
secretName: api-tls
rules:
- host: api.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: api-gateway
port:
number: 80Production Overlay:
# k8s/overlays/prod/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
bases:
- ../../base
namespace: production
images:
- name: ghcr.io/myorg/myapp/api-gateway
newTag: v1.2.3
- name: ghcr.io/myorg/myapp/users-service
newTag: v1.2.3
- name: ghcr.io/myorg/myapp/posts-service
newTag: v1.2.3
patchesStrategicMerge:
- increase-replicas.yaml
- resource-limits.yaml
configMapGenerator:
- name: app-config
behavior: merge
literals:
- LOG_LEVEL=info
- ENVIRONMENT=production
secretGenerator:
- name: app-secrets
behavior: merge
literals:
- redis-url=redis://redis-service:6379
- database-url=postgresql://user:pass@postgres:5432/prod# k8s/overlays/prod/increase-replicas.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-gateway
spec:
replicas: 5
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: users-service
spec:
replicas: 3
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: posts-service
spec:
replicas: 3Prometheus Configuration:
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- source_labels: [__meta_kubernetes_pod_phase]
regex: (Failed|Succeeded)
action: dropServiceMonitor for Custom Metrics:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: api-gateway
spec:
selector:
matchLabels:
app: api-gateway
endpoints:
- port: metrics
interval: 30s
path: /metrics
namespaceSelector:
matchNames:
- productionGrafana Dashboard:
{
"dashboard": {
"title": "API Gateway Dashboard",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total{app='api-gateway'}[5m])) by (status_code)",
"legendFormat": "{{status_code}}"
}
]
},
{
"title": "Request Latency (p99)",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{app='api-gateway'}[5m])) by (le))",
"legendFormat": "p99"
}
]
},
{
"title": "Error Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total{app='api-gateway', status_code=~'5..'}[5m])) / sum(rate(http_requests_total{app='api-gateway'}[5m]))",
"legendFormat": "error ratio"
}
]
},
{
"title": "CPU Usage",
"type": "graph",
"targets": [
{
"expr": "sum(container_cpu_usage_seconds_total{container='api-gateway'}) by (pod)",
"legendFormat": "{{pod}}"
}
]
},
{
"title": "Memory Usage",
"type": "graph",
"targets": [
{
"expr": "sum(container_memory_working_set_bytes{container='api-gateway'}) by (pod)",
"legendFormat": "{{pod}}"
}
]
}
]
}
}Alert Rules:
# alerts.yml
groups:
- name: api-gateway
rules:
- alert: APIHighErrorRate
expr: |
sum(rate(http_requests_total{app='api-gateway', status_code=~'5..'}[5m]))
/
sum(rate(http_requests_total{app='api-gateway'}[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "API Gateway high error rate"
description: "Error rate is {{ $value | humanizePercentage }} for 5 minutes"
- alert: APIHighLatency
expr: |
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{app='api-gateway'}[5m])) by (le)) > 0.5
for: 10m
labels:
severity: warning
annotations:
summary: "API Gateway high latency"
description: "p99 latency is {{ $value }}s for 10 minutes"
- alert: APIDown
expr: up{job='api-gateway'} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "API Gateway is down"
description: "API Gateway has been down for more than 1 minute"Secret Management:
# secrets.yaml (encrypted with sops)
apiVersion: v1
kind: Secret
metadata:
name: app-secrets
type: Opaque
data:
database-url: ENC[AES256_GCM,data:...]
redis-url: ENC[AES256_GCM,data:...]
api-key: ENC[AES256_GCM,data:...]
sops:
kms:
- arn: arn:aws:kms:us-east-1:123456789:key/...
created_at: "..."
enc: "..."Pod Security Policy:
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: restricted
spec:
privileged: false
allowPrivilegeEscalation: false
requiredDropCapabilities:
- ALL
volumes:
- 'configMap'
- 'emptyDir'
- 'projected'
- 'secret'
- 'downwardAPI'
hostNetwork: false
hostIPC: false
hostPID: false
runAsUser:
rule: 'MustRunAsNonRoot'
seLinux:
rule: 'RunAsAny'
supplementalGroups:
rule: 'MustRunAs'
ranges:
- min: 1
max: 65535
fsGroup:
rule: 'MustRunAs'
ranges:
- min: 1
max: 65535Network Policy:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-gateway-network-policy
spec:
podSelector:
matchLabels:
app: api-gateway
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: frontend
ports:
- protocol: TCP
port: 3000
egress:
- to:
- podSelector:
matchLabels:
app: users-service
ports:
- protocol: TCP
port: 8000
- to:
- podSelector:
matchLabels:
app: posts-service
ports:
- protocol: TCP
port: 8080
- to:
- podSelector:
matchLabels:
app: redis
ports:
- protocol: TCP
port: 6379Scale:
- 200M+ subscribers
- Thousands of microservices
- Millions of streaming hours daily
- Thousands of deployments daily
Key Practices:
1. Chaos Engineering
- Chaos Monkey randomly terminates instances
- Simian Army tests various failure modes
- Latency Monkey introduces delays
- Conformity Monkey enforces best practices
# Chaos Monkey simplified example
import random
import boto3
class ChaosMonkey:
def __init__(self, probability=0.01):
self.probability = probability
self.ec2 = boto3.client('ec2')
def run(self):
instances = self.get_production_instances()
for instance in instances:
if random.random() < self.probability:
self.terminate_instance(instance)
self.notify_team(instance)
def get_production_instances(self):
# Get instances with production tag
response = self.ec2.describe_instances(
Filters=[
{'Name': 'tag:Environment', 'Values': ['production']}
]
)
return response['Reservations']
def terminate_instance(self, instance):
instance_id = instance['Instances'][0]['InstanceId']
self.ec2.terminate_instances(InstanceIds=[instance_id])2. Immutable Infrastructure
- Servers never patched, always replaced
- Golden AMIs with everything baked in
- Blue/green deployments
- Automated rollback
3. Spinnaker for CD
- Multi-cloud continuous delivery
- Pipeline stages: bake, test, deploy
- Canary analysis
- Automated rollbacks
// Spinnaker pipeline
{
"application": "netflix",
"name": "deploy-service",
"stages": [
{
"type": "bake",
"name": "Bake Image",
"baseOs": "ubuntu",
"package": "myapp"
},
{
"type": "canary",
"name": "Canary Deploy",
"cluster": "myapp-canary",
"targetSize": 5,
"analysisType": "realTime",
"metrics": [
"error_rate < 0.1%",
"latency_p99 < 200ms"
]
},
{
"type": "rollingPush",
"name": "Production Deploy",
"cluster": "myapp-prod",
"targetSize": 100
}
]
}4. Culture of Freedom and Responsibility
- "You build it, you run it"
- Engineers own their services
- Blameless postmortems
- Data-driven decisions
Scale:
- 100M+ deployments per year
- 143,000 deployments in peak hour
- 2-pizza teams (6-10 people)
- Service-oriented architecture
Key Practices:
1. Two-Pizza Teams
- Small, autonomous teams
- Full ownership of services
- Independent deployment
- Clear API contracts
2. Deployment Pipeline
# Amazon's deployment pipeline simplified
class DeploymentPipeline:
def __init__(self, service_name):
self.service = service_name
self.stages = [
'commit',
'build',
'unit_tests',
'integration_tests',
'performance_tests',
'security_scan',
'canary',
'production'
]
def execute(self, version):
for stage in self.stages:
if not self.run_stage(stage, version):
self.rollback(version)
return False
# Collect metrics
metrics = self.collect_metrics(stage)
if self.thresholds_exceeded(metrics):
self.rollback(version)
return False
return True
def canary_deploy(self, version):
# Deploy to 1% of instances
canary_group = self.deploy_to_group(version, percent=1)
# Monitor for 15 minutes
time.sleep(900)
# Check metrics
if self.canary_healthy(canary_group):
# Gradual rollout
self.deploy_to_group(version, percent=10)
time.sleep(300)
self.deploy_to_group(version, percent=25)
time.sleep(300)
self.deploy_to_group(version, percent=50)
time.sleep(300)
self.deploy_to_group(version, percent=100)
else:
self.rollback_canary(version)3. API Mandate
- All teams expose APIs
- No direct database access
- Backward compatibility required
- Versioned APIs
4. "You Build It, You Run It"
- Developers carry pagers
- On-call rotation within dev teams
- Operational excellence is priority
- Automated remediation
Scale:
- Billions of users
- Global infrastructure
- 100% services with SLOs
- Error budgets for all services
Key Practices:
1. Error Budgets
class ErrorBudget:
def __init__(self, service, slo=99.99):
self.service = service
self.slo = slo
self.budget = 100 - slo
self.consumed = 0
def track_error(self, duration):
# Track error against budget
error_seconds = duration
total_seconds = self.get_total_seconds()
self.consumed = (error_seconds / total_seconds) * 100
if self.consumed > self.budget:
self.enforce_freeze()
def enforce_freeze(self):
# Block releases when budget exhausted
print(f"Error budget exhausted for {self.service}")
self.block_releases()
self.focus_on_reliability()
def reset_monthly(self):
self.consumed = 0
self.unblock_releases()2. Toil Elimination
- Target < 50% time on toil
- Automate everything
- Self-service platforms
- Continuous improvement
# Toil tracking
class ToilTracker:
def __init__(self):
self.toil_time = 0
self.eng_time = 0
def track_activity(self, activity_type, duration):
if activity_type == 'toil':
self.toil_time += duration
else:
self.eng_time += duration
self.ensure_balance()
def ensure_balance(self):
total = self.toil_time + self.eng_time
if total > 0:
toil_percentage = (self.toil_time / total) * 100
if toil_percentage > 50:
self.trigger_toil_reduction()
def trigger_toil_reduction(self):
print("Toil exceeds 50% - initiating reduction projects")
# Start automation projects
# Assign engineering time to reduce toil3. Monitoring Philosophy
- Monitor symptoms, not causes
- Only alert if action required
- Use SLIs, SLOs, error budgets
- Minimal, actionable alerts
Profile:
- Series B startup
- 50 engineers
- AWS cloud
- 10 microservices
- 100K users
DevOps Implementation:
Phase 1: Foundation (Month 1-3)
- GitHub for version control
- GitHub Actions for CI
- Terraform for infrastructure
- Docker for containerization
- ECS for orchestration (simpler than K8s)
Phase 2: Automation (Month 4-6)
- Automated testing in CI
- Container image building
- Blue/green deployments
- Basic monitoring (CloudWatch)
Phase 3: Scaling (Month 7-12)
- Migrate to EKS
- Service mesh (Linkerd)
- Prometheus/Grafana
- Centralized logging (ELK)
- Security scanning (Trivy)
Sample CI Pipeline:
name: Startup CI/CD
on:
push:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- run: npm ci
- run: npm test
- run: npm run lint
build:
needs: test
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- run: docker build -t myapp:${{ github.sha }} .
- run: docker tag myapp:${{ github.sha }} ${{ secrets.ECR_REPO }}:latest
- run: aws ecr get-login-password | docker login --username AWS --password-stdin ${{ secrets.ECR_REPO }}
- run: docker push ${{ secrets.ECR_REPO }}:latest
deploy:
needs: build
runs-on: ubuntu-latest
environment: production
steps:
- uses: actions/checkout@v2
- run: |
aws ecs update-service \
--cluster myapp-cluster \
--service myapp-service \
--force-new-deployment \
--region us-east-1Profile:
- Fortune 500 financial services
- 10,000+ employees
- 1,000+ applications
- Legacy data centers
- Strict regulatory requirements
Challenges:
- Legacy mainframe applications
- Regulatory compliance (SOX, PCI)
- Security concerns
- Siloed teams
- Vendor lock-in
Migration Phases:
Phase 1: Assessment (6 months)
- Application portfolio analysis
- Dependency mapping
- Compliance requirements review
- Skills assessment
- Vendor evaluation
Phase 2: Foundation (12 months)
- Create cloud landing zone
- Establish governance framework
- Build central platform team
- Implement security controls
- Set up connectivity (Direct Connect)
# Enterprise landing zone
module "landing_zone" {
source = "terraform-aws-modules/control-tower/aws"
# Multi-account structure
organizational_units = {
"Security" = {
accounts = ["audit", "security-tooling"]
}
"Infrastructure" = {
accounts = ["network", "shared-services", "cicd"]
}
"Workloads" = {
accounts = ["dev", "test", "prod", "dr"]
}
}
# Guardrails
guardrails = {
"DISALLOW_PUBLIC_IPS" = {
type = "mandatory"
}
"ENFORCE_ENCRYPTION" = {
type = "mandatory"
}
"ENABLE_CLOUDTRAIL" = {
type = "mandatory"
}
}
}Phase 3: Pilot (6 months)
- Select 3 pilot applications
- Lift-and-shift initial migrations
- Validate security controls
- Train first teams
- Document patterns
Phase 4: Scale (18 months)
- Wave-based migrations
- Automate where possible
- Modernize applications
- Implement CI/CD
- Establish FinOps
Phase 5: Optimize (ongoing)
- Rightsizing
- Spot instances
- Containerization
- Serverless adoption
- Continuous improvement
Key Success Factors:
- Executive sponsorship - C-level support
- Center of Excellence - Central team
- Training program - Skill development
- Security first - Compliance from day one
- Measurable wins - Show progress
- Cultural change - DevOps mindset
File Operations:
ls -la # List all files with details
cd /path/to/dir # Change directory
pwd # Print working directory
cp -r source dest # Copy recursively
mv source dest # Move/rename
rm -rf dir # Remove forcefully
mkdir -p path/to/dir # Create directory with parents
touch file.txt # Create empty file/update timestamp
cat file.txt # Display file content
less file.txt # View file page by page
head -n 10 file.txt # First 10 lines
tail -f file.txt # Follow file (live updates)
find . -name "*.txt" # Find files by name
grep -r "pattern" . # Search recursivelyProcess Management:
ps aux # All processes
top # Interactive process viewer
htop # Enhanced top
kill -9 PID # Force kill process
kill -15 PID # Graceful termination
pgrep process_name # Find PID by name
pkill process_name # Kill by name
jobs # List background jobs
bg %1 # Resume job in background
fg %1 # Bring to foreground
nohup command & # Run immune to hangupsNetwork Commands:
ip addr show # IP addresses
ip route show # Routing table
ss -tulpn # Listening ports
netstat -an # Network statistics (legacy)
curl -I http://example.com # HTTP headers
wget http://example.com/file # Download file
ping -c 4 example.com # ICMP ping
traceroute example.com # Trace route
nslookup example.com # DNS lookup
dig example.com # Detailed DNS
telnet host port # Test TCP connection
nc -vz host port # Netcat port scan
tcpdump -i eth0 # Capture packetsSystem Information:
uname -a # Kernel info
cat /etc/os-release # OS info
lscpu # CPU info
free -h # Memory usage
df -h # Disk usage
du -sh * # Directory sizes
uptime # System uptime
whoami # Current user
id # User identity
hostname # System hostname
date # Current date/time
dmesg | tail # Kernel messagesPackage Management (Ubuntu/Debian):
apt update # Update package lists
apt upgrade # Upgrade all packages
apt install package # Install package
apt remove package # Remove package
apt autoremove # Remove unused packages
apt search pattern # Search packages
dpkg -l # List installed
dpkg -S /path/to/file # Which package owns filePackage Management (RHEL/CentOS):
yum update # Update all packages
yum install package # Install package
yum remove package # Remove package
yum search pattern # Search packages
rpm -qa # List installed
rpm -qf /path/to/file # Which package owns fileSystemd Commands:
systemctl status service # Service status
systemctl start service # Start service
systemctl stop service # Stop service
systemctl restart service # Restart service
systemctl enable service # Enable at boot
systemctl disable service # Disable at boot
systemctl list-units # List all units
journalctl -u service # View logs
journalctl -f # Follow logs
systemctl daemon-reload # Reload unit filesBasic Commands:
git init # Initialize repository
git clone url # Clone repository
git add file # Stage file
git add . # Stage all
git commit -m "message" # Commit staged
git status # Show status
git log # Show history
git log --oneline # Compact history
git diff # Show unstaged changes
git diff --staged # Show staged changesBranching:
git branch # List branches
git branch new-branch # Create branch
git checkout branch # Switch branch
git checkout -b new-branch # Create and switch
git merge branch # Merge branch into current
git branch -d branch # Delete branch
git push origin --delete branch # Delete remote branchRemote Operations:
git remote -v # List remotes
git remote add origin url # Add remote
git push origin main # Push to remote
git pull origin main # Pull from remote
git fetch origin # Fetch without merge
git remote update # Update all remotesUndoing Changes:
git reset file # Unstage file
git reset --soft HEAD~1 # Undo commit, keep changes
git reset --hard HEAD~1 # Undo commit, discard changes
git revert HEAD # Create revert commit
git checkout -- file # Discard changes in file
git clean -fd # Remove untracked filesStashing:
git stash # Stash changes
git stash list # List stashes
git stash pop # Apply and remove stash
git stash apply # Apply stash
git stash drop stash@{0} # Drop stash
git stash branch new-branch # Create branch from stashHistory and Debugging:
git log --graph --oneline # Visual history
git blame file # Who changed what
git bisect start # Binary search for bug
git bisect bad # Current is bad
git bisect good commit # Mark good commit
git reflog # Reference logAdvanced:
git rebase -i HEAD~3 # Interactive rebase
git cherry-pick commit # Apply specific commit
git tag v1.0.0 # Create tag
git push --tags # Push tags
git submodule add url # Add submodule
git submodule update --init # Update submodulesPod:
apiVersion: v1
kind: Pod
metadata:
name: my-pod
labels:
app: myapp
spec:
containers:
- name: my-container
image: nginx:latest
ports:
- containerPort: 80
env:
- name: ENV_VAR
value: "value"
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
volumeMounts:
- name: data
mountPath: /data
volumes:
- name: data
emptyDir: {}Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-deployment
spec:
replicas: 3
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: myapp
image: myapp:latest
ports:
- containerPort: 8080
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5Service:
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app: myapp
ports:
- port: 80
targetPort: 8080
nodePort: 30080
type: NodePort # ClusterIP, NodePort, LoadBalancerIngress:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: my-ingress
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
spec:
rules:
- host: myapp.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: my-service
port:
number: 80
tls:
- hosts:
- myapp.example.com
secretName: myapp-tlsConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config
data:
config.json: |
{
"log_level": "info",
"max_connections": 100
}
database_url: "postgresql://localhost/mydb"Secret:
apiVersion: v1
kind: Secret
metadata:
name: app-secret
type: Opaque
data:
username: YWRtaW4= # base64 encoded
password: MWYyZDFlMmU2N2RmPersistentVolumeClaim:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: data-claim
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
storageClassName: fastVPC Module:
# modules/vpc/main.tf
resource "aws_vpc" "main" {
cidr_block = var.cidr_block
enable_dns_hostnames = true
enable_dns_support = true
tags = var.tags
}
resource "aws_subnet" "public" {
count = length(var.public_subnets)
vpc_id = aws_vpc.main.id
cidr_block = var.public_subnets[count.index]
availability_zone = var.availability_zones[count.index]
map_public_ip_on_launch = true
tags = merge(var.tags, {
Name = "public-${var.availability_zones[count.index]}"
})
}
# modules/vpc/variables.tf
variable "cidr_block" {
description = "CIDR block for VPC"
type = string
}
variable "public_subnets" {
description = "List of public subnet CIDRs"
type = list(string)
}
variable "availability_zones" {
description = "List of availability zones"
type = list(string)
}
variable "tags" {
description = "Tags to apply"
type = map(string)
default = {}
}
# modules/vpc/outputs.tf
output "vpc_id" {
value = aws_vpc.main.id
}
output "public_subnet_ids" {
value = aws_subnet.public[*].id
}EC2 Instance Module:
# modules/ec2/main.tf
data "aws_ami" "amazon_linux" {
most_recent = true
owners = ["amazon"]
filter {
name = "name"
values = ["amzn2-ami-hvm-*-x86_64-gp2"]
}
}
resource "aws_instance" "this" {
ami = var.ami != "" ? var.ami : data.aws_ami.amazon_linux.id
instance_type = var.instance_type
subnet_id = var.subnet_id
vpc_security_group_ids = var.security_group_ids
key_name = var.key_name
user_data = var.user_data
root_block_device {
volume_type = var.root_volume_type
volume_size = var.root_volume_size
encrypted = var.root_volume_encrypted
}
tags = merge(var.tags, {
Name = var.name
})
}
# modules/ec2/variables.tf
variable "name" {
description = "Instance name"
type = string
}
variable "instance_type" {
description = "Instance type"
type = string
}
variable "subnet_id" {
description = "Subnet ID"
type = string
}
variable "security_group_ids" {
description = "Security group IDs"
type = list(string)
}
variable "ami" {
description = "AMI ID (optional)"
type = string
default = ""
}
variable "key_name" {
description = "Key pair name"
type = string
default = ""
}
variable "user_data" {
description = "User data script"
type = string
default = ""
}
variable "root_volume_size" {
description = "Root volume size in GB"
type = number
default = 20
}
variable "root_volume_type" {
description = "Root volume type"
type = string
default = "gp3"
}
variable "root_volume_encrypted" {
description = "Encrypt root volume"
type = bool
default = true
}
variable "tags" {
description = "Tags to apply"
type = map(string)
default = {}
}
# modules/ec2/outputs.tf
output "instance_id" {
value = aws_instance.this.id
}
output "public_ip" {
value = aws_instance.this.public_ip
}
output "private_ip" {
value = aws_instance.this.private_ip
}GitHub Actions Multi-Stage:
name: Multi-Stage Pipeline
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
env:
AWS_REGION: us-east-1
ECR_REPOSITORY: myapp
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Run tests
run: |
npm ci
npm test
npm run lint
- name: Upload coverage
uses: codecov/codecov-action@v2
build:
needs: test
runs-on: ubuntu-latest
if: github.event_name == 'push'
outputs:
image_tag: ${{ steps.docker_build.outputs.image_tag }}
steps:
- uses: actions/checkout@v2
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v1
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ env.AWS_REGION }}
- name: Login to Amazon ECR
id: login-ecr
uses: aws-actions/amazon-ecr-login@v1
- name: Build and push Docker image
id: docker_build
env:
ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
IMAGE_TAG: ${{ github.sha }}
run: |
docker build -t $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG .
docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG
echo "::set-output name=image_tag::$IMAGE_TAG"
- name: Scan image
uses: aquasecurity/trivy-action@master
with:
image-ref: ${{ steps.login-ecr.outputs.registry }}/${{ env.ECR_REPOSITORY }}:${{ github.sha }}
severity: CRITICAL,HIGH
exit-code: 1
deploy-dev:
needs: build
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/develop'
environment: development
steps:
- uses: actions/checkout@v2
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v1
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ env.AWS_REGION }}
- name: Update kubeconfig
run: aws eks update-kubeconfig --name dev-cluster --region ${{ env.AWS_REGION }}
- name: Deploy to EKS
run: |
kubectl set image deployment/myapp \
myapp=${{ steps.login-ecr.outputs.registry }}/${{ env.ECR_REPOSITORY }}:${{ needs.build.outputs.image_tag }} \
-n development
kubectl rollout status deployment/myapp -n development
deploy-prod:
needs: build
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
environment: production
steps:
- uses: actions/checkout@v2
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v1
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ env.AWS_REGION }}
- name: Update kubeconfig
run: aws eks update-kubeconfig --name prod-cluster --region ${{ env.AWS_REGION }}
- name: Deploy to production
run: |
# Canary deployment (10%)
kubectl set image deployment/myapp-canary \
myapp=${{ steps.login-ecr.outputs.registry }}/${{ env.ECR_REPOSITORY }}:${{ needs.build.outputs.image_tag }} \
-n production
# Wait and monitor
sleep 300
# Full rollout
kubectl set image deployment/myapp \
myapp=${{ steps.login-ecr.outputs.registry }}/${{ env.ECR_REPOSITORY }}:${{ needs.build.outputs.image_tag }} \
-n production
kubectl rollout status deployment/myapp -n productionGitLab CI Pipeline:
stages:
- test
- build
- deploy
variables:
DOCKER_DRIVER: overlay2
IMAGE_TAG: $CI_COMMIT_SHORT_SHA
DOCKER_HOST: tcp://docker:2375
cache:
paths:
- node_modules/
before_script:
- docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
test:
stage: test
image: node:16
script:
- npm ci
- npm run lint
- npm test
coverage: '/All files[^|]*\|[^|]*\s+([\d\.]+)/'
build:
stage: build
image: docker:20.10.16
services:
- docker:20.10.16-dind
script:
- docker build -t $CI_REGISTRY_IMAGE:$IMAGE_TAG .
- docker push $CI_REGISTRY_IMAGE:$IMAGE_TAG
only:
- main
- develop
.deploy_template: &deploy_template
stage: deploy
image: alpine/k8s:1.22
script:
- apk add --no-cache curl
- curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
- chmod +x kubectl && mv kubectl /usr/local/bin/
- kubectl set image deployment/myapp myapp=$CI_REGISTRY_IMAGE:$IMAGE_TAG -n $K8S_NAMESPACE
- kubectl rollout status deployment/myapp -n $K8S_NAMESPACE
deploy_dev:
<<: *deploy_template
variables:
K8S_NAMESPACE: development
environment:
name: development
url: https://dev.example.com
only:
- develop
deploy_staging:
<<: *deploy_template
variables:
K8S_NAMESPACE: staging
environment:
name: staging
url: https://staging.example.com
only:
- main
deploy_production:
<<: *deploy_template
variables:
K8S_NAMESPACE: production
environment:
name: production
url: https://example.com
only:
- main
when: manual
needs: ["deploy_staging"]General DevOps:
- What is DevOps and why is it important?
- Explain the CAMS model.
- What are the Three Ways of DevOps?
- How do you measure DevOps success?
- What is the difference between Continuous Delivery and Continuous Deployment?
- Explain the concept of "shift left" in security.
- What is Conway's Law and how does it apply to DevOps?
- How do you handle blameless postmortems?
- What are DORA metrics?
- Explain the difference between Agile and DevOps.
CI/CD:
- How would you design a CI/CD pipeline?
- What's the difference between Jenkins, GitHub Actions, and GitLab CI?
- How do you handle database migrations in CI/CD?
- Explain blue/green deployment.
- What is canary deployment and when would you use it?
- How do you handle secrets in CI/CD pipelines?
- What is pipeline as code and why is it important?
- How do you ensure pipeline security?
- Explain the concept of "build once, deploy many".
- How do you handle rollbacks?
Containers & Kubernetes:
- What's the difference between Docker and Kubernetes?
- Explain Kubernetes architecture.
- How do you expose an application running in Kubernetes?
- What are Kubernetes Operators?
- How do you handle persistent storage in Kubernetes?
- Explain Kubernetes network policies.
- What's the difference between a deployment and a statefulset?
- How do you debug a pod that won't start?
- What is Helm and why use it?
- Explain Kubernetes RBAC.
Infrastructure as Code:
- What's the difference between declarative and imperative IaC?
- Explain Terraform vs Ansible.
- How do you manage Terraform state?
- What are modules in Terraform and why use them?
- How do you test infrastructure code?
- What is immutable infrastructure?
- Explain idempotency in IaC.
- How do you handle secrets in Terraform?
- What's the difference between Terraform and CloudFormation?
- How do you version infrastructure code?
Cloud:
- Explain the shared responsibility model.
- What's the difference between IaaS, PaaS, and SaaS?
- How do you design for high availability?
- Explain multi-region architecture.
- How do you manage cloud costs?
- What is VPC peering?
- Explain the difference between security groups and network ACLs.
- How do you implement disaster recovery?
- What is a landing zone?
- How do you handle cloud governance?
Monitoring & SRE:
- What are the four golden signals?
- Explain SLIs, SLOs, and SLAs.
- What is an error budget?
- How do you design effective alerts?
- What's the difference between metrics, logs, and traces?
- Explain the USE method.
- What is the RED method?
- How do you handle on-call rotations?
- What is chaos engineering?
- How do you measure reliability?
Security:
- What is DevSecOps?
- How do you implement security in CI/CD?
- What is SAST vs DAST?
- Explain container security best practices.
- How do you manage secrets?
- What is SBOM and why is it important?
- How do you scan for vulnerabilities?
- Explain the principle of least privilege.
- What is policy as code?
- How do you handle compliance in cloud?
Scenario Questions:
- A deployment is causing 500 errors. How do you respond?
- How would you migrate a legacy application to the cloud?
- Your builds are taking 30 minutes. How do you optimize?
- How would you implement a multi-region disaster recovery plan?
- A critical vulnerability is found in a dependency. What do you do?
- How would you convince management to invest in DevOps?
- Your team is experiencing burnout from on-call. How do you fix it?
- How would you design a platform for 100 microservices?
- A database migration caused downtime. How do you prevent recurrence?
- How would you implement cost optimization for a growing startup?
Level 1: Initial
- Manual deployments
- No version control
- Siloed teams
- Reactive monitoring
- Long release cycles (months)
- High failure rate
- Firefighting culture
Level 2: Managed
- Version control for code
- Basic CI (build automation)
- Some documentation
- Scheduled releases
- Basic monitoring
- Defined roles
- Tickets for operations
Level 3: Defined
- CI/CD pipelines
- Automated testing
- Configuration management
- Standardized environments
- Proactive monitoring
- Defined SLIs/SLOs
- Blameless postmortems
Level 4: Measured
- Pipeline as code
- Infrastructure as code
- Self-service platforms
- Automated security scanning
- Performance testing
- Capacity planning
- Error budgets
Level 5: Optimizing
- GitOps workflows
- Chaos engineering
- AIOps/MLOps
- Auto-remediation
- Continuous experimentation
- FinOps optimization
- Platform engineering
A
- Agile: Iterative software development methodology
- Artifact: Output of build process (JAR, Docker image)
- Autoscaling: Automatically adjusting resources based on demand
B
- Blue/Green Deployment: Two identical environments, switch traffic
- Build: Process of compiling source code into artifacts
C
- CAMS: Culture, Automation, Measurement, Sharing
- Canary Deployment: Gradual rollout to subset of users
- CD: Continuous Delivery/Deployment
- CI: Continuous Integration
- Chaos Engineering: Deliberately introducing failures
- CNCF: Cloud Native Computing Foundation
- Container: Lightweight virtualization at OS level
- CRD: Custom Resource Definition (Kubernetes)
D
- DaemonSet: Runs pod on every node (Kubernetes)
- DAST: Dynamic Application Security Testing
- Deployment: Kubernetes resource for managing pods
- DevOps: Cultural and technical movement for collaboration
- DORA: DevOps Research and Assessment
- Docker: Container platform
E
- EKS: Amazon Elastic Kubernetes Service
- ELK: Elasticsearch, Logstash, Kibana
- Error Budget: (1 - SLO) * time, acceptable failure
F
- Feature Flag: Toggle for feature visibility
- FinOps: Cloud financial management
- Flux: GitOps operator
G
- Git: Distributed version control
- GitOps: Git as source of truth for infrastructure
- GKE: Google Kubernetes Engine
- Grafana: Visualization platform
H
- Helm: Kubernetes package manager
- HPA: Horizontal Pod Autoscaler
- Hybrid Cloud: Mix of public and private cloud
I
- IaC: Infrastructure as Code
- IAM: Identity and Access Management
- Idempotent: Operation with same effect when run multiple times
- Ingress: Kubernetes API object for external access
- Istio: Service mesh
J
- Jenkins: CI/CD automation server
- JSON: JavaScript Object Notation
K
- K8s: Kubernetes (K + 8 letters)
- Kustomize: Kubernetes configuration customization
- Kyverno: Kubernetes policy engine
L
- Lambda: AWS serverless function
- Load Balancer: Distributes traffic
- Logging: Recording events
M
- Microservices: Architecture with small, independent services
- Monitoring: Collecting and analyzing metrics
- mTLS: Mutual TLS for service authentication
N
- Namespace: Isolation mechanism in Kubernetes
- Network Policy: Firewall rules for pods
- Node: Worker machine in Kubernetes
O
- Observability: Understanding system internals through outputs
- OCI: Open Container Initiative
- OPA: Open Policy Agent
- Operator: Kubernetes extension for application management
P
- PaaS: Platform as a Service
- Pod: Smallest deployable unit in Kubernetes
- Prometheus: Monitoring system
- PV: Persistent Volume
- PVC: Persistent Volume Claim
R
- RBAC: Role-Based Access Control
- ReplicaSet: Ensures specified number of pods running
- Rolling Update: Gradually replacing instances
- Runbook: Documented procedures for operations
S
- SaaS: Software as a Service
- SAST: Static Application Security Testing
- SBOM: Software Bill of Materials
- Secret: Kubernetes resource for sensitive data
- Service Mesh: Infrastructure layer for service communication
- SLA: Service Level Agreement
- SLI: Service Level Indicator
- SLO: Service Level Objective
- SRE: Site Reliability Engineering
T
- Terraform: IaC tool by HashiCorp
- Toil: Manual, repetitive operational work
- Tracing: Tracking request through distributed system
U
- Unit Test: Testing individual components
- USE Method: Utilization, Saturation, Errors
V
- VCS: Version Control System
- VPC: Virtual Private Cloud
- VPA: Vertical Pod Autoscaler
W
- Waterfall: Sequential development methodology
- Workload: Application running on Kubernetes
X
- XML: eXtensible Markup Language
Y
- YAML: YAML Ain't Markup Language
Z
- Zero Downtime Deployment: Deployment without service interruption