Things Cloud Engineers Should Know

Multi Cloud Decisions

Key Enablers
- Workload Portability
- Ability to negotiate with suppliers
- Ability to select best tool for a given job
Keys
- Visibility - trusted single source of truth
- Efficiency - across dev, qa, security and operations
- Governance - automated security, code quality, vulnerability management, policy enforcement
Use Managed Services
Every Engineer should be a Cloud Engineer
Keep Scalabity in mind but don't overdo it
- Does everything have to automatically scale
- Know the upper bound
- Gather usage data, continue testing and planning
- Identify limiting factors e.g. database
- Ability of code to take advantage of more CPU/Memory
Containers aren't magic
- Use container scanning services to periodically check the contents of image for latest known vulnerabilities
- Create secure containers/run in strict container sandbox
Re-Platform every 5 to 10 years
- Increased velocity
- Future proofing
- Scalability
- Security
- Efficiency
- Community backing
Visualize distributed systems
Serverless bad practices
- Deploying lot of functions - increases size, complexity and maintenance
- Calling function asynchronously - asynchronous calls increase the complexity of a system. Costs will increase, as a response channel and a serverless message queue will be required
- Employing many libraries - increases warm up time
- Using many technologies - requires people with skills in all of them
- Not documenting functions
Topology
- Modularity - separation of concerns
- Deployment Strategy e.g. Canary, blue/green
- Datacenter affinity - active/active, active/passive
Understand how services work under the hood - e.g. lambda cold start times, running out of IP's
Failing cloud migration
- Not optimizing for the cloud - doing just lift and shift
- Lack of architectural strategy - downtime management, latency, data migration etc.
Antipatterns
- Wild west - Each BU buying their own logging, monitoring solutions, differing CI/CD workflows
- Command & Control - Ticketing process for cloud
Security is Essential
Automation is required
Secrets
- Know where the secrets are kept
- Audit secrets - rotation, revocation
- Encryption
Never take a single region dependency
- Redundant storage
- AZ
- Backup
- Recovery
- Failovers should be automatic
- Practice failovers
Monitoring with Vizualizations/Dashboards
Incident Analysis and Chaos Engineering
Monitoring
- Functionality
- Usage patterns
- User Experience
- Security
- Billing
- Health Status
KISS It
- Avoid pre-mature optimizations
- Start small and use MVP's to guide design decisions
- Read documentation - pay attention to limits and error codes
- Focus on learning best practices
- Use standard naming conventions
- Delete unused cloud resources to remove clutter
- Find system failure scenarios and provide runbooks
Maintain Service Levels with Feature flags and circuit breakers
Design First, then code
Strategies to cope with duplicates
- Stateless consumers
- Keeping state - use TTL
Avoid big re-writes Risks
- Not making deadlines
- Going over budget
- Burning out team members
- Losing stakeholder confidence Steps
- Be realistic
- Utilize strangler pattern - incrementally modify an existing system by extracting parts of it gradually
- Repeat
QA is also feedback, early feedback
FinOps
- Make finance and procurement part of the planning process
- Provide guardrails for shared financial accountability
- Design and architect with finance in mind
- Use financial tracing to align cloud spending to product and customer metrics
- Provide real-time visibility of cloud spending for consuming teams
How it can be done
- Offer visibility for everyone
- Identify your cost drivers and metrics
- Have guardrails for the cost-control policy
Make sure you are watching and measuring your costs
Set billing alarms
Leverage a content delivery network
Stay within an availability zone (or region) in places where you are not looking to improve availability. Needless region-to-region costs are a killer
Leverage data compression
Have an effective tagging strategy
Place the accountability (and budget) for network charges on your application and solution teams
For compute and storage, utilize reserved resources/instances
Moving to microservices? Think about network, storage, and monitoring costs
Treat Your Infrastructure like Software
Focus on Your Team, Not on the Cost
- What is the impact of valuable/senior employees resigning because they are not comfortable with the new stack?
- What is your training session budget?
- What is the impact of having no productivity during the training session?
- How do you manage the lack of code quality, reliability, performance, and productivity because of the new stack?

Effectively Navigating Organizational Politics

Who was (or was not) involved in making the decision?
Who does this decision impact?
What did that pro‐cess look like?
Are the decision makers perceived as having the authority to make this decision?
Who is accountable for the impact of this decision?

HarshadRanganathan/README.md