Skip to content

Instantly share code, notes, and snippets.

@nicolasdao
Last active August 20, 2021 04:29
Show Gist options
  • Save nicolasdao/2693912322fab8b4be19cca8920c603e to your computer and use it in GitHub Desktop.
Save nicolasdao/2693912322fab8b4be19cca8920c603e to your computer and use it in GitHub Desktop.
AWS Solution Architect Pro certification. Keywords: aws cert certification solution architect

AWS SOLUTION ARCHITECT CERTIFICATION

Table of contents

Forewords

About this document

This document summarizes all my notes while I was studying to pass both the AWS Cloud Pactitioner and AWS Solution Architect Professional exams. In insight, should have skipped the AWS Cloud Pactitioner exam and go straight to the Solution Architect Pro exam. The first one does not provide much credibility and is quite easy if you've been architecting native solutions on AWS for more than a couple of years, while the second is actually quite hard.

General exam tips

  • Start reading the last sentence of the question, which is the actual question. Then read the entire question. This will allow you to focus on the part of the context that matters. Those questions are very long and confusing.
  • Know your jargon (please refer to the Jargon section in the Annex). There is nothing more frustrating than not being able to understand a question because of some obscure acronyme.
  • Review all the Basic skills in the Annex. This is not related to AWS specifically, but it is expected that, as a solution architect, you possess those skills.
  • Read the AWS Well-Architected Framework including all the Lenses which are extensions of the well-architected framework applied to specific industries. Those use cases have a high probability to appear in the exam.
  • Invalidating options:
    • Look for fallacies in the choices. Very often, you can simply rule an option out by looking for the ones that contain fallacies. Example, add a deny rule in a security group. SGs don't support deny rules. This option can easily be discarded.
    • Look for missing information option. For example, if an option suggest RDS, and the question only mention that the client needs to migrate their DB without specifying the DB type, you can invalidate the RDS option as we need to know the DB type if you want to recommend RDS or not.
  • Choose the cheapest solution when multiple options are valid.

Key concepts before starting

You must know the networking basics

There is no shortcuts here. If you suck at networking, you will not be able to leverage the basic AWS goodness (same applies to any other Cloud). This is not exhaustive, but the Networking section can get you started.

What are AWS fully-managed services and what this has to do with VPC connectivity and network cost?

AWS started as a cloud provider offering a safe and highly configurable remote network environment called a VPC where you could provision your own servers to create your own services. As AWS grew, it started to offer fully-managed services so you would not have to re-invent the wheel and choose a buy-over-build option (S3 and SQS were the first one in 2006). Those fully-managed services are now the ones grabbing the most attention.

From a network perspective, what is important to realize is that those fully-managed services are not collocacted in your VPC. When AWS (or any 3rd party cloud partner) uses the term fully-managed, this means those services are hosted in a VPC that you cannot have direct access to. That VPC is managed by the service provider (AWS or the 3rd party partner). To access them, you must create IAM resources (roles, policies, ...) that allows traffic for certain types of actions between your VPC and the VPC where the services is hosted. Then, each request to those fully-managed services are done via the public internet. This is critical to understand, as this means that:

  • Outbound traffic occurs each time you make a request to those services. This is not free. With AWS, ingress is free, but egress is not. Requesting S3 objects from EC2 should bear not cost as the request is insignificant egress, but requesting an S3 object, modifying it in your EC2 instance and then saving it back to S3 will incur egress cost.
  • Resources located in your private subnet will not be able to access AWS services via their DNS name. To access AWS services from a private subnet, you can:
    1. Add a NAT gateway to your private subnet to allow access to the AWS service. The NAT incurs cost and its bandwidth is limited.
    2. Set up a VPN or Direct Connect to your corporate network which has access to the public internet. This is extremely tedious to set up, it will be expensive and you will also pay for the traffic.
    3. (recommended) Set up VPC endpoints is the AWS service supports it.

Support plans

There are four different types of support plan:

  1. Basic (Free): It only covers questions around your account and billing, and the access to the community forum for all other questions.
  2. Developer ($29/month):
    • Can ask technical questions to a support center and expect response within 12 to 24 hours via email.
    • Can provide general guidance when you request Architecture Support.
  3. Business ($100/month)
    • 24/7 email, chat and phone support.
    • Production system down response time < 1 hour.
    • Faster response time (within 1 hour).
    • Trusted Advisor access who'll help optimizing your infra.
    • AWS Support API access. You need this if you need to integrate with other 3rd parties(e.g., Jira, Trello, Asana, ...).
    • Access to IEM (Infra Event Management) for an extra fee.
  4. Enterprise ($15,000/month)
    • Production system down response time < 1 hour.
    • Same as Business, plus:
      • Assigned Technical Account Manager (TAM) (pretty much your AWS EA) + withing 15 min. response time. The TAM should also be proactive.
      • AWS Concierge for non-urgent normal enquiry, you have access to an

enterprise_support

Exam questions

  • Which plan do you need to access a TAM? A. Enterprise
  • Which is the cheapest plan that allows to integrate with a 3rd party ticketting system (e.g., JIRA)? Business which exposes the AWS Support API.
  • Which is the cheapest plan that offers an official AWS response within 24 hours? Developer
  • Which is the cheapest plan that offers Production system down response time within an hour? Business

Core services

IAM

  • The are 3 different ways to access an AWS account?
    1. Programmatic with access key and secret
    2. AWS Console access
    3. AWS SDK
  • IAM is configured Globally. It's not region dependant.
  • Don't fuck up the root account. Use 2FA to secure the root account.
  • CARPE Access: The way you configure IAM access in other services (e.g., configuring your Lambda to read/write to DynamoDB) is by filling out the CARPE config using IAM values. CARPE stands for:
    • (C) Conditions: WHEN can this happen.
    • (A) Action: WHAT can be done.
    • (R) Resources: WHICH item can this happen to.
    • (P) Principal: WHO can do it.
    • (E) Effect: WHAT will be granted or denied.
  • When creating a Role there are 2 different aspects that must be created:
    1. Policy: The policy defines what can be accessed and how. The following policy allows specific actions on all resources in Cloudwatch:
     {
     	"Version": "2012-10-17",
     	"Statement": [
     		{
     			"Action": [
     				"logs:CreateLogGroup",
     				"logs:CreateLogStream",
     				"logs:PutLogEvents",
     				"logs:DescribeLogGroups",
     				"logs:DescribeLogStreams"
     			],
     			"Effect": "Allow",
     			"Resource": "*"
     		}
     	]
     }

    All actions per AWS service can be found at https://iam.cloudonaut.io/.

    1. Trust relationship: It establishes who is allowed to use this role. The following example shows that only AWS Flow Logs is allowed to use the previous policy:
     {
     	"Version": "2012-10-17",
     	"Statement": [
     		{
     			"Sid": "",
     			"Effect": "Allow",
     			"Principal": {
     				"Service": "vpc-flow-logs.amazonaws.com"
     			},
     			"Action": "sts:AssumeRole"
     		}
     	]
     } 

VPC

The first concept to understand that relates to networking in AWS is the Virtual Private Cloud (VPC).

VPC facts sheet:

  • A VPC is virtual network that isolate your AWS account from all the others in AWS.
  • A VPC is specific to a region. There is no such thing as a VPC that spans across multiple regions.
  • A VPC spans all the Availability Zones in its region.
  • VPC components are:
    • Subnet: A segment of a VPC’s IP address range where you can place groups of isolated resources.
    • ENI: An Elastic Network Interface is a logical networking component in a VPC that represents a virtual network card.
    • Route tables: Tables that map IPs to destinations.
    • Security groups: Kind of firewalls.
    • Internet Gateway: The Amazon VPC side of a connection to the public Internet. You can create Internet Gateway inside your VPC settings.
    • NAT Gateway: A highly available, managed Network Address Translation (NAT) service for your resources in a private subnet to access the Internet.
    • Virtual private gateway: The Amazon VPC side of a VPN connection.
    • Peering Connection: A peering connection enables you to route traffic via private IP addresses between two peered VPCs.
    • VPC Endpoints: Enables private connectivity to services hosted in AWS, from within your VPC without using an Internet Gateway, VPN, Network Address Translation (NAT) devices, or firewall proxies.
    • Egress-only Internet Gateway: A stateful gateway to provide egress only access for IPv6 traffic from the VPC to the Internet.
  • VPC traffic can be monitored with VPC Flow Logs

Subnet

A subnet is a segment of a VPC’s IP address range where you can place groups of isolated resources.

Subnet facts sheet:

  • A subnet is specific to an Availability Zone.
  • There are 3 types of subnets:
    • Public subnet: A subnet is labelled public if it is connected to an internet gateway.
    • Private subnet: A subnet is labelled private if it is NOT connected to an internet gateway.
    • VPN-only subnet: A subnet is labelled VPN-only if it is private and has its traffic routed to a virtual private gateway for a Site-to-Site VPN connection.

ENI

  • An Elastic Network Interface is a logical networking component in a VPC that represents a virtual network card.
  • Often, ENIs are not explicitely exposed to you. Instead, they are automatically provisioned as a side-effect (e.g., creating a new Lambda configured with VPC access creates at least one ENI in that VPC).
  • Each ENI consumes one IP address for itself, so by carefull os side-effects that provision too many ENIs, otherwise you'll exhaust your CIDR block (e.g., it used to be that too many lambdas connected to VPCs exhaust the ENIs). That's why subnet's CIDR block should be large enough.
  • With EC2, ENIs limits depend on the instance type and size.

Route tables

  • By default, a VPC is associated with a route table called the main route table. All subnets in that VPC also use that main route table.
  • It is possible to add specific route tables to specific subnets. Those route tables are called subnet route tables.
  • A subnet can only be associated to one route table, but a route table can be used by multile subnets.
  • Example of route table:
Destination Target
10.0.0.0/16 local

The above route table shows that the CIDR 10.0.0.0/16 routes traffic to the subnet's local gateway. If this is the only rule in the route table, this means that the only thing that the traffic can do is to go to the local gateway. If you were expecting the traffic to go somewhere else, then you should have added another rule. For example, let's assume that an internet gateway was setup and that you were expecting that any traffic that is not aimed to hit the local gateway should go to the internet gateway, then you should have added the following rules:

Destination Target
10.0.0.0/16 local
0.0.0.0/0 igw-id

Where 0.0.0.0/0 means any traffic. As you can see, route table try to match the rules in order first.

Screen Shot 2019-09-21 at 12 29 24

  • VPC peering example:
Destination Target
10.0.0.0/16 local
192.168.0.0/24 pcx-123456789

In the example above, all traffic defined by the CIDR block 192.168.0.0/24 is redirected from the current VPC to the external pcx-123456789 VPC that has been peered (more about this in the VPC peering section). 192.168.0.0/24 must also exist in the pcx-123456789 VPC. It often happens that this block is the entire external VPC CIDR block.

Security groups

  • Security groups are equivalent to firewall rules. They apply to many AWS services, but not all (which is confusing as there is no rules to know when a service uses them or not).
  • SGs can only manage allow rules, NOT deny rules. For rules that denies traffic, use NACLs (Networl Access Control Lists).
  • AWS services that require security groups:
    • EC2 instances
    • Services that launch EC2 instances:
      • AWS Elastic Beanstalk
      • AWS Elastic MapReduce
    • Services that use EC2 instances (without appearing directly in the EC2 service):
      • AWS RDS
      • AWS Redshift
      • AWS ElastiCache
      • AWS CloudSearch
    • Elastic Load Balancing
  • Rule are defined using the CIDR notation:
    • 0.0.0.0/0 means letting everything in (/0 means all IPs, 0.0.0.0 means any IP).
    • x.x.x.x/32 means restrict access to a single user (32) with the specific IP x.x.x.x
  • In order to let traffic in your Security Groups, make sure that the ports are opened (Exam tip: You must know the Popular ports).
  • Changes saved in SGs take effect immediately.
  • By default, all inbound rules are disabled and all outbound rules are allowed.
  • You can't block access to IPs or ports using SGs (instead, you need Network Access Control List).
  • SGs are statefull, this means that if you create an inbound rule, the equivalent outbound rule is created. So even if you delete all outbound rules, the server will still be able to send traffic out on the inbound rules.

Internet Gateway

  • An Internet Gateway is a logical connection between a VPC and the Internet.
  • It is not a physical or virtual device.
  • Only one can be associated with each VPC.
  • If a VPC does not have an Internet Gateway, then the resources in the VPC cannot be accessed from the Internet (unless the traffic flows via a corporate network and VPN/Direct Connect).
  • A subnet is deemed to be a Public Subnet if it has a Route Table that directs traffic to the Internet Gateway.

NAT Gateway & NAT instances

To know more about what NATs are and what they do, please refer to the NAT (Network Address Translation) section in the annex.

  • A NAT gateway or instance allows resources in a private subnet to make requests to the public internet without the inverse being possible, hence keeping the resource private.
  • A NAT MUST BE created in a public subnet.
  • NAT Gateways are specific to an Availability Zone.
  • Though it is possible to create a single NAT to serve multiple resources across multiple AZs, this is discouraged as loosing the AZ where the NAT is would break those resources in other AZs. The recommended way is to create one NAT per AZ and configure the route table in each AZ so that resources in an AZ use a NAT in the same AZ.
  • There are 2 types of NAT in AWS:
    • NAT instances: These are EC2 instances that you provision yourself in a public subnet. You are responsible to maintain these yourself.
    • NAT Gateway: That the AWS-managed solution that can replace the NAT instances.
  • An AWS NAT Gateway is a fully-managed AWS NAT service that supports 5Gbpsof bandwidth and can automatically scale up to 45Gbps (while a self-managed NAT instance can only support 5Gbps).
  • AWS NAT Gateway supports up to 55,000 concurrent connections (or 900/sec which is equivalent to ~55,000/min) per unique destination (i.e., combination of IP, port and protocol).
  • Supported protocols:
    • TCP
    • UDP
    • ICMP (used by routers and the ping command)
  • You do not control traffic via security groups or ACLs with NAT. Instead, you use ACLs on subnets to control the subnet traffic coming from NATs.
  • NATs can only be accessed from resources that are explicitely inside their subnet. This means that VPC peering, Site-to-Site or Direct Connect won't be able to expose NATs to their resources.
  • NATs automatically receive a public IP.
  • NAT vs internet gateway:
    • An internet gateway is NOT a NAT and is NOT a physical device.
    • It is a logical connection between an Amazon VPC and the Internet.
    • There can only be one IG per VPC.
    • An IG has NO bandwidth limit.
  • The traditional rule in the route table for a service in a private subnet that wish to access the internet looks like this:
Destination Target
0.0.0.0/0 nat-gateway-id

As you can see, this configure across the entire private subnet. If you wish to control the access to the internet for specific services in this private subnet, use NACLs on that subnet.

Security Group vs NACL

  • Security groups are tied to an instance whereas Network ACLs are tied to the subnet.
  • Network Access Control List (NACL) are rules you associate to a subnet. This helps securing who can access a subnet. You can both white list and black list traffic.
  • State: Stateful vs Stateless:
    • Security groups are stateful, this means any changes applied to an incoming rule will be automatically applied to the outgoing rule. e.g. If you allow an incoming port 80, the outgoing port 80 will be automatically opened.
    • Network ACLs are stateless, this means any changes applied to an incoming rule will not be applied to the outgoing rule. e.g. If you allow an incoming port 80, you would also need to apply the rule for outgoing traffic.
  • Allow or Deny rules:
    • Security group support allow rules only (by default all rules are denied). e.g. You cannot deny a certain IP address from establishing a connection.
    • Network ACL support allow and deny rules. By deny rules, you could explicitly deny a certain IP address to establish a connection example: Block IP address 123.201.57.39 from establishing a connection to an EC2 Instance.
  • Rule process order: All rules in a security group are applied whereas rules are applied in their order (the rule with the lower number gets processed first) in Network ACL.i.e. Security groups evaluate all the rules in them before allowing a traffic whereas NACLs do it in the number order, from top to bottom.
  • Defense order: Security group first layer of defense, whereas Network ACL is second layer of the defense.
  • NACLs are more complicated to set up than SGs. Please refer to the Ephemeral ports section to learn more about one of these extra difficulties.

VPC Flow Logs

  • VPC Flow Logs is a feature that enables you to capture information about the IP traffic going to and from ENIs in your VPC.
  • Flow logs can be configured at 3 levels:
    • VPC: This will capture traffic of all ENIs in that VPC.
    • Subnet: This will capture traffic of all ENIs in that subnet only.
    • ENI: This will capture traffic of that ENI only.
  • Those logs can be sent to CloudWatch or S3.
  • It can log IP traffic (just IP) to and from network interfaces in your VPC.
  • Log example:
@ingestionTime	1615939365538
@log			990933012216:flow-log-goanna-poc
@logStream		eni-0b3761ff9ea9368b0-all
@message		2 990933012216 eni-0b3761ff9ea9368b0 195.54.161.152 172.30.0.216 56507 15328 6 1 40 1615939325 1615939336 REJECT OK
@timestamp		1615939325000

where the structure of @message depends on how Flow Logs message format is configured. By default, it can be read as follow:

- version		2
- accountId		990933012216
- interfaceId	eni-0b3761ff9ea9368b0
- srcAddr		195.54.161.152
- dstAddr		172.30.0.216
- srcPort		56507
- dstPort		15328
- protocol		6
- packets		1
- bytes			40
- start			1615939325
- end			1615939336
- action		REJECT
- logStatus		OK

Peering

Please refer to the VPC peering section.

EC2

EC2 overview

  • EC2 stands for Elastic Cloud Compute.
  • Behind the scene, EC2 uses one of the following hypervisors:
    • Xen (legacy)
    • Nitro (new)
  • The tenancy defines weither the instance has dedicated hardware or shared(default) hardware.
  • 4 pricing models:
    1. On-demand (pay as you go (hour or second))
    2. Reserved (pay upfront, up to 3 years). Minimum 1 year commitment. There are 3 types of reserved instances:
      1. Standard
      2. Convertible
      3. Scheduled if the RI must only run at specific time of the day.
    3. Spot (if you cancel the instance, you are charge for the hour, if EC2 cancel then they don't bill the existing hour)
    4. dedicated - There are two types:
      • dedicated instance:
        • Your EC2 instance runs on a decicated hardware that can be shared with YOUR other EC2 instances.
        • Available as reserved, spot and on-demand.
      • dedicated host:
        • Your EC2 instance has exclusive hardware allocation (best for running software with specific licenses).
        • Only available as reserved.
  • There is an aweful lot number of EC2 instances. Here is a mnemonic: FIGHT DR MC PXZ
    • F: FPGA field programmable gate array used in biotech. With those, you can reprogram your hardware to accelarate your app.
    • I: IOPS
    • G: Graphics or GPU
    • H: High Disk Throughput
    • T: T2 micro, i.e., cheap
    • D: Dense storage. 10s to 100s of GB RAM and PT disk. Used for huge data processing.
    • R: RAM
    • M: Main purpose
    • C: CPU
    • P: Picture analysis, which is a trick to refer to ML. These are tensor flow GPU equipped instances.
    • X: Xtreme memory
    • Z: Xtreme memory & CPU
  • The main ways to connect to your EC2 instance are:
    1. SSH on port 22 for Linux (need a private key .pem file)
    2. Remote Desktop Protocol on port 3389 for Windows (need a private key in a .ppk file (which you can convert from the .pem file))
    3. HTTP on port 80 and HTTPS on port 443
  • Termination protection is a flag on your EC2 config that, if turned on, will prompt to confirm whether or not you are sure you want to delete the EC2. This is to prevent accidental EC2 termination. It is turn off by default.
  • Though you can configure the EC2's AWS CLI with access keys and secrets so it can perform some actions, it is highly not recommended. Indeed, a hacker could stole thiose details to hack your AWS account. Instead, it suggested to use roles. Create a role in IAM by selecting the right service (e.g., EC2) and then the right level of access (e.g., S3 read write). Then select your EC2 instance in the console, and attach that role. This role will work immediately, however, you need to make sure that you've flushed any hardcoded AWS CLI config from the EC2 instance first.
  • When you have a shit tons of EC2 instances, you have what's called an EC2 Fleet. To alleviate the maintenance of a fleet, you can use AWS Systems Manager which is a simple agent that is installed on all instances of your fleet. It can then run the same command or patch on all instances at once. It can be on premise too. It also integrates with CloudWatch.
  • Placement Groups is a feature that enables EC2 instances to interact with each other via high bandwidth, low latency connections. It helps to influence how EC2 deploys new instances. By default, EC2 will optimize for fail over by distributing new instances in different AZs, but in some cases you want to have control over this, and that's where placement groups come in. There are 3 types of placement groups:
    1. Clustered:
      • Description: You want your EC2 instances to be in the same AZ to minimise latencies.
      • Use case: Low network latency / high network throughput.
    2. Partitioned:
      • Description: You want your EC2 instances spread in a way that if they are part of the same partitions, they should share the same hardware.
      • Use case: Hosting Cassandra, Hadoop, HDFS system, ...
    3. Spread:
      • Description: Spread you EC2 instances on different hardware.
      • Use case: Minimize correlated failure.
    • Important facts:
      • Clustered PG cannot span multiple AZ, while Partitioned and Spread PGs can.
      • A PG's name in an account must be unique.
      • Only the following type of EC2 instances can be launched in a PG:
        • Compute optimized.
        • Memory optimized.
        • Storage optimized.
      • AWS recommends homogenous instances (i.e., instances of the same size and family) withing a clustered PG.
      • You can't merge PG.
      • You cannot move an existing instance inside a PG. To migrate one, create an AMI and then launch a new instance from that AMI.
      • PG allows to use Jumbo Frames which increase the MTU (Maximum Transmission Unit) from 1500 bytes to 9001 (called Jumbo Frames).
  • Once you've SSH to an instance, you can get metadata of the instance or the currently logged in user with the following commands:
    • curl https://169.254.169.254/latest/meta-data
    • curl https://169.254.169.254/latest/user-data

EC2 auto-scaling

  • To scale out EC2 instances you need two things:
    1. Launch configuration: This configures the type of auto-scaling group, AMI, VPC, SG that the EC2 instances use. The four types of auto-scaling that can be used with a launch configuration are:
    2. ASG (Auto-Scaling Group): The auto scaling group uses a launch configuration to launch EC2 instances. Keep in mind that:
      • There is no garantee that the auto-scaling group maintain an exact homogeneous EC2 count across multiple AZs. What it will do is make sure that there is the right number of instances.
      • If you need homogeneity of your EC2 count across AZs, you need to add reserved instances in those AZs.
      • The ASG is responsible for monitoring the health of all instances via health checks and replace the ones that are unhealthy.
      • There are 4 ways to configure an ASG's scaling behavior:
        1. Maintain: Keep a minimum specific number of instance in place
        2. Manual: Set up min. max. or specific number of instances.
        3. Schedule: Increase or decreased based on schedule.
        4. Dynamic: Have rules (CPU, memory, traffic) to scale. Dynamic type uses what's called Auto scaling policies:
          1. Tracking target: e.g., 70% above max CPU utilization.
          2. Simple scaling: Keeping adding instances after the health check and cooldown periods expired
          3. Step scaling: For more custom logic.
  • Launch configuration cannot be updated. If you need to change it, you have to create a new one.
  • cooldown period (default 300 sec.) is the amount of time we wait before adding a new instance. This allows to prevent scenarios where we overprovision. Instead, we add new instance one by one. Each time a new instance is added, we wait a little bit (cooldown period) before measure the system as a whole and decide whether we need to add more. It is automatically applied with dynamic scaling, optional with manual scaling but not supported for scheduled scaling.
  • Health check grace period is the time AWS wait after an EC2 instance is launched until doing a health check.
  • AWS Auto Scaling is a service that can control the auto scaling policies for non-EC2 services (e.g., DynamoDB).
  • AWS Predictive Scaling uses ML to help you predict when you should scale and how.

Exam tips

  • How does an ASG knows whether an EC2 instance is healthy or not? There are 3 ways to reports instance status to an ASG:
    • EC2 instance: In that case, any state other than running is considered unhealthy. This means that restarting the instance will have as a side-effect to terminate the instance and restart a new one.
    • ELB: The ELB pings the instances to know if its ok.
    • Custom health check

EC2 provisioning model

  1. On-demand: This is the usual default pricing model. You pay for what you use.
  2. RI (Reserved Instance):
    • Two scopes:
      • Regional(default):
        • EXAM This scope DOES NOT RESERVE CAPACITY!!! Instead, if a an instance is available, then the discount is applied if it is used.
        • It does not guarantee capacity in specific AZs. Instead, the instance is used in whichever AZ it is available.
        • 20 max/region (can be increased)
        • Can change the size if same family and if Linux/Unix and shared tenancy
      • Zonal:
        • This option allows to guarantee capacity in a specific AZ.
        • 20 max/AZ (can be increased)
        • Cannot change size

      EXAM TIP: If you need guaranteed reservation, you must choose the zonal scope.

    • Two classes:
      • standard:
        • Can change the size (exam: only for Linux in regional mode with default tenancy (i.e., shared)), AZ and networking type.
        • To apply changes use the ModifyReservedInstances API or the console
      • convertible:
        • Same as standard + change OS, instance family, tenancy, payment options.
        • More expensice than standard.
        • Cannot be resold on the secondary market
        • Can only exchange for equal or higher price
        • To apply changes use the ExchangeReservedInstances API or the console
    • There are 3 different payment options:
      1. AURI (All Upfront)
      2. PURI (Partial Upfront)
      3. NURI (No Upfront)
  3. Spot: You specify a bid price and when instance's price fall under that bid, they are yours. There are two request types:
    • one-time: When the spot instance is terminated by AWS because a higher bidder appeared, the processes are killed and the data are lost.
    • persistent: Same as one-time request except that the context is preserved, and when I get the lease back, my processes and data are restored.
  4. Reserved on-demand:
    • Allows to guarantee on-demand capacity in a specific region without having to commit to 1 year.
    • No discount. This is charged at the same price as on-demand. However, it is possible to leverage the Saving plans.
    • AZ specific only. This pricing option is not available across an entire region.
  5. Saving plans (release in Nov. 2019):
    • Aims to simplify the savings. Indeed, before this option got released, mixing the above 4 options to reach optimal cost saving had become a total mess of uncomprehensible complexity.
    • Same as RI convertible + region flexibility.
    • As RI, you must choose between 1 or 3 years commitment.
    • Supports EC2, Fargate, Lambda but NOT RDS.
    • Two packages:
      • Compute: That the most flexible, less cheap and broadest package (EC2, Fargate, Lambda).
      • EC2: Cheapest but less flexible (cannot change the region or family).
    • Cannot be resold on the secondary market

EC2 pricing

Pricing variables

  1. Clock hours (based on the instance type you pay per hours or per seconds)
  2. Instance type
  3. Pricing model (reserved, spot, on-demand, dedicated)
  4. Number of instances
  5. Load balancing (network LB is more expensive than App LB)
  6. Detailed monitoring is more expensive than the standard 5 min.
  7. Auto scaling
  8. Elastic IP address (that's the default). Each newly provisioned IP costs something.
  9. OS (Windows more expensive than Linux)

Pricing model

  1. On-demand: This is the usual default pricing model. You pay for what you use.
  2. RI (Reserved Instance):
    • Applies to EC2 and RDS.
    • The only two options are 1 year(36% to 40% saving) or 3 years(56% to 62% saving) commitment.
    • Above $500,000 of reserved instance in a single region, you received 5% discount on your next RI. Above $4,000,000, you receive 10%.
    • There are 3 different payment options:
      1. AURI (All Upfront)
      2. PURI (Partial Upfront)
      3. NURI (No Upfront)
  3. Spot: You specify a bid price and when instance's price fall under that bid, they are yours. There are two request types:
    • one-time: When the spot instance is terminated by AWS because a higher bidder appeared, the processes are killed and the data are lost.
    • persistent: Same as one-time request except that the context is preserved, and when I get the lease back, my processes and data are restored.
  4. Reserved on-demand:
    • Allows to guarantee on-demand capacity in a specific region without having to commit to 1 year.
    • No discount. This is charged at the same price as on-demand. However, it is possible to leverage the Saving plans.
    • AZ specific only. This pricing option is not available across an entire region.
  5. Saving plans (release in Nov. 2019):
    • Aims to simplify the savings. Indeed, before this option got released, mixing the above 4 options to reach optimal cost saving had become a total mess of uncomprehensible complexity.
    • Same as RI convertible + region flexibility.
    • As RI, you must choose between 1 or 3 years commitment.
    • Supports EC2, Fargate, Lambda but NOT RDS.
    • Two packages:
      • Compute: That the most flexible, less cheap and broadest package (EC2, Fargate, Lambda).
      • EC2: Cheapest but less flexible (cannot change the region or family).
    • Cannot be resold on the secondary market

LightSail

LightSail is a weird offering that does not fit in EC2 nor in serverless.

  • It is a PaaS from AWS. Though it uses EC2 behind the scene, you won't see them if you go to EC2.
  • Lighsail is a packaged server for very cheap hosting with a simplified inteface (meant to compete with Digital Ocean)

ELB

ELB overview

  • Elastic Load Balancer
  • To use an ELB, you need three components:
    1. ELB
    2. Target Group: That's a group that contains resources that can consume the traffic.
    3. Listener: That's a rule that is attached to the ELB that defines which traffic goes to which target group.
  • ELB provisions Load Balancers in each AZ that you enabled. That's important to understand that ELB is a regional concept for a zonal physical reality. That's why the next concept called cross-zone is also important.
  • cross-zone:
    • This is the ability for ELB to route traffic equally on all machines across AZs, otherwise, the traffic is split equally between AZs, which means that if an AZ has less instances than another, the instances in that AZ will be more stressed.
    • This feature is not toggled or not by default based on the type of ELB:
      • Application LB: Cross-zone toggled by default.
      • Network LB: Cross-zone NOT toggled by default.
      • Classic LB: It depends on the method used to create the ELB:
        • API or CLI: Cross-zone NOT toggled by default.
        • Console: Cross-zone toggled by default.
  • There are three types of ELB:
    1. Classic:
      • Legacy AWS load balancer.
      • Supports:
        • UDP
        • TCP
        • HTTP/S (via HTTP1)
      • DOES NOT support:
        • WebSocket
        • HTTP2
      • Operates at both layer 4 and 7 in the OSI (Open Systems Interconnection).
    2. Application:
      • Default choice for a Web Apps.
      • Supports:
        • HTTP/S (via HTTP2)
        • WebSocket.
        • It operates at the layer 7 in the OSI (Open Systems Interconnection).
      • DOES NOT support:
        • WebSocket
        • HTTP2
    3. Network:
      • Use it for system that must provide high-performance even with millions of connections per seconds.
      • Supports:
        • UDP
        • TCP
        • TLS.
        • It operates at the layer 4 in the OSI (Open Systems Interconnection).

    More about the differences between those three types in the next section.

  • ELB can route the same traffic to different destination based on weight. The weight can be any number, it does not go from 0 to 1 for example. Instead, 100% is made of the sum of all the weight and then the portion of the traffic that goes to one destination or the others is determine as a fraction of the weight divided by the total weight.
  • EXAM ELB does not support mutual authentication. If you need to implement mutual authentication while using an ELB, the only option is to use a Network Load Balancer because the ALB always terminates SSL while NLB supports TCP/443 and won't modify the request header so that the request can be mutually authenticated by the backend. As of Sep 2020, API Gateway supports mutual TLS authentication, so moving forward, this could be the way.

Classic vs Application vs Network ELBs

In most cases, you should use the Application or Network load balancer rather than the Classic. The Classic does not support the following features:

  • WebSocket (only for Application LB)
  • HTTP2 (only for Application LB)
  • Elastic IP (i.e., static IP in the Cloud that guarantees that the service's IP does not change even if the service is reprovisioned). Elastic IP is only available with Network LBs.

If you have a standard Web App, most likely, you'll use the Application LB. The key features of that LB are:

  • WebSocket
  • HTTP2
  • Layer 7 in the OSI, which means you can route based on the HTTP header.

Reasons to use Application LB over Classic LB:

  • You need HTTP2 or WebSocket.

Reasons to use Application LB over Network LB:

  • You need HTTP/S or WebSocket.
  • You need to route based on the HTTP header.
  • You need to support sticky session.

Reasons to use Network LB over Application LB:

  • You need static IP out-of-the-box (if you still need an Application LB with static IP, you need to setup extra services. The most common solution is to use AWS Global Accelerator. To learn more about this, please refer to the article titled Using static IP addresses for Application Load Balancers).
  • You need to deal with million of TCP/UDP requests per seconds.
  • You need deal with TCP or UDP.

Exam questions

  • Which ELB type uses Round Robins?
    • The Classic only uses it for TCP listeners.
    • The ALB 1st selects a target based on the routing rule, then uses a Round-Robin strategy to select a node.

EBS

  • Elastic Block Store is your virtual disk in the cloud. There are 5 different types of EBS:
    1. GP2 (SSD) - General purpose SSD
    2. IO1 (SSD) - Optimized for IOPS (good for DB)
    3. ST1 (Magnetic) - Low cost HDD (good for big data and data warehouse)
    4. SC1 (Magnetic) - Cold HDD for less frequently accessible files (good for files server)
    5. Standard (Magnetic) - (good for infrequent workloads)

Screen Shot 2019-09-21 at 12 29 24

  • Max size is 16TiB
  • The killer feature is the snapshot as they are very cost effective and do not take much space. Snapshots are saved to S3. They differ from copies by being incremental. This means that snapshots only store the difference between the previous snapshot and the current volume.
  • EBS volume CANNOT be copied. Instead, use a snapshot.
  • EBS volumes must be created in the same AZ than their associated EC2 instances.
  • To manage snapshots, use the AWS Data Lifecycle Manager.
  • EBS is not cross-region or cross-AZs available. This means you must snapshot them regularly if you which to improve business continuity.
  • Sharing a snapshot:
    • Same region & non-encrypted: That's the easiest. Just modify the permissions. You can share with other AWS accounts, or even make it public.
    • Different region & non-encrypted: You have to do the same as above in terms of permission, but there is an extra step for the sharer. They will have to find that snapshot in the same region, then manually copy it to the region of their choice.
    • Encrypted snapshot:
      • Snapshot encrypted with a default CMK (one generated by AWS for you) cannot be shared. You will have to unencrypt it, then re-encrypt it with a customer managed CMK.
      • When you share an encrypted snapshot, you must aslo share the customer managed CMK that was used for that snapshot. Users of your shared CMK must have the following four permissions:
        1. kms:DescribeKey
        2. kms:CreateGrant
        3. kms:GenerateDataKey
        4. kms:ReEncrypt
  • Though there is an option to use RAID for EBS, this is usually not necessary because:
    • EBS supports high-availability through some AZ server replication.
    • EBS supports high IOPS, you can even choose an EBS–optimized instance.
  • The pricing model works with pre-provisioning. That means that you may pay more than you use. That's to be opposed to S3 or EFS where you pay for what you use.
  • EBS can be updated (both type and size) on the fly (i.e., no need to stop the EC2 instance).
  • How do you move EBS volumes and EC2 instances to other AZs or regions?
    1. Create a snapshot of the root EBS.
    2. Create an AMI from the snapshot.
    3. (only for other region), copy that AMI to another region.
    4. Create a new EC2 instance from that AMI.
  • When an EC2 instance is terminated, by default, its root EBS (i.e., the one with the OS) is also deleted, HOWEVER additional volumes are not.
  • Out-of-the-box, the root EBS cannot be encrypted (you'll have to use a 3rd party such as bit locker to do that). However, additional volumes can be encrypted.
  • Instance stores (aka ephemeral storage) is an alternative of EBS is you need high IOPS at low latency. The difference with EBS is that the root device is mounted in the instance, where the EBS volumes lives outside of the instance. That's how instance store offers better IOPS performances. However, it comes at one big cost. If the instance fails, you lose the entire data in that storage too (hence why it is also called ephemeral storage).

Exam questions

  • How to encrypt EBS volumes (including the root volume):
    1. Use case 1 - Encrypting the root volume during the EC2 creation.
      1. Select an AMI.
      2. Select encrypted.
    2. Use case 2 - Encrypt and existing unencrypted EC2 instance:
      1. Take a snapshot of the root volume.
      2. Create an AMI from the snapshot, and select encrypt.
      3. Start a new EC2 instance from the encrypted AMI.
  • How to move an encrypted volume from one account to the other?

EFS

  • Elastic File System (EXAM: Does not support Windows!!!)
  • This is an alternative to EBS. It is considered a Storage service like S3. It is more expensive but has multiple advantages over EBS:
    • Allows multiple EC2 instances to access the same EFS.
    • Allow on-premise servers to access EFS.
  • Uses the NFSv4 protocol.
  • The pricing model is pay for what you use (as opposed as EBS where you have to pre-provision). However, EFS is roughly 3 times more expensive than EBS and 20 times more expensive than S3.
  • Can scale up to petabytes.
  • Data stored across multiple AZ's within a region.
  • Read after write consistency.
  • EFS supports cross-region sync.
  • Can be mounted on-premise but be carefull with security (recommended to use AWS DataSync).
  • A popular use case is to sync files (e.g., webserver) securely between on-premise and AWS using EFS with DataSync, then mount those EFS in multiple AZs for high-availability.
  • Amazon FSx is similar to EFS but for HPC (High Performance Computing). It supports Windows, Linux (for Linux, you must use Amazon FSx for Lustre. Lustre is an open-source file system for HPC).
  • EBS vs EFS: In a nutshell, only use EFS if you need a big data solutions that needs to be accessed by multiple instances.

S3

  • The key aspects of S3 are:
    1. Key
    2. Value
    3. Version ID
    4. Metadata
    5. Subresources
      1. Access control
      2. Torrent
  • Supports cross-region replication
  • Support the BitTorrent protocol
  • Supports Requester to Pay feature, but it only works if the requester has an AWS account.
  • Interesting use cases:
    • Data lake when combined with Athena or Redshift spectrum or Quicksight
    • IoT streaming data when combined with Kinesis firehose
    • ML/AI when combined with Rekognition, Lex or MXNet
    • Analytics when combined with S3 Management Analytics
  • Up to 5TB per file. Can store 0 bytes. However, the biggest object that can be uploaded at once is 5GB. If you need to upload bigger object, use multipart upload (recommended for files larger than 100MB).
  • Read & write consistency for PUTS of new objects (if you add a file, you can read it immediatelly), eventually consistent for overwrites PUTS and DELETES
  • Access Control Lists allows to manage access control for specific files, where Bucket Policies applies to entire bucket.
  • Bucket's name is as follow: https://s3-.amazonaws.com/
  • 7 different types of storage. All of them are 99,99999999999% durable except RRS which is 99.99%.
    1. Standard: 99.99% avail., 99.11'9s durable, stored redundantly across multiple devices in multiple facilities.
    2. IA (Infrequent Access): Same as standard but cheaper if you access data infrequently (retrieval fee).
    3. One Zone - IA: Cheapest option. Like IA but cheaper because there is only one facility.
    4. RRS (Reduce Redundancy Storage) This is LEGACY. Replaced by IA. It is Less durable storage type (99.99% durable). Use RRS if you are storing non-critical data that can be easily reproduced.
    5. Intelligent Tiering: Uses ML to move your data cost efficiently across the right tier
    6. Glacier: Archive file. Retrieval time is confirgurable from minutes to hours.
    7. Glacier Deep Archive: Like Glacier, but retrieval time starts at 12 hours.
  • To speed up uploads globally, toggle Transfer Acceleration on your bucket. What that does is to use the AWS edge locations which then use AWS private backbone to upload to your bucket.
  • There are 3 ways to restrict bucket access:
    1. Bucket policies
    2. Object policies
    3. IAM policies to users or groups
  • To make all files public, you have to:
    • Make the bucket public
    • Either manually make each object public, or add a new policy to your bucket that says that all object should be public.
  • You can use S3 Lifecycle Management to automatically create a policy that when to move object to different tiers. You can also use tags so that the policy is applied to any S3 bucket that has those tags.
  • Signed URLs:
    • This is a CloudFront feature that allows to attach an access policy to a file. Example:
      • Which user or IP can read the file.
      • How long is the file available.
    • There are two types of policies:
      • Custom: It offers all the features.
      • Canned: It only supports a single file and the expiration date.
  • S3 supports the following 6 features to protect your data:
    1. Permissions: Use bucket-level or object-level permissions alongside IAM policies to protect resources from unauthorized access and to prevent information disclosure, data integrity compromise or deletion.
    2. Versioning: Amazon S3 supports object versions. Versioning is disabled by default. Enable versioning to store a new version for every modified or deleted object from which you can restore compromised objects if necessary.
    3. Replication: Amazon S3 replicates each object across all Availability Zones within the respective region. Replication can provide data and service availability in the case of system failure, but provides no protection against accidental deletion or data integrity compromise – it replicates changes across all Availability Zones where it stores copies.
    4. Backup: You can use application-level technologies to manually back up data stored in Amazon S3 to other AWS regions or to on-premises backup systems.
    5. Encryption – server side: (not automatic, you have to toggle it) Amazon S3 supports server-side encryption of user data. Server-side encryption is transparent to the end user. AWS generates a unique encryption key for each object, and then encrypts the object using AES-256.
    6. Encryption – client side: With client-side encryption you create and manage your own encryption keys. Keys you create are not exported to AWS in clear text. Your applications encrypt data before submitting it to Amazon S3, and decrypt data after receiving it from Amazon S3. Data is stored in an encrypted form, with keys and algorithms only known to you.
  • S3 globally available read/write feature is named Global Tables.
  • You are billed based on the following 6 aspects:
    1. Storage
    2. Request
    3. Storage management
    4. Data transfer
    5. Transfer acceleration
    6. Cross-region replication
  • There are 3 types of encryption:
    1. Encryption in transit which means that as it is transiting, the file is encrypted. But before and after, it is in clear. The typical transit encryption mechanism is via SSL.
    2. Client side encryption which means the client is responsible for encrypting the file before storing it to S3.
    3. Server side encryption (aka SSE) which means the server encrypts the file. There are 3 ways of implementing it:
      1. S3 Managed Keys (aka SSE-S3) where AWS manages the keys for you.
      2. AWS Key Management service (aka SSE-KMS) where you and AWS manage the keys.
      3. Server Side Encryption with customer provider keys (aka SSE-C)
  • A few notes about replication:
    • Versioning must be enabled.
    • Files existing prior to toggling replication are not replicated.
    • Delete markers are not replicated (a delete marker is a placeholder for a versionied object that is used when that object is deleted. In reality, versioned objects are not physically deleted. They just appear so.).

Exam questions

  • How to restrict access to the public to all S3 files in a bucket and only let CloudFront access it?
    • Create a OAI (Origin Access Identity) user in CloudFront.
    • Restrict access to all except the OAI user on the S3 bucket.

Glacier

  • Data are encrypted by default. There are 3 types of Glacier retrieval time:
    1. Expedited : 1–5 minutes
    2. Standard : 3–5 hours
    3. Bulk : 5–12 hours
  • Core component is the Vault lock. Glacier Vault Lock is an immutable way to set policies on Glacier vaults.
  • Policies regulate the object lifecycles rules, while the access are managed by IAM.
  • Vault lock policies are important for compliance and regulation scenarios.
  • Applying a policy on the vault lock is an asynchronous process which returns an ID. You then have to wait for that process to be completed and then you have up to 24 hours to confirm using that ID. Otherwise, the policy is not applied.
  • The Vault lock objects are immutable. It can be deleted though, or be overwritten with a new version.
  • Max object file is 40TB.

CloudWatch

  • You can only log to log groups. This means that if you need to log, you first need to create a log group, then point to it.
  • metrics can be created based on logs in a log groups. You cannot log directly to a CloudWatch metric. This does not make any sense (exam fallacy).
  • Used for performance monitoring. It usually consists of those 4 metric types:
    • CPU
    • Network
    • Disk
    • Status check (e.g., Route 53 health check)
  • CloudTrail vs CloudWatch
    • Use CW to monitor performance.
    • Use CT to monitor AWS API calls (usually use for auditing)
  • Can monintor most of AWS as well as your apps on AWS
  • By default CW on EC2 runs every 5 min, but you can change that to 1 min with detailed monitoring on.
  • You can trigger notifications with CW Alarm

KMS

  • Key Management Service
  • CMK (Customer Master Key). This is the core asset managed by KMS:
    • Two families:
      • Symmetric: 256-bits CMK used for encryption/decryption only.
      • Asymmetric: There 2 options:
        • RSA for both encryption/decryption and signing/verification.
        • ECC curve for signing/verification only.
    • Three types:
      • Customer managed CMKs: These are the one the you create and manage yourself in your account.
      • AWS managed CMKs: These are the one that AWS automatically creates as part of other AWS services. They are prefixed with aws/<SERVICE NAME>.
      • AWS owned CMKs: These are global cross accounts CMKs that AWS uses across multiple account. You ca see them but you you cannot see their details.
  • KMS can only encrypt/decrypt data no larger than 4KB
  • To encrypt/decrypt data larger than 4KB use data keys. KMS can generate those data keys(symmetric or assymmetric), but it's up to the user to store them. The workflow to encrypt/decrypt large data with a data key is called envelope encryption and it works as follow:
    1. Create a new CMK or use an existing one.
    2. Generate a new data key. This will return:
      • plain text data key
      • encrypted data key called a CipherTextBlob. This encrypted key contains metadata so that KMS knows which CMK was used.
    3. Use the plain text data key to encrypt the large data.
    4. Get rid of the plain text data key for security reason and store the CipherTextBlob somewhere safe for decryption.
    5. To decrypt, use the CipherTextBlob with KMS's Decrypt API to get a new plain text data key (this is the same value as the one we disposed in step 4).
    6. Use the plain text data key to decrypt the large encrypted data.
    7. Get rid of the plain text data key for security reason.
  • Key policies are used to defined who can use and manage your CMKs. You cannot edit key policies for AWS managed CMKs.
  • Grant:
    • They are an alternative to key policies. They are more granular.
    • They are eventually consistent, so if you need to use them immediately, use the return grant token in certain APIs.
  • The process to use your own key using KMS for S3 or EBS or RDS works as follow:
    1. Create a new CMK in KMS with no material (material refers to the master key bits and pieces).
    2. Generate your key and encrypt it with a wrapping key.
    3. Import the encrypted key with the import token.
    4. When creating an S3 bucket or EBS or RDS, select the CMK.

Database services

NoSQL

DynamoDB

  • Fully-managed NoSQL DB.
  • Max record is 400KB.
  • Stored on SSD, spread across 3 geographical distinct data centres.
  • DynamoDB is broken down in 10GB partitions with 3000 RCU (Read Capacity Unit) and 1000 WCU (Write Capacity Unit). To know how many partitions you need, you pick the higher number between what you need for storage or throughput. For example, if you need to stored 15GB of data, you need 2 partitions. But if on top you need 6000 reads per seconds, then you need 3 partitions.
  • Supports 2 different types of read consistency:
    • Eventual read consistency (default). This is the fastest type of read, but it does not guarantee that the read is the latest.
    • Strongly consistent. That's an option you pass to the read request. It is slower because it waits until all the writes have been completed (this is not available on secondary global indexes).
  • Why is DynamoDB called schemaless when I'm forced to define a schema for each table? The answer has been documented in the Is DynamoDB truly schemaless? section under the annex.
  • Main operations:
    • GET: Super fast based on the primary key.
    • POST: Insert data.
    • Query: Super fast based on the primary or secondary index.
    • Scan: Fast to slow. It scans all properties.
  • Supports 3 types of indexes:
    1. Primary key (aka HASH): That can be an attribute (aka partition) (not nested) or a composite made of a partition and a sort attribute (aka range).
    2. Global secondary index (GSI):
      • It is technically another physical table fully-managed by AWS. The user does not explicitly see that table, it simply manage the index.
      • This index can be totally different from the primary key.
      • Up to 20 different GSIs per table can be configured.
      • They can be added after the table creation.
      • They incur twice the throughput for both reads and writes. Indeed, the read happens first on the index table, then on the base table, while the writes happen first on the base table and then on the index table.
      • Use case: The client requires to access and sort the data using totally different attributes than the primary key.
    3. Local secondary index (LSI):
      • Also use a separate physical table as for the GSI.
      • This index must be composite, the partition must be the primary key and the range can be any other attribute.
      • Up to 5 different LSIs per table can be configured.
      • They CANNOT be added after the table creation.
      • They incur twice the write throughput
  • Trick: Use the secondary index to replace the need to create a new table with specific features. This allows:
    1. Create read replicas (eventual consistent though).
    2. Create special tier for premium customers.
  • DynamoDB supports a TTL (Time To Live) feature that allows to automatically drop certain records based on a specific field (typically a timestamp). The advantage here is that this process does not consume any read/write capacity.
  • DAX (DynamoDB Accelerator) is the in-build caching system to speed up dynamoDB. It provides 2 main advantages:
    1. Increase read performance (from milliseconds to sub-milliseconds).
    2. Decrease costs by saving a over-provisioning READ throughput.

RDS

  • Relational Database Service.
  • Though RDS runs on virtual machine similar to EC2, you cannot SSH to it. You cannot patch it either. Patching it and managing the OS is AWS' responsibility.
  • Except Aurora, RDS is not serverless.
  • RDS has 2 main features:
    • Multi AZ for disaster recovery.
    • Read replica for performance (up to 5 replicas) (For RDS MSSQL, only the 2016-2019 Enterprise edition supports read-replicas, but without cross-region support).
  • Retention period is the number of days AWS keep RDS backups (default is 7 days, max is 35 days).
  • EXAM To enable read replicas, you need to enable backup. This also means that to disable backups (retention period set to 0), you also need to didable read replicas.
  • To enable read replicas, you need to enable backup.
  • You can create read replicas in different AZs and regions.
  • You can create read replicas of read replicas.
  • You can promote a read replica to master, but this will break the current read replica process.
  • You can force the failover process to another AZ by rebooting the RDS instance.
  • OLTP (Online Transaction Processing) for CRUDS where OLAP (Online Analytical Processing) is for big data processing typically using a data warehousing DB.
  • There are 6 types of RDS databases:
    1. RDS for SQL OLTP. It supports:
      • MySQL (Exam: Support both InnoDB and MyISAM but InnoDB is recommended as MyISAM does not support read replication and reliable crash recovery)
      • PostgreSQL
      • MS SQL
      • Oracle
      • MariaDB
      • Aurora
    2. DynamoDB for NoSQL OLTP
    3. AWS Redshift for OLAP and data warehousing
    4. AWS ElastiCache using either Memcache or Redis
  • Memorize the following table:
Database Engine Range of Provisioned IOPS Range of Storage
MariaDB 1,000–80,000 IOPS 100 GiB–64 TiB
SQL Server, Enterprise and Standard editions 1,000–64,000 IOPS* 20 GiB–16 TiB
SQL Server, Web and Express editions 1,000–64,000 IOPS* 100 GiB–16 TiB
MySQL 1,000–80,000 IOPS 100 GiB–64 TiB
Oracle 1,000–80,000 IOPS 100 GiB– 64 TiB
PostgreSQL 1,000–80,000 IOPS 100 GiB–64 TiB
  • Option Group. Each DB engine under RDS can be configured differently. Those specific configuration settings can be seen as some type of environment variables (some of them truly are). Those settings are stored in what AWS generically calls an option group.
  • RDS supports Reserved Instance for 1 or 3 years. This is available to multi-AZs.

Backups

  • There are 2 types of backup:
    • Automated backup, i.e., backup that are done automatically and preserved for a certain time called a retention period (going from 1 to 35 days). Those backups also preserve the transaction logs, meaning you can restore to any point in time. When you delete the RDS instance, all backup are also deleted.
    • Snapshot, i.e., manual back up. When the RDS instance is deleted, the snapshot are preserved.
  • Both the automated backup and snapshot are stored in S3.
  • When you restore from a backup, you get a new RDS on a new endpoint.

Encryption

  • Encryption at rest is supported for all RDS instance types.
  • Encryption is done using the AWS Key Management Service (KMS). Once your RDS instance is encrypted, both the DB data, its backups and its read replicas are encrypted.

Aurora

  • Aurora is the only RDS option that support automatic failover on read-replicas.
  • Aurora only supports failover across AZs, not regions. To support cross-region replication, you need to use Aurora Global Access
  • Storage starts from 10Gb, and then auto-scales by 10Gb increments to 64TB.
  • Compute can scale up to 32 vCPUs.
  • Memory can scale up to 244GB.
  • Always keeps 6 copies of your data, 2 copies in each AZ with minimum 3 AZs (which means it's only available in regions with at least 3 AZs).
  • Support read replicas up to 15.
  • By default, supports a single master (called primary and used for write and read ops). There is a multi-master mode.
  • Supports creating endpoints in order to target the primary DB separately from the load-balanced read-replicas.
  • Only the endpoint targetting the primary DB is ACID. Reading the read-replicas are only eventually consistent.

RDS Proxy

RDS proxy is a fully-managed AWS service that manages connection pools for the following 3 AWS RDS platforms (as of 2021):

  • MySQL
  • PostgreSQL
  • Aurora

Business continuity & reboot

  • Reboots occur either manually or automatically during the maintenance window(managed by AWS).
  • An RDS instance/cluster in pending-reboot will only reboot with a manual reboot.
  • A reboot occurs during the maintenance window when:
    • The backup retention period for a DB instance is changed from 0 to a nonzero value or from a nonzero value to 0.
    • The DB instance class is changed.
    • The storage class is changed(e.g., standard to SSD).
  • Reboots create outage which can be mitigated proactive measures are taken. Those measures are:
    • Enabling Multi-AZ and then check the Reboot With Failover
    • For Aurora, the same as above applies except that the failover is automatic.

    In both situation, there will be some downtime, but Aurora should be able to limit the downtime to under 10 seconds, while the other RDS will recover between 30 to 60 seconds.

  • The reboot time depends on the activity on the DB. That's why it adviced to choose reboot windows that correspong to low activity to reduce the service interruption.

Redshift

  • Petabytes fully-managed data warehouse service based on PostgreSQL.
  • Only available in 1 AZ, but you can restore from a snapshot in a different AZ in case of outage.
  • Does not support multi-AZs, therefore, best HA stragegy is to use multi-node cluster.
  • Super cheap (start at $0.25/per hour, i.e., ~$180/month).
  • Used for OLAP (Online Analytic Processing), i.e., running complex queries on your data instead of simply reading records (OLTP). OLAP requires a different type of architecture, both in terms of software and infrastructure. In Redshift case, the key design details are:
    • High columnar compression.
    • No need of indexes or materialized views, which reduces the overall DB size when compared to traditional RDBMS.
    • Uses Massive Parallel Processing (MPP).
  • 2 types of configuration:
    • Single node (160Gb)
    • Multi node:
      • Leader node manages client connections and received queries.
      • Compute nodes (up to 128) store data and perform queries.
  • Backups:
    • Automatic backup with 1 day retention period by default (max 35 days).
    • Attempts to maintain 3 copies (original, read replica, and S3 backup).
    • Can asynchronously create snapshot in S3 in a different region for DR.
  • Pricing model:
    • Charged per compute node (not charged for leader node).
    • Charged for backup.
  • Encryption:
    • Encrypted using AES-256 by default.
    • Redshift manages key automatically, but you can take over using Hardware Security Model (HSM) or KMS.

Spectrum

  • Spectrum is a new Data Lake service for RedShift. It allows to import S3 into RedShift so that any analytic BI tools that work with PostGreSQL will work.
  • Use spectrum over athena if you want to perform complex joins.

Neptune

  • The Graph DB service is called AWS Neptune.

Quantum Legder & Managed Blockchain

  • Quantum Legder is a blockchain implementation.
  • Managed Blockchain is a framework that allows to implement Hyperledger Fabric or Ethereum and host your implementation on AWS.

Timestream

  • Time-series DB

DocumentDB

  • MongoDB compatible.

AWS ElasticSearch (aka ES)

  • ELK stack = ElasticSearch + LogStash + Kibana

Redis vs Memcache

2 types of Elasticache: - Memcached: Use this for simple use cases. It does not support: - Backup and restore. - Advanced data types. - Sorting. - Pub/Sub. - Persistence - Multi AZ - Redis: Use this for more advanced scenarios where Memcached falls short (it supports all the things that Memcached does not support above).

  • Redis supports all Memcache features + more.
  • Memcache does not support replication -> when a node fails, there is data loss. That's why you should a more than one node per shard and distribute nodes across AZs. This will not prevent data loss, but will minimize its effect.
  • Redis supports replication groups to support failover of the primary node.

Event-sourcing services

SNS

  • 256KB max record

SQS

  • FIFO means that it maintains order.
  • 4 days by default up to 14.
  • 256KB max record, but can be hacked to 2GB with a Java SDK

AWS MQ

  • Similar to AWS SQS.
  • You should use SQS whenever possible.
  • Use it when migrating from on-prem and a pure drop-in replacement for MQTT, WebSocket, ActiveMQ, JMS, NMS is required.

Kinesis

  • Data is processed in shard
  • Each shard can ingest 1000 records/secs
  • Default limit is 500 shards but you can request a limit increase.
  • Max record size is 1MB.
  • Default retention is 24 hours but can be increased to 7 days.
  • KCL (Kinesis Client Library)
  • KPL (Kinesis Producer Library)
  • To read data from Kinesis Data stream you can:
    • Use the KCL (Kinesis Client Library) (recommended)
    • Use the Kinesis API
  • If you need to stream the data straight into storage (e.g., S3, RedShift, ElasticSearch, Splunk), you need Kinesis Firehose

CloudWatch events

EventBridge

Choosing the correct event-sourcing service

EventBridge vs CloudWatch Events, Kinesis and SNS

DevOps services

System Manager

  • SSM (formerly known as Simple System Manager) is a configuration management service/tool similar in the similar category than Ansible, except it operates across all the services in your AWS account.
  • Used to manage huge fleet of EC2 instances.
  • Can also be used to manage on-premise instances.
  • Requires an agent to be installed. Most recent AMI have it by default.
  • Its different components are:
    • Inventory
    • State manager (configured via a Command document and a Policy document): Filter instances by config and tag them.
    • Logging (recommended to replace this with CloudWatch if possible)
    • Parameter store: Equivalent to environment variables. Can leverage AWS Secret Manager
    • Insight dashboard: Centralized view of AWS CloudTrail, AWS Config and Trust Advisor
    • Resource groups: Use existing tags to group resources.
    • Automation (configured via a Automation document): Automatically run tasks like shutting down instances.
    • Run command (configured via a Command document): Automatically execute scripts on the instance without requiring SSH.
    • Patch Manager: Automatically run patches.
    • Maintenance windows: Assign specific windows to do things (e.g., running Patch Manager)
  • SSM can also manage instances outside of AWS, including:
    • On-prem.
    • Other Cloud provider.
  • A managed instance is an instance managed by SSM.
  • SSM contains five capabilities:
    • Operations Management: This is the overall dashboard that provides an helicopter view of outstanding items across all AWS account and more. You can also use it to create OpesItem, i.e., DevOps tickets on items that must be fixed.
    • Application Management: This helps manage your apps with the following specific tools:
      • Resource groups: Using tags, you can group your resources in logical units to make sense of them and even apply specific action to them.
      • Parameter store:
        • Stores key value pair that can be used by your apps. The value can be stored in plain text or encrypted using KMS.
        • Overlaps with AWS Secret Manager. Secret manager only work for encrypted data. It supports the following additional features:
          • Password generator
          • Automated secret rotation
          • Cross account access
      • AppConfig: Maintain config files for your app. Can also reference values stored in Parameter store.
    • Actions & Change: Infrastructure maintainence and change planning
      • Automation: Script your automation (e.g., OS patches, backup, secret rotation).
      • Change calendar
      • Maintenance windows
    • Instances & Nodes: Helps manage your EC2 fleet. Services include:
      • Session manager: This is an in browser terminal to access your instance, even if it is in a private network, which removes the need for a bastion server to SSH configuration. [Exam]:
        • Advantages over a bastion server are:
          • No need to open inbound SSH ports or PowerShell ports.
          • Access via Session Manager can be tracked via CloudTrail.
          • Manage user access in a central place using IAM only. You set policies between instances and users. YOU DO NOT SET IAM POLICIES ON SESSION MANAGER.
          • One click access via console or CLI.
          • Support for portforwarding (no need of bastion server anymore).
          • Supports both Windows and Linux.
          • Can log session to the following destinations:
            • S3
            • CloudTrail
            • CloudWatch Logs
            • EventBridge
            • SNS
        • Disadvantages:
          • It is not agentless.
      • patch manager
    • Shared Resources: Stores SSM document which are used to configure SSM.

CloudTrail

  • CloudTrail is an auditing service which allows to track API calls with AWS (e.g., creation of an EC2 instance).
  • The difference between CloudTrail and CloudWatch is that the first focuses on auditing where the other focuses on performances.
  • CloudTrail is a per region service. So you need to enable it in all regions :(.
  • Using CloudTrail, it is possible to track all AWS account activities and therefore create an global view of the expenditure, which then leads to the consolidated billing. To do so:
    1. Turn on CloudTrail in the paying account.
    2. Create an S3 bucket in the paying account and attach a policy to it so it is accessible to all the other AWS account (cross account access).
    3. Turn CloudTrail in all the AWS account to log into the shared S3 bucket

CloudFormation

  • Allows to automate your infrastructure using IaC with JSON or YAML templates.
  • A template uses the following properties:
    • AWSTemplateFormatVersion
    • Description
    • Metadata
    • Parameters
    • Mappings
    • Conditions: Series if programmtic boolean values that can be used in other section.
    • Transform
    • Resources
    • Outputs

X-Ray

Service that allows advanced tracing to trace microservices and debug the shit out of it. It is in-built in Lambda, but can also be used through an agent that you must installed in your app.

AWS OpsWork

  • Full-managed service that implements one of the following:
    • Chef
    • Puppet
    • Stack (custom AWS solution compatible with Chef recipes)
  • Though it is a global service, it can only create stacks on a per region basis. It is not possible to:
    • Manage a stack in one region from another stack in another region.
    • Clone a stack in one region to another one.

DevOps exam tips

  • There are a few nasty questions about the Resources property.
    • Resource types: You should be familiar with a few specific resource types:
      • AWS::CloudFormation::CustomResource: This one allows to fire an AWS Lambda to access custom data not supported by CF.
    • Resource policies: You should be familiar with a few specific resource policies:
      • CreationPolicy attribute:
        • It allows to explicitely tell CF when the resource is considered created. For example, for an auto-scaling group that creates 10 instances, we may want to consider the auto-scaling group created when at least 3 instances are running.
        • Its value is a complex object that depends on the resource type.
        • Only applies to:
          • AWS::AutoScaling::AutoScalingGroup
          • AWS::EC2::Instance
          • AWS::CloudFormation::WaitCondition
      • DeletionPolicy attribute:
        • It helps defining extra steps before a resources is deleted when the stack is deleted. For example, before deleting an EBS, you may want want to snapshot it before the stack is deleted.
        • String value which can only be one of the following enumeration:
          • Delete: That's the normal behavior for most of the default resources except AWS::RDS::DBCluster, AWS::RDS::DBInstance for which the default is Snapshot.
          • Retain: When the stack is deleted, retain that resource.
          • Snapshot: When the stack is deleted, snapshot that resource. Only applies to:
            • AWS::EC2::Volume (aka EBS)
            • AWS::ElastiCache::CacheCluster
            • AWS::ElastiCache::ReplicationGroup
            • AWS::Neptune::DBCluster
            • AWS::RDS::DBCluster
            • AWS::RDS::DBInstance
            • AWS::Redshift::Cluster
      • UpdatePolicy attribute:
        • Manages the way updates are applied.
        • Its value is a complex object that depends on the resource type.
        • Only applies to:
          • AWS::AutoScaling::AutoScalingGroup: Exam. The exam seems to like questions about auto scaling so memorize the following:
            • In this case, the UpdatePolicy attribute can contain the following three attributes:
              • AutoScalingReplacingUpdate:
                • If you need AutoScalingReplacingUpdate to take precedence over AutoScalingRollingUpdate, set WillReplace to true.
              • AutoScalingRollingUpdate
              • AutoScalingScheduledAction: That's the easiest to use. Set IgnoreUnmodifiedGroupSizeProperties to true if you want to make sure that an CF update to a scaling group that uses a scheduled behavior does not reset the scaling setting while a schedule is in effect.
          • AWS::ElastiCache::ReplicationGroup
          • AWS::Elasticsearch::Domain
          • AWS::Lambda::Alias

DevOps best practices

  • When code is distributed across multiple AWS accounts (e.g., multiple teams are responsible for their own account), then it makes sense to have a dedicated DevOps AWS account. Because CodeBuild, CodeCommit and CodePipeline all support cross-account acsess (i.e., they can run in one account like they are in another), we can pull all the resources from the other accounts and centralize all the CI/CD automation in a single account.

Serverless services

Lambda

  • Lambdas are fully-managed AWS services (hence managed in their own VPC).
  • They can be connected to your own VPC (to gain access to your RDS DB on your private subnet for example), but this requires care. It used to creates a lot of negative side-effects but this was mostly fixed in Nov 2019.

Lambda connected to a VPC

This use case occurs when a lambda needs to communicate directly with resources located in a private subnet (e.g., RDS DB). The usual confusion with this setup is the belief that the Lambda is provisioned inside your own VPC. It is not. As mentioned previously, a Lambda is always private and always provisioned inside AWS' own VPC. Instead, this setup is a network configuration that provisions a new ENI that allows that Lambda to reliably and safely access your private resources. Though there are legitimate cases to set this up, AWS recommends to avoid it if possible as this setup creates a series of non-trivial side-effects:

  • At least one new ENI is provisioned, which increases the cold start (this used to be far worst than that, but in September 2019, AWS made a significant optimization).
  • That ENI assigns a private IP address to that lambda (1). So even if the subnet associated with that lambda is public, that lambda won't be able to ingress/egress to the internet. Instead, a NAT must be added in the public subnet, and a route table mapping must redirect the lambda subnet traffic to that NAT.

(1) Public subnet can have private IPs. To understand that concept, please refer to the Public vs Private IP addresses section.

References

AWS step functions

Use it to coordinate AWS services.

Elastic Beanstalk

  • For the practitionner exam, you just need to know that's a service which allows to simply deploy code without worrying too much about AWS.

Containers services

ECS (Elastic Container Service)

  • Elastic Container Service has two responsibilities:
    1. Manage the lifecycle of tasks (e.g., deploy containers on existing EC2 instances). A task defines:
      • Docker image
      • VM size
      • Networking
      • Logging
      • Bootstrap script commands
      • IAM roles
    2. Running the container (which could be replaced by Fargate)
  • ECS is free. You just pay for the EC2 instances running.
  • When ECS runs the container, it does so in your own EC2 fleet. You are responsible for managing those EC2 as usual, which could be a pain.
  • ECS is not involved on managing your actual infrastructure which sucks as now you have two problems. Managing docker and managing EC2.
  • If you want to forget about managing EC2, then use AWS Fargate.

ECR (Elastic Container Registry)

Amazon Elastic Container Registry (ECR) is a fully-managed Docker container registry that makes it easy for developers to store, manage, and deploy Docker container images.

Fargate

  • Fargate uses ECS and replace its second step (i.e., running containers).
  • Fargate is considered a serverless solution. It eliminates the need to manage EC2 fleets and scale up/down automatically.
  • The combo ECS/Fargate is a competitor to any Kubernetes solution (e.g., AWS EKS). Choose Kubernetes if you need fine grained control.
  • Kubernetes will provide greater portability as Fargate (Fargate is AWS specific).
  • Choosing between AWS Fargate and AWS ECS is easy (choose Fargate) but choosing between Fargate and AWS EKS really depends on your business case.

Exam tips

  • Fargate is usually used for transient payload as it is a serverless solution. If a solution requires a fixed and finite amount of servers using containers, ECS is probably the right choice.

EKS (Elastic Kubernetes Service)

Amazon Elastic Kubernetes Service (Amazon EKS) makes it easy to deploy, manage, and scale containerized applications using Kubernetes on AWS.

Networking services

Route 53

Route 53 knowledge prerequisites

  • The root domain is the naked domain (e.g., the root domain of https://www.example.com is example.com).
  • A CNAME cannot be placed at the root domain level.
  • CNAME records must point to another domain name, never to an IP address.
  • CNAME records can point to other CNAME records, but this is not considered a good practice as it is inefficient.
  • hosted zone is a container for DNS records. There are two types of hosted zones:
    • public: Stores DNS records to route traffic on the internet.
    • private: Stores DNS records to route traffic in your VPC.

Route 53 overview

  • Can log DNS queries to CloudWatch (via a Log Group only in US East). Usefull for traffic analysis.
  • alias is like a CNAME where the target is an AWS service (e.g., CloudFront, ELB, API Gateway).
  • When an alias is defined on the root domain, this is called an zone apex.

Cloudfront

  • The 3 terminolgies are:
    1. Edge
    2. Origin
    3. Distribution (just the name of your edge).
  • Behaviors is what allows to determine which backend (static S3 or Dynamic EC2) is being served based on the URL path.
  • There are 2 types of distributions:
    1. Web
    2. RTMP (media streaming like flash)
  • You can also write to an edge (e.g., transfer acceleration)
  • You can clear your cached object but you'll be charged for doing that.

Direct Connect

  • AWS Direct Connect is a cloud service solution that allows to establish a dedicated network connection from your premises to AWS.
  • It is complicated to set up as you need the intervention of your ISP. EXAM: Direct Connect can support, in theory, bandwidth ranging from 1Gbps to 10 Gbps.
  • It is not inherantly redundant, which means you should be able to design redundancy by adding an extra VPN connection or a secondary Direct Connect.

VPC endpoints

Usefull examples of VPC endpoints can be seen at https://docs.aws.amazon.com/vpc/latest/userguide/vpc-peer-region-example.html

  • VPC endpoints are horizontally scaled, redundant, and highly available VPC components that allow your VPC components to privately serve (provider) services to other AWS VPCs (in the same account or in other accounts) or consumer other services (themselves exposed via a VPC endpoint).
  • VPC endpoints are REGIONAL ONLY. This means that you cannot access them from a different regions. The usual solution around this limitation is to add a Inter-region VPC peering which allows to connect VPCs across region. To learn more about those types of solutions, please refer to:
  • You pay per hours (1 month +- USD10) and per GB.
  • Use VPC endpoints over NAT, internet gateways, VPN, or Direct Connect when your aim is to connect services across AWS accounts rather than exposing AWS services to external systems (web, corporate network, ...).
  • Can decrease cost by saving on egress: If some of your AWS processes(e.g., app on EC2) are in a public subnet and send a lot of data via the internet back to an AWS service(e.g., Kinesis Streams)_ then you are incurring a cost for traffic leaving AWS. Configuring an interface endpoint for Kinesis and enabling private DNS (which is the default when creating an endpoint) will automatically change that behavior because the private DNS will resolve the Kinesis endpoint to the new VPC endpoint. The traffic will stay inside the AWS backbone and not incur any costs.
Private DNS OFF Private DNS ON
Screen Shot 2019-09-21 at 12 29 24 Screen Shot 2019-09-21 at 12 29 24
  • There are two types of VPC endpoints:
    • AWS PrivateLink used by interface endpoints & endpoint services: More details below.
    • Gateway endpoints: Similar to AWS PrivateLinks but for S3 and DynamoDB.
  • AWS PrivateLink is a bit confusing because this service is not explicitely used. Instead, you interact with an interface endpoint (in the AWS console, they are labelled as endpoint) or an endpoint service:
    • interface endpoint uses AWS PrivateLinks to provision ENIs (Elastic Network Interfaces) in front of resources in a private AWS VPC so they can be accessed (consumed or be consumed) by other VPC endpoints.
    • endpoint service uses AWS PrivateLinks to provision ENIs (Elastic Network Interfaces) in front of an NLB (Network Load Balancer) that sits in a private AWS VPC so it can be accessed (consumed or be consumed) by other VPC endpoints.

    In both cases, the firewal must be configured via a Security Group otherwise no traffic is allowed.

  • Endpoint policies are used to manage access between consumers and providers.
  • To let a consumer located in an on-prem network to access an VPC endpoint, either Direct Connect or VPN must be set up.
  • Limitations:
    • Only supports IPv4 traffic over TCP.
    • Only supports connecting consumer and provider across the same region.
    • Consumer and provider must be in the same region and must have AZs overlap. If there is no AZ overlap, the consumer won't be able to discover the provider's VPC endpoint.
    • For the endpoint service, the associated Network Load Balancer can support 55,000 simultaneous connections or about 55,000 connections per minute to each unique target (IP address and port).
  • Troubleshooting:
    • Both consumer and provider are in the same region and all endpoint policies have been set correctly, but the consumer cannot see the VPC endpoint. This must be because the provider's VPC endpoint is in a different AZ than the consumer.

Big data services

Athena

  • Serverless querying service which allows to query your S3 files with SQL

Glue

  • AWS Glue is a fully-managed ETL service.
  • It is made of the following components:
    • Crawler: Can crawl many different sources (CSV files, JSON files, RDBMS, ...) and extract schemas from those data sources.
    • Classifier: A classifier is a function that a crawler use to infer a data schema. There are many out-of-the-box classifiers. The crawler use them in sequence until the probability for the data pattern is high enough. You can also create your own classifier.
    • ETL job: Wrangle a data source into a destination.
    • Data Catalog: Metadata store about data sources. It will contain the data source (e.g., S3, Redshift, RDS, DynamoDB) and add on top of it the schema. It's very usefull for AWS Athena for example. Athena would use the Data Catalog a bit similarly to a Table though the underlying data could be a flat file.
  • Supported data sources
    • Aurora
    • RDS for MySQL
    • RDS for Oracle
    • RDS for PostgreSQL
    • RDS for SQL Server
    • Redshift
    • DynamoDB
    • S3
    • MSK (AWS Managed Streaming for Kafka)
    • Kinesis Data Streams
    • Apache Kafka
    • MySQL, Oracle, Microsoft SQL Server PostgreSQL running on EC2
  • Glue ETL allows you to code your own ETL script. It supports two languages:
    • Python
    • Scala

Elastic MapReduce (EMR)

When a cluster is deleted, the HDFS data are also deleted. If you need to persist them, you need to explicitely export it to S3 using EMRFS file system.

ML/AI services

Service Description
SageMaker Manage Juniper Notebook that can be used to buil and train models as well as analysing data. Once the model are built, they can be scales by exposing them to endpoints so that other services can query them.
Comprehend Uses NLP to build sentiment analysis.
Forecast Uses time-series to make prediction
Lex Transform speech into command (ASR (Automatic Speech Recognition)) or text to command (NLU (Natural Language Understanding)). That's what's being used behind Alexa.
Personalize Recommendation engine based groups (e.g., customer's data)
Polly Converts text to voice in many different languages.
Rekognition Image processing
Textract Extract text from pictures or documents.
Transcribe Voice to text.
Translate Translate text from one language to another.

Examples:

  • Simple Alexa skill: Lex (text to command) -> Polly (answer to voice)
  • Alexa skill with recommendation: Lex (text to command) -> Personalize (add suggestions to answer) -> Polly (answer to voice)

IoT services

AWS IoT is broken down in the following services:

  • IoT devices:
    • Amazon FreeRTOS: RTOS that can manage microcontrollers and connect directly to AWS Cloud (via AWS IoT Core) or to a local gateway (via AWS Greengrass). The main advantage of using Amazon FreeRTOS is portability across devices with different microcontroller. It supports:
      • Local and cloud connectivity.
      • OAU (Over the Air Updates)
      • Data encryption
      • Key management
      • Integrated code signing
    • AWS IoT Device SDK: The SDK offers the same features than the Amazon FreeRTOS for situations where the Amazon FreeRTOS is not available. The SDK supports the following languages: C, C++, Java, Javascript, Python, Android, iOS.
  • IoT Edge processing or network gateways:
    • AWS Greengrass: This is a software that can extend your AWS account functionalities (AWS lambda, storage) to the egde.
  • IoT Cloud:
    • AWS IoT Core:
      • Supports WebSocket, MQTT and HTTP
      • Rules Engine is used to route message to any AWS service. The syntax to create a rule is similar to a SQL query: SELECT * FROM topic WHERE condition -> invoke action.
      • Device Shadow is a sort of device proxy that allows your Apps to control the device state, even when the device is offline. The device shadow persist the desired device state (e.g., show green light) and the AWS IoT Device SDK or the Amazon FreeRTOS will try to sync the device to that state (or the inverse).
  • IoT Security:
    • AWS IoT Device Defender:
      • Help implementing(e.g., constant monitoring) security best practices or compliance rules to your fleet of devices (e.g., checking the certs are not shared across devices)
  • IoT Analytics
    • AWS IoT Analytics: Adhoc analytics.
      1. Create a new analytic. This will create in the background:
        • Channel to absorb the data from your topic.
        • Data store stores your messages.
        • Pipeline links the channel to the data store.
        • Data set are SQL queries against the data store which can be run a specific schedule.

Security services

WAF

  • Web Application Firewall helps protect against programmatic attacks such as:
    • SQL injection
    • Suspicious|blacklisted IPs, countries, headers
    • Suspicious URL string
    • Suspicious scripts (cross-site scripting)
  • WAF is configured via ACLs (Access Control List)
  • It operates on layer 7 of the OSI.
  • Can protect the following services:
    • CloudFront
    • API Gateway
    • Application Load Balancer
    • AppSync
  • It cannot protect an ECS cluster. If you need to do so, put the cluster behind an ALB.

Shield

  • AWS Shield is a security service that is provided out-of-the-box on all AWS accounts for free.
  • It is enabled by default and comes in 2 types:
    • Standard
    • Advanced ($3000/month)
  • Its main purpose is to protect against DDoS. The most common attacks are:
    • UDP reflection attacks: OSI Layer 3 (Network) Protected by AWS Shield standard
    • TCP Syn Flood attack: OSI Layer 4 (Transport) Protected by AWS Shield standard
    • TLS abuse: OSI Layer 6 (Presentation)
    • HTTP floods, DNS query floods: OSI Layer 7 (Application) Protected by AWS Shield advanced
  • AWS Shield advanced is not free and helps protect against advanced DDoS.
  • EXAM. If you plan to test the robustness of your AWS infrastructure with a planned DDoS or a planned load testing, you should let AWS know. Otherwise, AWS may automatically block your requests. This is what AWS Shield does.

GuardDuty

  • Threat detection service that continuously analyses CloudTrail, VPC Flow Logs and DNS logs.
  • Prices is based on:
    • Number of CloudTrail events analysed (pay per millions).
    • Volume of VPC Flow Logs and DNS logs analysed (per GB)

Macie

  • Service that can detect weither your S3 data or CloudTrails logs contains sensitice PII (Personal Identifiable Information. Example: credit card numbers).
  • It uses ML and NLP to do that.
  • Usefull for PCI-DSS compliance.

Firewall Manager

  • AWS Firewall Manager helps manage rules referred as ACLs (Access Control Lists) and apply them across multiple accounts to AWS WAF, AWS Shield and AWS VPC Security Groups.
  • It is usefull to centrally manage and monitor security groups across multiple accounts.
  • AWS Config must be enabled.
  • It must be setup in Organization management account. The account where AWS Firewall Manager is setup is referred as the Firewall Manager administrator account.

Inspector

Agent that you installed on each of your EC2 instance. It will run diagnostics to report vulnerabilities and issues based on best practices rules.

Trusted Advisor

AWS Trusted Advisor is a service that analyses your entire AWS Account to provide suggestions on how to optimise:

  1. Costs.
  2. Performances,
  3. Security
  4. Fault tolerance.

It comes in 2 flavors:

  • Core Checks & Recommendations (FREE)
  • Full trusted advisor (Only for business and enterprise support plans)

Personal Health Dashboard

Diagnostic tools the shows all the critical events that may have affected your AWS resources. It offers:

  1. A personlized view of the services health.
  2. Detailed troubleshooting guidance.

AWS Assurance Program aka AWS Compliance Program

This is the AWS initiative to be as compliant as possible with as many regulations and standards. The cornerstones of this program are:

  1. Compliance with Laws and Regulations
  2. Certifications/Attestations

Its components are:

  1. Risk Management
  2. Control Environment
  3. Information Security

Directory and identity services

Cloud Directory

  • Cloud-native solution for hierarchical data with complex relationships. This is not an active directory. Instead, see this service to build bespoke HR SaaS products.
  • Fully managed, hierarchical data store in AWS cloud.
  • Amazon Cloud Directory enables you to build flexible cloud-native directories for organizing hierarchies of data along multiple dimensions. With Cloud Directory, you can create directories for a variety of use cases, such as organizational charts, course catalogs, and device registries. While traditional directory solutions, such as Active Directory Lightweight Directory Services (AD LDS) and other LDAP-based directories, limit you to a single hierarchy, Cloud Directory offers you the flexibility to create directories with hierarchies that span multiple dimensions. For example, you can create an organizational chart that can be navigated through separate hierarchies for reporting structure, location, and cost center.
  • Main benefits:
    1. Organize hierarchies across multiple dimensions.
    2. Scales automatically to millions of objects.

Cognito

Login and signup for apps.

Microsoft AD (aka AWS Managed Directory or AWS Directory)

Fully-managed AD.

AD Connector

Used for on-prem AD users to log to AWS. It support MFA.

Simple AD

AWS Fully-managed Samba 4 AD. It DOES NOT support

  • MFA
  • Trust relationships with other domains

Security Token Service (STS)

AWS Security Token Service (STS) allows to issue temporary, limited-privileges credentials for IAM users. Here is a use case is:

  1. User logs in with an identity provider
  2. STS issues a token
  3. That user uses that token to access AWS

The workflow above describes what is sometimes referred as to a TVM (Token Vending Machine). For mobile apps, AWS recommends to use AWS Cognito with its SDK instead.

CloudHSM

CloudHSM is a physical hardware that AWS can either lease to you (pay per hour) or buy on your behalf ($5000) and that is installed on a rack in your VPC. It is an alternative to AWS KMS. It is not multi-tenant.

AWS Artifact

This is just a list of documents that allows you to manage your agreements with AWS. You can also find the compliance that AWS have with other services (ISO whatever).

Security exam questions

  • Identity-based policies are policies attached to identities while resource-based policies are policies attached to resources.

Hybrid cloud services

AWS Storage Gateway

  • File system mounted on top of S3. Typically used as a hybrid solution for new user migrating from on-premise to AWS.
  • It is a physical or virtual appliance that can be used to cache S3 locally at a customer's site.
  • Prerequisites: iSCSI vs NFS
    • The NFS protocol allows to share a file system over the network to multiple clients.
    • The iSCSI protocol allows to share a file system over the network to a single clients.
  • A popular use case is Cloud Migration. The gateway is installed on premise an overtime, syncs all the file to S3 of Glacier.
  • There are 4 modes:
    1. File gateway (NFS only): On-premise NFS/SMB mount that sync to S3.
    2. Volume gateway stored mode (iSCSI only) (old name: Gateway-stored volumes): Similar to File gateway but asynchronously via iSCSI.
    3. Volume gateway cached mode (iSCSI only) (old name: Gateway-cached volumes): Access S3 files via the on-premise iSCSI interface.
    4. Tape gateway (iSCSI only) (old name: Gateway-virtual tape library): Used for backup.
  • AWS Storage Gateway supports bandwidth throttling. This is usefull when you're syncing files with an on-premise location that has a limit amount of bandwidth (don't want to piss of the rest of the office by exhausting all the bandwidth).
  • Storage Gateway is not a direct sync process. This means that the data goes first from on-prem to the storage gateway disk, then from the storage gateway to S3.

Migration services

DMS (Database Migration Service)

Full AWS DMS Guide available at https://gist.github.com/nicolasdao/36630bac9aac704712dba8efc1d82f5e.

DMS description

Database Migration Service is an AWS service that can connect endpoints (source endpoint and target endpoint) to:

  • Migrate data which is referred to as Full load.
  • Sync data storage (e.g., AWS Aurora with AWS ElasticSearch) which is referred to as CDC (Change Data Capture).
  • Full load is supported by most DB engines (even MSSQL web edition), but CDC is not (e.g.,MSSQL Web Edition). That's because CDC requires the DB engine itself to support CDC.
  • To migrate data, DMS requires that:
    • At least one endpoint (source or target) is hosted on AWS.
    • The DB engine of the source and target are supported. This means that DMS cannot migrate data from on-prem to another on-prem.
  • DMS can migrate data from different DB engines, as long as they are supported.

How does DMS work?

SUPER GOTCHA: DMS reserves the CIDR block 192.168.0.0/24 (I have no clue why but this is detailed at https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Troubleshooting.html#CHAP_Troubleshooting.General.ReservedIP). For this reason, do NOT create a host your endpoints using IPs in that CIDR block.

  • There are 3 actors in a data migration process involving DMS:
    1. Source: The DB that holds the data to be migrated/replicated.
    2. Replication servers: EC2 instances (supports multi-AZs deployment for high-availability) that host the DMS processes calles Replication Tasks.
    3. Target: The DB that receives the migrated data.
  • DMS can replicate the data using to processes:
    • Full load: Moving a data from the source to the target.
    • CDC (Change Data Capture): Capturing real-time changes via the native source API transation logs and applying those changes to the target. Those 2 processes can be combined during a data migration. It usually works in 3 steps:
      1. Table Full load and CDC start together. However, the CDC does not apply the changes to the target yet (logic as the full load is not over yet). Instead, the CDC caches those transaction on the replication server.
      2. When the full load is finished, the cached CDC are applied.
      3. The cached CDC have been applied, the CDC keeps going to keep the source and target in sync.
  • The 3 steps process described above explains the EC2 instances must be choosed carefully. The CDC process is the one that will predominantly determine how big and fast your EC2 must be. Indeed, the CDC keeps changes in memory. If it runs out of memory, it will use the disk, which will create synchronization delays. If the DMS console show a CDCLatencyTarget higher than the CDCLatencySource, this probably means that the target cannot ingest the DMS writes fast enough.
  • LOBs (Large Object Binary) can seriously degrade the DMS migration/sync performance. DMS supports 2 modes for LOBs:
    • Full LOB mode: All LOB are transferred sequentially, which can create problems if some are really big.
    • Limite LOB mode: Allows to define a Max LOB Size and will dramatically improve performances by truncating LOBs that exceed that threshold.
  • DMS does not manage schema migration. You either have to manage that yourself, or use tools such as AWS SCT (Schema Conversion Tool)

DMS pricing

The actual DMS service is free, but its hosting is not. DMS must be hosted on EC2 instances which you pay for.

  • EC2: You pay based on the instance size: https://aws.amazon.com/dms/pricing/.
  • Data transfer:
    • Free:
      • Getting data in AWS.
    • Not free:
      • Getting data outside of AWS.
      • Moving data in AWS in different AZs or regions.

SCT (Schema Conversion Tool)

Tool you have to install on a local machine that can access the on-prem DB as well as your target AWS account. It also supports converting Stored-Procedures. If it fails, it details the failed objects.

AWS Storage gateway

Snowball - Snowmobile - Snowcone

  • Snowball: Physical box containing a compute and storage unit (50TB or 80TB).
  • Snowmobile: Same as Snowball but for data in the exabyte range. You can transfer 100PB per snowmobile
  • Snowcone: Smaller version of the Snowball. Cost $60 for 5 days then $6/day. 8TB of storage.

AWS DataSync

Service used to transfer huge amount of files from your on-premise file systems to AWS. It can work via AWS Direct Connect.

AWS ADS (Application Discovery Service)

This is an agent that must be installed on on-prem machines and sends encrypted data about the infra usage to AWS Migration Hub. You can also use those data with AWS TCO (Total Cost of Ownership) to see your saving when moving to AWS.

External tools

Server: VMWare vSphere and Microsoft Hyper-V are virtualization tools that can replicate VMs to AWS and create periodic AMIs.

AWS management services

Organization

  • This is a concept more than a explicit service.
  • Organization refers to managing more than one AWS account via a master AWS account (renamed as Management Account to be politically correct). With that topology, the management account does not contain any active AWS services other than services used to administrate the other accounts. Those other AWS accounts are called and Member Accounts which can be grouped inside OUs (Organization Units).
  • OUs (Organization Units) allow to apply the same policies to multiple Member Accounts, hence simplifying the admin tasks.
  • Organization exists in two modes:
    • All featured enabled(default)
    • Consolidated billing only
  • Organization allows to:
    • Consolidate and centralize billing and aggregate resource cost. The resource cost aggregation allows to reach cheaper tiers.
    • Centralize account management (only available if the the Org is in All featured enabled mode):
      • Compliance & security
      • EXAM:SCPs (Service Control Policies) are a type of organization policy that you can use to manage permissions in your organization. It helps being nazi on what can be done in certain accounts.
        • Allows to control which services/resources/action can be accessed and by who. IMPORTANT: SCP does NOT grant permission. Instead, it blacklists which service is available in one or many AWS accounts. If a service is available, IAM policies are still required to let users access that service.
        • When set, even Admin of the member account can't override those policies, even if they explicitely define policies in the IAM policies in the member account.
        • SCPs only affect users and roles on the Member Accounts not in the Management Account.
        • SCPs do not affect Service linked-roles. _ SCPs can be configured in two ways:
          • Deny list (default & recommended): All services are allowed except the one explicitely defined.
          • Allow list (deprecated): All services are denied except the one explicitely defined.
      • Tagging policies: Allow to create tagging rules for all accounts.
      • Backup policies: Self-explanatory.
      • AI data management policies: Control how AI data are stored or collected.
      • IAM: Manage users for all AWS account. If a user accesses are managed in both the Management Account and a Member Account, then when they can access must be defined in both IAM.
  • Popular ways to structure accounts:
    • Identity account holds all the IAM users and groups
    • Logging account: That's used for compliance
    • Publishing account: Contains your own AMIs or specific service catalog (ECS).

    EXAM:: There no such thing as a compliance account. Usually, when a question refers to compliance, they mean audit, which means logging account.

Tags

  • You can tag any AWS resources
  • Tags are global
  • Resource Groups are a way to group resources. You can group using tags.
  • Resource Groups are regional. That means that it will be created based on the currently selected regions.
  • You can combine AWS Systems Manager with AWS Resource Groups to automate all the EC2 instances in a resource group.
  • Tagging best practices:
    1. Always use a standardized, case-sensitive format for tags, and implement it consistently across all resource types.
    2. Consider tag dimensions that support the ability to manage resource access control, cost tracking, automation, and organization.
    3. Implement automated tools to help manage resource tags.
    4. Err on the side of using too many tags rather than too few tags.
    5. Remember that it is easy to modify tags to accommodate changing business requirements, however consider the ramifications of future changes, especially in relation to tag-based access control, automation, or upstream billing reports.

Landing zone

Landing zone is a preconfigured AWS Account topology based on best practices. It helps set up an AWS Organization with the following four account structure:

  • AWS management account (previously known as master account) which contains:
    • SSO service
    • AWS Landing Zone configuration stored in S3
    • SCPs (Service Control Policies)
  • Core OU (Organization Unit) with the following three accounts:
    • Shared service account: Contains shared services (e.g., AWS Managed Directory to power SSO).
    • Log archive account: Aggregate all CloudTrail and AWS Config logs into S3.
    • Security account:
      • Defines two roles:
        • Auditor (read-only)
        • Admin (full-access)
      • Manages GuardDuty for all accounts.

RAM

  • Resource Access Management is used to centrally manage resources across multiple AWS accounts in your organization. Typicall resource you would share accross accounts:
    • AWS Transit Gateways
    • Subnets
    • AWS License Manager configurations
    • Route 53 Resolver rules
  • Conceptually, once this feature is enabled in your organization, you can share an AWS resource from any account.

Service catalog

  • AWS service catalog helps us to define what AWS services are available to our AWS account users. This is a powerfull service that allows dev teams to self-provision without having to ask for permission (self-sufficiency). We can, for example, define a policy that says that in dev, large instances are not available cause that's an overkill. The service catalog remove the need to provision excessive IAM policies.
  • SC is usually provisioned using CloudFormation templates.
  • To configure the AWS service catalog we use three types of constraints:
    • Launch constraint: This remove the needs to set up specific IAM roles and policies for each users by defining what can or cannot be provisioned.
    • Notification constraint: Defines what type of SNS notification we receive when resources are deployed.
    • Template constaint: kind of wizard whoch based on the answers restricts what's possible to deploy (for example, choosing the target env.).
  • EXAM: Only support CloudWatch events for monitoring. As of 2020, it does not support Lambda triggers.

Service Catalog vs Service Control Policies

Though those two services can help allowing or denying access to AWS services, their purpose is different.

  • SCPs (Service Control Policies) are defined at the organization level and help denying AWS Services only. They don't manage who can or cannot access the AWS services. This has to be done explicitely via IAM policies, which the standard way of managing services' access.
  • Service Catalog can achieve the same as above, but it can also explicitely defined IAM policies so specific users can or cannot access the service. On top of that it can also:
    • Include full-fledge applications instead of AWS services only.
    • Group Applications and AWS service in Portfolio.
    • Automatically launch and provision applications via a WYSIWYG assistant.

As you can see Service Catalog's purpose is geared towards facilitating provisioning applications and services following some form of compliance, while SCPs is focusing mainly on security service access restriction.

Config

  • Audit tool that allows to be a total Nazi about your AWS setup: Do this don't do that. Very useful if you're implemetint ITIL.
  • Tracks the configuration changes of your AWS resources.

AWS Quick Starts

This is just series of CloudFormation templates architected by experts. For example, if you need an elastic search set up, select the one you need and then select your AWS Account, and it will run a CloudFormation template.

Productivity and Business services

WorkSpace

Desktop with any OS as a service. Handy if you need to have employees do their job in a very controlled desktop environment.

AppStream

Same as AWS WorkSpace except its a single app. Usefull to demo a software to a client when you only have access to a browser.

Connect

Call center as a service where customer can have a multiple choice option when they call.

Chime

Video and meeting calls.

WorkDocs

Similar to Dropbox and Google Docs.

WorkMail

Fully manage email and calendar service (supports Microsoft Exchange)

WorkLink

  • Reverse proxy to access apps hosted in your VPC and only accessible to employees. It's kind of like a VPN but one step below and easier to use.
  • Typically used to secure website only accessible to employees.

Alexa for Business

Alexa skills but just for your enterprise.

Other services

AWS Pinpoint

Marketing automation service whose core value proposition is to reach customers based on data segmentation via:

  • email
  • SMS
  • Push notification
  • Voice

AWS Migration Hub

Management service that allows you to track your progress. This tool does not provide migration tool. It only allows to track progress.

AWS Simple Workflow Service

  • Legacy of AWS step function. AWS recommends to use Step function. However, there is still one legitimate use case for SWS: Manual intervention (e.g., KYC).
  • With Amazon SWF, developers get full control over implementing processing steps and coordinating the tasks that drive them, without worrying about underlying complexities such as tracking their progress and keeping their state.
  • It makes it easy to coordinate work across distributed application components.

Networking guidelines

Networking overview

Before diving into this section, you might want to refresh your mind with some basics in the Networking section in the Annex.

  • The allowed block size for a VPC are between /16 and /28.
  • NAT does not support IPv6, because IPv6 can only be public. If you need to prevent inbound connection to your private IPv6 instances you need to add an egress-only internet gateway to your VPC, and then connect your private IPv6 EC2 instances to it.
  • You must disable source/destination on a NAT server. That flag is true by default on a each EC2 instance. This is a configuration that enforces that the source or destination of any traffic is the instance. By definition, it won't for a NAT server, and therefore you must disable this setting.
  • NAT comes in two flavors on AWS:
    1. NAT instance: You have to spawn an AMI on an EC2 instance as well as manage that instance yourself. You are also responsible to manage the route table for that instance.
    2. NAT Gateway: That's an AWS managed NAT service. You also have to set up a defailt route to that NAT. It is AZ specific, so if you need redundancy, you need to create a NAT gateway in each AZ.
  • NAT Gateway vs NAT instance:
    • NAT-G is limited to 45Gbps while it depends on the NAT-I (based on the instance type).
    • Both solutions use an elastic IP but with NAT-G, it cannot be detached.
    • NAT-G cannot be used as a Bastion Server.
  • The broadcast address on a network is the network portion + the host portion made of 1s. For example:
    • 192.168.8.16/24 -> 192.168.8.255
    • 192.168.8.16/16 -> 192.168.255.255
    • 192.168.8.16/28 -> 192.168.8.31
  • AWS does not allows multicast (sending a message to multiple IPs on the same network) and broadcast (sending a message to all IPs on the same network) because that's done at the Data Link layer and allowing it would open the door to some customers affecting others. That's why AWS only allow unicast.
  • AWS always reserves the first 4 IPs in your VPC network plus the broadcast. For example, if your VPC is 10.0.0.0/24, those addresses are:
    • 10.0.0.0: Network address
    • 10.0.0.1: VPC router
    • 10.0.0.2: AWS DNS
    • 10.0.0.3: For future use
    • 10.0.0.255 Because AWS does not support broadcast, so it reserves that address. which means that 251 IPs are available (from 10.0.0.4 to 10.0.0.254). If your VPC is 192.168.8.16/28, , those addresses are:
    • 192.168.8.16: Network address
    • 192.168.8.17: VPC router
    • 192.168.8.18: AWS DNS
    • 192.168.8.19: For future use
    • 192.168.8.31 Because AWS does not support broadcast, so it reserves that address. which means that 11 IPs are available (from 192.168.8.20 to 192.168.8.30).
  • There is no deterministic link between an AZ label and its physical data center. Each AWS account is different. This means that us-west-2a is different in your account than mine. AWS does that to load-balance the traffic more homogeoneously.
  • You can increase your network performance on EC2 instances with the following two interface:
    • Intel 82599 VF interface (up to 10Gpbs). Requirements:
      • Instance types: C3, C4, D2, I2, M4 (excluding m4.16xlarge), and R3
      • OS: HVM AMI with Linux > 2.6.32
    • Elastic Network Adapter (up to 25Gpbs). Requirements:
      • Instance types: current generation instance type, other than C4, D2, M4 instances smaller than m4.16xlarge, or T2
  • When using a custom domain on CloudFront, you must provide an SSL (that can be managed by AWS certificate manager), but you also must enable SNI (Server Name Indication) (WARNING: Super old browsers don't support it). This allows CloudFront to know which domain is being requested and serve the cert.

VPC peering

Inter & intra VPC

  • This is only possible via AWS's backbone and not the internet.
  • There are four ways to peer VPCs (EXAM: The following solutions only work between VPC hosted on AWS. If you see those solutions mentioned with a 3rd-party using them to connect to your AWS system, those solution can be discarded and considered fallacies):
    1. VPC peering. This is an explicit setup between two VPCs. VPC peering allows their particpants to access the resources of one another. This feature is also available for VPCs in different regions (called Inter-region VPC peering). VPC peering has some important limitations though (EXAM!!!):
      • VPC peering only works when all VPCs are in AWS. No such thing as VPC peering between an AWS VPC and a corporate network.
      • VPCs with overlapping CIDR blocks CANNOT BE PEERED. This is still true even if one of the VPC has a secondary CIDR block that does not overlap. If two VPCs must communicate though they have overlapping CIDR blocks, a transit VPC can be used to solve that limitation (obviously making sure the transit VPC does not have CIDR overlap).
      • Transitive peering NOT supported. This means that if VPC-A is peered to VPC-B and VPC-B is peered with VPC-C, then VPC-A is not peered with VPC-C.
      • Similarly to transitive peering, peered VPCs cannot access the VPC endpoints or the VPN/Direct Connect systems in the other VPC.
    2. VPC endpoints (aka PrivateLinks): Refer to the VPC endpoints section.
    3. Transit VPC:
      • It peers VPCs via VPN.
      • Hub-and-spoke architecture. Use this when you want all networks located in different parts to safely communicate with each other via a central VPC. That's how you could have a multi-cloud solution. The AWS VPC could link your Google Cloud and Azure Cloud.
      • You have to implement that yourself. This means creating a new VPC and host EC2 instance with the right software (e.g., CISCO CSRs).
      • The max throughput for each VPN is ~1.25Gbps.
      • Can connect VPCs from different accounts, regions, even outside of AWS (i.e., corporate network).
    4. Transit Gateway
      • Fully-managed AWS service that delivers the advantages of the transit VPC without its disadvantages:
        • Not bound to VPN
        • Not restricted to AWS VPCs only. This means you can use this to also connect a corporate network over VPN.
      • Is REGIONAL and can only connect VPCs in the same region.
      • Supports peering with another TG in another region.
      • Max bandwidth per AZ is 50Gbps.
      • Can have multiple Transit Gateways per region (though this is not recomnended) BUT cannot be peered.
      • It is recommended to place the Transit Gateway in a dedicated AWS Account (usually referred as the Network Service Account)
      • If you only manage a single connection between two VPCs, VPC Peering might be better for the following reasons:
        • Lower cost
        • No bandwidth limits though both options are limited to 10Gbps for placement group and 5Gbps for individual instances.
        • Placement group support. Transit Gateway does NOT support PGs.
        • Lower latency because there is one less hop.
        • Security Groups compatibility
  • BGP (Border Gateway Protocol) is designed to exchange routing information on the Internet. It is used to propogate information about the network to allow dynamic routing. In the context of this exam you must know that:
    • BGP is required when using Direct Connect and optional when using VPN.
    • The alternative to nit using BGP with AWS VPC is static routes.
    • BGP requires:
      • TCP on port 179
      • Ephemeral ports
    • Systems on a BGP number are uniquely identified by their ASN (Autonomous System Number)
    • BGP priotirize traffic based on weight. So we could have rules that say that our traffic between our on-prem and AWS goes both via VPN and Direct Connect, but the first one have a weight of 100 and the second a weight of 0, which means that the traffic goes through VPN. If we want to change that, we can increase the weight of the Direct Connect connection to 150.

Connecting AWS VPCs to on-premise corporate network

  • In this context, there are two types of gateways:
    1. Customer Gateway, i.e., the customer's on-prem gateway that we need to connect from/to.
    2. AWS VPG (Virtual Private Gateway) is an access gateway to a VPC. Each time you need to connect a VPC, you need a VPG.
  • There are 6 main ways to establish a connection between VPG and the Customer Gateway:
    1. VPN:
      • That's the easiest, but the least reliable.
      • EXAM: As of 2020 a VPN max bandwidth cannot exceed the speed of the internet which is usually around 500Mbps.
    2. VPN CloudHub: Same as above, but for use cases where you need to connect more than one location to the same AWS VPC.
    3. AWS Direct Connect: More complicated as you need the intervention of your ISP. EXAM: Direct Connect can support, in theory, bandwidth ranging from 1Gbps to 10 Gbps.
    4. AWS Direct Connect + VPN: Usual use case is big corporation with multiple sub-companies under it. They want each companies to be seperated from each other from a security point of view.
    5. Software VPN: This is the ultimate do it yourself.
    6. DGW (Direct Connect Gateway)
  • AWS Direct Connect is not inherantly redundant, which means you should be able to design redundancy by adding an extra VPN connection or a secondary Direct Connect.

VPG vs DGW vs TGW

Best article on this topic at https://www.megaport.com/blog/aws-vgw-vs-dgw-vs-tgw/

All those solutions aimed at connecting AWS VPCs.

  • VPG (Virtual Private Gateway) is the oldest and was designed to connect Corporate network to AWS VPCs via Direct Connect. The issue is that this setup was a one-to-one. For each VPC and therefore for each region there was the need to create a new Direct Connect.
  • DGW (Direct Connect Gateway) is an evolution of VPG and allowed to connect multiple VPCs across regions to a single Direct Connect Gateway so that the Corporate network would only use a single Direct Connect connection. However, the limitation was that VPCs connected to the DGW cannot talk to each other directly. Instead, they have to transit via the corporate network.
  • TGW (Transit) was created after DGW and aimed not so much to help connect to corporate networks, but to facilitate communication between VPCs across accounts IN THE SAME REGION. This remove the need to transit via the corporate network. It also supports TGW peering to overcome the same region limitation.

Security guidelines

Security common knowledge

  • AWS Shield standard is automatically included for free in any AWS Account.
  • EXAM: The steps in which AWS evaluates wheither a request is allowed or not are:
    1. Authenticates the principal
    2. Determines which policy to apply to the request
    3. Evaluates the policy types and arranges an order of evaluation
    4. Process the policies against the request context to determine if it is allowed
  • EXAM: User assuming a roles: For a user to be able to assume a role, the policy document that describe that role must define two things:
    1. The Action must contain sts:AssumeRole where STS (Security Token Service) is the component responsible to issue access token with the correct claims.
    2. The Principal must contain the user's ARN. That's the Trust policy part.

AWS Shared Responsibility Model

  • AWS is responsible for the Security OF the cloud.
  • Your are responsible for the Security IN the cloud.
  • Shared Controls in the context of the Shared Responsibility Model means the controls that AWS has provided to the customers, for which AWS has a responsibility, but the customer also has a responsibility. There are 3 types of shared controls:
    1. Patch Management: AWS must patch the hypervisor that manages its compute services, but the customer is also responsible for patching the OS they decided to install.
    2. Configuration Management: AWS is responsible for configuring properly its Cloud, and the customer is responsible to properly configure his app.
    3. Awareness & Training: AWS is responsible to stay up to date with the latest best practices, and so is the customer.
  • Try to memorize the diagram shown in figure shared_responsibility_model
  • The rule of thumb in figuring out whether you are responsible or not is to ask yourself if you could do it yourself. If you see a question about security and you don't know how you would do this, then most likely, it is not your responsibiity.
  • EC2 sucks as you have an awful lot of responsiilities in terms of the underlying OS. You have to:
    • Patch it
    • Update it
  • Encryption is a shared responsibility as you are responsible to encrypt the data in the first place, but AWS is responsible for the encryption tools. So if your data were encrypted, but badly encrypted, that might be AWS fault.
  • If you're providing your own key, encryption is your sole responsibility.

Popular attack types

DDoS

  • UDP reflection attacks: OSI Layer 3 (Network) Protected by AWS Shield standard
  • TCP Syn Flood attack: OSI Layer 4 (Transport) Protected by AWS Shield standard
  • TLS abuse: OSI Layer 6 (Presentation)
  • HTTP floods, DNS query floods: OSI Layer 7 (Application) Protected by AWS Shield advanced

Programmatic

  • SQL injection: Protected by AWS WAF
  • Suspicious|blacklisted IPs, countries, headers: Protected by AWS WAF
  • Suspicious URL string: Protected by AWS WAF
  • Suspicious scripts (cross-site scripting): Protected by AWS WAF

Benefits from using Amazon CloudFront and Amazon Route 53 include:

  • AWS Shield DDoS mitigation systems that are integrated with AWS edge services, reducing time-to-mitigate from minutes to sub-second.
  • Stateless SYN Flood mitigation techniques that proxy and verify incoming connections before passing them to the protected service.
  • Automatic traffic engineering systems that can disperse or isolate the impact of large volumetric DDoS attacks.
  • Application layer defense when combined with AWS WAF that does not require changing your current application architecture (for example, in an AWS Region or on-premises datacenter).

Migration strategies

Six common strategies

  1. Rehost aka Lift & shift: Same architecture moved to the Cloud.
  2. Replatform aka Lift & reshape: Same architecture but new services (e.g., from managed MySQL to Aurora)
  3. Repurchase aka Drop & shop: Migrate from legacy system to new one (e.g., abandon CRM to Salesforce)
  4. Rearchitect: Redesign to Cloud native.
  5. Retire: Completely drop the component as nobody is using it anymore.
  6. Retain/Not moving: Some components are too mission critical and too complicated to be migrated now, so we keep them where they are.

Screen Shot 2019-09-21 at 12 29 24

Exam tips

  • When a question mentions a project's timeframe longer than 6 months, this means you can go ape shit and redesign everything (rearchitect).

Cloud Adoption Framework

EXAM: This framework is used mainly for internal stackeholder. External stackeholder are not explicitely excluded from it, but are not considered an explicit top priority.

Building blocks

  1. Business capabilities
    1. Business:
      • Most of the time we want to increase business agility. What this means or how it is perceived varies from one business to another, so it is important to discover the meaning of business agility for each client.
      • What the business does won't inherently change, but HOW they do it will. We want to link the SKILLS and PROCESSES attach to each WHAT so we can adapt those to fit the Cloud.
      • Roles:
        • Business managers/Strategy stakeholders
        • Finance managers/Budget ownere
    2. People:
      • Recruitment
      • Adjusting KPIs and incentives
      • Career management (moving from managing physical racks to managing Cloud stability)
      • Training
      • Moving from a culture of slow big releases to fast and frequent small releases.
      • Roles: HR managers
    3. Governance
      • Roles:
        • CIO/Portfolio managers: That can be challenging for them because migration does not happen overnight. This means that there is transition period where they will have three problems rather then one:
          • Managing the legacy
          • Managing the new Cloud
          • Bridging the two together
        • BA/Project managers
        • Enterprise architect
  2. Technical capabilities
    1. Platform
      • Significant change in compute, network, storage and database. Those become more automated.
      • Roles:
        • CTO
        • IT managers/Solution Architect
    2. Security
      • Roles:
        • CISO (Chief Information Security Officer)
        • Security managers
        • Compliance managers
    3. Operations
      • Monitoring.
      • Roles:
        • Operation/Support managers

Action plan

Screen Shot 2019-09-21 at 12 29 24

How to get started

This framework is like any other framework, if you cargo-cult it, it won't deliver any value. Therefore, it is important to be honest and

  1. Identify challenge and outcomes: Unfortunatelly, clients want to "move to the cloud" without clearly defining an outcome. An outcome is not, gaining agility or being disruptive. An outcome would be shipping new features in production 4 times a months instead of 4 times a year, or reducing outage by 80%.
  2. Identify the stackeholders: Define the people you need to get on board:
    1. Identify the capabilities involved in the migration.
    2. Identify the owner for each capability.
  3. Assess the stakeholders' needs, fears and challenge: You need to get them on your side.

Migration tools

  • Storage:
    • AWS Storage gateway
    • AWS Snowball
  • Server: VMWare vSphere and Microsoft Hyper-V are virtualization tools that can replicate VMs to AWS and create periodic AMIs.
  • Data:
    • DMS (Data Migration Service): AWS fully-managed service that helps establish connection between an on-prem DB, a replication DB and a target DB.
    • SCT (Schema Conversion Tool): Tool you have to install on a local machine that can access the on-prem DB as well as your target AWS account. It also supports converting Stored-Procedures. If it fails, it details the failed objects.

    For the exam, check the supported systems for those tools.

  • Apps and systems: AWS Discovery Service is an agent that you install on all your on-prem machines that will gathe data about your on-prem systems and output a report on the TCO (Total Cost of Ownership).
  • Network typical use case:
    1. Start with VPN
    2. As usage grows, add Direct Connect but keep VPN for redundancy.
    3. Migrate from VPN to Direct Connect using BGP (Border Gateway Protocol)

Disaster recovery strategies

  1. backup & restore: Simple, cheap but long and not really flexible (e.g., snowball, virtual tape library, storage gateway).
  2. pilot light: This is a replica of your on-prem on AWS. It is also simple and cost effective but requires manual intervention. Also, you have to sync AMIs and data storage.
  3. warm standby: This is similar to pilot light but it's already accepting load (e.g., staging environment).
  4. multi site: The on-prem is fully replicated in the cloud and the traffic load balance traffic between the 2 sites. If one of them fails, the other takes over automatically.

Scaling guidelines

Auto scaling

  • Move from tightly coupled systems to loosely coupled systems because you can scale each components separately.

Exam question. We need loosely couple system to gain atomic functional units.

  • To scale out EC2 instances you need
    • Auto-scaling group: The auto scaling group uses a launch configuration to aunch EC2 instances. Keep in mind that:
      • There is no garantee that the auto-scaling group maintain an exact homogeneous EC2 count across multiple AZs. What it will do is make sure that there is the right number of instances.
      • If you need homogeneity of your EC2 count across AZs, you need to add reserved instances in those AZs.
    • Launch configuration: This configures the type of auto-scaling group, AMI, VPC, SG that the EC2 instances use. The four types of auto-scaling that can be used with a launch configuration are:
      1. Maintain: Keep a minimum specific number of instance in place
      2. Manual: Set up min. max. or specific number of instances.
      3. Schedule: Increase or decreased based on schedule.
      4. Dynamic: Have rules (CPU, memory, traffic) to scale. Dynamic type uses what's called Auto scaling policies:
        1. Tracking target: e.g., 70% above max CPU utilization.
        2. Simple scaling: Keeping adding instances after the health check and cooldown periods expired
        3. Step scaling: For more custom logic.
  • Launch configuration cannot be updated. If you need to change it, you have to create a new one.
  • cooldown period (default 300 sec.) is the amount of time we wait before adding a new instance. This allows to prevent scenarios where we overprovision. Instead, we add new instance one by one. Each time a new instance is added, we wait a little bit (cooldown period) before measure the system as a whole and decide whether we need to add more. It is automatically applied with dynamic scaling, optional with manual scaling but not supported for scheduled scaling.
  • Health check grace period is the time AWS wait after an EC2 instance is launched until doing a health check.
  • AWS Auto Scaling is a service that can control the auto scaling policies for non-EC2 services (e.g., DynamoDB).
  • AWS Predictive Scaling uses ML to help you predict when you should scale and how.

Business continuity guidelines

  • Jargon
    • BC (Business continuity): Minimize business activity disruption when something bad happens.
    • DR (Disaster Recovery): Act of responding to failures that threat BC.
    • RTO (Recovery Time Objective): Time we set ourself to complete DR. RTO is measured between the incident and its recovery.
    • RPO (Recovery Point Objective): Acceptable amount of data loss measured in time during an incident. RPO can be measured between the last backup and the incident.
  • Business continuity plan:
    1. Start with defining RTO and RPO
    2. #1 justifies the amount of effort the company is ready to invest in HA.
    3. The effort spent in HA should be inversely proportional to the likelyhood of having an incident that require DR
    4. If we have to do a DR, that strategy will be scoped against the RTO and RPO defined in #1.
  • Disaster categories:
    1. hardware
    2. deployment
    3. (over)load (e.g., DDoS)
    4. data (bad conversion from one type to another)
    5. credentials expiration (e.g., failing to renew SSL)
    6. dependency (e.g., S3 fails and the rest falls like dominos)
    7. infrastructure
    8. exhaustion (e.g., no more capacity in an AZs to provision more EC2)
    9. human
  • AWS mentions 4 different types of DR strategies:
    1. backup & restore: Simple, cheap but long and not really flexible (e.g., snowball, virtual tape library, storage gateway).
    2. pilot light: This is a replica of your on-prem on AWS. It is also simple and cost effective but requires manual intervention. Also, you have to sync AMIs and data storage.
    3. warm standby: This is similar to pilot light but it's already accepting load (e.g., staging environment).
    4. multi site: The on-prem is fully replicated in the cloud and the traffic load balance traffic between the 2 sites. If one of them fails, the other takes over automatically.
  • For EBS, choose the RAID type based on your BC needs and your snapshot strategy.
  • For HA NAT Gateways, set them up in each AZs and have routes for private subnet to use the local gateway.

Deployment & Operation management

  • Deployment categories:
    • Big bang: All at once
    • Phased rollout: Once feature at a time with a clear cut.
    • Parallel adoption: Once feature at a time with two system working side-by-side.
  • Deployment types:
    • Rolling: Terminate old instance and start new update one. If there is a fleet behind a load balancer, there should be no downtime.
    • A/B testing: Deploy new instances and split the traffic between new and old.
    • Canary: Progressively ramp-up the from old to new.
    • Blue/Green: Deploy an entire new system and use Route 53 to switch between one or the other.
  • EC2 instance upgrades can be done using two different strategies:
    • in-place: The EC2 instances stay the same and the app is updated. Use CodeDeploy to perform that type of update.
    • disposable: You provision new machine with the new code and replace the old machine with the new. This is done with Elastic beanstalk, CloudFormation or OpsWork.
  • AWS Tools:
    • CodeCommit: Similar to GitHub but on AWS.
    • CodeBuild: Build artefacts
    • CodeDeploy: Deploys artefacts
    • CodePipeline: Orchestration tool to manage all the above.
  • CloudFormation:
    • If you need to perform some custom actions (fetch data on some custom way via HTTP), use the AWS::CloudFormation::CustomResource
    • Stack Policy:
      • This is a template that enforces policies to prevent deploying a template that could adversely impact our infra.
      • By default, as soon as you add a policy, it protects everything. This means that the mere existance of a policy will prevent any action on any resources. This obviously sucks. That's all policies must contain this item in its array:
       Effect: Allow
       Action: Update:*
       Principal: *
       Resource: *
      • A policy can only be added via the console or the CLI.
      • Once a policy is added, it cannot be removed. It can only be updated. Futhermore, this update can only be done via the CLI.

Cost management

Cost management guidelines

  • Strategies are:
    • Appropriate provisioning: Only provision what you need (delete what you are not using) and merge small things into bigger ones if you can (e.g., merging small DBs in a bigger one).
    • Right sizing: That's where the loosely coupled architecture really helps because we can downscale and upscale in an optimal way.
    • Purchase options: Reserved vs spot vs on-demand + EC2 fleet
    • Geographic selection: If your AWS service does not require to be in a specific geolocation, choose the cheapest.
    • Managed services: Replace managed service with AWS fully-managed service if this move free your resoruces.
    • Optimized data transfer: Getting data in is free but out could be non-negligeable based on your use case. Consider the following options:
      • Cross-regions transfer might bite you.
      • Direct Connect could be a good cost saving option.
  • Leverage tagging to create resource groups that gives you could reporting on your cost allocation.
  • If tagging is critical to you, you can enforce it via the AWS Config
  • RI (Reserved Instance):
    • Above $500,000 of reserved instance in a single region, you received 5% discount on your next RI. Above $4,000,000, you receive 10%.
    • The tenancy defines weither the instance has dedicated hardware or shared(default) hardware.
    • There are two types of dedicated:
      • dedicated instance:
        • Your EC2 instance runs on a decicated hardware that can be shared with other of YOUR other EC2 instances.
        • Available as reserved, spot and on-demand.
      • dedicated host:
        • Your EC2 instance has exclusive hardware allocation (best for running software with specific licenses).
        • Only available as reserved.
    • Can be AZ specific (optional). If that option is not set, they are Region specific, which means that we have the guarantee that we have 100% guarantee to be able to provision an instance somwhere in the region, without controlling the AZ.
    • There are two types:
      • standard: Can change the size (exam: only for Linux in regional mode with default tenancy (i.e., shared)), AZ and networking type.
      • flexible: Same as standard + change OS, instance family, tenancy, payment options. More expensice than standard.
  • Spot:
    • There are two request types:
      • one-time: When the spot instance is terminated by AWS because a higher bidder appeared, the processes are killed and the data are lost.
      • persistent: Same as one-time request except that the context is preserved, and when I get the lease back, my processes and data are restored.
  • Tools:
    • AWS budget: Creates alarm based on budget threshold. Used to raise awareness.
    • Trust advisor: Can scan the accounts for saving tips.

Billing & Pricing

General Philosophy

  • Pay as you go
  • Pay for what you use
  • Pay less as you use more
  • Pay even less when you reserve capacity

CAPEX vs OPEX

  • CAPEX stands for Capital Expenditure. You pay upfront. That's prior to cloud.
  • OPEX stands for Operational Expenditure. You pay for what you use.

5 Billing Policies

  1. Pay as you go
  2. Pay less when you reserve
  3. Pay even less per unit when you use more
  4. Pay less as AWS grows
  5. Custom pricing

3 Main Drivers of Cost

  1. Compute
  2. Storage
  3. Data outbound (leaving your environment)

Data Transfer

  • Data transfered across Az in the same region is free.
  • Data transfered across region is charged on both side.

4 EC2 Pricing Models

  1. On-demand
  2. Dedicated instances
  3. Spot instances
  4. Reserved instances

What are the FREE AWS services?

  1. Amazon VPC
  2. Elastic Beanstalk
  3. IAM
  4. Cloudformation
  5. Auto scaling
  6. Opswork
  7. Consolidated Billing

EC2 pricing drivers

  1. Clock hours (based on the instance type you pay per hours or per seconds)
  2. Instance type
  3. Pricing model (reserved, spot, on-demand, dedicated)
  4. Number of instances
  5. Load balancing (network LB is more expensive than App LB)
  6. Detailed monitoring is more expensive than the standard 5 min.
  7. Auto scaling
  8. Elastic IP address (that's the default). Each newly provisioned IP costs something.
  9. OS (Windows more expensive than Linux)

EC2 Reserved Pricing

  • 1 year contract: Save between 36% (no upfront with monthly) to 40% (upfront)
  • 3 years contract: Save between 56% (no upfront with monthly) to 62% (upfront)
  • There are 3 different payment options:
    1. AURI (All Upfront)
    2. PURI (Partial Upfront)
    3. NURI (No Upfront)

Lambdas Pricing

  • Pay per request. 1st million free, then $0.2 per millions
  • Pay per duration. Total time te functions were up times the RAM of the function (with 3.2 million sec free compute)
  • Pay per data exchanged with other services (e.g., S3)

EBS Pricing

Pay per:

  • Volume (GB)
  • Snapshot (GB)
  • Data transfer

S3 Pricing

Pay per:

  • Storage class
  • Storage (GB)
  • Requests (GET, PUT, COPY)
  • Data transfer

Glacier Pricing

Pay per:

  • Storage
  • Data retrieval time

Snowball Pricing

Pay per:

  • Size (50TB is $200 and 80TB is $250)
  • Daily charge. First 10 days free, then $15 a day
  • Data transfer. Data from Snowball to S3 is free, but from S3 to snowball is not.

RDS Pricing

Pay per:

  • Clock hours
  • DB characterictics (MySQL, MS SQL, Aurora, ...)
  • DB purchase type (i.e., H1, T2, X3, ...)
  • Number of instances
  • Storage (GB)
  • Additional storage
  • Requests
  • Deployment type
  • Data transfer

DynamoDB Pricing

Pay per:

  • Provisioned WRITE throughput
  • Provisioned READ throughput
  • Storage (GB)

CloudFront Pricing

Pay per:

  • Traffic disribution
  • Requests
  • Data transfer OUT

AWS Budgets vs AWS Cost Explorer

  • Budget is used to predict how much you're gonna pay BEFORE this happens.
  • Cost Explorer is used to analyse where your money is spent AFTER this happened.

Consolidated Billing

  • It allows to consolidate multiple AWS Account into a single Paying Account, allowing to move to cheaper tiers thanks to AWS economy of scale.
  • If one account has 5 reserved EC2 but is only using 3 and a second account is using 4 on-demand EC2, then the second account will be charge as if it had 2 on-demand and 2 reserved.
  • To toggle this feature, you need to create an Organization. The organization will then invite other AWS Account.
  • You can't manage the organization using IAM policies. Instead, you must use your root account.

Billing Calculators

Simple Monthly Calculator

It's just a form where you input what you're using and it will tell you how much you'll spend per month.

Total Cost of Ownership (TCO)

  • This is a more advanced calculator which compare how much you would save by moving your infra from on-premise or hybrid to AWS.
  • You can update the methodology as well as the assumptions, and then generate a report that you present to your CTO.
  • The main methodology takes into account:
    • Server costs (incl. OS licenses, virtualization software maintenance)
    • Storage costs
    • Network costs
    • IT labor costs

AWS Organization

  • It comes in 2 flavors:
    1. Full access
    2. Consolidated Billing only
  • In full access you can create Organization Units (e.g., Dev, Serverless, ...) and attach an AWS account to those OU. You can then apply policies to the OU or directly to the AWS Account (e.g., prevent usage of EC2).
  • Best practices:
    • Your paying account should only be used for billing. Don't deploy resources there.
  • You can have a max of 20 linked AWS account under an Org.

CloudTrail, or how to consolidate billing

  • CloudTrail is an auditing service which allows to track API calls with AWS (e.g., creation of an EC2 instance).
  • The difference between CloudTrail and CloudWatch is that the first focuses on auditing where the other focuses on performances.
  • CloudTrail is a per region service. So you need to enable it in all regions :(.
  • Using CloudTrail, it is possible to track all AWS account activities and therefore create an global view of the expenditure, which then leads to the consolidated billing. To do so:
    1. Turn on CloudTrail in the paying account.
    2. Create an S3 bucket in the paying account and attach a policy to it so it is accessible to all the other AWS account (cross account access).
    3. Turn CloudTrail in all the AWS account to log into the shared S3 bucket

AWS Well-Architected Framework

Full doc at https://aws.amazon.com/architecture/well-architected

This framework establishes five pillars for architecting in the Cloud. This is also usefull to assess where a client sits in there Cloud journey:

  1. Operational excellence
  2. Security
  3. Reliability
  4. Performance efficiency
  5. Cost optimization

If also defines what is refered as to the General Principle Design:

  1. Stop guessing your capacity needs.
  2. Test systems at production scale (you can now that infra is a commodity).
  3. Automate to make architectural experimentation easier.
  4. Allow for evolutionary architectures.
  5. Drive architectures using data.
  6. Improve through game days (similar to Netflix Chaos Monkey).

Exam tip: For whatever reason, the exam defines the General Principle Design as follow:

  1. Scalability
  2. Disposable resources
  3. Automation
  4. Loose coupling
  5. Managed services instead of servers
  6. Flexible data storage options

Operational excellence

  1. Perform operations as code
  2. Make frequent, small, reversible changes
  3. Refine operations procedures frequently
  4. Anticipate failure
  5. Learn from all operational failures

Best practices:

  • Prepare (using AWS Config)
  • Operate (using AWS CloudWatch)
  • Evolve (using AWS ES (Elasticsearch Service) to analyse the log)

Security

  1. Implement a strong identity foundation
  2. Enable traceability
  3. Apply security at all layers
  4. Automate security best practices
  5. Protect data in transit and at rest
  6. Keep people away from data
  7. Prepare for security events

Best practices:

  1. IAM
  2. Detective Controls (using CloudWatch, CloudTrail, S3) Processing of logs and monitoring of events that allow for auditing, automated analysis, and alarming.
  3. Infrastructure Protection (using AWS VPC)
  4. Data Protection (AWS ELB, then all AWS storage and all AWS RDS, AWS KMS (Key Management Service))
  5. Incident Response (AWS IAM for managing the incident response team, AWS CloudFormation is used to create a clean room, AWS CloudWatch is used to send response)

Reliability

  1. Automatically recover from failure
  2. Test recovery procedures
  3. Scale horizontally to increase aggregate workload availability
  4. Stop guessing capacity
  5. Manage change in automation (Changes to your infrastructure should be made using automation)

Best practices:

  1. Foundations (using IAM && VPC to provision the right resources, Trsuted Advisor to monitor service limits, Shield to protect against DDoS) i.e., making sure you're infra has what it takes (network provisionion, ...) to support your organization.
  2. Change Management (using CloudTrail, AWS Config, Auto Scaling and CloudWatch) i.e., monitoring changes in your IT consumptions and workload in order to provision that correct resources.
  3. Failure Management (using CloudFormation, S3 and Galcier for backups, and KMS). How to restore when failure occurs.

Performance efficiency

  1. Democratize advanced technologies (buy advanced tech via AWS services rather building your own (e.g., AI/ML))
  2. Go global in minutes
  3. Use serverless architectures
  4. Experiment more often
  5. Consider mechanical sympathy (choose the right service based on the business case)

Best practices:

  1. Selection i.e., what service to choose based on your workload and constraints. There are usually 4 variables:
    1. Compute (using Auto Scaling)
    2. Storage (using EBS, S3)
    3. Network (using VPC, Direct Connect, Route 53)
    4. DB (using RDS, DynamoDB)
  2. Review (following the AWS blog)
  3. Monitoring (using CloudWatch to trigger Lambdas)
  4. Tradeoffs (using ElasticCache, CloudFront Snowball, RDS read-replicas to alleviate the tradeoffs)

Cost optimization

  1. Implement Cloud Financial Management
  2. Adopt a consumption model (only pay for what you use)
  3. Measure overall efficiency
  4. Stop spending money on undifferentiated heavy lifting (let AWS pay for the infra mgmt)
  5. Analyze and attribute expenditure (AWS let's you associate budget to teams)

Best practices:

  1. Expenditure Awareness (using Cost Explorer and Budget)
  2. Cost-Effective Resources (using Cost Explorer for Reserved instances, CloudWatch + Trusted Advisor to determine the right sizes, Aurora and RDS to remove licenses, Direct Connect and CloudFront to optimize data transfer)
  3. Matching supply and demand (using Auto Scaling)
  4. Optimizing Over Time (using AWS blog for best practices and Trusted Advisor to optimize resources management)

Exam

Look for fallacies in the choices. Very often, you can simply rule an option out by looking for the ones that contain fallacies. Example, add a deny rule in a security group. SGs don't support deny rules. This option can easily be discarded.

Stuff that can be flushed afterwards

Wordpress BS

  • A WordPress plugin is required to leverage caching. Those plugins typically interacts with a caching layer (e.g., Memcache) that sits between WP and the DB.
  • The core file in Wordpress is wp-config.php. For the exam, it is important to know what this file does and what it does not:
    • It manages:
      • DB creds and host. That's the core function.
      • Secrutiy keys for login.
      • debug mode on/off
      • PHP memory limit
    • It does not manage:
      • Plugins
      • Caching

Troubleshooting RDS

Original doc at https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Troubleshooting.html#CHAP_Troubleshooting.Backup.Retention

  • Can't connect to Amazon RDS DB instance:
    • Most likely due to a missing or incorrect inbiund rule in your SG.
    • If you set up your RDS as public, make sure it is in a public subnet.
    • To test your connection, use one of the following command based on your OS:
      • Linux/Unix: nc -zv DB-instance-endpoint port (nc is netcat)
      • Windows: telnet DB-instance-endpoint port
  • Insufficient DB instance capacity: There are two scenarios where this could happen:
    • Restoring from a snapshot in an AZ that does not support the instance class. To fix this issue, try another AZ, change the instance class or don't explictely specify the AZ.
    • Restoring an EC2-classic platform snapshot which is not in a VPC.
  • Can't set backup retention period to 0: Disable read replicas first.

Keeping Amazon ElasticSearch in sync with your main DB

  • You must use AWS DMS and cannot use AWS Glue as AWS ES does not support Glue.
  • AWS ES can only ingest a fixed amount of writes otherwise, it throws a 429. To not exceed that ingestion rate limit:
    • Calculate the ingestion rate limit for AWS ES.
    • Adjust the AWS DMS MaxFullLoadSubTasks and ParallelLoadThreads to not exceed the ingestion rate limit.

Oracle BS

  • RAC (Real Application Cluster) is not supported on RDS, which means you must host it yourself.
  • Supported licenses:
    • SE1 - Standard Edition One (both BYO and AWS-managed): This license started with version 11g.
    • SE2 - Standard Edition Two (both BYO and AWS-managed): This license started with version 12c.
    • SE - Standard Edition (BYO only)
    • EE - Enterprise Edition (BYO only)
  • Upgrading the license.
    • If it is managed by AWS, this can be done by changing the config, however, beware that this will create an outage.
    • If you're using BYO, you have to snapshot the DB, then create a new DB with the new license and DB engine from that snapshot.
  • Oracle Data Guard is a DR tool that protects your DB. It can technically create replicas, but that's not its main purpose.

Popular SaaS BS

  • Tableau is a BI tool that can integrate with Athena or Redshift/Redshift Spectrum. It does not integrate with S3 directly.

SCPs errors

  • An SCP can only be a single JSON object. If you need multiple Statement, it supports array, so put all the rules in an array.
  • SCP document can not exceed 5,120 bytes. If a document exceeds that limit, try removing all white spaces.

FMEA (Failure Mode and Effect Analysis)

This is a Pro tip, and might not be asked in the exam. This framework helps with planning DR and HA.

  1. For each business capability (e.g., invoicing, shipping, selling), list the following:
    1. What could go wrong
    2. What impact it might have
    3. What is the likelihood of happening
    4. What is our ability to detect and react
  2. Priotirize each component of your analysis, your can use the RPN (Risk Priority Number) score, which is calculated as follow: RPN = severity * probability * detection

Example:

Capability: Invoicing

Failure mode Cause Current control
Pricing unavailable Retail price incorrect in ERP Master data maintenance audit report
Pricing incorrect Retail price not assigned in ERP None
Slow to build invoice Invoicing system is slow None
Unable to build invoice Invoicing system is offline Uptime monitor

Using the RPN formula, we can create this table:

Failure mode Customer impact Likelihood Detect and read RPN
Pricing unavailable 7 3 2 42
Pricing incorrect 8 3 9 216
Slow to build invoice 5 2 9 90
Unable to build invoice 8 3 2 48

We can see that Pricing incorrect has the highest RPN and therefore has the highest priority. We should try to do whatever we can to try to lower that number.

Popular questions

  • What's the Cloud? A region is made of multiple availability zone (i.e., data center)

  • Why should you use an AZ over another?

    1. Data sovereighty
    2. End users latency
    3. Not all AWS services are available. US East-1 is the one that gets all first.
  • What are the 4 different support plan?

    1. Basic (Free)

      • It only covers questions around your account and billing, and the access to the community forum for all other questions.
    2. Developer ($29/month)

      • Can ask technical questions to a support center and expect response within 12 to 24 hours via email.
      • Can provide general guidance when you request Architecture Support.
    3. Business ($100/month)

      • 24/7 email, chat and phone support.
      • Production system down response time < 1 hour.
      • Faster response time (within 1 hour).
      • Trusted Advisor access who'll help optimizing your infra.
      • AWS Support API access. You need this if you need to integrate with other 3rd parties(e.g., Jira, Trello, Asana, ...).
      • Access to IEM (Infra Event Management) for an extra fee.
    4. Enterprise ($15,000/month)

      • Production system down response time < 1 hour.
      • Same as Business, plus:
        • Assigned Technical Account Manager (TAM) (pretty much your AWS EA) + withing 15 min. response time. The TAM should also be proactive.
        • AWS Concierge for non-urgent normal enquiry, you have access to an

enterprise_support

  • How to setting up a billing alarm?

    1. Go your account and select the My Billing Dashboard. There, in the left-menu,click on the Billing preferences and toggle the Receive Billing Alerts checkbox (I have no clue what this does).
    2. Go to CloudWatch, make sure the N. Virginia region has been selected, under Alarms click Billing, and then click the Create alarm button.
    3. There will be a step to create an SNS topic.
    • Name all the compute services:
      • EC2
      • Lambda
      • Elastic Beanstalk
  • Which services are global?

    1. IAM
    2. Route 53
    3. CloudFront
    4. SNS
    5. SES
  • Which services can be used On-Premise?

    1. Snowball (big harddrive to ship data to AWS)
    2. Snowball edge (same as snowball but with a CPU to run lambdas. It's like a walking lambdas)
    3. Storage Gateway. Either virtual or physicall storage that stays in your datacenter and replicate your files to S3.
    4. Snowcone
    5. CodeDeploy. Helps deploy your code to AWS EC2 but can also be used to deploy your code to your on-premise
    6. Opsworks. Similar to elastic beanstalk. It uses Chef. You can use Opswork to deploy to EC2 or your on-premise.
    7. IoT Greengrass. Collect data in your datacenter and stream them to AWS
  • What is the 6 benefits of the Cloud:

    1. Trade captital expense for variable expense.
    2. Massive economies of scale
    3. Stop guessing about capacity
    4. Increase speed nd agility
    5. Stop ending money running and maintaining data centers
    6. Go global in minutes
  • What are the 3 types of Cloud offering:

    1. IaaS like EC2
    2. PaaS like Elastic Beanstalk
    3. SaaS like Stripe
  • What are the 3 different type of Cloud Deployments:

    1. Public Cloud like AWS, Google, Azure
    2. Hybrid
    3. Private, i.e., On-Premise like VMWare of OpenStack on your data center
  • What do you need allow a private subnet to access the internet? A NAT Gateway

  • What are the Auto Scaling Component?

    1. Auto Scaling Group: This is the group of EC2 that scale identically.
    2. Launch Configuration: This is the Auto scaling group configuration.
    3. Auto Scaling Policies: The config uses policies to determine how to scale.
  • How do you manage AWS Service Limits?

    1. To increase them, use the AWS Support Center and choose Service Limit Increase.
    2. To monitor your current service limits status, use AWS Trusted Advisor and look for the Service Limit Check section.
  • How to you provision SSL certs?

    1. Use AWS ACM (Certificate Manager).
    2. Use IAM for regions that don't support ACM
  • How do you manage encryption keys?

    1. AWS KMS (Key Management Service). KMS is a managed service that makes it easy for you to create and control the encryption keys used to encrypt your data, and uses FIPS 140-2 validated hardware security modules to protect the security of your keys.
    2. AWS HSM (Hardware Security Module). HSM enables you to easily generate and use your own encryption keys on the AWS Cloud. With CloudHSM, you can manage your own encryption keys using FIPS 140-2 Level 3 validated HSMs.
  • What's the difference between AWS KMS and AWS HSM?

    • KMS only uses symmetric keys while HSM also supports assymetric keys.
    • HSM uses its own dedicated hardware, where KMS is a managed service with shared infra. Depending on your compliance requirement, you may have to choose HSM.
  • What is AWS CAF? AWS Cloud Adoption Framework (AWS CAF) to help organizations design and travel an accelerated path to successful cloud adoption.

  • What is AWS Service Catalog?

    • AWS Service Catalog allows organizations to create and manage catalogs of IT services that are approved for use on AWS.
    • Its main benefit is to centrallu manage commonly deployed IT services.
  • What are the 4 advantages of AWS Security?

    1. Keep Your Data Safe
    2. Keep Your Data Saf
    3. Save Money
    4. Scale Quickly
  • When are Elastic IPs free? In a nutshell, you can have one free EIP free. In details:

    1. The Elastic IP address is associated with an EC2 instance.
    2. The instance associated with the Elastic IP address is running.
    3. The instance has only one Elastic IP address attached to it.

    You're charged by the hour for each Elastic IP address that doesn't meet these conditions.

  • What is AWS Global Accelerator

    • AWS Global Accelerator is a networking service that improves the availability and performance of the applications that you offer to your global users.
  • What are the server-based AWS Services?

    • EC2
    • RDS
    • Redshift
    • EMR

Annex

Jargon

  • AD (Active Directory) is a hierachical DB that stores objects and relations and bit like a phone book. The most well-known AD is Microsoft AD. An Open-Source version of MS-AD is Samba AD. In the Cloud, AWS uses AWS Cloud Directory.
  • ASR (Automatic Speech Recognition) Used in AWS Lex.
  • BC (Business continuity): Minimize business activity disruption when something bad happens.
  • Connection-based protocol means that the server and client can acknowledge the whether messages have been received (e.g., TCP) as opposed to connectionless (e.g., UDP).
  • DR (Disaster Recovery): Act of responding to failures that threat BC.
  • Elastic means the system can change size (both up and down) based on the payload.
  • Fault tolerant means that the system can still work when one of the component shits itself. This is a hardware solution that guarantees almost no interuption.
  • Highly available sees the system as a whole and try to maintain as much up time as possible. Interruption is allowed but not guaranteed. Usually cheaper than fault tolerance.
  • HPC (High Performance Computing).
  • hub-and-spoke network is one where all nodes talk to each other via a single node (the hub). This is the opposite of a point-to-point network.
  • hub-and-spoke network is one where all nodes talk to each other via a single node (the hub). This is the opposite of a point-to-point network.
  • IBM WebShere is a legacy web server to develop website or web APIs.
  • ICMP (Internet Control Message Protocol) protocol. That's the protocol used by ping and routers (exam question).
  • IDS (Intrusion Detection System)
  • IPS (Intrusion Prevention System)
  • LDAP (Lightweight Directory Access Protocol) is a client/server protocol to query an AD.
  • LOB (Large Binary Objects). This is mentionned in AWS DMS. DMS can be configured in Full LOB mode or Limited LOB mode (for performance).
  • Lotus Notes: At its core, it is a email server, but it is also used as an enterprise development software to build internal apps and manage documents.
  • MEAN Stack: MongoDB, Express, Angular, Node.
  • MTTR (Mean Time To Recovery), that's a KPI used to measure Highly available system.
  • MTU (Maximum Transmission Unit). The default is 1500 (i.e., packets of 1500 bytes). If you use a higher number, it is very likely that other systems will break each packet down to drop under 1500. The two systems that allows to use higher value called Jumbo Frames are EC2 Placement Groups and AWS Direct Connect.
  • NACL (Network Access Contorl Lists) is used to configure the firewall of your VPC subnet.
  • NLU (Natural Language Understanding) Used in AWS Lex.
  • Oracle Solaris is a legacy OS which is not supported on AWS because it requires SPARC processors.
  • OSI (Open System Interconnection) Layer. It divides network communication in 7 layers. Layers 1-4 are called lower layers, i.e., hardware layers, mainly concerned with moving data around. Layers 5-7 are called upper layers, and contain application level data. In that model, each layer has a specific job to do and do it sequentially starting with layer 7, down to layer 1.
  • RPG: High-level programming language from IBM.
  • RPO (Recovery Point Objective) is the minimum period of time for backup files that can be used to consider that a a system has been restored properly. For example, if the RPO is 1 hour, that means that as long that we can restore from a backup file that is less or equal to 1 hour, then the system will be considered succesfully recovered. The RPO determines the backup frequency in a DR strategy.
  • RPO (Recovery Point Objective): Acceptable amount of data loss measured in time during an incident. RPO can be measured between the last backup and the incident.
  • RTO (Recovery Time Objective): Time we set ourself to complete DR. RTO is measured between the incident and its recovery.
  • Scalable means that the system can grow when the payload increases.
  • SIEM (Security Information and Event Management) is a system made of the logs from the IPS and IDS and the logic that can take actions based on those logs to improve the security of your infrastructure.
  • SPOF (Single Point of Failure)
  • SSE (Server-Side Encryption). This is often mentioned with the following techniques:
    • SSE-S3: S3 managed key service.
    • SSE-KMS: S3 automatically uses a data key generated using an AWS KMS CMK.
    • SSE-C: You manage your own custom keys.
  • Statefull in the context of security means that changing inbound rules also change the outbound rules as opposed to stateless.

Is DynamoDB truly schemaless?

A schemaless data store is one that does not require to define any schema upfront. So why is DynamoDB categorized as a schemaless data store though it must contain a schema definition when it is created?

The answer is that DynamoDB is not a pure schemaless DB. AWS probably categorizes it as such for marketing and sales reasons. Technically, it is possible to insert objects whose schema is a super set of the table schema, which means that it kind of look like there is no schema for those attributes which have not been defined in the table's schema. This begs the question: Why does DynamoDB require a schema if it allows to insert attributes that are not part of the schema? The answer is indexing. The 2 key value propositons of DynamoDB are its horizontal scale and its read/write linear throughput. To accomplish this, DynamoDB leverages partitions, which explains why it requires a primary key index. Creating a table requires to first create a schema made of attributes, and then choose one attribute to be the primary key. So what about the other attributes? Those attributes can be used to help partition the data set based on how you intend to query your data. By defining a schema upfront, you're anticipating that you may have to perform faster queries, but you're not sure which one yet. Because secondary indexes (i.e., indexes other than the primary key) can only be made from top-level attributes of type N, S or B, it is recommended to explicitly define a table's schema based on your domain object.

Basic skills

Tools

  • nc (netcat) is a Linux/Unix tool to do anything TCP/UDP. You often use it to test connecting to a remote device such as a database.
  • telnet is similar to nc but for Windows.
  • rsync (remote sync) is a Linux/Unix tool that can copy and syncing files between folders either locally or remotely (via an SSH connection).

Storage

  • RAID. There are 4 types of RAID:
    • standard: Normal disk.
    • RAID0: Single partition on 2 disks which doubles the Read/Write throughput (normal as you have 2 pieces of hardware to help you read and write). This offers the highest Read/Write throughput, but if one disk fails, both disks are unusuable.
    • RAID1: Kind of two RAID0 so if one goes down, the other takes over. Read/Write throughtput is lower, and you need double the storage, but FT is improved.
    • RAID5: Three disks. Offers the same FT level than RAID1 but improves Read throughput, while write throughput is lower.
    • RAID6: Four disks. Offers the maximum level of FT because 2 disks can fail. The Read throughput is as good as RAID0, while write throughput is the worst.

    AWS recommends to NOT USE RAID5 or RAID6 because the they consume too much IOPs over the network (up to 20 to 30%) which dilutes their benefits.

Networking

Refresher example

Very briefly, remember that a VPC is specific to a region (no such thing as a VPC that covers multiple regions) and that subnets are specific to an availability zone (no such thing as a subnet that span across multiple AZs). That means that if you're in the Sydney region (which only has three AZs: ap-southeast-2a, ap-southeast-2b and ap-southeast-2c) and if you want to configure both a private subnet and a public subnet, then you'll end up with six subnets: three private subnets, one for each AZ and three public subnets, one for each AZ.

The following two common scenarios are quick refreshers:

1: Securing an EC2 public access:

How would you configure an EC2 instance so that it cannot be DIRECTLY and publicly accessed, but could be indirectly accessed publicly via a load balancer?

  1. Make sure the LB is in a public subnet, otherwise, it won't be able to receive traffic from the web.
  2. Make sure that the EC2 instance is associated to a subnet (private or public) in the same AZ than the LB's subnet.
  3. Add a security group to the load balancer (let's call it SG-LB) and another one to the EC2 instance (let's call it SG-EC2).
  4. Configure SG-LB to accept traffic for HTTP (port 80) and HTTPS (port 443) for source 0.0.0.0/0.
  5. Configire SG-EC2 to accept traffic for HTTP (port 80) and HTTPS (port 443) for source SG-LB.

Notice that in #2, it does not matter whether the EC2 is in a public or private subnet. Ideally, it is in a private subnet to guarantee that no public traffic can access it, even if an engineer misconfigured an SG.

2: Load balancing public traffic safely:

How many load balancers do you need to manage traffic across two regions, and how to configure them safely?

We need to setup:

  • Two load balancers, one per region. That's because VPCs can only exist in a single region. There is no such thing as a VPC across multiple region.
  • A DNS configuration (e.g., AWS Route 53) that can route traffic to the correct LB based on its source.
  • public subnets for all load balancers (as many as there are AZs per region). That allows the LB to receive public traffic.
  • (optional) a private subnet for each EC2 instance in each AZ for each region.
  • Security groups on each LB to allow traffic.
  • Security group on each EC2 instance to allow traffic from the LB.

Public vs Private IP addresses

Understaning this concept is highly critical. It is unfortunatelly not obvious to determine when an IP is public or private. Things become even more confusing when public and private subnets are thrown in the mix. This section aims at clarifying what they are.

  • A public IP is one that is public on the internet. When a service is hosted on AWS with a public IP, this means that service can egress to the public internet and ingress from it to.
  • A private IP is one that can only be use to communicate inside a VPC. When an service is hosted on AWS with a private IP only, this service cannot egress to the public internet and the public internet cannot reach it either.
  • A public subnet is a subnet that possesses a route table mapping between 0.0.0.0/0 and an internet gateway.
  • Hosting services inside a public subnet does not mean that service is publicly accessible. If that service uses a private address, than that service cannot send or receive traffic to or from the internet. However, than service can become publicly accessible if a public IP is associated to it.
  • Only public subnets can associate public IPs to their hosted services.
  • Public subnets can contain private IPs.
  • A private subnet is one that does not contain any mapping between IPs and internet gateway in its routing table.
  • By default, when an EC2 is launched inside a public subnet, it receives both a public and private IP.
  • When a services inside a private subnet needs to access the internet, a new rule must be added into its route table that sends traffic to a NAT located in the public subnet.
  • AWS lambdas connected to a VPC are assigned a private IP address. So even if that subnet associated to that lambda is public, that lambda won't be able to ingress/egress to the internet. Instead, a NAT must be added in the public subnet, and a route table mapping must redirect the lambda subnet traffic to that NAT.

DNS vs DHCP

DNS (Domain Name Server) vs DHCP (Dynamic Host Configuration Protocol).

  • DNS maps a domain to an IP.
  • DHCP associates IPs to devices(aka host). In a nutshell, the DHCP server manages IPs on a network. Its most essential tasks are
    • Assigning subnet mask.
    • Assigning IPs to:
      • Default Gateway.
      • DNS

OSI

OSI (Open System Interconnection) is a network model made of 7 layers. To remember those layers, use this mnemonic:

  1. Please: Physical
  2. Do: Data link (MAC)
  3. Not: Network
  4. Throw: Transport (e.g., TCP)
  5. Sausage: Session
  6. Pizza: Presentation (e.g., SSL, compression)
  7. Away: Application (e.g., Web browser)

The first 2 layers (Physical and Data link) are AWS responsibilities. The rest is yours.

Ephemeral ports

What are they?

Those ports are short-lived ports used by client machine to make requests. They exist to allow a client to make multiple concurrent requests from their IP. Without ephemeral ports, a client wouould only be able to make sequential requests, forcing each request to wait for the port to be free before using it.

(Exam:) The range of ephemeral ports varies from one system to another:

  • 1024-65535:
    • NAT gateway
    • AWS Lambda
    • ELB
  • 32768-61000: Many Linux kernels
  • 1025-5000: Windows server up until 2003 incl.
  • 49152-65535:
    • Windows server 2008+
    • Mac OS

So as you can see, the range that should cover all the use cases is 1024-65535.

When do things get messy because of them?*

The usual best practice to secure subnets is to define NACLs (i.e., Network Access Control Lists). A common mistake is to forget to include those EPs in the outbound rule. For example:

Inbound rules:

Rule Protocol Port range Source Allow/Deny
100 HTTPS 443 0.0.0.0/0 ALLOW

Outbound rules:

Rule Protocol Port range Source Allow/Deny
100 HTTPS 443 0.0.0.0/0 ALLOW

This will not work. The inbound rule is correct as it allows any source to access the subnet using TCP on port 443. However, the server will not respond back to a destination on port 443. Instead, it will respond back to the client's port, which is whatever ephemeral port the client, based on its OS, randomly choosed to make that request. The outbound rules must be changed to one of the two options:

Rule Protocol Port range Source Allow/Deny
100 TCP * 0.0.0.0/0 ALLOW

or

Rule Protocol Port range Source Allow/Deny
100 TCP 1024-65535 0.0.0.0/0 ALLOW

As we discussed previously, that range covers most of all systems' ephemeral ports range.

And though, its often on the outbound rules that those EPs must be configured (because the usual use case is a client connecting to a server, i.e., ingressing), this does not mean that inbound rules are always free of worries. Consider the case of an EC2 server that needs to periodically install patches. That server, will initiate the connection using one of its available EPs, and the patching server will respond using that same EP. In that case, the inbound rules table needs to include a rule to cover that use case.

NAT (Network Address Translation)

  • A NAT helps services to contact and received response from the public internet, while preventing the public internet to contact them.
  • A NAT is a physical device (typically a server) that forwards traffic from the instances in the private subnet to the internet or other AWS services, and then sends the response back to the instances. It deals with outbound traffic, not inbound traffic (this means services outside the network cannot make explicit connections to resources inside the network).
  • NAT vs proxy: A NAT kind of work like a forward-proxy, but is implemented differently. A forward-proxy operates typically in the layer-7 of the OSI model (application layer) while a NAT operates in the layer-3 (network) and layer-4 (transport) of the OSI model.
  • NAT vs internet gateway:
    • An internet gateway is NOT a NAT and is NOT a physical device.
    • It is a logical connection between an Amazon VPC and the Internet.
    • There can only be one IG per VPC.
    • An IG has NO bandwidth limit.

The topic specific to AWS NAT is detailed under the NAT Gateway & NAT instances section.

Popular ports

  • SSH: 22
  • Powershell remote: 5985(HTTP) and 5986(HTTPS)
  • HTTPS: 433
  • Windows RDP: 3389
  • MSSQL DB: 1433
  • MySQL DB: 3306
  • PostgreSQL DB: 5432

AWS Gateways

There is a plethora of gateways in the networks, which means this concept must be mastered to be a Cloud engineer. The list below contains the most common gateways in AWS:

  • Local gateway: This is the default gateway associated to any subnet. This is what components in that subnet use to communicate between them by default. The local gateway cannot communicate with other subnets. If you wish to establish connection between subnets, in each of them, you will have to create new gateways and link them to IPs in routing tables.
  • Internet gateway: This is a gateway that can send/receive traffic to/from the internet. When an internet gateway is configured in a routing table of a VPC subnet, the subnet is called a public subnet.

CIDR

  • Classless Inter-Domain Routing is a method for allocating IP addresses and for IP routing. Its aim was to replace the classfull network approach which led to IP exhaustion with the rise if the internet.
  • CIDR represents IPv4 addresses encoded on 32 bits. They are broken down between their network portion (part of the IP that defines the network via a mask) and their host portion (part of the IP that defines the device). The network portion is always made of bytes on the left. Masking is the process of annotating how many bytes on the left are used for the network portion.
  • CIDR uses a compact notation of masking. It works as follow:
    • /24: The first 24 bits from the left are used for the network.
    • /16: The first 16 bits from the left are used for the network.

If you have a subnet 10.0.0.0/24, this means that the network IP address starts with the first 24 bits, which leaves 8 bits (from 0 to 255) for the hosts (from 10.0.0.0 to 10.0.0.255). It could, fo example, be divided in two subnets of equal size, each using 128 IPs:

  • 10.0.0.0/25 for IPs between 10.0.0.0 and 10.0.0.127
  • 10.0.0.128/25 for IPs between 10.0.0.128 and 10.0.0.255

Examples:

  • 3 Subnets in VPC with CIDR 172.31.0.0/16 that can each host 4096(2^12) devices:

    • subnet-b8956bde: 172.31.0.0/20
    • subnet-a6523bfe: 172.31.16.0/20
    • subnet-6a10f822: 172.31.32.0/20
  • 3 Subnets in VPC with CIDR 172.30.0.0/16 that can each host 256(2^8) devices:

    • subnet-0c2372ca21e0b94ba: 172.30.0.0/24
    • subnet-07a710fa68cb9bce0: 172.30.1.0/24
    • subnet-0b4969850765857d3: 172.30.2.0/24

VPN

VPN max bandwidth is usually ~1.25Gbps. This means that any solution that relies on VPN tunelling is bound to this. To achieve higher bandwidth on AWS:

  • Use multiple VPN (called VPN aggregation) to aggregate the bandwidt. However, this set up does not change the fact that a single flow is limited to 1.25Gbps.
  • Switch to Direct Connect.

DevOps

  • CI (Continous Integration) is the process of automating your build process and making sure that the master branch is stable and has been automatically tested.
  • CD (Continous delivery) is the next step after after CI and includes the deployment steps at the click of a button. This means that a person (aka release manager) can decide to deploy whenever they see fit.
  • CD (Continous Deployment) goes one step further then Continous Delivery by automating the deployment as soon as the master branch (or the branch being the deployment source of truth) received a new commit.

Security

  • IAM roles use two types fo policies:
    • permission policy defines what can or cannot be accessed.
    • trust policy defines who can assume that role.
  • Service linked-roles
    • Predefined IAM roles that only AWS services can assume. This simplify AWS services's set up as those services must usually call other AWS services. Wihout this concept, you would have to create those very verbose roles yourself and then attach them to the service, which would make using AWs services harder to get started with.
    • Those roles are automatically created for you when you start using an AWS service. Depending on the service, a WYSIWYG form may ask you a few question to help setting it up.
    • EXAM: Service linked-role only applies to services, not users. If you see a solution that mention updating the Trust Policy of a SLR to allow a user, this is a fallacy.

Resources

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment