Skip to content

Instantly share code, notes, and snippets.

@thimslugga
Last active November 12, 2024 19:29
Show Gist options
  • Save thimslugga/36019e15b2a47a48c495b661d18faa6d to your computer and use it in GitHub Desktop.
Save thimslugga/36019e15b2a47a48c495b661d18faa6d to your computer and use it in GitHub Desktop.
Setup Docker on Amazon Linux 2023

Setup Docker on Amazon Linux 2023

The following guide is for setting up Docker with docker-compose v2 on Amazon Linux 2023. The steps are intendend for AL2023 on EC2 but should mostly work for the AL2023 VMs running on other hypervisors.

Install and configure Docker on Amazon Linux 2023

Check for new updates

Get the hosts current Amazon Linux 2023 release:

rpm -q system-release --qf "%{VERSION}\n"

OR

cat /etc/amazon-linux-release

To find out the LATEST SYSTEM RELEASE of Amazon Linux 2023:

# You can use the following command to get more verbose output
sudo dnf check-release-update --refresh --latest-only --version-only
#sudo dnf check-release-update

To upgrade the Amazon Linux 2023 based host for the CURRENT SYSTEM RELEASE i.e. (cat /etc/amazon-linux-release):

sudo dnf check-update --refresh
sudo dnf upgrade --refresh

To upgrade the Amazon Linux 2023 based host to a SPECIFIC SYSTEM RELEASE:

sudo dnf check-update --refresh --releasever=2023.5.20240722
sudo dnf update --refresh --releasever=2023.5.20240722

To upgrade the Amazon Linux 2023 based host to the LATEST SYSTEM RELEASE:

sudo dnf check-update --refresh --releasever=latest
sudo dnf upgrade --refresh --releasever=latest

Note: Using sudo dnf upgrade --releasever=latest updates all packages, including system-release. Then, the version remains locked to the new system-release unless you set the persistent override.

To permanently switch the host to always get the latest system release updates:

# This command only needs to be run once
sudo touch /etc/dnf/vars/releasever && echo 'latest' | sudo tee /etc/dnf/vars/releasever

Then it's just a matter of running the following commands to update via latest:

sudo dnf check-update --refresh
sudo dnf upgrade --refresh

To get more details about the current repos:

dnf repoinfo
dnf repolist all --verbose

Install Base OS Packages

Install the following packages, which are good to have installed:

sudo dnf install --allowerasing -y \
  kernel-modules-extra \
  dnf-plugins-core \
  dnf-plugin-release-notification \
  dnf-plugin-support-info \
  dnf-utils \
  git-core \
  git-lfs \
  grubby \
  kexec-tools \
  chrony \
  audit \
  dbus \
  dbus-daemon \
  polkit \
  systemd-pam \
  systemd-container \
  udisks2 \
  crypto-policies \
  crypto-policies-scripts \
  openssl \
  nss-util \
  nss-tools \
  dmidecode \
  nvme-cli \
  lvm2 \
  dosfstools \
  e2fsprogs \
  xfsprogs \
  xfsprogs-xfs_scrub \
  attr \
  acl \
  shadow-utils \
  shadow-utils-subid \
  fuse3 \
  squashfs-tools \
  star \
  gzip \
  pigz \
  bzip2 \
  zstd \
  xz \
  unzip \
  p7zip \
  numactl \
  iproute \
  iproute-tc \
  iptables-nft \
  nftables \
  conntrack-tools \
  ipset \
  ethtool \
  net-tools \
  iputils \
  traceroute \
  mtr \
  telnet \
  whois \
  socat \
  bind-utils \
  tcpdump \
  cifs-utils \
  nfsv4-client-utils \
  nfs4-acl-tools \
  libseccomp \
  psutils \
  python3 \
  python3-pip \
  python3-psutil \
  python3-policycoreutils \
  policycoreutils-python-utils \
  bash-completion \
  vim-minimal \
  wget \
  jq \
  awscli-2 \
  ec2rl \
  ec2-utils \
  htop \
  sysstat \
  fio \
  inotify-tools \
  rsync

(Optional) Remove EC2 Hibernation Agent

Run the following command to remove the EC2 Hibernation Agent:

sudo dnf remove -y ec2-hibinit-agent

(Optional) Install EC2 Instance Connect Utility

sudo dnf install --allowerasing -y ec2-instance-connect ec2-instance-connect-selinux

(Optional) Install Smart-Restart Utility

Amazon Linux now ships with the smart-restart package, which the smart-restart utility restarts systemd services on system updates whenever a package is installed or deleted using the systems package manager. This occurs whenever a dnf <update|upgrade|downgrade> is executed.

The smart-restart uses the needs-restarting from the dnf-utils package and a custom denylisting mechanism to determine which services need to be restarted and whether a system reboot is advised. If a system reboot is advised, a reboot hint marker file is generated (/run/smart-restart/reboot-hint-marker).

sudo dnf install --allowerasing -y smart-restart python3-dnf-plugin-post-transaction-actions

After the installation, the subsequent transactions will trigger the smart-restart logic.

(Optional) Enable Kernel Live Patching (KLP)

Run the following command to install the kernel live patching feature:

sudo dnf install --allowerasing -y kpatch-dnf kpatch-runtime

Enable the service:

sudo dnf kernel-livepatch -y auto
sudo systemctl daemon-reload
sudo systemctl enable --now kpatch.service

(Optional) Install Amazon EFS Utils

sudo dnf install --allowerasing -y amazon-efs-utils

(Optional) Enable FIPS Mode on the Host

This step is safe to skip as it will only apply to specific end user environments. I would recommend reading into FIPS compliance, validation and certification before enabling FIPS mode on EC2 instances.

sudo dnf install --allowerasing -y crypto-policies crypto-policies-scripts
sudo fips-mode-setup --check
sudo fips-mode-setup --enable
sudo fips-mode-setup --check
sudo systemctl reboot

(Optional) Setup Amazon SSM Agent

Install the Amazon SSM Agent:

sudo dnf install --allowerasing -y amazon-ssm-agent

The following is a tweak, which should resolve the following reported issue.

Add the following drop-in to make sure networking is up, dns resolution works and cloud-init has finished before the amazon ssm agent is started.

sudo mkdir -pv /etc/systemd/system/amazon-ssm-agent.service.d

cat <<'EOF' | sudo tee /etc/systemd/system/amazon-ssm-agent.service.d/00-override.conf
[Unit]
# To have a service start after cloud-init.target it requires the
# addition of DefaultDependencies=no due to the following default
# DefaultDependencies=y, which results in the default target e.g.
# multi-user.target to depending on the service.
#
# See the follow for more details: https://serverfault.com/a/973985
Wants=network-online.target
After=network-online.target nss-lookup.target cloud-init.target
DefaultDependencies=no
ConditionFileIsExecutable=/usr/bin/amazon-ssm-agent

EOF
sudo systemctl daemon-reload
sudo systemctl enable --now amazon-ssm-agent.service
sudo systemctl try-reload-or-restart amazon-ssm-agent.service
sudo systemctl status amazon-ssm-agent.service

Verify:

systemd-delta --type=extended
systemctl show amazon-ssm-agent --all
# systemctl show <unit>.service --property=<PROPERTY_NAME>
# systemctl show <unit>.service --property=<PROPERTY_NAME1>,<PROPERTY_NAME2>
systemctl show amazon-ssm-agent.service --property=After,Wants

(Optional) Install and setup the Unified CloudWatch Agent

Install the Unified CloudWatch Agent:

sudo dnf install --allowerasing -y amazon-cloudwatch-agent collectd

Add the following drop-in to make sure networking is up, dns resolution works and cloud-init has finished before the unified cloudwatch agent is started.

sudo mkdir -pv /etc/systemd/system/amazon-cloudwatch-agent.d

cat <<'EOF' | sudo tee /etc/systemd/system/amazon-cloudwatch-agent.d/00-override.conf
[Unit]
# To have a service start after cloud-init.target it requires the
# addition of DefaultDependencies=no due to the following default
# DefaultDependencies=y, which results in the default target e.g.
# multi-user.target depending on the service.
#
# See the follow for more details: https://serverfault.com/a/973985
Wants=network-online.target
After=network-online.target nss-lookup.target cloud-init.target
DefaultDependencies=no
ConditionFileIsExecutable=/opt/aws/amazon-cloudwatch-agent/bin/start-amazon-cloudwatch-agent

EOF
sudo systemctl daemon-reload
sudo systemctl enable --now amazon-cloudwatch-agent.service
sudo systemctl try-reload-or-restart amazon-cloudwatch-agent.service
sudo systemctl status amazon-cloudwatch-agent.service

The current version of the CloudWatchAgentServerPolicy:

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "cloudwatch:PutMetricData",
                    "ec2:DescribeVolumes",
                    "ec2:DescribeTags",
                    "logs:PutLogEvents",
                    "logs:DescribeLogStreams",
                    "logs:DescribeLogGroups",
                    "logs:CreateLogStream",
                    "logs:CreateLogGroup"
                ],
                "Resource": "*"
            },
            {
                "Effect": "Allow",
                "Action": [
                    "ssm:GetParameter"
                ],
                "Resource": "arn:aws:ssm:*:*:parameter/AmazonCloudWatch-*"
            }
        ]
    }

(Optional) Install Ansible

Run the following to install ansible on the host:

sudo dnf install -y \
  python3-psutil \
  ansible \
  ansible-core \
  sshpass

Configure sane defaults for the OS

Configure the locale:

sudo localectl set-locale LANG=en_US.UTF-8

Verify:

localectl

Configure the hostname:

sudo hostnamectl set-hostname --static <hostname>
sudo hostnamectl set-chassis vm

Verify:

hostnamectl

Set the system timezone to UTC and ensure chronyd is enabled and started:

sudo systemctl enable --now chronyd
sudo timedatectl set-timezone Etc/UTC
sudo timedatectl set-ntp true

Verify:

timedatectl

Configure logging:

sudo mkdir -pv /etc/systemd/journald.conf.d

cat <<'EOF' | sudo tee /etc/systemd/journald.conf.d/00-override.conf
[Journal]
SystemMaxUse=100M
RuntimeMaxUse=100M
RuntimeMaxFileSize=10M
RateLimitIntervals=1s
RateLimitBurst=10000

EOF

sudo systemctl daemon-reload
sudo systemctl try-reload-or-restart systemd-journald.service
sudo systemctl status systemd-journald.service

Configure custom MOTD banner:

# Disable the AL2023 MOTD banner (found at /usr/lib/motd.d/30-banner):
sudo ln -s /dev/null /etc/motd.d/30-banner
cat <<'EOF' | sudo tee /etc/motd.d/31-banner
   ,     #_
   ~\_  ####_
  ~~  \_#####\
  ~~     \###|
  ~~       \#/ ___   Amazon Linux 2023 (Docker Optimized)
   ~~       V~' '->
    ~~~         /
      ~~._.   _/
         _/ _/
       _/m/'
EOF

AL2023 uses pam-motd, see: http://www.linux-pam.org/Linux-PAM-html/sag-pam_motd.html

Configure a sane user environment for the current user e.g. ec2-user

touch ~/.{profile,bashrc,bash_profile,bash_login,bash_logout,hushlogin}

mkdir -pv "${HOME}"/bin
mkdir -pv "${HOME}"/.config/{systemd,environment.d}
mkdir -pv "${HOME}"/.config/systemd/user/sockets.target.wants
mkdir -pv "${HOME}"/.local/share/systemd/user
mkdir -pv "${HOME}"/.local/bin
#cat <<'EOF' | tee ~/.config/environment.d/environment_vars.conf
#PATH="${HOME}/bin:${HOME}/.local/bin:${PATH}"
#
#EOF
sudo loginctl enable-linger $(whoami)
systemctl --user daemon-reload

Note: If you need to switch to root user, use the following instead of sudo su - <user>.

# sudo machinectl shell <username>@
sudo machinectl shell root@

Install and configure Moby aka Docker on the host

Run the following command to install moby aka docker:

sudo dnf install --allowerasing -y \
  docker \
  containerd \
  runc \
  container-selinux \
  cni-plugins \
  oci-add-hooks \
  amazon-ecr-credential-helper \
  udica

Add the current user e.g. ec2-user to the docker group:

sudo groupadd docker
sudo usermod -aG docker $USER
sudo newgrp docker

Configure the following docker daemon settings:

test -d /etc/docker || sudo mkdir -pv /etc/docker

test -f /etc/docker/daemon.json || cat <<'EOF' | sudo tee /etc/docker/daemon.json
{
  "debug": false,
  "experimental": false,
  "exec-opts": ["native.cgroupdriver=systemd"],
  "userland-proxy": false,
  "live-restore": true,
  "log-level": "warn",
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m",
    "max-file": "3"
  }
}
EOF

Enable and start the docker and containerd service(s):

sudo systemctl enable --now docker.service containerd.service
sudo systemctl status docker containerd

Install the Docker Compose v2 CLI Plugin

Install the Docker Compose plugin with the following commands.

To install the docker compose plugin for all users:

sudo mkdir -p /usr/local/lib/docker/cli-plugins

sudo curl -sL https://github.com/docker/compose/releases/latest/download/docker-compose-linux-"$(uname -m)" \
  -o /usr/local/lib/docker/cli-plugins/docker-compose

# Set ownership to root and make executable
test -f /usr/local/lib/docker/cli-plugins/docker-compose \
  && sudo chown root:root /usr/local/lib/docker/cli-plugins/docker-compose
test -f /usr/local/lib/docker/cli-plugins/docker-compose \
  && sudo chmod +x /usr/local/lib/docker/cli-plugins/docker-compose

(Optional) To install only for the local user e.g. ec2-user, run the following commands:

mkdir -p "${HOME}/.docker/cli-plugins" && touch "${HOME}/.docker/config.json"
curl -sL https://github.com/docker/compose/releases/latest/download/docker-compose-linux-"$(uname -m)" \
  -o "${HOME}/.docker/cli-plugins/docker-compose"

cat <<'EOF' | tee -a "${HOME}/.bashrc"

# https://specifications.freedesktop.org/basedir-spec/latest/index.html
XDG_CONFIG_HOME="${HOME}/.config"
XDG_DATA_HOME="${HOME}/.local/share"
XDG_RUNTIME_DIR="${XDG_RUNTIME_DIR:-/run/user/$(id -u)}"
DBUS_SESSION_BUS_ADDRESS="unix:path=${XDG_RUNTIME_DIR}/bus"
export XDG_CONFIG_HOME XDG_DATA_HOME XDG_RUNTIME_DIR DBUS_SESSION_BUS_ADDRESS 

# Docker
DOCKER_TLS_VERIFY=1
#DOCKER_CONFIG=/usr/local/lib/docker
DOCKER_CONFIG="${DOCKER_CONFIG:-$HOME/.docker}"
export DOCKER_CONFIG DOCKER_TLS_VERIFY
#DOCKER_HOST="unix:///run/user/$(id -u)/docker.sock"
#export DOCKER_HOST

EOF

Verify the plugin is installed correctly with the following command(s):

docker compose version

(Optional) Install the Docker Scout Plugin

(Optional) Install docker scout with the following commands:

curl -sSfL https://raw.githubusercontent.com/docker/scout-cli/main/install.sh | sh -s --
chmod +x $HOME/.docker/scout/docker-scout

(Skip) Install the Docker Buildx Plugin

Note: You can safely skip this step as it should not be necessary due to the version of Moby shipped in AL2023 bundling the buildx plugin by default.

(Optional) Install the docker buildx plugin with the following commands:

sudo curl -sSfL 'https://github.com/docker/buildx/releases/download/v0.14.0/buildx-v0.14.0.linux-amd64' \
  -o /usr/local/lib/docker/cli-plugins/docker-buildx

#sudo curl -sL https://github.com/docker/compose/releases/latest/download/docker-buildx-linux-"$(uname -m)" \
#  -o /usr/local/lib/docker/cli-plugins/docker-buildx

# Set ownership to root and make executable
test -f /usr/local/lib/docker/cli-plugins/docker-buildx \
  && sudo chown root:root /usr/local/lib/docker/cli-plugins/docker-buildx
test -f /usr/local/lib/docker/cli-plugins/docker-buildx \
  && sudo chmod +x /usr/local/lib/docker/cli-plugins/docker-buildx

cp /usr/local/lib/docker/cli-plugins/docker-buildx "${HOME}/.docker/cli-plugins/docker-buildx"

docker buildx install

(Optional) Install the EC2 Nitro Enclave CLI tool

This is mostly optional if needed, otherwise you can just skip this one.

sudo dnf install --allowerasing -y aws-nitro-enclaves-cli aws-nitro-enclaves-cli-devel

Add teh user to the ne group:

sudo groupadd ne
sudo usermod -aG ne $USER
sudo newgrp ne

Enable and start the service:

sudo systemctl enable --now nitro-enclaves-allocator.service

(Optional) Install the Nvidia Drivers

To install the Nvidia drivers:

sudo dnf install -y wget kernel-modules-extra kernel-devel gcc dkms

Add the Nvidia Driver and CUDA repository:

sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/amzn2023/x86_64/cuda-amzn2023.repo
sudo dnf clean expire-cache

Install the Nvidia driver + CUDA toolkit from the Nvidia repo:

sudo dnf module install -y nvidia-driver:latest-dkms
sudo dnf install -y cuda-toolkit

(Alternative) Download the driver install script and run it to install the nvidia drivers:

curl -sL 'https://us.download.nvidia.com/tesla/535.161.08/NVIDIA-Linux-x86_64-535.161.08.run' -O
sudo sh NVIDIA-Linux-x86_64-535.161.08.run -a -s --ui=none -m=kernel-open

Verify:

nvidia-smi

For the Nvidia container runtime, add the nvidia container repo:

curl -sL 'https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo' | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
sudo dnf clean expire-cache
sudo dnf check-update

Install and configure the nvidia-container-toolkit:

sudo dnf install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker

Restart the docker and containerd services:

sudo systemctl restart docker containerd

To create an Ubuntu based container with access to the host GPUs:

docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
docker run --rm --runtime=nvidia --gpus all public.ecr.aws/amazonlinux/amazonlinux:2023 nvidia-smi

(Optional) Configure the aws-cli for the ec2-user

# configure region
aws configure set default.region $(curl --noproxy '*' -w "\n" -s -H "X-aws-ec2-metadata-token: $(curl --noproxy '*' -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")" http://169.254.169.254/latest/dynamic/instance-identity/document | jq -r .region)
# use regional endpoints
aws configure set default.sts_regional_endpoints regional
# get credentials from imds
aws configure set default.credential_source Ec2InstanceMetadata
# get credentials last for 1hr
aws configure set default.duration_seconds 3600
# set default pager
aws configure set default.cli_pager ""
# set output to json
aws configure set default.output json

Verify:

aws configure list
aws sts get-caller-identity

(Optional) Create your first Amazon Linux 2023 based container(s)

Login to the AWS ECR service:

aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws

To create an AL2023 based container:

docker pull public.ecr.aws/amazonlinux/amazonlinux:2023
docker run -it --security-opt seccomp=unconfined public.ecr.aws/amazonlinux/amazonlinux:2023 /bin/bash

Performance Tuning for Amazon Linux 2023

EC2 Bandwidth Limits

ethtool -S eth0 | grep -E 'err|exceeded|missed'

NIC Tuning

#sudo ethtool -G eth0 tx 1024 rx 4096
sudo ethtool -G eth0 tx 1024 rx 8192
ethtool -c eth0

rx-adaptive on
rx usecs 20
tx usecs 64 (default)
#ethtool -C eth0 adaptive-rx off rx-usecs 0 tx-usecs 0
cat /proc/interrupts | grep Tx-Rx

GRUB Configuration

uname -sr; cat /proc/cmdline
sudo grubby --update-kernel=ALL --args="intel_idle.max_cstate=1 processor.max_cstate=1 cpufreq.default_governor=performance"
sudo grubby --update-kernel=ALL --args="swapaccount=1 psi=1"

Verify:

sudo grubby --info=ALL

To reboot the host:

sudo systemctl reboot

sysctl

# start with 50-70
echo 50 | sudo tee /proc/sys/net/core/busy_read
echo 50 | sudo tee /proc/sys/net/core/busy_poll
echo 0 | sudo tee /proc/sys/net/ipv4/tcp_sack
cat <<'EOF' | sudo tee /etc/sysctl.d/99-custom-tuning.conf
# Custom kernel sysctl configuration file
#
# Disclaimer: These settings are not a one size fits all and you will need to test and validate them in your own environment.
#
# https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
# https://www.kernel.org/doc/Documentation/sysctl/net.txt
# https://www.kernel.org/doc/Documentation/networking/proc_net_tcp.txt
# https://www.kernel.org/doc/Documentation/networking/scaling.txt
# https://www.kernel.org/doc/Documentation/networking/multiqueue.txt
# https://www.kernel.org/doc/Documentation/networking/ena.txt
#
# For binary values, 0 is typically disabled, 1 is enabled.
#
# See sysctl(8) and sysctl.conf(5) for more details.
#
# AWS References:
#
# - https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-network-bandwidth.html
# - https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ena-nitro-perf.html
# - https://github.com/amzn/amzn-drivers/blob/master/kernel/linux/ena/ENA_Linux_Best_Practices.rst
# - https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ena-improve-network-latency-linux.html
# - https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ena-express.html
# - https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/processor_state_control.html
# - https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-network-performance-ena.html
# - https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-and-configure-cloudwatch-agent-using-ec2-console.html
# - https://github.com/amzn/amzn-ec2-ena-utilities/tree/main
#
# Misc References:
#
# - https://github.com/leandromoreira/linux-network-performance-parameters
# - https://oxnz.github.io/2016/05/03/performance-tuning-networking/
# - https://www.speedguide.net/articles/linux-tweaking-121
# - https://www.tweaked.io/guide/kernel/
# - http://rhaas.blogspot.co.at/2014/06/linux-disables-vmzonereclaimmode-by.html
# - https://fasterdata.es.net/host-tuning/linux/
# - https://documentation.suse.com/sles/15-SP5/html/SLES-all/cha-tuning-network.html
# - https://blog.packagecloud.io/monitoring-tuning-linux-networking-stack-receiving-data/
# - https://blog.packagecloud.io/monitoring-tuning-linux-networking-stack-sending-data/
# - https://blog.cloudflare.com/optimizing-tcp-for-high-throughput-and-low-latency/
# - https://github.com/myllynen/rhel-performance-guide
# - https://github.com/myllynen/rhel-troubleshooting-guide
# - https://www.brendangregg.com/linuxperf.html

# Adjust the kernel printk to minimize seiral console logging.
# The defaults are very verbose and they can have a performance impact.
# Note: 4 4 1 7 should also be fine, just not debug i.e. 7
kernel.printk=3 4 1 7

# man 7 sched
#
# This feature aimed at improving system responsiveness under load by
# automatically grouping task groups with similar execution patterns.
# While beneficial for desktop responsiveness, in server environments,
# especially those running Kubernetes, this behavior might not always
# be desirable as it could lead to uneven distribution of CPU resources
# among pods. 
# 
# The use of the cgroups(7) CPU controller to place processes in cgroups 
# other than the root CPU cgroup overrides the affect of auto-grouping. 
#
# This setting enables better interactivity for desktop workloads and is 
# not typically suitable for many server type workloads e.g. postgresdb. 
#
# https://cateee.net/lkddb/web-lkddb/SCHED_AUTOGROUP.html
# https://www.postgresql.org/message-id/[email protected]
kernel.sched_autogroup_enabled=0

# This affects / allows processes to stay longer on a CPU core since it's 
# last run. This is a heuristic for estimating cache misses e.g. you have 
# a lot tasks to run, for those who still have a lot of their data in cache 
# it's cheaper to wait minimally and then run on the CPU core they last ran
# as cache misses cost quite more cpu-cycle wise.
#
# For those which have not much, or even no, data in the local CPU caches i.e.
# L1, and also L2, it maybe faster/better to just run the task on another CPU 
# with less work i.e. migrate it, as it must re-cache its data anyway.
#
# The heuristic uses, among other things, the time duration since the tasks last 
# run to estimate how many task data is probably still cached, as the longer the 
# task did not run the more likely it is that it's data was evicted from cache to 
# make room for another task. Now, setting this to high can have downsides, e.g.
# cache penalty may add up, but having a too low of setting is also not ideal, as 
# task migration is not exactly free i.e. as often inter-CPU locks must be acquired 
# to move a task to another's CPU cores run queue. But it seems that this values 
# default may be a bit to low for modern systems and a hypervisor work load, so you
# could try to set it to 5ms instead of 0.5 ms, and observe how your system is affected, 
# note that here a higher CPU load which may be desired as basically it just gets used 
# more efficiently i.e. less time wasted in task migrations and you can achieve more 
# throughput.
#
# A lower value e.g. like 500000 (0.5 ms) may improve the responsiveness for certain workloads.
#kernel.sched_migration_cost_ns=500000

# Security feature. No randomization, everything is static.
#kernel.randomize_va_space=1

# For rngd
#kernel.random.write_wakeup_threshold=3072

# Prevent ebpf privilege escalation, see the following:
# https://lwn.net/Articles/742170
# https://www.suse.com/support/kb/doc/?id=000020545
# https://discourse.ubuntu.com/t/unprivileged-ebpf-disabled-by-default-for-ubuntu-20-04-lts-18-04-lts-16-04-esm/27047
# 0 = re-enable, 1 = disable, 2 = disable but allow admin to re-enable without a reboot
#kernel.unprivileged_bpf_disabled=0

# Rootless Containers
# https://github.com/containers/podman/blob/main/docs/tutorials/rootless_tutorial.md
user.max_user_namespaces=28633

# When set to "enabled", all users are allowed to use userfaultfd syscalls.
# https://lwn.net/Articles/782745/
#vm.unprivileged_userfaultfd=1

# Specifies the minimum number of kilobytes to keep free across the system. 
# This is used to determine an appropriate value for each low memory zone, 
# each of which is assigned a number of reserved free pages in proportion 
# to their size.
#
# Setting min_free_kbytes to an extremely low value prevents the system from 
# reclaiming memory, which can result in system  hangs and OOM-killing processes.
#
# However, setting min_free_kbytes too high e.g. 5–10% of total system memory can 
# cause the system to enter an out-of-memory state immediately, resulting in the
# system spending too much time trying to reclaim memory. 
#
# As a rule of thumb, yset this value to between 1-3% of available system
# memory and adjust this value up or down to meet the needs of your application 
# workload. It is not recommended that the setting of vm.min_free_kbytes 
# exceed 5% of the system's physical memory.
#
# Ensure that the reserved kernel memory is sufficient to sustain a high
# rate of packet buffer allocations as the default value may be too small.
# awk 'BEGIN {OFMT = "%.0f";} /MemTotal/ {print "vm.min_free_kbytes =", $2 * .03;}' /proc/meminfo
vm.min_free_kbytes=1048576

# Maximum number of memory map areas a process may have (memory map areas are used
# as a side-effect of calling malloc, directly by mmap and mprotect, and also when
# loading shared libraries).
vm.max_map_count=262144

vm.overcommit_memory=1

# Make sure the host does not try to swap too early.
# https://access.redhat.com/solutions/6785021
# https://access.redhat.com/solutions/7042476
# vm.force_cgroup_v2_swappiness=1
vm.swappiness=10

# The maximum percentage of dirty system memory.
# https://www.suse.com/support/kb/doc/?id=000017857
vm.dirty_ratio = 10

# Percentage of dirty system memory at which background writeback will start.
# (default 10)
vm.dirty_background_ratio=5

# Some kernels won't allow dirty_ratio to be set below 5%.
# Therefore when dealing with larger amounts of system memory,
# percentage ratios might not be granular enough. If that is the 
# case, then use the below instead of the settings above.
#
# Configure 600 MB maximum dirty cache
#vm.dirty_bytes=629145600

# Spawn background write threads once the cache holds 300 MB
#vm.dirty_background_bytes=314572800

# The value in file-max denotes the maximum number of file-handlers that the Linux kernel will allocate.
# When you get lots of error messages about running out of file handlers, you will want to increase this limit.
# Attempts to allocate more file descriptors than file-max are reported with printk, look for in the kernel logs.
# VFS: file-max limit <number> reached
fs.file-max=1048576

# Maximum number of concurrent asynchronous I/O operations (you might need to
# increase this limit further if you have a lot of workloads that uses the AIO
# subsystem e.g.  MySQL, MariaDB, etc.
# 524288, 1048576, etc.
fs.aio-max-nr=524288

# Upper limit on the number of watches that can be created per real user ID
# Raise the limit for watches to the limit i.e. 524,288
# https://man7.org/linux/man-pages/man7/inotify.7.html
fs.inotify.max_user_watches=524288

# Suppress logging of net_ratelimit callback
#net.core.message_cost=0

# Increasing this value for high speed cards may help prevent losing packets
# https://access.redhat.com/solutions/1241943
net.core.netdev_max_backlog = 2000
net.core.netdev_budget = 600

# The maximum number of "backlogged sockets, accept and syn queues are governed by 
# net.core.somaxconn and net.ipv4.tcp_max_syn_backlog. The maximum number of 
# "backlogged sockets". The net.core.somaxconn setting caps both queue sizes.
# Ensure that net.core.somaxconn is always set to a value equal to or greater than 
# tcp_backlog e.g. net.core.somaxconn >= 4096.
#
# Increase number of incoming connections
net.core.somaxconn = 4096
net.ipv4.tcp_max_syn_backlog = 4096

# Increase UDP Buffers
# Maximum Receive/Transmit Window Size
# if netstat -us is reporting errors, another underlying issue may 
# be preventing the application from draining its receive queue.
# https://github.com/quic-go/quic-go/wiki/UDP-Buffer-Sizes
# https://medium.com/@CameronSparr/increase-os-udp-buffers-to-improve-performance-51d167bb1360
# The maximum allowed (16MB) receive socket buffer (size in bytes)
net.core.rmem_max=16777216
# The maximum allowed (16MB) send socket buffer (size in bytes)
net.core.wmem_max=16777216

# The default socket receive buffer (size in bytes)
#net.core.rmem_default=31457280
#net.core.wmem_default=

# Increase linux auto-tuning of TCP buffer limits to 16MB to prevent dropped packets.
# https://blog.cloudflare.com/the-story-of-one-latency-spike/
net.ipv4.tcp_rmem=4096 87380 16777216
net.ipv4.tcp_wmem=4096 65536 16777216

# Enable busy poll mode
# Busy poll mode reduces latency on the network receive path. When you enable busy poll 
# mode, the socket layer code can directly poll the receive queue of a network device. 
# The downside of busy polling is higher CPU usage in the host that comes from polling 
# for new data in a tight loop. There are two global settings that control the number of 
# microseconds to wait for packets for all interfaces.
# ethtool -k eth0
#net.core.busy_read=50
#net.core.busy_poll=50

# It's recommended to use a 'fair queueing' qdisc e.g. fq or fq_codel.
#
# - fq or fq_codel can be safely used as a drop-in replacement for pfifo_fast.
# - fq or fq_codel is required to use tcp_bbr as it requires fair queuing.
# - fq-codel is best for forwarding/routers which don't originate local traffic,
#   hypervisors and best general purpose qdisc.
# - fq is best for fat servers with tcp-heavy workloads and particularly at 10GigE+.
#
# - BBR supports fq_codel in Linux Kernel version 4.13 and later.
# - BBR must be used with fq qdisc with pacing enabled, since pacing is integral to the BBR design 
#   and implementation. BBR without pacing would not function properly and may incur unnecessary 
#   high packet loss rates.
#
# http://man7.org/linux/man-pages/man8/tc-fq.8.html
# https://github.com/systemd/systemd/blob/main/sysctl.d/50-default.conf
# https://www.bufferbloat.net/projects/codel/wiki/
# https://github.com/systemd/systemd/issues/9725#issuecomment-412286509
# https://forum.vyos.io/t/bbr-and-fq-as-new-defaults/12344
# https://research.google/pubs/pub45646/
# https://github.com/google/bbr/blob/master/README
net.core.default_qdisc = fq_codel
#net.ipv4.tcp_congestion_control = bbr

# Negotiate TCP ECN for active and passive connections
#
# Turn on ECN as this will let AQM sort out the congestion backpressure 
# without incurring packet losses and retransmissions.
#
# In order to make best used of this we really need ECN-enablement 
# sysctl net.ipv4.tcp_ecn on end-hosts.
#
# https://github.com/systemd/systemd/pull/9143
# https://github.com/systemd/systemd/issues/9748
net.ipv4.tcp_ecn = 2
net.ipv4.tcp_ecn_fallback = 1

# Bump the TTL from the default of 64 to 127 on AWS
net.ipv4.ip_default_ttl = 127

# Enable forwarding so that docker networking works as expected.
# Enable IPv4 forwarding
net.ipv4.ip_forward = 1
net.ipv4.conf.all.forwarding = 1
# Enable IPv6 forwarding
net.ipv6.conf.default.forwarding = 1
net.ipv6.conf.all.forwarding = 1

# Disables ICMP redirect sending
net.ipv4.conf.eth0.send_redirects=0
net.ipv4.conf.all.send_redirects=0
net.ipv4.conf.default.send_redirects=0

# Disables ICMP redirect acceptance
net.ipv4.conf.all.accept_redirects=0
net.ipv4.conf.default.accept_redirects=0
net.ipv6.conf.all.accept_redirects=0
net.ipv6.conf.default.accept_redirects=0

net.ipv4.conf.all.secure_redirects=0
net.ipv4.conf.default.secure_redirects=0

# Increase the local outgoing port range
net.ipv4.ip_local_port_range = 10000	65535
#net.ipv4.ip_local_reserved_ports=

# Enable Multipath TCP
net.mptcp.enabled = 1

# Enable low latency mode for TCP, intended to give preference to low latency 
# over higher throughput. Setting to 1 will disable IPv4 tcp pre-queue processing.
#net.ipv4.tcp_low_latency = 1

# Enable TCP Window Scaling
net.ipv4.tcp_window_scaling = 1

# RFC 1323, Support for IPV4 TCP window sizes larger than 64K, which is generally 
# needed on high bandwidth networks. Tells the kernel how much of the socket buffer 
# space should be used for TCP window size and how much to save for an application buffer.
net.ipv4.tcp_adv_win_scale = 1

#net.ipv4.tcp_no_metrics_save = 1

#net.ipv4.tcp_moderate_rcvbuf = 1

# Disable the TCP timestamps option for better CPU utilization.
#net.ipv4.tcp_timestamps = 0

# Recommended for hosts with jumbo frames enabled. Default in AWS.
net.ipv4.tcp_mtu_probing = 1

# Enable to send data in the opening SYN packet.
net.ipv4.tcp_fastopen = 1

# Protect Against TCP Time-Wait Assassination Attacks
net.ipv4.tcp_rfc1337 = 1

# Enable the TCP selective ACKs option for better throughput.
#net.ipv4.tcp_sack = 1

# https://blog.cloudflare.com/optimizing-the-linux-stack-for-mobile-web-per/
# https://access.redhat.com/solutions/168483
# Use this parameter to ensure that the maximum speed is used from beginning
# also for previously idle TCP connections. Avoid falling back to slow start
# after a connection goes idle keeps our cwnd large with the keep alive
# connections (kernel > 3.6).
net.ipv4.tcp_slow_start_after_idle = 0

# The maximum times an IPV4 packet can be reordered in a TCP packet stream without 
# TCP assuming packet loss and going into slow start.
#net.ipv4.tcp_reordering = 3

# The net.ipv4.tcp_tw_recycle option is quite problematic for public-facing servers as it 
# will not handle connections from two different computers behind the same NAT device, which 
# is a problem hard to detect and waiting to bite you in the ass.
#net.ipv4.tcp_tw_recycle=

net.ipv4.tcp_tw_reuse=  1

# Decrease the time default value for connections to keep alive.
#net.ipv4.tcp_keepalive_time = 300
#net.ipv4.tcp_keepalive_probes = 5
#net.ipv4.tcp_keepalive_intvl = 15

# Decrease the time default value for tcp_fin_timeout connection, FIN-WAIT-2
#net.ipv4.tcp_fin_timeout = 15

# Reduce TIME_WAIT from the 120s default to 30-60s
#net.netfilter.nf_conntrack_tcp_timeout_time_wait = 30

# Reduce FIN_WAIT from teh 120s default to 30-60s
#net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 30

EOF

To apply these settings:

sudo systemctl daemon-reload
sudo sysctl --system

Unoffical Guide to Amazon Linux 2023

Amazon Linux 2023 Resources

AL2023 Repository Details

# cdn.amazonlinux.com (x86_64)
https://cdn.amazonlinux.com/al2023/core/mirrors/latest/x86_64/mirror.list
https://cdn.amazonlinux.com/al2023/core/guids/<guid>/x86_64/

# cdn.amazonlinux.com (aarch64)
https://cdn.amazonlinux.com/al2023/core/mirrors/latest/aarch64/mirror.list
https://cdn.amazonlinux.com/al2023/core/guids/<guid>/aarch64/

# al2023-repos-us-east-1-<guid>.s3.dualstack.<region>.amazonaws.com
https://al2023-repos-<region>-<guid>.s3.dualstack.<region>.amazonaws.com/core/mirrors/<releasever>/x86_64/mirror.list
https://al2023-repos-<region>-<guid>.s3.dualstack.<region>.amazonaws.com/core/guids/<guid>/x86_64/<rest_of_url>
https://al2023-repos-<region>-<guid>.s3.dualstack.<region>.amazonaws.com/core/mirrors/<releasever>/SRPMS/mirror.list
https://al2023-repos-<region>-<guid>.s3.dualstack.<region>.amazonaws.com/kernel-livepatch/mirrors/al2023/x86_64/mirror.list

Docker Resources

Containers Resources

Setup local dns caching service on Amazon Linux 2023

The following steps can be used to setup a local DNS caching service (dnsmasq) to cache DNS lookups on AL2023.

sudo dnf install --allowerasing -y dnsmasq bind-utils

Backup the defualt configuration:

sudo cp /etc/dnsmasq.conf{,.bak}

Configure dnsmasq:

cat <<'EOF' | sudo tee /etc/dnsmasq.conf
# https://thekelleys.org.uk/dnsmasq/docs/dnsmasq-man.html
# https://thekelleys.org.uk/gitweb/?p=dnsmasq.git


## Server Configuration

# The user to which dnsmasq will change to after startup
user=dnsmasq

# The group which dnsmasq will run as
group=dnsmasq

# PID file
pid-file=/var/run/dnsmasq.pid

# The alternative would be just 127.0.0.1 without ::1
listen-address=::1,127.0.0.1

# port=53 or port=0 to disable the dnsmasq DNS server functionality.
port=53

# For a local only DNS resolver use interface=lo + bind-interfaces
# See for more details: https://serverfault.com/a/830737
#
# Listen only on the specified interface(s).
interface=lo

# dnsmasq binds to the wildcard address, even if it is only 
# listening some interfaces. It then discards requests that 
# it shouldn't reply to. This has the advantage of working 
# even when interfaces come and go and change address.
bind-interfaces

#bind-dynamic

# Do not listen on the specified interface(s). 
#except-interface=eth0
#except-interface=eth1

## DHCP Server

# Turn off DHCP and TFTP Server features
#no-dhcp-interface=eth0

#dhcp-authoritative

# Dynamic range of IPs to make available to LAN PC and the lease time. 
# Ideally set the lease time to 5m only at first to test everything 
# works okay before you set long-lasting records.
#dhcp-range=192.168.1.100,192.168.1.253,255.255.255.0,16h

# Provide IPv6 DHCPv6 leases, where the range is constructed using the 
# network interface as prefix.
#dhcp-range=::f,::ff,constructor:eth0

# Set default gateway
# dhcp-option=3,192.168.1.1
#dhcp-option=option:router,192.168.1.1

# If your dnsmasq server is also doing the routing for your network, 
# you can use option 121 to push a static route out. where x.x.x.x is
# the destination LAN, yy is the CIDR notation (usually /24) and 
# z.z.z.z is the host that will be doing the routing.
#dhcp-option=121,x.x.x.x/yy,z.z.z.z

# Set DNS servers to announce
# dhcp-option=6,192.168.1.10
#dhcp-option=option:dns-server,192.168.1.10

# Optionally set a domain name
#domain=local

# To have dnsmasq assign static IPs to some of the clients, you can specify
# a static assignment i.e. Hosts NIC MAC addresses to IP address.
#dhcp-host=aa:bb:cc:dd:ee:ff,fw01,192.168.1.1,infinite  
#dhcp-host=aa:bb:cc:dd:ee:ff,sw01,192.168.1.2,infinite  
#dhcp-host=aa:bb:cc:ff:dd:ee,dns01,192.168.1.10,infinite

## Name Resolution Options

# Specify the upstream AWS VPC Resolver within this config file
# https://docs.aws.amazon.com/vpc/latest/userguide/vpc-dns.html#AmazonDNS
# Setting this does not suppress reading of /etc/resolv.conf, use --no-resolv to do that.
#server=169.254.169.253
#server=fd00:ec2::253

# Specify upstream servers directly
#server=/ec2.internal/169.254.169.253
#server=/compute.internal/169.254.169.253

# IPv6 addresses may include an %interface scope-id
#server=/ec2.internal/fd00:ec2::253%eth0
#server=/compute.internal/fd00:ec2::253%eth0

# https://tailscale.com/kb/1081/magicdns
# https://tailscale.com/kb/1217/tailnet-name
#server=/beta.tailscale.net/100.100.100.100@tailscale0

# To query all upstream servers simultaneously
#all-servers

# Query upstream servers in order
strict-order

# Later versions of windows make periodic DNS requests which don't get sensible answers 
# from the public DNS and can cause problems by triggering dial-on-demand links.
# This flag turns on an option to filter such requests. 
#filterwin2k

# Specify the upstream resolver within another file
resolv-file=/etc/resolv.dnsmasq

# Uncomment if specify the upstream server in here so you no longer poll 
# the /etc/resolv.conf file for changes.
#no-poll

# Uncomment if you specify the upstream server in here so you don't read 
# /etc/resolv.conf. Get upstream servers only from cli or dnsmasq conf.
#no-resolv

# Whenever /etc/resolv.conf is re-read or the upstream servers are set via DBus, clear the 
# DNS cache. This is useful when new nameservers may have different data than that held in cache. 
#clear-on-reload

# Additional hosts files to include
#addn-hosts=/etc/dnsmasq-blocklist

# Send queries for internal domain to another internal resolver
#address=/int.example.com/10.10.10.10

# Examples of blocking TLDs or subdomains
#address=/.local/0.0.0.0
#address=/.example.com/0.0.0.0

# Return answers to DNS queries from /etc/hosts and --interface-name and
# --dynamic-host which depend on the interface over which the query was received.
#localise-queries

# Never forward addresses in the non-routed address spaces
bogus-priv

# Never forward plain names
domain-needed

# Reject private addresses from upstream nameservers
stop-dns-rebind

# Disable the above entirely by commenting out the option OR allow RFC1918 responses 
# from specific domains by commenting out and/or adding additional internal domains.
#rebind-domain-ok=/int.example.com/
# https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-naming.html
rebind-domain-ok=/ec2.internal/compute.internal/local/

# Exempt 127.0.0.0/8 and ::1 from rebinding checks
rebind-localhost-ok

# Set the maximum number of concurrent DNS queries.
# The default value is 150. Adjust to your needs.
#dns-forward-max=150

# Set the size of dnsmasq's cache, default is 150 names
cache-size=1000

# Without this option being set, the cache statistics are also available in 
# the DNS as answers to queries of class CHAOS and type TXT in domain bind. 
no-ident

# The following directive controls whether negative caching 
# should be enabled or not. Negative caching allows dnsmasq 
# to remember “no such domain” answers from the parent 
# nameservers, so it does not query for the same non-existent 
# hostnames again and again.
#no-negcache

# Negative replies from upstream servers normally contain 
# time-to-live information in SOA records which dnsmasq uses 
# for caching. If the replies from upstream servers omit this 
# information, dnsmasq does not cache the reply. This option 
# gives a default value for time-to-live (in seconds) which 
# dnsmasq uses to cache negative replies even in the absence 
# of an SOA record.  
neg-ttl=60

# Uncomment to enable validation of DNS replies and cache DNSSEC data.

# Validate DNS replies and cache DNSSEC data.
#dnssec

# As a default, dnsmasq checks that unsigned DNS replies are legitimate: this entails 
# possible extra queries even for the majority of DNS zones which are not, at the moment,
# signed. 
#dnssec-check-unsigned

# Copy the DNSSEC Authenticated Data bit from upstream servers to downstream clients.
#proxy-dnssec

# https://data.iana.org/root-anchors/root-anchors.xml
#conf-file=/usr/share/dnsmasq/trust-anchors.conf

# The root DNSSEC trust anchor
#
# Note that this is a DS record (ie a hash of the root Zone Signing Key)
# If was downloaded from https://data.iana.org/root-anchors/root-anchors.xml
#trust-anchor=.,19036,8,2,49AAC11D7B6F6446702E54A1607371607A1A41855200FD2CE1CDDE32F24E8FB5

## Logging directives

#log-async
#log-dhcp

# Uncomment to log all queries
#log-queries

# Uncomment to log to stdout
#log-facility=-

# Uncomment to log to /var/log/dnsmasq.log
log-facility=/var/log/dnsmasq.log

EOF

Create the following file with the upstream resolvers:

cat <<'EOF' | sudo tee /etc/resolv.dnsmasq
nameserver 169.254.169.253
#nameserver fd00:ec2::253

EOF

Validate the configuration

sudo dnsmasq --test

Make sure that systemd-resolved is not configured to be a stub resolver:

sudo mkdir -pv /etc/systemd/resolved.conf.d

cat <<'EOF' | sudo tee /etc/systemd/resolved.conf.d/00-override.conf
[Resolve]
DNS=127.0.0.1
FallbackDNS=169.254.169.253
DNSStubListener=no
MulticastDNS=no
LLMNR=no

EOF

sudo systemctl daemon-reload
sudo systemctl restart systemd-resolved

Unlink the stub and re-create the /etc/resolv.conf file:

sudo unlink /etc/resolv.conf
cat <<'EOF' | sudo tee /etc/resolv.conf
nameserver 127.0.0.1
nameserver ::1
search ec2.internal
options edns0 timeout:1 attempts:5
#options trust-ad

EOF

Enable and start the service:

sudo systemctl enable --now dnsmasq.service
sudo systemctl restart dnsmasq.service

Verify:

dig aws.amazon.com @127.0.0.1
#!/bin/bash
# https://brew.sh/
# https://docs.brew.sh/Homebrew-on-Linux#install
sudo dnf groupinstall 'Development Tools'
sudo dnf install --alowerasing -y procps-ng curl file git git-lfs
# set password as homebrew script wont allow running as root i.e. sudo
sudo passwd ec2-user
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# add to bash env
(echo; echo 'eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)"') >> /home/ec2-user/.bashrc
eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)"
# Ensure `/home/linuxbrew/.linuxbrew/bin:/home/linuxbrew/.linuxbrew/sbin` is in your PATH
echo $PATH
# View formulae details about the homebrew packaage:
#brew info jq
# install packages via homebrew:
#brew install gcc

Setup DataDog Vector on Amazon Linux 2023

Note: This is WIP ATM.

Install DataDog Vector for exporting journaled logs to CloudWatch Logs

Either use the setup script or manually add the repo and install the package.

Setup script:

bash -c "$(curl -L https://setup.vector.dev)"

Manually add repo:

cat <<'EOF' | sudo tee /etc/yum.repos.d/vector.repo
[vector]
name = Vector
baseurl = https://yum.vector.dev/stable/vector-0/$basearch/
enabled=1
gpgcheck=1
repo_gpgcheck=1
priority=1
gpgkey=https://keys.datadoghq.com/DATADOG_RPM_KEY_CURRENT.public
       https://keys.datadoghq.com/DATADOG_RPM_KEY_B01082D3.public
       https://keys.datadoghq.com/DATADOG_RPM_KEY_FD4BF915.public

EOF

Install the Vector package:

sudo dnf install -y vector

Backup the default configuration file and then configure the Vector service:

sudo mv /etc/vector/vector.yaml{,.bak}
cat <<'EOF' | sudo tee /etc/vector/vector.yaml
sources:
  my_journald_source:
    type: "journald"

sinks:
  my_cloudwatch_sink:
    type: "aws_cloudwatch_logs"
    inputs:
      - "my_journald_source"
    compression: "gzip"
    encoding:
      codec: "json"
    #create_missing_group: true
    #create_missing_stream: true
    #endpoint: http://127.0.0.0:5000/path/to/service
    group_name: "prodenv"
    region: "us-east-1"
    stream_name: "prodsite/{{ host }}"

EOF

Verify the configuration file is valid:

vector validate
#vector --config /etc/vector/vector.yaml --require-healthy
#cloud-config
# vim:syntax=yaml
disable_ec2_metadata: false
datasource:
Ec2:
timeout: 50
max_wait: 120
metadata_urls:
- http://169.254.169.254:80
- http://[fd00:ec2::254]:80
#- http://instance-data:8773
apply_full_imds_network_config: true
# boot commands
# default: none
# This is very similar to runcmd above, but commands run very early
# in the boot process, only slightly after a 'boothook' would run.
# - bootcmd will run on every boot
# - INSTANCE_ID variable will be set to the current instance ID
# - 'cloud-init-per' command can be used to make bootcmd run exactly once
bootcmd:
- systemctl stop amazon-ssm-agent
package_update: false
package_upgrade: false
package_reboot_if_required: false
packages:
- docker
manage_resolv_conf: true
resolv_conf:
nameservers: ['169.254.169.253']
searchdomains:
- ec2.internal
domain: ec2.internal
options:
timeout: 5
# set the locale to a given locale
# default: en_US.UTF-8
locale: en_US.UTF-8
# disable ssh access as root.
# if you want to be able to ssh in to the system as the root user
# rather than as the 'ubuntu' user, then you must set this to false
# default: true
disable_root: true
write_files:
- path: /etc/systemd/system/amazon-ssm-agent.service.d/00-override.conf
permissions: "0644"
content: |
[Unit]
# To have a service start after cloud-init.target it requires the
# addition of DefaultDependencies=no due to the following default
# DefaultDependencies=y, which results in the default target e.g.
# multi-user.target to depending on the service.
#
# See the follow for more details: https://serverfault.com/a/973985
Wants=network-online.target
After=network-online.target nss-lookup.target cloud-init.target
DefaultDependencies=no
ConditionFileIsExecutable=/usr/bin/amazon-ssm-agent
@MSoliven
Copy link

This didn't work for me:

sudo curl -sL https://github.com/docker/compose/releases/latest/download/docker-compose-linux-$(uname -m) ......

It should be the following, i.e. download comes first before the version tag:

sudo curl -sL https://github.com/docker/compose/releases/download/latest/docker-compose-linux-$(uname -m)

But in my case, I had to substitute "latest" with v2.26.1.

@thimslugga
Copy link
Author

@MSoliven ty for sharing, I've updated the doc to address your feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment