quanhua92/k3s_or_nomad.md

Last active November 30, 2025 06:46

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/quanhua92/8993aee236a56441c49b4b034298afa2.js"></script>
Save quanhua92/8993aee236a56441c49b4b034298afa2 to your computer and use it in GitHub Desktop.

HA Nomad Cluster

Raw

Nomad generally uses less CPU and RAM than K3s for the same basic cluster functionality. Nomad is widely regarded as one of the most efficient and lightweight orchestrators available.

For your 5-VPS setup, Nomad's single-binary, minimal approach will result in lower resource utilization and operational overhead compared to K3s, which, despite being "lightweight," still runs the full set of Kubernetes control plane components.

⚖️ Nomad vs. K3s Resource Footprint

The difference in resource consumption stems from the architectural complexity of the two systems.

Resource	Nomad Agent (Server/Client)	K3s Agent / Server
CPU	Lower Baseline. Nomad's scheduler is very fast and efficient, leading to minimal CPU usage during idle periods and faster scheduling under load.	Higher Baseline. K3s runs the Kubernetes API server, controller manager, and scheduler—all of which require a higher CPU baseline, even in their lightweight K3s form.
RAM	Lower. Nomad's core agent footprint is smaller. Its Raft consensus is built into the agent and is generally less memory-hungry than a full-featured Kubernetes control plane.	Higher. K3s servers still require sufficient RAM (e.g., 2 GB minimum) to run the Kubernetes control plane and its datastore (defaulting to SQLite or embedded etcd).
Storage (Data)	Low. Nomad stores its state in memory and uses disk primarily for Raft snapshots and logging.	Medium. K3s, by default, uses the embedded SQLite database, which is simple but still requires constant, fast I/O on the disk, making disk performance a key factor.
Bandwidth (Gossip)	Lower and more efficient. Nomad uses the Serf gossip protocol for cluster membership, which is known for being extremely bandwidth-efficient.	Higher. K3s uses the standard Kubernetes API model for internal communication, which can generate more network traffic, especially for large clusters.

Why Nomad is Leaner:

Single Binary: Nomad is a single binary that handles everything—scheduling, client execution, and Raft consensus. This eliminates the overhead of inter-process communication that Kubernetes components inherently require.
Focus on Scheduling: Nomad's core purpose is efficiently placing jobs. It doesn't include components like Ingress controllers, CNI, or CSI drivers in its core binary (it uses plugins for that), keeping the agent very minimal.
Bin Packing: Nomad's scheduling algorithm is optimized for bin packing—cramming as many jobs as possible onto a node—to maximize infrastructure utilization, not just minimize its own overhead.

For your small cluster of 5 VPS, the resource difference will be noticeable, with Nomad requiring less total resources to keep the orchestration layer running, leaving more resources for your Rust containers.

Raw

nomad_daily_usage.md

Estimating daily usage for Nomad on a 5-VPS cluster requires breaking down the resource consumption of the orchestration layer itself (Nomad agents) and the application workloads.

Here's an estimate of the resources consumed by the Nomad orchestration layer running on your five VPS:

📊 Estimated Resource Usage for Nomad Agents (Orchestration Layer Only)

Nomad is highly efficient. The vast majority of resource usage will come from your applications (your Rust binary), not the orchestrator.

1. CPU Usage (Per Agent)

Idle Servers (3 VPS): 50 MHz to 200 MHz (0.05 to 0.2 CPU cores). The CPU is mainly consumed by the Raft consensus and the scheduler loop. This scales slightly with the number of jobs being scheduled.
Idle Clients (2 VPS): <50 MHz (less than 0.05 CPU cores). The client is mostly waiting for instructions and periodically sending heartbeats.

2. RAM Usage (Per Agent)

Servers (3 VPS): 256 MB to 512 MB. RAM is the most critical resource for servers, as they hold the entire cluster state (Raft log, job specifications, allocation data) in memory. For a small 5-node cluster, 256 MB is usually sufficient.
Clients (2 VPS): 64 MB to 128 MB. Very lightweight, primarily used for tracking local allocations and communicating with Docker.

3. Network Bandwidth (Total Daily Traffic)

The bandwidth is primarily event-driven, not constant.

Baseline (Idle Traffic): <10 GB per day total across all 5 VPS. This low baseline traffic is generated by the Serf gossip protocol (membership/health checks) and client heartbeats (periodic check-ins). This is extremely minimal.
Event-Driven (Bursty Traffic): This is highly variable.
- Job Deployment: Deploying a job (e.g., submitting your Rust binary) involves sending the HCL file and receiving the allocation status, which is minimal data.
- Container Pulls: The single largest bandwidth event is pulling the container image (my-rust-app:latest). This happens whenever you deploy a new version or when a node is added. The usage depends entirely on the size of your binary's Docker image.

💡 Estimating Total VPS Usage

To estimate the actual resource usage for your VPS, you need to add your application requirements.

Example Estimation for Your Rust Binary

Assume you deploy 3 replicas of your Rust service across the 5 VPS:

Resource	Nomad Orchestration Layer	Application (3 Replicas)	Total Cluster Usage (Estimate)
CPU (Peak MHz)	≈ 600 MHz (All 5 agents)	3 × 200 MHz (Example)	≈ 1200 MHz (1.2 cores)
RAM (Peak MB)	≈ 1000 MB (All 5 agents)	3 × 512 MB (Example)	≈ 2536 MB (2.5 GB)

Conclusion

Nomad's low overhead means you can confidently purchase VPS that meet your application's resource needs, knowing that the orchestrator will consume less than one full CPU core and about 1 GB of RAM across the cluster.

Raw

nomad_ha_cluster.md

This is the final, consolidated guide for setting up a highly available (HA) Nomad cluster on your 5 VPS, using IPs 10.0.0.1 through 10.0.0.5, with essential security enabled.

The cluster will be composed of 3 Nomad Servers (10.0.0.1, 10.0.0.2, 10.0.0.3) and 2 Nomad Clients (10.0.0.4, 10.0.0.5).

🚀 Step 1: Preparation (All 5 VPS)

Perform these actions on all five servers:

Install Nomad: Download the single binary and make it executable.

# Download Nomad (check HashiCorp website for the latest version)
curl -LO https://releases.hashicorp.com/nomad/1.7.6/nomad_1.7.6_linux_amd64.zip
unzip nomad_*.zip
sudo mv nomad /usr/local/bin/

# Create necessary directories
sudo mkdir -p /etc/nomad.d
sudo mkdir -p /opt/nomad/data
sudo mkdir -p /etc/nomad.d/certs

Install Docker: Ensure Docker is installed and running on all 5 VPS, as it's required for the Nomad docker task driver.
Generate Gossip Key (Encryption): Run this ONCE on any machine and save the outputted key.
```
NOMAD_ENCRYPT_KEY=$(nomad operator keygen)
echo $NOMAD_ENCRYPT_KEY
```

🔒 Step 2: Configure and Enable TLS (All 5 VPS)

This secures the communication between agents (mTLS) and is crucial for production.

Generate Root CA: Run this ONCE on 10.0.0.1 and copy the ca.pem file to the /etc/nomad.d/certs/ directory on all other 4 VPS.
```
nomad tls ca create -keysize 2048
sudo cp ca.pem /etc/nomad.d/certs/
```
Generate Certificates:
- Servers (10.0.0.1 - 10.0.0.3): Run this command on each server host and move the files to the certs directory.
```
nomad tls cert create -server
sudo mv *.pem /etc/nomad.d/certs/
```
- Clients (10.0.0.4 - 10.0.0.5): Run this command on each client host and move the files to the certs directory.
```
nomad tls cert create -client
sudo mv *.pem /etc/nomad.d/certs/
```

⚙️ Step 3: Configure and Start Server Agents (10.0.0.1 - 10.0.0.3)

Create /etc/nomad.d/config.hcl on the three Server VPS (10.0.0.1, 10.0.0.2, 10.0.0.3). Replace <GOSSIP_KEY> with the key from Step 1.

/etc/nomad.d/config.hcl (Server Agents)

datacenter = "dc1"
data_dir   = "/opt/nomad/data"
bind_addr  = "10.0.0.1" # IMPORTANT: Change this to the specific IP of the VPS

server {
  enabled          = true
  bootstrap_expect = 3
}

client {
  enabled = true # All servers should also be clients to run jobs
  servers = ["10.0.0.1:4647", "10.0.0.2:4647", "10.0.0.3:4647"]
}

# --- Security Configuration ---
encrypt_key = "<GOSSIP_KEY>"

tls {
  rpc_upgrade_mode = true
  ca_file          = "/etc/nomad.d/certs/ca.pem"
  cert_file        = "/etc/nomad.d/certs/cli.pem"
  key_file         = "/etc/nomad.d/certs/cli-key.pem"
  verify_server_hostname = true
  verify_https           = true
}

Start the Service on all three Server VPS:

sudo systemctl enable nomad
sudo systemctl start nomad

Verify Status (wait a minute for Raft to form):
```
nomad server members
```
You should see all three IPs with a leader designated.

🖥️ Step 4: Configure and Start Client Agents (10.0.0.4 - 10.0.0.5)

Create /etc/nomad.d/config.hcl on the two Client VPS (10.0.0.4, 10.0.0.5).

/etc/nomad.d/config.hcl (Client Agents)

datacenter = "dc1"
data_dir   = "/opt/nomad/data"
bind_addr  = "10.0.0.4" # IMPORTANT: Change this to the specific IP of the VPS

client {
  enabled = true
  servers = ["10.0.0.1:4647", "10.0.0.2:4647", "10.0.0.3:4647"]
  
  # Enable the Docker driver
  options = {
    "docker.privileged.enabled" = "true" # Required for some container setups
  }
}

server {
  enabled = false
}

# --- Security Configuration (Same as Servers) ---
encrypt_key = "<GOSSIP_KEY>"

tls {
  rpc_upgrade_mode = true
  ca_file          = "/etc/nomad.d/certs/ca.pem"
  cert_file        = "/etc/nomad.d/certs/cli.pem"
  key_file         = "/etc/nomad.d/certs/cli-key.pem"
  verify_server_hostname = true
  verify_https           = true
}

Start the Service on both Client VPS:

sudo systemctl enable nomad
sudo systemctl start nomad

Verify Status from any Server VPS:
```
nomad node status
```
You should see all 5 IPs listed, with a status of ready.

🎯 Step 5: Run Your First Job

You can now submit a workload to your secure cluster from any machine that has the Nomad CLI and access to the cluster network.

Set the Environment Variable to target the leader's IP (e.g., 10.0.0.1).

export NOMAD_ADDR=https://10.0.0.1:4646
# You may also need to set the TLS CA path if running remotely
export NOMAD_CACERT=/etc/nomad.d/certs/ca.pem

Run the Job (using the HCL file you created previously):
```
nomad job run my-web-app.hcl
```

You now have a secure, highly available container orchestrator that is much simpler to manage than Kubernetes!

Configure System Service

To make Nomad run reliably on your VPS and automatically restart after reboots or crashes, you need a systemd unit file.

You will create one single file on all 5 VPS that tells the operating system how to manage the Nomad agent. The configuration file (/etc/nomad.d/config.hcl) you created previously determines if it runs as a server, client, or both.

📄 Essential File: nomad.service

Create this file on all 5 VPS at the location: /etc/systemd/system/nomad.service.

[Unit]
Description=Nomad
Documentation=https://www.nomadproject.io/docs/
# Waits for the network to be fully configured before starting Nomad
Wants=network-online.target
After=network-online.target
# If you decide to integrate Consul later, uncomment these lines:
# Wants=consul.service
# After=consul.service

[Service]
# Execute Nomad as the 'nomad' user (recommended for security)
# Note: You'll need to create this user first and adjust ownership of /opt/nomad/data
# User=nomad
# Group=nomad

# Set the binary path and tell it to load ALL configuration files 
# from the /etc/nomad.d directory.
ExecStart=/usr/local/bin/nomad agent -config /etc/nomad.d

# Reload the configuration without stopping the running agent
ExecReload=/bin/kill -HUP $MAINPID

# Keep the service running even if the main process exits, 
# as long as the OS thinks it's a good time to restart it.
Restart=on-failure
RestartSec=10s

# Kill only the main process (Nomad handles its child processes)
KillMode=process
KillSignal=SIGINT

# Increase limits for file descriptors and processes (essential for high concurrency)
LimitNOFILE=65536
LimitNPROC=infinity

[Install]
# This line ensures Nomad starts automatically when the system boots up
WantedBy=multi-user.target

Post-Creation Steps (All 5 VPS)

After creating or updating the /etc/systemd/system/nomad.service file:

Reload systemd: You must tell systemd to load the new service definition.
```
sudo systemctl daemon-reload
```
Enable the service: This creates the necessary symlink to start Nomad at boot.
```
sudo systemctl enable nomad
```
Start the service: (If not already running, or to restart with the new config).
```
sudo systemctl start nomad
```
Check Status: Confirm it is running and healthy.
```
sudo systemctl status nomad
```

Raw

nomad_jobs.md

That's a great approach to understanding Nomad. The three common job types are differentiated by the value of the type stanza, which defines the scheduler's behavior regarding failure, scaling, and placement.

Here are the most straightforward demos and first principles for the three common job types:

🥇 1. Service Jobs (Long-Lived Applications)

First Principles

The primary goal is High Availability (HA). The service must run forever, and if it fails, it must be restarted immediately and automatically. Scaling is manual (or triggered by HPA).

The Demo: A Highly Available Web Server (Your Rust API)

Job Feature	Explanation
`type = "service"`	Tells Nomad that this is a long-running application.
`count = 3`	Specifies Desired State: Always keep three instances running across the cluster.
`restart` stanza	Defines the failure tolerance (e.g., attempt 3 restarts within 30 minutes).

HCL Snippet

This job ensures three replicas of an HTTP echo server are constantly available, automatically restarting or relocating them upon failure.

job "web-api" {
  type = "service"
  datacenters = ["dc1"]

  group "server" {
    count = 3
    
    restart {
      attempts = 3
      interval = "30m"
      delay = "15s"
      mode = "fail" # Stop trying if attempts are exhausted
    }

    network { port "http" { static = 8080 } }

    task "echo" {
      driver = "docker"
      config { image = "hashicorp/http-echo" }
      service { name = "web-service" port = "http" }
    }
  }
}

🥈 2. Batch Jobs (Short-Lived Tasks)

First Principles

The primary goal is Execution and Completion. The task should run, finish its work, and then be removed. Restarting is usually undesirable unless the task is idempotent.

The Demo: A Database Migration Script

Job Feature	Explanation
`type = "batch"`	Tells Nomad to run the job until it hits a terminal state (success or failure).
`count = 1`	Typically runs only once.
`restart` stanza	Often set to `mode = "fail"` or omitted, as a failed batch job usually requires human review.

HCL Snippet

This job runs a script once to simulate a database migration.

job "db-migration" {
  type = "batch"
  datacenters = ["dc1"]
  priority = 100 # Batch jobs often have high priority for immediate execution

  group "migration" {
    count = 1
    
    # Do NOT automatically restart if it fails, as a migration should be reviewed
    restart { mode = "fail" } 

    task "migrate" {
      driver = "docker"
      config {
        image = "my-repo/migration-tool:latest"
        # Command to run the migration and then exit
        command = "/usr/local/bin/migrate"
      }
    }
  }
}

🥉 3. System Jobs (Host-Specific Agents)

First Principles

The primary goal is Guaranteed Placement. The task must run exactly once on every eligible Nomad client, regardless of resource constraints (within reason).

The Demo: A Logging Agent (e.g., Fluent Bit)

Job Feature	Explanation
`type = "system"`	Tells Nomad to run the job on every client and to automatically schedule it when new clients join the cluster.
`count` stanza	Ignored. The count is determined by the number of eligible clients.
`constraint` stanza	Often used to limit placement (e.g., only on nodes tagged as "Linux" or "Storage").

HCL Snippet

This job ensures a logging agent runs on every machine to collect host logs.

job "node-logger" {
  type = "system" # <--- IMPORTANT: This forces placement on all clients
  datacenters = ["dc1"]

  group "agent" {
    # No 'count' stanza is needed

    task "fluent-bit" {
      driver = "docker"
      config {
        image = "fluent/fluent-bit:latest"
        # Requires mounting the host log directory
        volumes = ["/var/log:/var/log"]
      }
      
      # Use a constraint to target specific OS types if needed
      constraint {
        attribute = "${attr.kernel.name}"
        value = "linux"
      }
    }
  }
}

You can find more detailed examples and demonstrations of how Nomad handles different workloads, including these three types, in video: From Zero to WOW! with Nomad.

quanhua92/k3s_or_nomad.md

Select an option

No results found

Select an option

No results found

⚖️ Nomad vs. K3s Resource Footprint

Why Nomad is Leaner:

📊 Estimated Resource Usage for Nomad Agents (Orchestration Layer Only)

1. CPU Usage (Per Agent)

2. RAM Usage (Per Agent)

3. Network Bandwidth (Total Daily Traffic)

💡 Estimating Total VPS Usage

Example Estimation for Your Rust Binary

Conclusion

🚀 Step 1: Preparation (All 5 VPS)

🔒 Step 2: Configure and Enable TLS (All 5 VPS)

⚙️ Step 3: Configure and Start Server Agents (10.0.0.1 - 10.0.0.3)

/etc/nomad.d/config.hcl (Server Agents)

🖥️ Step 4: Configure and Start Client Agents (10.0.0.4 - 10.0.0.5)

/etc/nomad.d/config.hcl (Client Agents)

🎯 Step 5: Run Your First Job

Configure System Service

📄 Essential File: nomad.service

Post-Creation Steps (All 5 VPS)

🥇 1. Service Jobs (Long-Lived Applications)

First Principles

The Demo: A Highly Available Web Server (Your Rust API)

HCL Snippet

🥈 2. Batch Jobs (Short-Lived Tasks)

First Principles

The Demo: A Database Migration Script

HCL Snippet

🥉 3. System Jobs (Host-Specific Agents)

First Principles

The Demo: A Logging Agent (e.g., Fluent Bit)

HCL Snippet