Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save DonRichards/42ad261445cdc9484983318746a415b7 to your computer and use it in GitHub Desktop.
Save DonRichards/42ad261445cdc9484983318746a415b7 to your computer and use it in GitHub Desktop.
A comprehensive document covering everything that can go wrong with Docker and `docker compose` on RHEL (with SELinux disabled, local XFS storage, using the modern `docker compose` plugin). I'll also focus on inter-container communication issues, filesystem migration problems, and networking edge cases that match the production setup

Perfect. I'll prepare a comprehensive document covering everything that can go wrong with Docker and docker compose on RHEL (with SELinux disabled, local XFS storage, using the modern docker compose plugin). I'll also focus on inter-container communication issues, filesystem migration problems, and networking edge cases that match your setup.

I'll get back to you soon with a detailed, organized, and actionable draft you can use for your 30+ page paper.

Comprehensive Docker on RHEL Troubleshooting Guide

Introduction

Migrating a containerized environment between different host systems can expose numerous failure modes. In this scenario, Docker Engine v26.1.3 (with the Docker Compose v2.27.0 plugin) was moved from an Ubuntu setup (using S3-backed storage for volumes) to a Proxmox-based RHEL host. The new host uses local storage on an XFS filesystem for Docker’s data-root, and SELinux has been fully disabled. After migration, containers could ping each other, yet HTTP requests (e.g. via curl) between them failed to get proper responses. This document serves as a detailed guide to all possible failure modes, edge cases, and troubleshooting areas for such a setup. We will cover everything from filesystem quirks and storage driver issues to networking, DNS, firewall rules, volume permissions, and container startup sequencing. The guide is organized into clear sections and subsections, making it easy to scan and find relevant insights.

Environment Summary:

  • OS / Platform: RHEL (Red Hat Enterprise Linux) on Proxmox VM (local disks)
  • Docker Engine: 26.1.3 (Community Edition)
  • Docker Compose: v2.27.0 (Docker CLI plugin, not the legacy docker-compose binary)
  • Filesystem: XFS for Docker’s data directory (e.g. /var/lib/docker), local disk storage
  • Storage Driver: OverlayFS (using overlay2)
  • Security: SELinux disabled (not enforcing), no Kubernetes in use
  • Migrated From: Ubuntu host (previously using S3-backed storage for some volumes)

Observed Symptoms Post-Migration:

  • Containers can reach each other by network (ICMP ping succeeds), indicating basic connectivity.
  • However, services in containers do not respond to HTTP requests from peers (e.g. curl between containers times out or fails).
  • The issue affects application-layer communication (web APIs, file uploads) despite low-level network being up.

This guide will delve into numerous topics to diagnose and resolve such problems. We begin with filesystem and storage driver considerations (especially XFS and overlay2 specifics), then examine network and connectivity issues (bridging, Docker networking, firewall, DNS), and proceed to container runtime aspects (permissions, volumes, daemon configuration, startup order, health checks, etc.). We also highlight known gotchas when migrating data between different Linux distributions (Ubuntu -> RHEL) and provide recommendations for each area. Technical depth and actionable steps are emphasized, as the target audience is technically advanced.

1. Filesystem and Storage Driver Considerations

One of the first areas to inspect is the host filesystem and Docker’s storage driver configuration. In our scenario, the Docker data resides on an XFS filesystem and uses the overlay2 storage driver. Both XFS and overlay/overlay2 have specific requirements and potential pitfalls that could impact container behavior after a migration.

1.1 XFS Filesystem Requirements (d_type Support)

Docker’s overlay/overlay2 driver requires that the backing filesystem (XFS in this case) is formatted with d_type support enabled (this corresponds to ftype=1 in XFS format options). Without d_type, the overlay driver cannot reliably identify directory entries, leading to errors in file operations and even Docker daemon startup failures (Docker on CentOS 7 with xfs filesystem can cause trouble when d_type is not supported | by Khushal Bisht | Medium) (Docker failed to initialize due to xfs filesystem format on Linux in Bitbucket Cloud | Bitbucket Cloud | Atlassian Support). Common issues when running Docker on an XFS volume without d_type include inability to chown/chmod files, failures to delete files, and weird file listing outputs (e.g., files showing up as ???????? with ls). For instance, a Stack Overflow report described files that could not be deleted and appeared as question marks when listed, in a Docker setup using XFS with ftype=0 (Docker volume mapping file corruption when filesystem is xfs and storage driver is overlay - Stack Overflow) (Docker volume mapping file corruption when filesystem is xfs and storage driver is overlay - Stack Overflow).

Docker Engine will actually refuse to use overlay2 on XFS if d_type is off – producing an error like: “overlay2: the backing xfs filesystem is formatted without d_type support… Backing filesystems without d_type support are not supported” (Docker failed to initialize due to xfs filesystem format on Linux in Bitbucket Cloud | Bitbucket Cloud | Atlassian Support). In other words, if your XFS is missing ftype=1, Docker’s overlay2 driver won’t function correctly or at all. The solution is to reformat or recreate the XFS filesystem with -n ftype=1 (enabling d_type) (Docker on CentOS 7 with xfs filesystem can cause trouble when d_type is not supported | by Khushal Bisht | Medium). This may involve moving data off, reformatting, and moving data back – a non-trivial operation if the system is already in use.

How to Check: Run xfs_info <mountpoint> (for example, xfs_info /var/lib/docker if that’s an XFS mount). In the output, look for the “naming” section and the ftype value. ftype=1 means d_type is enabled, whereas ftype=0 means it is disabled (Docker on CentOS 7 with xfs filesystem can cause trouble when d_type is not supported | by Khushal Bisht | Medium) (Docker on CentOS 7 with xfs filesystem can cause trouble when d_type is not supported | by Khushal Bisht | Medium). If it’s 0, you have likely identified a critical issue. In such a case, you have a few options to resolve it (Docker volume mapping file corruption when filesystem is xfs and storage driver is overlay - Stack Overflow):

In our context, ensure that the RHEL host’s XFS volume for Docker was formatted with ftype=1. Many modern RHEL/CentOS installations do set ftype=1 by default on XFS, but if the system was created with older defaults or migrated from an older image, it’s worth verifying. Not meeting this requirement could cause Docker to either refuse to start the containers or exhibit strange file behavior inside containers. For example, one known manifestation was file corruption and “No such file or directory” errors for volume-mounted files until the XFS ftype issue was resolved (Docker volume mapping file corruption when filesystem is xfs and storage driver is overlay - Stack Overflow) (Docker volume mapping file corruption when filesystem is xfs and storage driver is overlay - Stack Overflow).

1.2 Overlay2 Storage Driver Edge Cases

Assuming the XFS d_type prerequisite is satisfied, the overlay2 storage driver should operate normally. However, overlay filesystems have their own quirks and edge cases to consider, especially after migrating data:

  • Verify Overlay2 is Actually in Use: Run docker info and check the “Storage Driver” and “Backing filesystem” lines. It should report Storage Driver: overlay2 and Backing Filesystem: xfs with Supports d_type: true (Docker on CentOS 7 with xfs filesystem can cause trouble when d_type is not supported | by Khushal Bisht | Medium). If it shows a different driver (like devicemapper or vfs) unexpectedly, it means Docker possibly fell back due to an issue. For example, if Supports d_type: false was detected, Docker might have automatically switched to devicemapper or refused to start. Using a slower/unsupported driver could degrade performance or functionality, so ensure overlay2 is active. The docker info output will also list any warnings (for instance, it would warn if using loopback devicemapper, or if there’s an overlay issue).

  • Layer Count and Performance: Overlay2 supports up to 128 lower layers by default (OverlayFS storage driver - Docker Docs). In most cases this is plenty (typical images have far fewer layers), but if you have extremely layered images (perhaps built via many incremental steps), hitting this limit could cause errors in container start. This is rare, but worth noting as a theoretical edge case. More commonly, having many layers can slow down container startup due to overlay lookups. If the migration involved rebuilding images differently, consider flattening images or using multi-stage builds to reduce layers if performance is an issue.

  • Overlay2 on XFS Performance Considerations: There are some reports that overlay2 on XFS can exhibit different performance characteristics compared to ext4. For example, one user observed that a container workload generated massively more write I/O on XFS than on ext4 for the same task (600MB of writes on ext4 vs. 16GB on XFS) (Why is my XFS Block IO insane compared to my EXT4 Block IO? - General - Docker Community Forums). The operations on XFS involved significantly more low-level file events (observing 5 operations on XFS for what was 2 operations on ext4 in one test) (Why is my XFS Block IO insane compared to my EXT4 Block IO? - General - Docker Community Forums) (Why is my XFS Block IO insane compared to my EXT4 Block IO? - General - Docker Community Forums). This suggests that certain file-intensive workloads (especially those creating/deleting many small files rapidly) may incur extra overhead on XFS with overlay. The exact cause can be kernel-specific, but it’s something to keep in mind: after migration, if you notice containers have become I/O bound or slower in file operations, the underlying filesystem difference might be a factor. While not a “failure” mode per se, it’s an edge performance pitfall. If encountered, options include tuning XFS mount options, using prjquota for better space management, or in extreme cases considering a different filesystem for that workload.

  • Inode Exhaustion / XFS Quota: On ext4, one has to watch for running out of inodes if there are millions of small files in containers/volumes. XFS dynamically allocates inodes, so it’s less likely to hit an inode limit, but do monitor disk space. If using XFS project quotas (common if one sets --storage-opt size=... on containers), be aware of a known issue: Docker’s use of XFS project quotas can sometimes leave residual project IDs or not fully clean up after container removal (Docker Run Error with --storage-opt and Overlay2 on XFS with pquota). This could manifest as an inability to start new containers with --storage-opt size or weird quota exceeded errors even when space is free. If you’re not explicitly using size quotas, this likely won’t apply, but it’s good to note for completeness.

  • Data Integrity After Migration: If the migration involved copying the actual Docker directory (/var/lib/docker) with the overlay2 directories from Ubuntu to RHEL, this can be problematic. The overlay2 directory contains many symlinks and metadata files (for layer diffs, etc.). Copying those between different filesystems or OSes could break references or lose extended attributes. For instance, overlay uses special whiteout files (character device files with 0/0 device numbers) to mark deletions; if a copy tool wasn’t aware of these, it might not preserve them. A safer migration approach is to re-pull or re-build images on the new host or use docker save/docker load for images, rather than copying the overlay directories directly. If you did copy them, double-check that containers actually launch and that there aren’t hidden inconsistencies (check container logs for file not found errors that could stem from missing whiteouts or incomplete layers).

  • Orphaned/Residual Data: After getting everything running on the new host, you might want to clean up any leftover artifacts from the migration. Use docker image prune and docker container prune as appropriate to remove dangling images or stopped containers. Since space is now local, old layers from pre-migration that are not used could consume space. Also, ensure the old volumes from the Ubuntu host were properly brought in (we’ll cover volume migration in detail later). If any overlay layer directories were half-copied or not needed, removing them requires caution (only let Docker do it via prune or docker system prune if you’re sure things are not referenced).

In summary, for filesystem and storage: the top priority is confirming XFS is formatted correctly for overlay2 (Docker on CentOS 7 with xfs filesystem can cause trouble when d_type is not supported | by Khushal Bisht | Medium) and that Docker is indeed using overlay2 without errors. Once that baseline is confirmed, monitor for any performance or storage oddities that might relate to overlay on XFS, and adjust accordingly. With the filesystem stable, we can move on to networking aspects, which in this case appear to be where the major symptoms lie.

2. Networking and Connectivity Issues

Networking issues are often the culprit when containers can ping each other (ICMP) but higher-level protocols like HTTP fail. In a Docker context, “ping works but HTTP doesn’t” usually means that basic IP connectivity is present, but something is interfering with TCP/HTTP traffic or the application isn’t responding. Key areas to investigate include the Docker bridge network setup, host firewall rules (iptables/firewalld), container DNS settings, and any Docker network configuration mismatches after migration. We will also consider MTU or fragmentation issues that might allow small packets (ping) but drop larger ones.

2.1 Docker Bridge Network and Inter-Container Communication

By default, Docker creates a bridge network interface (docker0) on the host for containers. In Docker Compose, a user-defined bridge network (often named after your compose project) is typically used. Containers on the same bridge network should be able to communicate freely (all ports open) unless something on the host is restricting it. Key points to verify:

  • IP Forwarding and Bridge Config: The host’s kernel must allow packet forwarding between interfaces (this is usually enabled by Docker). Check sysctl net.ipv4.ip_forward; Docker sets it to 1 when starting if it’s not already. Also check that the policy for the FORWARD chain in iptables is set to ACCEPT (Docker usually modifies this). If the policy is DROP or there are restrictive rules, container-to-container traffic could be blocked at the host level. Docker automatically adds rules to iptables to allow established/related connections and any traffic on the docker networks, but these could be missing if the host firewall is interfering (discussed in the next subsection).

  • Docker Network Subnet Changes: When you moved to the new host, the default bridge network subnet might differ from the old one (especially if you didn’t copy Docker’s network configs). For example, Docker might have used 172.17.0.0/16 on Ubuntu and also uses 172.17.0.0/16 on RHEL by default. If, however, the old host had a custom bridge IP or if the new host had an existing network conflict, Docker could choose a different subnet (Docker can auto-pick a non-conflicting subnet). If your containers’ applications had any hard-coded IPs or access lists expecting the old subnet, that could break connectivity. It’s generally better to rely on container names (DNS) rather than fixed IPs, but it’s worth confirming that no such assumption was baked into your configs.

  • User-Defined Networks in Compose: Docker Compose v2 will usually create a network named <project>_default. All services in the compose join that network by default. Ensure that all the relevant containers are indeed on the same network (you can verify with docker compose ps or docker inspect <container> to see its networks). If you accidentally launched containers in different networks (say by running parts of the compose separately with different project names), they wouldn’t be able to talk except via the host. This doesn’t seem to be the case here, but it’s a general edge-case to mention.

  • Exposed vs Published Ports: Remember that within a Docker network, containers can reach each other on the container’s internal ports regardless of whether those ports are published to the host. “Exposing” ports in Dockerfile or Compose is just documentation; actual connectivity depends on being on the same network. So if Container A needs to talk to Container B’s service, ensure you use B’s internal port. (E.g., if B runs on port 8080 internally, A should target B:8080). If you changed the compose file or forgot the correct ports, that could lead to a ping working (ICMP doesn’t use ports) but the TCP connection being refused or not established. Double-check the docker-compose YAML for correct port mappings and service interconnections.

  • Name Resolution: When using the default bridge or a user-defined bridge, Docker provides an embedded DNS to resolve container names. So container “web” can resolve “db” to the appropriate IP, etc. If ping by container name works, DNS is fine. If you find yourself able to ping an IP but not a name, then DNS might be an issue. That doesn’t appear to be the primary problem here since ping did succeed presumably by name or IP. However, confirm that within a container, the file /etc/resolv.conf is pointing to Docker’s DNS (often the gateway like 127.0.0.11 for user-defined networks) or appropriate servers. If a container was started with --network=host by any chance (bypassing Docker networking), then the ping would be on host network and other rules apply. But here it sounds like standard bridging.

In essence, ensure that the basic Docker network plumbing is correct: containers on the same network should have unrestricted access to each other by default (no Docker-level filtering of ports). If that’s not the case, it usually points to a host-level policy blocking traffic. This leads us to the host firewall discussion.

2.2 Host Firewall (firewalld/iptables) and Docker

RHEL systems use firewalld as a front-end to iptables/nftables. Docker, on the other hand, manipulates iptables rules directly to enable container networking (NAT, port forwarding, inter-container traffic). This can lead to conflicts. A known issue on CentOS/RHEL is that firewalld can override or flush Docker’s iptables rules, causing connectivity problems that are sometimes non-obvious. In fact, Red Hat’s documentation has noted that firewalld and Docker can conflict, especially if firewalld is restarted while Docker is running (Docker and firewalld mess in CentOS 7 | by Azhagarasu A | Medium).

One real-world symptom: A container’s web page was accessible and basic pings worked, but file uploads via that web app failed due to firewalld blocking certain connections (Docker and firewalld mess in CentOS 7 | by Azhagarasu A | Medium) (Docker and firewalld mess in CentOS 7 | by Azhagarasu A | Medium). The culprit was that firewalld had removed or disallowed the Docker-added NAT rules, so larger or multi-packet exchanges were being dropped. In the user’s words: “The webpage has a function to upload files, but it was not accepting file uploads... It took me quite long to figure out firewalld is the culprit” (Docker and firewalld mess in CentOS 7 | by Azhagarasu A | Medium). This aligns closely with our scenario (ping works, HTTP POST fails), strongly hinting that firewalld on the RHEL host may be interfering with Docker networking.

Specific behaviors to understand:

  • firewalld vs Docker iptables rules: When firewalld is active, if it reloads or if certain events occur, it might flush chains it doesn’t recognize. Docker creates custom chains (DOCKER, DOCKER-USER) for handling container NAT and filtering. If those get wiped or if the default policies are strict, traffic from containers might be blocked. An example error from firewalld logs: “iptables: No chain/target/match by that name” referring to Docker’s rules (Docker and firewalld mess in CentOS 7 | by Azhagarasu A | Medium). Essentially, Docker expects its chains to be present; if firewalld removed them, Docker’s rules no longer function, preventing outgoing NAT or inter-container traffic (Docker and firewalld mess in CentOS 7 | by Azhagarasu A | Medium) (Docker and firewalld mess in CentOS 7 | by Azhagarasu A | Medium).

  • Forwarding chain default: RHEL systems using firewalld often have a default DROP for forwarding, expecting firewalld to open necessary paths. If Docker’s rules are not in effect, this could mean container-to-container forwarding is blocked. Ping might still work because ICMP could be accepted as related traffic or because the containers are on the same bridge (Linux may allow bridge local traffic). But TCP being blocked is consistent with a DROP policy on forwarding.

  • Masquerade (NAT) issues: If containers need internet access or if you access containers from the host network, Docker’s MASQUERADE rule is needed (in iptables POSTROUTING chain for docker0). Firewalld might remove this rule (Docker and firewalld mess in CentOS 7 | by Azhagarasu A | Medium), leading to containers unable to reach external networks, or external clients unable to reach containers on published ports.

Troubleshooting Tip: Check if firewalld is running: systemctl status firewalld. If it is, a quick test is to disable it temporarily: systemctl stop firewalld. Then test your container HTTP connections again. If they start working when firewalld is off, you’ve confirmed the issue. (If firewalld was already disabled, then obviously this isn’t the culprit in your case, but given SELinux was manually disabled, it’s possible firewalld was left active unknowingly.)

Solutions: There are two general solutions if firewalld is the cause:

  1. Disable firewalld and use raw iptables: On a dedicated Docker host, many administrators choose to disable firewalld altogether and let Docker manage iptables directly (Docker and firewalld mess in CentOS 7 | by Azhagarasu A | Medium). This avoids the conflict entirely. If you go this route, ensure you have no other services that rely on firewalld’s management. You’d manage any needed host firewall rules via direct iptables or another mechanism.
  2. Configure firewalld to coexist with Docker: This can be done by adding the docker bridge to a trusted zone or configuring firewalld to allow the traffic. For instance, you can instruct firewalld to accept traffic on the docker0 interface (e.g., firewall-cmd --permanent --zone=trusted --add-interface=docker0). You also want to ensure masquerading is allowed for the docker zone. Some guides recommend switching Docker to use iptables=false and then manually replicating needed rules in firewalld, but this is complex and not usually needed if simply trusting the interface works. The important thing is that firewalld should not be removing Docker’s rules. In RHEL8/9 (with nftables backend), the interplay is a bit different but the core idea remains that Docker’s iptables-nft entries need to persist.

Given that SELinux is disabled (which might indicate a mindset of reducing security layers to simplify troubleshooting), you might opt to also disable firewalld for now, at least to confirm it’s the cause of the HTTP blockage. The Medium article on Docker and firewalld explicitly notes that disabling firewalld resolved the file upload issue (Docker and firewalld mess in CentOS 7 | by Azhagarasu A | Medium). If disabling it is not an option (maybe the host serves other roles and needs a firewall), then configuring as above is the path forward.

2.3 IPTables and Kernel Modules on RHEL

Another Fedora/RHEL-specific networking hiccup can come from required kernel modules for iptables NAT not being loaded. Docker usually loads these, but in some cases (especially with a minimal installation), you might find that certain netfilter modules (like xt_nat, xt_conntrack, br_netfilter) weren’t loaded, leading to incomplete firewall rules.

There was a case on Fedora where inside containers, certain iptables commands failed with “No chain/target/match by that name”, which was resolved by loading the xt_nat and related modules on the host (Ping (icmp) work but not tcp · Issue #14 · qoomon/docker-host · GitHub). Essentially, if NAT rules aren’t working, ensure the nf_nat kernel module is active. Also, for containers to communicate across bridges, Linux needs br_netfilter to pass bridged traffic to iptables (Docker typically enables this via sysctl net.bridge.bridge-nf-call-iptables=1). If that sysctl or module is turned off, then the firewall might not even see the packets.

Action items:

  • Run lsmod | grep -E 'xt_nat|xt_multiport|nf_conntrack' to see if these modules are loaded. If not, use modprobe xt_nat xt_multiport nf_conntrack (and ensure they load on boot). The Fedora Docker issue was fixed by modprobe xt_nat and xt_multiport (Ping (icmp) work but not tcp · Issue #14 · qoomon/docker-host · GitHub).
  • Ensure br_netfilter is loaded if you need iptables rules on bridged traffic. Check sysctl net.bridge.bridge-nf-call-iptables = 1 (and same for ip6tables if IPv6).
  • Since you’re on RHEL, be aware of the nftables vs iptables situation. RHEL8+ by default uses the iptables-nft backend (iptables commands translate to nftables). Docker doesn’t natively speak nftables yet; it relies on iptables interface. Make sure the system’s iptables command is using the legacy backend if issues persist. In some cases, switching to the legacy iptables backend (or ensuring the compatibility mode is working) can solve strange firewall behaviors (Native support for nftables · Issue #1472 · docker/for-linux - GitHub) (How to install Docker CE on RHEL 8 - LinuxConfig.org). On RHEL, you can install iptables-services and set update-alternatives to use iptables-legacy if needed, or just ensure the nft compatibility is functioning. Check iptables -L -t nat and see if Docker’s chains (DOCKER, DOCKER-USER) are present.

In summary, the host’s networking should be configured such that Docker’s own iptables rules are intact and effective. Firewalld can disrupt them (Docker and firewalld mess in CentOS 7 | by Azhagarasu A | Medium), and missing kernel modules can prevent those rules from working (Ping (icmp) work but not tcp · Issue #14 · qoomon/docker-host · GitHub). Both need to be addressed to get reliable container connectivity.

2.4 Container DNS Resolution

DNS issues can sometimes masquerade as “network” issues. For example, if containers can ping each other by IP but not by name, or if they can ping external IPs (8.8.8.8) but not resolve domains, then DNS is the problem. After migrating to RHEL, it’s worth checking a few things about DNS:

  • Host DNS Setup: On Ubuntu, systemd-resolved might have been in use, meaning /etc/resolv.conf could point to 127.0.0.53 and act as a stub resolver. Docker would typically copy the host’s resolv.conf into containers (unless overridden). On RHEL, NetworkManager is usually managing /etc/resolv.conf (often linking it to something in /run). If the host’s resolv.conf on RHEL points to an internal DNS (like a company DNS server or 127.0.0.1 because dnsmasq is running), the containers will use the same and that DNS must be reachable from containers. If it’s 127.0.0.1 (host localhost), containers cannot resolve via that by default (since 127.0.0.1 inside a container is the container itself). In such cases, containers would fail DNS resolution. The fix is to configure Docker with a DNS server (e.g., your real DNS IP) or ensure the host’s DNS resolver is accessible. For instance, if NetworkManager uses dnsmasq on the host at 127.0.0.1, you can either disable that so that resolv.conf lists the real DNS servers, or use Docker’s --dns option globally via daemon.json.

  • Internal Container Name Resolution: In compose setups, Docker’s embedded DNS handles container names on the user-defined network. This generally works out of the box. But if, for some reason, you see containers not resolving each other (perhaps you see errors like “hostname not found”), verify that the containers are on the same network and that the Docker DNS is not overridden. One edge case: if you explicitly set a container’s DNS (via dns: in compose or daemon), it might bypass Docker’s internal DNS, thus not knowing container names. Best to either not override DNS for internal name resolution, or use an Docker DNS alias feature if needed. By default, no config change means they resolve fine.

  • Connection to External Services: If any container needs to reach an external host (like an API or database outside this Docker host), DNS issues could also cause “hangs” that might look like connectivity issues. For example, a web app container trying to call api.someservice.local might hang if it can’t resolve the name due to DNS misconfiguration. So ensure external DNS is working from within containers (docker exec <container> nslookup google.com or similar test).

  • Testing and Fixing DNS: A quick test is docker run --rm busybox nslookup example.com to see if DNS works in a new container. If it fails, focus on the host’s resolv.conf and Docker’s config. You can set custom DNS servers in /etc/docker/daemon.json like:

    {"dns": ["8.8.8.8", "8.8.4.4"]}

    as a blunt solution (remember to restart Docker). Or fix the host’s DNS resolution to not rely on localhost. There’s also the option of enabling Docker’s embedded DNS server ("dns-enable": true and related settings), but by default it should already handle container name resolution.

DNS might not be the direct cause of HTTP between containers failing (since presumably they pinged by name or IP already), but it’s part of a comprehensive check. It’s especially relevant if any part of your system changed IP schemes (for instance, if you had hostnames in config files now pointing wrong due to environment change).

2.5 Network MTU and Fragmentation (Edge Case)

One less common but possible cause for a situation where “ping works but large requests don’t” is an MTU or fragmentation issue. If the network MTU is misconfigured, small packets (like ping’s 64 bytes) go through, but larger TCP packets might get dropped if they can’t be fragmented or if a blackhole prevents ICMP Fragmentation Needed messages.

  • Check MTU: On the host, check ip link show docker0 and the host interface. If the host’s NIC (or bridge in Proxmox) is, say, 1500 and docker0 is also 1500, that’s normal. If the host network is using a tunnel or VLAN that effectively lowers MTU (common in cloud environments or some SDNs), you might need to set Docker’s MTU accordingly. For instance, if the host’s actual throughput MTU is 1450 (typical when encapsulating in VXLAN or GRE), Docker0 at 1500 could cause containers to send packets that get dropped upstream. While Proxmox bridging locally shouldn’t change MTU, if the underlying network used by Proxmox has special settings, consider it. Proxmox often defaults to standard Ethernet MTU though.

  • Testing: You can test MTU issues by inside one container, ping another with large packet size, e.g., ping -s 2000 -M do <other_container_ip> to send a 2000-byte ICMP packet with Don’t Fragment. If you see fragmentation needed messages or failure, there might be a path MTU issue.

  • Solution: If MTU is an issue, you can add { "mtu": 1400 } (for example) to Docker daemon.json to make docker0 use that MTU. This ensures containers use smaller packets that won’t get dropped.

MTU problems typically would also affect ping if the ping payload is big enough, so it’s not the first suspect unless you notice issues like only small responses succeed. It’s included here for completeness, especially because after migrations, sometimes the networking environment (cloud vs on-prem) changes dramatically.

Recap for Networking: Ensure the Docker network is set up with proper iptables rules (consider firewalld impact) (Docker and firewalld mess in CentOS 7 | by Azhagarasu A | Medium), required kernel modules are loaded (Ping (icmp) work but not tcp · Issue #14 · qoomon/docker-host · GitHub), and DNS is functioning for containers. The fact that containers can ping means the basic network namespace and routing is fine; so focus on things that would block TCP or higher-level traffic – firewall rules and possibly application issues. Now, with the low-level network addressed, let’s explore issues at the container and application level, including volumes and permissions.

3. Container Runtime and Volume Issues

Having addressed the host-level factors (filesystem and network), we turn to container-specific issues that often arise after migration. These include file permission mismatches on volumes, differences in user IDs, container health check timing, and application services not starting or binding correctly. We also consider the Docker daemon’s configuration (daemon.json) and how misconfigurations there could lead to subtle problems.

3.1 Volume Permissions and UID/GID Mismatches

Migrating volumes between Linux hosts can easily result in file ownership and permission mismatches, especially if the container uses non-root users. Docker volumes (the contents in /var/lib/docker/volumes/<name>/_data) carry Linux ownership info as numeric UIDs/GIDs. On the source Ubuntu system, files might have been created by certain UIDs; on the new host, those UIDs correspond to potentially different users (or no user, if the UID was only present inside the container). Since Docker does not translate UIDs (it’s not like NFS with idmapping; it’s a direct bind of the ext4/XFS data), the numeric ownership must align with what the container’s expected user is.

Common scenario: An application inside the container runs as a user (say appuser with UID 1000). When the volume was first created on the old host, if it was a named volume, Docker would have initialized it with the content and permissions from the image. If that image had /data owned by appuser:appuser (1000:1000), then the volume’s files are owned by 1000 on the host as well (containers - docker: migrating volumes with correct permissions - Server Fault). Now, on the new host, if you copied those files and preserved ownership, they should still be 1000. If the container’s appuser is UID 1000, then it should match up. But problems arise if:

  • The ownership was not preserved during copy (e.g., everything became owned by root on the new host). Then the app inside the container might not be able to write to its volume (Permission denied errors).
  • The image or user changed. For example, if you updated the container image and it now expects files to be owned by UID 1001 instead of 1000 (maybe the appuser UID changed), the volume would still have old ownership because Docker only sets it on first creation (containers - docker: migrating volumes with correct permissions - Server Fault). After migration, the volume is not empty, so Docker doesn’t re-initialize it. This is a subtle case where an image change leads to a stale ownership on an existing volume (containers - docker: migrating volumes with correct permissions - Server Fault).
  • You bind-mounted directories from the host where permissions differ (not exactly this case, since using named volumes, but if any container uses bind mount, e.g., for logs, ensure the host path permissions are correct).

Symptoms of permission issues: The container’s application might fail to start, write, or read files. In a web service context, a container might be running but any attempt to, say, upload a file or write to a database fails, causing the service to error out – which might present to the outside as the service not responding properly. This could indeed cause HTTP requests to hang or fail (the app might be stuck waiting on an I/O that is failing, etc.). For example, consider an HTTP POST upload that tries to save to /uploads volume, but the container process lacks permission – it might throw an internal error or hang, and from the client side you just see no response or a timeout.

Troubleshooting permissions:

  • Enter the container (docker exec -it <container> sh) and try to touch or modify files in the mounted volume path. See if you get “Permission denied”. Also ls -l to see ownership from inside (it will show numeric IDs if it doesn’t map to a known user in the container’s /etc/passwd).
  • On the host, look at the volume directory and check ownership (ls -n to see numeric IDs). Compare that with the Dockerfile or container’s expected UID. Many official images run as root by default (UID 0), in which case permissions are less often an issue (root can typically write anywhere, unless SELinux was on). But a lot of modern images use non-root users for security.

Fixing permissions: If you find a mismatch, you have a few options:

  • The simplest is often to chown the volume directory on the host to the expected UID:GID. For example, if the container app runs as 1000:1000, and the volume files are currently owned by 0:0, running chown -R 1000:1000 /var/lib/docker/volumes/<volname>/_data on the host (while containers not using it, or do it inside a temporary docker run --rm -v volname:/target alpine chown -R 1000:1000 /target) will fix it.
  • Alternatively, you can modify the container startup (entrypoint script) to adjust permissions at runtime. This is less ideal in production but sometimes done: e.g., an entrypoint that does chown -R appuser /data if it detects a root-owned volume. This ensures whenever the container starts, it fixes perms. But this needs the container running as a user that can chown (root or CAP_CHOWN).
  • Ensure that when migrating volumes next time, you preserve owners. Using tar is recommended (containers - docker: migrating volumes with correct permissions - Server Fault): for instance, docker run --rm -v volname:/data busybox tar -cC /data . > vol.tar on old host, then move tar, then on new host cat vol.tar | docker run --rm -i -v volname:/data busybox tar -xC /data (containers - docker: migrating volumes with correct permissions - Server Fault). This method keeps ownership and is safer.

We should also note the security angle: The ServerFault discussion pointed out the risk that if a host user has the same UID as a sensitive container user, they could access the container’s files on the host (containers - docker: migrating volumes with correct permissions - Server Fault) (containers - docker: migrating volumes with correct permissions - Server Fault). That’s partly why named volumes reside in /var/lib/docker/volumes which is root-only accessible. In our case, with SELinux off, there’s no MAC to prevent host access, but normally one might use user namespaces to mitigate host<->container UID overlap (containers - docker: migrating volumes with correct permissions - Server Fault). We won’t digress into user namespaces deeply, but just be aware if you had enabled userns-remap on old or new host, that would intentionally shift UIDs and could definitely complicate volume permissions. The default, however, is user namespaces disabled (so UID 1000 in container == UID 1000 on host filesystem for volumes).

To conclude on permissions: Volume permission issues are a prime suspect for post-migration problems, especially with symptoms like inability to handle file uploads or a service working partially (serving static files fine but failing on writes). They are relatively easy to fix by aligning ownerships. Always ensure the container’s expected user can read/write the needed directories on the volume.

3.2 SELinux (Disabled vs Enabled) and Volume Mounts

In the new RHEL environment, SELinux is disabled (set to permissive or disabled mode). While this removes one layer of potential file access problems, it’s worth discussing because if SELinux were enabled (the default on RHEL), it would need to be addressed for containers to use volumes from the host.

When SELinux is enforcing, files on the host have contexts. Docker by default, if not told otherwise, will mount host directories (for volumes or binds) with the host context, which likely is not accessible by the container domain. The result is permission denied errors even for root inside container, due to context mismatch. The fix for that is to use :Z or :z options on volume definitions, which tells Docker to relabel the files with a context that allows container access (Docker-compose volumes give permission denied after upgrade to ...). In Docker Compose, you can specify :Z on the volume mount (e.g., /host/path:/container/path:Z). This gives the host files a label like container_file_t so that container can read/write.

However, since SELinux was disabled, none of this applies at the moment – which is probably one reason it was disabled, to avoid dealing with it. The downside is you lose SELinux protection. For completeness: if you ever enable SELinux again, remember to either relabel existing volume files or add :Z. Otherwise containers that try to read/write those volumes will get immediate permission errors (which could look similar to a standard permission issue). The error messages in logs often mention avc: denied in audit logs or mention SELinux, making it diagnosable.

In summary: With SELinux off, you won’t have SELinux-caused errors, but it’s good to be aware that a properly configured RHEL Docker host would keep it on and use the :Z flag for binds or ensure volumes get correct label. Some people disable SELinux after encountering this exact issue (“why does my container get EACCES on volume writes? – oh SELinux, turn it off”). At least we know SELinux is not directly responsible for the current connectivity issue (that’s more network), but it could have been a factor in volume permission issues if it was on.

3.3 Differences in Docker Compose v2 (Compose CLI Plugin)

The environment now uses the Docker Compose v2 plugin (docker compose command) instead of the old docker-compose Python binary. In theory, Compose v2 is mostly a drop-in replacement and uses the same YAML format and Docker API. However, there are some subtle differences and it’s worth ensuring none are affecting your setup:

  • Compose File Version Field: Compose v2 ignores the version: '2' or version: '3' field in the YAML (Docker-compose vs docker compose - Docker Community Forums). All the old keys are generally supported, but if you had an old Compose file version 2.x with deprecated keys, v2 might warn or error. Check the output of docker compose up for any warnings or differences.

  • Order of Startup: Compose v2 might start services slightly faster or in parallel compared to v1. If your application was relying on the old compose bringing containers up in a certain order (though v1 never officially waited for dependencies beyond depends_on for start order), you might see race conditions now if one container starts quicker. This ties into the next section on startup sequence.

  • Environment File Parsing or Variable Substitution: In some edge cases, Compose v2 had minor differences in how it interpolates ${VAR} in the YAML. Ensure your environment variables are coming through as expected (e.g., docker compose config to see the resolved config).

  • Networking Differences: Compose v2 uses Docker’s networking under the hood just like v1. One thing to note: by default it creates a network with name like project_default. If your old setup explicitly named networks, those names carry over. There’s typically no difference here, but just ensure you didn’t rely on the old behavior where v1 might reuse existing networks by name in some cases. Compose v2 tends to be strict that if a network name is defined in the YAML, it will create (or reuse if exists and external) that exact name. Same for volumes.

  • CLI command differences: The syntax docker compose is slightly different than docker-compose. Most commands are same, but check any scripts or aliases. For example, docker-compose logs -f vs docker compose logs -f are fine. But docker-compose down -v vs docker compose down -v (works the same). One notable thing: docker compose stores some data in the Docker API (like the project and events) differently, but that shouldn’t affect runtime.

Overall, it’s unlikely that the Compose version itself broke container connectivity. But one could imagine a scenario where, say, Compose v2 did not recreate a network that was expected, or it didn’t populate an env var that an app needed to bind to the correct interface, etc. So just keep in mind to double-check the compose deployment output for anything unusual.

3.4 Container Startup Order, Dependencies, and Health Checks

After migration, the timing at which containers come up may differ. Perhaps the new host is faster (or slower), or the Compose v2 runs things in parallel. This can cause race conditions if not handled. For example, imagine a web application container that, on startup, immediately tries to connect to a database container. If the DB is not up yet, the connection will fail. The app might then either crash or go into a retry loop. If it crashes, Compose might restart it (depending on restart policy), but repeated failures could mean the service never becomes reachable (yet you can ping the container itself, because it’s there, just the app inside isn’t ready).

Docker Compose depends_on: In compose files version 3, depends_on no longer waits for the dependency to be “healthy”; it only ensures start order. If you want the web container to wait for DB, you either implement a wait-for-it script or use the healthcheck mechanism with depends_on: condition: service_healthy (which is a Compose v2.4+ feature). Check if your compose had any such logic that might not have been carried over.

Health checks: If your containers have healthcheck defined in the Dockerfile or compose, the Compose v2 will report health status. If a container is unhealthy, it may not be used by others if you built logic around that. For instance, you might have configured a load balancer container to only proxy when backend is healthy. Ensure that all containers are passing health checks. You can see this with docker ps (it will show healthy/unhealthy) or docker inspect on the container for the Health status. If something like the DB is unhealthy, the app may be refusing connections accordingly.

Sequencing fix: If you suspect an order issue, a quick remedy is to (temporarily) start dependencies first manually, or add some sleep in entrypoints (not ideal for production, but as a test). The better solution is to use healthchecks and have the app wait until DB is available. Many orchestration setups use tools or scripts for this.

In our issue of ping vs HTTP, it’s probably not just startup order, because presumably after things settle we still see no HTTP. However, consider if container A can ping B, but when A tries to HTTP to B, maybe B’s service isn’t actually up (maybe B container is running, but its server process failed to start properly). So it’s not a network drop; it’s that the service isn’t listening. This leads us to check container logs.

Examine Container Logs: Always do docker compose logs <service> (or docker logs <container>) to see if, for example, the web server inside container B crashed or threw an error on startup. You might find clues like an exception trace, missing file, or port binding issue (e.g., "Address already in use" or "Failed to bind to port"). If the application inside failed to bind to the expected interface/port, it could be listening on the wrong place. For instance, if an app is configured to listen only on localhost (127.0.0.1 inside container), then other containers cannot reach it (they’d have to be on the same container’s loopback, which they are not). This is a classic mistake: in development one might bind a service to 127.0.0.1, which is fine in a single VM, but in Docker, you often need to bind to 0.0.0.0 so it’s accessible to other containers. So verify that the apps are binding to the container’s network interface, not just loopback. (Ping would work regardless because ping doesn’t rely on the service listening, it’s an ICMP echo to the OS, not to the application.)

Container Restart Policies: If containers are set to restart (e.g., always or on-failure), you might have containers continually restarting in the background if they keep encountering an error. Use docker ps -a to see if containers are flapping (you’ll see status Up (restarting) or many exited instances). A constantly restarting container might respond to ping intermittently (when up) but never complete an HTTP request. Identify such cases and fix the underlying issue (likely configuration or dependency not met causing crash).

3.5 Docker Daemon Configuration (daemon.json) and Systemd

Misconfigurations in Docker’s daemon configuration can also create issues, though they usually would affect all containers broadly. Let’s review some relevant settings one might have in /etc/docker/daemon.json on the new host:

  • Data Root: If you moved the Docker data directory, you might have a data-root entry pointing to the XFS mount (e.g., /var/lib/docker or a custom path). Ensure this is correct. If Docker was started without knowing the data location, it might have created a new empty directory elsewhere (like if someone set it to /mnt/docker but you put data in /var/lib/docker, that would cause Docker not to see your migrated data). So double-check that Docker is indeed using the intended directory (again docker info shows Docker Root Dir).

  • iptables: There’s a flag "iptables": true/false. It should be true (default) to allow Docker to add iptables rules for networks. If someone had set it false (perhaps in an attempt to avoid firewall issues), then Docker would not manage iptables at all – meaning you’d have to manually set up NAT, etc. This is unlikely, but if it were the case, container networking (especially internet access and cross-network) would break. Our symptom is internal HTTP failing, which could happen if iptables was false and the FORWARD chain policy was drop (no automatic rule to allow container traffic). So ensure "iptables": true (or the absence of that key, which defaults to true).

  • Bridge Settings: Options like "bip" (to set a custom bridge IP subnet) or "fixed-cidr" might be present if a custom network range was desired. If so, ensure they don’t conflict with something in the new environment. If the old system had a "bip": "172.18.0.1/16" to avoid a conflict with something, and the new doesn’t need it (or worse, now that conflicts with something else), that could cause networking issues. In such a conflict scenario, Docker might fail to bring up the network or containers might get unexpected IPs. Usually Docker would error on startup if it can't create the bridge IP.

  • DNS settings: As noted, "dns": [...] may be configured. If it points to something not reachable (like an old DNS server IP), containers may have trouble with DNS. Adjust to either use local resolvers or public ones as needed for the environment.

  • Logging driver: By default, Docker uses the json-file logging driver, which writes container stdout/stderr to JSON files on disk. On RHEL, sometimes people use journald so logs go to system journal. If the logging driver changed or log options changed, that could affect your ability to see container logs or could cause performance issues if mis-set (e.g., if it was set to something like syslog which is not running). It usually wouldn’t break connectivity, but not seeing logs could hamper debugging. Preferably, set a log rotation policy if not already (e.g., "log-opts": {"max-size": "10m", "max-file": "5"}) to avoid unlimited logs filling the disk.

  • Cgroup Driver: Modern Docker default is to use the cgroupfs driver on cgroup v1 and systemd driver on cgroup v2. RHEL8 might be cgroup v1 by default (unless enabled v2), RHEL9 is cgroup v2. Docker v26 likely uses systemd driver on cgroup v2 systems. Mismatches here can cause warnings (e.g., if Docker was told to use systemd driver but cgroup v1 is in use, or vice versa). Check docker info for “Cgroup Driver” (Docker on CentOS 7 with xfs filesystem can cause trouble when d_type is not supported | by Khushal Bisht | Medium). If there’s a problem, Docker might log errors but usually it won’t prevent containers from running; it would only matter if you were integrating with Kubernetes (which demands systemd driver on cgroup v2, etc.). Just ensure there’s no red flag in docker info about cgroups.

  • Live Restore / Userland Proxy / Experimental: These are less likely to cause the issues at hand, but for completeness: if live-restore was on, it allows Docker to upgrade without stopping containers – irrelevant here. The userland proxy (handles hairpin connections for published ports) is on by default; disabling it could cause inability for host to access a container’s published port via the external IP (not our case of container->container though). Experimental features likely off unless explicitly enabled.

In essence, review the daemon.json and confirm it matches expectations for the new environment. If uncertain, comparing it with the old environment’s config (if available) might reveal something forgotten. Sometimes migrations miss copying daemon settings – e.g., the old host might have had log driver set to “local” with compression, and new uses json-file with default (could fill disk eventually, but not immediate connectivity issue).

Systemd integration: Since Docker is a system service, also ensure the systemd unit file hasn’t been modified in a way that impacts things:

  • Check if any drop-in files exist under /etc/systemd/system/docker.service.d/. For instance, some installations add a drop-in to set HTTP proxy for Docker, or to set --iptables=false, etc. Ensure none of those have unwanted options.

  • The ordering: Docker service should start after network is online. Normally the unit has After=network-online.target and Wants=network-online.target. If for some reason Docker starts very early and perhaps network wasn’t fully ready, there could be issues pulling images etc., but since we’re focusing on container-to-container, that’s not a likely problem.

  • If using systemd to control containers (via docker) – not applicable directly, but if any container is run as a systemd service (like using Podman or something, which uses systemd units per container), not relevant here since we use Docker.

To wrap up this section: a careful audit of Docker’s config and the container runtime behavior can reveal misconfigurations or differences from the old environment that cause issues. We suspect volume permissions and firewall were bigger factors, but one should not ignore checking container logs and daemon settings, as they often provide clues (“why is my service not responding? oh, it crashed on startup because of X, as the log says”).

4. Application-Level Issues and Case Studies

Finally, let’s consider the applications running inside the containers and how certain issues manifest at that level. We tie together the previous topics with concrete scenarios and troubleshooting steps. This section is more of a “checklist” of known issues by symptom, which can serve as actionable troubleshooting steps.

4.1 Case: Containers can ping each other but HTTP requests fail

This is the primary symptom we have. Let’s analyze it systematically:

  • Ping succeeds: network namespaces are connected, IP routing is fine, so containers are on the same network. We can largely rule out basic L3 connectivity issues (e.g., no typos in IP, network exists).
  • HTTP fails: Could be a TCP connection failing (no response) or connecting but not receiving data. We need to determine if it’s a connection refusal/reset or a hang (timeout). Using curl -v http://othercontainer:port/ from one container can show if the TCP connection was made or not (it will say “Connected to ...” or it will say “Connection timed out” or “Connection refused”).

If Connection refused occurs immediately, that typically means nothing is listening on that port on the target container (the OS sent back a RST). This points to the service inside the target container not running or not bound correctly. Check that container’s process is actually running and listening on the intended port. Use docker exec <target> netstat -tlnp (if netstat or ss is available in the container) to see listening ports. If it's not listening, focus on the application startup (logs, config).

If Connection timed out, it means the packets are not being answered at all. This could be firewall dropping them (as we suspect with firewalld) or the packets not reaching the target. Because ping worked, we know ICMP went through, but TCP being dropped suggests a firewall or similar. On the target container, you could run tcpdump -n port <port> (if you have tcpdump in it, otherwise on host do tcpdump -i any port <port> and filter by container IP) to see if the SYN packet arrives. If it arrives and the container’s OS doesn’t send a response, likely nothing listening (which usually triggers a RST, not silence, unless filtered). If it doesn’t arrive, host firewall likely blocked it. In our context, firewalld dropping forwarded packets is a prime suspect for a timeout scenario. The Medium article showed exactly that scenario where firewalld blocked certain HTTP requests (specifically file upload posts) (Docker and firewalld mess in CentOS 7 | by Azhagarasu A | Medium) (Docker and firewalld mess in CentOS 7 | by Azhagarasu A | Medium).

Assuming the firewall scenario: the fix we discussed (disable or adjust firewalld) should resolve it. After applying, test again with curl. If it now connects and gets a response, we’ve solved it.

If the issue was an application not listening (refused), then fix the application config: e.g., modify it to bind 0.0.0.0, or correct the port, or ensure the service is running. This might involve editing config files or passing env vars (some apps listen on 127.0.0.1 by default for security; in Docker that doesn’t allow other containers in).

  • Application-level firewall: It’s rare, but some containers might have an internal firewall (for example, if you containerized something like a ufw-enabled Ubuntu or a CentOS that had firewalld inside the container). Usually, containers are minimal and don’t run their own firewall, but if you built a container from a full OS image, check that. If a container had firewalld running inside it, that could also block traffic (again, unlikely unless you made a container that runs systemd and enabled it). But mention for completeness.

  • DNS confusion: Ensure the curl is using the correct hostname. If you used container names and they happen to be the same as some old environment’s DNS entries, there’s a chance of confusion (e.g., container tries to resolve via external DNS instead of internal). To avoid that, use the explicit container name or alias given by Docker’s network. In compose, usually <service> resolves to the right container IP. If in doubt, use the IP directly as a test to bypass DNS. If IP works but name doesn’t, then you have a DNS issue in Docker network.

In practice, a step-by-step approach when facing this symptom:

  1. From container A, run ping B – it works.
  2. From container A, run curl B:port – it fails.
  3. From container A, run nc -zv B port (if netcat installed) to see if TCP can connect at all.
  4. Check container B’s health: docker compose logs B – any errors? Is it up and running the app? Does it log that it started HTTP server on port X?
  5. Check host firewall: iptables -L -n -v | grep <B's IP or port> to see if anything dropping.
  6. (If possible) Turn off firewalld and retry.
  7. Check container B’s process listening as mentioned.
  8. If still no go, try a quick workaround: exec into B container, and run a simple python3 -m http.server <port> (or install socat to echo) on that port as a test service. Then from A try curl to B again on that port. If that works, the issue is with B’s application not responding. If that still doesn’t work, it’s definitely network plumbing.

Given the strong evidence of firewalld, we expect after addressing that, the HTTP connectivity will start working, assuming no other underlying app problems.

4.2 Case: Volume Mounted but Application Can’t Read/Write (Permission Denied)

This case we covered earlier in permissions, but to frame it as a troubleshooting scenario: Suppose after migration, one of your containers (say a database or a file storage service) starts but is not functioning correctly. Checking its logs, you find errors like “Could not open file XYZ: Permission denied” or “Database error: unable to write to disk”. This clearly indicates a permission issue on a mounted volume.

Steps to troubleshoot:

  • Identify which volume or path the error refers to. For example, a MySQL container might say cannot access /var/lib/mysql/mysql.ibd.
  • Verify that path is indeed a Docker volume or bind from host.
  • On host, check ownership: ls -ln /var/lib/docker/volumes/yourvol/_data/… – who owns it? Does it match the user MySQL runs as (MySQL runs as user mysql usually UID 27 in official images)? If not, there’s the issue.
  • Fix with appropriate chown. For MySQL, chown -R 27:27 /var/lib/docker/volumes/yourvol/_data. For something like Postgres (usually UID 999), or for web apps (maybe UID 1000).
  • Restart the container and see if it can now access.

One more angle: If the volume was originally on Ubuntu and perhaps had different ACLs or extended attributes, moving to RHEL (with XFS) might drop those. Usually, standard POSIX perms suffice, but if you had POSIX ACLs set on ext4, those wouldn’t have copied unless specifically preserved. You might need to re-set any special ACLs (rare in containers, but mentioning just in case).

Additionally, if the service inside container gave up trying to start because of permission, you might need to re-launch it or it might be stuck in a crash loop. So after fixing perms, do docker compose restart service or bring the stack down/up.

4.3 Case: Container Service Unresponsive or Crashing After Migration

Consider a scenario where a container appears to run (no obvious errors on start) but doesn’t function normally – maybe it responds slowly or not at all to requests, or exits unexpectedly after some time:

  • This could be due to incompatible data format if a database or service was updated. For example, if you inadvertently updated the version of a database by moving to a new image, the old data files might need migration. The service might detect mismatch and refuse to start or run in some weird state. Check container logs for messages about data upgrade or errors opening files (beyond permissions).
  • Memory/CPU limits: If on the old host you had no resource limits, but on new host someone added some (through compose or daemon default), the container might be OOM-killed or throttled. For instance, if a container hits a memory limit, the kernel will kill it, it’ll restart – during that time it’s unreachable. OOM events show in dmesg or docker logs as “Container killed due to memory”. Ensure any resource constraints (if any) are appropriate for your workload.
  • Another subtle one: Time zone or locale differences. If your containers rely on the host for time zone (most don’t, but if you bind mount /etc/localtime or something), a difference could possibly break things expecting consistent time. Probably not causing HTTP failure directly, but e.g., if JWT tokens or auth sessions are time-sensitive, a time sync issue could cause requests to be rejected. Verify host time and container time are correct (use date inside).
  • External dependency changes: If the containers need to talk to something outside themselves (like an external S3 or auth service), ensure the new environment can reach those. A firewall could be blocking egress (less likely if firewalld off, but maybe corporate network rules?).
  • Proxmox virtualization quirks: If the RHEL VM on Proxmox has resource limits or if Proxmox’s virtio drivers had issues, you might see network or disk performance problems. For instance, if virtIO RNG wasn’t enabled, maybe containers waiting for entropy (only relevant if something like SSL key generation hung, etc.). Ensure the VM has adequate resources and tools installed (like the QEMU guest agent, though not directly impacting Docker).

4.4 Case: Issues with Services Handling File Uploads or Large Data Transfers

This case was explicitly mentioned to include: services that accept curl/POST/file uploads. We already touched on potential causes – notably the firewalld blocking scenario (Docker and firewalld mess in CentOS 7 | by Azhagarasu A | Medium). Let’s compile a targeted checklist for this kind of issue:

  • Firewall (MSS/fragmentation): Large file uploads could be split into many packets. A firewall might allow small packets but choke on large streams if not properly configured. Firewalld interfering with NAT or connection tracking could manifest more on long-lived connections (like a big upload) than on quick small ones. Ensure the firewall is not dropping these; after our earlier fix, retest file uploads.
  • Proxy or Web Server Buffers: If the service consists of, say, Nginx proxying to an app, and Nginx’s client_max_body_size is too low, it will reject large uploads with an HTTP error (413 Request Entity Too Large). That wouldn’t exactly be “ping works but HTTP doesn’t” (you’d get a specific error), but check your web server configs if applicable. Similarly, if some app server has a request size limit, that could give errors.
  • Timeouts: Did the migration introduce any new network latency? Possibly negligible if all local. But if the upload goes to a service that in turn uploads to S3 or similar, and now maybe with a different internet route, maybe it times out. Ensure no new timeout issues in app logs.
  • Disk Throughput: Writing a large upload to disk – if the new disk is slower or near full, the app might choke. XFS on local disk should be fine, likely faster than S3 storage from before, but check disk I/O just in case (iostat etc).
  • Testing manually: You can use curl -T file http://container/upload or similar to test upload. Also test downloading a file of similar size to ensure large responses work too.

In the Medium example, after disabling firewalld, file uploads worked normally (Docker and firewalld mess in CentOS 7 | by Azhagarasu A | Medium). So we expect similar.

4.5 Case: Migrating Data Between Different Linux Distributions

While not a “case” of failure per se, it’s worth summarizing known general issues when migrating Docker data from Ubuntu to RHEL:

  • Line Endings or File Modes: Not usually an issue between ext4 and XFS, but between Windows and Linux or if copying via git, etc., sometimes text files could get DOS line endings which could break scripts. Ensure your scripts (entrypoint, etc.) didn’t accidentally get ^M line endings in transfer (if you tarred things this wouldn’t happen).

  • OS Specific Configuration: Ubuntu’s Docker setup might have certain defaults different from RHEL’s (like cgroup driver or apparmor vs selinux). On RHEL, AppArmor is not typically in use (that’s Ubuntu’s way of restricting containers). If on Ubuntu your containers had AppArmor profiles applied (by Docker’s default, it applies a profile named docker-default), on RHEL AppArmor isn’t running, so that restriction is gone. That should generally mean things are more open now, not causing failure. If anything, a container doing something borderline might now succeed (but that’s fine).

  • Case Sensitivity: Both Ubuntu ext4 and RHEL XFS are case-sensitive by default, so file name case issues shouldn’t change. Just remember RHEL’s default filesystem XFS does not support case-insensitivity at mount (ext4 had an option but not used normally), so no difference likely.

  • Architecture: Ensure you didn’t unintentionally move to a different CPU architecture (both likely x86_64, but if, say, the new host was ARM – not in this case – that would obviously break containers built for x86).

  • Kernel differences: Ubuntu vs RHEL kernels might have different defaults for certain sysctls. For example, RHEL might have more conservative TCP settings (like smaller TCP backlog or lower conntrack max). If your application is very network-heavy, you might need to tune those. Check /etc/sysctl.conf or differences in /proc/sys if relevant. One that sometimes matters: fs.file-max or fs.inotify.max_user_watches – if a container uses many file descriptors or watches (some apps like Node or etcd do), the limits on host can matter. If you encounter errors about too many open files or inotify limits, consider raising those on RHEL to match what Ubuntu had.

  • Systemd nuances: RHEL’s systemd may treat cgroups slightly differently. There was an issue historically where if you stop the Docker service, systemd would send SIGTERM to container processes as well (because they were in the docker.service cgroup). Docker’s “live-restore” can mitigate that. But if you rarely stop docker, not an issue. Just be aware if you do systemctl stop docker, it will also stop all containers (by design). On Ubuntu this is same, not a difference, but people sometimes get surprised by it.

  • Volume plugins: If your old environment used a volume plugin for S3 (for example, some use NFS or cloud storage drivers), and now you moved data to local volumes, ensure the compose file no longer references the old driver. If a volume was defined with a driver and that driver isn’t available on RHEL, Docker might error or create an empty volume. Ideally, in the migration, you’d adjust the volume definitions to use the “local” driver (which is default). If not, you might inadvertently have gotten a fresh empty volume. So double-check the compose YAML for any driver: on volumes. If it said something like rexray/s3 and you didn’t install that on RHEL, Docker would fail to create or use that volume (or used local as fallback if unknown? usually it fails). So that could also cause data loss until fixed. The solution is to remove driver specification or ensure local driver.

5. Recommendations and Best Practices

In this section, we distill the troubleshooting into actionable recommendations to ensure a stable Docker/Compose setup on RHEL:

  • Verify Filesystem Compatibility: Always ensure your Docker storage meets the driver requirements. For XFS + overlay2, d_type must be enabled (Docker on CentOS 7 with xfs filesystem can cause trouble when d_type is not supported | by Khushal Bisht | Medium). Check docker info for “Supports d_type: true” (Docker on CentOS 7 with xfs filesystem can cause trouble when d_type is not supported | by Khushal Bisht | Medium). If not, address it immediately by migrating to a compatible filesystem or reformatting (Docker volume mapping file corruption when filesystem is xfs and storage driver is overlay - Stack Overflow). This prevents a whole class of obscure errors.

  • Maintain Docker and OS Configuration Parity: Match key Docker daemon settings between old and new environments (storage driver, data-root path, etc.). Differences can lead to unexpected behavior. For example, if SELinux is enabled, use :Z on volume binds (or disable SELinux if absolutely necessary as done here, but preferably configure it). If firewalld is present, either disable it or integrate Docker’s network into it (e.g., trust docker0) to avoid Docker network rules being wiped (Docker and firewalld mess in CentOS 7 | by Azhagarasu A | Medium).

  • Monitor and Configure the Firewall: If running firewalld, be mindful that restarting it can break Docker networking (Docker and firewalld mess in CentOS 7 | by Azhagarasu A | Medium). Either stop using firewalld in favor of raw iptables (suitable for a dedicated container host) (Docker and firewalld mess in CentOS 7 | by Azhagarasu A | Medium), or configure it using persistent rules to accept Docker traffic (masquerading and forwarding). This will save you from headaches where containers suddenly cannot talk to each other or the internet.

  • Check Networking Basics: After any migration or major change, do a quick audit:

    • docker network ls and docker network inspect to ensure the expected networks and subnets.
    • iptables -L -v -n (or nft list ruleset if using nft) to ensure Docker’s rules are present (look for DOCKER chain, rules permitting traffic from docker0).
    • Ping and test connectivity between containers, and from container to host and host to container for any needed paths.
    • Set net.bridge.bridge-nf-call-iptables=1 via sysctl if not already, to allow iptables to see bridged traffic (usually handled, but just in case).
    • If issues, use tools like curl, wget, dig/nslookup, and tcpdump inside containers for pinpointing.
  • Ensure Proper Volume Migration: When copying volume data between hosts, preserve ownership and permissions. Use tar or rsync with -a. After moving, verify the UID/GID of critical files and folders matches the container’s expectations (containers - docker: migrating volumes with correct permissions - Server Fault). If not, correct them before starting the service to avoid runtime errors. This includes database data directories, uploaded file storage directories, etc. A quick docker run --rm -v yourvol:/data alpine ls -lR /data can show you the ownership from container perspective.

  • Leverage Health Checks: Add healthcheck directives to important services in your compose file. This allows Docker to track if a container’s app is actually responding. Coupled with depends_on: condition: service_healthy, you can ensure dependent containers wait for the service to be really up. This avoids many timing issues on startup where one container tries to use another too early.

  • Logging and Monitoring: Keep an eye on Docker’s logs (journalctl -u docker) and container logs (docker compose logs). After migration, some warnings might only appear there. For example, Docker might log a warning about falling back to an alternate cgroup driver or failing to apply an AppArmor profile (if apparmor isn't on RHEL, it might warn but continue). Container application logs will be your first indicator if something isn’t right internally (stack traces, etc.). Set up a log rotation for container logs to prevent disk fill if you haven’t already.

  • Resource Limits and Performance Tuning: Now that you’re on a new system, adjust resources if needed. Ensure the VM or hardware has enough CPU/RAM for all containers. If you notice high IO on XFS, consider measuring it and if needed, experiment with using ext4 for Docker data if it’s a huge difference (only if performance is a bottleneck; otherwise XFS is fine and is the RHEL default). Also, on XFS, defragmentation or allocating more space could help if performance degrades over time (xfs_fsr, etc., though not usually needed for Docker volumes which are mostly small files).

  • Documentation and Automation: Document the configuration changes made (like disabling SELinux, firewalld adjustments, any sysctl tweaks) for future reference or for others who might maintain this system. Consider automating these settings via Ansible or scripts so that if you rebuild the host or scale out to more nodes, you apply consistent fixes (for example, an Ansible task to ensure ftype=1 on Docker storage or to configure firewalld for Docker).

With these practices, you can prevent many of the edge cases from occurring or quickly catch them if they do.

Conclusion

Migrating Docker and Docker Compose setups across different environments (Ubuntu to RHEL, cloud-backed storage to local XFS, etc.) is a complex task that can surface a wide range of issues. In this guide, we covered filesystem-related pitfalls (like the XFS d_type requirement for overlay2 (Docker failed to initialize due to xfs filesystem format on Linux in Bitbucket Cloud | Bitbucket Cloud | Atlassian Support)), networking edge cases (firewalld conflicts causing ping to work but HTTP to fail (Docker and firewalld mess in CentOS 7 | by Azhagarasu A | Medium), kernel module issues (Ping (icmp) work but not tcp · Issue #14 · qoomon/docker-host · GitHub), DNS and MTU considerations), storage driver quirks (overlay2 behavior on XFS vs ext4 (Why is my XFS Block IO insane compared to my EXT4 Block IO? - General - Docker Community Forums)), volume permission problems (UID/GID mismatches leading to access denied (containers - docker: migrating volumes with correct permissions - Server Fault)), daemon and system settings (docker daemon.json configurations, SELinux, cgroup, etc.), and application-level concerns (service binding, health checks, startup order).

Each section provided detailed analysis and actionable solutions or checks, with references to known issues and official recommendations. By systematically going through these areas, a technically advanced user can troubleshoot and resolve the kinds of problems observed in the migration scenario.

Next Steps: Apply the recommended fixes (particularly around firewall and permissions), then re-test the inter-container HTTP communication. It’s advisable to test each service in isolation (if possible) and then as an integrated stack to ensure everything functions as expected. Monitor the system over time for any lingering warnings or performance issues (e.g., use docker stats, docker inspect for resource usage, etc.). With the information in this guide, you should be equipped to pinpoint any residual issues or confidently declare the migration a success once all tests pass.

Lastly, as an ongoing practice, keep Docker and the OS updated to benefit from the latest fixes – for example, newer Docker releases might offer better nftables integration or improved logging, and RHEL updates could include kernel fixes that improve overlay2 on XFS. Always review release notes for changes that might affect your setup.

By covering all these bases, you can ensure a robust Docker environment on RHEL that runs your migrated containers smoothly and reliably, while being prepared to tackle any edge-case issues that arise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment