Docker Internals: Isolation, Resource Control, and Kernel Mechanics
How Docker uses NAMESPACES, CGROUP and more for isolation
Introduction
Docker containers are more than just lightweight alternatives to virtual machines—they are a sophisticated abstraction built on decades of Linux kernel innovation. To truly grasp how Docker achieves isolation and efficiency, we must dissect the kernel primitives it leverages: namespaces, cgroups, capabilities, and filesystem layering. This deep dive will unravel these components, their interactions, and their role in shaping containerized environments.
Linux Namespaces: The Foundation of Isolation
Namespaces are the cornerstone of container isolation. They partition global system resources (like processes, networks, and filesystems) into distinct views, ensuring each container operates in its own virtualized environment. Docker uses six primary namespaces, each targeting a specific resource type:
a. PID Namespace (Process Isolation)
Purpose: Isolates process IDs. A container’s init process (PID 1) is unaware of processes outside its namespace.
Mechanics:
The host sees container processes with regular PIDs (e.g.,
PID 5000
), but inside the container they start atPID 1
.ProcFS (
/proc
) inside the container reflects only its own processes.E.g.
# On the host: ps aux | grep nginx # Shows PID 5000 # Inside the container: ps aux # Shows PID 1 (nginx)
Limitation: Without additional tools (e.g.,
--pid=host
), a container cannot inspect or signal host processes.
b. Mount Namespace (Filesystem Isolation)
Purpose: Provides an isolated filesystem tree. Containers see their own root (
/
) and mounted volumes.Mechanics:
Docker uses OverlayFS to stack read-only image layers and a writable container layer.
Each container’s filesystem is a union of:
Lowerdir: Read-only layers from the Docker image.
Upperdir: Writable layer for container-specific changes.
Merged: Unified view presented to the container.
Example:
mount -t overlay overlay -o lowerdir=/lower,upperdir=/upper,workdir=/work /merged
Use
docker inspect <container-id>
to view the OverlayFS layers.
c. Network Namespace (Network Virtualization)
Purpose: Each container gets its own network stack (interfaces, IPs, routing tables).
Mechanics:
Docker creates a virtual Ethernet pair (veth): One end in the container (
eth0
), the other on the host (attached todocker0
bridge).Network traffic is routed through iptables rules for NAT and port forwarding.
Example:
# View host-side veth interface: ip link show docker0 # Inside the container: ip addr show eth0 # Shows container-specific IP.
d. UTS Namespace (Hostname Isolation)
Purpose: Isolates the hostname and domain name.
Example:
docker run --hostname mycontainer alpine hostname # Output: "mycontainer"
e. IPC Namespace (Inter-Process Communication)
Purpose: Restricts shared memory (SHM), semaphores, and message queues to container-local processes.
Use Case: Prevents conflicts when multiple containers use POSIX shared memory.
f. User Namespace (UID/GID Remapping)
Purpose: Maps container users to non-root users on the host for security.
Mechanics:
A container running as "root" (UID 0) can map to a non-privileged host UID (e.g., 1000).
Enabled via
/etc/subuid
and/etc/subgid
configuration.
Example:
docker run --user 1000:1000 alpine id # Runs as UID 1000 inside and outside.
Control Groups (cgroups): Resource Governance
Cgroups regulate resource usage (CPU, memory, I/O) and prevent noisy neighbors. Docker uses cgroups v1/v2 to enforce limits and prioritize resources.
Key cgroup Subsystems
Subsystem | Purpose | Docker Flag Example |
cpu | Allocate CPU time | --cpus=2 |
memory | Limit RAM and swap | --memory=1g |
blkio | Throttle disk I/O | --device-read-bps=/dev/sda:1mb |
pids | Limit process count | --pids-limit=100 |
devices | Allow/deny device access | --device=/dev/ttyUSB0 |
How Docker Configures cgroups
Directory Structure: Each container’s cgroups are under
/sys/fs/cgroup/<subsystem>/docker/<container-id>
.Example: Setting a memory limit:
echo 500000000 > /sys/fs/cgroup/memory/docker/<container-id>/memory.limit_in_bytes
Out-of-Memory (OOM) Killer: If a container exceeds its memory limit, the kernel terminates processes (often the container itself).
cgroups v2 Unification
Unified Hierarchy: cgroups v2 merges all controllers into a single hierarchy, simplifying resource management.
Docker Support: Enabled via
"cgroupfs"
or"systemd"
drivers in/etc/docker/daemon.json
.
Security Hardening: Beyond Isolation
Docker layers security mechanisms to minimize attack surfaces:
a. Linux Capabilities
Purpose: Split root privileges into granular units (e.g.,
CAP_NET_ADMIN
for network config).Docker’s Defaults: Drops risky capabilities like
CAP_SYS_ADMIN
orCAP_NET_RAW
.Customization:
docker run --cap-add NET_ADMIN --cap-drop CHOWN ...
b. Seccomp (Secure Computing Mode)
Purpose: Restrict system calls (syscalls) a container can make.
Default Profile: Docker blocks ~44 dangerous syscalls (e.g.,
reboot
,syslog
).Custom Profiles: Override with
--security-opt seccomp=profile.json
.
c. AppArmor/SELinux
Mandatory Access Control (MAC): Define fine-grained policies for file/network access.
Example: Docker’s default AppArmor profile prevents containers from writing to
/proc
or/sys
.
The Container Filesystem: Layers and Storage Drivers
Docker’s efficient image management relies on copy-on-write (CoW) filesystems:
OverlayFS in Action
Layers:
Base Layer (Lowerdir): Read-only image layers (e.g.,
alpine:latest
).Container Layer (Upperdir): Writable layer for runtime changes.
Merged View: Unified filesystem presented to the container.
Example: Modifying a file in the container copies it to the Upperdir, leaving the Lowerdir intact.
Alternative Storage Drivers
aufs: Original Docker driver (deprecated).
btrfs/zfs: Advanced CoW with snapshot support.
devicemapper: For block storage in enterprise environments.
Docker’s Runtime Architecture
When you run docker run
, a chain of tools collaborates:
Docker CLI: Parses the command.
Docker Engine (dockerd): Manages the API and image distribution.
containerd: Handles container lifecycle (start/stop/pull).
runc: OCI-compliant runtime that creates namespaces, cgroups, and starts the process.
OCI (Open Container Initiative) Spec
Standardization: Defines container image formats and runtime rules.
runc: Reference implementation of OCI, used by Docker to spawn containers.
Advanced Use Cases
Rootless Docker: Runs containers as non-root users via user namespaces.
Custom Namespaces: Share namespaces between containers (e.g.,
--net=container:web
).cgroup v2 Resource Limits:
# Limit CPU usage using cgroup v2 docker run --cpus=2 --memory=1g ...
Conclusion
Docker’s elegance lies in its clever use of Linux kernel primitives. Namespaces isolate, cgroups govern, and layered filesystems optimize storage. By understanding these internals, developers can debug issues, optimize performance, and harden security. Want to go further? Experiment with custom cgroup policies, explore rootless containers, or write your own Seccomp profiles!
Further Exploration
If you’d like a visual walkthrough, check out this excellent YouTube video:
Cgroups, namespaces, and beyond: what are containers made from?