Introduction

Docker containers are more than just lightweight alternatives to virtual machines—they are a sophisticated abstraction built on decades of Linux kernel innovation. To truly grasp how Docker achieves isolation and efficiency, we must dissect the kernel primitives it leverages: namespaces, cgroups, capabilities, and filesystem layering. This deep dive will unravel these components, their interactions, and their role in shaping containerized environments.

Linux Namespaces: The Foundation of Isolation

Namespaces are the cornerstone of container isolation. They partition global system resources (like processes, networks, and filesystems) into distinct views, ensuring each container operates in its own virtualized environment. Docker uses six primary namespaces, each targeting a specific resource type:

a. PID Namespace (Process Isolation)

Purpose: Isolates process IDs. A container’s init process (PID 1) is unaware of processes outside its namespace.

Mechanics:
The host sees container processes with regular PIDs (e.g., PID 5000), but inside the container they start at PID 1.
ProcFS (/proc) inside the container reflects only its own processes.

E.g.

  # On the host:
  ps aux | grep nginx  # Shows PID 5000
  # Inside the container:
  ps aux  # Shows PID 1 (nginx)

Limitation: Without additional tools (e.g., --pid=host), a container cannot inspect or signal host processes.

b. Mount Namespace (Filesystem Isolation)

Purpose: Provides an isolated filesystem tree. Containers see their own root (/) and mounted volumes.
Mechanics:
- Docker uses OverlayFS to stack read-only image layers and a writable container layer.
- Each container’s filesystem is a union of:
  - Lowerdir: Read-only layers from the Docker image.
  - Upperdir: Writable layer for container-specific changes.
  - Merged: Unified view presented to the container.
- Example:
```
  mount -t overlay overlay -o lowerdir=/lower,upperdir=/upper,workdir=/work /merged
```
  Use docker inspect <container-id> to view the OverlayFS layers.

c. Network Namespace (Network Virtualization)

Purpose: Each container gets its own network stack (interfaces, IPs, routing tables).
Mechanics:
- Docker creates a virtual Ethernet pair (veth): One end in the container (eth0), the other on the host (attached to docker0 bridge).
- Network traffic is routed through iptables rules for NAT and port forwarding.

Example:

  # View host-side veth interface:
  ip link show docker0
  # Inside the container:
  ip addr show eth0  # Shows container-specific IP.

d. UTS Namespace (Hostname Isolation)

Purpose: Isolates the hostname and domain name.

Example:

  docker run --hostname mycontainer alpine hostname  # Output: "mycontainer"

e. IPC Namespace (Inter-Process Communication)

Purpose: Restricts shared memory (SHM), semaphores, and message queues to container-local processes.
Use Case: Prevents conflicts when multiple containers use POSIX shared memory.

f. User Namespace (UID/GID Remapping)

Purpose: Maps container users to non-root users on the host for security.
Mechanics:
- A container running as "root" (UID 0) can map to a non-privileged host UID (e.g., 1000).
- Enabled via /etc/subuid and /etc/subgid configuration.

Example:

  docker run --user 1000:1000 alpine id  # Runs as UID 1000 inside and outside.

Control Groups (cgroups): Resource Governance

Cgroups regulate resource usage (CPU, memory, I/O) and prevent noisy neighbors. Docker uses cgroups v1/v2 to enforce limits and prioritize resources.

Key cgroup Subsystems

Subsystem	Purpose	Docker Flag Example
cpu	Allocate CPU time	`--cpus=2`
memory	Limit RAM and swap	`--memory=1g`
blkio	Throttle disk I/O	`--device-read-bps=/dev/sda:1mb`
pids	Limit process count	`--pids-limit=100`
devices	Allow/deny device access	`--device=/dev/ttyUSB0`

How Docker Configures cgroups

Directory Structure: Each container’s cgroups are under /sys/fs/cgroup/<subsystem>/docker/<container-id>.

Example: Setting a memory limit:

  echo 500000000 > /sys/fs/cgroup/memory/docker/<container-id>/memory.limit_in_bytes

Out-of-Memory (OOM) Killer: If a container exceeds its memory limit, the kernel terminates processes (often the container itself).

cgroups v2 Unification

Unified Hierarchy: cgroups v2 merges all controllers into a single hierarchy, simplifying resource management.
Docker Support: Enabled via "cgroupfs" or "systemd" drivers in /etc/docker/daemon.json.

Security Hardening: Beyond Isolation

Docker layers security mechanisms to minimize attack surfaces:

a. Linux Capabilities

Purpose: Split root privileges into granular units (e.g., CAP_NET_ADMIN for network config).
Docker’s Defaults: Drops risky capabilities like CAP_SYS_ADMIN or CAP_NET_RAW.

Customization:

  docker run --cap-add NET_ADMIN --cap-drop CHOWN ...

b. Seccomp (Secure Computing Mode)

Purpose: Restrict system calls (syscalls) a container can make.
Default Profile: Docker blocks ~44 dangerous syscalls (e.g., reboot, syslog).
Custom Profiles: Override with --security-opt seccomp=profile.json.

c. AppArmor/SELinux

Mandatory Access Control (MAC): Define fine-grained policies for file/network access.
Example: Docker’s default AppArmor profile prevents containers from writing to /proc or /sys.

The Container Filesystem: Layers and Storage Drivers

Docker’s efficient image management relies on copy-on-write (CoW) filesystems:

OverlayFS in Action

Layers:
- Base Layer (Lowerdir): Read-only image layers (e.g., alpine:latest).
- Container Layer (Upperdir): Writable layer for runtime changes.
- Merged View: Unified filesystem presented to the container.
Example: Modifying a file in the container copies it to the Upperdir, leaving the Lowerdir intact.

Alternative Storage Drivers

aufs: Original Docker driver (deprecated).
btrfs/zfs: Advanced CoW with snapshot support.
devicemapper: For block storage in enterprise environments.

Docker’s Runtime Architecture

When you run docker run, a chain of tools collaborates:

Docker CLI: Parses the command.
Docker Engine (dockerd): Manages the API and image distribution.
containerd: Handles container lifecycle (start/stop/pull).
runc: OCI-compliant runtime that creates namespaces, cgroups, and starts the process.

OCI (Open Container Initiative) Spec

Standardization: Defines container image formats and runtime rules.
runc: Reference implementation of OCI, used by Docker to spawn containers.

Advanced Use Cases

Rootless Docker: Runs containers as non-root users via user namespaces.
Custom Namespaces: Share namespaces between containers (e.g., --net=container:web).

cgroup v2 Resource Limits:

  # Limit CPU usage using cgroup v2
  docker run --cpus=2 --memory=1g ...

Conclusion

Docker’s elegance lies in its clever use of Linux kernel primitives. Namespaces isolate, cgroups govern, and layered filesystems optimize storage. By understanding these internals, developers can debug issues, optimize performance, and harden security. Want to go further? Experiment with custom cgroup policies, explore rootless containers, or write your own Seccomp profiles!

Further Exploration

If you’d like a visual walkthrough, check out this excellent YouTube video:

Cgroups, namespaces, and beyond: what are containers made from?

Docker Internals: Isolation, Resource Control, and Kernel Mechanics

How Docker uses NAMESPACES, CGROUP and more for isolation

Table of contents

Introduction

Linux Namespaces: The Foundation of Isolation

a. PID Namespace (Process Isolation)

b. Mount Namespace (Filesystem Isolation)

c. Network Namespace (Network Virtualization)

d. UTS Namespace (Hostname Isolation)

e. IPC Namespace (Inter-Process Communication)

f. User Namespace (UID/GID Remapping)

Control Groups (cgroups): Resource Governance

Key cgroup Subsystems

How Docker Configures cgroups

cgroups v2 Unification

Security Hardening: Beyond Isolation

a. Linux Capabilities

b. Seccomp (Secure Computing Mode)

c. AppArmor/SELinux

The Container Filesystem: Layers and Storage Drivers

OverlayFS in Action

Alternative Storage Drivers

Docker’s Runtime Architecture

OCI (Open Container Initiative) Spec

Advanced Use Cases

Conclusion

Further Exploration