GOAL: This project will sum up all the knowledge you have learned about operating systems and create a small container runtime that can run a containerized process on Linux.
Throughout the course, we have been using Docker and Dev Container to run our code. But what exactly is a container?
A container is a lightweight and isolated execution environment that encapsulates an application and its dependencies. It provides a consistent and reproducible environment across different systems, allowing applications to run reliably regardless of the underlying infrastructure.
Containers are created from container images, which are self-contained packages that include the application code, runtime, system tools, libraries, and configuration files required for the application to run.
A container runtime is responsible for managing the lifecycle of containers. It provides an interface between the container and the host operating system, orchestrating the necessary resources and ensuring isolation and security within the container environment.
One of the most important features of container runtimes is to provide isolation. That is, whatever happens inside a container does not affect the host system or other containers.
Modern container runtimes provide isolation for many OS abstractions, such as CPU, memory, network, etc. In this project, we focus on two essential abstractions: processes and files.
Our container runtime should provide isolation for processes. For example, the processes running on the host should not be visible to the processes inside the container.
Let's check how Docker isolates processes. We use the ps
command to see the information about running processes.
If you execute ps -A
, it lists all processes that are running on the system.
PID TTY TIME CMD
1 ? 06:30:08 systemd
2 ? 00:00:44 kthreadd
3 ? 00:00:00 rcu_gp
4 ? 00:00:00 rcu_par_gp
6 ? 00:00:00 kworker/0:0H-kblockd
...
This is the result of ps -A
on my server. We can see many processes are running.
Let us create a new Docker container and execute the same command inside the container:
$ docker run --rm -it alpine
$ ps -A
PID USER TIME COMMAND
1 root 0:00 /bin/sh
7 root 0:00 ps -A
The first command creates a new Docker container using the alpine image. It opens a shell inside the container, so we run ps -A
. Even though many processes are running on the host system, they are not visible inside the container.
In Linux, the isolation of processes is achieved using the PID namespace. A process ID (PID) is a unique number that identifies a running process. Every process belongs to a PID namespace and it can only see processes in the same PID namespace.
Our container runtime also supports isolating the filesystem. Each container has its own filesystem and cannot affect the host or other containers.
Let's check how Docker provides filesystem isolation. We use the same alpine
image.
In one terminal, let us create a container and create a file foo
in its root directory.
$ docker run --rm -it alpine
$ echo foo > foo
$ ls /
bin etc home media opt root sbin sys usr
dev foo lib mnt proc run srv tmp var
Open another terminal, create a new container, and run ls
.
$ docker run --rm -it alpine
$ ls /
bin etc lib mnt proc run srv tmp var
dev home media opt root sbin sys usr
Even though both containers are created from the same image, foo
is not visible in the new container.
One way to achieve this is to use the Overlay Filesystem. The Overlay filesystems is a type of union filesystem that allows multiple directories to be mounted together, presenting a single unified view. It provides a way to overlay a read-write filesystem on top of a read-only filesystem, creating a combined view that appears as a single coherent filesystem.
When a file or directory is accessed, the Overlay filesystem looks for it in the topmost layer first. If the file is found, it is returned. If not, the filesystem searches the lower layers in a specific order until it locates the file. This allows modifications to be made to the topmost layer, while the lower layers remain unchanged. Changes made to the topmost layer are stored separately, without modifying the underlying read-only layers.
Let's see how the overlay filesystem works with a simple example.
To use the overlay filesystem, we need three directories: lower
, upper
, and work
. lower
will be a read-only directory that provides an "image", upper
will store all changes on top of lower
, and work
is used by the overlay filesystem as a workspace.
First, we use the following commands to create the directories. merged
is a directory where we will mount the overlay filesystem at.
mkdir lower upper work merged
Then, we create a read-only file in the lower directory.
echo "this is in lower" > lower/foo
Now, we are ready to create a new overlay filesystem. Use the following command to create an overlay filesystem and mount it as merged
.
mount -t overlay overlay -o lowerdir=lower,upperdir=upper,workdir=work merged
Now, we mounted the overlay filesystem to merged
. It provides the unified view of lower
and upper
.
$ ls merged
foo
$ cat merged/foo
this is in lower
Let's make some change to the file in the overlay filesystem.
echo "new foo" > merged/foo
Because lower/foo
is read-only, it cannot be modified. Instead, the overlay filesystem creates upper/foo
with the updated content. Because upper/foo
now exists, merged/foo
refers to the updated content at upper/foo
.
$ echo merged/foo
new foo
$ echo upper/foo
new foo
$ echo lower/foo
foo
To provide an isolated filesystem for each container, we make an overlay filesystem. The lower
directory is the container image that stores all files and directories needed for that container. Each container gets a unique upper
directory, so changes by the container do not affect other containers.
The goal of this project is to complete container.c
. It is capable of creating a container from an image and executing a command inside the container.
In container.c
, an image is a directory under ./images
that stores all the files and directories required for the system. You can think of an image directory as a snapshot of the system root directory.
The easiest way to create an image is to use the docker export
command. Let us create an image directory from the alpine
docker image.
First, we create a Docker container using docker run --rm -it alpine sh
. This open a shell inside the newly created container.
Second, we need to get the ID of the container. Open a new terminal and run docker ps
to see the list of running Docker containers.
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
f1cf18783484 alpine "sh" 2 minutes ago Up 2 minutes boring_sanderson
Find the container and copy the container ID (f1cf18783484
in this case).
Then, we run docker export {container ID} > alpine.tar
to create a tarball of the image.
docker export f1cf18783484 > alpine.tar
Finally, we extract files in the tarball to ./images/{image name}
using the following commands.
$ mkdir images/alpine
$ tar -xf alpine.tar -C images/alpine
$ ls images/alpine
bin dev etc home lib media mnt opt proc root run sbin srv sys tmp usr var
Now, we have the image directory called alpine
. We use this identifier to specify the image.
container
takes three or more arguments.
$ ./container
Usage: ./container [ID] [IMAGE] [CMD]...
- The first argument (ID) specifies the unique ID of the container.
- Docker assigns this at random, but we require the user to provide one .
- The ID can be at most 16 characters (
CONTAINER_ID_MAX
).
- The second argument (IMAGE) specifies the image to create a container from.
./images/{IMAGE}
must exist and store all files required for this container
- The rest of the arguments specify the commands to run inside the container.
- It can be more than one as the user might provide options.
For example, ./container my-container alpine echo "hello world"
will
- Create a container with ID
my-container
- Use the image located at
./images/alpine
- Execute
echo "hello world"
inside the container
sudo ./container my-container alpine echo "hello world"
hello world
You will need to complete two functions in container.c
: main
and container_exec
.
main
is the entry point of the command-line interface. It needs to parse the command-line arguments (argv
) and create a child process by calling clone
with appropriate parameters.
int clone_flags = SIGCHLD | CLONE_NEWNS | CLONE_NEWPID;
int pid = clone(container_exec, &child_stack, clone_flags, &container);
clone
works similarly to the fork
system call and creates a child process. The child process executes the container_exec
function and takes container
as an argument, just like how we passed arguments to a new thread in pthread_create
. By passing three flags, the child process will have separate PID and mount namespaces that provide isolation.
Add fields to the container
struct and fill values in main
so container_exec
will have enough information to create a container from an image and execute the command.
main
executes container_exec
in a child process with separate PID and mount namespaces. container_exec
needs to
- create and mount an overlay filesystem
- call
change_root
- use
execvp
to run the command
container_exec
needs to create an overlay filesystem. The merged
directory will have everything inside the image directory plus the changes made inside the container, and will be used as a root of the filesystem inside the container.
To create an overlay filesystem, use the mount
function.
int mount(const char *source, const char *target,
const char *filesystemtype, unsigned long mountflags,
const void *data);
source
is often a path referring to a device. Because we are not mounting a device, use the dummy string"overlay"
target
specifies the directory at which to create the mount point. Use themerged
directory path:/tmp/container/{id}/merged
.filesystemtype
specifies the type of the filesystem. Use"overlay"
.mountflags
provides options. UseMS_RELATIME
.data
provides options specific to the filesystem. The overlay filesystem takes the three arguments (lowerdir, upperdir, workdir) in the format:lowerdir={lowerdir},upperdir={upperdir},workdir={workdir}
. Construct a string of this format and pass the pointer.
lowerdir
should be the image directory. In principle, upperdir
and workdir
can be any directory, but in order for the overlay filesystem to work inside the Dev Container, those directories must be inside /tmp/container
. main
creates this directory.
Use /tmp/container/{id}/lower
, /tmp/container/{id}/work
for lowerdir
and workdir
, respectively. In order for mount
to work, those directories must exist. Use mkdir
to create a directory if not exist.
For example, if the current directory is /workspaces/project5-container
, the container ID is my-container
, and image name is alpine
, you need to call
mount(
"overlay",
"/tmp/container/my-container/merged",
"overlay",
MS_RELATIVE,
"lowerdir=/workspaces/project5-container/images/alpine,upperdir=/tmp/container/my-container/upper,workdir=/tmp/container/my-container/workdir"
);
Now, the overlay filesystem mounted at /tmp/container/{id}/merged
. We want the child process to treat this as the root directory.
pivot_root
is the system call to achieve this. Because calling pivot_root
is complex and tedious, we provided a helper function to do so.
void change_root(const char* path)
Provide the path to the "merged" directory to the change_root
function. It will call pivot_root
to change the root directory to the "merged" directory and ensure it cannot access outside directories.
change_root
also does a couple of more things to ensure the container works properly, such as setting the PATH
environment variable.
At this point, the child process has its own PID namespace and the overlay filesystem as its root directory. The last step is to execute the specified command.
Use execvp(3)
so it can execute commands without specifying the full path to the executable.
int execvp(const char *file, char *const argv[]);
file
specifies the name of the command and argv
specifies the entire arguments. argv
needs to be null-terminated.
For example, if the command is echo "hello world"
, you should call
char *argument_list[] = {"echo", "hello world", NULL};
execvp(argument_list[0], argument_list);
You can test your container runtime is working properly using the alpine
image described earlier. In particular, we want to make sure processes and filesystem are isolated.
To check the process isolation, we can use the ps
command.
$ sudo ./container my-container alpine sh
--- inside container ---
$ ps -A
PID USER TIME COMMAND
1 root 0:00 sh
2 root 0:00 ps -A
ps -A
must not print the processes running on host. The command used to create the container (sh
in the above example) should have PID
1.
To check the file system isolation, you can use cd
inside the container to try to get out of the file system. If change_root
is called properly, you should not be able to get out of the overlay filesystem.
$ sudo ./container my-container alpine sh
# inside container
$ cd /../../
$ ls
bin dev etc home lib media mnt opt proc root run sbin srv sys tmp usr var
Any changes made inside the container should be visible at the upper
directory.
$ sudo ./container my-container alpine sh
--- inside container ---
$ echo hello from container > hello.txt
$ exit
--- returned to host ---
$ sudo cat /tmp/container/my-container/upper/hello.txt
hello from container
While our container runtime is minimal, it is capable of running a variety of images. Once you are done testing with the alpine image, try using your container runtime to execute your favorite image.
For example, here is how to execute JavaScript (Node.js) using the node:18-alpine
image.
# follow similar steps to create an image directory
$ docker pull node:18-alpine
$ docker run --rm -it node:18-alpine sh
# in a different terminal
$ docker ps # copy the container ID
$ docker export {container-id} > node.tar
$ mkdir images/node
$ tar -xf node.tar -C images/node
$ sudo ./container node-container node node
Welcome to Node.js v18.16.0.
Type ".help" for more information.
>
Error: Could not open history file.
REPL session history will not be persisted.
> console.log("hello, world!")
hello, world!
undefined
>
Try running your favorite programming language with your container runtime!
make
must create thecontainer
executable.- All source files must be formatted using clang-format. Run make format to format .c and .h files.
- The filesystems can often enter wrong states. If filesystems behave weirdly, try running the "Dev Containers: Rebuild Container" command in VS Code. This will recreate the Dev Container and likely to resolve the issue. You can also try restarting Docker Desktop.