jvns/executing-file.md

Last active August 30, 2024 17:24

Star (51) You must be signed in to star a gist
Fork (3) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/jvns/7688286.js"></script>
Save jvns/7688286 to your computer and use it in GitHub Desktop.

Download ZIP

What happens when I run ./hello

Raw

executing-file.md

(You can comment here.)

Here I'm trying to understand what happens when I run

./hello

#include <stdio.h>

int main() {
    printf("Hello!\n");
}

a simple "Hello World" program written in C, in Unix -- what I'd have to do if I wanted to write an OS that could execute it.

I'm going to assume that ./hello is statically linked, because that sounds simpler to deal with. It's worth noting that a statically linked hello is 868K on my machine. Eep.

I compiled it using

gcc -static hello.c -o hello

Any (nice!) comments or clarifications are appreciated.

Read from the filesystem

To run a program, I have to be able to find the program. So there would need to be some kind of filesystem and I would need to read the file from somewhere.

Copy the text into memory

In a Unix system, executables are in the ELF format.

So I would need to copy the "text" of the program somewhere.

Copy the data into memory

There is a string in the program. It needs to go somewhere.

Give the program a stack pointer

This program doesn't actually allocate memory, so perhaps it does not need a heap and it doesn't matter where the heap pointer is. It does need a stack. stack overflow question on how the stack works in assembly

Implement system calls

hello has some system calls in it. I found this out by running

objdump -d -M intel hello | grep 'syscall'

syscall is an assembly instruction for making a system call. That looks like

  401385:       b8 03 00 00 00          mov    eax,0x3
  40138a:       0f 05                   syscall

The number stored in eax is the system call that is called. In this case, 3

There are 119 instances of syscall, and it's using several different system calls. This is worrying.

(Explained more in this stackoverflow question)

Making sure a stack overflow doesn't happen

I have no idea how the OS would check up on the program. I guess it doesn't just let the program run, but takes away control periodically and makes sure the stack pointer hasn't moved too far. How would it take away control? Hmm.

When there is a stack overflow I guess it sends a signal to the program, which is a POSIX thing.

I do not understand this.

???

There are no mallocs in the program, so I would not need to allocate memory for it or anything.

What else?!??

Outstanding questions

How long would this take for a human (where human = me) to write from scratch?
Is there a way to write a smaller program with less system calls and magic? There are like 50 system calls and what are they even doing?
Do I need a heap if I never use malloc?
Could I write my own printf in assembly that does less and is simpler? Just printing a string is pretty easy...
How do I kill a program?

Useful links

mjdominus commented Nov 28, 2013

I don't think you should list "copy the data somewhere" as a separate step. By the time the linker has finished creating the executable program, the static string "Hello!\n" has been embedded into the executable, and the linker has calculated where it is and arranged that the parts of the program that need it are passed a pointer to the actual data. (If you run the strings utility on the executable file you should see the string hiding in there.) So the data is embedded in the program text and needn't be copied separately.

The statically linked hello executable has to include all the code for printf and for the C standard I/O library, which is surprisingly complex underneath.

One thing I would have included that you omitted is the system calls. At some point printf has to tell the OS that it actually wants to write data somewhere. At the library level, the way this works is that somewhere down in the guts of the standard I/O library (which is now part of your program, remember) there is an invocation of the Unix "system call" write. Typically the way this is implemented at the object level is that the program loads a magic number into one register, identifying which system call it wants to make (on my system I think the magic number is 4, and is listed in /usr/include/x86_64-linux-gnu/asm/unistd_32.h.), and the arguments of write into other registers (or perhaps it pushes them on the stack; it varies from architecture to architecture). The arguments in this case are a pointer to the data to be written, a length, in this case 7, and a "file descriptor", which in this case is 1, representing standard output by convention.

Then it executes a special machine instruction that causes a context switch to the kernel. The kernel has a write function in it that gathers the arguments and checks them for validity. For example, it would be bad if you could write a copy of some other process's memory onto the disk just by passing the write function a pointer to it. The kernel checks the arguments and if they're valid it copies the 7 bytes somewhere. It stores the result of the write operation into the process structure; the C .library will take care of arranging that this result appears to have been returned from the write function when the process resumes. Then the kernel context-switches back, or perhaps runs a completely different process.

What happens to the 7 bytes inside the write call is very interesting and depends on where the process's standard output is pointing. This is recorded in the process structure in the process's open file table. The open file table maps file descriptors to "file pointers"; the file pointer in turn records an offset (maybe) and a pointer into the kernel's open file table. The open file table in turn will record whether the descriptor is attached to a network socket, a disk file, a pipe, or a device, and relevant details about each.

In the most common case standard output is attached to a terminal device. In this case the open file table has two numbers. The "major device number" says that the device is a terminal, and indexes a table of pointers inside the kernel that point to the device driver routines that have the code for opening, closing, reading from, and writing to the terminal. The "minor device number" says which of the many otherwise identical terminal devices is being used, and is usually just passed to the device driver routines as a parameter.

The terminal device in this case isn't a physical terminal, which would require sending the data out some physical interface; it's what's called a "pseudo-tty". A pseudo-tty has two ends: a "slave" end that behaves like a real terminal, and a "master" end that can be used to control the terminal, read what was written to it, and send a reply. Data written to the slave side of a pseudo-tty is saved inside a kernel buffer until another process that has the master side open chooses to read it. Then that other process gets the seven bytes of data. In this case that process is your xterm terminal program. It then figures out what that should look like on its window and sends a bunch of requests to the X server via a network socket to tell it what to paint on the screen.

If the hello program's standard output has been directed to a file instead of to the pseudo-tty, the disposition of the seven bytes is somewhat different. The kernel's open file table will record the identity of the disk device on which the file resides, and the "i-number" of the file on that device. (Every file has an i-number, and each file on a device has a different one.) The "i-number" is so called because it is the "index number" of the file in the disk's index; the first data on the disk is a giant table of "inodes" ("index nodes") which records the file's owner, its permissions, and (most important) where on the disk its data is located. The kernel has previously located the inode on disk and loaded it into memory; this is what happens when you open a file. It knows the offset into the file that at which the write should take place. So it can calculate which disk block needs to be modified. Using the device number of the device on which the file resides, it calls the device driver to read the correct block from the disk; it then modifies the right seven bytes, and it leaves the block in a kernel "dirty buffer" area of blocks that have been modified but not yet written to disk. At some point in the not-too-far future, it will call the device driver again to write the modified data back to the right place on the disk. If the system crashes before that happens, the write is lost.

I hope this was not too much detail, and was what you were looking for.

mjdominus commented Nov 28, 2013

Regarding malloc, you might be mistaken. The standard I/O library has a buffer for every open I/O structure, and typically there are at least three, for standard input, standard output, and standard error output. In ancient versions of the standard I/O library, these buffers were statically allocated, but I bet they aren't in this case. The linker plays some trickery to arrange that main is called when the program begins, by linking in a file usually named something like crt, for "C run-time". The crt file initializes the stack pointers, sets up the program's arguments so that they can be read via argv, and other similar tasks'; it might also allocate memory for standard I/O buffers.

On my system, the statically-linked executable makes these system calls:

    execve("./hello", ["./hello"], [/* 48 vars */]) = 0
    uname({sys="Linux", node="ortolan", ...}) = 0
    brk(0)                                  = 0xe52000
    brk(0xe531c0)                           = 0xe531c0
    arch_prctl(ARCH_SET_FS, 0xe52880)       = 0
    brk(0xe741c0)                           = 0xe741c0
    brk(0xe75000)                           = 0xe75000
    fstat(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(136, 4), ...}) = 0
    mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd900b2e000
    write(1, "Hello!\n", 7)                 = 7
    exit_group(7)                           = ?

(To get this, I ran strace -o hello.out ./hello. strace runs a program and prints a report of all the system calls the program makes.)

Those brk and mmap calls are the ones that allocate memory. Somewhere way down in the guts of the malloc library there are calls to brk or to mmap; there have to be, because malloc allocates memory, it has to get that memory from somewhere, and the only place to get it is by asking the kernel. I don't know for sure what the memory is being allocated for, but it is being allocated, and I will guess that the mmap call there is to allocate the output buffer for the printf. This is mainly because it allocates 4096 bytes, which is a likely size for such a buffer.

(Addendum: When I replce the printf("Hello!\n") with write(1, "Hello!\n", 7), the fstat and mmap calls disappear from the output of strace, but not the brk calls, so I think my guess about the standard I/O library making calling mmap to allocate a buffer was correct.)

funkaoshi commented Nov 28, 2013

The program is big because it's linking against the C runtime library. That's why the code balloons like that. If you search around on the Internet, you'll find people who try and write "tiny" C runtimes. That's the underlying issue, really. You could get around this all by writing your Hello World in assembly, I suppose.

You might want to read about the UNIX system calls. The kill() call is what you use to kill a process.

When I was in university they made us write an operating system in 3rd year. (We didn't do it from scratch, but used this thing called Nachos.) That class was 3-4 months long. If you were interested you could probably work through the course on your own. When I took the class you had to have already taken a class on concurrency and some other things you might need to teach yourself first.

Author

jvns commented Nov 28, 2013

Hmm, so I'm not sure that I understand how context switching and multitasking works. I read the Wikipedia article, where it says

A program is running for a slice of time
Something happens (an interrupt?)
When the interrupt gets called, the hardware looks up the interrupt handler (some kernel code)
The kernel code decides which program to continue executing and loads its state into the registers, storing the state of the old program

1, 3, and 4 make sense to me, but I don't yet understand how the interrupt would get triggered. The Wikipedia article mentions a "timer interrupt", which sounds like a program's execution always gets interrupted every fixed amount of time if they haven't. I don't see how that works - if the CPU is busy executing something, how does the interrupt get called?

danellis commented Nov 28, 2013

If you're actually planning on writing a toy OS, then you could probably start out by not caring about loading a program from a file, and instead start with a couple of hard-coded processes and figure out how to do the context-switching between them. And have you seen the OSdev wiki? It's very useful for this stuff: http://wiki.osdev.org/Main_Page

So regarding stack overflows, your virtual memory space is laid out like this:

+---------------+
|    Stack      |
|      |        |
|      v        |
+---------------+
:               :
+---------------+
|      ^        |
|      |        |
|     Heap      |
+---------------+
|     Data      |
+---------------+
|     Code      |
+---------------+

(Please excuse the crudity of this diagram. I didn't have time to built it to scale or paint it.)

At the bottom of that space is your code, followed by your global and static variables. Above that is your heap. At the top, coming downwards, is your stack. Between the two is "empty" space. That is, those virtual addresses are not mapped to any physical memory. Apart from saving RAM that can be mapped into other processes, these pages act as "guards". When your stack tries to grow downwards beyond its allocated space, a page fault is triggered, which the OS uses as a signal that it needs to allocate more physical pages to that process.

This also brings us to the 'brk' syscall that @mjdominus mentioned. It changes the location of the "program break", which is the end of the data section. If you move it up, you get space that you can use for the heap. If you need more heap space, you move it up some more. Note that it is a syscall, so it's the kernel that is doing this, and can map in more physical memory. The program break also represents the limit of how far down the stack can grow. When the heap and the stack meet, that's when you've run out of virtual address space.

"Do I need a heap if I never use malloc?"

Depends! If you're linking with libraries, they will likely use malloc even if you're not. In some scenarios, though, such as embedded software, it's common to not use malloc at all, in which case no, you don't need a heap.

"Could I write my own printf in assembly that does less and is simpler? Just printing a string is pretty easy..."

Absolutely. And I plan to do the same myself soon, because printf is ENORMOUS. If you think about all the things it does... formatting integers as decimal or hex, formatting floating point numbers, padding, field widths, thousands separators, etc, it's no wonder. You could write a much smaller one yourself to just do the things you need.

"So I would need to copy the "text" of the program somewhere."

If everything is in RAM, you don't need to copy anything. You'd want to load the text and data segments separately, though, so they can have different memory protections. The BSS section does need to be zeroed out, though -- that's what crt.o does. (If you're on a system where your code is in flash, you need the additional step of copying the data segment (initialized data) into RAM so it can be writable!)

danellis commented Nov 28, 2013

" if the CPU is busy executing something, how does the interrupt get called?"

It's done by the hardware. The timer is a physical unit within the CPU. When it reaches a preset value (or zero if it's counting downwards), it generates an interrupt, which causes the CPU to stop whatever it's doing and jump to the interrupt handler. I don't know the details of how this is done on x86, but on ARM it stacks a few registers for you (including the program counter), then loads an address from the "vector table" (a table that has the addresses of all your fault handlers and interrupt handlers), switches to privileged mode and jumps to that address. That would be the addresses of your scheduler, which choose the next task to run and switches the stack pointer to the one for that task. Then, when it returns, the CPU unstacks the registers, but this time they're the saved ones of a different process, including a different saved program counter, so it "returns" to a different process. Of course, in a x86 system you also have to change the virtual-to-physical page mappings to those of the next process so that it's mapped into the same virtual memory space.

danluu commented Nov 28, 2013

I don't know the details of how this is done on x86, but on ARM

x86 has the same mechanism as ARM[1].

If the question is what actually keeps track of the time in order to trigger a timer interrupt, there are a few different mechanisms on x86. Modern OSes will use APIC or x2APIC timer interrupts. But if you just want to do something simple yourself, setting up the PIT is much simpler.

[1] Well, pretty much. The IDT is more general; more entires, and it has a gate descriptor, instead of just being basically a pointer.

mjdominus commented Nov 29, 2013

Regarding those 50 system calls you see in the object dump, and the large size, my guess is that the linker is including a bunch of subroutines that you never call, and most of the stuff you see is not actually related to your program. For example, when I run nm on my executable, I get a huge listing of functions, including a bunch of time zone functions and Yacc parser functions that I certainly don't use. (There's probably some incantation you can give to the linker to prevent this, but I hate the linker and never touch it unless I have to.)

To see the system calls that your program actually performs as it executes, use strace as I described above.

mjdominus commented Nov 29, 2013

Regarding your #2 above, the interrupt gets called periodically by the hardware, as @danellis remarked. But there's also a very important additional circumstance in which an interrupt occurs: there is a machine instruction which causes one when it is executed, and this is precisely how the hello program notifies the kernel that it wants to perform the write call. (At least, that's definitely how it works on a System/360 machine, which I am ashamed to say is the most recent architecture I am really familiar with. Most of my career has been at the source code level.)

You also asked about stack overflows. @danellis described how the stack grows downwards in memory. When your program tries to access an address off the bottom of the stack, the MMU generates an interrupt. The kernel gets control, and checks to see if your process is allowed to have another page of stack space; if so it gets the MMU to allocate one and map it into your process's address space, and then everything continues normally. But if it decides to refuse the request, perhaps because your process is marked in the kernel as only being allowed a limited amount of stack space, it instead sends your process a SEGV (‘segmentation violation’) signal, which then gets handled as usual; the default is for the process to exit immediately and the kernel copies its entire address space into a file named core in the process's working directory. The shell command ulimit, which is a thin wrapper around the kernel's ulimit system call, is for setting the stack size and other limits.

I don't know if there's any way for the process to find out how big its stack is. It can use ulimit to find out the maximum allowed size. I don't think there's a way for it to catch the interrupt generated by the MMU when it starts a new stack page. It can change the way it handles the SEGV signal. One thing it can do is to ask the kernel to transfer control to a "signal handler" function in case of SEGV, instead of killing the process instantly. The signal handler can try to reduce the size of the data segment so that the stack has more room to grow.

Author

jvns commented Nov 29, 2013

from @nevyn on Twitter:

afaik, the system does no bounds checks of memory. Instead, the hardware MMU starts a signal chain when bad access happens.

Author

jvns commented Nov 29, 2013

@danellis suggested

By the way, you might find 'Advanced Programming in the Unix Environment' by W Richard Stevens to be useful. It's a classic.

danellis commented Nov 29, 2013

"I don't know if there's any way for the process to find out how big its stack is."

I don't know of a portable way. You can get the address of the bottom of the stack using a function that returns the address of an argument of local variable, but the top not only differs per platform, but is randomized slightly for security purposes. You can see how much memory is reserved for your stack space in /proc/self/maps, though:

$ grep '\[stack]' /proc/self/maps 
7fff227eb000-7fff2280c000 rw-p 00000000 00:00 0                          [stack]

rythie commented Nov 29, 2013

On your query about interrupts...

Interrupts can be called by any hardware that has a IRQ line. Interrupts are sent directly to one of the CPUs and they have a location in memory setup to jump to handle that interrupt quickly and get back to what they were doing.

Network interfaces, USB, disk controllers usually do interrupts when they have something ready to send, i.e. a new packet has come in the network interface (though sometimes they batch them which is called interrupt mitigation). There is also a programable timer chip on the motherboard, which in Linux OS is typically programmed to interrupt a set number of times a second (e.g. 1000 on my box) which calls the scheduler to run.

Processes run until they block (by doing a system call) or run out of their timeslice (~20ms). When a process blocks, the system handles the system call and typically that has some wait in it, e.g. it asks for something from disk that might take 1-10ms. The OS then does something else in that 1-10ms. When the data comes back from the disk an interrupt is raised and the system puts that into memory. The original process is now runable again and will be ran by the scheduler based on it's algorithm. If the process runs over it's timeslice (yours wouldn't - a but one doing some CPU intensive stuff would) some thing else can be scheduled for a bit before it get a chance to run again.

lenary commented Dec 5, 2013

Here's another resource for ELF Files: http://i.imgur.com/GZ5a0sb.png

I think it's really neat, how helpful it is though, remains to be seen.

jvns/executing-file.md

Read from the filesystem

Copy the text into memory

Copy the data into memory

Give the program a stack pointer

Implement system calls

Making sure a stack overflow doesn't happen

???

Outstanding questions

Useful links

mjdominus commented Nov 28, 2013

Uh oh!

mjdominus commented Nov 28, 2013

Uh oh!

funkaoshi commented Nov 28, 2013

Uh oh!

jvns commented Nov 28, 2013

Uh oh!

danellis commented Nov 28, 2013

Uh oh!

danellis commented Nov 28, 2013

Uh oh!

danluu commented Nov 28, 2013

Uh oh!

mjdominus commented Nov 29, 2013

Uh oh!

mjdominus commented Nov 29, 2013

Uh oh!

jvns commented Nov 29, 2013

Uh oh!

jvns commented Nov 29, 2013

Uh oh!

danellis commented Nov 29, 2013

Uh oh!

rythie commented Nov 29, 2013

Uh oh!

lenary commented Dec 5, 2013

Uh oh!