I was to lazy to figure out a title, if I ever publish this somewhere else, I'll probably try to figure one out.
Let's look at the program bellow:
#include <stdio.h>
#define DEFAULT_MSG "Default message"
static void say(const char *msg) {
if (msg == NULL) {
msg = DEFAULT_MSG;
}
printf("(0x%lx) %s\n", (size_t)msg, msg);
}
int main(int argc, char **argv) {
if (argc > 1)
say(argv[1]);
else
say(NULL);
return 0;
}
Let's compile it using a bunch of unnecessary flags:
gcc test.c -o test -W -Wall -Wextra -pedantic -Wcast-align -Wcast-qual -Wconversion -Wwrite-strings -Wfloat-equal -Wpointer-arith -Wformat=2 -Winit-self -Wuninitialized -Wshadow -Wstrict-prototypes -Wmissing-declarations -Wmissing-prototypes -Wno-unused-parameter -Wbad-function-cast -Wunreachable-code -O0 -g
And run it:
~$ ./test
(0x402004) Default message
~$ ./test hello!
(0x7ffd713d125a) hello!
The C programming language represents strings as NULL-terminated character arrays. Also, it can only pass strings to a function by reference. Albeit it may seem a little bit unintuitive, this is the same as passing a pointer to a string by value.
In the example above, the function say
takes a pointer to a constant string;
which means that through that pointer we should not modify the referenced
string. This is not to be confused with a constant pointer to a string; which
would mean that the pointer itself should not be modified, but it may however be
used to modify the referenced string.
For instance, a constant pointer to a constant string would not compile:
static void say(const char * const msg) {
if (msg == NULL) {
msg = DEFAULT_MSG;
}
printf("(0x%lx) %s\n", (size_t)msg, msg);
}
test.c: In function ‘say’:
test.c:6:13: error: assignment of read-only parameter ‘msg’
msg = DEFAULT_MSG;
^
Exercise to the reader: what could go wrong from using a constant pointer to a string to modify argv[1]?
Now, let's rewind for a little and see how strings are stored in our program.
First, we should distinguish two types of strings: “Constant strings”
(const char*
) and “dynamic strings” (char *
).
One important remark is that we can always get a const char*
from a char *
,
since the difference relies in promising the compiler not using that
particular pointer to modify the string.
Now, the compiler macro DEFAULT_MSG
will be replaced at pre-compile time by
its value, so the function say will be seen by the compiler as so:
static void say(const char *msg) {
if (msg == NULL) {
msg = "Default message";
}
printf("(0x%lx) %s\n", (size_t)msg, msg);
}
Constant strings are stored with the compiled object and later relocated by the linker within the produced binary, we can verify this by compiling the intermediary object, linking and peeking inside them.
~$ gcc -c test.c -o test.o
~$ xxd test.o | grep -C 2 Default
00000090: f048 83c0 0848 8b00 4889 c7e8 a0ff ffff .H...H..H.......
000000a0: eb0a bf00 0000 00e8 94ff ffff b800 0000 ................
000000b0: 00c9 c344 6566 6175 6c74 206d 6573 7361 ...Default messa
000000c0: 6765 0028 3078 256c 7829 2025 730a 0000 ge.(0x%lx) %s...
000000d0: 4743 433a 2028 474e 5529 2038 2e33 2e30 GCC: (GNU) 8.3.0
~$ gcc test.o -o test
~$ xxd test | grep -C 2 Default
00001fe0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00001ff0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00002000: 0100 0200 4465 6661 756c 7420 6d65 7373 ....Default mess
00002010: 6167 6500 2830 7825 6c78 2920 2573 0a00 age.(0x%lx) %s..
00002020: 011b 033b 3c00 0000 0600 0000 00f0 ffff ...;<...........
~$ readelf -S test
There are 35 section headers, starting at offset 0x4f38:
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
<--snip-->
[15] .rodata PROGBITS 0000000000402000 00002000
0000000000000020 0000000000000000 A 0 0 4
<--snip-->
[30] .debug_str PROGBITS 0000000000000000 00003e11
0000000000000657 0000000000000001 MS 0 0 1
<--snip-->
Key to Flags:
W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
L (link order), O (extra OS processing required), G (group), T (TLS),
C (compressed), x (unknown), o (OS specific), E (exclude),
By looking at a hex dump we can see our string located at address 0x2004, in
the binary. Using readelf, we can study each of its sections, in particular
the .rodata
section, which will be mapped at address 0x402000
by the dynamic
linker from file contents at offset 0x2000
. We can also see that the entire
section is aligned at the 4 byte boundary. By looking at the flags we can see
that such section is just allocated and it is not executable.
One may be inclined to think that the
S
(String) flag should be set here aswell. Exercise to the reader: why is it not set?
Looking at the .rodata
section, we see our message
and an extra string. After looking back at the code, we notice that it
corresponds to the format string we passed to printf! (Exercise to the reader:
where is the ^J
coming from?)
~$ readelf -p .rodata test
String dump of section '.rodata':
[ 4] Default message
[ 14] (0x%lx) %s^J
Now, our default message is located 4 bytes after the .rodata
section, which
is mapped at 0x402000
. This means that it will be available at runtime at
address 0x402004
; which corresponds to what we see when we print its address.
~$ ./test
(0x402004) Default message