Troubleshooting a running application can be difficult, usually it starts around checking log output and then following through the likely code paths to get an idea of where a failure may occur. In a development environment, you might attach a debugger a step through source, but troubleshooting isn't always that convenient. There are several helpful tools that can assist, but one that gives the most comprehensive view of a running application is strace
. With strace
you are able to see all of the system calls an application makes to get a detailed understanding of what is going on "under the hood" in order to troubleshoot an issue.
Take a simple "hello world" F# application, the kind you get from dotnet new console -lang F# -n strace-sample"
. Build it with dotnet build
and then launch it with strace
to get a trace of all the system calls in a file called trace.log
(adjusting for your build output path if on a different framework version):
strace -o trace.log dotnet bin/Debug/netcoreapp2.2/strace-sample.dll
Along with the expected console output, you'll also get a new file, trace.log
- a huge one, mine came to almost 6,000 lines! Here are the first 8 lines:
execve("/usr/bin/dotnet", ["dotnet", "bin/Debug/netcoreapp2.2/strace-s"...], 0x7ffd0da40d78 /* 57 vars */) = 0
brk(NULL) = 0x24db000
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=103259, ...}) = 0
mmap(NULL, 103259, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f2e84896000
close(3) = 0
And if you scroll 5000+ system calls later, you'll see this:
write(24, "Hello World from F#!", 20)
The write
statement is a system call, the same statement would be output by compiling and tracing the system calls for this code:
#include <unistd.h>
void main (void) {
write (STDOUT_FILENO, "Hello world from F#!", 20);
}
Before you can say "what a bloated framework, I should write everything in C!", please keep in mind that no one outside of a demo ever ships a "hello world" console application. Plan to take the time to understand all of those system calls, the various libraries that are loaded, garbage collection, and heap alignment so that you can make an informed decision.
It's a good excercise to read through a few of these now, while the application is working as expected. Protip: read some strace
output for a few applications now to become familiar - when you're faced with an issue with an application, you don't want the added stress of learning how to read a trace so take the time now. You can get a quick reference for any of the system calls with man
, for example man openat
.
execve
- this is followed by the path todotnet
, then its parameters, executing the entry point.brk
- this most likely called bymalloc
to allocate memory for a path string for the next few calls.access
- this checks if a file exists (F_OK), is readable (R_OK), writable (W_OK), or executable (X_OK).openat
- this open a file at the specified path, returning number, the file descriptor.fstat
- gets astats
structure for the file, which includesst_size
for the size of the file in bytes.mmap
- the size fromfstat
is passed into this call along with the file descriptor fromopenat
, loading the contents of this file into memory.close
- this closes the file descriptor fromopenat
There are thousands of lines of system calls for bootstrapping the CLR and loading all the dependencies. The Linux system calls are very well documented, so no need to go through all of them here. Now for actually troubleshooting an application.
This is a simple application that creates a timer that does a DNS lookup every 5 seconds and then prints the address if found. It looks fairly safe, checking that there is an item in the array of addresses before printing it.
[<EntryPoint>]
let main argv =
use timer = new System.Timers.Timer (5000.)
timer.Enabled <- true
timer.AutoReset <- true
timer.Elapsed.AddHandler (fun _ _ ->
printfn "Looking up DNS address."
let addr = System.Net.Dns.GetHostAddresses "example.com"
match addr |> Array.tryHead with
| Some firstAddr -> printfn "Address: %O" firstAddr
| None -> printfn "No address found."
)
/// Run until we enter hit <Return>.
stdin.ReadLine () |> ignore
0 // return an integer exit code
Let's give it a try:
$ dotnet bin/Debug/netcoreapp2.2/strace-sample.dll
Looking up DNS address.
Looking up DNS address.
Looking up DNS address.
Looking up DNS address.
Looking up DNS address.
It's not working properly. There is no message about finding the address, nor is there an error that the address is not found. That doesn't leave a lot to go on, unfortunately. This isn't a big deal in a small example, but in a real application, maybe we only hit this case when we misconfigured some service address, and we don't realize until it's deployed. Too late to attach a debugger.
Let's attach with strace
to see the system calls. To do so, first get the process ID with pgrep
.
$ pgrep -a dotnet
6414 dotnet bin/Debug/netcoreapp2.2/strace-sample.dll
Now we can attach to process 6414, but need to have elevated privileges to do so:
sudo strace -f -p 6414 -o running.log
The -f
option says to follow threads, -p
specifies the process ID, and -o
specifies the output file.
Now the trace output goes to a file named running.log
that we can view to see what is going on after. There are multiple threads running handling the timer, and the system calls are interleaved. The number in the first column is the thread ID, and you can see that 3266 is the thread of execution where the println and then the subsequent calls to perform the DNS lookup being, with a stat
on /etc/resolv.conf
to get the system's DNS resolvers.
3266 write(27, "Looking up DNS address.", 23 <unfinished ...>
3264 <... clone resumed> child_stack=0x7feb24072fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7feb240739d0, tls=0x7feb24073700, child_tidptr=0x7feb240739d0) = 3311
3311 <... set_robust_list resumed> ) = 0
3266 <... write resumed> ) = 23
3264 futex(0x7fea80003ac8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
3311 sigaltstack(NULL, <unfinished ...>
3266 write(27, "\n", 1 <unfinished ...>
3311 <... sigaltstack resumed> {ss_sp=NULL, ss_flags=SS_DISABLE, ss_size=0}) = 0
3266 <... write resumed> ) = 1
3311 mmap(NULL, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0 <unfinished ...>
3266 stat("/etc/resolv.conf", <unfinished ...>
With multithreaded code, there will be a lot of futex
calls - these are where one thread typically will issue some operation, sleep, and then when that operation is completed, the other thread will wake it up. Let's ignore those for now, filter out some noise and just follow that thread.
It opens the /etc/hosts
file to read, then closes it. This is trying to resolve the domain locally.
3266 stat("/etc/resolv.conf", <unfinished ...>
3266 <... stat resumed> {st_mode=S_IFREG|0644, st_size=318, ...}) = 0
3266 openat(AT_FDCWD, "/etc/hosts", O_RDONLY|O_CLOEXEC <unfinished ...>
3266 <... openat resumed> ) = 41
3266 fstat(41, <unfinished ...>
3266 <... fstat resumed> {st_mode=S_IFREG|0644, st_size=221, ...}) = 0
3266 read(41, <unfinished ...>
3266 <... read resumed> "127.0.0.1\tlocalhost\n127.0.1.1\tub"..., 4096) = 221
3266 read(41, <unfinished ...>
3266 <... read resumed> "", 4096) = 0
3266 close(41 <unfinished ...>
3266 <... close resumed> ) = 0
Since it couldn't resolve locally, it creates a socket, then connects to the DNS server (from /etc/resolv.conf
read earlier), and then sends a message to it. You'll see the example.com
domain sent here. It's the DNS lookup.
3266 socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP <unfinished ...>
3266 <... socket resumed> ) = 41
3266 connect(41, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.53")}, 16 <unfinished ...>
3266 <... connect resumed> ) = 0
3266 poll([{fd=41, events=POLLOUT}], 1, 0) = 1 ([{fd=41, revents=POLLOUT}])
3266 sendmmsg(41, [{msg_hdr={msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\322\272\1\0\0\1\0\0\0\0\0\1\7example\3com\0\0\1\0\1\0\0)"..., iov_len=40}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, msg_len=40}, {msg_hdr={msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="r\303\1\0\0\1\0\0\0\0\0\1\7example\3com\0\0\34\0\1\0\0)"..., iov_len=40}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, msg_len=40}], 2, MSG_NOSIGNAL) = 2
Then it makes the poll
system call, which tells the OS to wait for an event, in this case POLLIN
means waiting until there is data to read. Then it calls recvfrom
to read the message from the socket. This is the DNS reply.
3266 poll([{fd=41, events=POLLIN}], 1, 5000) = 1 ([{fd=41, revents=POLLIN}])
3266 ioctl(41, FIONREAD, [40]) = 0
3266 recvfrom(41, "\322\272\201\203\0\1\0\0\0\0\0\1\7example\3com\0\0\1\0\1\0\0)"..., 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.53")}, [28->16]) = 40
3266 poll([{fd=41, events=POLLIN}], 1, 4990) = 1 ([{fd=41, revents=POLLIN}])
3266 ioctl(41, FIONREAD, [40]) = 0
3266 recvfrom(41, "r\303\201\203\0\1\0\0\0\0\0\1\7example\3com\0\0\34\0\1\0\0)"..., 65536, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.53")}, [28->16]) = 40
3266 close(41) = 0
Next, the thread enters some code that it doesn't want any interruption, so it masks OS signals with rt_sigprocmask
, then it makes a uname
and then unblocks signals again, gets the thread ID with gettid
, and calls a futex
- the thread is done with what it needed to do here. We never saw it go to either of the cases in the match statement, because both of them print.
3266 rt_sigprocmask(SIG_BLOCK, [HUP USR1 USR2 PIPE ALRM CHLD TSTP URG VTALRM PROF WINCH IO], [], 8) = 0
3266 uname({sysname="Linux", nodename="ubuntu", ...}) = 0
3266 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
3266 mprotect(0x7feb291b0000, 4096, PROT_READ|PROT_WRITE) = 0
3266 mprotect(0x7feb291b0000, 4096, PROT_NONE) = 0
3266 gettid() = 3266
3266 futex(0x7fea7c02a700, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
3266 <... futex resumed> ) = 1
3266 futex(0x7fea7c02a6b0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
Between the two rt_sigprocmask
system calls, the code was unwinding the stack. If you adjust the code to use a domain that will resolve, then you will find that the system calls to rt_sigprocmask
and uname
don't occur. If you spot these, most likely some exception is occurring. Thinking back to the code, GetHostAddresses
can throw an exception, so let's catch it:
timer.Elapsed.AddHandler (fun _ _ ->
try
printfn "Looking up DNS address."
let addr = System.Net.Dns.GetHostAddresses "example.com"
match addr |> Array.tryHead with
| Some firstAddr -> printfn "Address: %O" firstAddr
| None -> printfn "No address found."
with
| ex -> printfn "Failed: %O" ex
)
Run it again using strace -f dotnet bin/Debug/netcoreapp2.2/strace-sample.dll
to output directly to the console.
Message received, socket closed, rt_sigprocmask
, uname
, rt_sigprocmask
combination, but then write
calls with the stack trace before it wraps up the thread.
[pid 9670] recvfrom(48, "\250\7\201\203\0\1\0\0\0\0\0\1\7example\3com\0\0\34\0\1\0\0)"..., 65536, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.53")}, [28->16]) = 40
[pid 9670] close(48) = 0
[pid 9670] rt_sigprocmask(SIG_BLOCK, [HUP USR1 USR2 PIPE ALRM CHLD TSTP URG VTALRM PROF WINCH IO], [], 8) = 0
[pid 9670] uname({sysname="Linux", nodename="ubuntu", ...}) = 0
[pid 9670] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid 9670] write(27, "Failed: ", 8Failed: ) = 8
[pid 9670] write(27, "System.Net.Internals.SocketExcep"..., 256System.Net.Internals.SocketExceptionFactory+ExtendedSocketException (00000005, 6): No such device or address
at System.Net.Dns.InternalGetHostByName(String hostName)
at System.Net.Dns.GetHostAddresses(String hostNameOrAddress)
at Program.main@6-4.) = 256
[pid 9670] write(27, "Invoke(Object _arg2, ElapsedEven"..., 94Invoke(Object _arg2, ElapsedEventArgs _arg1) in /home/user/src/strace-sample/Program.fs:line 9) = 94
[pid 9670] write(27, "\n", 1
) = 1
[pid 9670] mprotect(0x7f74b42a9000, 4096, PROT_READ|PROT_WRITE) = 0
[pid 9670] mprotect(0x7f74b42a9000, 4096, PROT_NONE) = 0
[pid 9670] gettid() = 9670
[pid 9670] futex(0x7f740c002e74, FUTEX_WAKE_PRIVATE, 1) = 1
Now the code is handling the exception, but more importantly, this shows how to trace through system calls to find where the system calls no longer match what you expect the application to do - the place in the code where there could be an issue. Tracing code when it's working properly gives a great deal of insight into how the CoreCLR interacts with the operating system, and also helps to gain familiarity with reading traces and recognizing the types of system calls that occur when errors happen.
I can confirm strace can help a lot with hunting some cryptic issues (such as assembly resolution). This is what I've used recently: