The ultimate goal of debugging is to try and find the truth of the situation. Something is off. You expect the code or system to do x and it is off doing y. As we deploy code in different environments, running at crazy scale, or running distributed you can't always step through the code in a debugger or reproduce perfectly but you still need to be able to make progress.
At the end of the day there is code running somewhere in some order. Your goal is to figure out what is actually running. In a perfect universe you could pause and step through all the code running on every cpu in the system until then we are trying to get as close as possible to that level of understanding.
Obvisouly if you can run something locally and step through it in a debugger you should but for many systems this is not possible.
The next phase would be try to add logging to see what is going on.
Again there maybe systems were you can't log or you can't get access to the logs. Then you need to be creative.
The goal is not the log but the information on what is going on.
So you could
- Update something you can control, write something to a db, or a file, or anywhere you code can access. Write out some information you can review, through a message in queue, make a dummy api call.
- For an API or similar system have it return directly its internal state or append a logging message.
The goal is not the log but understanding what is happening. Find anyway to get that information.
When things are just not working and you don't have a good starting place. Get the most basic thing working. Comment out all the code except maybe a dummy return value. Does that work? Then build on that.
Another method for tracking down issues. Comment out half the code flow and run. Does that first half work, then swap. Halve again. Works well with the go to ground method.
For certain bugs they can span time. Many things are updating, maybe there are background tasks, distrubuted agents. Just like a murder investigation build a timeline of events.
Sequence diagrams can be helpful here.
Many times we just assume things and don't really know we are doing that. Oh that code stopped working but there have been no changes, then it must be this other change that just occurred. That is not a bad assumption most of the time but it is still an assumption and can be wrong.
That code that has been running for years could be now running with different data and unrelated to any other change.
When debugging there can be so many options on what is happening. Pick one and assume that is true, then follow loggically from that assumption. If that assumption is true what would you expect to see happening? Do it match, then you can be more confident in that assumption, if not move on to the next.
https://gist.github.com/Jbot29/b8a4a1923455aec06bc4fd4aaef96d39