mtaziz · September 16, 2015 15:25
diff --git a/Problem-solving-on-unix-linux-systems b/Problem-solving-on-unix-linux-systems
 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 +                       ###All the Credits Goes to Author, Dan Stromberg###                               +
 + http://stromberg.dnsalias.org/~strombrg/Problem-solving-on-unix-linux-systems.html                      +
 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

 Note: This web page was automatically created from a PalmOS "pedit32" memo.

 Problem solving on unix/linux systems


 This document covers generic problem solving approaches that have
 proved useful on unix and/or linux systems.  Some of it applies to other
 operating systems as well.

 If you see a method of solving problems on unix and/or linux systems that
 isn't here, Please let me know: strombrg at dcs dot nac dot uci dot edu.
 I'll of course credit the source.

 These are not, at this point, listed in any particular order, but they
 may be someday.  :)

 1) Get the full text of any error messages.  Take a guess what they mean,
 and try to address the problem based on that.

 2) Get the full text of any error messages, and google for them.
 Leave out anything very system-specific, like pid numbers or values
 of pointers (other than the NULL pointer).  Often someone will have
 already solved the problem you're seeing, and there'll be an answer to
 your question in some archive somewhere.  Googling in both the web and
 usenet is generally a good idea.  You may or may not want to restrict
 your usenet search to a particular usenet group - sometimes this can
 increase the relevancy of the results, but of course it can also cut
 down greatly on the number of hits you get.

 3) Run df.  A lot of problems can be quickly tracked down by just
 checking if any filesystems are full, or any remote (EG, NFS) mounts
 are having problems.

 4) Try truss/strace/par/trace/&c.  These programs can list system
 calls being executed by a program.  Often the content of the system call
 trace, near the bottom, will give a fair indication of what is wrong.
 If one of the last things is trying to do something with a file, and an
 "Esomething" error status is returned, there's a good chance that's
 the problem.  Alternatively, if the last thing is succesfully reading
 a config file but shortly therafter giving an error anyway (via write()
 or whatever, or perhaps not giving an error at all!), then there's a good
 chance that the error is in that config file.  It's often worth trying
 something like this on both the client and the server.  If it's hard to
 fire up a tracer against a client quickly enough, then "echo $$", and
 truss -f -p the pid that yields from another window.  This will truss
 (or whatever) your shell, and its subprocesses.  It's also sometimes
 helpful to truss -f -p inetd's pid, xinetd's pid, or other daemon's pid
 (like sshd).  If traceing httpd, you may have to kill and restart httpd
 under truss, or change httpd's config file to only spawn one child (for
 example).  Sometimes if you're on a busy system, you'll get flooded with
 information doing this.  In such a situation, you can sometimes move
 to another representative system, or set up a tight while loop that
 will initiate your truss of a relevant process as soon as possible
 after it is exec'd, by ps | grep'ing again and again.  See also http://stromberg.dnsalias.org/~dstromberg/debugging-with-syscall-tracers.html

 5) You can usually tell which NFS mount is having problems by one of
 three methods:

 5a) Run df &.  Wait a long time.  Eventually, df will probably tell
 you which NFS server is down.

 5b) Run df &.  Note the last filesystem listed.  It is probably the
 -next- filesystem in the machine's filesystem list that has the problem.
 You can often list these filesystems by inspecting /etc/mtab, /etc/mnttab,
 or running the mount command with no arguments.

 5c) Use a system call tracer on df &.  This will most likely identify
 which filesystem is having problems pretty quickly.  I generally prefer
 this method of the three.

 6) If the problem you are troubleshooting is network related, fire up
 a sniffer on the traffic.  ethereal/tethereal, snoop and tcpdump -v are
 pretty good at annotating network conversations with useful information.
 Even if the traffic is encrypted, you can sometimes make an educated guess
 about where the problem lies based on the last host to send anything as
 part of the conversation.  Also, sometimes you can give sniffers keys
 that they can use to decrypt traffic.

 7) truss and such will probably detect this to some extent, but check
 if the user in question is up to or exceeding their hard quota, or have
 exceeded their soft quota for more than the specified amount of time
 (usually one week).  This problem can often lead to other problems -
 for example, X11 credential forwarding may mysteriously fail if the
 homedir is not writeable.

 8) Check for permissions problems.  Again, truss and such will help you
 pinpoint this fairly quickly, but it can still sometimes help to think
 "If I were this program, what files would I need, and do I have the
 needed access?"

 9) Try to eliminate as many variables as you can.  Compare across
 machines.  Do all machines of the same OS type have the same problem?
 Consider entire platforms as well as increasingly minor releases of
 the software.  Also compare across users: Is the problem unique to a
 specific user or group of users?  If so, why?

 10) Check if the program, or the components of the program, have been
 modified recently.  ls -l `which chmod`, for
 example.  Also, get a list of libraries used by the program, and see if
 they've been updated.  You can usually do this with "ldd /bin/ls" or
 "odump -Dl /bin/ls" or "dump -X 32 -Tv /bin/ls".  Another alternative
 is to strings the binary ("strings -a `which
 chmod` | grep / | less -sc"), and then checking each of the files and/or
 directories the program references.

 11) If one system is working, and another is not, compare the md5sum's
 of the files in step 10 on a working system, and a nonworking system

 12) If one user is working, and another user is not, there is a good
 chance there's a permissions problem, which again, truss and co. can help
 you identify.  Another major class of problems come from differences
 in environment variables.  To track down this kind of problem, "su
 - okuser" followed by "env | sort > /tmp/env.okuser; exit" and
 then "su - baduser" followed by "env | sort > /tmp/env.baduser".
 You can then "diff -u /tmp/env.okuser /tmp/env.baduser" to determine
 what differences the users have in their environments.  If there are
 a lot of differences, you can binary search on the differences, until
 you pinpoint the one that matters.  I've also sometimes replaced an
 entire environment with that of another user, to see if there is any
 variable leading to the trouble, or if it is really something else.
 Please note that this sort+diff method isn't perfect, especially
 if some environment variables contain newlines.  See also http://stromberg.dnsalias.org/~dstromberg/env-search.html

 13) Sometimes it is helpful to set up a cron job or while loop, that will
 save the status of a particular thing (like "ps axf", "hps", "netstat
 -a", "uptime" and so on) in a series of files, named by date +%whatever.
 Then when a system finally crashes, you can get some idea of what was
 happening at the time, by looking at the last item(s) in your output.

 14) Sometimes it is helpful to see if a particular kind of problem
 is always happening at the same time every day.  This tends to lead
 to hypothesies like "is it a cron job?" or "Is it a user with regular
 behavior?"  Checking nagios can help with this.

 15) If you're dealing with a network service, try to replicate
 the problem (in a minimalist way) by telnet'ing to the port on
 the host (optionally, from the client), or using the "ssl-connect"
 program to connect to an openssl-encrypted service - see also http://stromberg.dnsalias.org/~dstromberg/ssl-connect.html

 16) If there is a technologically-enforced licensing scheme involved,
 check if any license servers have died, or if any licenses have expired,
 or if any license server configuration changes have been made (check
 both the license manager(s)' input data, as well as its executable and
 dependent libraries - see if any changes have been made recently).

 17) Ask users when they first noticed the problem.  This can lead to
 recalling a change that was made around that time.

 18) If you have one group of users with a problem, and another
 group of users without a problem, you can binary search their
 config file keywords, much like was mentioned above on environment
 variable issues.  You can also do a quick, rudimentary check of
 users' config files using the "classify" program, or my "equivs"
 program.  classify has more flexible options, but my equivs
 program is usually faster on large collections of input files.  http://stromberg.dnsalias.org/~dstromberg/software/

 19) If you're on an AIX system, and you're seeing strange shared library
 conflicts, study up on "loader domains".  Question: Do any other *ix's
 have "loader domains" or something similar to them?

 20) Check any and all relevant logs!  If you don't find anything, go
 check any logs that have changed recently (works best on relatively
 quiet systems).  This is triply true if you see a truss (or similar)
 writing to a log file, or opening a socket or door to syslog.

 21) If you're having trouble finding stuff in your syslog files, consider
 combining them into one big file.  Also, a script that pulls anything
 you've had trouble with before in your syslog data, is a really good
 way to be proactive.

 22) Don't rule out multi-variable problems or holistic situations
 unnecessarily.  While it's usually best to initially assume a
 single-problem issue, and that reductionistic analysis will work,
 eventually solution-resistant problems call for considering things like
 "OK, are there two variables (or more) in specific combinations) that
 give the failure, while other combinations of the same variables give
 working results?"  To sum this up in programmer/logician terms, in the
 two variable case, sometimes "a and b" yields problems, but sometimes it's
 "not a and b" or "a and not b" or "not a and not b".

 23) Try getting a backtrace. This may help you, or it may help the
 people you request help from.  Usually you can do this with "gdb program
 [core]" followed by "run -a arg1 arg2 arg3 ... argn" followed by "bt".
 Newer gdb's don't seem to want the -a anymore.

 24) Try other forms of debugging - whatever's availalble.  If you're a
 programmer, you may want to try ddd or similar on C/C++/whatever programs.
 If you're troubleshooting an sh/ash/ksh/bash script, try throwing in
 "set -x" (and optionally, "set +x") here and there, to put the error
 in context.  If you're troubleshooting a csh/tcsh script, try putting a
 "-x" on the #! line (the first line).

 25) If you're on a mixed wordsize (EG 32 bit and 64 bit) system, are you
 getting a bad combination of 32 bit and 64 bit libraries at load time?
 Or are you seeing libraries that are available for 32 bit systems,
 but not for 64 bit systems (or vice-versa)?

 26) If your OS has a "map the 0th page to something innocuous and
 writable" option, go ahead and try it, but be sure to report the crash
 to the developers/maintainers anyway.  This can sometimes help make null
 pointer dereferencing relatively toothless.  Some OSes put a "bomb" at the
 0th page, so that programmers can catch their errors early.  Others don't.
 On Solaris 8 (maybe earlier), we have /usr/lib/[email protected] - which you
 should sometimes be able to eliminate problems with through LD_PRELOAD.

 27) Can you move the application to another machine, on which it -will- work?

 28) Can you upgrade the operating system on the machine(s) that is/are
 having problems?

 29) Can you put a different operating system on the same hardware, that
 will fix the problem?  (EG, there are many *ix's that run on x86 hardware.
 If you're having problems with NetBSD, maybe try Fedora.  If you're
 having problems with  Fedora, maybe try DragonFlyBSD.  If you're having
 problems with DragonFly, maybe try SuSE.  And so on. When considering
 this, keep in mind that in some environments, it's helpful to cut down
 on the number of OSes in play.  In others, you can chose whatever's best
 for just the single job at hand.  Bear in mind that a large number of
 OSes means extra labor put into patching, as compared to a small number
 of OSes.  Some folks like to just compile their own binaries from the same
 sources, and there can be a place for that, but don't underestimate the
 value of a vendor or distributor doing quality testing on the programs
 you're using, in the environment you're using them.

 30) A tool like nagios, netreo or bigbrother can help you recognize
 patterns in a problem. EG, does it happen at the same time of day,
 5 days a week?  Is it happening to all the Suns we support?

 31) If you are trying to sort out trouble with an RPC service,
 rpcinfo is your friend, in addition to some of the other methods.
 If you "rpcinfo -p <hostname>", that should tell you what
 RPC services the host in question has registered.  You can then
 "rpcinfo -u <hostname> <rpcservice>" to list the
 readiness of the UDP versions of a service, and you can do the same
 for TCP versions of a server with the "-t" option.  See also http://stromberg.dnsalias.org/~dstromberg/rpc-health.html

 32) Try ping.  :)  If a machine isn't pingable, try traceroute or mtr.
 traceroute and mtr will be more useful if you've saved a copy of what
 they should normally look like in advance - that is, unless you have a
 network small enough to know how it's supposed to look without that.  :)
 Be aware though, that if your network has redundant paths built into it,
 sometimes what you saved won't correspond to the path you're seeing at
 the time you investigate a problem.

 33) Check if the problem is DNS-related.  Try "dig hostname.uci.edu",
 and "dig hostname.uci.edu mx" and "dig -x 128.200.34.1" and such.
 Some weird network problems can be traced to slow DNS resolution, say,
 because of a down DNS server timing out before a good DNS server answers.
 Another common problem is for programs that verify that a host has a
 good source address, to reverse resolve the client's IP address - and
 some of these programs will reject requests from hosts that don't have
 proper reverse resolution configured (ask your DNS people about "the PTR
 record").  Make sure that your /etc/resolv.conf is set up correctly too.

 Also, sometimes what -seems- to be a DNS problem can end up being a bad
 entry in the NIS "hosts" map.  I recommend that you keep your NIS hosts
 map 0 length.

 These are from Shane Chen on the OCLUG maliing list, on the subject of
 tracking down DNS problems:

 * Figure out the condition of your ns servers by pinging them.
 Are they up?  Is the latency bad?  Is it dropping packets?

 * Check their performance by manually resolving against them.  Something
 like `time host google.com ns_server.foo`.  How
 long is it taking to resolve something?  How long does it take to resolve
 the same domain if you try another name server (e.g. ns1.earthlink.net)?

 * See if there's any difference between ping a host by FQDN and IP
 (preferably some domain you haven't resolved by using your local name
 server - `host foo.bar ns1.earthlink.net` then
 ping the IP first, follow by the domain).

 34) Another class of problems can be tracked down to trouble in some form
 of name service switch configuration.  Some hosts put this information
 in /etc/nsswitch.conf, /etc/svc.conf, or even /etc/resolv.conf.

 35) If you're having (or suspect you're having) NIS problems, try
 ypcat'ing the relevant maps, EG "ypcat passwd".  Some weird NIS problems
 can be traced back to a corrupted map, a map that some OSes require and
 others don't (EG, for speeding up, through indexing, getpwuid lookups -
 a good sniffer is your friend here).  Other NIS problems can be traced
 to an outdated NIS slave or master that hasn't been updated in a while -
 "ypwhich" and "ypwhich -m" can be helpful.  You can also get a list of
 map aliases with "ypcat -x".

 36) Try to get an easy way of replicating the problem.  If it's a
 complaint from only a single user, consider using x11vnc or similar
 so you can see the problem "first hand" over the network.  http://stromberg.dnsalias.org/~dstromberg/vnc.html#addons

 37) If you suspect a particular process on a system of causing
 load problems, or other forms of problems, when way of testing that
 hypothesis is to kill the process.  But there's a more subtle way too:
 kill -STOP <pid>, monitor how the system changes, and then kill
 -CONT <pid> to make the process pick up where it left off.



 Less technical items:

 1) Post to newsgroups or bulletin boards or mailing lists -relevant-
 to the difficulty you're faced with.  Seriously consider reading
 any relevant FAQ's -first-!  Schedule yourself times to check in
 on the message thread you've created.  Consider hanging around
 on that forum a while longer to contribute a couple/few solutions
 (or more) yourself, to repay the group for its help.  Read this: http://www.catb.org/~esr/faqs/smart-questions.html
 !

 2) Also, sometimes using some form of chat channel, like IRC or an Instant
 Messaging service can be helpful for quick turnaround, but often will
 not give your question exposure to the large number of eyes that a bbs,
 mailing list or newsgroup will.

 3) Contact the relevant vendor, vendors, author, authors, maintainer,
 or maintainers, if any.  If you "have no vendor", consider signing up
 with one of the many consulting businesses that are springing up, which
 specialize in support of other people's opensource software.  You like
 to be thanked for being helpful; so do the people you're asking for
 help from.  In the case of an opensource author or maintainer, be sure
 to mention how valuable the software system is to you or your clients'
 endeavors, if it is.

 4) Setting user expectations: I've found that the single most useful
 phrase in helping endusers understand the nature of IT jobs, is to say
 "OK, that's one hurdle cleared.  Now we have to check and see if there
 are any others."

 5) Smile and try to enjoy your work.  This will often spread to your
 users in the form of greater user satisfaction.  EG, if you grimace on
 the phone, sometimes people pick up on that.
	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
	+ ###All the Credits Goes to Author, Dan Stromberg### +
	+ http://stromberg.dnsalias.org/~strombrg/Problem-solving-on-unix-linux-systems.html +
	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

	Note: This web page was automatically created from a PalmOS "pedit32" memo.

	Problem solving on unix/linux systems


	This document covers generic problem solving approaches that have
	proved useful on unix and/or linux systems. Some of it applies to other
	operating systems as well.

	If you see a method of solving problems on unix and/or linux systems that
	isn't here, Please let me know: strombrg at dcs dot nac dot uci dot edu.
	I'll of course credit the source.

	These are not, at this point, listed in any particular order, but they
	may be someday. :)

	1) Get the full text of any error messages. Take a guess what they mean,
	and try to address the problem based on that.

	2) Get the full text of any error messages, and google for them.
	Leave out anything very system-specific, like pid numbers or values
	of pointers (other than the NULL pointer). Often someone will have
	already solved the problem you're seeing, and there'll be an answer to
	your question in some archive somewhere. Googling in both the web and
	usenet is generally a good idea. You may or may not want to restrict
	your usenet search to a particular usenet group - sometimes this can
	increase the relevancy of the results, but of course it can also cut
	down greatly on the number of hits you get.

	3) Run df. A lot of problems can be quickly tracked down by just
	checking if any filesystems are full, or any remote (EG, NFS) mounts
	are having problems.

	4) Try truss/strace/par/trace/&c. These programs can list system
	calls being executed by a program. Often the content of the system call
	trace, near the bottom, will give a fair indication of what is wrong.
	If one of the last things is trying to do something with a file, and an
	"Esomething" error status is returned, there's a good chance that's
	the problem. Alternatively, if the last thing is succesfully reading
	a config file but shortly therafter giving an error anyway (via write()
	or whatever, or perhaps not giving an error at all!), then there's a good
	chance that the error is in that config file. It's often worth trying
	something like this on both the client and the server. If it's hard to
	fire up a tracer against a client quickly enough, then "echo $$", and
	truss -f -p the pid that yields from another window. This will truss
	(or whatever) your shell, and its subprocesses. It's also sometimes
	helpful to truss -f -p inetd's pid, xinetd's pid, or other daemon's pid
	(like sshd). If traceing httpd, you may have to kill and restart httpd
	under truss, or change httpd's config file to only spawn one child (for
	example). Sometimes if you're on a busy system, you'll get flooded with
	information doing this. In such a situation, you can sometimes move
	to another representative system, or set up a tight while loop that
	will initiate your truss of a relevant process as soon as possible
	after it is exec'd, by ps \| grep'ing again and again. See also http://stromberg.dnsalias.org/~dstromberg/debugging-with-syscall-tracers.html

	5) You can usually tell which NFS mount is having problems by one of
	three methods:

	5a) Run df &. Wait a long time. Eventually, df will probably tell
	you which NFS server is down.

	5b) Run df &. Note the last filesystem listed. It is probably the
	-next- filesystem in the machine's filesystem list that has the problem.
	You can often list these filesystems by inspecting /etc/mtab, /etc/mnttab,
	or running the mount command with no arguments.

	5c) Use a system call tracer on df &. This will most likely identify
	which filesystem is having problems pretty quickly. I generally prefer
	this method of the three.

	6) If the problem you are troubleshooting is network related, fire up
	a sniffer on the traffic. ethereal/tethereal, snoop and tcpdump -v are
	pretty good at annotating network conversations with useful information.
	Even if the traffic is encrypted, you can sometimes make an educated guess
	about where the problem lies based on the last host to send anything as
	part of the conversation. Also, sometimes you can give sniffers keys
	that they can use to decrypt traffic.

	7) truss and such will probably detect this to some extent, but check
	if the user in question is up to or exceeding their hard quota, or have
	exceeded their soft quota for more than the specified amount of time
	(usually one week). This problem can often lead to other problems -
	for example, X11 credential forwarding may mysteriously fail if the
	homedir is not writeable.

	8) Check for permissions problems. Again, truss and such will help you
	pinpoint this fairly quickly, but it can still sometimes help to think
	"If I were this program, what files would I need, and do I have the
	needed access?"

	9) Try to eliminate as many variables as you can. Compare across
	machines. Do all machines of the same OS type have the same problem?
	Consider entire platforms as well as increasingly minor releases of
	the software. Also compare across users: Is the problem unique to a
	specific user or group of users? If so, why?

	10) Check if the program, or the components of the program, have been
	modified recently. ls -l `which chmod`, for
	example. Also, get a list of libraries used by the program, and see if
	they've been updated. You can usually do this with "ldd /bin/ls" or
	"odump -Dl /bin/ls" or "dump -X 32 -Tv /bin/ls". Another alternative
	is to strings the binary ("strings -a `which
	chmod` \| grep / \| less -sc"), and then checking each of the files and/or
	directories the program references.

	11) If one system is working, and another is not, compare the md5sum's
	of the files in step 10 on a working system, and a nonworking system

	12) If one user is working, and another user is not, there is a good
	chance there's a permissions problem, which again, truss and co. can help
	you identify. Another major class of problems come from differences
	in environment variables. To track down this kind of problem, "su
	- okuser" followed by "env \| sort > /tmp/env.okuser; exit" and
	then "su - baduser" followed by "env \| sort > /tmp/env.baduser".
	You can then "diff -u /tmp/env.okuser /tmp/env.baduser" to determine
	what differences the users have in their environments. If there are
	a lot of differences, you can binary search on the differences, until
	you pinpoint the one that matters. I've also sometimes replaced an
	entire environment with that of another user, to see if there is any
	variable leading to the trouble, or if it is really something else.
	Please note that this sort+diff method isn't perfect, especially
	if some environment variables contain newlines. See also http://stromberg.dnsalias.org/~dstromberg/env-search.html

	13) Sometimes it is helpful to set up a cron job or while loop, that will
	save the status of a particular thing (like "ps axf", "hps", "netstat
	-a", "uptime" and so on) in a series of files, named by date +%whatever.
	Then when a system finally crashes, you can get some idea of what was
	happening at the time, by looking at the last item(s) in your output.

	14) Sometimes it is helpful to see if a particular kind of problem
	is always happening at the same time every day. This tends to lead
	to hypothesies like "is it a cron job?" or "Is it a user with regular
	behavior?" Checking nagios can help with this.

	15) If you're dealing with a network service, try to replicate
	the problem (in a minimalist way) by telnet'ing to the port on
	the host (optionally, from the client), or using the "ssl-connect"
	program to connect to an openssl-encrypted service - see also http://stromberg.dnsalias.org/~dstromberg/ssl-connect.html

	16) If there is a technologically-enforced licensing scheme involved,
	check if any license servers have died, or if any licenses have expired,
	or if any license server configuration changes have been made (check
	both the license manager(s)' input data, as well as its executable and
	dependent libraries - see if any changes have been made recently).

	17) Ask users when they first noticed the problem. This can lead to
	recalling a change that was made around that time.

	18) If you have one group of users with a problem, and another
	group of users without a problem, you can binary search their
	config file keywords, much like was mentioned above on environment
	variable issues. You can also do a quick, rudimentary check of
	users' config files using the "classify" program, or my "equivs"
	program. classify has more flexible options, but my equivs
	program is usually faster on large collections of input files. http://stromberg.dnsalias.org/~dstromberg/software/

	19) If you're on an AIX system, and you're seeing strange shared library
	conflicts, study up on "loader domains". Question: Do any other *ix's
	have "loader domains" or something similar to them?

	20) Check any and all relevant logs! If you don't find anything, go
	check any logs that have changed recently (works best on relatively
	quiet systems). This is triply true if you see a truss (or similar)
	writing to a log file, or opening a socket or door to syslog.

	21) If you're having trouble finding stuff in your syslog files, consider
	combining them into one big file. Also, a script that pulls anything
	you've had trouble with before in your syslog data, is a really good
	way to be proactive.

	22) Don't rule out multi-variable problems or holistic situations
	unnecessarily. While it's usually best to initially assume a
	single-problem issue, and that reductionistic analysis will work,
	eventually solution-resistant problems call for considering things like
	"OK, are there two variables (or more) in specific combinations) that
	give the failure, while other combinations of the same variables give
	working results?" To sum this up in programmer/logician terms, in the
	two variable case, sometimes "a and b" yields problems, but sometimes it's
	"not a and b" or "a and not b" or "not a and not b".

	23) Try getting a backtrace. This may help you, or it may help the
	people you request help from. Usually you can do this with "gdb program
	[core]" followed by "run -a arg1 arg2 arg3 ... argn" followed by "bt".
	Newer gdb's don't seem to want the -a anymore.

	24) Try other forms of debugging - whatever's availalble. If you're a
	programmer, you may want to try ddd or similar on C/C++/whatever programs.
	If you're troubleshooting an sh/ash/ksh/bash script, try throwing in
	"set -x" (and optionally, "set +x") here and there, to put the error
	in context. If you're troubleshooting a csh/tcsh script, try putting a
	"-x" on the #! line (the first line).

	25) If you're on a mixed wordsize (EG 32 bit and 64 bit) system, are you
	getting a bad combination of 32 bit and 64 bit libraries at load time?
	Or are you seeing libraries that are available for 32 bit systems,
	but not for 64 bit systems (or vice-versa)?

	26) If your OS has a "map the 0th page to something innocuous and
	writable" option, go ahead and try it, but be sure to report the crash
	to the developers/maintainers anyway. This can sometimes help make null
	pointer dereferencing relatively toothless. Some OSes put a "bomb" at the
	0th page, so that programmers can catch their errors early. Others don't.
	On Solaris 8 (maybe earlier), we have /usr/lib/[email protected] - which you
	should sometimes be able to eliminate problems with through LD_PRELOAD.

	27) Can you move the application to another machine, on which it -will- work?

	28) Can you upgrade the operating system on the machine(s) that is/are
	having problems?

	29) Can you put a different operating system on the same hardware, that
	will fix the problem? (EG, there are many *ix's that run on x86 hardware.
	If you're having problems with NetBSD, maybe try Fedora. If you're
	having problems with Fedora, maybe try DragonFlyBSD. If you're having
	problems with DragonFly, maybe try SuSE. And so on. When considering
	this, keep in mind that in some environments, it's helpful to cut down
	on the number of OSes in play. In others, you can chose whatever's best
	for just the single job at hand. Bear in mind that a large number of
	OSes means extra labor put into patching, as compared to a small number
	of OSes. Some folks like to just compile their own binaries from the same
	sources, and there can be a place for that, but don't underestimate the
	value of a vendor or distributor doing quality testing on the programs
	you're using, in the environment you're using them.

	30) A tool like nagios, netreo or bigbrother can help you recognize
	patterns in a problem. EG, does it happen at the same time of day,
	5 days a week? Is it happening to all the Suns we support?

	31) If you are trying to sort out trouble with an RPC service,
	rpcinfo is your friend, in addition to some of the other methods.
	If you "rpcinfo -p <hostname>", that should tell you what
	RPC services the host in question has registered. You can then
	"rpcinfo -u <hostname> <rpcservice>" to list the
	readiness of the UDP versions of a service, and you can do the same
	for TCP versions of a server with the "-t" option. See also http://stromberg.dnsalias.org/~dstromberg/rpc-health.html

	32) Try ping. :) If a machine isn't pingable, try traceroute or mtr.
	traceroute and mtr will be more useful if you've saved a copy of what
	they should normally look like in advance - that is, unless you have a
	network small enough to know how it's supposed to look without that. :)
	Be aware though, that if your network has redundant paths built into it,
	sometimes what you saved won't correspond to the path you're seeing at
	the time you investigate a problem.

	33) Check if the problem is DNS-related. Try "dig hostname.uci.edu",
	and "dig hostname.uci.edu mx" and "dig -x 128.200.34.1" and such.
	Some weird network problems can be traced to slow DNS resolution, say,
	because of a down DNS server timing out before a good DNS server answers.
	Another common problem is for programs that verify that a host has a
	good source address, to reverse resolve the client's IP address - and
	some of these programs will reject requests from hosts that don't have
	proper reverse resolution configured (ask your DNS people about "the PTR
	record"). Make sure that your /etc/resolv.conf is set up correctly too.

	Also, sometimes what -seems- to be a DNS problem can end up being a bad
	entry in the NIS "hosts" map. I recommend that you keep your NIS hosts
	map 0 length.

	These are from Shane Chen on the OCLUG maliing list, on the subject of
	tracking down DNS problems:

	* Figure out the condition of your ns servers by pinging them.
	Are they up? Is the latency bad? Is it dropping packets?

	* Check their performance by manually resolving against them. Something
	like `time host google.com ns_server.foo`. How
	long is it taking to resolve something? How long does it take to resolve
	the same domain if you try another name server (e.g. ns1.earthlink.net)?

	* See if there's any difference between ping a host by FQDN and IP
	(preferably some domain you haven't resolved by using your local name
	server - `host foo.bar ns1.earthlink.net` then
	ping the IP first, follow by the domain).

	34) Another class of problems can be tracked down to trouble in some form
	of name service switch configuration. Some hosts put this information
	in /etc/nsswitch.conf, /etc/svc.conf, or even /etc/resolv.conf.

	35) If you're having (or suspect you're having) NIS problems, try
	ypcat'ing the relevant maps, EG "ypcat passwd". Some weird NIS problems
	can be traced back to a corrupted map, a map that some OSes require and
	others don't (EG, for speeding up, through indexing, getpwuid lookups -
	a good sniffer is your friend here). Other NIS problems can be traced
	to an outdated NIS slave or master that hasn't been updated in a while -
	"ypwhich" and "ypwhich -m" can be helpful. You can also get a list of
	map aliases with "ypcat -x".

	36) Try to get an easy way of replicating the problem. If it's a
	complaint from only a single user, consider using x11vnc or similar
	so you can see the problem "first hand" over the network. http://stromberg.dnsalias.org/~dstromberg/vnc.html#addons

	37) If you suspect a particular process on a system of causing
	load problems, or other forms of problems, when way of testing that
	hypothesis is to kill the process. But there's a more subtle way too:
	kill -STOP <pid>, monitor how the system changes, and then kill
	-CONT <pid> to make the process pick up where it left off.



	Less technical items:

	1) Post to newsgroups or bulletin boards or mailing lists -relevant-
	to the difficulty you're faced with. Seriously consider reading
	any relevant FAQ's -first-! Schedule yourself times to check in
	on the message thread you've created. Consider hanging around
	on that forum a while longer to contribute a couple/few solutions
	(or more) yourself, to repay the group for its help. Read this: http://www.catb.org/~esr/faqs/smart-questions.html
	!

	2) Also, sometimes using some form of chat channel, like IRC or an Instant
	Messaging service can be helpful for quick turnaround, but often will
	not give your question exposure to the large number of eyes that a bbs,
	mailing list or newsgroup will.

	3) Contact the relevant vendor, vendors, author, authors, maintainer,
	or maintainers, if any. If you "have no vendor", consider signing up
	with one of the many consulting businesses that are springing up, which
	specialize in support of other people's opensource software. You like
	to be thanked for being helpful; so do the people you're asking for
	help from. In the case of an opensource author or maintainer, be sure
	to mention how valuable the software system is to you or your clients'
	endeavors, if it is.

	4) Setting user expectations: I've found that the single most useful
	phrase in helping endusers understand the nature of IT jobs, is to say
	"OK, that's one hurdle cleared. Now we have to check and see if there
	are any others."

	5) Smile and try to enjoy your work. This will often spread to your
	users in the form of greater user satisfaction. EG, if you grimace on
	the phone, sometimes people pick up on that.