Strategies for problem solving

In this chapter I’ll go into detail about some of the strategies I have used in the past to solve problems.

Manual pages, program documentation

There are many places I can turn to for help. Usually one of the first is the manual pages and other program documentation. With Debian I use dpkg -L to obtain a list of all files belonging to a package. Manual pages are located under /usr/share/man. More information can often be found under /usr/share/doc/packagename. Here I look for files whose name starts with README. Sometimes there is a package named packagename-doc which contains further documentation for packagename. For self compiled software I can look in the source archives.

There is also a program called apropos which lets you search for keywords in the descriptions of the manual pages. I can use the slash (/) to search on a manual page.

In documentation directories I use grep -ri to search.

Unfortunately this won’t help if I have removed nearly all documentation due to a shortage of space.

Internet, search engines

Another easy strategy which I also often use as a starting point is to search the internet. Often someone else has had the same problem and has perhaps even found a solution to it - which is even better. I limit myself to about 10 to 15 minutes when looking in the internet for a solution. Here I have to narrow the problem down to generate appropriate search words. I often find the necessary information in the system logs.

I look for keywords in those log lines which point to the problem and copy them to the search form in the browser. Often I take the whole line and just remove the parts that are probably different on other machines (date, pid, computer name, addresses, …). I look in the results to see whether these describe my problem or - even better - contain a solution. If I get too many results, I look at those that come closest to my problem for further keywords. It’s great when I am able to solve my problem. I note it down in my journal and then get on with my day.

For example I found the following log line on one of my routers:

Dec  5 05:17:01 baas authpriv.err CRON[17414]: \
pam_env(cron:session): Unable to open env file: \
/etc/default/locale: No such file or directory

Hence I used the following keywords for the internet search:

authpriv CRON pam_env "Unable to open env file" \
"No such file or directory"

I didn’t come up with much using Google I but DuckDuckGo gave me information about Debian bug #442049 among other things. I entered this number to the bug database. The solution in this case was easy. The missing file /etc/default/locale belonged to a locales package which wasn’t used on the router. Pam_env was trying to access this file without checking its existence. The workaround:

# touch /etc/default/locale

If I can’t find a solution in the manual pages or through an internet search, I try to solve the problem alone or at least to narrow it down so that I can ask a specific question in one of the support forums. Here I proceed differently depending on whether I suspect the problem to be more on the local machine or on the network.

Strategies for local problems on the computer

When the problem is that a program doesn’t work correctly or even at all, I have to observe what it’s doing in the first place. I have already looked in the system logs and maybe have an idea of what may be the cause.

If I suspect a problem with a shell script, I can start it with sh -x scriptname and get more information about the sequence of actions. This may already help or point me towards another program that I should be looking closer at.

For problems with Perl programs I can use perl -d scriptname to start the Perl debugger. This requires at least a basic knowledge of this program language and the Perl debugger.

Tip

The manual pages of Unix and Linux systems are traditionally grouped in sections. Popular sections are

1 for executable programs

2 for system calls

3 for library functions

5 for file formats

8 for executable programs for system administration

If there are manual pages with the same name in different sections, man selects any one of these. To get a page from a specific section I put the section before the name of the page:

$ man 2 open

If I want to look at all of the pages with the same name, I can use the option -a:

$ man -a open

When I come across a binary program (ELF executable), I can use strace to get an overview of the system calls. To do this I start strace with the command line strace -f -o xyz.strace programname or strace -f -o xyz.strace -p pid if the program is already running and has pid as its process id. Then I find the system calls of the program in the text file xyz.strace and may already see the cause of the problem. If the program generates error messages, I search for these messages in xyz.strace and see what has happened immediately before this. This method allows me to easily localize problems with access rights. Of course I need to have knowledge about the system calls which I can find in section two of the manual pages.

If a program crashes, I can try to produce a core dump and analyze this with a debugger. To do this I must tell the system to write a core dump when a program crashes:

$ ulimit -c 1000000

Here the option -c 1000000 indicates the maximum size of the core dump file that should be written. The exact meaning of this command is in the man page of the shell (for instance man bash and then search for ulimit).

If I want information about running programs, I can use lsof, strace and fuser depending on the kind of information I want to get.

If I suspect missing or false libraries, I can get help from ldd.

Troubleshooting mount problems

The following example for the localization of a problem with the read-only mount of the root partition is taken from a real case and should clarify the procedure.

On the computer in question I noticed the following line on the console:

Remounting / as read-only ... mount: / is busy

After logging in I could verify this with:

# mount
...
/dev/hda2 on / type ext2 (rw,noatime,errors=continue)
...
# remountro
mount: / is busy

This was not what I wanted for this machine. So I had to find out which processes prevented the read-only mount. For the most part these are processes which had opened a file to write on the partition in question. I used the program fuser to find out which processes:

# fuser -vm /
                     USER        PID ACCESS COMMAND
/:                   root     kernel mount /
...
                     root       1467 Frce.  dhclient
...

In this case it was only one process and it had the file still open. I could verify this by stopping the process and then remounting the root partition to read-only using remountro.

The next step was to find out which file this process kept open, so that I could move this file to some other place if necessary. The program lsof helped me do this:

# lsof -p 1431
COMMAND   PID USER  FD  TYPE..NAME
dhclient 1431 root cwd   DIR../
dhclient 1431 root rtd   DIR../
dhclient 1431 root txt   REG../sbin/dhclient
dhclient 1431 root mem   REG../lib/libnss_files-...
dhclient 1431 root mem   REG../lib/libc-2.11.2.so
dhclient 1431 root mem   REG../lib/ld-2.11.2.so
dhclient 1431 root   0u  CHR../dev/null
dhclient 1431 root   1u  CHR../dev/null
dhclient 1431 root   2u  CHR../dev/null
dhclient 1431 root   3w  REG../var/lib/dhcp/dhclient\
.eth0.leases
dhclient 1431 root   4u pack..type=SOCK_PACKET
dhclient 1431 root   5u IPv4..*:bootpc

I had restarted the computer to restore the same conditions and this time the dhclientprocess had the PID 1431. The process kept /var/lib/dhcp/dhclient.eth0.lease open for writing. Thus this file had to be moved to another place.

I first tried adding /var/lib/dhcp to the variable VOYAGE_SYNC_DIRS in the file /etc/default/voyage-util. Unfortunately this didn’t work:

# mount
...
/dev/hda2 on / type ext2 (rw,noatime,errors=continue)
...
tmpfs on /var/lib/dhcp type tmpfs (rw,no...,mode=755)
# fuser -vm /
...
                      root       1440 Frce. dhclient
...
# lsof -p 1440
COMMAND   PID USER  FD TYPE..NAME
dhclient 1440 root cwd  DIR../
...
dhclient 1440 root   3w REG../var/lib/dhcp/dhclien...

The file dhclient.eth0.leases was still open in the root partition even though it’s path referred to another file system. In this case it turned out that a race condition caused the DHCP client to open the file before /var/lib/dhcp was mounted as a tmpfs. The ultimate solution was to make sure the network was initialized after voyage-sync was executed.

Later I found out that with Voyage Linux /var/lib/dhcp is actually a link to /lib/init/rw/var/lib/dhcp and the tmpfs under /lib/init/rw is setup much earlier. I don’t know why it wasn’t the case with this machine.

Strategies for network problems

A very good book for troubleshooting network problems is Network Troubleshooting Tools by Joseph D. Sloan.

I divide problems in the network arbitrarily into completely defective connections, partially defective connections and random dropouts / performance problems. I check flaws in this order too.

First of all I check whether there is any connection between the affected systems. I can use ping, for example to make an initial decision.

If this doesn’t prove fruitful, I check whether the addresses of both systems are set correctly in terms of the network and if there are routes to the respective network in case both machines are not on the same network segment. The programs netstat, ifconfig, ip, route and arp help me do this:

$ /sbin/ifconfig
...
eth0      Link encap:Ethernet  HWaddr 00:0d:b9:21:...
          inet addr:192.168.1.254  Bcast:192.168.1...
          inet6 addr: fe80::20d:b9ff:fe21:715c/64 ...
          UP BROADCAST RUNNING MULTICAST MTU:1500 ...
          RX packets:1755606 errors:0 dropped:75 o...
          TX packets:2584367 errors:0 dropped:0 ov...
          collisions:0 txqueuelen:1000 
          RX bytes:212300464 (202.4 MiB) TX bytes:...
          Interrupt:10 Base address:0xc000
...
$ ip addr show
...
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 150...
    link/ether 00:0d:b9:21:71:5c brd ff:ff:ff:ff:f...
    inet 192.168.1.254/24 brd 192.168.1.255 scope ...
    inet6 fe80::20d:b9ff:fe21:715c/64 scope link 
       valid_lft forever preferred_lft forever
...

The output from ip is shorter but contains the information necessary to obtain a diagnosis so I prefer this tool. It shows me the ip address, network mask and ethernet MAC address. arp enables me to see whether the address of the computer is added to the ARP cache:

$ ping 192.168.1.254
$ /usr/sbin/arp -n 192.168.1.254
Address      HWtype HWaddress        Flags Mask Iface
192.168.1.254 ether 00:0d:b9:21:71:5c C          eth0

If ping does not work but the MAC address appears in the ARP cache, this points to a host firewall suppressing PING messages.

If the connection encompasses different networks I first use ping to see whether I can reach the gateway to the other network. I determine the gateway using netstat -r (or route which delivers the same output) or ip route show:

$ netstat -rn
Kernel-IP-Routentabelle
Ziel        Router        Genmask       Flags MSS ...
192.168.1.0 0.0.0.0       255.255.255.0 U       0 ...
0.0.0.0     192.168.1.254 0.0.0.0       UG      0 ...
$ ip route show
192.168.1.0/24 dev eth0 proto kernel scope link sr...
default via 192.168.1.254 dev eth0

If the connection runs through several networks I can try to discover the path using traceroute. Here I have to remember that traceroute may be disrupted by packet filters which suppress ICMP messages, or by NAT. If you keep this in mind, traceroute can sometimes help localize the fault location.

If the addresses and routes are configured correctly and the gateways - if necessary - reachable, I get out the big gun and monitor the line on both ends to see whether the datagrams are being sent and received at all. I use tcpdump and/or wireshark to do this.

If I see more datagrams on one computer than on the other, I can act on the assumption that there are packet drops or a firewall in the network. Then I can leave both computers as they are and turn my attention to the network.

If I see the same datagrams on both computers but one does not send, I can assume that there is a packet filter on that computer. This can be verified and corrected with iptables.

If I get a connection with PING between two computers, this need not imply that I can reach the service that I want to use.

For security reasons many services are bound to the loopback interface (address 127.0.0.1) after installation and are therefore not available through the network. I can use netstat -ntl for TCP services and netstat -aun for UDP services to verify this. Here I should see the external IP address or 0.0.0.0 in the column Local Address, followed by a colon (:) and the port number of the service. If I don’t see this, I must look in the configuration of the service. If the service is bound to the external interface and still doesn’t answer, I use iptables to see if there are any filter rules that prevent it from answering. If there are no filter rules I can look in the files /etc/hosts.allow and /etc/hosts.deny.

Finally I can monitor the network interface using tcpdump and see how the computer reacts to connection requests. strace allows me to see whether the datagrams arriving at the interface are handled by the right server process.

If I do have a connection to the service but there are performance problems or dropouts, then I must capture the whole session and analyze it with wireshark.

If I assume there are network problems, I can use ping -f or a program like iperf to do a performance test.

If I have eliminated all of the other problems so far and tcpdump is showing me that there is a connection to the service despite the error messages, I must analyze the protocol of the service. I can use the system logs together with a sufficient debugging level if the service supports this. Otherwise - with clear text protocols - I can look at the captured session with wireshark (Option Follow TCP Stream). With clear text protocols I can start a session manually with netcat or telnet and for SSL connections with openssl:

$ openssl s_client -connect webserver:443

A method for analyzing the TFTP protocol can be found in the section of this protocol in the chapter about protocols and mechanisms.