Leanpub: Publish Early, Publish Often

Components of a Linux system

In this chapter I will go into detail about the components that make up a Linux system. To do this, I divided the system into the following areas:

Kernel: The glue that holds everything together.
File system: The place where everything (programs and data) is permanently deposited.
RAM: All of the programs and data must be copied here before they can be used. Most of the time little consideration is given to this topic. Because these machines only have a little RAM, I believe a few thoughts are appropriate.
I/O system: This provides the basis for communicating with the environment via the network, serial connections, keyboard, screen and other means.
System programs: All of the programs which aren’t vital for the intended use of the machine, but are instead there to maintain stable operation.
User programs: All programs which are vital for the intended use of the machine.

Of course these areas overlap and some programs may be considered to be a system program or a user program - depending on what they are being used for.

Kernel

The kernel is - rightly so - only considered a component of the total system. Efforts are being made, for instance with Debian, to make the operating system usable with the BSD kernel or with Hurd. Nevertheless the kernel plays a central role in the overall system as the gateway between hardware and software, as communication interface between the various processes running and as the entity that allocates the resources to the processes.

Of particular interest is the amount of support the kernel gives to the hardware in the ALIX machines and whether I need external kernel modules or whether I can get by with Userland programs.

Tip

The Geode LX800 processor is not compatible with i686.

Since there are no more i586 software packages from Debian, I have to install the i486 versions.

Kernel modules

The standard kernel includes GPL drivers for all of the hardware plugged into the ALIX boards.

leds-alix

This modul is used to control the LEDs on the boards. The ledtrig-* modules provide activation triggers. Since kernel version 2.6.30, these modules have been included in the standard kernel. That means I don’t need separate modules for Debian 6 Squeeze. For older kernels (for instance from Debian 5 Lenny), I need the package leds-alix-source which I can compile and install with module-assistant (m-a).

rtc

I need this module to access the hardware clock.

geode-aes, geode-rng

The AMD Geode LX800 processor, which is built into some of the ALIX boards, contains an on-chip AES 128-bit crypto accelerations block and a true random number generator. It is faster to use these than the algorithms used in the software and frees up the CPU for other purposes. I need rngd from the rng-tools to use the hardware random number generator.

geodewdt

A watchdog driver is available starting in kernel version 2.6.33 which restarts the computer automatically if it gets stuck.

i2c_core

This module for the I²C bus is needed to access the sensors

lm90

This is the driver for the sensor chip. Currently it identifies a lm86 sensor.

scx200_acb

This is the driver for the ACCESS bus of the Geode processors and the CS5536 chips.

cs5535_gpio

The GPIO module to access the LEDs and button comes with kernel version 2.6.33 and above, otherwise it must be compiled as an additional kernel module.

cs5536, pata_cs5536

Modules for the compact flash “hard disk”. Depending on the kernel version and kernel options, the CF card is accessible either as /dev/hda or /dev/sda. This can lead to problems at boot time if you want to test another kernel version. In this case it would be better to identify the disk using its UUID or label.

cs5536	ATA Stack -> /dev/sda
pata_cs5536	PATA Stack -> /dev/hda

via_rhine

The ethernet module.

File system

The file system is the place where everything that amounts to the system (programs, data) is stored permanently. It pays to look a little bit closer at this. On the one hand the various file systems are to a greater or lesser extent suitable for the different media. Particularly with regard to flash media I want to avoid writing on the same spot over and over and to use at least the operating system data in read-only to minimize damages caused by excessive write access. On the other hand some programs need permanent storage to save data after a reboot or if there is not enough RAM available.

If I have mounted the root file system read-only there are still ways to enable write access for processes:

With overlay file systems like AUFS
By mounting writable file systems like tmpfs at certain mount points
With symbolic links to writable file systems

SquashFS

This is a compressed read-only file system for Linux from version 2.4 and above. The kernel accesses this file system through a kernel modul as virtual file system (VFS). SquashFS has been included in the standard kernel since kernel version 2.6.29. For lower kernel versions a separate module has to be compiled.

The entire UID and GID as well the file creation time are stored in SquashFS. Duplicate files are stored only once. Files are compressed with deflate (zlib) or with the more effective Lempel-Zif-Markow algorithm (LZMA). SquashFS is often used together with UnionFS or the more modern AUFS to allow the processes at least temporary write access.

To work with SquashFS on Debian you have to install the package squashfs-tools which contains the programs mksquash for creating it and unsquashfs for extracting a SquashFS without mounting it. MS Windows at least offers read access with 7-zip.

For initial experiments you can convert a part of your file system into a SquashFS:

# mksquashfs /usr/local /mnt/local.sqsh
...
# du -s /mnt/local.sqsh /usr/local
94156   /mntlocal.sqsh
219252  /usr/local/

The newly formed SquashFS is already considerably smaller than the original file system. Since I want to try out the SquashFS together with the overlay file system AUFS, I create three mount points:

# mkdir /mnt/local
# mkdir /mnt/local-ro
# mkdir /mnt/local-rw
# mount /mnt/local.sqsh /mnt/local-ro -t squashfs \
  -o loop
# mount -t aufs \
  -o dirs=/mnt/local-rw=rw:/mnt/local-ro=ro \
  aufs /mnt/local
# mount
...
/dev/loop0 on /mnt/local-ro type squashfs (rw)
aufs on /mnt/local type aufs \
(rw,dirs=/mnt/local-rw=rw:/mnt/local-ro=ro)

The SquashFS is mounted at /mnt/local-ro/ where it is read-only. I want to use this file system read-write at /mnt/local/. AUFS redirects all my write access to /mnt/local-rw/:

# echo foo> /mnt/local-ro/var/foo
-bash: /mnt/local-ro/var/foo: file system is read-only
# echo foo> /mnt/local/var/foo
# diff -r /mnt/local-ro /usr/local
# diff -r /mnt/local /usr/local
Nur in /mnt/local/var: foo.
# cat /mnt/local/var/foo 
foo
# cat /mnt/local-rw/var/foo 
foo

I can’t write to /mnt/local-ro/, even though mount showed that it was mounted read-write, since this is impossible with SquashFS. Therefore I create the file in /mnt/local/ and find it eventually in /mnt/local-rw/.

Overlay root file system with AUFS

With this solution the root file system gets mounted read-only and there is an AUFS laid over the whole directory tree which redirects all write access to a temporary file system in RAM (tmpfs). The whole thing is set up through a script in the InitRD which moves the root partiton to /ro/, mounts a tmpfs at /rw/ and an AUFS at /, reads from /ro/ and writes to /rw/. This script is activated by the kernel command line argument aufs=tmpfs. Furthermore, the script creates two other scripts named remountrw and remountro, which can remount the root partition at /ro/ in read-write mode or read-only mode. I set up my first ALIX systems this way.

Everything that gets written by any program into any file ends up in tmpfs, as long as there is enough RAM available and disappears after a reboot. If I want to make some changes permanent, I have to call up remountrw, move the changes from /rw/ to /ro/ and then call up remountro.

One advantage of this solution is that I don’t have to think about which program is trying to write which file. Everything goes into tmpfs and is gone after rebooting.

The downside is that system upgrades also land under /rw/ and have to be copied afterwards to /ro/. Alternatively I can start the system without the kernel option aufs=tmpfs and boot the system just like any normal system. This is not a very good option for systems which are supposed to run permanently. Therefore I turn to the next solution which is used, among others, by Voyage Linux.

Read-only root file system with multiple tmpfs

The central idea of this solution is to mount the root file system read-only and to not allow any process to write to this file system. For directories that traditionally contain writable files (e.g. /var/run/) I mount tmpfs (RAM disks) at these points to allow write access.

For the directories /var/run/ and /var/lock/ this is already available in standard Debian Linux if I set the following in the file /etc/default/rcS:

# /etc/default/rcS
# ...
RAMRUN=yes
RAMLOCK=yes

This is even better supported in Voyage Linux. Here I can specify further directories to be mounted as tmpfs in the file /etc/default/voyate-util by adding them to the variable VOYAGE_SYNC_DIRS. These directories are also automatically saved at shutdown, and filled with the saved files at system boot. If I want to manually save the files in these directories, I do it like this:

# remountrw
# /etc/init.d/voyage-sync sync
# remountro

With this solution I have to think about which directories should be writable (for instance the directory with the leases of the DHCP demon). But afterwards a system upgrade is as easy as this:

# remountrw
# apt-get update && apt-get upgrade
# /etc/init.d/voyage-sync sync
# remountro

This is why I moved away from the AUFS solution and am now using Voyage Linux on my ALIX machines.

The package flashybrid on standard Debian GNU/Linux functions similarly to voyage-util on Voyage Linux.

Tip

The package flashybrid from Debian isn’t as low maintenance as voyage-util from Voyage Linux. With a few adjustments, however, it does do what I want it to do.

After installating the package I set the variable ENABLED=yes in the file /etc/default/flashybrid.
I create a directory /ram/ under which flashybrid mounts all the tmpfs.
I configure the maximum RAM for the tmpfs in the file /etc/flashybrid/config.
I determine which directories are to be provided as RAM-disk in the file /etc/flashybrid/ramstore. These are filled from the root partition at boot time and saved at shutdown with fh-sync.
I configure all of the directories which only contain temporary files in /etc/flashybrid/ramtmp.
To make sure that /etc/init.d/flashybrid gets started at boot time, I use insserv flashybrid.
Some servics are started before flashybrid and keep files in the root file system open. To close these files, I have to use this workaround in /etc/rc.local:
```
1 /etc/init.d/rsyslog restart
2 /etc/init.d/cron restart
3 /etc/init.d/nfs-common restart
4 /etc/init.d/portmap restart
5 mountro
```
To find out which services have to be restarted, please see chapter Strategies for problem solving.

Flashybrid provides the commands mountro, mountrw and fh-sync, which have the same function as their corresponding commands in Voyage Linux.

Writable file systems

If I need a writable file system for my project, I recommend a modern CF card (with CompactFlash 5.0 or later) that supports the TRIM command. Together with a suitable file system (btrfs, ext4fs, fat or gfs2) and a current kernel (starting with version 2.6.33) the operating system can tell the CF card which sectors aren’t needed anymore and no longer need to be copied. By doing this, and by leaving some space free, I can extend the lifetime of the CF card.

Identifying partitions with UUID or label

Depending on the kernel version and options, the first CF card will either be called /dev/hda or /dev/sda. Thus the system boot may fail if I just want to test a new kernel. In this case it may be beneficial to identify the partition using its UUID. To do this, I do the following:

After the machine has booted I look in /dev/disk/by-uuid/ to find out the UUID of the individual partitions have:

$ ls -l /dev/disk/by-uuid/
total 0
lrwxrwxrwx 1 root root 10 2011-11-25 07:43
f779141e-e3b1-4521-9333-9dde9de0b64f -> ../../sda1

(Output is wrapped for better legibility.)

Afterwards I change the entry for /dev/sda1 in the file /etc/fstab to UUID=f779141e-e3b1-4521-9333-9dde9de0b64f and do the same for the other partitions. I change the kernel option root in the Grub boot entry (file /boot/grub/menu.lst) accordingly:

1 root=/dev/disk/by-uuid/f779141e-...-9dde9de0b64f

Another possibility is to use filesystem labels. I can use e2label, for instance, to write these on an ext2fs.

# e2label /dev/sda1 rootfs

I change the line for the root file system in /etc/fstab like this:

1 LABEL=rootfs   /  ext3    errors=remount-ro 0 1

And the Grub boot entry looks like this for the kernel option root:

1 root=LABEL=rootfs

Random Access Memory

Before programs and data can be used by the CPU, they have to be copied into RAM. This is where everything happens, but this type of memory is also often scarce on these small computers. This is the most important thing that can be said about RAM, but because I consider RAM to be so important, I want to go further into detail.

In order to execute a programm in a process, it has to be loaded into RAM. Only the parts of the program that will be executed next get loaded and not the whole program (except when the whole program fits onto a single page, but you’ll have a hard time finding such a program). If a program is used by different processes it will only be loaded once. Only the stack and the heap are used privately by the process. It is advisable to look out for programs, which don’t use much memory but still provide the functionality you need. It is also advantageous whan a program like busybox can replace as many other programs as possible because you save memory in RAM and on the file system when it can be used by many processes.

I will need more RAM if I use RAM directly as a file system through overlay file systems, tmpfs or loopback mounts. This RAM is no longer available as working memory for processes.

Finally the kernel uses all of the memory that is not used for any of the purposes mentioned above as a buffer for file system access. I usually don’t need to pay attention to this because this memory is automatically freed up when it is needed for other purposes.

The main memory of the X86 computer architecture is divided into three areas:

ZONE_DMA: from 0 to 16 MiB. This range contains memory pages which may be used by devices for DMA.
ZONE_NORMAL: from 16 MiB to 896 MiB. This range contains regular memory pages.
ZONE_HIGHMEM: over 896 MiB. This range contains memory pages, which are not continuously available in the address space of the 32-bit CPU. This range has no relevance for ALIX computers.

Analyzing memory usage

I can use the programs free, top, ps and pmap in order to analyze the memory usage of a Linux system.

The program free gives me an overview of the current allocation of the total usable system memory:

$ free
             total   used   free shared buffers cached
Mem:        255488 135984 119504      0    6588 108732
-/+ buffers/cache:  20664 234824
Swap:            0      0      0

I am never able to see the entire memory under total because the memory used by the hardware and the kernel has been calculated out.

The memory labelled buffers contains temporary data from the processes running, like input queues, file buffers, output queues and so on. The memory marked cached contains buffered file accesses, for instance if multiple processes are accessing the same file.

I can use the program top to isolate processes which use a particularly large amount of memory. It provides an overview of the processes, CPU load and total memory consumption in the head lines and below these there is a table containing the data of the individual processes. The output is updated continuously and can be modified. By pressing ? I am able to call up a brief help page explaining the possible modifications. It is interesting to sort the table by memory consumption, which I can do by pressing m:

Output mangled to fit

top - 08:29:03 up 125 days, 21:33,  1 user,  load a..
Tasks:  54 total,   1 running,  53 sleeping,   0 st..
Cpu(s):  0.4%us,  0.2%sy,  0.0%ni, 99.4%id,  0.0%wa..
Mem:    255488k total,   136172k used,   119316k fr..
Swap:        0k total,        0k used,        0k fr..

 PID USER   ..VIRT  RES  SHR..%MEM    TIME+  COMMAND
9006 mathias..6220 4924 1340.. 1.9   0:03.21 bash
1031 snmp   ..8832 4268 2660.. 1.7 186:43.97 snmpd
 954 ntp    ..4576 1920 1480.. 0.8   9:52.96 ntpd
9037 mathias..2324 1096  876.. 0.4   0:00.74 top
9005 root   ..2396 1048  788.. 0.4   0:01.44 dropbear
 898 root   ..3808  928  740.. 0.4   0:27.43 cron
2260 root   ..2960  900  672.. 0.4   0:06.71 pppd
 842 dnsmasq..4116  840  656.. 0.3   0:14.54 dnsmasq
 220 root   ..2252  720  396.. 0.3   0:00.23 udevd
 263 root   ..2248  688  364.. 0.3   0:00.06 udevd

The following columns are the most important columns for analyzing memory:

VIRT: stands for the virtual size of the process. This includes all code, data and shared libraries plus pages that have been swapped out and pages that have been mapped but not used. In other words all of the memory this process could use.
RES: is the resident size. This is the physical memory of a process which has not been swapped out to disk. This is used to compute the value of the %MEM column.
SHR: is the shared memory size. This is the part of VIRT that can be shared with other processes.
%MEM: the percentage of the available physical memory which the process currently uses. I use this column (and sort this column by pressing m) to find the processes and programs which use the most memory and are therefore candidates for further investigation.

Using the program ps I can get a snapshot of the memory currently being consumed by all of the processes:

Output mangled to fit

$ ps aux
USER     PID %CPU %MEM  VSZ  RSS TTY..COMMAND
root       1  0.0  0.2 2024  676 ?  ..init [2]  
root       2  0.0  0.0    0    0 ?  ..[kthreadd]
root       3  0.0  0.0    0    0 ?  ..[ksoftirqd/0]
root       4  0.0  0.0    0    0 ?  ..[watchdog/0]
root       5  0.0  0.0    0    0 ?  ..[events/0]
...
snmp    1031  0.1  1.6 8832 4268 ?  ../usr/sbin/snmpd
root    1033  0.0  0.1 1480  396 ?  ../usr/sbin/udhcp
root    1056  0.0  0.2 1700  536 tty../sbin/getty -L
root    1305  0.0  0.2 2248  544 ?  ..udevd --daemon
root    2260  0.0  0.3 2960  900 ?  ../usr/sbin/pppd
root    9005  0.0  0.4 2396 1048 ?  ../usr/sbin/dropb
mathias 9006  0.1  1.9 6220 4928 pts..-bash
root    9041  0.0  0.0    0    0 ?  ..[flush-8:0]
mathias 9042  0.0  0.3 2344  904 pts..ps aux

To find out which process is consuming the most memory I sort by column 6:

Output mangled to fit

$ ps aux|sort -n -k6 -r |head
mathias 9006  0.1  1.9 6220 4928 pts..-bash
snmp    1031  0.1  1.6 8832 4268 ?  ../usr/sbin/snmpd
ntp      954  0.0  0.7 4576 1920 ?  ../usr/sbin/ntpd 
root    9005  0.0  0.4 2396 1048 ?  ../usr/sbin/dropb
root     898  0.0  0.3 3808  928 ?  ../usr/sbin/cron
mathias 9054  0.0  0.3 2344  908 pts..ps aux
root    2260  0.0  0.3 2960  900 ?  ../usr/sbin/pppd 
dnsmasq  842  0.0  0.3 4116  840 ?  ../usr/sbin/dnsma
mathias 9057  0.0  0.3 2036  768 pts..less -S
root     220  0.0  0.2 2252  720 ?  ..udevd --daemon

The columns VSZ (virtual set size, 5), RSS (resident set size, 6) and PID (process id, 2) are the most interesting for analyzing memory. I use the last one to investigate a process further using pmap:

Output mangled to fit

$ sudo pmap -d 1031
1031:   /usr/sbin/snmpd -Lsd -Lf /dev/null -u snmp \
-g snmp -I -smux -p /var/run/snmpd.pid
Address  Kbytes Mode  Offset   Device    Mapping
08048000     24 r-x-- 0..00000 000:00010 snmpd
0804e000      4 rw--- 0..05000 000:00010 snmpd
09cd4000   1156 rw--- 0..00000 000:00000   [ anon ]
b70bf000     40 r-x-- 0..00000 000:00010 libnss_fi...
b70c9000      4 r---- 0..09000 000:00010 libnss_fi...
b70ca000      4 rw--- 0..0a000 000:00010 libnss_fi...
...
b77c7000      4 r-x-- 0..00000 000:00000   [ anon ]
b77c8000    108 r-x-- 0..00000 000:00010 ld-2.11.2.so
b77e3000      4 r---- 0..1a000 000:00010 ld-2.11.2.so
b77e4000      4 rw--- 0..1b000 000:00010 ld-2.11.2.so
bfc54000    332 rw--- 0..00000 000:00000   [ stack ]
mapped: 8828K  writeable/private: 2172K  shared: 0K

The memory marked writable/private in the last line of the output is the memory that the process uses only for itself and doesn’t share with other processes.

Swappiness

If I have to swap memory, despite all of my efforts to reduce memory consumption, I can at least influence whether the kernel prefers to swap out processes and data or reduce buffer caches when all of the free memory is taken. I have to use a kernel version of at least 2.6 to do this. There is a parameter swappiness which is adjusted as an integer between 0 and 100. 100 means the kernel prefers to swap out processes and 0 means the kernel first reduces buffer caches. The default is 60; a value of 20 or less is recommended for laptops. You can change this value at runtime like this:

# sysctl -w vm.swappiness = 30

or:

# echo 30 > /proc/sys/vm/swappiness

If the system runs without swap memory, this parameter is irrelevant.

I/O subsystem

The I/O subsystem’s job is to communicate with the environment. The kernel is responsible for allocating the devices and the low level drivers. Here I find the drivers suitable to do this.

It is very important to first identify the hardware built into the computer. There are a few programs which I can use to do this:

lspci: lists the devices on the PCI bus.
lsusb: does the same for the USB.
lscpu: provides information about the CPU which supplements the information from /proc/cpuinfo
lshw: finds out nearly everything about the hardware that can be found using software.
dmesg: shows the kernel messages and, particularly with false identified hardware, can show the kernel’s view or show whether it has recognized this hardware at all.

Using the output of these programs in an internet search, I can usually find the right driver for hardware hitherto unknown to me.

System programs

These are programs that are not concerned with the overall purpose of the system but rather to ensure the operational availability.

The first process to load after the system boots is init. Traditionally there are System-V and BSD init programs that work in a similar fashion and only differ in the way they process the start and stop scripts of the systems services. Because most traditional services were geared towards server systems that would boot very infrequently, the init programs can not be optimized easily for a fast system boot. Therefore recent projects have been trying to find a substitute for init that allows more flexibility and shorter boot times for the entire system.

Other important system programs for logging into the system are getty for the serial console on ALIX and sshd which enables you to login via the network. These programs are actually designed for the system’s user but I regard the system administrator to be a user who has to log in to acquire an overview of the system or to diagnose a problem. In the same way the display manager on a graphical system or an HTTP server with a web administration system could be regarded as a system program.

Tip

I like using dropbear as an SSH demon. It is designed for environments with little memory. It implements most of the features of the SSH2 protocol and others like X11 and authentication agent forwarding.

I consider syslogd and klogd to be essential for every system. These often provide valuable hints should errors arise and, if monitored regularly, help to avoid some problems beforehand. I need a syslogd that uses resources sparingly, especially on limited platforms like ALIX machines. I have had good experiences with busybox-syslogd. It doesn’t write to files but uses a memory range of a certain size for the log messages and can forward them to external log servers. I can use logread to read the local messages.

Because many systems which use the network to work together, depend on synchronous clocks (to correlate log messages, for cryptographic systems like kerberos and other things), I consider ntp to be essential as well. In a system without a network this may not be required.

I use an SNMP demon if I want to use this protocol to monitor the device.

Tip

In Debian’s default settings, snmpd logs every access. This is particularly annoying if, for instance with Nagios, access occurs at short intervals. Then I have a system log with data which is not relevant to the problem. To prevent this I have to change the command line parameter for logging. The original option for logging using syslogd is -Lsd. In my opinion -LSwd is better which means only messages with a priority of warning or above are logged to syslogd. This has to be changed in the file /etc/default/snmp in the variable SNMPDOPTS and afterwards I have to restart snmpd.

User programs

These depend on the purpose of the machine. They can be DHCP, DNS, HTTP or other servers, MP3 streaming clients, or Asterisk for a telephone system.

Due to the wide range of possible application fields and the array of available programs, I won’t go into detail here.

Up next

Compiling software yourself