Components of a Linux system
In this chapter I will go into detail about the components that make up a Linux system. To do this, I divided the system into the following areas:
- Kernel
- The glue that holds everything together.
- File system
- The place where everything (programs and data) is permanently deposited.
- RAM
- All of the programs and data must be copied here before they can be used. Most of the time little consideration is given to this topic. Because these machines only have a little RAM, I believe a few thoughts are appropriate.
- I/O system
- This provides the basis for communicating with the environment via the network, serial connections, keyboard, screen and other means.
- System programs
- All of the programs which aren’t vital for the intended use of the machine, but are instead there to maintain stable operation.
- User programs
- All programs which are vital for the intended use of the machine.
Of course these areas overlap and some programs may be considered to be a system program or a user program - depending on what they are being used for.
Kernel
The kernel is - rightly so - only considered a component of the total system. Efforts are being made, for instance with Debian, to make the operating system usable with the BSD kernel or with Hurd. Nevertheless the kernel plays a central role in the overall system as the gateway between hardware and software, as communication interface between the various processes running and as the entity that allocates the resources to the processes.
Of particular interest is the amount of support the kernel gives to the hardware in the ALIX machines and whether I need external kernel modules or whether I can get by with Userland programs.
Kernel modules
The standard kernel includes GPL drivers for all of the hardware plugged into the ALIX boards.
- leds-alix
- This modul is used to control the LEDs on the boards. The ledtrig-* modules provide activation triggers. Since kernel version 2.6.30, these modules have been included in the standard kernel. That means I don’t need separate modules for Debian 6 Squeeze. For older kernels (for instance from Debian 5 Lenny), I need the package leds-alix-source which I can compile and install with module-assistant (m-a).
- rtc
- I need this module to access the hardware clock.
- geode-aes, geode-rng
- The AMD Geode LX800 processor, which is built into some of the ALIX boards, contains an on-chip AES 128-bit crypto accelerations block and a true random number generator. It is faster to use these than the algorithms used in the software and frees up the CPU for other purposes. I need rngd from the rng-tools to use the hardware random number generator.
- geodewdt
- A watchdog driver is available starting in kernel version 2.6.33 which restarts the computer automatically if it gets stuck.
- i2c_core
- This module for the I²C bus is needed to access the sensors
- lm90
- This is the driver for the sensor chip. Currently it identifies a lm86 sensor.
- scx200_acb
- This is the driver for the ACCESS bus of the Geode processors and the CS5536 chips.
- cs5535_gpio
- The GPIO module to access the LEDs and button comes with kernel version 2.6.33 and above, otherwise it must be compiled as an additional kernel module.
- cs5536, pata_cs5536
- Modules for the compact flash “hard disk”.
Depending on the kernel version and kernel options, the CF card is accessible
either as /dev/hda or /dev/sda.
This can lead to problems at boot time if you want to test another kernel
version.
In this case it would be better to identify the disk using its UUID or label.
cs5536 ATA Stack -> /dev/sda pata_cs5536 PATA Stack -> /dev/hda - via_rhine
- The ethernet module.
File system
The file system is the place where everything that amounts to the system (programs, data) is stored permanently. It pays to look a little bit closer at this. On the one hand the various file systems are to a greater or lesser extent suitable for the different media. Particularly with regard to flash media I want to avoid writing on the same spot over and over and to use at least the operating system data in read-only to minimize damages caused by excessive write access. On the other hand some programs need permanent storage to save data after a reboot or if there is not enough RAM available.
If I have mounted the root file system read-only there are still ways to enable write access for processes:
- With overlay file systems like AUFS
- By mounting writable file systems like tmpfs at certain mount points
- With symbolic links to writable file systems
SquashFS
This is a compressed read-only file system for Linux from version 2.4 and above. The kernel accesses this file system through a kernel modul as virtual file system (VFS). SquashFS has been included in the standard kernel since kernel version 2.6.29. For lower kernel versions a separate module has to be compiled.
The entire UID and GID as well the file creation time are stored in SquashFS. Duplicate files are stored only once. Files are compressed with deflate (zlib) or with the more effective Lempel-Zif-Markow algorithm (LZMA). SquashFS is often used together with UnionFS or the more modern AUFS to allow the processes at least temporary write access.
To work with SquashFS on Debian you have to install the package squashfs-tools which contains the programs mksquash for creating it and unsquashfs for extracting a SquashFS without mounting it. MS Windows at least offers read access with 7-zip.
For initial experiments you can convert a part of your file system into a SquashFS:
# mksquashfs /usr/local /mnt/local.sqsh
...
# du -s /mnt/local.sqsh /usr/local
94156 /mntlocal.sqsh
219252 /usr/local/
The newly formed SquashFS is already considerably smaller than the original file system. Since I want to try out the SquashFS together with the overlay file system AUFS, I create three mount points:
# mkdir /mnt/local
# mkdir /mnt/local-ro
# mkdir /mnt/local-rw
# mount /mnt/local.sqsh /mnt/local-ro -t squashfs \
-o loop
# mount -t aufs \
-o dirs=/mnt/local-rw=rw:/mnt/local-ro=ro \
aufs /mnt/local
# mount
...
/dev/loop0 on /mnt/local-ro type squashfs (rw)
aufs on /mnt/local type aufs \
(rw,dirs=/mnt/local-rw=rw:/mnt/local-ro=ro)
The SquashFS is mounted at /mnt/local-ro/ where it is read-only. I want to use this file system read-write at /mnt/local/. AUFS redirects all my write access to /mnt/local-rw/:
# echo foo> /mnt/local-ro/var/foo
-bash: /mnt/local-ro/var/foo: file system is read-only
# echo foo> /mnt/local/var/foo
# diff -r /mnt/local-ro /usr/local
# diff -r /mnt/local /usr/local
Nur in /mnt/local/var: foo.
# cat /mnt/local/var/foo
foo
# cat /mnt/local-rw/var/foo
foo
I can’t write to /mnt/local-ro/, even though mount showed that it was mounted read-write, since this is impossible with SquashFS. Therefore I create the file in /mnt/local/ and find it eventually in /mnt/local-rw/.
Overlay root file system with AUFS
With this solution the root file system gets mounted read-only and there is an AUFS laid over the whole directory tree which redirects all write access to a temporary file system in RAM (tmpfs). The whole thing is set up through a script in the InitRD which moves the root partiton to /ro/, mounts a tmpfs at /rw/ and an AUFS at /, reads from /ro/ and writes to /rw/. This script is activated by the kernel command line argument aufs=tmpfs. Furthermore, the script creates two other scripts named remountrw and remountro, which can remount the root partition at /ro/ in read-write mode or read-only mode. I set up my first ALIX systems this way.
Everything that gets written by any program into any file ends up in tmpfs, as long as there is enough RAM available and disappears after a reboot. If I want to make some changes permanent, I have to call up remountrw, move the changes from /rw/ to /ro/ and then call up remountro.
One advantage of this solution is that I don’t have to think about which program is trying to write which file. Everything goes into tmpfs and is gone after rebooting.
The downside is that system upgrades also land under /rw/ and have to be copied afterwards to /ro/. Alternatively I can start the system without the kernel option aufs=tmpfs and boot the system just like any normal system. This is not a very good option for systems which are supposed to run permanently. Therefore I turn to the next solution which is used, among others, by Voyage Linux.
Read-only root file system with multiple tmpfs
The central idea of this solution is to mount the root file system read-only and to not allow any process to write to this file system. For directories that traditionally contain writable files (e.g. /var/run/) I mount tmpfs (RAM disks) at these points to allow write access.
For the directories /var/run/ and /var/lock/ this is already available in standard Debian Linux if I set the following in the file /etc/default/rcS:
1 # /etc/default/rcS
2 # ...
3 RAMRUN=yes
4 RAMLOCK=yes
This is even better supported in Voyage Linux. Here I can specify further directories to be mounted as tmpfs in the file /etc/default/voyate-util by adding them to the variable VOYAGE_SYNC_DIRS. These directories are also automatically saved at shutdown, and filled with the saved files at system boot. If I want to manually save the files in these directories, I do it like this:
# remountrw
# /etc/init.d/voyage-sync sync
# remountro
With this solution I have to think about which directories should be writable (for instance the directory with the leases of the DHCP demon). But afterwards a system upgrade is as easy as this:
# remountrw
# apt-get update && apt-get upgrade
# /etc/init.d/voyage-sync sync
# remountro
This is why I moved away from the AUFS solution and am now using Voyage Linux on my ALIX machines.
The package flashybrid on standard Debian GNU/Linux functions similarly to voyage-util on Voyage Linux.
Flashybrid provides the commands mountro, mountrw and fh-sync, which have the same function as their corresponding commands in Voyage Linux.
Writable file systems
If I need a writable file system for my project, I recommend a modern CF card (with CompactFlash 5.0 or later) that supports the TRIM command. Together with a suitable file system (btrfs, ext4fs, fat or gfs2) and a current kernel (starting with version 2.6.33) the operating system can tell the CF card which sectors aren’t needed anymore and no longer need to be copied. By doing this, and by leaving some space free, I can extend the lifetime of the CF card.
Identifying partitions with UUID or label
Depending on the kernel version and options, the first CF card will either be called /dev/hda or /dev/sda. Thus the system boot may fail if I just want to test a new kernel. In this case it may be beneficial to identify the partition using its UUID. To do this, I do the following:
After the machine has booted I look in /dev/disk/by-uuid/ to find out the UUID of the individual partitions have:
$ ls -l /dev/disk/by-uuid/
total 0
lrwxrwxrwx 1 root root 10 2011-11-25 07:43
f779141e-e3b1-4521-9333-9dde9de0b64f -> ../../sda1
(Output is wrapped for better legibility.)
Afterwards I change the entry for /dev/sda1 in the file /etc/fstab to
UUID=f779141e-e3b1-4521-9333-9dde9de0b64f and do the same for the other
partitions.
I change the kernel option root in the Grub boot entry (file
/boot/grub/menu.lst) accordingly:
1 root=/dev/disk/by-uuid/f779141e-...-9dde9de0b64f
Another possibility is to use filesystem labels. I can use e2label, for instance, to write these on an ext2fs.
# e2label /dev/sda1 rootfs
I change the line for the root file system in /etc/fstab like this:
1 LABEL=rootfs / ext3 errors=remount-ro 0 1
And the Grub boot entry looks like this for the kernel option root:
1 root=LABEL=rootfs
Random Access Memory
Before programs and data can be used by the CPU, they have to be copied into RAM. This is where everything happens, but this type of memory is also often scarce on these small computers. This is the most important thing that can be said about RAM, but because I consider RAM to be so important, I want to go further into detail.
In order to execute a programm in a process, it has to be loaded into RAM. Only the parts of the program that will be executed next get loaded and not the whole program (except when the whole program fits onto a single page, but you’ll have a hard time finding such a program). If a program is used by different processes it will only be loaded once. Only the stack and the heap are used privately by the process. It is advisable to look out for programs, which don’t use much memory but still provide the functionality you need. It is also advantageous whan a program like busybox can replace as many other programs as possible because you save memory in RAM and on the file system when it can be used by many processes.
I will need more RAM if I use RAM directly as a file system through overlay file systems, tmpfs or loopback mounts. This RAM is no longer available as working memory for processes.
Finally the kernel uses all of the memory that is not used for any of the purposes mentioned above as a buffer for file system access. I usually don’t need to pay attention to this because this memory is automatically freed up when it is needed for other purposes.
The main memory of the X86 computer architecture is divided into three areas:
- ZONE_DMA
- from 0 to 16 MiB. This range contains memory pages which may be used by devices for DMA.
- ZONE_NORMAL
- from 16 MiB to 896 MiB. This range contains regular memory pages.
- ZONE_HIGHMEM
- over 896 MiB. This range contains memory pages, which are not continuously available in the address space of the 32-bit CPU. This range has no relevance for ALIX computers.
Analyzing memory usage
I can use the programs free, top, ps and pmap in order to analyze the memory usage of a Linux system.
The program free gives me an overview of the current allocation of the total usable system memory:
$ free
total used free shared buffers cached
Mem: 255488 135984 119504 0 6588 108732
-/+ buffers/cache: 20664 234824
Swap: 0 0 0
I am never able to see the entire memory under total because the memory used by the hardware and the kernel has been calculated out.
The memory labelled buffers contains temporary data from the processes running, like input queues, file buffers, output queues and so on. The memory marked cached contains buffered file accesses, for instance if multiple processes are accessing the same file.
I can use the program top to isolate processes which use a particularly
large amount of memory.
It provides an overview of the processes, CPU load and total memory consumption
in the head lines and below these there is a table containing the data of the
individual processes.
The output is updated continuously and can be modified.
By pressing ? I am able to call up a brief help page explaining the possible
modifications.
It is interesting to sort the table by memory consumption, which
I can do by pressing m:
Output mangled to fit
top - 08:29:03 up 125 days, 21:33, 1 user, load a..
Tasks: 54 total, 1 running, 53 sleeping, 0 st..
Cpu(s): 0.4%us, 0.2%sy, 0.0%ni, 99.4%id, 0.0%wa..
Mem: 255488k total, 136172k used, 119316k fr..
Swap: 0k total, 0k used, 0k fr..
PID USER ..VIRT RES SHR..%MEM TIME+ COMMAND
9006 mathias..6220 4924 1340.. 1.9 0:03.21 bash
1031 snmp ..8832 4268 2660.. 1.7 186:43.97 snmpd
954 ntp ..4576 1920 1480.. 0.8 9:52.96 ntpd
9037 mathias..2324 1096 876.. 0.4 0:00.74 top
9005 root ..2396 1048 788.. 0.4 0:01.44 dropbear
898 root ..3808 928 740.. 0.4 0:27.43 cron
2260 root ..2960 900 672.. 0.4 0:06.71 pppd
842 dnsmasq..4116 840 656.. 0.3 0:14.54 dnsmasq
220 root ..2252 720 396.. 0.3 0:00.23 udevd
263 root ..2248 688 364.. 0.3 0:00.06 udevd
The following columns are the most important columns for analyzing memory:
- VIRT
- stands for the virtual size of the process. This includes all code, data and shared libraries plus pages that have been swapped out and pages that have been mapped but not used. In other words all of the memory this process could use.
- RES
- is the resident size. This is the physical memory of a process which has not been swapped out to disk. This is used to compute the value of the %MEM column.
- SHR
- is the shared memory size. This is the part of VIRT that can be shared with other processes.
- %MEM
- the percentage of the available physical memory which the process currently
uses.
I use this column (and sort this column by pressing
m) to find the processes and programs which use the most memory and are therefore candidates for further investigation.
Using the program ps I can get a snapshot of the memory currently being consumed by all of the processes:
Output mangled to fit
$ ps aux
USER PID %CPU %MEM VSZ RSS TTY..COMMAND
root 1 0.0 0.2 2024 676 ? ..init [2]
root 2 0.0 0.0 0 0 ? ..[kthreadd]
root 3 0.0 0.0 0 0 ? ..[ksoftirqd/0]
root 4 0.0 0.0 0 0 ? ..[watchdog/0]
root 5 0.0 0.0 0 0 ? ..[events/0]
...
snmp 1031 0.1 1.6 8832 4268 ? ../usr/sbin/snmpd
root 1033 0.0 0.1 1480 396 ? ../usr/sbin/udhcp
root 1056 0.0 0.2 1700 536 tty../sbin/getty -L
root 1305 0.0 0.2 2248 544 ? ..udevd --daemon
root 2260 0.0 0.3 2960 900 ? ../usr/sbin/pppd
root 9005 0.0 0.4 2396 1048 ? ../usr/sbin/dropb
mathias 9006 0.1 1.9 6220 4928 pts..-bash
root 9041 0.0 0.0 0 0 ? ..[flush-8:0]
mathias 9042 0.0 0.3 2344 904 pts..ps aux
To find out which process is consuming the most memory I sort by column 6:
Output mangled to fit
$ ps aux|sort -n -k6 -r |head
mathias 9006 0.1 1.9 6220 4928 pts..-bash
snmp 1031 0.1 1.6 8832 4268 ? ../usr/sbin/snmpd
ntp 954 0.0 0.7 4576 1920 ? ../usr/sbin/ntpd
root 9005 0.0 0.4 2396 1048 ? ../usr/sbin/dropb
root 898 0.0 0.3 3808 928 ? ../usr/sbin/cron
mathias 9054 0.0 0.3 2344 908 pts..ps aux
root 2260 0.0 0.3 2960 900 ? ../usr/sbin/pppd
dnsmasq 842 0.0 0.3 4116 840 ? ../usr/sbin/dnsma
mathias 9057 0.0 0.3 2036 768 pts..less -S
root 220 0.0 0.2 2252 720 ? ..udevd --daemon
The columns VSZ (virtual set size, 5), RSS (resident set size, 6) and PID (process id, 2) are the most interesting for analyzing memory. I use the last one to investigate a process further using pmap:
Output mangled to fit
$ sudo pmap -d 1031
1031: /usr/sbin/snmpd -Lsd -Lf /dev/null -u snmp \
-g snmp -I -smux -p /var/run/snmpd.pid
Address Kbytes Mode Offset Device Mapping
08048000 24 r-x-- 0..00000 000:00010 snmpd
0804e000 4 rw--- 0..05000 000:00010 snmpd
09cd4000 1156 rw--- 0..00000 000:00000 [ anon ]
b70bf000 40 r-x-- 0..00000 000:00010 libnss_fi...
b70c9000 4 r---- 0..09000 000:00010 libnss_fi...
b70ca000 4 rw--- 0..0a000 000:00010 libnss_fi...
...
b77c7000 4 r-x-- 0..00000 000:00000 [ anon ]
b77c8000 108 r-x-- 0..00000 000:00010 ld-2.11.2.so
b77e3000 4 r---- 0..1a000 000:00010 ld-2.11.2.so
b77e4000 4 rw--- 0..1b000 000:00010 ld-2.11.2.so
bfc54000 332 rw--- 0..00000 000:00000 [ stack ]
mapped: 8828K writeable/private: 2172K shared: 0K
The memory marked writable/private in the last line of the output is the memory that the process uses only for itself and doesn’t share with other processes.
Swappiness
If I have to swap memory, despite all of my efforts to reduce memory consumption, I can at least influence whether the kernel prefers to swap out processes and data or reduce buffer caches when all of the free memory is taken. I have to use a kernel version of at least 2.6 to do this. There is a parameter swappiness which is adjusted as an integer between 0 and 100. 100 means the kernel prefers to swap out processes and 0 means the kernel first reduces buffer caches. The default is 60; a value of 20 or less is recommended for laptops. You can change this value at runtime like this:
# sysctl -w vm.swappiness = 30
or:
# echo 30 > /proc/sys/vm/swappiness
If the system runs without swap memory, this parameter is irrelevant.
I/O subsystem
The I/O subsystem’s job is to communicate with the environment. The kernel is responsible for allocating the devices and the low level drivers. Here I find the drivers suitable to do this.
It is very important to first identify the hardware built into the computer. There are a few programs which I can use to do this:
- lspci
- lists the devices on the PCI bus.
- lsusb
- does the same for the USB.
- lscpu
- provides information about the CPU which supplements the information from /proc/cpuinfo
- lshw
- finds out nearly everything about the hardware that can be found using software.
- dmesg
- shows the kernel messages and, particularly with false identified hardware, can show the kernel’s view or show whether it has recognized this hardware at all.
Using the output of these programs in an internet search, I can usually find the right driver for hardware hitherto unknown to me.
System programs
These are programs that are not concerned with the overall purpose of the system but rather to ensure the operational availability.
The first process to load after the system boots is init. Traditionally there are System-V and BSD init programs that work in a similar fashion and only differ in the way they process the start and stop scripts of the systems services. Because most traditional services were geared towards server systems that would boot very infrequently, the init programs can not be optimized easily for a fast system boot. Therefore recent projects have been trying to find a substitute for init that allows more flexibility and shorter boot times for the entire system.
Other important system programs for logging into the system are getty for the serial console on ALIX and sshd which enables you to login via the network. These programs are actually designed for the system’s user but I regard the system administrator to be a user who has to log in to acquire an overview of the system or to diagnose a problem. In the same way the display manager on a graphical system or an HTTP server with a web administration system could be regarded as a system program.
I consider syslogd and klogd to be essential for every system. These often provide valuable hints should errors arise and, if monitored regularly, help to avoid some problems beforehand. I need a syslogd that uses resources sparingly, especially on limited platforms like ALIX machines. I have had good experiences with busybox-syslogd. It doesn’t write to files but uses a memory range of a certain size for the log messages and can forward them to external log servers. I can use logread to read the local messages.
Because many systems which use the network to work together, depend on synchronous clocks (to correlate log messages, for cryptographic systems like kerberos and other things), I consider ntp to be essential as well. In a system without a network this may not be required.
I use an SNMP demon if I want to use this protocol to monitor the device.
User programs
These depend on the purpose of the machine. They can be DHCP, DNS, HTTP or other servers, MP3 streaming clients, or Asterisk for a telephone system.
Due to the wide range of possible application fields and the array of available programs, I won’t go into detail here.