Leanpub: Publish Early, Publish Often

Protocols and mechanisms

In this chapter I’ll discuss some of the protocols and mechanisms that require at least a minimum of understanding.

Bootloader

A bootloader is the first program that gets loaded by the firmware (the BIOS in IBM-compatible PCs) and executed. The bootloader then loads further parts of the operating system, usually the kernel.

Traditionally the bootloader consists of at least two parts: The first part (stage 1) is tiny and is placed in the master boot record (MBR) of the hard disk. Its job is to load the second part (stage 2) which often displays a menu for selecting the kernel and inputting additional kernel parameters and loads the selected kernel.

There are three main bootloaders for Linux on X86 systems: LILO, GRUB and SYSLINUX. Each one has its specific advantages and drawbacks and their suitability varies depending on the case of operation. Below I will go into a bit of detail about the three bootloaders.

LILO

Linux Loader (LILO) is a well-proven bootloader for Linux. LILO is usually configured in the file /etc/lilo.conf. Details can be found in the manual. In addition to Linux, LILO can load other operating systems through chain loading.

After changes have been made to the configuration or a kernel has been upgraded, the program /sbin/lilo has to be called up because this boot loader can’t deal with file systems. The program /sbin/lilo determines which hard disk blocks are to be loaded by the bootloader. If I forget to call up /sbin/lilo, LILO can’t load the desired kernel and the system may become useless.

LILO’s drawback is also its advantage because it means LILO is not limited to known file systems and can also load the kernel from unknown file systems as long as these are not compressed or encrypted.

GRUB

The Grand Unified Bootloader (GRUB) was initially developed as part of the GNU Hurd Project. Because GRUB is more flexible and can deal with file systems - here it isn’t necessary to run a program to determine the hard disk blocks after every kernel update - it has replaced LILO in many systems.

At the moment GRUB is being reworked. The new version is called GRUB 2, the old one GRUB Legacy.

In the old GRUB version there was a stage 1.5 between stage 1 and stage 2 which could read only one file system type. Stage 1.5 was located in the blocks between the MBR and the first partition. There, the appropriate version was installed for reading the file system containing stage 2.

In the new GRUB, stage 2 is divided into a kernel and loadable modules. The kernel only contains essential code for decompression, hard disk access, a shell and an ELF loader for modules. During installation the modules for the file system and the remaining components are appended to the kernel. Because it is compressed it usually fits in the area between the MBR and the first partition.

SYSLINUX

The SYSLINUX project creates a series of lightweight bootloaders for IBM-compatible devices, in particular:

SYSLINUX: for booting from FAT or NTFS file systems
ISOLINUX: for booting from CD-ROM ISO 9660 file systems
PXELINUX: for booting from a network server with a Preboot Execution Environment (PXE)
EXTLINUX: for booting from Linux ext2, ext3 or btrfs file systems
MEMDISK: for booting older operating systems like MS-DOS from these media

The use of SYSLINUX is recommended in these special situations. I’ll go into detail about PXELINUX in the next section.

Preboot Execution Environment (PXE)

Using PXE I can start a computer with software loaded from the network. The code that enables PXE boot is mostly located in the network interface of the computer but can also come from a floppy disk, an USB device or a CD-ROM. The computer uses this code to communicate with a DHCP server in order to get information about the network and the next server. Afterwards it communicates with a TFTP server to load the operating system.

First the computer uses DHCP to search for a PXE compatible redirection service to obtain a valid network configuration and to get information about available boot servers. When it has received this information it then contacts the boot server to load the Network Bootstrap Program (NBP) via TFTP. The NBP then takes control of the process.

A software suitable as NBP which I can use for my purposes is PXELINUX.

The DHCP configuration suitable for PXE boot with PXELINUX could look like this if I use the ISC DHCP server:

allow booting;
allow bootp;
group {
    next-server <TFTP server address>;
    filename "/pxelinux.0";
    host <hostname> {
        hardware ethernet <ethernet address>;
    }
}

This is only the part of the server configuration that is responsible for PXE boot. I group together all of the computers that use the same bootloader and use the MAC address to identify them. Here it is possible to assign fixed addresses in the block following host using fixed-address <hostname>.

PXELINUX

PXELINUX is a component of the SYSLINUX project. Most Linux distributions contain SYSLINUX. I can find the documentation in the files syslinux.txt and pxelinux.txt which most often can be found under the directory /usr/shar/doc/syslinux if the package is installed.

To use PXELINUX I copy the file pxelinux.0, which is part of the software package, onto a TFTP server and create a directory named pxelinux.cfg/. This directory will contain the configuration files for PXELINUX. The names of the configuration files depend on the MAC and IP address of the booting computer. PXELINUX looks for its configuration file in the pxelinux.cfg/ directory in the following order:

First it looks for a file with a name like the client UUID if such an UUID is provided from the PXE stack. The standard UUID format uses hexadecimal numbers with lower case letters as for instance b8945908-d6a6-41a9-611d-74a6ab80b83d.
Next it looks for a file with a name pertaining to the type of hardware and hardware address (MAC), all in hexadecimal numbers with lower case letters which are separated by hyphens. Using an ethernet card with the MAC address 00:0D:B9:22:7D:24 it would look for a file named 01-00-0d-b9-22-7d-24. Note the prepended string 01-.
Next it looks for a file with a name like the IPv4 address encoded as a hexadecimal number with upper case letters (For instance with IP address 192.168.1.5 it would look for a file named C0A80105). There is a program that is a component of SYSLINUS called gethostip which computes the hexadecimal number for any IP address.
If there is no file with a name like the IP address in the previous step, it removes one hexadecimal number from the end of the name and tries again until it findes a file or until there is no hexadecimal number left in the name. Thus it is possible to use the same configuration file for just one IP address, sixteen addresses, 256, 4096, and so on, depending on the length of the file name.
If it still hasn’t found a file, it takes the file named default (in lower case) as a last resort.

Since version 3.20 PXELINUX restarts the computer after a timeout if a configuration file hasn’t been found. Thus the computer remains active even when there are problems with the boot server.

PXELINUX needs a TFTP server which understands the tsize extension of the TFTP protocol. This could be tftp-hpa for instance.

The file syslinux.txt describes the directives available in the configuration file. The most important ones for PXELINUX are:

LOCALBOOT 0: with PXELINUX this means that the computer will boot from the local disk instead of from a kernel loaded through the network. 0 means a normal system start.
I always write this into the default file to avoid accidental installation.
SERIAL port [[baudrate] flowcontrol]: opens the serial interface as a console. For ALIX computers this directive looks like this:

  SERIAL 0 38400 0

DEFAULT kernel options: defines the kernel command line. This is used if PXELINUX starts automatically. It is possible to use a label here.
LABEL label: this entry is most often followed by a KERNEL entry which specifies the kernel and an APPEND entry which specifies the kernel command line.
I use the following entries, for instance, to install Linux via PXE boot on ALIX computers:

added line break for formatting

  serial 0 38400
  console 0
  label linux
    KERNEL vmlinuz
    APPEND initrd=initrd console=ttyS0,38400n1 \
           root=/dev/hda1

This means the files vmlinuz and initrd are on the same TFTP server.

udev - managing devices dynamically

The Linux kernel uses udev to dynamically provide device files under /dev/. Originally the directory /dev/ contained a fixed set of device files for all of the possible devices that may possibly be connected to the computer. Accordingly the directory was crowded and confusing and you didn’t know if the file represented a device that was really connected until you tried it. Later devfs dynamically created device files under /dev/ for devices that were indeed connected to the computer. Udev is the current way of managing connected devices.

Hence udev relies on the information which the kernel provides via sysfs and the rules stipulated by the Linux distribution and the user of the computer. Thus it is possible to:

rename device files
assign alternate or persistent names for a device via symbolic links
determine the name of a device file through the output of a program
change the access rights and ownership of device files
start a program if certain devices are connected to the computer
rename network interfaces

udev rules

Udev rules are read from files containing the suffix .rules in their names under the directories /lib/udev/rules.d/, /etc/udev/rules.d/ and /dev/.udev/rules.d/ (for temporary rules). All rule files are sorted by name and processed in lexical order regardless of the directory they are in. File names must be unique: duplicate file names are ignored. Files under /etc/udev/rules.d/ take precedence over those under /lib/udev/rules.d/. Thus it is possible to deactivate rule files under /lib/udev/rules.d/.

In the rule files, blank lines and lines starting with # are ignored. All other lines are interpreted as rules. Every rule must be on it’s own line and consist of one or more key-value pairs which are separated by a comma (,).

There are two kinds of keys: match and assignment. Once all of the match keys match their values, the assignment keys are assigned their value. The operator determines how matching and assignment are carried out. When writing my own rules I have to consult the udev manual page because some keys can be used for matching as well as for assignment (for instance ATTR{key} and ENV{key}).

I can use wild cards when matching. These have the following meaning:

*: matches zero, one or more arbitrary characters.
?: matches exactly one arbitrary character.
[]: matches exactly one of the characters or character ranges given in the square bracket (for instance [0-9] matches any digit). If the first character between the brackets is an exclamation mark (!), it matches all characters which are not determined by this wild card.

Some of the assignment operators allow characters to be substituted in the rules. The complete list can be found on the manual page. Some important operators are:

$kernel | %k: the kernel name for the device
$number | %n: the kernel number for the device, for instance a partition number of a hard disk.
$result | %c: the output of an external program
$$: the dollar sign itself
%%: the percent sign itself

Information from sysfs

If I want to write udev rules, I try to describe the device as closely as possible. I can obtain the value of the different keys of a connected device using the udevadm program with the info command:

# udevadm info --query all --name /dev/ttyUSB0 \
          --attribute-walk

This command gives me all information about the device, which is at the moment accessible via /dev/ttyUSB0, in order to identify it in my udev rules.

To get the information regarding a network card I use

# udevadm info --query all \
          --path /sys/class/net/eth0 \
          --attribute-walk

Developing rules with udevadm

The program udevadm not only provides information about the connected devices, I can monitor udev for events like the connection or removal of a USB device, test the written rules and affect the status of the running udev process. More information is available on the manual page.

DHCP

Dynamic Host Configuration Protocol (DHCP) is used to assign a network configuration to a client computer. The protocol is defined in RFC2131 and uses the UDP port 67 (for the server or relay agent) and 68 (for the client).

DHCP is an extension of the Bootstrap Protocol (BOOTP). It is widely compatible with the latter and can - with some limitations - work together with BOOTP clients and servers.

The client and server send different messages depending on the state of the client and the validity of its network information.

DHCPDISCOVER: is sent by the client to the servers in the local network as a broadcast message.
DHCPOFFER: is the response from the servers after getting a DHCPDISCOVER message from the client.
DHCPREQUEST: is sent by the client to request one of the addresses offered by the servers. This message is also sent to ask the server for a renewal of the lease time for the requested address.
DHCPACK: is the acknowledgement by the server of the address request sent by the client.
DHCPNAK: is sent by the server if it refuses the request from the client.
DHCPDECLINE: is sent by the client if the address offered by the server is already in use.
DHCPRELEASE: is sent by the client to release resources.
DHCPINFORM: is sent by the client for queries regarding data without IP addresses, for instance, because it has its address manually set.

The server may work in three different modes which influence the lease time of an address assignment.

Using static allocation, the mapping of IP addresses to MAC addresses is determined beforehand by the administrator.

With automatic allocation the DHCP server has an IP address range from which it permanently assigns addresses to client MAC addresses. If this range is completely used up, no further client can get an IP address from this server.

Dynamic allocation is like automatic allocation with the exception that an IP address is only assigned for a determined amount of time and the client has to renew the assignment in time. The time during which an assignment is valid is called lease time.

Communication sequence

An initial assignment goes like this:

The client sends a UDP broadcast datagram with an DHCPREQUEST message from address 0.0.0.0:68 to address 255.255.255.255:67
One or more DHCP server send DHCPOFFER messages as a UDP broadcast to 255.255.255.255:68 with the source port 67.
The client chooses one of the offers and sends a DHCPREQUEST message to the chosen server. The server is identified through the server ID in the message. The other servers interpret this message as a rejection of their offers and can offer their addresses elsewhere.
The chosen server acknowledges its offer with more relevant data (DHCPACK) or it withdraws its offer (DHCPNAK).
Before using the address, the client checks if it is already in use and if it is being used it rejects the address with DHCPDECLINE.

A refresh with dynamic allocation looks like this:

The client is told the lease time along with the IP address.
After half of the lease time has expired, the client sends a DHCPREQUEST message using unicast to the DHCP server to renew the lease.
When the server sends a DHCPACK message with a new lease time, the refresh is completed and the client can continuew using the address. After half of the lease time expires the client starts the next refresh.
If the server sends a DHCPNAK, the client must deactivate the use of the IP address on its network interface and begin a new initial assignment.
If the client doesn’t receive an answer from the server after 7/8 of the lease time, it sends a DHCPREQUEST message as a broadcast to get a renewal from any server.
If the client could not renew its IP address at the end of the lease time, it must deactivate the use of the address and begin a new initial assignment.

DHCP for different subnetworks

Remote networks may be connected to a DHCP server via DHCP relay agents. The relay agent receives the broadcast messages from the client and forwards them to the servers. The agent adds the IP address of the interface on which it received the broadcast to the datagram so that the server can determine the network for which the client should be configured. The relay agent receives the answer from the server at UDP port 67 and forwards it to the client’s port 68.

Security

DHCP is easy to disrupt since DHCP clients accept every DHCP server. If a foreign DHCP server is accidentally integrated into a network, the network can be paralyzed to a large extent.

An attacker can register all addresses of a DHCP server to prevent this server from responding to further requests (DHCP Starvation Hack). Afterwards the attacker can behave like a DHCP server.

When using DHCP in production networks, appropriate arrangements should be made like manageable switches and monitoring for unauthorized DHCP servers.

IPv6

IPv6 does not need a DHCP service to configure network addresses. To distribute other information there is the protocol DHCPv6 which is specified in RFC3315 and does approximately the same for IPv6 as DHCPv4 does for IPv4. Unlike DHCPv4 the communication runs from UDP port 546 of the client to port 547 of the server.

Trivial File Transfer Protocol (TFTP)

TFTP is a simple protocol for transferring files between computers. The protocol uses connection less protocols like UDP for file transfer. It is especially designed to load the operating system using firmware or small bootloaders and has the following characteristics:

reading and writing files from or to a server
no directory listings
no authentication, compression or encryption

RFC1350 indicates the form of protocol that is currently being used. RFC2347 describes how to extend the protocol with options and RFC2349 indicates the transfer size option (tsize) needed for PXELINUX. This option enables the sending party to inform the receiving party of the size of the file to be transfered.

TFTP file transfer is always done between a client and a server and is initiated by the client. After a connection has been established this distinction is irrelevant for the protocol and it is more helpful to distinguish between sender and receiver. However a distinction between client and server is necessary for establishing the connection because it is only the server which waits for connections with a predetermined transfer identifier (TID). The TIDs of both partners are nothing more as the UDP ports in use and the server waits at UDP port 69 for connections.

Establishing a connection

A client initiates a TFTP connection by sending a read request (RRQ) or write request (WRQ) and possibly some options to UDP port 69 of the server computer. The client’s UDP port is selected randomly. The server determines a random UDP port for its side of this connection and sends the data to the client address and port. The server’s initial may contain:

an OACK message to accept or relay options
an ACK message to accept the write request without options
a message containing the first data block if the server sends the file but doesn’t accept options
an error message

Most error messages instantly close down the connection, details can be found in the referred to RFCs.

Data transfer

When it comes to the actual data transfer it is more helpful to distinguish between sender and receiver because client and server behave exactly the same when they send or receive a file.

During data transfer the sender always sends one data packet with 512 bytes of data (unless there was a different block size negotiated when the connection was established as described in RFC2348) together with the respective block number. The receiver always answers with an acknowledgement message, which contains the same block number, or with an error message. If one message gets lost in transmission, the last message is retransmitted after a timeout.

Data transfer ends when a message with less data than the negotiated block size (or 512 bytes if nothing was negotiated) is sent. If the size of the file is an integer multiple of the block size, the sender must send a data message with 0 bytes of data to end the transfer. The transfer also ends when there is an error.

Analyzing a TFTP session

When analyzing the protocol at network level with tcpdump or wireshark it is important to know that I can’t easily filter the connection using ports. I must capture all of the UDP datagrams and possibly the ICMP messages between the two computers to analyze a TFTP session. One possible command line for tcpdump would be for instance:

# tcpdump -w tftp.pcap \
  \( icmp or udp \) and host client and host server

After finishing the capture I can determine the client’s UDP port and use this knowledge to filter out the whole session:

# tcpdump -n -r tftp.pcap \
  host server and port 69
# tcpdump -nv -r tftp.pcap \
  host client and port clientport

If even the first datagram from the server is missing, an ICMP-Port-Unreachable datagram can signalize that there is no TFTP demon running on the server computer.

Zero Configuration Networking

The origin of Zero Configuration Networking (Zeroconf) dates back to the 1990s. At that time Stuart Cheshire was working on networking computers using IP without explicit configuration. This technology has been integrated into Apple computers for some time now under the name Rendezvous and Bonjour respectively. It is available for Mac OSX, Windows, Linux, BSD UNIX and other operating systems. A very good book about Zeroconf is Zero Configuration Networking / The Definitive Guide by Stuart Cheshire and Daniel H. Steinberg.

From a technical standpoint Zeroconf is the combination of three technologies: Dynamic Configuration of IPv4 Link-Local Addresses (RFC3927), Multicast-DNS and DNS Service Discovery. The goal of Zeroconf is to make setting up a networking device as easy as turning on a table lamp: plug it in, switch it on, it works.

Link local addressing

Every device in an IP network needs at least one unique IP address. If I operate my own network or connect to a professionally managed network, the address is either configured manually or provided through DHCP. Both methods require preparation work which I don’t want to expend if I just want to temporary connect two laptops with an ethernet cable. This is precisely where I can use the automatic assignment of addresses according to RFC3927 for which the address range from 169.254.1.0 to 169.25.254.255 is reserved.

The procedure goes like this:

The computer chooses a random address in the mentioned range.
The computer sends ARP requests to find out whether another computer is already using this address. To do this it sets the sender IP address to 0.0.0.0 and the sender hardware address to its own hardware address.
a) If another computer claims the address by responding to the ARP request, the computer starts over from 1.
b) If there is no response after a few seconds, the computer sends several ARP announcements to claim this address for itself and to clear possible stale entries for this address in the ARP caches of other computers.
If another computer tries to use this address at a later point in time, the computer defends the address by responding to the ARP requests.
If there are later conflicts, for instance because network segments which had been separated are then connected or because a rogue computer isn’t playing by the rules, the standard requires the following: If a computer sees an ARP request for its own address from another computer, it sends, at most, one ARP response to raise its claim. If the other device doesn’t give up the address, the computer has to abandon the use of the address and start over with 1.

Multicast DNS

After getting an IP address for my computer I then need a name to refer to this IP address. To do this, I can use Multicast DNS (mDNS) if there is no configured DNS server.

There is no central authority within mDNS. Instead every client who wants to make a query sends its request through multicast to every interested machine in the network and responds when it sees a query for its own name.

Zeroconf uses the domain .local to distinguish between local names and existing domains. Like the addresses in the network 169.254/16, names in this domain are only unique in the local network. Names in this top level domain are usually resolved with mDNS. Every mDNS query is sent to multicast address 225.0.0.251 (FF02::FB with IPv6) and port 5353.

Multicast DNS identifies three categories of queries:

one-shot queries
one-shot queries that accumulate multiple responses
continuous, ongoing queries

For one-shot queries where the client expects just one answer it only sends a DNS query to UDP port 5353 at address 224.0.0.251. This functionality is enough to resolve, for instance, http://some-name.local/ in a webbrowser.

With the second type of query the client knows that there may be multiple responses. It collects the responses and possibly repeats the query. With a repeated query it sends a list of already received responses so that these need not be sent another time. Follow-up queries are sent at a decreasing rate, in other words, with a growing interval between them.

Ongoing queries are used for instance for lists of available services. These queries are also sent at increasing intervals (up to one hour) and contain a list of the known responses. Responses to these queries go to the multicast address so that they can be registered by all interested computers. To keep the list up-to-date without reducing the interval, new computers use unsolicited responses when they arrive in a network to indicate there presence. Every response contains a lifetime and just before this lifetime expires the computer asks again for a response. If a computer notices that one of its responses is invalid (for instance because it is going to shut down) it sends a goodbye message which is an unsolicited response with a lifetime of zero.

A computer does the following to claim a name in mDNS:

The computer chooses a name for its IP address.
The computer asks three times, with a 250 ms idle time between requests, whether the name is already in use.
a) If there is a response within 750 ms it goes back to step 1 (and possibly sends a message to the user interface).
b) If two computers claim the same name at the same time, there is a procedure to resolve this conflict.

c) If there is no answer, the computer starts to claim the name with unsolicited responses. Its neighbors will replace all information regarding this name with the new information.
A computer must recognize conflicts at any time and be able to send mDNS messages, not only in the probing phase.

DNS Service Discovery (DNS-SD)

With the first two components we can determine an IP address and a unique name in the network without explicit configuration. However it is better to be able to choose the service I need at the time from a list of services. IP addresses are established when a computer connects to the network and they change often. The names are usually controlled by the user and are relatively permanent. But the services I am interested in can be found with DNS-SD.

If DNS-SD uses mDNS, the same rules for conflict resolution apply. Services are advertised under the subdomains _tcp and _udp (for example: Internet Printing Protocol `_ipp._tcp.local). There was a free registry for services till 2010 at www.dns-sd.org but now these services may be registered directly with IANA.

Because there may be different protocols for the same service - for instance UNIX LPR (_printer._tcp), IPP (_ipp._tcp), tcp connections to port 9100 (_pdl-datastream._tcp), remote USB emulations (_riousbprint._tcp) for printer services - one of these protocols is called the flagship protocol. Any computer that wants to register one of these protocols with mDNS must also register the flagship protocol so that the mDNS conflict resolution can work if two computer try to register two different print services with the same name. If one computer does not use the flagship protocol, it sets its port to 0 to make it clear that the name is claimed but that this protocol is not in service.

DNS-SD TXT records allow more information to be provided that is not otherwise available with the protocol.

DNS Service Discovery may use standard DNS and - using this - extend beyond the limit of the local network.

Within Linux the Avahi Framework provides appropriate software for Zero Configuration Networking. I can find Zeroconf extensions for other software, for instance the apache webserver, on Debian-derived distributions using apt-cache search zeroconf.

Up next

Glossary