8. Appendix

8.1 Adapt your logging techniques to the technology characteristics

Broadly speaking, there are four main dimensions that shape the detail of how we use logging and metrics:

  1. The end-to-end connection quality (bandwidth, reliability) between the originating node and the collecting/ingest system - “how much data can we transmit?”
  2. The processing power and storage on the originating node - “how much data can we store and forward from the node?”
  3. What permissions we have to transmit data from the originating node - “how far can we push the data?”
  4. What is the cost to transmit data - “how much per Gigabyte?”

For example, consider a stock control system for a large manufacturer running in a secure data centre. Although the software system has access to Gigabit-speed internet bandwidth via fiber connectivity (high connection quality - #1 above), and runs on x86 VMs (good processing power - #2 above), the company refuses to allow log data and metrics to be shipped outside the data centre (limited ability to “push” the data - #3). In this case, the log aggregation system and metrics collector need to be run inside the same data centre and likely have some resource constraints (storage and processing power) so the emission of log and metrics data needs to be controlled.

Conversely, a building management system that uses smart sensors deployed in offices that communicate with a central web-based system using the customers’ internet connections as part of the provided service would have very different constraints: the sensors might be reasonably-capable ARM-based devices with plenty of onboard storage, and we can assume that the network connection is rapid, but we should take care not to swamp the customers’ connections with too much data, as this may cost them additional fees.

A SaaS-based online shopping website running in 2024 likely has excellent bandwidth, capable runtime hardware nodes, permission to send data to a collector/aggregator, and reasonable bandwiidth costs.

Logging and metrics for applications running on x86-based machines

If your software runs directly on a machine that resembles a commodity x86-based server or desktop machine (whether 32-bit or 64-bit, physical or virtual), then it is usually safe to assume that:

  • the machine has plenty of RAM and CPU power, and probably multiple CPU cores
  • there is ample local storage for files (including log files)
  • the machine has a fast Ethernet connection (1Gbit or more)

In this context, a reasonable model for dealing with log data and metrics is to use a locally-installed log agent (a separate daemon/service) that watches log files written to local storage and then forwards these logs to a central location. Metrics can be derived from log data or can be emitted directly from the application with an assumption that the metrics will be collected more or less immediately by the central server.

Logging and metrics for containers and containerized applications

If you software runs in a container (Docker, LXC, etc.), good practice is to avoid writing to the filesystem locally inside the container. Storage for containers is heavily virtualised and not optimal for use by many hundreds of containers simultaneously.

Instead, containerized applications should write logs to STDOUT / STDERR, relying on the container fabric or helper containers to listen for this STDOUT output and forward the log messages to a central collector. Generally, the log listeners will decorate the log messages with details of the container ID and type, providing more context for the log messages when they appear in the central aggregator.

Time-series metrics for containerized applications can rapidly saturate a network due to the large number of running containers and the subsequent network traffic that can ensure from metrics transmission. In these cases, metrics can be most effectively managed by using a pre-aggregation helper.

The containerized applications send their metrics to a aggregator container running on the same host (thereby using only the host’s network card and not the switched network itself). The aggregator then decorates the application metrics with system-level metrics and (possibly) compresses or collates the application metrics before sending them to the central collector.

Logging and metrics for Serverless / Function-as-a-Service (FaaS) and Platform-as-a-Service (PaaS)

For Serverless or Function-as-a-Service applications (such as AWS Lambda, Azure Functions, and Google Cloud Functions) and applications using a Platform-as-a-Service framework like CloudFoundry or OpenShift, the logging and metrics options are typically (deliberately) restricted, with the platform or runtime hiding the implementation details.

For FaaS and PaaS applications, you will typically either use a library specific to the platform, or send data to a default endpoint (STDOUT or localhost) for the platform to collect. For Serverless/FaaS systems, a “log reflector” approach using one or more additional helping functions seems to work well.

Logging and metrics for IoT / connected devices and embedded systems

With the increase in internet-connected devices (“Internet of Things”), autonomous vehicles, medical and agricultural devices, building sensors, and industrial automation, we need to consider sensible approaches to log aggregation and metrics for IoT devices. Such systems are typically characterised by:

  • Limited compute resources (slow CPU, limited RAM, storage that is slow and/or limited in size, limited bandwidth)
  • Only occasional network connectivity, often by design for some devices
  • Protocols other than HTTP as the initial transport mechanism (MQTT, LoRaWAN, etc.)

In these cases, we need to adopt a pragmatic approach to logging and metrics that still gives us the benefits of rich information and invaluable insights into the operational effectiveness of the device or system without compromising CPU/RAM/network limits.

We should be highly selective about what data we send and when we send it from the remote device to any data collection endpoint; if possible, send only essential information in a compressed form. On some IoT networks - such as LoRaWAN - data transmission bandwidth is extremely restricted, so use the bandwidth wisely. It is also important to limit the size of on-device storage used for log and metrics data: overwrite old data if necessary.

Logging and metrics for mobile apps

Personal mobile devices (phones, tablets, and similar devices) sit between x86 machines and IoT devices in terms of storage and processing capabilities. When generating and sending log data and metrics from mobile devices, we need to expect that:

  • The network connectivity is intermittent
  • The network connectivity is expensive for the user
  • The device has limited storage space remaining

This means we should be quite selective about what data we send and when we send it from the mobile device to any remote data collection endpoint; wait for a wifi connection (rather than cellular data) to send log data and metrics, and even then send only essential information in a compressed form. It is also important to limit the size of on-device storage used for log and metrics data, overwriting old data if necessary.

8.2 Understand how the complexity of modern distributed systems drives a need for a focus on operability

“As systems become more complex, this reductionist way of understanding them fails; they behave in ways that cannot feasibly be predicted from under-standing of the individual parts, or were not expected by the system designer who assembled the parts, or both.” – Jeffrey Mogul, p.293

Modern computer systems are typically highly-distributed, multi-node systems that act as part of a larger service. In order to cope with the many different failure modes of these kind of systems, we need to address operability as first-class concern.

Many professional and college/university courses in software engineering, computer science, and programming have only recently begun to tackle the distributed nature of modern software, leaving many software professionals unaware of the need for operability as a foundation for effective distributed software systems.

This effect has been exacerbated by the way in which people have misinterpreted the lack of an operability focus in the Agile Manifesto to mean that operability (and operational concerns in general) are not important for Agile software development. The Agile manifesto was written at a time when most software development was for user-installed desktop PC software whose operational needs were very simple (a single computer). Software in 2024 and beyond needs an additional focus on operability above and beyond the user focus of the Agile Manifesto.

As the Internet of Things (IoT) drives a proliferation of network-connected devices for both consumer and industrial applications, the potential for unpredictable side-effects in software interactions increases (see Leveson2017a). A strong commitment to operability as a core aspect of modern software is crucial to ensuring that these interconnected systems work effectively.