Leanpub: Publish Early, Publish Often

Chapter 2: Engineering concerns and platform security

If cryptography is the proverbial vault door, it makes sense to evaluate the rest of the building and the foundation it’s built on before deciding on the specifics of the vault door. It’s often repeated that a secure system is only secure as its weakest component, which means the other parts of the system must be up to par before cryptography is useful. In this chapter, we’ll look at the engineering concerns around secure systems, particularly for Unix-based systems.

Basic security

Security should provide authentication, authorisation, and auditing. Authentication means that the system verifies the identity of parties interacting with the system; authorisation verifies that they should be allowed to carry out this interaction; and auditing creates a log of security events that can be verified and checked to ensure the system is providing security. The end goal is some assurance that the system is secure.

Authentication

Authentication asks the question, “who am I talking to?”; it attempts to verify the identity of some party. Passwords are one means of authentication; they aren’t a strong authentication mechanism because anyone who knows the password (whether because they chose it, were given it, or guessed it) will be authenticated. Multifactor authentication attempts to provide a stronger basis for assurance of an identity, and is based on three factors:

Something you know (such as a password)
Something you have (such as an authentication token of some kind)
Something you are (such as biometrics)

The most common multifactor authentication configuration found in common use is two-factor employing the first two factors. A user might be required to enter both their password and a time-based one-time password (such as with TOTP) from an app on their phone. The assumption here is that the key used to generate the TOTP is only present on the authenticator (i.e. the mail provider) and the user’s phone, and that the user’s phone hasn’t been compromise or the key taken from it. Both sides share the key for this one-time password (OTP), and they do not have to communicate any additional information after setup to verify the password.

This is in contrast with two-step verification; an example is an SMS code sent to the phone. The user and server have to communicate, albeit over a different channel than the browser, to share this code. This still provides a channel for intercepting the code.

Authorisation

Authorisation asks the question, “should you be doing this?” Authorisation relies on an access control mechanism of some kind. This might be as simple as an access control list, where the system has a list of parties that should have access to a resource or should be allowed to perform some operation. The Unix security model uses a set of access control lists for reading, writing, and executing by the owner, the group the resource belongs to, and the world. It employs “discretionary access control”: a user can explicitly change the values of those access control lists, giving other users and groups permission at their discretion. A mandatory access control model (such as provided by SELinux or AppArmor) operates on security levels or labels; each label is given a set of capabilities. Users or processes are given a label or security level, and they can only operate within the confines of the permitted capabilities.

As an example, a user might create a text file and opt to make it world-readable in the DAC model: any process or user can now access it. In the MAC model, access to that file would be restricted by label. If a process or user doesn’t have permissions based on their label, they cannot access it, and the original user simply cannot share the text file in this way. The labels are assigned by an administrator or security officer, and the user cannot change this. Access control is no longer at the user’s discretion, but mandatory.

An alternative to access control lists is the role-based access control security model is role-based security. On Unix systems, root has full control over the system; in a role-based system, this control is split among several roles, each of which has the minimum set of permissions to carry out that role. This model is also more fine grained that an access control list in that it can specify grant or permissions for specific operations.

Auditing

Security efforts are for nought if no one is auditing the system to ensure it is operating correctly. An audit log should be available, and its access restricted to auditors, that records security events. The events recorded and the details present will vary based on the requirements of the system. Auditors should also be confident that the audit log will not have been tampered with.

An attacker that successfully authenticates may not leave any indication that the system is compromised. The only way to identify the compromise is through positive auditing: that is, auditing the record of events that succeeded. Whether the risk of such a compromise outweighs the need to maintain usage privacy needs to be considered.

Policy

There should be a set of policies that clearly specify the authentication, authorisation, and auditing in a system. For a large organisation, this may be fairly complex. For an application given to users, it might be as simple as password-based and printing security failures to the system log. This isn’t an engineering concern, per se, but it really must factor into a secure application.

Specifications

Specifications are a critical part of building a secure system to understand what its behaviour should be. From a security perspective, it’s also important understand what its behaviour must not be. The security model is part of the system’s specification, and care should be taken to build it properly.

Testing is an important part of the specification, as well. It is the assurance that the system behaves according to the specification. Unit tests verify the code paths within each unit of the system, functional tests verify that components and the system behaves as it should, and regression tests make sure that bugs aren’t re-introduced. Integration tests may also be useful to verify compatibility.

Building secure systems depends on writing correct code. An incorrect system that is shipped is shipping with security flaws that will probably subvert any cryptographic security built in. The more code that is present in a system, the greater the attack surface: only the minimum code to implement a system that fits the specifications should be used. This includes any libraries used by the system: where possible, remove any unused functionality. This lowers the cost of the system as well: less test code has to be written, which reduces both the time and financial costs.

Security models

One of the key steps in designing a secure system is building a security model. A security model describes the assumptions that are made about the system, the conditions under which security can be provided, and identifies the threats to the system and their capabilities. A secure system cannot be built unless its characteristics and the problems it is intended to solve are understood. A security model should include an analysis of the system’s attack surface (what components can come under attack), realistic threat vectors (where attacks come from), the parties that can attack the system and their capabilities, and what countermeasures will be required to provide security. The security model should not just cover the cryptographic components: the environment and platforms that the system will run in and on must be considered as well. It’s also important to consider which problems are technical in nature, and which problems are social in nature. Trust is also a key consideration in this model; that is, understanding the roles, intended capabilities, and interactions of legitimate parties as well as the impact of a compromised trusted party.

From experience, it is extremely difficult to bolt security onto a system after it has been designed or built. It is important to begin discussing the security requirements during the initial stages of the system’s design for the same reasons it is important to consider the other technical requirements. Failure to consider the load level of the system, for example, may result in poor architectural decisions being made that add a great deal of technical debt that impede a stable, reliable system. In the same manner, failure to consider the security requirements may result in similarly poor architectural decisions. A secure system must be reliable and it must be correct; most security vulnerabilities arise from exploiting parts of a system that do not behave correctly. Proper engineering is key to secure systems; clear specifications and both positive and negative testing (testing both that the system behaves correctly and that it fails gracefully) will greatly improve the system’s ability to fulfill its security objectives. It’s useful to consider security as a performance metric. The performance of a secure system relates to its ability to operate without being compromised, and its ability to recover from a compromise. The non-security performance of a secure system must also be considered: if the system is too slow or too difficult to use, it won’t be used. An unused secure system is an insecure system as it is failing to provide message security.

The security components must service the other objectives of the system; they must do something useful inside the specifications of the system.

As part of the specifications of a secure system, the choice of cryptographic algorithms must also be made. In this book, we will prefer the NaCl algorithms for greenfield designs: they were designed by a well-respected cryptographer who is known for writing well-engineered code, they have a simple interface that is easy to use, and they were not designed by NIST. They offer high performance and strong security properties. In other cases, compatibility with existing systems or standards (such as FIPS) is required; in that case, the compatible algorithms should be chosen.

On errors

The more information an attacker has about why a cryptographic operation failed, the better the chances that they will be able to break the system. There are attacks, for example, that operate by distinguishing between decryption failures and padding failures. In this book, we either signal an error using a bool or with a generic error value such as “encryption failed”.

We’ll also check assumptions as early as possible, and bail as soon as we see something wrong.

Input sanitisation

A secure system also has to scrutinise its inputs and outputs carefully to ensure that they do not degrade security or provide a foothold for an attacker. It’s well understood in software engineering that input from the outside world must be sanitised; sanity checks should be conducted on the data and the system should refuse to process invalid data.

There are two ways to do this: blacklisting (a default allow) and whitelisting (a default deny). Blacklisting is a reactive measure that involves responding to known bad inputs; a blacklisting system will always be reacting to new bad input it detects, perhaps via an attack. Whitelisting decides on a set of correct inputs, and only permits those. It’s more work up front to determine what correct inputs look like, but it affords a higher assurance in the system. It can also be useful in testing assumptions about inputs: if valid input is routinely being hit by the whitelist, perhaps the assumptions about the incoming data looks like are wrong.

Memory

At some point, given current technologies, sensitive data will have to be loaded into memory. Go is a managed-memory language, which means the user has little control over memory, presenting additional challenges for ensuring the security of a system. Recent vulnerabilities such as Heartbleed show that anything that is in memory can be leaked to an attacker with access to that memory. In the case of Heartbleed, it was an attacker with network access to a process that had secrets in memory. Process isolation is one countermeasure: preventing an attacker from accessing a process’s memory space will help mitigate successful attacks against the system. However, an attacker who can access the machine, whether via a physical console or via a remote SSH session, now potentially has access to the memory space of any process running on that machine. This is where other security mechanisms are crucial for a secure system: they prevent an attacker from reaching that memory space.

It’s not just the memory space of any process that’s running on the machine that’s vulnerable, though. Any memory swapped to disk is now accessible via the file system. A secret swapped to disk now has two places where it’s present. If the process is running on a laptop that is put to sleep, that memory is often written to disk. If a peripheral has direct memory access (DMA), and many of them do, that peripheral has access to all the memory in the machine, including the memory space of every process. If a program crashes and dumps core, that memory is often written to a core file. The CPU caches can also store secrets, which might be an additional attack surface, particularly on shared environments (such as a VPS).

There are a few methods to mitigate this: using the stack to prevent secrets from entering the heap, and attempting to zero sensitive data in memory when it’s no longer needed (though this is not always effective, e.g. [5]). In this book, we’ll do this where it makes sense, but the caveats on this should be considered.

There is also no guarantee that secrets stored on disk can be completely and securely erased (short of applying a healthy dose of thermite). If a sector on disk has failed, the disk controller might mark the block as bad and attempt to copy the data to another sector, leaving that data still on the hardware. The disk controller might be subverted, as disk drives contain drive controllers with poorly (if at all) audited firmware.

In short, given our current technologies, memory is a difficult attack surface to secure. It’s helpful to ask these following questions for each secret:

Does it live on disk for long-term storage? If so, who has access to it? What authorisation mechanisms ensure that only authenticated parties have access?
When it’s loaded in memory, who owns it? How long does it live in memory? What happens when it’s no longer used?
If the secrets lived on a virtual machine, how much trust can be placed in parties that have access to the host machine? Can other tenants (i.e. users of other virtual machines) find a way to access secrets? What happens when the machine is decomissioned?

Randomness

Cryptographic systems rely on sources of sufficiently random data. We want the data from these sources to be indistinguishable from ideally random data (a uniform distribution over the range of possible values). There has been historically a lot of confusion between the options available on Unix platforms, but the right answer (e.g. [6]) is to use /dev/urandom. Fortunately, crypto/rand.Reader in the Go standard library uses this on Unix systems.

Ensuring the platform has sufficient randomness is another problem, which mainly comes down to ensuring that the kernel’s PRNG is properly seeded before being used for cryptographic purposes. This is a problem particularly with virtual machines, which may be duplicated elsewhere or start from a known or common seed. In this case, it might be useful to include additional sources of entropy in the kernel’s PRNG, such as a hardware RNG that writes to the kernel’s PRNG. The host machine may also have access to the PRNG via disk or memory allowing its observation by the host, which must be considered as well.

Time

Some protocols rely on clocks being synced between peers. This has historically been a challenging problem. For example, audit logs often rely on the clock to identify when an event occurred. One of the major challenges in cryptographic systems is checking whether a key has expired; if the time is off, the system may incorrectly refuse to use a key that hasn’t expired yet or use a key that has expired. Sometimes, the clock is used for a unique value, which shouldn’t be relied on. Another use case is a monotonically-increasing counter; a clock regression (e.g. via NTP) makes it not-so-monotonic. Authentication that relies on time-based one-time passwords also require an accurate clock.

Having a real-time clock is useful, but not every system has one. Real-time clocks can also drift based on the physical properties of the hardware. Network time synchronisations work most of the time, but they are subject to network failures. Virtual machines may be subject to the clock on the host.

Using the clock itself as a monotonic counter can also lead to issues; a clock that has drifted forward may be set back to the correct time (i.e. via NTP), which results in the counter stepping backwards. There is a CPU clock that contains ticks since startup that may be used; perhaps bootstrapped with the current timestamp, with the latest counter value stored persistently (what happens if the latest counter value is replaced with an earlier value or removed?).

It helps to treat clock values with suspicion. We’ll make an effort to use counters instead of the clock where it makes sense.

Side channels

A side channel is an attack surface on a cryptographic system that is based entirely on the physical implementation; while the algorithm may be sound and correct, the implementation may leak information due to physical phenomena. An attacker can observe timings between operations or differences in power usage to deduce information about the private key or the original message.

Some of these types of side channels include:

timing: an observation of the time it takes some piece of the system to carry out an operation. Attackers have used this to even successfully attack systems over the network ([2]).
power consumption: this is often used against smart cards; the attacker observes how power usage changes for various operations.
power glitching: the power to the system is glitched, or brought to near the shutdown value for the CPU. Sometimes this causes systems to fail in unexpected ways that reveals information about keys or messages.
EM leaks: some circuits will leak electromagnetic emissions (such as RF waves), which can be observed.

These attacks can be surprisingly effective and devastating. Cryptographic implementations have to be designed with these channels in mind (such as using the constant time functions in crypto/subtle); the security model should consider the potential for these attacks and possible countermeasures.

Privacy and anonymity

When designing the system, it should be determined what measure of privacy and anonymity should be afforded. In a system where anonymity is needed, perhaps the audit log should not record when events succeed but only when they fail. Even these failures can leak information: if a user mistypes their password, will it compromise their identity? If information such as IP addresses are recorded, they can be used to de-anonymise users when combined with other data (such as activity logs from the user’s computer). Think carefully about how the system should behave in this regard.

Trusted computing

One problem with the underlying platform is ensuring that it hasn’t been subverted; malware and rootkits can render other security measures ineffective. It would be nice to have some assurance that the parties involved are running on platforms with a secure configuration. Efforts like the Trusted Computing Group’s Trusted Computing initiative aim to prove some measure of platform integrity and authenticity of the participants in the system, but the solutions are complex and fraught with caveats.

Virtual environments

The cloud is all the rage these days, and for good reason: it provides a cost-effective way to deploy and manage servers. However, there’s an old adage in computer security that an attacker with physical access to the machine can compromise any security on it, and cloud computing makes getting “physical” access to some of these machines much easier. The hardware is emulated in software, so an attacker who gains access to the host (even via a remote SSH session or similar) has equivalent access. This makes the task of securing sensitive data, like cryptographic keys, in the cloud a dubious prospect given current technologies. If the host isn’t trusted, how can the virtual machine be trusted? This doesn’t just mean trust in the parties that own or operate the host: is their management software secure? How difficult is it for an attacker to gain access to the host? Can another tenant (or user on another virtual machine on the host) gain access to the host or other virtual machines they shouldn’t be able to? If the virtual machine is decomissioned, is the drive sufficently wiped so that it never ends up in another tenants hands? Security models for systems deployed in a virtual environment need to consider the security of the host provider and infrastructure in addition to the system being developed, including the integrity of the images that are being run.

Public key infrastructure

When deploying a system that uses public key cryptography, determining how to trust and distribute public keys becomes a challenge that adds extra engineering complexity and costs to the system. A public key by itself contains no information; some format that contains any other required identity and metadata information needs to be specified. There are some standards for this, such as the dread X.509 certificate format (which mates a public key with information about the holder of the private key and holder of the public key that vouches for this public key). Deciding on what identity information to include and how it is to be verified should be considered, as should the lifetime of keys and how to enforce key expirations, if needed. There are administrative and policy considerations that need to be made; PKI is largely not a cryptographic problem, but it does have cryptographic impact.

Key rotation is one of the challenges of PKI. It requires determining the cryptoperiod of a key (how long it should be valid for); a given key can generally only encrypt or sign so much data before it must be replaced (so that it doesn’t repeat messages, for example). In the case of TLS, many organisations are using certificates with short lifetimes. This means that if a key is compromised and revocation isn’t effective, the damage will be limited. Key rotation problems can also act as DoS attack: if the rotation is botched, it can leave the system unusable until fixed.

Key revocation is part of the key rotation problem: how can a key be marked as compromised or lost? It turns out that marking the key this way is the easy part, but letting others know is not. There are a few approaches to this in TLS: certificate revocation lists, which contain a list of revoked keys; OCSP (the online certificate status protocol), which provides a means of querying an authoritative source as to whether a key is valid; TACK and certificate transparency, which have yet to see large scale adoption. Both CRLs and OCSP are problematic: what if a key compromise is combined with a DDoS against the CRL or OCSP server? Users may not see that a key was revoked. Some choose to refuse to accept a certificate if the OCSP server can’t be reached. What happens in the case of a normal network outage? CRLs are published generally on set schedules, and users have to request the CRL every so often to update it. How often should they check? Even if they check every hour, that leaves up to an hour window in which a compromised key might still be trusted.

Due to these concerns and the difficulty in providing a useful public key infrastructure, PKI tends to be a dirty word in the security and cryptographic communities.

What cryptography does not provide

Thought the encryption methods we’ll discuss provide strong security guarantees, none provide any sort of message length obscurity; depending on this system, this may make the plaintext predictable even in the case of strong security guarantees. There’s also nothing in encryption that hides when a message is sent if an attacker is monitoring the communications channel. Many cryptosystems also do not hide who is communicating; many times this is evident just from watching the communications channel (such as tracking IP addresses). By itself, cryptography will not provide strong anonymity, but it might serve as part of a building block in such a system. This sort of communications channel monitoring is known as traffic analysis, and defeating it is challenging.

Also, despite the unforgeability guarantees that we’ll provide, cryptography won’t do anything to prevent replay attacks. Replay attacks are similar to spoofing attacks, in which an attacker captures previously sent messages and replays them. An example would be recording a financial transaction, and replaying this transaction to steal money. Message numbers are how we will approach this problem; a system should never repeat messages, and repeated messages should be dropped by the system. That is something that will need to be handled by the system, and isn’t solved by cryptography.

Data lifetimes

In this book, when we send encrypted messages, we prefer to do so using ephemeral message keys that are erased when communication is complete. This means that a message can’t be decrypted later using the same key it was encrypted with; while this is good for security, it means the burden of figuring out how to store messages (if that is a requirement) falls on the system. Some systems, such as Pond (https://pond.imperialviolet.org/), enforce a week lifetime for messages. This forced erasure of messages is considered the social norm; such factors will have to play into decisions about how to store decrypted messages and how long to store them.

There’s also the transition between data storage and message traffic: message traffic is encrypted with ephemeral keys that are never stored, while stored data needs to be encrypted with a long-term key. The system architecture should account for these different types of cryptography, and ensure that stored data is protected appropriately.

Options, knobs, and dials

The more options a system has for picking the cryptography it uses, the greater the opportunity for making a cryptographic mistake. For this reason, we’ll avoid doing this here. Typically, we’ll prefer NaCl cryptography, which has a simple interface without any options, and it is efficient and secure. When designing a system, it helps to make a well-informed opinionated choice of the cryptography used. The property of cryptographic agility, or being able to switch the cryptography out, may be useful in recovering from a suspected failure. However, it may be prudent to step back and consider why the failure happened and incorporate that into future revisions.

Compatibility

The quality of cryptographic implementations ranges wildly. The fact that there is a good implementation of something in Go doesn’t mean that a good implementation will exist in the language used to build other parts of the system. This must factor into the system’s design: how easy is it to integrate with other components, or to build a client library in another language? Fortunately, NaCl is widely available; this is another reason why we will prefer it.

Conclusion

Cryptography is often seen as the fun part of building secure systems; however, there are a lot of other work that needs to be done before cryptography enters the picture. It’s not a one-size-fits-all security solution; we can’t just sprinkle some magic crypto dust on an insecure system and make it suddenly secure. We also have to make sure to understand our problems, and ensure that the problems we are trying to solve are actually the right problems to solve with cryptography. The challenges of building secure systems are even more difficult in virtual environments. It is crucial that a security model be part of the specification, and that proper software engineering techniques are observed to ensure the correctness of the system. Remember, one flaw, and the whole system can come crashing down. While it may not be the end of the world as we know it, it can cause signficant embarassment and financial costs—not only to you, but to the users of the system.