Leanpub: Publish Early, Publish Often

How computers interpret information

As we discussed in the first section of the book, one of the things that programmers do is build up blocks. The first case that we’ll see of that is building together different different data types.

Data types tell the computer which operations are legal between two chunks of data. Addition between two numbers makes sense. Addition between two sentences makes less sense. As different data types are actually using the same underlying system (zeros and ones), there needs to be some rules in place which prevent things from clobbering each other.

Let’s look at how a computer might try represent the following real world things:

numbers
text
pictures
dates
lists
places
truth
uncertainty
nothingness

These real world things or in some cases concepts are only a small set of things that programmers might want to represent. However, once we have these they can act as building blocks to create more complex structures. This may sound quite vague at this stage, because we haven’t discussed how to do that. However, we’ll get there.

Numbers

Numbers are interesting. Numbers go up to infinity, yet there is finite space within your computer. Therefore, we need to make trade offs. You may have encountered a form field in the past that refused.

Integers

Let’s start by talking about integers. If you remember, integers are whole numbers on the number line. They don’t have a decimal point. Here are some integers: -1, 0, 1, 2, 3.

An 8 bit computer, like what what were common in the 80s, stores an integer using 8 zeros and ones. That is, 0000001 represents 1, 0000010 represents 2 and so on up till 255 (11111111). Each position is called a bit. As a group of 8, they’re called a byte.

When you hit the top, you flip back down to 0. Perhaps think of it as a car’s odomoter. Modern programming languages will take care of this problem for you by dynamically changing the type if it overflows back to zero. They will simply use more bytes. However, other systems like databases are often be much more rigid. That’s because databases are less able to expand the space they can allocate.

If you see a limit imposed at 255, it has probably been imposed by technical constraints rather than design. One of the more famous examples of this is Pac-Man, which would crash after level 255. Not every program knows how to deal with its odometer being reset.

Whenever you see something that goes between 0 and 255, it probably means that a programmer wants to represent the values between 0 and 1. 255 here means 100%. Do IP addresses make more sense now?

More modern computers use more bits to represent integers. When you talk to a programmer, and use the term int (shorthand for integer), they will assume that the computer is using 32 bits to represent integers. 32 bits allows programmers to have an odometer that goes up to 2³² − 1, or 4,294,967,295.

Negative numbers

So far, we’ve only talked about the numbers from 0 to infinity. What about -5 and friends? Well, programmers have decided to use up some of their valuable bits to enable a wider range of numbers to be represented. The first bit acts like a switch. If it’s 0, it’s

Real numbers

Decimals and fractions are also hard for similiar reasons to integers. We, the human operators, want to be able to represent every possible number. However our computer tools have finite space. With numbers that have decimals, the trade offs are more difficult.

A real number is called a float, as its decimal point can float around. The number 4.552 is much different than 45.52.

Text

To the computer, text looks the same as a long list of integers. Sections of text are called strings by programmers. Interestingly, it is perfectly acceptable to have a string of length zero.

Matching between an integer that appears in a string and a character that might appear on the string is described in an encoding. Coarsely, an encoding is an arbitrary mapping between a number and a character. The encoding that might be worthwhile spending a minute or two on is ASCII.

You can read the original ASCII standard, published in plain text in 1969 by Vint Cerf as RFC 20.

These days, ASCII has been supplemented by a much more complex system called Unicode. Unicode allows for every language to be processed by computers.

What text looks like as a programmer

If you see something wrapped up in quotes, it is probably a string:

"Hi there String Two, how are you?"
"Hello String One, I am great thank you."

Depending on the programming language that you choose to use, single (') and double quotes (") might mean different things. We will try to stick to double quotes within this book to prevent confusion.

Stumbling blocks

Non-printable characters

One of the things to grasp when you encounter a programming language is that everything is a character. When you push the backspace key on the keyboard, a backspace character is sent to the application that you are writing on. It is up to that application to know what to do with a backspace character, that is remove the last character next to the cursor.

Here are is the typical ways to the new line, tab and and backspace character.

Backspace	`\b`
New line	`\n`
Tab	`\t`

Note, that the representation of the string that you’re seeing does not reflect the underlying storage. Even though you are being presented with two characters on the screen, the computer is only storing one.

Programming languages use escape characters to represent non-printable characters on the screen. Most programming languages use a slash as an escape character (\). When an escape character is encountered, the computer will change the way it interprets the following character. In the case of \n, it will build a new line, rather than a literal n. If are wondering how to print out an actual slash, given that it is being used as an escape character, you might have guessed that you can just add a second slash, e.g. \\.

Line endings

Line endings are a real pain. Operating systems behave differently. In MS Windows, lines within files end with both a line feed character. In OS X, they end with a line feed character. In Linux, they end with a new line character.

Pictures

The data storing pictures is typically referred to as “binary data”.

The reference to the term binary relates to the fact that the raw zeros and ones is the best we can get. It is extremely hard, without context such as a file extension, for a programmer to know what to do with binary data. To programmers who do not have access to the specialised encoding or format, binary data is opaque and unwieldy.

When someone describes something on the web as being “machine readable”, they mean the information is presented as strings which follow a well-defined encoding.

In many ways, binary data is like a string. It is a sequence of bytes. The difference is that as a programmer, we don’t know which operations are legal. Do we treat the bytes like ASCII, like Unicode, like an array of integers or what?

Pictures encoded in elaborate encoding called a file format. The difference between a file format and an encoding is somewhat murky, although you could say that a file format is intended to be much more general than an encoding designed to handle text.

There is no easy way to know from looking at binary data what it represents. Note: There are some harder ways. Every PNG file starts with “PNG”. If you feel like spending some time wandering through Wikipedia, you might decide to look up “Magic numbers”. A magic number is some context.

Dates

Computer designers cheat. They’ve picked an arbitrary date, 1 January 1970, and simply count the seconds since.

Just like email and web servers, there are actually time servers on the Internet that tell your computer the correct time. See http://www.ntp.org/.

What actually happens inside the computer may surprise you. As dealing with time is tricky, given that there are leap years and leap seconds, it seems natural that there would be an elaborate scheme for handling that.

Those at date as an int. When you ask for the representation of a date as a string, say.

Lists

Lists of things are useful. That allows us to perform operations on a group of entities, rather than .

The syntax varies, but the concept is fairly similiar between languges. Strings often be can considered a list of characters. Some languages will not allow you to mix datatypes within a list, e.g. homogenous lists. Others are much more flexible and allow heterogeneous lists.

Programming languages will often provide programmers with several types of collections, lists or things that create sequences of elements. When you’re learning, pick the thing with the simplest name. It will generally be the most general and most widely applicable data structure.

One thing to note is that you should generally be wary of the term set. Most programming languages will use the term set to mean an implementation of a mathematical set, which is an unordered collection of unique objects. Also, if you hear the term bag being used, it means that it is similiar to a set in that there is no inherent ordering only that it allows duplicates.

Truth values

We haven’t even really started programming yet. However, I can assure you that you will be spending a great deal of time testing against Boolean values. Named after logistician John Boole, Boolean values are related to logic and are especially related to making decisions.

Deciding what represents truth and what represents falsehood is an area where programming languages are at their most unique.

As a very general rule, data that looks like a 0 will represent falsehood, whereas non-0 characters will represent truth. Jargon that is sometimes used is to describe some values as “truthy” and some values as “falsey”. In many languages, there will be special keywords to describe Boolean values explicitly. This is generally some variant of True or true. Note however, that when used this way they do not have quotes around them. In a fairly uncommon language called Lisp, the letters t and f are the keywords. In the Python programming language, empty strings and lists also evaluate to false.

Uncertainty

Representing uncertainty is generally left up to the programmer, rather than the programming language. In order to do this, we would use a float.

Probability	Likely representation
100%	`1.0`
70%	`0.7`
0%	`0.1`

How would this be used in practice? Well, by taking advantage of the float data type, we can make use of all of its legal operations such as multiplication.

When using a standard float, the computer will not know inherently that you are representing a probability value though. We may encounter a situation where a value falls outside of the bounds of 0.0 and 1.0. If you wish to create your own bespoke data type to place constraints, you are able to do that. Other programming books can show you how.

Nothingness

Representing the property of non-existence is not just a metaphysical problem for computers. Like many things we’ve discussed, different languages make different choices on how to handle it. Most languages will have a keyword that will represent nothingness. Some common variants include:

null
nil
undefined
none

None and friends can lead to odd consqeuences.

You should be aware that JavaScript contains both null and undefined. They represent different things. null represents an empty variable, but undefined means that JavaScript knows nothing at all. In a sense, undefined is more nothingness than null. Other programming languages do not make that distinction. The big problem is that they do not compare as equal, so this can trip you up when you’re still learning.

Up next

Create a comprehensive website backup