Read A New Route to Programming

A New Route to Programming

Tim McNamara

A Motivating Introduction
- What you will learn
Some guiding principles
First steps programming
How computers interpret information
- Numbers
- Text
- Pictures
- Dates
- Lists
- Truth values
- Uncertainty
- Nothingness
Create a comprehensive website backup

A Motivating Introduction

If you have chosen this book, over many others, it is likely that you are slightly more inquisitive and creative than other people who learn programming. You might be more interested in some of the human factors involved in learning to code. This is not an ordinary introduction to computer programming.

What you will learn

This book is not the world’s best introduction to programming. Its main goal is to give you the confidence to pick up and learn the best introductions to programming.

We will be working through several small projects, which will together will teach you programming by stealth. You should have so much fun walking yourself through the examples that you can’t help but learn. At least, that is the idea.

This book places people first. People are often tentative and cautious. The approach taken when writing this book is to explicitly acknowledge that.

This book will be most helpful to people who are interested in understanding how programming might be useful in their day-to-day lives.

Some guiding principles

As we go through this book, here are some principles that might help you get to grips with the new environment that you are exploring.

Programming is cultural

Something quite curious about programming is that it is extremely subjective and is that although computer technology and therefore programming seem very objective, there is a huge cultural diversity. Technologists group themselves into very strong, and often adversarial, communities.

What do I mean? Let’s start with programming languages. Some languages may be affiliated with specific corporate interests. Others will be dismissed because they have origins from a particular computing heritage. Others still will be very fashionable.

Let’s try to walk through some examples. If you have not heard of any of these terms before, fear not. Think of them as different types of ice cream.

Java is connotated with large corporations and serious business both because of its high adoption within those environments, but also due to the fact that

Some programming language communities are very devisive. Adobe Flash was originally heralded as the way to bring advanced interactivity to websites. Its syntax was similiar to the common JavaScript language, but it had far more capability. Unfortunately though, Flash websites were unusable for website users relying on screen readers. They also required very large downloads. This led to an association of Flash web developers as being people who cared much more about aesthetics and presentation of content, than the content being accessible to the widest audience possible.

These cultural styles do change. As soon as Apple, Inc. dropped support for Flash when it released the iPhone, its popularity diminished. JavaScript used to be seen as a clunky, brittle, yet necessary evil. It is now seen as extremely powerful system for developing software that is very accessible.

Culture goes further than that though. Every part of software development includes communities devoted to specific text editors, specific hardware to write software is written on, specific operating systems to run text editors, specific methods of collaborating with others, …

When you begin to start your first code, you should ignore almost all of this. You should learn to program the language which someone close to you knows. If no one you know knows programming, read the appendix (note to self: write the appendix). The quicker you can get up and running, the quicker you can start to learn.

Programmers use analogies

Much of the difficulty in learning programming comes from the fact that your brain does not know how to categorise the information that it is receiving. This is not helped by the fact that programmers will tend to use analogy a great deal. They do this because the underlying technology is very complex.

When you come into a new field of expertise, your brain will attempt to attach new information into pre-existing buckets.

As you will see, you might find out that you cannot create threads with strings. Nor does glue code have anything to do with objects.

Just like legal jargon, things that sound like ordinary English can be much different when used with in the field. Jurists themselves have a term for jargon: a term of art.

Interface is a particularly confusing analogy. It will easier for you to understand once you have started to programming. A course definition is that an interface is how programs allow other programs to communicate with them.

Programmers like to simplify

One part of programming that you will not be able to overcome is that programmers like to make things simple for themselves. Unfortunately, the effect of simplificiation can be to exclude others.

It is hard to describe this phenomenon without further context. Just be aware that it’s a problem. When you encounter it, you should be reassured that

Programming works by building up components

Much of programming involves combining simple things that achieve a little by themselves into complex things that achieve a great deal.

Ultimately, everything that you are see on computer monitor is the product of a process that started with zeros and ones interacting. The wonders that are enabled by technology today, including the programming languages themselves to build those wonders, are possible due to the fact that people spent a great deal of time thinking about how to get zeros and ones to interact.

As you read through other technical material, this will be known as abstraction. Unlike abstraction in the creative arts, abstraction in programming terms tends to make things more useful and practical. If it helps, you can substitute the word think of things

Programming works by breaking down problems

Problems are complex. In order to achive an overarching goal, many things need to happen correctly, in the right order. When building a technology system, it is also important to make sure that different parts of the solution are able to inform the other parts that they are done and that there are processes in place for dealing with errors.

Programming works by making trade offs

Every time we make a decision about which components to build up together, or which parts of a problem should be split into sub-problems, we make trade offs.

The trade offs that programmers make means that there is a great deal of scope for creativity. As you

Programmers will have the

Because programmers need to make decisions

Naming things is hard

As you enter your programming life, you should be very prepared to spend many minutes looking at the computer deciding what things should be called. It will be fustrating. It will be mostly unavoidable.

First steps programming

The next section of the book, we’ll be discussing what geeks mean when they use the term data. It is a relatively abstract topic, which is likely to lose people who are more practically inclined. We will be taking our first steps with programming by interacting with real web pages. This may sound daunting, but you will be fine. We are trying to get an appreciation for what programming can offer us.

Step 1: Open a web browser

You will need to use a laptop or desktop machine, rather than a tablet or mobile device. That is because the mobile versions of browsers have fewer options for developers.

I recommend using a recent version of Firefox or Chrome.

If you do not have access to one of these browsers, you should install a browser extension called Firebug Lite. Firebug will enable you to follow along, but it might take a few minutes to set up. When you open its web page, click on the big red button on the top right.

Step 2: Open its web developer console

Internet browsers come built with several tools that make it easy for web developers to make sense of what is going on with a browser. One of them is an interactive console that allows us to do programming right there in the browser.

If you hunt aroung in the tools menu, you’ll find it.

Browser	Menu Path
Firefox	Tools → Web Developer → Web Console
Chrome	Tools → JavaScript Console

If things have worked correctly, a sub-window will pop up from the bottom of the page. What you are interested in is a greater than sign (>).

Step 3: Begin to code

Let’s demonstrate that computers can count. Type 1 + 1 and push Enter. The computer should respond with 2.

> 1 + 1
2

If that worked, I would like you to give you a small taste of some of the power you have at your fingertips.

You are not going to hurt the computer by making mistakes.

One of the good things about JavaScript in the browser, is that nothing you do can affect the system that is running the browser. The browser acts as a shield as well as an enabler.

Visit http://www.nzherald.co.nz. If your web console is still open, you may see a bunch of garbage appear in front of you. Ignore all of that. Click on a news story headline. Once you are inside an article, enter the following line into the > prompt and push Enter:

> $("#articleBody p").text()

All going well, you will see a large block of text appear. The formatting is terrible, but you should be able to make out the content. I hope this has given you a taste of some of the really complex things you will be able to achieve by working through the book.

What have we just done? We have sucked out the paragraph tags from the HTML element with the id articleBody and then asked for the text from those p elements to be returned to us.

Some pointers for the inquisitive:

Perhaps think of the dollar sign as shorthand for the word “select”. As I mentioned in the first chapter, programmers like to simplify at the expense of ease of learning.
The brackets (parentheses, for non-New Zealanders), ask JavaScript to execute the command to the left. When something appears within the brackets, we are providing context. In the case of the $ command, we are providing "#articleBody p" as context. The double-quotes are significant here, as they tell JavaScript to treat these characters as a single string of characters. Formally, this extra context is known as providing arguments to functions, although we have only provided a single argument to $.
Once a function returns a value, we can ask the value to provide commands. $("#articleBody p") returns a list of HTML elements. Those elements understand the text function, which is called with zero arguments by just using empty brackets ().
We call functions on return values by connecting them with a dot.

Seeking early reader feedback

Did this section make sense to you? It introduces a huge number of concepts. I am worried that may clobber uncertain readers.

How computers interpret information

As we discussed in the first section of the book, one of the things that programmers do is build up blocks. The first case that we’ll see of that is building together different different data types.

Data types tell the computer which operations are legal between two chunks of data. Addition between two numbers makes sense. Addition between two sentences makes less sense. As different data types are actually using the same underlying system (zeros and ones), there needs to be some rules in place which prevent things from clobbering each other.

Let’s look at how a computer might try represent the following real world things:

numbers
text
pictures
dates
lists
places
truth
uncertainty
nothingness

These real world things or in some cases concepts are only a small set of things that programmers might want to represent. However, once we have these they can act as building blocks to create more complex structures. This may sound quite vague at this stage, because we haven’t discussed how to do that. However, we’ll get there.

Numbers

Numbers are interesting. Numbers go up to infinity, yet there is finite space within your computer. Therefore, we need to make trade offs. You may have encountered a form field in the past that refused.

Integers

Let’s start by talking about integers. If you remember, integers are whole numbers on the number line. They don’t have a decimal point. Here are some integers: -1, 0, 1, 2, 3.

An 8 bit computer, like what what were common in the 80s, stores an integer using 8 zeros and ones. That is, 0000001 represents 1, 0000010 represents 2 and so on up till 255 (11111111). Each position is called a bit. As a group of 8, they’re called a byte.

When you hit the top, you flip back down to 0. Perhaps think of it as a car’s odomoter. Modern programming languages will take care of this problem for you by dynamically changing the type if it overflows back to zero. They will simply use more bytes. However, other systems like databases are often be much more rigid. That’s because databases are less able to expand the space they can allocate.

If you see a limit imposed at 255, it has probably been imposed by technical constraints rather than design. One of the more famous examples of this is Pac-Man, which would crash after level 255. Not every program knows how to deal with its odometer being reset.

Whenever you see something that goes between 0 and 255, it probably means that a programmer wants to represent the values between 0 and 1. 255 here means 100%. Do IP addresses make more sense now?

More modern computers use more bits to represent integers. When you talk to a programmer, and use the term int (shorthand for integer), they will assume that the computer is using 32 bits to represent integers. 32 bits allows programmers to have an odometer that goes up to 2³² − 1, or 4,294,967,295.

Negative numbers

So far, we’ve only talked about the numbers from 0 to infinity. What about -5 and friends? Well, programmers have decided to use up some of their valuable bits to enable a wider range of numbers to be represented. The first bit acts like a switch. If it’s 0, it’s

Real numbers

Decimals and fractions are also hard for similiar reasons to integers. We, the human operators, want to be able to represent every possible number. However our computer tools have finite space. With numbers that have decimals, the trade offs are more difficult.

A real number is called a float, as its decimal point can float around. The number 4.552 is much different than 45.52.

Text

To the computer, text looks the same as a long list of integers. Sections of text are called strings by programmers. Interestingly, it is perfectly acceptable to have a string of length zero.

Matching between an integer that appears in a string and a character that might appear on the string is described in an encoding. Coarsely, an encoding is an arbitrary mapping between a number and a character. The encoding that might be worthwhile spending a minute or two on is ASCII.

You can read the original ASCII standard, published in plain text in 1969 by Vint Cerf as RFC 20.

These days, ASCII has been supplemented by a much more complex system called Unicode. Unicode allows for every language to be processed by computers.

What text looks like as a programmer

If you see something wrapped up in quotes, it is probably a string:

"Hi there String Two, how are you?"
"Hello String One, I am great thank you."

Depending on the programming language that you choose to use, single (') and double quotes (") might mean different things. We will try to stick to double quotes within this book to prevent confusion.

Stumbling blocks

Non-printable characters

One of the things to grasp when you encounter a programming language is that everything is a character. When you push the backspace key on the keyboard, a backspace character is sent to the application that you are writing on. It is up to that application to know what to do with a backspace character, that is remove the last character next to the cursor.

Here are is the typical ways to the new line, tab and and backspace character.

Backspace	`\b`
New line	`\n`
Tab	`\t`

Note, that the representation of the string that you’re seeing does not reflect the underlying storage. Even though you are being presented with two characters on the screen, the computer is only storing one.

Programming languages use escape characters to represent non-printable characters on the screen. Most programming languages use a slash as an escape character (\). When an escape character is encountered, the computer will change the way it interprets the following character. In the case of \n, it will build a new line, rather than a literal n. If are wondering how to print out an actual slash, given that it is being used as an escape character, you might have guessed that you can just add a second slash, e.g. \\.

Line endings

Line endings are a real pain. Operating systems behave differently. In MS Windows, lines within files end with both a line feed character. In OS X, they end with a line feed character. In Linux, they end with a new line character.

Pictures

The data storing pictures is typically referred to as “binary data”.

The reference to the term binary relates to the fact that the raw zeros and ones is the best we can get. It is extremely hard, without context such as a file extension, for a programmer to know what to do with binary data. To programmers who do not have access to the specialised encoding or format, binary data is opaque and unwieldy.

When someone describes something on the web as being “machine readable”, they mean the information is presented as strings which follow a well-defined encoding.

In many ways, binary data is like a string. It is a sequence of bytes. The difference is that as a programmer, we don’t know which operations are legal. Do we treat the bytes like ASCII, like Unicode, like an array of integers or what?

Pictures encoded in elaborate encoding called a file format. The difference between a file format and an encoding is somewhat murky, although you could say that a file format is intended to be much more general than an encoding designed to handle text.

There is no easy way to know from looking at binary data what it represents. Note: There are some harder ways. Every PNG file starts with “PNG”. If you feel like spending some time wandering through Wikipedia, you might decide to look up “Magic numbers”. A magic number is some context.

Dates

Computer designers cheat. They’ve picked an arbitrary date, 1 January 1970, and simply count the seconds since.

Just like email and web servers, there are actually time servers on the Internet that tell your computer the correct time. See http://www.ntp.org/.

What actually happens inside the computer may surprise you. As dealing with time is tricky, given that there are leap years and leap seconds, it seems natural that there would be an elaborate scheme for handling that.

Those at date as an int. When you ask for the representation of a date as a string, say.

Lists

Lists of things are useful. That allows us to perform operations on a group of entities, rather than .

The syntax varies, but the concept is fairly similiar between languges. Strings often be can considered a list of characters. Some languages will not allow you to mix datatypes within a list, e.g. homogenous lists. Others are much more flexible and allow heterogeneous lists.

Programming languages will often provide programmers with several types of collections, lists or things that create sequences of elements. When you’re learning, pick the thing with the simplest name. It will generally be the most general and most widely applicable data structure.

One thing to note is that you should generally be wary of the term set. Most programming languages will use the term set to mean an implementation of a mathematical set, which is an unordered collection of unique objects. Also, if you hear the term bag being used, it means that it is similiar to a set in that there is no inherent ordering only that it allows duplicates.

Truth values

We haven’t even really started programming yet. However, I can assure you that you will be spending a great deal of time testing against Boolean values. Named after logistician John Boole, Boolean values are related to logic and are especially related to making decisions.

Deciding what represents truth and what represents falsehood is an area where programming languages are at their most unique.

As a very general rule, data that looks like a 0 will represent falsehood, whereas non-0 characters will represent truth. Jargon that is sometimes used is to describe some values as “truthy” and some values as “falsey”. In many languages, there will be special keywords to describe Boolean values explicitly. This is generally some variant of True or true. Note however, that when used this way they do not have quotes around them. In a fairly uncommon language called Lisp, the letters t and f are the keywords. In the Python programming language, empty strings and lists also evaluate to false.

Uncertainty

Representing uncertainty is generally left up to the programmer, rather than the programming language. In order to do this, we would use a float.

Probability	Likely representation
100%	`1.0`
70%	`0.7`
0%	`0.1`

How would this be used in practice? Well, by taking advantage of the float data type, we can make use of all of its legal operations such as multiplication.

When using a standard float, the computer will not know inherently that you are representing a probability value though. We may encounter a situation where a value falls outside of the bounds of 0.0 and 1.0. If you wish to create your own bespoke data type to place constraints, you are able to do that. Other programming books can show you how.

Nothingness

Representing the property of non-existence is not just a metaphysical problem for computers. Like many things we’ve discussed, different languages make different choices on how to handle it. Most languages will have a keyword that will represent nothingness. Some common variants include:

null
nil
undefined
none

None and friends can lead to odd consqeuences.

You should be aware that JavaScript contains both null and undefined. They represent different things. null represents an empty variable, but undefined means that JavaScript knows nothing at all. In a sense, undefined is more nothingness than null. Other programming languages do not make that distinction. The big problem is that they do not compare as equal, so this can trip you up when you’re still learning.

Create a comprehensive website backup

The first thing that we will do is look at the process of backing up a website. Through the process, we’ll play with some of the data types that were introduced at the start of the book. Which data types will be looking at? Well, strings, dates and integers.

This section will be the first time that we will look at making decisions.

An Overview of the Approach

The actual mechanics of what we will be doing is quite simple.

using an external program to download a website on our behalf and store it as a .zip file.
use some software that we will write together to store that .zip file in a folder that has the date

Before we get to the mechanics of how this works though, there is quite a lot of background material available for you to look through in case you don’t have a lot of knowledge about how web pages come to be displayed on your computer monitor or how web links actually work.

Tools

I’m going to be using a tool called wget to do the actual downloading. wget is a tool that runs on Linux and OS X. If you are running Windows, you can download web pages using an Internet browser and push Ctrl+S to save the page. You won’t have the advantage of being able to deal with the whole of a website, but that’s not necessary while you’re starting out.

For processing data, we’ll be using a programming language called Python.

Python is a handy programming language to learn if you don’t already know one. It tends to have fewer arcane syntax rules and the language community prefers to settle on agreed conventions for things. Other languages pride themselves on saying that there is more than one way to do it. In the Python world. Fewer things to learn should help you be productive relatively quickly.

The reason that we will be using Python, rather than JavaScript introduced earlier, is that we need to deal with files on your machine. Working with JavaScript in the browser means interacting with web pages delivered via the Internet.

Defining a web page

In their simplest form, web pages are made up of text. However, these days the only web pages that only contain text tend to be written by academics. Most pages reference other resources, like images, JavaScript code, something you may not have heard before called style sheets and perhaps even font files. The job of the web browser is to gather up all of these resources and render a single page for the reader.

The text of a web page is defined within a language called HTML or Hypertext Markup Language. So, what does that actually mean?

Hypertext refers to the ability for one piece of text to link to other pieces of text, anywhere in the world.
Markup refers to the way that internal structure is defined. In a markup language, sections such as paragraphs are marked up with opening and closing tags. In HTML, the tag for a paragraph is , and its associated closing tag is . To be very explicit, a small paragraph might look like this in source form, Some text, but will be rendered as Some text.
Language just means that there is a defined grammar for markup and adding hyperlinks. You can’t just add your own tags or syntax for defining tags at a whim.

How web pages link to each other

Within HTML, the anchor tag (<a>) provides for cross-referencing. We haven’t touched upon it yet, but tags have hidden attributes as well as text. One of the <a> tag’s attributes is href, or hyperlink reference.

Let’s see if this makes any sense. Can you make out what is happening here?

<p>
    Some text.
    <a href="http://www.example.org/">
      Click here
    </a> 
    for more info.
</p>

If you are somewhat baffled, we are nesting our <a> tag within the base  tag. We are marking up the “Click here” text as a source anchor with a hyperlink to example.org.

Tags nested inside other tags are referred to as children of the broader tags, which are referred to as parents. The ability to nest. This structure, is known to computer scientists as a tree. Elements starting at the base are as known as the root and each descendant element is a branch, extending to elements with no children, the leaves. Remember what I said at the start, computer scientists like to use metaphor!

What a full web page looks like

Now that we have seen what a snippet of HTML looks like, it will be worthwhile to expose you to an HTML document in its full glory. Note that formatting is arbitrary and does not affect how things are rendered¹.

<!DOCTYPE html>
<html lang="en">
<head>
  <title>An example webpage</title>
</head>
<body>
<p>
    Some text.
    <a href="http://www.example.org/">
      Click here
    </a> 
    for more info.
</p>
</body>
</html>

If you copy this code into a text editor and save it as some.html, you will be able to open the file in your web browser just like a page from the Internet.

Let me explain what is going on here for people who have not encountered HTML before section by section.

<!DOCTYPE html>

When files are sent over the Internet, it is sometimes difficult for the receiver to know exactly how to interpret the content. You may have noticed that there are many URLs that don’t contain file extensions. Web pages tell their receivers what they are on their first line.

For people who are very interested in the details, <!DOCTYPE html> is an XML declaration. The declaration describes which tags are legal syntax. In effect, you are telling browsers to apply the, somewhat messy and flexible, rules of real-world, hand-written HTML to whatever follows.

<html lang="en">
  …
</html>

This is the root tag of HTML. It enables us to hang the rest of our source off of it. Browsers will cope if you leave it out, however it is good form. As you become a proficient programmer, you will find yourself wanting your source code to be dapper.

<html lang="en">
<head>
  <title>An example webpage</title>
</head>
…
</html>

The <head> element is somewhat interesting. Its contents are largely invisible to viewers of the web page. The only thing in here so far is the <title>, which defines what appears as the text at the top of a browser tab. Real-life web pages will put lots of extra info in here, like external CSS and JavaScript files, as well as hints for social media sites and search engines about how to describe the content when shared or appears in search results.

<html lang="en">
…
<body>
<p>
    Some text. 
    <a href="http://www.example.org/">
      Click here
    </a> 
    for more info.
</p>
</body>
</html>

The <body> tag is where the content which is visible to the reader appears.

External resources

Understanding the distinction between a web page and its component resources is important when trying to use automated tools. Unless told otherwise, most tools will ignore content that they can’t read like images and content they don’t care about like typography instructions. Knowing that a web page only has one URL, but is actually made up of content from several, can greatly reduce confusion.

Here are the resources that make up most web pages:

JavaScript: A programming language that runs on all web browsers. As we’ve discovered, is a way to add interaction to websites. It’s also the tool to build web apps. If you have heard the term “Front-end Developer”, read it as “JavaScript Programmer”.
CSS: Cascading Style Sheets, are declarations for typography and layout that web browsers understand.
Images: and possibly other multimedia such as videos and sound.

A note about embedded videos

If you see a video embedded into a web page, it is actually. If you look at the source, it is very likely that there are <iframe> tags. Although it may only look like a video, your browser has actually pulled down a whole other website which is being served by the video host.

Respecting robots.txt

There is a page on most websites that you are unlikely to have seen before. It sits at the base of a website at /robots.txt. This file declares rules to automated web agents, such as search engines and web harvesters. The tool that we’re using, wget, respects these rules unless told to explicitly ignore them.

How we select specific elements

Note to self

Explain CSS selectors

Downloading web sites

This first command gets us most of the way there:

$ wget --mirror www.example.org

If you do not have Linux or OS X running, you will not be able to run wget.

Note to self

Add install instructions.

Note to self

Create an example archive so Windows users can carry on with the examples.

We are asking wget to replicate, or mirror the web site www.example.org. To be honest with you though, www.example.org is a fairly boring website to mirror because it only has one page. Feel free to try mirroring another website that you have a copyright licence to.

This one is more complete. The \ character tells the computer to ignore the newline and means that we can have each option on its own page.

$ wget --mirror \
       --local-encoding="utf-8" \
       --adjust-extension \
       --convert-links \
       --page-requisites \
       www.example.org

This invocation is far more thorough and will allow you to view the downloaded site.

Compress the downloads

Now that we have a complete website, we should store it in a compressed format for backup purposes. Rather than needing to run an application by hand to do that for us each time, let’s write a computer program to do that for us.

As well as explaining the code that follows, part of what I will be showing you is instructions on how to interpret code examples like this so that you can follow along with web tutorials.

I will be introducing some programming concepts. These include variables, flow control interacting with the outside world, as well as defining and using functions.

#! /usr/env/bin python

import argparse
import datetime
import shutil
import os

def today():
    t = datetime.date.today()
    t_as_string = str(t)
    return t_as_string

def archive_site(site_loc, dest):
    site_dir, site_file = os.path.split(site_loc)
    archive = site_file + "-" + today()
    archive_loc = os.path.join(dest, archive)
    shutil.make_archive(base_name=archive_loc, 
      format="zip",
      root_dir=site_loc, 
      base_dir=site_loc,
      verbose=True)

p = argparse.ArgumentParser(
      description='Compress a website mirror.')
p.add_argument('location',
       help='Where the website mirror is now.')
p.add_argument('destination',
              default=os.path.curdir,
              help='The directory to save to.')
args = p.parse_args()

archive_site(args.location, args.destination)

Wow, this is a lot of code to take in. I am really glad you’ve managed to make it all the way through it. Let’s go through each section step by step.

#! /usr/env/bin python

Do you remember how the XML declaration of HTML (<!DOCTYPE html>) helped your web browser know what is legal syntax? This first line does the same thing can happen inside your computer too. If you have only used computers running an MS Windows operating system, this may look very odd. Computers from the UNIX operating system heritage, such as Linux and OS X, do not rely on file extensions. Instead, they look for a special comment in a file’s first line to figure out what to do with it. In this case, we are telling the computer to run the /usr/env/bin program and give it the argument python. It is a long way of saying that we want this piece of text to be run by a the Python language interpreter, e.g. something that knows what to do with Python source code.

import argparse
import datetime
import shutil
import os

This chunk of code is somewhat opaque. I wish it didn’t need to be at the top, because in many ways it is the hardest bit of code to explain. The simplest way to think of it is that we would like to pull in someone else’s pre-written code into this file. We call those chunks of external code modules.

Here is what each of those modules do:

argparse: Parses command line arguments. This will allow the script that we are writing to take in context provided by you when you run the code. What we will be doing is asking for the location of the website mirror that we have previously downloaded.
datetime: Contains many functions allowing programmers to deal with times, dates and time zones. In our code, we will actually be accessing the date submodule, which doesn’t know anything about times. It does, however know today’s date.
shutil: The term shutil is shorthand for shell utility. shutil provides several tools for interacting with files. In our case, we will be relying on shutil to create a zip archive for us. The term shell can be a difficult analogy to grasp. It can be thought of the way that people can interact with the operating system. It is in some sense the surface which is visible to you.
os: The os module provides programmatic access to operating system details like files and, most relevantly to us, directories. If Python is running on MS Windows, the os module will use Windows-specifc commands. If it is running on VxWorks, the operating system that powers most Mars rovers, it will run commands specific to VxWorks. In our code, we’ll be using it to split apart and join together file paths in an operating system agnostic way.

Let’s carry on to the first piece of our own code, defining the today function.

def today():
    t = datetime.date.today()
    t_as_string = str(t)
    return t_as_string

Even though this section is quite short, at four lines, I’m going to spend quite a bit of time working through them.

Here is an overhead view of what is happening to start us off. We are telling Python to define a function called today, which accepts no arguments. Once we’re in the function, we find out what day it is, with the help of the datetime.date.today function and the help of a temporary variable called t. We then convert t to a string and return that string to whoever called the function with the use of another temporary variable.

def today():: Here we use the def keyword to define a function called today. Functions can start with alphanumeric characters and the underscore. The brackets, or parentheses if you prefer, are place holders for any arguments that the function might accept. In our case, we don’t have any arguments that we want to accept. If we wanted to specify the time zone, we could do so here.This statement ends with a colon. You will gain a feel for when to use a colon over time. Think of it as telling Python that you wish to logically demarcate a block of code. Everything indented from the colon is associated with the def keyword. Once the source code returns to the same position as def, it is no longer part of the definition.
t = datetime.date.today(): We do three important things in this line. The first is the creation of our first variable! We then assign that variable, t, with the result of a call to datetime.date.today(). You may be confused with the use of the equals sign here. Many programming languages will do this, unfortunately. Although equals implies a symmetrical relationship, there is actually a transfer of information from right to left. Whatever is returned from datetime.date.today() will be assigned to t after the call has been completed. You will often read tutorials and listen to talks where people even say “t equals dateime dot date dot today”. However, I prefer to say “t is assigned to”.
t_as_string = str(t): Do you remember me describing how the computer stores times as integers earlier on in the book? Well, here we force Python to represent whatever is in the t variable, that is an object representing today’s date, as a string.
return t_as_string: This asks Python to return the string representation of today’s date to the caller. Caller is just jargon for whichever function will end up calling today in the future.

In Python, we can overwrite previously assigned variables. That means that we could have written our code like this:

def today():
    t = datetime.date.today()
    t = str(t)
    return t

In fact, there is nothing stopping us from simply doing string conversion on the same line as the return statement:

def today():
    t = datetime.date.today()
    return str(t)

To go even further, we could have done away with all of our temporary variables. We could have written everything on a single line if we had wanted to:

def today():
    return str(datetime.date.today())

Every programmer will have different stylistic preferences. It is a value statement to decide between conciseness and readability. As your programming experience matures, you will find yourself becoming very picky about which style you prefer. This is perfectly natural, but be wary of building artificial walls and disregarding technical competence due to style choices.

def archive_site(site_loc, dest):
    site_dir, site_file = os.path.split(site_loc)
    archive = site_file + "-" + today()
    archive_loc = os.path.join(dest, archive)
    shutil.make_archive(base_name=archive_loc, 
      format="zip",
      root_dir=site_loc, 
      base_dir=site_loc,
      verbose=True)

This piece of code is complex, but will introduce some very valuable tools for you as you progress.

def archive_site(site_loc, dest):: We’re using the def keyword to define another function, called archive_site. archive_site takes two arguments, one called site_loc and the other dest. The site_loc variable refers to the site’s current location that we have downloaded, dest is where we want the zipped version to end up.
site_dir, site_file = os.path.split(site_loc): This shows us how to handle a function that returns two values. We can assign each of them to their own variable. If we only had one variable, say site_path_parts, then that variable would be assigned a tuple (an ordered list) of two members. If you were wondering what os.path.split(site_loc) does, it takes a file path name, say C:\Users\Tim\Documents\magnum-opus.docx and splits it into two: C:\Users\Tim\Documents\ and magnum-opus.docx.
archive = site_file + "-" + today(): This is where we finally get to use the today function that we defined earlier on! The + operator, when applied to strings, joins them together. If we substitute the variable site_file with "timmcnamara.co.nz" and today() with "2013-08-11", we get "timmcnamara.co.nz" + "-" + "2013-08-11". What we will end up with the archive variable being assigned to something like "timmcnamara.co.nz-2013-08-11". We don’t need to add a file extension, as it is added by the command that creates the .zip later on.
archive_loc = os.path.join(dest, archive): We now join together the destination folder provided by the user with the file name that we have just created as archive.
shutil.make_archive(base_name=archive_loc,: We ask shutil to zip up the directory for us. Here, you can see us using a Python feature called named arguments. Our variable archive_loc is being assigned to the variable base_name that has been defined in the internal definition of shutil.make_archive. If you want to learn more about the arguments I have specified, you can read the documentation yourself online.

p = argparse.ArgumentParser(
      description='Compress a website mirror.')
p.add_argument('location',
       help='Where the website mirror is now.')
p.add_argument('destination',
              default=os.path.curdir,
              help='The directory to save to.')
args = p.parse_args()

This block of code can be handled as a single unit, but introduces a very important part of Python which I have not yet touched upon: objects. In programming land, an object is data which also has functions. In this code, we create an instance of the ArgumentParser class of objects and assign that instance to the variable p.

Creating a new instance of a class is just like calling a function. Within the Python programming community, the convention is to use CamelCase names to designate a class. So, what is an object? An object carries its own data, often referred to as its state, and functions to manipulate that state. An object’s functions are known as its methods. It takes time to understand what is going on, as the metaphor of a real-world object goes most of the way but will not get you all the way. You will find yourself staring at the screen a little bit as you try to figure it out. Here, we are creating an object which knows how to parse command line arguments and assigning that to args. A command line argument is a string that into the program from the outside world.

archive_site(args.location, args.destination)

This is where we do the magic! Our args object has two attributes, location and destination that we access with the dot operator (.). Feeding them into the archive_site function that we’ve defined will create an archive of the site downloaded at the start. Voilà!

I would love to have your feedback on the book so far.

You can contact me via newroute@timmcnamara.co.nz or post your feedback publicly at https://leanpub.com/a-new-route-to-programming/feedback.

The notable exception is the <pre> tag, which we have not encountered yet. <pre> is shorthand for pre-formatted. The way the text within it is laid out in source code form is the layout which will appear on to the website reader. ↩

Table of Contents

A Motivating Introduction

What you will learn

Some guiding principles

Programming is cultural

Programmers use analogies

Programmers like to simplify

Programming works by building up components

Programming works by breaking down problems

Programming works by making trade offs

Naming things is hard

First steps programming

Step 1: Open a web browser

Step 2: Open its web developer console

Step 3: Begin to code

How computers interpret information

Numbers

Integers

Negative numbers

Real numbers

Text

What text looks like as a programmer

Stumbling blocks

Non-printable characters

Line endings

Pictures

Dates

Lists

Truth values

Uncertainty

Nothingness

Create a comprehensive website backup

An Overview of the Approach

Tools

Defining a web page

How web pages link to each other

What a full web page looks like

External resources

A note about embedded videos

Respecting robots.txt

How we select specific elements

Downloading web sites

Compress the downloads