Table of Contents
A Motivating Introduction
If you have chosen this book, over many others, it is likely that you are slightly more inquisitive and creative than other people who learn programming. You might be more interested in some of the human factors involved in learning to code. This is not an ordinary introduction to computer programming.
What you will learn
This book is not the world’s best introduction to programming. Its main goal is to give you the confidence to pick up and learn the best introductions to programming.
We will be working through several small projects, which will together will teach you programming by stealth. You should have so much fun walking yourself through the examples that you can’t help but learn. At least, that is the idea.
This book places people first. People are often tentative and cautious. The approach taken when writing this book is to explicitly acknowledge that.
This book will be most helpful to people who are interested in understanding how programming might be useful in their day-to-day lives.
Some guiding principles
As we go through this book, here are some principles that might help you get to grips with the new environment that you are exploring.
Programming is cultural
Something quite curious about programming is that it is extremely subjective and is that although computer technology and therefore programming seem very objective, there is a huge cultural diversity. Technologists group themselves into very strong, and often adversarial, communities.
What do I mean? Let’s start with programming languages. Some languages may be affiliated with specific corporate interests. Others will be dismissed because they have origins from a particular computing heritage. Others still will be very fashionable.
Let’s try to walk through some examples. If you have not heard of any of these terms before, fear not. Think of them as different types of ice cream.
Java is connotated with large corporations and serious business both because of its high adoption within those environments, but also due to the fact that
Some programming language communities are very devisive. Adobe Flash was originally heralded as the way to bring advanced interactivity to websites. Its syntax was similiar to the common JavaScript language, but it had far more capability. Unfortunately though, Flash websites were unusable for website users relying on screen readers. They also required very large downloads. This led to an association of Flash web developers as being people who cared much more about aesthetics and presentation of content, than the content being accessible to the widest audience possible.
These cultural styles do change. As soon as Apple, Inc. dropped support for Flash when it released the iPhone, its popularity diminished. JavaScript used to be seen as a clunky, brittle, yet necessary evil. It is now seen as extremely powerful system for developing software that is very accessible.
Culture goes further than that though. Every part of software development includes communities devoted to specific text editors, specific hardware to write software is written on, specific operating systems to run text editors, specific methods of collaborating with others, …
When you begin to start your first code, you should ignore almost all of this. You should learn to program the language which someone close to you knows. If no one you know knows programming, read the appendix (note to self: write the appendix). The quicker you can get up and running, the quicker you can start to learn.
Programmers use analogies
Much of the difficulty in learning programming comes from the fact that your brain does not know how to categorise the information that it is receiving. This is not helped by the fact that programmers will tend to use analogy a great deal. They do this because the underlying technology is very complex.
When you come into a new field of expertise, your brain will attempt to attach new information into pre-existing buckets.
As you will see, you might find out that you cannot create threads with strings. Nor does glue code have anything to do with objects.
Just like legal jargon, things that sound like ordinary English can be much different when used with in the field. Jurists themselves have a term for jargon: a term of art.
Programmers like to simplify
One part of programming that you will not be able to overcome is that programmers like to make things simple for themselves. Unfortunately, the effect of simplificiation can be to exclude others.
It is hard to describe this phenomenon without further context. Just be aware that it’s a problem. When you encounter it, you should be reassured that
Programming works by building up components
Much of programming involves combining simple things that achieve a little by themselves into complex things that achieve a great deal.
Ultimately, everything that you are see on computer monitor is the product of a process that started with zeros and ones interacting. The wonders that are enabled by technology today, including the programming languages themselves to build those wonders, are possible due to the fact that people spent a great deal of time thinking about how to get zeros and ones to interact.
As you read through other technical material, this will be known as abstraction. Unlike abstraction in the creative arts, abstraction in programming terms tends to make things more useful and practical. If it helps, you can substitute the word think of things
Programming works by breaking down problems
Problems are complex. In order to achive an overarching goal, many things need to happen correctly, in the right order. When building a technology system, it is also important to make sure that different parts of the solution are able to inform the other parts that they are done and that there are processes in place for dealing with errors.
Programming works by making trade offs
Every time we make a decision about which components to build up together, or which parts of a problem should be split into sub-problems, we make trade offs.
The trade offs that programmers make means that there is a great deal of scope for creativity. As you
Programmers will have the
Because programmers need to make decisions
Naming things is hard
As you enter your programming life, you should be very prepared to spend many minutes looking at the computer deciding what things should be called. It will be fustrating. It will be mostly unavoidable.
First steps programming
The next section of the book, we’ll be discussing what geeks mean when they use the term data. It is a relatively abstract topic, which is likely to lose people who are more practically inclined. We will be taking our first steps with programming by interacting with real web pages. This may sound daunting, but you will be fine. We are trying to get an appreciation for what programming can offer us.
Step 1: Open a web browser
You will need to use a laptop or desktop machine, rather than a tablet or mobile device. That is because the mobile versions of browsers have fewer options for developers.
I recommend using a recent version of Firefox or Chrome.
If you do not have access to one of these browsers, you should install a browser extension called Firebug Lite. Firebug will enable you to follow along, but it might take a few minutes to set up. When you open its web page, click on the big red button on the top right.
Step 2: Open its web developer console
Internet browsers come built with several tools that make it easy for web developers to make sense of what is going on with a browser. One of them is an interactive console that allows us to do programming right there in the browser.
If you hunt aroung in the tools menu, you’ll find it.
Browser | Menu Path |
---|---|
Firefox | Tools → Web Developer → Web Console |
Chrome | Tools → JavaScript Console |
If things have worked correctly, a sub-window will pop up from the bottom of the page. What you are interested in is a greater than sign (>
).
Step 3: Begin to code
Let’s demonstrate that computers can count. Type 1 + 1
and push Enter. The computer should respond with 2
.
>
1
+
1
2
If that worked, I would like you to give you a small taste of some of the power you have at your fingertips.
You are not going to hurt the computer by making mistakes. One of the good things about JavaScript in the browser, is that nothing you do can affect the system that is running the browser. The browser acts as a shield as well as an enabler. |
Visit http://www.nzherald.co.nz. If your web console is still open, you may see a bunch of garbage appear in front of you. Ignore all of that. Click on a news story headline. Once you are inside an article, enter the following line into the >
prompt and push Enter:
>
$
(
"#articleBody p"
).
text
()
All going well, you will see a large block of text appear. The formatting is terrible, but you should be able to make out the content. I hope this has given you a taste of some of the really complex things you will be able to achieve by working through the book.
What have we just done? We have sucked out the p
aragraph tags from the HTML element with the id articleBody
and then asked for the text from those p
elements to be returned to us.
Some pointers for the inquisitive:
- Perhaps think of the dollar sign as shorthand for the word “select”. As I mentioned in the first chapter, programmers like to simplify at the expense of ease of learning.
- The brackets (parentheses, for non-New Zealanders), ask JavaScript to execute the command to the left. When something appears within the brackets, we are providing context. In the case of the
$
command, we are providing"#articleBody p"
as context. The double-quotes are significant here, as they tell JavaScript to treat these characters as a single string of characters. Formally, this extra context is known as providing arguments to functions, although we have only provided a single argument to$
. - Once a function returns a value, we can ask the value to provide commands.
$("#articleBody p")
returns a list of HTML elements. Those elements understand thetext
function, which is called with zero arguments by just using empty brackets()
. - We call functions on return values by connecting them with a dot.
How computers interpret information
As we discussed in the first section of the book, one of the things that programmers do is build up blocks. The first case that we’ll see of that is building together different different data types.
Data types tell the computer which operations are legal between two chunks of data. Addition between two numbers makes sense. Addition between two sentences makes less sense. As different data types are actually using the same underlying system (zeros and ones), there needs to be some rules in place which prevent things from clobbering each other.
Let’s look at how a computer might try represent the following real world things:
- numbers
- text
- pictures
- dates
- lists
- places
- truth
- uncertainty
- nothingness
These real world things or in some cases concepts are only a small set of things that programmers might want to represent. However, once we have these they can act as building blocks to create more complex structures. This may sound quite vague at this stage, because we haven’t discussed how to do that. However, we’ll get there.
Numbers
Numbers are interesting. Numbers go up to infinity, yet there is finite space within your computer. Therefore, we need to make trade offs. You may have encountered a form field in the past that refused.
Integers
Let’s start by talking about integers. If you remember, integers are whole numbers on the number line. They don’t have a decimal point. Here are some integers: -1, 0, 1, 2, 3
.
An 8 bit computer, like what what were common in the 80s, stores an integer using 8 zeros and ones. That is, 0000001
represents 1, 0000010
represents 2 and so on up till 255 (11111111
). Each position is called a bit. As a group of 8, they’re called a byte.
When you hit the top, you flip back down to 0. Perhaps think of it as a car’s odomoter. Modern programming languages will take care of this problem for you by dynamically changing the type if it overflows back to zero. They will simply use more bytes. However, other systems like databases are often be much more rigid. That’s because databases are less able to expand the space they can allocate.
If you see a limit imposed at 255, it has probably been imposed by technical constraints rather than design. One of the more famous examples of this is Pac-Man, which would crash after level 255. Not every program knows how to deal with its odometer being reset.
More modern computers use more bits to represent integers. When you talk to a programmer, and use the term int
(shorthand for integer), they will assume that the computer is using 32 bits to represent integers. 32 bits allows programmers to have an odometer that goes up to 232 − 1, or 4,294,967,295.
Negative numbers
So far, we’ve only talked about the numbers from 0 to infinity. What about -5 and friends? Well, programmers have decided to use up some of their valuable bits to enable a wider range of numbers to be represented. The first bit acts like a switch. If it’s 0, it’s
Real numbers
Decimals and fractions are also hard for similiar reasons to integers. We, the human operators, want to be able to represent every possible number. However our computer tools have finite space. With numbers that have decimals, the trade offs are more difficult.
A real number is called a float
, as its decimal point can float around. The number 4.552 is much different than 45.52.
Text
To the computer, text looks the same as a long list of integers. Sections of text are called strings
by programmers. Interestingly, it is perfectly acceptable to have a string of length zero.
Matching between an integer that appears in a string and a character that might appear on the string is described in an encoding. Coarsely, an encoding is an arbitrary mapping between a number and a character. The encoding that might be worthwhile spending a minute or two on is ASCII.
These days, ASCII has been supplemented by a much more complex system called Unicode. Unicode allows for every language to be processed by computers.
What text looks like as a programmer
If you see something wrapped up in quotes, it is probably a string:
"Hi there String Two, how are you?"
"Hello String One, I am great thank you."
Depending on the programming language that you choose to use, single ('
) and double quotes ("
) might mean different things. We will try to stick to double quotes within this book to prevent confusion.
Stumbling blocks
Non-printable characters
One of the things to grasp when you encounter a programming language is that everything is a character. When you push the backspace key on the keyboard, a backspace character is sent to the application that you are writing on. It is up to that application to know what to do with a backspace character, that is remove the last character next to the cursor.
Here are is the typical ways to the new line, tab and and backspace character.
Backspace | \b |
New line | \n |
Tab | \t |
Note, that the representation of the string that you’re seeing does not reflect the underlying storage. Even though you are being presented with two characters on the screen, the computer is only storing one.
Programming languages use escape characters to represent non-printable characters on the screen. Most programming languages use a slash as an escape character (\
). When an escape character is encountered, the computer will change the way it interprets the following character. In the case of \n
, it will build a new line, rather than a literal n. If are wondering how to print out an actual slash, given that it is being used as an escape character, you might have guessed that you can just add a second slash, e.g. \\
.
Line endings
Line endings are a real pain. Operating systems behave differently. In MS Windows, lines within files end with both a line feed character. In OS X, they end with a line feed character. In Linux, they end with a new line character.
Pictures
The data storing pictures is typically referred to as “binary data”.
The reference to the term binary relates to the fact that the raw zeros and ones is the best we can get. It is extremely hard, without context such as a file extension, for a programmer to know what to do with binary data. To programmers who do not have access to the specialised encoding or format, binary data is opaque and unwieldy.
In many ways, binary data is like a string
. It is a sequence of bytes. The difference is that as a programmer, we don’t know which operations are legal. Do we treat the bytes like ASCII, like Unicode, like an array of integers or what?
Pictures encoded in elaborate encoding called a file format. The difference between a file format and an encoding is somewhat murky, although you could say that a file format is intended to be much more general than an encoding designed to handle text.
There is no easy way to know from looking at binary data what it represents. Note: There are some harder ways. Every PNG file starts with “PNG”. If you feel like spending some time wandering through Wikipedia, you might decide to look up “Magic numbers”. A magic number is some context.
Dates
Computer designers cheat. They’ve picked an arbitrary date, 1 January 1970, and simply count the seconds since.
Just like email and web servers, there are actually time servers on the Internet that tell your computer the correct time. See http://www.ntp.org/.
What actually happens inside the computer may surprise you. As dealing with time is tricky, given that there are leap years and leap seconds, it seems natural that there would be an elaborate scheme for handling that.
Those at date as an int
. When you ask for the representation of a date as a string
, say.
Lists
Lists of things are useful. That allows us to perform operations on a group of entities, rather than .
The syntax varies, but the concept is fairly similiar between languges. Strings often be can considered a list of characters. Some languages will not allow you to mix datatypes within a list, e.g. homogenous lists. Others are much more flexible and allow heterogeneous lists.
Programming languages will often provide programmers with several types of collections, lists or things that create sequences of elements. When you’re learning, pick the thing with the simplest name. It will generally be the most general and most widely applicable data structure.
One thing to note is that you should generally be wary of the term set. Most programming languages will use the term set to mean an implementation of a mathematical set, which is an unordered collection of unique objects. Also, if you hear the term bag being used, it means that it is similiar to a set in that there is no inherent ordering only that it allows duplicates.
Truth values
We haven’t even really started programming yet. However, I can assure you that you will be spending a great deal of time testing against Boolean values. Named after logistician John Boole, Boolean values are related to logic and are especially related to making decisions.
Deciding what represents truth and what represents falsehood is an area where programming languages are at their most unique.
As a very general rule, data that looks like a 0 will represent falsehood, whereas non-0 characters will represent truth. Jargon that is sometimes used is to describe some values as “truthy” and some values as “falsey”. In many languages, there will be special keywords to describe Boolean values explicitly. This is generally some variant of True
or true
. Note however, that when used this way they do not have quotes around them. In a fairly uncommon language called Lisp, the letters t
and f
are the keywords. In the Python programming language, empty strings and lists also evaluate to false.
Uncertainty
Representing uncertainty is generally left up to the programmer, rather than the programming language. In order to do this, we would use a float
.
Probability | Likely representation |
---|---|
100% | 1.0 |
70% | 0.7 |
0% | 0.1 |
How would this be used in practice? Well, by taking advantage of the float
data type, we can make use of all of its legal operations such as multiplication.
When using a standard float
, the computer will not know inherently that you are representing a probability value though. We may encounter a situation where a value falls outside of the bounds of 0.0
and 1.0
. If you wish to create your own bespoke data type to place constraints, you are able to do that. Other programming books can show you how.
Nothingness
Representing the property of non-existence is not just a metaphysical problem for computers. Like many things we’ve discussed, different languages make different choices on how to handle it. Most languages will have a keyword that will represent nothingness. Some common variants include:
- null
- nil
- undefined
- none
None
and friends can lead to odd consqeuences.
You should be aware that JavaScript contains both |
Create a comprehensive website backup
The first thing that we will do is look at the process of backing up a website. Through the process, we’ll play with some of the data types that were introduced at the start of the book. Which data types will be looking at? Well, strings, dates and integers.
This section will be the first time that we will look at making decisions.
An Overview of the Approach
The actual mechanics of what we will be doing is quite simple.
- using an external program to download a website on our behalf and store it as a .zip file.
- use some software that we will write together to store that .zip file in a folder that has the date
Before we get to the mechanics of how this works though, there is quite a lot of background material available for you to look through in case you don’t have a lot of knowledge about how web pages come to be displayed on your computer monitor or how web links actually work.
Tools
I’m going to be using a tool called wget to do the actual downloading. wget is a tool that runs on Linux and OS X. If you are running Windows, you can download web pages using an Internet browser and push Ctrl+S to save the page. You won’t have the advantage of being able to deal with the whole of a website, but that’s not necessary while you’re starting out.
For processing data, we’ll be using a programming language called Python.
Python is a handy programming language to learn if you don’t already know one. It tends to have fewer arcane syntax rules and the language community prefers to settle on agreed conventions for things. Other languages pride themselves on saying that there is more than one way to do it. In the Python world. Fewer things to learn should help you be productive relatively quickly.
The reason that we will be using Python, rather than JavaScript introduced earlier, is that we need to deal with files on your machine. Working with JavaScript in the browser means interacting with web pages delivered via the Internet.
Defining a web page
In their simplest form, web pages are made up of text. However, these days the only web pages that only contain text tend to be written by academics. Most pages reference other resources, like images, JavaScript code, something you may not have heard before called style sheets and perhaps even font files. The job of the web browser is to gather up all of these resources and render a single page for the reader.
The text of a web page is defined within a language called HTML or Hypertext Markup Language. So, what does that actually mean?
- Hypertext refers to the ability for one piece of text to link to other pieces of text, anywhere in the world.
-
Markup refers to the way that internal structure is defined. In a markup language, sections such as paragraphs are marked up with opening and closing tags. In HTML, the tag for a paragraph is
<p>
, and its associated closing tag is</p>
. To be very explicit, a small paragraph might look like this in source form,<p>Some text</p>
, but will be rendered asSome text
. - Language just means that there is a defined grammar for markup and adding hyperlinks. You can’t just add your own tags or syntax for defining tags at a whim.
How web pages link to each other
Within HTML, the anchor tag (<a>
) provides for cross-referencing. We haven’t touched upon it yet, but tags have hidden attributes as well as text. One of the <a>
tag’s attributes is href
, or hyperlink reference.
Let’s see if this makes any sense. Can you make out what is happening here?
<p>
Some text.
<a
href=
"http://www.example.org/"
>
Click here
</a>
for more info.
</p>
If you are somewhat baffled, we are nesting our <a>
tag within the base <p>
tag. We are marking up the “Click here” text as a source anchor with a hyperlink to example.org.
Tags nested inside other tags are referred to as children of the broader tags, which are referred to as parents. The ability to nest. This structure, is known to computer scientists as a tree. Elements starting at the base are as known as the root and each descendant element is a branch, extending to elements with no children, the leaves. Remember what I said at the start, computer scientists like to use metaphor!
What a full web page looks like
Now that we have seen what a snippet of HTML looks like, it will be worthwhile to expose you to an HTML document in its full glory. Note that formatting is arbitrary and does not affect how things are rendered1.
<!DOCTYPE html>
<html
lang=
"en"
>
<head>
<title>
An example webpage</title>
</head>
<body>
<p>
Some text.
<a
href=
"http://www.example.org/"
>
Click here
</a>
for more info.
</p>
</body>
</html>
Let me explain what is going on here for people who have not encountered HTML before section by section.
<!
DOCTYPE
html
>
When files are sent over the Internet, it is sometimes difficult for the receiver to know exactly how to interpret the content. You may have noticed that there are many URLs that don’t contain file extensions. Web pages tell their receivers what they are on their first line.
For people who are very interested in the details, <!DOCTYPE html>
is an XML declaration. The declaration describes which tags are legal syntax. In effect, you are telling browsers to apply the, somewhat messy and flexible, rules of real-world, hand-written HTML to whatever follows.
<html
lang=
"en"
>
…
</html>
This is the root tag of HTML. It enables us to hang the rest of our source off of it. Browsers will cope if you leave it out, however it is good form. As you become a proficient programmer, you will find yourself wanting your source code to be dapper.
<html
lang=
"en"
>
<head>
<title>
An example webpage</title>
</head>
…
</html>
The <head>
element is somewhat interesting. Its contents are largely invisible to viewers of the web page. The only thing in here so far is the <title>
, which defines what appears as the text at the top of a browser tab. Real-life web pages will put lots of extra info in here, like external CSS and JavaScript files, as well as hints for social media sites and search engines about how to describe the content when shared or appears in search results.
<html
lang=
"en"
>
…
<body>
<p>
Some text.
<a
href=
"http://www.example.org/"
>
Click here
</a>
for more info.
</p>
</body>
</html>
The <body>
tag is where the content which is visible to the reader appears.
External resources
Understanding the distinction between a web page and its component resources is important when trying to use automated tools. Unless told otherwise, most tools will ignore content that they can’t read like images and content they don’t care about like typography instructions. Knowing that a web page only has one URL, but is actually made up of content from several, can greatly reduce confusion.
Here are the resources that make up most web pages:
- JavaScript
- A programming language that runs on all web browsers. As we’ve discovered, is a way to add interaction to websites. It’s also the tool to build web apps. If you have heard the term “Front-end Developer”, read it as “JavaScript Programmer”.
- CSS
- Cascading Style Sheets, are declarations for typography and layout that web browsers understand.
- Images
- and possibly other multimedia such as videos and sound.
A note about embedded videos
If you see a video embedded into a web page, it is actually. If you look at the source, it is very likely that there are <iframe>
tags. Although it may only look like a video, your browser has actually pulled down a whole other website which is being served by the video host.
Respecting robots.txt
There is a page on most websites that you are unlikely to have seen before. It sits at the base of a website at /robots.txt
. This file declares rules to automated web agents, such as search engines and web harvesters. The tool that we’re using, wget, respects these rules unless told to explicitly ignore them.
How we select specific elements
Note to self Explain CSS selectors |
Downloading web sites
This first command gets us most of the way there:
$
wget
--
mirror
www
.
example
.
org
If you do not have Linux or OS X running,
you will not be able to run |
We are asking wget
to replicate, or mirror the web site www.example.org
. To be honest with you though, www.example.org
is a fairly boring website to mirror because it only has one page. Feel free to try mirroring another website that you have a copyright licence to.
This one is more complete. The \
character tells the computer to ignore the newline and means that we can have each option on its own page.
$
wget
--
mirror
\
--
local
-
encoding
=
"utf-8"
\
--
adjust
-
extension
\
--
convert
-
links
\
--
page
-
requisites
\
www
.
example
.
org
This invocation is far more thorough and will allow you to view the downloaded site.
Compress the downloads
Now that we have a complete website, we should store it in a compressed format for backup purposes. Rather than needing to run an application by hand to do that for us each time, let’s write a computer program to do that for us.
As well as explaining the code that follows, part of what I will be showing you is instructions on how to interpret code examples like this so that you can follow along with web tutorials.
I will be introducing some programming concepts. These include variables, flow control interacting with the outside world, as well as defining and using functions.
#! /usr/env/bin python
import
argparse
import
datetime
import
shutil
import
os
def
today
():
t
=
datetime
.
date
.
today
()
t_as_string
=
str
(
t
)
return
t_as_string
def
archive_site
(
site_loc
,
dest
):
site_dir
,
site_file
=
os
.
path
.
split
(
site_loc
)
archive
=
site_file
+
"-"
+
today
()
archive_loc
=
os
.
path
.
join
(
dest
,
archive
)
shutil
.
make_archive
(
base_name
=
archive_loc
,
format
=
"zip"
,
root_dir
=
site_loc
,
base_dir
=
site_loc
,
verbose
=
True
)
p
=
argparse
.
ArgumentParser
(
description
=
'Compress a website mirror.'
)
p
.
add_argument
(
'location'
,
help
=
'Where the website mirror is now.'
)
p
.
add_argument
(
'destination'
,
default
=
os
.
path
.
curdir
,
help
=
'The directory to save to.'
)
args
=
p
.
parse_args
()
archive_site
(
args
.
location
,
args
.
destination
)
Wow, this is a lot of code to take in. I am really glad you’ve managed to make it all the way through it. Let’s go through each section step by step.
#! /usr/env/bin python
Do you remember how the XML declaration of HTML (<!DOCTYPE html>
) helped your web browser know what is legal syntax? This first line does the same thing can happen inside your computer too. If you have only used computers running an MS Windows operating system, this may look very odd. Computers from the UNIX operating system heritage, such as Linux and OS X, do not rely on file extensions. Instead, they look for a special comment in a file’s first line to figure out what to do with it. In this case, we are telling the computer to run the /usr/env/bin
program and give it the argument python
. It is a long way of saying that we want this piece of text to be run by a the Python language interpreter, e.g. something that knows what to do with Python source code.
import
argparse
import
datetime
import
shutil
import
os
This chunk of code is somewhat opaque. I wish it didn’t need to be at the top, because in many ways it is the hardest bit of code to explain. The simplest way to think of it is that we would like to pull in someone else’s pre-written code into this file. We call those chunks of external code modules.
Here is what each of those modules do:
- argparse
- Parses command line arguments. This will allow the script that we are writing to take in context provided by you when you run the code. What we will be doing is asking for the location of the website mirror that we have previously downloaded.
- datetime
- Contains many functions allowing programmers to deal with times, dates and time zones. In our code, we will actually be accessing the
date
submodule, which doesn’t know anything about times. It does, however know today’s date. - shutil
- The term
shutil
is shorthand for shell utility.shutil
provides several tools for interacting with files. In our case, we will be relying onshutil
to create a zip archive for us. The term shell can be a difficult analogy to grasp. It can be thought of the way that people can interact with the operating system. It is in some sense the surface which is visible to you. - os
- The
os
module provides programmatic access to operating system details like files and, most relevantly to us, directories. If Python is running on MS Windows, theos
module will use Windows-specifc commands. If it is running on VxWorks, the operating system that powers most Mars rovers, it will run commands specific to VxWorks. In our code, we’ll be using it to split apart and join together file paths in an operating system agnostic way.
Let’s carry on to the first piece of our own code, defining the today
function.
def
today
():
t
=
datetime
.
date
.
today
()
t_as_string
=
str
(
t
)
return
t_as_string
Even though this section is quite short, at four lines, I’m going to spend quite a bit of time working through them.
Here is an overhead view of what is happening to start us off. We are telling Python to def
ine a function called today
, which accepts no arguments. Once we’re in the function, we find out what day it is, with the help of the datetime.date.today
function and the help of a temporary variable called t
. We then convert t
to a string and return
that string to whoever called the function with the use of another temporary variable.
def today():
- Here we use the
def
keyword to define a function calledtoday
. Functions can start with alphanumeric characters and the underscore. The brackets, or parentheses if you prefer, are place holders for any arguments that the function might accept. In our case, we don’t have any arguments that we want to accept. If we wanted to specify the time zone, we could do so here.This statement ends with a colon. You will gain a feel for when to use a colon over time. Think of it as telling Python that you wish to logically demarcate a block of code. Everything indented from the colon is associated with thedef
keyword. Once the source code returns to the same position asdef
, it is no longer part of the definition. t = datetime.date.today()
- We do three important things in this line. The first is the creation of our first variable! We then assign that variable,
t
, with the result of a call todatetime.date.today()
. You may be confused with the use of the equals sign here. Many programming languages will do this, unfortunately. Although equals implies a symmetrical relationship, there is actually a transfer of information from right to left. Whatever is returned fromdatetime.date.today()
will be assigned tot
after the call has been completed. You will often read tutorials and listen to talks where people even say “t
equalsdateime
dotdate
dottoday
”. However, I prefer to say “t
is assigned to”. t_as_string = str(t)
- Do you remember me describing how the computer stores times as integers earlier on in the book? Well, here we force Python to represent whatever is in the
t
variable, that is an object representing today’s date, as a string. return t_as_string
- This asks Python to return the string representation of today’s date to the caller. Caller is just jargon for whichever function will end up calling
today
in the future.
In Python, we can overwrite previously assigned variables. That means that we could have written our code like this:
def
today
():
t
=
datetime
.
date
.
today
()
t
=
str
(
t
)
return
t
In fact, there is nothing stopping us from simply doing string conversion on the same line as the return
statement:
def
today
():
t
=
datetime
.
date
.
today
()
return
str
(
t
)
To go even further, we could have done away with all of our temporary variables. We could have written everything on a single line if we had wanted to:
def
today
():
return
str
(
datetime
.
date
.
today
())
Every programmer will have different stylistic preferences. It is a value statement to decide between conciseness and readability. As your programming experience matures, you will find yourself becoming very picky about which style you prefer. This is perfectly natural, but be wary of building artificial walls and disregarding technical competence due to style choices.
def
archive_site
(
site_loc
,
dest
):
site_dir
,
site_file
=
os
.
path
.
split
(
site_loc
)
archive
=
site_file
+
"-"
+
today
()
archive_loc
=
os
.
path
.
join
(
dest
,
archive
)
shutil
.
make_archive
(
base_name
=
archive_loc
,
format
=
"zip"
,
root_dir
=
site_loc
,
base_dir
=
site_loc
,
verbose
=
True
)
This piece of code is complex, but will introduce some very valuable tools for you as you progress.
def archive_site(site_loc, dest):
- We’re using the
def
keyword to define another function, calledarchive_site
.archive_site
takes two arguments, one calledsite_loc
and the otherdest
. Thesite_loc
variable refers to the site’s current location that we have downloaded,dest
is where we want the zipped version to end up. site_dir, site_file = os.path.split(site_loc)
- This shows us how to handle a function that returns two values. We can assign each of them to their own variable. If we only had one variable, say
site_path_parts
, then that variable would be assigned a tuple (an ordered list) of two members. If you were wondering whatos.path.split(site_loc)
does, it takes a file path name, sayC:\Users\Tim\Documents\magnum-opus.docx
and splits it into two:C:\Users\Tim\Documents\
andmagnum-opus.docx
. archive = site_file + "-" + today()
- This is where we finally get to use the
today
function that we defined earlier on! The+
operator, when applied to strings, joins them together. If we substitute the variablesite_file
with"timmcnamara.co.nz"
andtoday()
with"2013-08-11"
, we get"timmcnamara.co.nz" + "-" + "2013-08-11"
. What we will end up with thearchive
variable being assigned to something like"timmcnamara.co.nz-2013-08-11"
. We don’t need to add a file extension, as it is added by the command that creates the .zip later on. archive_loc = os.path.join(dest, archive)
- We now join together the destination folder provided by the user with the file name that we have just created as
archive
. shutil.make_archive(base_name=archive_loc,
- We ask
shutil
to zip up the directory for us. Here, you can see us using a Python feature called named arguments. Our variablearchive_loc
is being assigned to the variablebase_name
that has been defined in the internal definition ofshutil.make_archive
. If you want to learn more about the arguments I have specified, you can read the documentation yourself online.
p
=
argparse
.
ArgumentParser
(
description
=
'Compress a website mirror.'
)
p
.
add_argument
(
'location'
,
help
=
'Where the website mirror is now.'
)
p
.
add_argument
(
'destination'
,
default
=
os
.
path
.
curdir
,
help
=
'The directory to save to.'
)
args
=
p
.
parse_args
()
This block of code can be handled as a single unit, but introduces a very important part of Python which I have not yet touched upon: objects. In programming land, an object is data which also has functions. In this code, we create an instance of the ArgumentParser
class of objects and assign that instance to the variable p
.
Creating a new instance of a class is just like calling a function. Within the Python programming community, the convention is to use CamelCase names to designate a class. So, what is an object? An object carries its own data, often referred to as its state, and functions to manipulate that state. An object’s functions are known as its methods. It takes time to understand what is going on, as the metaphor of a real-world object goes most of the way but will not get you all the way. You will find yourself staring at the screen a little bit as you try to figure it out. Here, we are creating an object which knows how to parse command line arguments and assigning that to args
. A command line argument is a string that into the program from the outside world.
archive_site
(
args
.
location
,
args
.
destination
)
This is where we do the magic! Our args
object has two attributes, location
and destination
that we access with the dot operator (.
). Feeding them into the archive_site
function that we’ve defined will create an archive of the site downloaded at the start. Voilà!
- The notable exception is the
<pre>
tag, which we have not encountered yet.<pre>
is shorthand for pre-formatted. The way the text within it is laid out in source code form is the layout which will appear on to the website reader. ↩