Leanpub: Publish Early, Publish Often

Create a comprehensive website backup

The first thing that we will do is look at the process of backing up a website. Through the process, we’ll play with some of the data types that were introduced at the start of the book. Which data types will be looking at? Well, strings, dates and integers.

This section will be the first time that we will look at making decisions.

An Overview of the Approach

The actual mechanics of what we will be doing is quite simple.

using an external program to download a website on our behalf and store it as a .zip file.
use some software that we will write together to store that .zip file in a folder that has the date

Before we get to the mechanics of how this works though, there is quite a lot of background material available for you to look through in case you don’t have a lot of knowledge about how web pages come to be displayed on your computer monitor or how web links actually work.

Tools

I’m going to be using a tool called wget to do the actual downloading. wget is a tool that runs on Linux and OS X. If you are running Windows, you can download web pages using an Internet browser and push Ctrl+S to save the page. You won’t have the advantage of being able to deal with the whole of a website, but that’s not necessary while you’re starting out.

For processing data, we’ll be using a programming language called Python.

Python is a handy programming language to learn if you don’t already know one. It tends to have fewer arcane syntax rules and the language community prefers to settle on agreed conventions for things. Other languages pride themselves on saying that there is more than one way to do it. In the Python world. Fewer things to learn should help you be productive relatively quickly.

The reason that we will be using Python, rather than JavaScript introduced earlier, is that we need to deal with files on your machine. Working with JavaScript in the browser means interacting with web pages delivered via the Internet.

Defining a web page

In their simplest form, web pages are made up of text. However, these days the only web pages that only contain text tend to be written by academics. Most pages reference other resources, like images, JavaScript code, something you may not have heard before called style sheets and perhaps even font files. The job of the web browser is to gather up all of these resources and render a single page for the reader.

The text of a web page is defined within a language called HTML or Hypertext Markup Language. So, what does that actually mean?

Hypertext refers to the ability for one piece of text to link to other pieces of text, anywhere in the world.
Markup refers to the way that internal structure is defined. In a markup language, sections such as paragraphs are marked up with opening and closing tags. In HTML, the tag for a paragraph is , and its associated closing tag is . To be very explicit, a small paragraph might look like this in source form, Some text, but will be rendered as Some text.
Language just means that there is a defined grammar for markup and adding hyperlinks. You can’t just add your own tags or syntax for defining tags at a whim.

How web pages link to each other

Within HTML, the anchor tag (<a>) provides for cross-referencing. We haven’t touched upon it yet, but tags have hidden attributes as well as text. One of the <a> tag’s attributes is href, or hyperlink reference.

Let’s see if this makes any sense. Can you make out what is happening here?

<p>
    Some text.
    <a href="http://www.example.org/">
      Click here
    </a> 
    for more info.
</p>

If you are somewhat baffled, we are nesting our <a> tag within the base  tag. We are marking up the “Click here” text as a source anchor with a hyperlink to example.org.

Tags nested inside other tags are referred to as children of the broader tags, which are referred to as parents. The ability to nest. This structure, is known to computer scientists as a tree. Elements starting at the base are as known as the root and each descendant element is a branch, extending to elements with no children, the leaves. Remember what I said at the start, computer scientists like to use metaphor!

What a full web page looks like

Now that we have seen what a snippet of HTML looks like, it will be worthwhile to expose you to an HTML document in its full glory. Note that formatting is arbitrary and does not affect how things are rendered¹.

<!DOCTYPE html>
<html lang="en">
<head>
  <title>An example webpage</title>
</head>
<body>
<p>
    Some text.
    <a href="http://www.example.org/">
      Click here
    </a> 
    for more info.
</p>
</body>
</html>

If you copy this code into a text editor and save it as some.html, you will be able to open the file in your web browser just like a page from the Internet.

Let me explain what is going on here for people who have not encountered HTML before section by section.

<!DOCTYPE html>

When files are sent over the Internet, it is sometimes difficult for the receiver to know exactly how to interpret the content. You may have noticed that there are many URLs that don’t contain file extensions. Web pages tell their receivers what they are on their first line.

For people who are very interested in the details, <!DOCTYPE html> is an XML declaration. The declaration describes which tags are legal syntax. In effect, you are telling browsers to apply the, somewhat messy and flexible, rules of real-world, hand-written HTML to whatever follows.

<html lang="en">
  …
</html>

This is the root tag of HTML. It enables us to hang the rest of our source off of it. Browsers will cope if you leave it out, however it is good form. As you become a proficient programmer, you will find yourself wanting your source code to be dapper.

<html lang="en">
<head>
  <title>An example webpage</title>
</head>
…
</html>

The <head> element is somewhat interesting. Its contents are largely invisible to viewers of the web page. The only thing in here so far is the <title>, which defines what appears as the text at the top of a browser tab. Real-life web pages will put lots of extra info in here, like external CSS and JavaScript files, as well as hints for social media sites and search engines about how to describe the content when shared or appears in search results.

<html lang="en">
…
<body>
<p>
    Some text. 
    <a href="http://www.example.org/">
      Click here
    </a> 
    for more info.
</p>
</body>
</html>

The <body> tag is where the content which is visible to the reader appears.

External resources

Understanding the distinction between a web page and its component resources is important when trying to use automated tools. Unless told otherwise, most tools will ignore content that they can’t read like images and content they don’t care about like typography instructions. Knowing that a web page only has one URL, but is actually made up of content from several, can greatly reduce confusion.

Here are the resources that make up most web pages:

JavaScript: A programming language that runs on all web browsers. As we’ve discovered, is a way to add interaction to websites. It’s also the tool to build web apps. If you have heard the term “Front-end Developer”, read it as “JavaScript Programmer”.
CSS: Cascading Style Sheets, are declarations for typography and layout that web browsers understand.
Images: and possibly other multimedia such as videos and sound.

A note about embedded videos

If you see a video embedded into a web page, it is actually. If you look at the source, it is very likely that there are <iframe> tags. Although it may only look like a video, your browser has actually pulled down a whole other website which is being served by the video host.

Respecting robots.txt

There is a page on most websites that you are unlikely to have seen before. It sits at the base of a website at /robots.txt. This file declares rules to automated web agents, such as search engines and web harvesters. The tool that we’re using, wget, respects these rules unless told to explicitly ignore them.

How we select specific elements

Note to self

Explain CSS selectors

Downloading web sites

This first command gets us most of the way there:

$ wget --mirror www.example.org

If you do not have Linux or OS X running, you will not be able to run wget.

Note to self

Add install instructions.

Note to self

Create an example archive so Windows users can carry on with the examples.

We are asking wget to replicate, or mirror the web site www.example.org. To be honest with you though, www.example.org is a fairly boring website to mirror because it only has one page. Feel free to try mirroring another website that you have a copyright licence to.

This one is more complete. The \ character tells the computer to ignore the newline and means that we can have each option on its own page.

$ wget --mirror \
       --local-encoding="utf-8" \
       --adjust-extension \
       --convert-links \
       --page-requisites \
       www.example.org

This invocation is far more thorough and will allow you to view the downloaded site.

Compress the downloads

Now that we have a complete website, we should store it in a compressed format for backup purposes. Rather than needing to run an application by hand to do that for us each time, let’s write a computer program to do that for us.

As well as explaining the code that follows, part of what I will be showing you is instructions on how to interpret code examples like this so that you can follow along with web tutorials.

I will be introducing some programming concepts. These include variables, flow control interacting with the outside world, as well as defining and using functions.

#! /usr/env/bin python

import argparse
import datetime
import shutil
import os

def today():
    t = datetime.date.today()
    t_as_string = str(t)
    return t_as_string

def archive_site(site_loc, dest):
    site_dir, site_file = os.path.split(site_loc)
    archive = site_file + "-" + today()
    archive_loc = os.path.join(dest, archive)
    shutil.make_archive(base_name=archive_loc, 
      format="zip",
      root_dir=site_loc, 
      base_dir=site_loc,
      verbose=True)

p = argparse.ArgumentParser(
      description='Compress a website mirror.')
p.add_argument('location',
       help='Where the website mirror is now.')
p.add_argument('destination',
              default=os.path.curdir,
              help='The directory to save to.')
args = p.parse_args()

archive_site(args.location, args.destination)

Wow, this is a lot of code to take in. I am really glad you’ve managed to make it all the way through it. Let’s go through each section step by step.

#! /usr/env/bin python

Do you remember how the XML declaration of HTML (<!DOCTYPE html>) helped your web browser know what is legal syntax? This first line does the same thing can happen inside your computer too. If you have only used computers running an MS Windows operating system, this may look very odd. Computers from the UNIX operating system heritage, such as Linux and OS X, do not rely on file extensions. Instead, they look for a special comment in a file’s first line to figure out what to do with it. In this case, we are telling the computer to run the /usr/env/bin program and give it the argument python. It is a long way of saying that we want this piece of text to be run by a the Python language interpreter, e.g. something that knows what to do with Python source code.

import argparse
import datetime
import shutil
import os

This chunk of code is somewhat opaque. I wish it didn’t need to be at the top, because in many ways it is the hardest bit of code to explain. The simplest way to think of it is that we would like to pull in someone else’s pre-written code into this file. We call those chunks of external code modules.

Here is what each of those modules do:

argparse: Parses command line arguments. This will allow the script that we are writing to take in context provided by you when you run the code. What we will be doing is asking for the location of the website mirror that we have previously downloaded.
datetime: Contains many functions allowing programmers to deal with times, dates and time zones. In our code, we will actually be accessing the date submodule, which doesn’t know anything about times. It does, however know today’s date.
shutil: The term shutil is shorthand for shell utility. shutil provides several tools for interacting with files. In our case, we will be relying on shutil to create a zip archive for us. The term shell can be a difficult analogy to grasp. It can be thought of the way that people can interact with the operating system. It is in some sense the surface which is visible to you.
os: The os module provides programmatic access to operating system details like files and, most relevantly to us, directories. If Python is running on MS Windows, the os module will use Windows-specifc commands. If it is running on VxWorks, the operating system that powers most Mars rovers, it will run commands specific to VxWorks. In our code, we’ll be using it to split apart and join together file paths in an operating system agnostic way.

Let’s carry on to the first piece of our own code, defining the today function.

def today():
    t = datetime.date.today()
    t_as_string = str(t)
    return t_as_string

Even though this section is quite short, at four lines, I’m going to spend quite a bit of time working through them.

Here is an overhead view of what is happening to start us off. We are telling Python to define a function called today, which accepts no arguments. Once we’re in the function, we find out what day it is, with the help of the datetime.date.today function and the help of a temporary variable called t. We then convert t to a string and return that string to whoever called the function with the use of another temporary variable.

def today():: Here we use the def keyword to define a function called today. Functions can start with alphanumeric characters and the underscore. The brackets, or parentheses if you prefer, are place holders for any arguments that the function might accept. In our case, we don’t have any arguments that we want to accept. If we wanted to specify the time zone, we could do so here.This statement ends with a colon. You will gain a feel for when to use a colon over time. Think of it as telling Python that you wish to logically demarcate a block of code. Everything indented from the colon is associated with the def keyword. Once the source code returns to the same position as def, it is no longer part of the definition.
t = datetime.date.today(): We do three important things in this line. The first is the creation of our first variable! We then assign that variable, t, with the result of a call to datetime.date.today(). You may be confused with the use of the equals sign here. Many programming languages will do this, unfortunately. Although equals implies a symmetrical relationship, there is actually a transfer of information from right to left. Whatever is returned from datetime.date.today() will be assigned to t after the call has been completed. You will often read tutorials and listen to talks where people even say “t equals dateime dot date dot today”. However, I prefer to say “t is assigned to”.
t_as_string = str(t): Do you remember me describing how the computer stores times as integers earlier on in the book? Well, here we force Python to represent whatever is in the t variable, that is an object representing today’s date, as a string.
return t_as_string: This asks Python to return the string representation of today’s date to the caller. Caller is just jargon for whichever function will end up calling today in the future.

In Python, we can overwrite previously assigned variables. That means that we could have written our code like this:

def today():
    t = datetime.date.today()
    t = str(t)
    return t

In fact, there is nothing stopping us from simply doing string conversion on the same line as the return statement:

def today():
    t = datetime.date.today()
    return str(t)

To go even further, we could have done away with all of our temporary variables. We could have written everything on a single line if we had wanted to:

def today():
    return str(datetime.date.today())

Every programmer will have different stylistic preferences. It is a value statement to decide between conciseness and readability. As your programming experience matures, you will find yourself becoming very picky about which style you prefer. This is perfectly natural, but be wary of building artificial walls and disregarding technical competence due to style choices.

def archive_site(site_loc, dest):
    site_dir, site_file = os.path.split(site_loc)
    archive = site_file + "-" + today()
    archive_loc = os.path.join(dest, archive)
    shutil.make_archive(base_name=archive_loc, 
      format="zip",
      root_dir=site_loc, 
      base_dir=site_loc,
      verbose=True)

This piece of code is complex, but will introduce some very valuable tools for you as you progress.

def archive_site(site_loc, dest):: We’re using the def keyword to define another function, called archive_site. archive_site takes two arguments, one called site_loc and the other dest. The site_loc variable refers to the site’s current location that we have downloaded, dest is where we want the zipped version to end up.
site_dir, site_file = os.path.split(site_loc): This shows us how to handle a function that returns two values. We can assign each of them to their own variable. If we only had one variable, say site_path_parts, then that variable would be assigned a tuple (an ordered list) of two members. If you were wondering what os.path.split(site_loc) does, it takes a file path name, say C:\Users\Tim\Documents\magnum-opus.docx and splits it into two: C:\Users\Tim\Documents\ and magnum-opus.docx.
archive = site_file + "-" + today(): This is where we finally get to use the today function that we defined earlier on! The + operator, when applied to strings, joins them together. If we substitute the variable site_file with "timmcnamara.co.nz" and today() with "2013-08-11", we get "timmcnamara.co.nz" + "-" + "2013-08-11". What we will end up with the archive variable being assigned to something like "timmcnamara.co.nz-2013-08-11". We don’t need to add a file extension, as it is added by the command that creates the .zip later on.
archive_loc = os.path.join(dest, archive): We now join together the destination folder provided by the user with the file name that we have just created as archive.
shutil.make_archive(base_name=archive_loc, ⋯: We ask shutil to zip up the directory for us. Here, you can see us using a Python feature called named arguments. Our variable archive_loc is being assigned to the variable base_name that has been defined in the internal definition of shutil.make_archive. If you want to learn more about the arguments I have specified, you can read the documentation yourself online.

p = argparse.ArgumentParser(
      description='Compress a website mirror.')
p.add_argument('location',
       help='Where the website mirror is now.')
p.add_argument('destination',
              default=os.path.curdir,
              help='The directory to save to.')
args = p.parse_args()

This block of code can be handled as a single unit, but introduces a very important part of Python which I have not yet touched upon: objects. In programming land, an object is data which also has functions. In this code, we create an instance of the ArgumentParser class of objects and assign that instance to the variable p.

Creating a new instance of a class is just like calling a function. Within the Python programming community, the convention is to use CamelCase names to designate a class. So, what is an object? An object carries its own data, often referred to as its state, and functions to manipulate that state. An object’s functions are known as its methods. It takes time to understand what is going on, as the metaphor of a real-world object goes most of the way but will not get you all the way. You will find yourself staring at the screen a little bit as you try to figure it out. Here, we are creating an object which knows how to parse command line arguments and assigning that to args. A command line argument is a string that into the program from the outside world.

archive_site(args.location, args.destination)

This is where we do the magic! Our args object has two attributes, location and destination that we access with the dot operator (.). Feeding them into the archive_site function that we’ve defined will create an archive of the site downloaded at the start. Voilà!

I would love to have your feedback on the book so far.

You can contact me via newroute@timmcnamara.co.nz or post your feedback publicly at https://leanpub.com/a-new-route-to-programming/feedback.

Up next

Conduct a Miniature Web Harvest