Leanpub: Publish Early, Publish Often

10. Modules And Packages

Modules and packages are the last organizational unit of code that are discussed. They provide the means by which large programs can be developed and shared.

10.1 Modules

Modules enable the reuse of programs. A module is a file that contains a collection of definitions and statements and has a .py extension. The contents of a module can be used by importing the module either into another module or into the interpreter. To illustrate this, our favourite Account class shown in the following snippet is saved in a module called account.py.

    class Account:
        num_accounts = 0

        def __init__(self, name, balance):
            self.name = name 
            self.balance = balance 
            Account.num_accounts += 1

        def del_account(self):
            Account.num_accounts -= 1

        def deposit(self, amt):
            self.balance = self.balance + amt 

        def withdraw(self, amt):
            self.balance = self.balance - amt 

        def inquiry(self):
            return self.balance

To re-use the module definitions, the import statement is used to import the module as shown in the following snippet.

    Python 3.4.2 (v3.4.2:ab2c023a9432, Oct  5 2014, 20:42:22) 
    [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import account
    >>> acct = account.Account("obi", 10)
    >>> acct
    <account.Account object at 0x101b6e358>
    >>>

All executable statements contained within a module are executed when the module is imported. A module is also an object that has a type - module as such all generic operations that apply to objects can be applied to modules. The following snippets show some unintuitive ways of manipulating module objects.

    Python 3.4.2 (v3.4.2:ab2c023a9432, Oct  5 2014, 20:42:22) 
    [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import account
    >>> type(account)
    <class 'module'>
    >>> getattr(account, 'Account') # access the Account class using getattr
    <class 'cl.Account'>
    >>> account.__dict__
    {'json': <module 'json' from '/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/json/__i\
nit__.py'>, '__cached__': '/Users/c4obi/writings/scratch/src/__pycache__/cl.cpython-34.pyc', '__loader__':\
 <_frozen_importlib.SourceFileLoader object at 0x10133d4e0>, '__doc__': None, '__file__': '/Users/c4obi/wr\
itings/scratch/src/cl.py', 'Account': <class 'account.Account'>, '__package__': '', '__builtins__': { ...}\
 ... 
     }

Each module possesses its own unique global namespace that is used by all functions and classes defined within the module and when this feature is properly used, it eliminates worries about name clashes from third party modules. The dir() function without any argument can be used within a module to find out what names are available in a module’s namespace.

As mentioned, a module can import another module; when this happens and depending on the form of the import, the imported module’s name, part of the name defined within the imported module or all names defined with the imported module could be placed in the namespace of the module doing the importing. For example, from account import Account imports and place the Account name from the account module into the namespace, import account imports and adds the account name referencing the whole module to the namespace while from account import * will import and add all names in the account module except those that start with an underscore to the current namespace. Using from module import * as a form of import is strongly advised against as it may import names that the developer is not aware of and that conflict with names used in the module doing the importing. Python has the __all__ special variable that can be used within modules. This value of the __all__ variable should be a list that contains the names within a module that are imported from such module when the from module import * syntax is used. Defining this method is totally optional on the part of the developer. We illustrate the use of the __all__ special method with the following example.

    __all__ = ['Account']

    class Account:
        num_accounts = 0

        def __init__(self, name, balance):
            self.name = name 
            self.balance = balance 
            Account.num_accounts += 1

        def del_account(self):
            Account.num_accounts -= 1

        def deposit(self, amt):
            self.balance = self.balance + amt 

        def withdraw(self, amt):
            self.balance = self.balance - amt 

        def inquiry(self):
            return self.balance 

    class SharedAccount:
        pass

    >>> from account import *
    >>> dir()
    ['Account', '__builtins__', '__doc__', '__loader__', '__name__', '__package__', '__spec__'] # only Acc\
ount has been imported
    >>>

The name of an imported module is gotten by referencing the __name__ attribute of the imported module. In the case of a module that is currently executing, the __name__ value is set to __main__. Python modules can be executed with python module <arguments>. A corollary of the fact that the __name__ of the currently executing module is set to __main__ is that we can have a recipe such as the following.

    if __name__ == "__main__":
        # run some code

That makes the module usable as a standalone script as well as an importable module. A popular use of the above recipe is for running unittest; we can run the module as a standalone to test it but then import it for use into another module without running the test cases.

Reloading Modules

Once modules have been imported into the interpreter, any change to such a module is not reflected within the interpreters. However, Python provides the importlib.reload that can be used to re-import a module once again into the current namepace.

10.2 How are Modules found?

Import statements are able to import modules that are in any of the paths given by the sys.path variable. The import system uses a greedy strategy in which the first module found is imported. The content of the sys.path variable is unique to each Python installation. An example of the value of the sys.path variable on a Mac operating system is shown in the following snippet.

>>> import sys
>>> sys.path
['', '/Library/Frameworks/Python.framework/Versions/3.4/lib/python34.zip', '/Library/Frameworks/Python.fra\
mework/Versions/3.4/lib/python3.4', '/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/plat-\
darwin', '/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/lib-dynload', '/Library/Framewor\
ks/Python.framework/Versions/3.4/lib/python3.4/site-packages']

The sys.path list can be modified at runtime by adding or removing elements from this list. However, when the interpreter is started conventionally, the sys.path list contains paths that come from three sources namely: sys.prefix, PYTHONPATH and initialization by the site.py module.

sys.prefix: This variable specifies the base location for a given Python installation. From this base location, the Python interpreter can work out the location of the Python standard library modules. The location of the standard library is given by the following paths.

    sys.prefix + '/lib/python3X.zip'
    sys.prefix + '/lib/python3.X'
    sys.prefix + '/lib/python3.X/plat-sysname' 
    sys.exec_prefix + '/lib/python3.X/lib-dynload'

The paths of the standard library can be found by running the Python interpreter with the -S option; this prevents the site.py initialization that adds the third party package paths to the sys.path list. The location of the standard library can also be overriden by defining the PYTHONHOME environment variable that replaces the sys.prefix and sys.exec_prefix.

PYTHONPATH: Users can define the PYTHONPATH environment variable and the value of this variable is added as the first argument to the sys.path list. This variable can be set to the directory where a user keeps user defined modules.
site.py: This is a path configuration module that is loaded during the initialization of the interpreter. This modules adds site-specific paths to the module search path. The site.py starts by constructing up to four directories from a prefix and a suffix. For the prefix, it uses sys.prefix and sys.exec_prefix. For the suffix, it uses the empty string and then lib/site-packages on Windows or lib/pythonX.Y/site-packages on Unix and Macintosh. For each of these distinct combinations, if it refers to an existing directory, it is added to the sys.path and further inspected for configuration files. The configuration files are files with .pth extension. The contents are additional items one per line to be added to sys.path. Non-existing items are never added to sys.path, and no check is made that the item refers to a directory rather than a file. Each item is added to sys.pathonce. Blank lines and lines beginning with # are skipped. Lines starting with import followed by space or tab are executed. After these path manipulations, an attempt is made to import a module named sitecustomize that can perform arbitrary site-specific customizations. It is typically created by a system administrator in the site-packages directory. If this import fails with an ImportError exception, it is silently ignored. After this, if ENABLE_USER_SITE is true, an attempt is made to import a module named usercustomize that can perform arbitrary user-specific customizations, . This file is intended to be created in the user site-packages directory that is part of sys.path unless disabled by -s. Any ImportError is silently ignored.

10.3 Packages

Just as modules provide a mean for organizing statements and definitions, packages provide a mean for organizing modules. A close but imperfect analogy of the relationship of packages to modules is that of folders to files on computer file systems. A package just like a folder can be composed of a number of module files. In Python however, packages are just like modules; in fact all packages are modules but not all modules are packages. The difference between a module and package is the presence of a __path__ special variable in a package object that does not have a None value. Packages can have sub-packages and so on; when referencing a package and it corresponding sub-packages the dot notation is used so a complex number sub-package within a mathematics package will be referenced as math.complex.

There are currently two types of packages:- regular packages and namespace packages.

Regular Packages

A regular package is one that consists of a group of modules in a folder with an __init__.py module within the folder. The presence of this __init__.py file within the folder cause the interpreter to treat the folder as a package. An example of package structure is the following.

    parent/         <----- folder
        __init__.py
        one/        <------ sub-folder
            __init__.py
            a.py
        two/        <------ sub-folder
            __init__.py
            b.py

The parent, one and two folders are all packages because they all contain an __init__.py module within each of their respective folders. one and two are sub-packages of the parent package. Whenever a package is imported, the __init__.py module of such a package is executed. One can think of the __init__.py as the store of attributes for the package - only symbols defined in this module are attributes of the imported module. Assuming the __init__.py module from the above parent package is empty and the package, parent, is imported using import parent, the parent package will have no module or subpackage as an attribute. The following code listing shows this.

        >>> import parent
        >>> dir()
        ['__builtins__', '__doc__', '__loader__', '__name__', '__package__', '__spec__', 'parent']
        >>> dir(parent)
        ['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__\
path__', '__spec__']

As the example shows, none of the modules or sub-packages is listed as an attribute of the imported package object. On the other hand, if a symbol, package="testing packages", is defined in the __init__.py module of the parent package and the parent package is imported, the package object has this symbol as an attribute as shown in the following code listing .

        >>> import parent
        >>> dir()
        ['__builtins__', '__doc__', '__loader__', '__name__', '__package__', '__spec__', 'parent']
        >>> dir(parent)
        ['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__\
path__', '__spec__', 'package']
        >>> parent.package
        'testing packages'
        >>>

When a sub-package is imported, all __init__.py modules in parent packages are imported in addition to the __init__.py module of the sub-package. Sub-packages are referenced during import using the dot notation just like modules in packages are. In the previous package structure, the notation would be parent.one to reference the one sub-package. Packages support the same kind of import semantics as modules; individual modules or packages can be imported as in the following example.

    # import the module a
    import parent.one.a

When the above method is used then the fully qualified name for the module, parent.one.a, must be used to access any symbol in the module. Note that when using this method of import, the last symbol can be either a module or sub-package only; classes, functions or variables defined within modules are not allowed. It is also possible to import just the module or sub-package that is needed as the following example shows.

    # importing just required module
    from parent.one import a

    # importing just required sub-package
    from parent import one

Symbols defined in the a module or modules in the one package can then be referenced using dot notation with just a or one as the prefix. The import forms, from package import * or from package.subpackage import *, can be used to import all the modules in a package or sub-package. This form of import should however be used carefully if ever used as it may import some names into the namespace that may cause naming conflicts. Packages support the __all__ (the value of this should by convention be a list) variable for listing modules or names that are visible when the package is imported using the from package import * syntax. If __all__ is not defined, the statement from package import * does not import all submodules from the package into the current namespace rather it only ensures that the package has been imported possibly running any initialization code in __init__.py and then imports whatever symbols are defined in the __init__.py module; including any names defined here and submodules imported here.

Namespace Packages

A namespace package is a package in which the component modules and sub-packages of the package may reside in multiple different locations. The various components may reside on different part of the file system, in zip files, on the network or on any other location searched by interpreter during the import process however when the package is imported, all components exist in a common namespace. To illustrate a namespace package, observe the following directory structures containing modules; both directories, apollo and gemini could be located on any part of the file system and not necessarily next to each other.

    apollo/
        space/
            test.py
    gemini/
        space/
            test1.py

In these directories, the name, space, is used as a common namespace and will serve as the package name. Observe the absence of __init__.py modules in either directory. The absence of this module within these directories is a signal to the interpreter that it should create a namespace package when it encounters such. To be able to import this space package, the paths for its components must first of all be added to interpreter’s module search path, sys.path.

    >>> import sys
    >>> sys.path.extend(['apollo', 'gemini']) 
    >>> import space.test
    >>> import space.test1

Observe that the two different package directories are now logically regarded as a single name space and either space.test or space.test1 can be imported as if they existed in the same package. The key to a namespace package is the absence of the __init__.py modules in the top-level directory that serves as the common namespace. The absence of the __init__.py module causes the interpreter to create a list of all directories within its sys.path variable that contain a matching directory name rather than throw an exception. A special namespace package module is then created and a read-only copy of the list of directories is stored in its __path__ variable. The following code listing gives an example of this.

    >>> space.__path__
    _NamespacePath(['apollo/space', 'gemini/space'])

Namespaces bring added flexibility to package manipulation because namespaces can be extend by anyone with their own code thus eliminating the need to modify package structures in third party packages. For example, suppose a user had his or her own directory of code like this:

    my-package/
        space/
            custom.py

Once this directory is added to sys.path along with the other packages, it would seamlessly merge together with the other space package directories and the contents can also be imported along with any existing artefacts.

    >>> import space.custom 
    >>> import space.test 
    >>> import space.test1

10.4 The Import System

The import statement and importlib.import_module() function provide the required import functionality in Python. A call to the import statement combines two actions:

A search operation to find the requested module through a call to the __import__ statement and
A binding operation to add the module returned from operation 1 to the current namespace.

If the __import__ call does not find the requested module then an ImportError is returned.

The Import Search Process

The import mechanism uses the fully qualified name of the module for the search. In the case that the fully qualified name is a sequence of names separated by dots e.g foo.bar.baz, the interpreter will attempt to import foo followed by bar followed by bar. If any of these modules is not found then an ImportError is raised.

The sys.modules variable is a cache for all previously imported modules and is the first port of call in the module search process. If a module is present in the sys.modules cache then it is returned otherwise an ImportError is raised and the search continues. The sys.modules cache is writeable so user code can manipulate the content of the cache. An example of the content of the cache is shown in the following snippet.

>>> import sys
>>> sys.modules
{'readline': <module 'readline' from '/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/lib-\
dynload/readline.so'>, 'json.scanner': <module 'json.scanner' from '/Library/Frameworks/Python.framework/V\
ersions/3.4/lib/python3.4/json/scanner.py'>, '_sre': <module '_sre' (built-in)>, 'copyreg': <module 'copyr\
eg' from '/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/copyreg.py'>, '_collections_abc'\
: <module '_collections_abc' from '/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/_collec\
tions_abc.py'>, 'cl': <module 'cl' from '/Users/c4obi/writings/scratch/src/cl.py'>, 'rlcompleter': <module\
 'rlcompleter' from '/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/rlcompleter.py'>, '_s\
itebuiltins': <module '_sitebuiltins' from '/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.\
4/_sitebuiltins.py'>, '_imp': <module '_imp' (built-in)>, '_json': <module '_json' from '/Library/Framewor\
ks/Python.framework/Versions/3.4/lib/python3.4/lib-dynload/_json.so'>, '_weakrefset': <module '_weakrefset\
' from '/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/_weakrefset.py'>, 'json.decoder': \
<module 'json.decoder' from '/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/json/decoder.\
py'>, '_codecs': <module '_codecs' (built-in)>, 'codecs': <module 'codecs' from '/Library/Frameworks/Pytho\
n.framework/Versions/3.4/lib/python3.4/codecs.py'>, ... }

Finders and Loaders

When a module is not found in the cache, the interpreter makes use of its import protocol to try and find the module. The Python import protocol defines finder and loader objects. Objects that implement both interfaces are called importers.

Finders define strategies for locating modules. Modules maybe available locally on the file system in regular files or in zipped files, or in other locations such as a database or even at a remote location. Finders have to be able to deal with such locations if modules are going to be imported from any of such locations. By default, Python has support for finders that handle the following scenarios.

Built-in modules,
Frozen modules and
Path based modules - this finder handles imports that have to interact with the import path given by the sys.path variable as shown in the following.

These finders are located in the sys.meta_path variable as shown in the following snippet.

>>> import sys
>>> sys.meta_path
[<class '_frozen_importlib.BuiltinImporter'>, <class '_frozen_importlib.FrozenImporter'>, <class '_frozen_\
importlib.PathFinder'>]
>>>

The interpreter continues the search for the module by querying each finder in the meta_path to find which can handle the module. The finder objects must implement the find_spec method that takes three arguments: the first is the fully qualified name of the module, the second is an import path that is used for the module search - this is None for top level modules but for sub-modules or sub-packages, it is the value of the parent package’s __path__ and the third argument is an existing module object that is passed in by the system only when a module is being reloaded.

If one of the finders locates the module, it returns a module spec that is used by the interpreter import machinery to create and load the module (loading is tantamount to executing the module). The loaders carry out the module execution in the module’s global namespace. This is done by a call to the importlib.abc.Loader.exec_module() method with the already created module object as argument.

Customizing the import process

The import process can be customized via import hooks. There are two types of this hook: meta hooks and import path hooks.

Meta hooks

These are called at the start of the import process immediately after the sys.modules cache lookup and before any other process. These hooks can override any of the default finders search processes. Meta hooks are registered by adding new finder objects to the sys.meta_path variable.

To understand how a custom meta_path hook can be implemented, a very simple case is illustrated. In online Python interpreters, some built-in modules such as the os are disabled or restricted to prevent malicious use. A very simple way to implement this is to implement a meta import hook that raises an exception any time a restricted import is attempted; the following snippet shows such an example.

    class RestrictedImportFinder:

        def __init__(self):
            self.restr_module_names = ['os']
     
        def find_spec(self, fqn, path=None, module=None):
            if fqn in self.restr_module_names:
                raise ImportError("%s is a restricted module and cannot be imported" % fqn)
            return None
      

    import sys
    # remove os from sys.module cache
    del sys.modules['os']
    sys.meta_path.insert(0, RestrictedImportFinder())
    import os

    Traceback (most recent call last):
      File "test_concat.py", line 16, in <module>
        import os
      File "test_concat.py", line 9, in find_spec
        raise ImportError("%s is a restricted module and cannot be imported" % fqn)
    ImportError: os is a restricted module and cannot be imported

Import Path hooks

These hooks are called as part of the sys.path or package.__path__ processing. Recall from our previous discussion that a path based finder is one of the default meta-finder and this finder works with entries in the sys.path variable. The meta path based finder delegates the job of finding modules on the sys.path variables to other finders - these are the import path hooks. The sys.path_hooks is a collection of built in path entry finders. By default, the Python interpreter has support for processing files in zip folders and normal files in directories as shown in the following snippet.

import sys
>>> sys.path_hooks
[<class 'zipimport.zipimporter'>, <function FileFinder.path_hook.<locals>.path_hook_for_FileFinder at 0x10\
03c1b70>]

Each hooks knows how to handle a particular kind of file. For example, an attempt to get the finder for one of the entries in sys.path is attempted in the following snippet.

    >>> sys.path_hooks
    [<class 'zipimport.zipimporter'>, <function FileFinder.path_hook.<locals>.path_hook_for_FileFinder at \
0x1003c1b70>]
    # sys.prefix is a directory
    >>> path = sys.prefix
    # sys.path_hooks[0] is associated with zip files
    >>> finder = sys.path_hooks[0](path)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    zipimport.ZipImportError: not a Zip file
    >>> finder = sys.path_hooks[1](path)
    >>> finder
    FileFinder('/Library/Frameworks/Python.framework/Versions/3.4')
    >>>

New import path hooks can be added by inserting new callables into the sys.path_hooks.

Why You Probably Should Not Reload Modules…

Now that we understand that the last step of a module import is the exec of the module code within the global namespace of the importing module, it is clearer why it maybe a bad idea to use the importlib.reaload to reload modules that have changed.

A module reload does not purge the global namespace of objects from the module being imported. Imagine a module, Foo, that has a function, print_name imported into another module, Bar; the function, Foo.print_name, is referenced by a variable, x, in the module, Bar. Now if the implementation for print_name is changed for some reason and then reloaded in Bar, something interesting happens. Since the reload of the module Foo will cause an exec of the module contents without any prior clean-up, the reference that x holds to the previous implementation of Foo.print_name will persist thus we have two implementations and this is most probably not the behaviour expected.

For this reason, reloading a module is something that maybe worth avoiding in any sufficiently complex Python program.

10.5 Distributing Python Programs

Python provides the distutils module for packaging up Python code for distribution. Assuming the program has been properly written, documented and structured then distributing it is relatively straightforward using distutils. One just has to:

1. write a setup script (setup.py by convention)
2. (optional) write a setup configuration file
3. create a source distribution
4. (optional) create one or more built (binary) distributions

A set-up script using distutils is a setup.py file. For a program with the following package structure,

```python
parent/
    __init__.py
    spam.py
    one/
        __init__.py
        a.py
    two/
        __init__.py
        b.py
```

an example of a simple setup.py file is given in the following snippet.

from distutils.core import setup
setup(name='parent',
      version='1.0',
      author="xxxxxx",
      maintainer="xxxx",
      maintainer_email="xxxxx"
      py_modules=['spam'],
      packages=['one', 'two'],
      scripts=[]
      )

The setup.py file must exist at the top level directory so in this case, it should exist at parent/setup.py. The values used in the set-up script are self explanatory. py_modules will contain the names of all single file python modules, packages will contains a list of all packages,scripts will contain a list of all scripts within the program. The rest of the arguments though not exhaustive of the possible parameters are self explanatory.

Once the setup.py file is ready, the following snippet is used at the commandline to create an archive file for distribution.

    >>> python setup.py sdist

sdist will create an archive file (e.g., tarball on Unix, ZIP file on Windows) containing your setup script setup.py, your modules and packages. The archive file will be named parent-1.0.tar.gz (or .zip), and will unpack into a directory parent-1.0.. To install the created distribution, the file is unzipped and python setup.py install is run inside the directory. This will install the package in the site-packages directory for the installation.

One can also create one or more built distributions for programs. For instance, if running a Windows machine, one can make the use of the program easy for end users by creating an executable installer with the bdist_wininst command. For example:

python setup.py bdist_wininst

will create an executable installer, parent-1.0.win32.exe, in the current directory.

Other useful built distribution formats are RPM, implemented by the bdist_rpmcommand, bdist_pkgtool for Solaris, and bdist_sdux for HP-UX install. It is important to note that the use of distutils assumes that the end user of a distributed package will have the python interpreter already installed.

Up next

11. Inspecting Objects