5. Code Objects

Code objects are essential building blocks of the Python virtual machine. Code objects encapsulate the Python virtual machine’s bytecode; we may call the bytecode the assembly language of the Python virtual machine.

Code objects, as the name suggests, represent compiled executable Python code. We had come across code objects before when we discussed Python source compilation. The compilation process maps each code block to a code object. As described in the brilliant Python documentation:

A Python program is constructed from code blocks. A block is a piece of Python program text that is executed as a unit. The following are blocks: a module, a function body, and a class definition. Each command typed interactively is a block. A script file (a file given as standard input to the interpreter or specified as a command-line argument to the interpreter) is a code block. A script command (a command specified on the interpreter command line with the ‘-c’ option) is a code block. The string argument passed to the built-in functions eval() and exec() is a code block.

The code object contains runnable bytecode instructions that alter the state of the Python VM when run. Given a function, we can access its code object using the __code__ attribute as in the following snippet.

Listing 5.1: Function code objects
    def return_author_name(): 
        return "obi Ike-Nwosu"

    >>> return_author_name.__code__
    <code object return_author_name at 0x102279270, file "<stdin>", line 1>

For other code blocks, one can obtain the code objects for that code block by compiling such code. The compile function provides a facility for this in the Python interpreter. The code objects possess several fields that are used by the interpreter loop when executing and we look at some of these in the following sections.

5.1 Exploring code objects

An excellent way to start with code objects is to compile a simple function and inspect the resulting code object. We use the simple fizzbuzz function shown in Listing 5.2 as a guinea pig.

Listing 5.2: Function code objects attributes of Fizzbuzz function
    co_argcount = 1
    co_cellvars = ()
    co_code = b'|\x00d\x01\x16\x00d\x02k\x02r\x1e|\x00d\x03\x16\x00d\x02k\x02r\x1ed\\
x04S\x00n,|\x00d\x01\x16\x00d\x02k\x02r0d\x05S\x00n\x1a|\x00d\x03\x16\x00d\x02k\x02r\
Bd\x06S\x00n\x08t\x00|\x00\x83\x01S\x00d\x00S\x00'
    co_consts = (None, 3, 0, 5, 'FizzBuzz', 'Fizz', 'Buzz')
    co_filename = /Users/c4obi/projects/python_source/cpython/fizzbuzz.py
    co_firstlineno = 6
    co_flags = 67
    co_freevars = ()
    co_kwonlyargcount = 0
    co_lnotab = b'\x00\x01\x18\x01\x06\x01\x0c\x01\x06\x01\x0c\x01\x06\x02'
    co_name = fizzbuzz
    co_names = ('str',)
    co_nlocals = 1
    co_stacksize = 2
    co_varnames = ('n',)

The fields shown are almost self-explanatory except for the co_lnotab and co_code fields that seem to contain gibberish.

  1. co_argcount: This is the number of arguments to a code block and has a value only for function code blocks. The value is set to the count of the argument set of the code block’s AST during the compilation process. The evaluation loop makes use of these variables during the set-up for code evaluation to carry out sanity checks such as checks that all arguments are present and for storing locals.
  2. co_code: This holds the sequence of bytecode instructions executed by the evaluation loop. Each of these bytecode instruction sequences is composed of an opcode and an oparg - arguments to the opcode where it exists. For example, co.co_code[0] returns the first byte of the instruction, 124 that maps to a Python LOAD_FAST opcode.
  3. co_consts: This field is a list of constants like string literals and numeric values contained within the code object. The example from above shows the content of this field for the fizzbuzz function. The values included in this list are integral to code execution as they are the values referenced by the LOAD_CONST opcode. The operand argument to a bytecode instruction such as the LOAD_CONST is the index into this list of constants. Consider the co_consts value of (None, 3, 0, 5, 'FizzBuzz', 'Fizz', 'Buzz') for the FizzBuzz function and contrast with the disassembled code object below.
           Listing 5.3: Cross section of bytecode instructions for Fizzbuzz function
               0 LOAD_BUILD_CLASS
               2 LOAD_CONST               0 (<code object Test at 0x101a02810, file "fiz\
    zbuzz.py", line 1>)
               4 LOAD_CONST               1 ('Test')
               6 MAKE_FUNCTION            0
               8 LOAD_CONST               1 ('Test')
              10 CALL_FUNCTION            2
              12 STORE_NAME               0 (Test)
    
              ...
    
              66 LOAD_GLOBAL              0 (str)
              68 LOAD_FAST                0 (n)
              70 CALL_FUNCTION            1
              72 RETURN_VALUE
              74 LOAD_CONST               0 (None)
              76 RETURN_VALUE
    

    Recall that during the compilation process, a return None is added if there is no return statement at the end of a function so we can tell that the bytecode instruction at offset 74 is a LOAD_CONST for a None value. The opcode argument is a 0, and we can see that the None value has an index of 0 in the constants list from where the LOAD_CONST instruction loads it.

  4. co_filename: This field, as the name suggests, contains the name of the file that contains the code object’s source code from which the code object.
  5. co_firstlineno: This gives the line number on which the source for the code object begins. This value plays quite an essential role during activities such as debugging code.
  6. co_flags: This field indicates the kind of code object. For example, if the code object is that of a coroutine, the flag is set to 0x0080. Other flags such as CO_NESTED indicate if a code object is nested within another code block, CO_VARARGS indicates if a code block has variable arguments. These flags affect the behaviour of the evaluation loop during bytecode execution.
  7. co_lnotab: The contains a string of bytes used to compute the source line numbers that correspond to instruction at a bytecode offset. For example, the dis the function makes use of this when calculating line numbers for instructions.
  8. co_varnames: This is the collection of locally defined names in a code block. Contrast this with co_names.
  9. co_names: This is the collection of non-local names used within the code object. For example, the snippet in listing 5.4 references a non-local variable, p.
           Listing 5.4: Illustrating local and non-local names
         def test_non_local():
             x = p + 1
             return x
    

    List 5.5 is the result of introspecting on the code object for the function in Listing 5.4.

           Listing 5.5: Illustrating local and non-local names
         co_argcount = 0
         co_cellvars = ()
         co_code = b't\x00d\x01\x17\x00}\x00|\x00S\x00'
         co_consts = (None, 1)
         co_filename = /Users/c4obi/projects/python_source/cpython/fizzbuzz.py
         co_firstlineno = 18
         co_flags = 67
         co_freevars = ()
         co_kwonlyargcount = 0
         co_lnotab = b'\x00\x01\x08\x01'
         co_name = test_non_local
         co_names = ('p',)
         co_nlocals = 1
         co_stacksize = 2
         co_varnames = ('x',)
    

    From this example, the difference between the c_names and co_varnames is noticeable. co_varnames references the locally defined names while co_names references non-locally defined names. Do note that it is only during execution of the program that an error is raised when the name p is not found. Listing 5.6 shows the bytecode instructions for the function in Listing 5.4, and it is an easy set to understand.

           Listing 5.6: Bytecode instructions for test_non_local function
         0 LOAD_GLOBAL              0 (0)
         3 LOAD_CONST               1 (1)
         6 BINARY_POWER
         7 STORE_FAST               0 (0)
        10 LOAD_FAST                0 (0)
        13 RETURN_VALUE
    

    Note how rather than a LOAD_FAST as was seen in the previous example, we have LOAD_GLOBAL instruction. Later, when we discuss the evaluation loop, we will discuss an optimisation that the evaluation loop carries out that makes the use of the LOAD_FAST instruction as the name suggests.

  10. co_nlocals: This is a numeric value that represents the number of local names used by the code object. In the immediate past example from Listing 5.4, the only local variable used is x and thus this value is 1 for the code object of that function.
  11. co_stacksize: The Python virtual machine is stack-based, i.e. values used in evaluation and results of the evaluation are read from and written to an execution stack. This co_stacksize value is the maximum number of items that exist on the evaluation stack at any point during the execution of the code block.
  12. co_freevars: The co_freevars field is a collection of free variables defined within the code block. This field is mostly relevant to nested functions that form closures. Free variables are variables that are used within a block but not defined within that block; this does not apply to global variables. The concept of a free variable is best illustrated with an example, as shown in listing 5.7.
           Listing 5.7: A simple nested function
       def f(*args):
                x=1
                def g():
                    n = x
    

    The co_freevars field is empty for the code object of the f function while that of the g function contains the x value. Free variables are strongly interrelated with cell variables.

  13. co_cellvars: The co_cellvars field is a collection of names for that require cell storage objects during the execution of a code object. Take the snippet in Listing 5.7, the co_cellvars field of the code object for the function - f, contains just the name -x while that of the nested function’s code object is empty; recall from the discussion on free variables that the co_freevars collection of the nested function’s code object consists of just this name - x. This captures the relationship between cell variables, and free variables - a free variable in a nested scope is a cell variable within the enclosing scope. Special cell objects are required to store the values in this cell variable collection during the execution of the code object. This is so because each value in this field is used by nested code objects whose lifetime may exceed that of the enclosing code object. Hence, such values require storage in locations that do not get deallocated after the execution of the code object.

The bytecode - co_code in more detail.

The actual virtual machine instructions for a code object, the bytecode, are contained in the co_code field of a code object as previously mentioned. The byte code from the fizzbuzz function, for example, is the string of bytes shown in listing 5.7.

Listing 5.7: Bytecode string for fizzbuzz function
b'|\x00d\x01\x16\x00d\x02k\x02r\x1e|\x00d\x03\x16\x00d\x02k\x02r\x1ed\x04S\x00n,|\x0\
0d\x01\x16\x00d\x02k\x02r0d\x05S\x00n\x1a|\x00d\x03\x16\x00d\x02k\x02rBd\x06S\x00n\x\
08t\x00|\x00\x83\x01S\x00d\x00S\x00'

To get a human-readable version of the byte string, we use the dis function from the dis module to extract a human-readable printout as shown in listing 5.8.

Listing 5.8: Bytecode instruction disassembly for fizzbuzz function
  7           0 LOAD_FAST                0 (n)
              2 LOAD_CONST               1 (3)
              4 BINARY_MODULO
              6 LOAD_CONST               2 (0)
              8 COMPARE_OP               2 (==)
             10 POP_JUMP_IF_FALSE       30
             12 LOAD_FAST                0 (n)
             14 LOAD_CONST               3 (5)
             16 BINARY_MODULO
             18 LOAD_CONST               2 (0)
             20 COMPARE_OP               2 (==)
             22 POP_JUMP_IF_FALSE       30

              ...

 14     >>   66 LOAD_GLOBAL              0 (str)
             68 LOAD_FAST                0 (n)
             70 CALL_FUNCTION            1
             72 RETURN_VALUE
        >>   74 LOAD_CONST               0 (None)
             76 RETURN_VALUE

The first column of the output shows the line number for that instruction. Multiple instructions may map to the same line number. This value is calculated using information from the co_lnotab field of a code object. The second column is the offset of the given instruction from the start of the bytecode. Assuming the bytecode string is contained in an array, then this value is the index of the instruction into the array. The third column is the actual human-readable instruction opcode; the full range of opcodes are found in the Include/opcode.h module. The fourth column is the argument to the instruction.

The first LOAD_FAST instruction takes the argument 0. This value is an index into the co_varnames array. The last column is the value of the argument - provided by the dis function for ease of use. Some arguments do not take explicit arguments. Notice that the BINARY_MODULO and RETURN_VALUE instructions take no explicit argument. Recall that the Python virtual machine is stack-based so these instructions read values from the top of the stack.

Bytecode instructions are two bytes in size - one byte for the opcode and the second byte for the argument to the opcode. In the case where the opcode does not take an argument, then the second argument byte is zeroed out. The Python virtual machine uses a little-endian byte encoding on the machine which I am currently typing out this book thus the 16 bits of code are structured as shown in figure 5.0 with the opcode taking up the higher 8 bits and the argument to the opcode taking up the lower 8 bits.

Figure 5.0: Bytecode instruction format showing opcode and oparg
Figure 5.0: Bytecode instruction format showing opcode and oparg

Sometimes, the argument to an opcode may be unable to fit into the default single byte. The Python virtual machine makes use of the EXTENDED_ARG opcode for these kinds of arguments. What the Python virtual machine does is to take an argument that is too large to fit into a single byte and split it into two (we assume that it can fit into two bytes here, but this logic is easily extended past two bytes) - the most significant byte is an argument to the EXTENDED_ARG opcode while the least significant byte is the argument to its actual opcode. The EXTENDED_ARG opcode(s) will come before the actual opcode in the sequence of opcodes, and the argument can then be rebuilt by shifts to the right and or’ing with other sections of the argument. For example, if one wanted to pass the value 321 as an argument to the LOAD_CONST opcode, this value cannot fit into a single byte, so the EXTENDED_ARG opcode is used. The binary representation of this value is 0b101000001, so the actual do work opcode (LOAD_CONST) takes the first byte (1000001) as argument (65 in decimal) while the EXTENDED_ARG opcode takes the next byte (1) as an argument; thus, we have (144, 1), (100, 65) as the sequence of instructions that is output.

The documentation for the dis module contains a comprehensive list and explanation of all opcodes currently implemented by the virtual machine.

5.2 Code Objects within other code objects

Another code block code object that is worth looking at is that of a module. Assuming we are compiling a module with the fizzbuzz function as content, what would the output, look like? To find out, we use the compile function in python to compile a module with the content shown in listing 5.9.

Listing 5.9: Nested function to illustrated nested code objects
    def f():
        print(c)
        a = 1
        b = 3
        def g():
            print(a+b)
            c=2
            def h():
                print(a+b+c)

Listing 5.10 is the result of compiling a module code block.

Listing 5.10: Bytecode instruction disassembly for listing 5.10
    0 LOAD_CONST               0 (<code object f at 0x102a028a0, file "fizzbuzz.py",\
 line 1>)
    2 LOAD_CONST               1 ('f')
    4 MAKE_FUNCTION            0
    6 STORE_NAME               0 (f)
    8 LOAD_CONST               2 (None)
   10 RETURN_VALUE

The instruction at byte offset 0 loads a code object stored as the name f - our function definition using the MAKE_FUNCTION Instruction. Listing 5.11 is the content of this code object.

Listing 5.11: Bytecode instruction disassembly for nested function from listing 5.9
    co_argcount = 0
    co_cellvars = ()
    co_code = b'd\x00d\x01\x84\x00Z\x00d\x02S\x00'
    co_consts = (<code object f at 0x1022029c0, file "fizzbuzz.py", line 1>, 'f', No\
ne)
    co_filename = fizzbuzz.py
    co_firstlineno = 1
    co_flags = 64
    co_freevars = ()
    co_kwonlyargcount = 0
    co_lnotab = b''
    co_name = <module>
    co_names = ('f',)
    co_nlocals = 0
    co_stacksize = 2
    co_varnames = ()

As would be expected in a module, the fields related to code object arguments are all zero - (co_argcount, co_kwonlyargcount). The co_code field contains bytecode instructions, as shown in listing 5.10. The co_consts field is an interesting one. The constants in the field are a code object and the names - f and None. The code object is that of the function, the value ‘f’ is the name of the function, and None is the return value of the function - recall the python compiler adds a return None statement to a code object without one.

Notice that function objects are not created during the module’s compilation. What we have are just code objects - it is during the execution of the code objects that the function gets created as seen in Listing 5.10. Inspecting the attributes of the code object will show that it is also composed of other code objects as shown in listing 5.12.

Listing 5.12: Bytecode instruction disassembly for nested function from listing 5.10
    co_argcount = 0
    co_cellvars = ('a', 'b')
    co_code = b't\x00t\x01\x83\x01\x01\x00d\x01\x89\x00d\x02\x89\x01\x87\x00\x87\x01\
f\x02d\x03d\x04\x84\x08}\x00d\x00S\x00'
    co_consts = (None, 1, 3, <code object g at 0x101a028a0, file "fizzbuzz.py", line\
 5>, 'f.<locals>.g')
    co_filename = fizzbuzz.py
    co_firstlineno = 1
    co_flags = 3
    co_freevars = ()
    co_kwonlyargcount = 0
    co_lnotab = b'\x00\x01\x08\x01\x04\x01\x04\x01'
    co_name = f
    co_names = ('print', 'c')
    co_nlocals = 1
    co_stacksize = 3
    co_varnames = ('g',)

The same logic explained earlier on applies here with the function object created only during the execution of the code object.

5.3 Code Objects in the VM

Like most built-in types, there is the code type that defines the code object type and the PyCodeObject structure for code objects instances. The code type is similar to other type objects that have been discussed in previous sections, so we do not reproduce it here. Listing 5.13 shows the structures used to represent code objects instances.

Listing 5.13: Code object implementation in C
typedef struct {
    PyObject_HEAD
    int co_argcount;        /* #arguments, except *args */
    int co_kwonlyargcount;    /* #keyword only arguments */
    int co_nlocals;        /* #local variables */
    int co_stacksize;        /* #entries needed for evaluation stack */
    int co_flags;        /* CO_..., see below */
    int co_firstlineno;   /* first source line number */
    PyObject *co_code;        /* instruction opcodes */
    PyObject *co_consts;    /* list (constants used) */
    PyObject *co_names;        /* list of strings (names used) */
    PyObject *co_varnames;    /* tuple of strings (local variable names) */
    PyObject *co_freevars;    /* tuple of strings (free variable names) */
    PyObject *co_cellvars;      /* tuple of strings (cell variable names) */
   
    unsigned char *co_cell2arg; /* Maps cell vars which are arguments. */
    PyObject *co_filename;    /* unicode (where it was loaded from) */
    PyObject *co_name;        /* unicode (name, for reference) */
    PyObject *co_lnotab;    /* string (encoding addr<->lineno mapping) See
                   Objects/lnotab_notes.txt for details. */
    void *co_zombieframe;     /* for optimization only (see frameobject.c) */
    PyObject *co_weakreflist;   /* to support weakrefs to code objects */
    /* Scratch space for extra data relating to the code object.__icc_nan
       Type is a void* to keep the format private in codeobject.c to force
       people to go through the proper APIs. */
    void *co_extra;
} PyCodeObject;

The fields are almost all the same as those found in a Python code objects except for the co_stacksize, co_flags, co_cell2arg, co_zombieframe, co_weakreflist and co_extra. co_weakreflist and co_extra are not really interesting fields at this point. The rest of the fields here pretty much serve the same purpose as those in the code object. The co_zombieframe is a field that exists for optimisation purposes. It holds a reference to a frame object that was previously used as a context to execute the code object. This is then used as the execution frame when such code object is being re-executed to prevent the overhead of allocating memory for another frame object.