Friday, August 14, 2015

Lists, Comprehensions, and Generators

In A Student’s Guide to Python for Physical Modeling, we emphasized NumPy arrays and paid less attention to Python lists. The reason is simple: In most scientific computing applications, NumPy arrays store data more efficiently and speed up mathematical calculations, sometimes a thousandfold.

However, there are some applications where a Python list is the better choice. There are also times when the choice between a list and an array has little or no effect on performance. In such cases a list can make your code easier to read and understand, and that is always a good thing.

In this post, I will describe Python lists and explain a special Python construct for creating lists called a list comprehension. I will also describe a similar construct called a generator expression.

Lists

A list is an ordered collection of items. You may have made a “To Do” list this morning or a grocery list for a recent trip to the store. In computer science, a list is a data structure that supports a few basic methods like insert, remove, append, and find. You probably used several of these operations with your own list. Perhaps you penciled in a new task later in the day (append), then crossed tasks off the list as you completed them (remove).

An array is a rigid data structure that stores a fixed number of identical elements (integers, floats, eight-character strings, etc.). If operations like insert, remove, or append are important parts of a computational task, a more flexible data structure like a list may be appropriate. The type of data may also suggest that a Python list is a better choice than a NumPy array. For instance, how would you initialize an array to store a grocery list? Furthermore, if the number of items to be stored is not known at the outset, it may be easier to store the data in a list and convert it to an array later. (Perhaps you are taking data at regular intervals but do not know how long an experiment will run.) Finally, if you are not worried about performance and scaling, a Python list might be a simpler option than a NumPy array. If you just need the first 20 perfect cubes, do you really want to import all of NumPy?

Let’s use the example of a grocery list to create and transform a simple list. A more useful example in scientific computing might be managing a collection of data files to process into stunning figures for your latest report, but the principles are the same.

To create a Python list, just enclose a comma-separated list of elements inside square brackets.

groceries = ['milk', 'eggs', 'orange juice']

Some functions also return a list. I can create the same list from a string using the split method:

groceries = "milk, eggs, apples, orange juice".split(',')

To find an item in a list, use the index method. It will return the index of the first occurrence of the item you request if it is in the list and a ValueError if it is not.

groceries.index('eggs')
groceries.index('bread')

It looks like I forgot the bread! I can add it to the list using the append method:

groceries.append('bread')

Later, I see some orange juice in the back of the refrigerator (not yet expired …), so I will delete that item from the list using the remove method:

groceries.remove('orange juice')

Since I am headed to ABC Grocery, where everything is organized alphabetically, I will sort the list and review it:

groceries.sort()
print(groceries)

One more useful operation is joining lists, or concatenation. In Python, the addition operator for lists is defined to join two lists. The extend method of a list will also join two lists. Calling append with a list as its argument will not add each element to the original list. It will make the argument list the final element of the calling list. I.e., you will have a list within a list, not a concatenation of the two lists.

If I had two separate grocery lists, I could join them into a single list in a variety of ways:

old_list = ['bananas', 'coffee']
new_list = groceries + old_list         # Addition creates a new list.
groceries += old_list                   # In-place addition extends original list.
old_list.extend(groceries)              # extend also extends original list.

bad_list = ['bananas', 'coffee']
bad_list.append(groceries)              # append does NOT concatenate lists.

After this set of commands, groceries and new_list contain the same elements. old_list contains 'bananas' and 'coffee' twice, since the commands append old_list to groceries first, and then append the new groceries to the original old_list. As you can see, bad_list did not merge the two lists properly.

In case you are skeptical of the usefulness of Python lists in scientific endeavors, here is a function that creates a list of the first N Fibonacci numbers.

def fibonacci(N):
    if N == 0: return [1]       # Handle unusual request for 0th number.
    fib = [1, 1]                # Create a list for all other values of N.
    for k in range(1,N):
        # Next Fibonacci number is the sum of the previous two.
        fib.append(fib[-1] + fib[-2])
    return fib

If you are still skeptical of the possible utility of a Python list, try writing the same function using NumPy arrays and using it to compute fibonacci(100). I’ve included a solution at the end of the post.

The command inside the loop could also have been written using list addition. Either of the following commands will work:

fib += [fib[-1] + fib[-2]]
fib = fib + [fib[-1] + fib[-2]]

The second approach is less efficient because it makes a copy of the list every time a new element is added, but it is syntactically correct. Be aware that you can only use list addition to join two lists — not a list and some other object you would like to put inside it. The following alternatives would result in errors:

fib += fib[-1] + fib[-2]
fib += (fib[-1] + fib[-2])
fib = fib + (fib[-1] + fib[-2])

List Comprehensions

Sometimes a list is not a collection of random elements; it has a logical structure instead. For instance, suppose you want to find the sum of the first 20 perfect cubes. You could create an array or a list of cubes, then add them up. The familiar procedure using a NumPy array is

import numpy as np
cubes = np.arange(1,21)**3
cubes.sum()

This does not require too much typing, and the purpose of the code is fairly clear. However, compare it with the following code:

cubes = [n**3 for n in range(1,21)]
sum(cubes)

This is an example of a list comprehension: a Python expression inside of the square brackets that denote a list. The statement is similar to the notation used in mathematics to define a set. It also clearly describes what the list contains. Note the similarity of the list comprehension to the following loop, which creates the same list:

cubes = []                  # Initialize empty list.
for n in range(1,21):
    cubes.append(n**3)      # Append perfect cubes.

The list comprehension effectively compresses all of this into a single Python statement.

A list comprehension defines a new list from another collection of objects via an expression like “for n in ...” Rather than using range, you can build one list from another:

poly = [(x+1)*(x-1) for x in cubes]

You can also apply conditional statements in a list comprehension:

even_squares = [n**2 for n in range(1,51) if n%2 == 0]
odd_squares  = [n**2 for n in range(1,51) if n%2 == 1]

You can cram quite a lot of code into a list comprehension, but it is not always advisable:

pythagoras = [(a,b,c)   for a in range(1,31) for b in range(a,31) \
                        for c in range(1,31) if a**2 + b**2 == c**2]

Despite the length and complexity of this single expression, its meaning is still fairly clear.

Returning to our original task, we can do even better than adding up the first 20 perfect cubes. Using a nested list comprehension, we can make a list of sums of cubes!

sums_of_cubes = [sum([n**3 for n in range(1,N+1)]) for N in range(1,21)]

Generators

A list comprehension creates a Python list that stores all of the elements in a single data structure. Sometimes this is exactly what you need. Other times, you simply want to iterate over all of the items in a list. If you never need all of the items in the list at once, you can use a generator expression instead. A generator expression looks like a list comprehension, except that you enclose the expression in round parentheses instead of square brackets — (...) instead of [...]. Despite the round parentheses, a generator expression does not create a tuple, and there is no such thing as a “tuple comprehension”.

cube_list = [n**3 for n in range(1,101)]
cube_generator = (n**3 for n in range(1,101))

A generator is simpler than a list. You cannot insert, remove, or append items, nor can you search or sort a generator. A generator knows how to produce the next item in a sequence, and little else. Once it has reached the end of its sequence, it does even less.

for x in cube_list: print(x)            # Prints numbers stored in list.
for x in cube_list: print(x)            # Prints numbers stored in list again.

for x in cube_generator: print(x)       # Prints numbers provided by generator.
for x in cube_generator: print(x)       # Prints nothing.  Generator is finished.

The advantages of a generator over a list are size and speed. Compare the output of the __sizeof__() method for the following lists and generators. This method returns the size of the object in bytes.

cube_list = [n**3 for n in range(1,10)]
cube_generator = (n**3 for n in range(1,10))
print(cube_list.__sizeof__())
print(cube_generator.__sizeof__())

cube_list = [n**3 for n in range(1,10**3)]
cube_generator = (n**3 for n in range(1,10**3))
print(cube_list.__sizeof__())
print(cube_generator.__sizeof__())

cube_list = [n**3 for n in range(1,10**6)]
cube_generator = (n**3 for n in range(1,10**6))
print(cube_list.__sizeof__())
print(cube_generator.__sizeof__())

The list grows from 168 bytes to 9 kB to 8.7 MB, while the generator remains a constant 48 bytes. Also, you may have noticed a delay while Python created the large list during the last set of commands.

I generally prefer a generator when I iterate over a large sequence of items once — especially if the program might exit the loop before reaching the end of the sequence.

Summary

NumPy arrays are often the most efficient data structure for numerical work in Python. However, there are some tasks for which a Python list is a better choice — often when organizing data rather than processing data. Python offers a compact syntax for creating lists called a list comprehension. A generator expression is similar, but creates an object that can produce a sequence without storing all of its elements. A generator is often a better choice than a list or an array when iterating over a large sequence of items.




NumPy version of fibonacci(N)

Here is a version of the fibonacci(N) function above that uses NumPy arrays.

import numpy as np

def Fibonacci(N):
    if N == 0: return np.array(1)   # Handle unusual request for 0th number.
    fib = np.zeros(N+1, dtype=int)  # Initialize list for all other values of N.
    fib[0], fib[1] = 1, 1
    for k in range(2,N+1):
        # Next Fibonacci number is the sum of the previous two.
        fib[k] = fib[k-1] + fib[k-2]
    return fib

Perhaps you came up with a more elegant solution. I find this version more difficult to code and more confusing to read. Plus, using a NumPy array forces a compromise: Either use floating point numbers and lose significant digits for N > 78, or use integers and encounter an overflow error for N > 91. In either case, you cannot generate the 100th Fibonacci number!

1 comment:

  1. This comment has been removed by a blog administrator.

    ReplyDelete

To avoid duplication and inappropriate content, your comment will be reviewed before being published. Thank you for your input!


-- Jesse & Phil