## Thursday, September 24, 2015

### Speeding Up Python — Part 1: Profiling

When people argue about programming languages, a common critique of Python is, “It’s slow.” This is occasionally followed by, “A program written in C will run a thousand times faster.” Such generalization carry little weight. Python is often fast enough, and a well-written Python program can run significantly faster than a poorly-written C program. Plus, Moore’s Law implies that computers today are over a thousand times faster than those of 15 years ago: You can do with Python today what was only possible with a highly optimized, compiled program in 2000.

It is also important to consider development time. Suppose a C program takes a week to write and debug and 1 minute to run, and an equivalent Python program takes a day to write and debug and 1000 minutes (about a day) to run. The “slow” Python program will finish running five days earlier than the “fast” C program! If you already know Python and don’t know any C, then the time difference will be even greater.

In short, you need not avoid Python or learn some other programming language just because someone tells you Python is slow. Of course, sometimes there is a need for speed. If you want to eke out the best performance from the available hardware, you may need to learn a compiled language. However, you might want to see how much you can improve a Python program first.

The goal of this post and its sequel is to provide some tools and tips for improving the performance of Python programs. In this post, we will look at some profiling tools — sophisticated stopwatches for timing programs as they execute. In the next post, we will use these tools to demonstrate some general principles that will help you speed up your Python programs.

Before proceeding, I offer this advice: If your program already runs fast enough, do not bother with profiling and optimization. There are an endless number of interesting problems waiting to be solved, and the question of how to improve the performance of a particular program by 20 percent is probably not one of them.

I have included a sample module <calculator.py> and a sample script <test.py> at the end of this post, which I will use to illustrate some of the profiling tools. You can copy and paste these into your own working directory to replicate the examples, or you can try the profiling tools on some of your own modules and scripts.

## How Long Does It Really Take?

To improve the performance of a program, it is useful to gather quantitative data on how long it takes to run. This is called profiling. If the program takes a long time to run, you may be able to use your wristwatch to time it. For more accurate measurements, IPython provides some useful “magic” commands. These are commands preceded by a percent sign that must be entered at the IPython command prompt. (See Section 1.2.1 of A Student’s Guide to Python for Physical Modeling.)

All the commands that follow should by entered at the IPython command prompt.

### The %time Command

%time is a basic stopwatch. It will tell you how much time elapses on your computer’s internal clock while a command is executed.

Try the following commands at an IPython command prompt and investigate the information provided by %time:

%time 2**100
%time pow(2,100)


You should see something like this:

CPU times: user 6 us, sys: 2 us, total: 8 us
Wall time: 14.1 us


The output includes several times: user, sys, total, and Wall time. Wall time is the time you would have measured with your stopwatch, were your reflexes fast enough. It is not a very good metric of how efficient a program is because it includes the time your job spent waiting in line to run as well as interruptions by other processes that your operating system thought were more important. user measures how much time your CPU spent running your code. sys is the amount of time devoted to such processes as memory access, reading and writing data, gathering input, and displaying output. total is the sum of user and sys. It is the best measure of performance, and it may be significantly less than Wall time.

Run the commands above several times. You may notice minor differences in the elapsed times, as well as a few significant variations. To get an accurate measure of performance, it is best to average over many repetitions of the same command. This is what the %timeit magic command does.

### The %timeit Command

Try the same operations as before, but use the %timeit command instead of %time:

%timeit 2**100
%timeit pow(2,100)


The output should be something like this:

$%timeit 2**100 10000000 loops, best of 3: 45.4 ns per loop  This means that Python inserted the command 2**100 inside a loop and carried out the operation ten million times. It evaluated 3 such loops. It recorded the total time for each loop, and then divided by 10 million. The best result from the 3 loops was an average execution time of 45.4 ns. (This is less than the result of %time, which includes the time required to transform the string "2**100" into instructions your CPU understands.) You can already see the potential benefits of profiling. While 2**100 takes a mere 45 ns, pow(2,100) takes 1,230 ns — 27 times as long. If I am repeatedly computing large powers of integers, I can save time by using x**y instead of pow(x,y). You may notice that %timeit does not execute different commands the same number of times. It uses an adaptive method to get as many iterations of a command as possible without taking too long. Its default is to do three loops of a large number of iterations, but you can modify this. For example, to force %timeit to use 10 loops of 1 million iterations, you would type %timeit -r 10 -n 1000000 pow(2,100)  This method of specifying options will look strange if you have not worked at a UNIX command line. The hyphens and letters like “-r” are called option flags. -r tells Python to set the number of repetitions to whatever number comes next. Likewise, the -n tells Python to set the number of iterations in each loop to whatever number comes next. The command to time comes last. It may look jumbled and confusing, but don’t worry — Python knows what to do! You can find out more about %time and %timeit at the IPython command prompt: %time? %timeit?  ### The %run -t Command You can time the evaluation of an entire script by supplying an option flag to the %run magic command: %run -t test.py  This will run the script and report the time it took to execute. You can repeat a script several times by supplying an additional option flag: %run -t -N 10 test.py  This will run the script 10 times and report the total and average time of execution. Note that you must use a capital N here. Lower case n means something different to the %run command. ## Which Part Takes the Longest? You can accomplish a lot with the profiling tools mentioned so far. With %timeit, you can profile individual functions and commands. With %run -t, you can assess the effects of changes to a script. However, neither of these tools provides information on how time is divided among functions within a script, or subroutines within a function. You could try stepping through the program and using %timeit on each line to see which ones take the longest, but there are better ways. Spyder and the profile module allow you to see how much time Python spends executing individual functions, and the line_profiler.py module can measure the time spent on each line of a program! This allows you to identify the elements of a program that take the most time — the bottlenecks of your program. You can then focus your efforts on optimizing those portions of your code that will have the greatest impact on overall performance. To run the examples below, first execute the following commands (or run the test.py script) to import NumPy and the calculator module (included at the end of this post) and create two random arrays A and B. import numpy as np import calculator as calc M = 10**3 N = 10**3 A = np.random.random((M,N)) B = np.random.random((M,N))  ### The profile Module The profile module can be accessed from the command line or within scripts. It’s output is not always easy to understand, but it is useful for identifying which functions are consuming the most time. To use the module, import it and use its run method. import profile profile.run('calc.hypotenuse(A,B)', sort='tottime')  This command will execute the command calc.hypotenuse(A,B) and display profiling statistics on the screen. I have used the optional keyword argument sort to instruct the method to display the most time-consuming functions at the top of the output. The default is to sort by function or method name. The output is plain text: In [10]: profile.run('calc.hypotenuse(A,B)', sort='tottime') 1000014 function calls in 3.943 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 2 2.010 1.005 2.018 1.009 calculator.py:16(multiply) 1 0.999 0.999 1.003 1.003 calculator.py:3(add) 1 0.743 0.743 0.917 0.917 calculator.py:29(sqrt) 1000000 0.170 0.000 0.170 0.000 {built-in method sqrt} 4 0.016 0.004 0.016 0.004 {built-in method zeros} 1 0.004 0.004 3.943 3.943 <string>:1(<module>) 1 0.000 0.000 3.943 3.943 {built-in method exec} 1 0.000 0.000 3.938 3.938 calculator.py:42(hypotenuse) 1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:2264(_handle_fromlist) 1 0.000 0.000 0.000 0.000 {built-in method hasattr} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}  The results show us which functions and modules were used during the execution of a script, how much time was spent on each, how many times each was called, and the file and line number where each function is defined. This tool is useful for identifying which functions consume the most time. It does not provide a detailed analysis of individual functions, so you may still need to use %lprun. The cProfile module contains the same methods and produces the same output as the profile module, but it takes less time to run. (Many of the Python routines in profile are rewritten as C routines in cProfile.) If you are doing a lot of profiling, you can speed up the process by replacing import profile with import cProfile as profile  ### Profiling with Spyder All of the tools mentioned so far are command-line tools that can be used with IPython, whether or not you are using Spyder. Spyder offers an additional option. You can use the menu command Run > Profile or the shortcut key <F10> to run a script using Spyder’s profiler. It will produce output like this: Here we see information similar to the output of profile.run, but in a format that is easier to interpret. ### The %lprun Command If you have successfully identified the function that consumes the most time, you may wish to dissect it further to find out why. The line_profiler package, written by Robert Kern, does exactly this. It allows you to see how much time Python spends on each individual line of code. The module is part of the Anaconda distribution. If you used the Miniconda installer, you may need to manually install this package from the command line: $ conda install line_profiler


The package can do quite a lot, but we are only going to look at one of its tools — an IPython magic command for profiling. To make this command available in IPython, we need to load it using another magic command:

%load_ext line_profiler


This gives you access to a magic command called %lprun if the line_profiler module is installed correctly.

There are two modes available with %lprun. The first is “function mode”. This allows you to designate a specific function to be analyzed when you execute a command or series of commands. The second is “module mode”, which will analyze all of the functions in a module you designate.

To profile the add function in the calculator module with %lprun type the following:

%lprun -f calc.add calc.add(A,B)


The -f option indicates function mode. The next item is the name of the function to analyze. (Be sure you provide the name, and not a function call. Do not include parentheses or arguments.) The last item is the Python statement to execute. I have instructed Python to gather information on the add function in the calculator module (imported as calc) while it evaluates the statement calc.add(A,B). Here is the output:

Timer unit: 1e-06 s

Total time: 2.94468 s
File: calculator.py

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
4                                              """
5                                              Add two arrays using a Python loop.
6                                              x, y: 2D arrays with the same shape.
7                                              """
8         1            7      7.0      0.0     m,n = x.shape
9         1         5704   5704.0      0.2     z = np.zeros((m,n))
10      1001         1044      1.0      0.0     for i in range(m):
11   1001000       872878      0.9     29.6         for j in range(n):
12   1000000      2065045      2.1     70.1             z[i,j] = x[i,j] + y[i,j]
13         1            1      1.0      0.0     return z



The total time is not that useful. It includes some of the overhead of analyzing the code line-by-line. If you are interested in the total execution time, use %timeit. The most useful information here is in the “% Time” column. This is the percentage of the total execution time spent on each line. Here, we see that most of the time (70.1 percent) is spent adding the elements of the arrays. However, it may surprise you to see that almost 30 percent of the time is spent on Line 11, evaluating the statement “for j in range(n)”.

Just seeing how time is spent during the function call can suggest ways to speed up the code. For example, if so much time is spent iterating over the values of the index, maybe a Python loop is a poor method for adding arrays …

It is also possible to use %lprun to analyze all of the functions in a module at once. This will print out a lot of information, but sometimes this is what you want.

%lprun -m calculator calc.hypotenuse(A,B)


The -m option indicates module mode, the next item is the name of the module to analyze, and the last item is the Python statement to execute. I have instructed Python to gather information on all of the functions in the calculator module while it evaluates the statement calc.hypotenuse(A,B).

## Tips for Profiling

If Carl von Clausewitz were a computer programmer rather than a military strategist, he might have said, “The enemy of a good program is the dream of a perfect program.” The most important rules of profiling are

• Avoid unnecessary profiling.

• Avoid premature profiling.

Profiling is time-consuming. Unless you need a working program to run faster — or you simply want to learn about profiling — skip it. When you use profiling tools, you should only analyze a working program. Remember, the goal is to identify and eliminate bottlenecks. You cannot diagnose the most time-consuming step of a program is until the entire program is working. Profiling and “optimizing” code too early slow down development and often produce unintended consequences.

Profiling tools can provide a glut of information that is difficult to digest. If you are trying to speed up a program (for example, the test.py script at the end of this post), you might try the following procedure:

1. Use %lprun in function mode, the profile module, or Spyder’s profiler to analyze the primary function (e.g., hypotenuse(A,B)) and identify bottlenecks.

2. Use %lprun in function mode to dissect the secondary functions that consume the most time (e.g., multiply(x,y)).

3. Use %timeit to find faster alternatives to the most time-consuming operations.

4. Repeat steps 1–3 until your program is fast enough.

Analyzing the primary function is important. You might be able to speed up a secondary function by a factor of 1,000; however, if that function only takes 1 percent of the total run time of your program, you haven’t gained much. On the other hand, if another function takes 90 percent of the run time and you speed it up by a factor of 2, you have made a significant improvement.

There are many more profiling tools available in Python. Delve into the timeit, profile, and line_profiler modules if you need to go beyond the techniques described here.

## Summary

The first step in improving the performance of your code is quantifying performance. IPython provides several tools that allow you to time statements, functions, code fragments, and scripts. These tools will allow you to identify the portions of your program that consume the most time — the bottlenecks. By focusing on these, you can get most out of your efforts toward optimization. Once your program is fast enough, you can move on to something more interesting!

## Code Samples

### The calculator.py Module

This module uses NumPy arrays for storage, but executes array math using Python loops.

# -----------------------------------------------------------------------------
# calculator.py
# -----------------------------------------------------------------------------
import numpy as np

"""
Add two arrays using a Python loop.
x and y must be two-dimensional arrays of the same shape.
"""
m,n = x.shape
z = np.zeros((m,n))
for i in range(m):
for j in range(n):
z[i,j] = x[i,j] + y[i,j]
return z

def multiply(x,y):
"""
Multiply two arrays using a Python loop.
x and y must be two-dimensional arrays of the same shape.
"""
m,n = x.shape
z = np.zeros((m,n))
for i in range(m):
for j in range(n):
z[i,j] = x[i,j] * y[i,j]
return z

def sqrt(x):
"""
Take the square root of the elements of an arrays using a Python loop.
"""
from math import sqrt
m,n = x.shape
z = np.zeros((m,n))
for i in range(m):
for j in range(n):
z[i,j] = sqrt(x[i,j])
return z

def hypotenuse(x,y):
"""
Return sqrt(x**2 + y**2) for two arrays, a and b.
x and y must be two-dimensional arrays of the same shape.
"""
xx = multiply(x,x)
yy = multiply(y,y)
return sqrt(zz)


### The test.py Script

This is a short script that creaates some arrays and uses the calculator module.

# -----------------------------------------------------------------------------
# test.py
# -----------------------------------------------------------------------------
import numpy as np
import calculator as calc

M = 10**3
N = 10**3

A = np.random.random((M,N))
B = np.random.random((M,N))

calc.hypotenuse(A,B)


To avoid duplication, spam, and inappropriate content, we will review comments before posting them.

— Jesse & Phil