Python’s best known scientific computing libraries have lots of documentation about how to use them.
But how do the tools work under the hood? Google searches turn up fewer answers to this.
Luckily the tools are open source, so I can dive into the code myself.
I did a post about diving into the Scipy CSR Matrix. As it turns out, the code reconstitutes sparse arrays into dense ones with a pair of for
loops in C++. It’s fast because C is fast.
But that got me thinking about vectorization in Numpy. Numpy arrays tout a performance (speed) feature called vectorization.
The generally held impression among the scientific computing community is that vectorization is fast because it replaces the loop (running each item one by one) with something else that runs the operation on several items in parallel. That technique is commonly called parallelization in the wider programming community.
But what if these lauded “vectorizations” are, at their core, just for
loops in a fast language (namely C) without any parallel magic? Let’s find out.
How do we know if Numpy’s speed is due to C or to parallelization?
Let’s do this the hard way, because the more gray hairs we give ourselves the better we’ll remember the lesson :).
We’ll write the loops in C and compare their speed to Numpy’s. If the loops are beating Numpy, then maybe Numpy is just loops. If Numpy beats the loops, maybe something more clever is going on in there.
Before we do this, let’s acknowledge that this has been done. Charles Menguy shared some results on StackOverflow in 2012:
I was trying to figure out the fastest way to do matrix multiplication and tried 3 different ways:
- Pure python implementation: no surprises here.
- Numpy implementation using
numpy.dot(a, b)
- Interfacing with C using
ctypes
module in Python.
Charles’ test purported to demonstrate that numpy operates faster than a straight looping C implementation. We’re not going to just leave it at that, but I figured I’d share the chart because charts are pretty:
Wait, why can’t we just leave it at that?
The thing is, the implementation deployed for this particular test is more naive than a looping implementation has to be. Granted it is meant to work for an array of arbitrary size and dimension, but it doesn’t do that, so I cannot give it full marks for this. Additionally, the stride is implicitly, rather than explicitly, defined. Also, the matmult
method does a multiplication operation that could benefit from a structural efficiency by transposing one of the factors, and this implementation doesn’t do that (you’ll see why this is useful in a diagram we’ll draw later).
I redid the C with an explicitly defined matrix size and dimensionality. Matrix-handling libraries commonly have separate array manipulation implementations not only for separate numbers of dimensions, but even for separate data types.
I also added a struct for object clarity.
C Experiment Number 1: Straight C with ctypes
Here’s all the code on Github, if you’re into that kind of thing.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#include <stdio.h> | |
#include <stdlib.h> | |
typedef struct Collection { | |
int size; | |
float data[2000][2000]; | |
} Collection; | |
void multiply(Collection *a, Collection *b, Collection *output, int x_dim, int y_dim) { | |
int i,j; | |
for (i = 0; i < x_dim; ++i) { | |
for (j = 0; j < y_dim; ++j) { | |
output->data[i][j] = a->data[i][j] * b->data[i][j]; | |
} | |
} | |
} |
Here are the multiplication methods:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import ctypes | |
import time | |
from _ctypes import Structure, POINTER, byref | |
from ctypes import c_int, c_float | |
import numpy as np | |
def c_multiplication(a, b): | |
class Collection(Structure): | |
_fields_ = [("size", c_int), | |
("data", (c_float * a.shape[1]) * a.shape[0])] | |
libmatmult = ctypes.CDLL("./multiply.so") | |
libmatmult.multiply.argtypes = [POINTER(Collection)] | |
libmatmult.multiply.restype = None | |
t_a = Collection() | |
t_b = Collection() | |
t_c = Collection() | |
for i in range(a.shape[0]): | |
for j in range(a.shape[1]): | |
t_a.data[i][j] = a[i][j] | |
t_c.data[i][j] = a[i][j] | |
for i in range(b.shape[0]): | |
for j in range(b.shape[1]): | |
t_b.data[i][j] = b[i][j] | |
start = time.time() | |
libmatmult.multiply(byref(t_a), byref(t_b), byref(t_c), a.shape[0], a.shape[1]) | |
end = time.time() | |
print(end – start) | |
c_result = np.ones(a.shape) | |
for i in range(a.shape[0]): | |
for j in range(a.shape[1]): | |
c_result = (t_c.data[i][j]) | |
return c_result | |
def python_multiplication(a,b): | |
product = np.ones(a.shape) | |
for row_index in np.arange(a.shape[0]): | |
for col_index in np.arange(a.shape[1]): | |
product[row_index][col_index] = a[row_index][col_index] * b[row_index][col_index] | |
return product |
and the calls:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import numpy as np | |
import time | |
from multiply import c_multiplication, python_multiplication | |
a = np.random.rand(2000,2000) | |
b = np.random.rand(2000,2000) | |
print("Timing python multiplication:\n") | |
start = time.time() | |
python_result = python_multiplication(a, b) | |
end = time.time() | |
print(end – start) | |
print("Timing C multiplication:\n") | |
c_result = c_multiplication(a, b) | |
print("Timing numpy multiplication:\n") | |
start = time.time() | |
python_result = a * b | |
end = time.time() | |
print(end – start) |
Results:
It looks like our straight Python takes 4 seconds (ouch), followed by the Numpy implementation at 0.013 seconds, followed by our C at 0.009 seconds. These numbers differ a little each time I run this code, but these results are typical.
Worth noting: the C timing is faster than the Numpy timing right now, but as you can see, the C time does not include loading the data into a C array and then back into a python-usable array. Since I do that element by element with python, it wouldn’t be a fair comparison to the C implementation with that in there. But since Numpy takes and returns a python-usable collection, this timing method isn’t exactly fair to Numpy.
C Experiment Number 2: Cython Conversion of Straight Python
Let’s try this with Cython instead of with ctypes. Cython handles its own python to C conversion, so we will literally copy our python implementation into a Cython rig first and see if we pick up any speed that way.
Python code converting to Cython:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import numpy as np | |
def cython_multiplication(a,b): | |
product = np.ones(a.shape) | |
for row_index in np.arange(a.shape[0]): | |
for col_index in np.arange(a.shape[1]): | |
product[row_index][col_index] = a[row_index][col_index] * b[row_index][col_index] | |
return product |
updated execution:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import ctypes | |
import time | |
from _ctypes import Structure, POINTER, byref | |
from ctypes import c_int, c_float | |
import pyximport; pyximport.install() | |
import multiplication | |
import numpy as np | |
a = np.random.rand(2000,2000) | |
b = np.random.rand(2000,2000) | |
print("Timing python multiplication:\n") | |
start = time.time() | |
python_result = python_multiplication(a, b) | |
end = time.time() | |
print(end – start) | |
print("Timing ctypes C multiplication:\n") | |
c_result = c_multiplication(a, b) | |
print("Timing cython multiplication:\n") | |
start = time.time() | |
python_result = multiplication.cython_multiplication(a,b) | |
end = time.time() | |
print(end – start) | |
print("Timing numpy multiplication:\n") | |
start = time.time() | |
python_result = a * b | |
end = time.time() | |
print(end – start) |
Updated results:
Yow. Our Cython version takes 3.2 seconds—better than the straight Python without conversion, but nowhere near as fast as the straight C or Numpy.
Of course, the C part is still faster. That’s why the Cython one can load a python collection into C, run the transformation, unload the C array back into python, and still finish faster than the python implementation did despite having an extra step at the beginning and the end.
Here’s the Python interaction heat map for this code. We have a lot of Python interaction going on here as indicated by the yellow highlights: a middle yellow line on the import statement, then five lines of intense yellow, and a faint yellow on that return value. In fact, not a single line of our code avoids Python interaction.
The Python interaction is a pretty gaping speed leak. Let’s do this same thing in a more C-ish manner and see if it’s any faster. Note that we have even defined our API in C style: we pass a return value, and we pass the number of dimensions as integers. I got lazy at the end and used np.asarray()
to convert my return value. Sue me.
C Experiment Number 3: Cython, But Make it C-ish
New code for Cython:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import numpy as np | |
def cython_multiplication(double[:,:] a, double[:,:] b, double[:,:] out, int x_dims, int y_dims): | |
cdef int i, j | |
for i in range(x_dims): | |
for j in range(y_dims): | |
out[i, j] = a[i,j] * b[i,j] | |
return np.asarray(out) |
Updated execution:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
print("Timing cython multiplication:\n") | |
start = time.time() | |
python_result = multiplication.cython_multiplication(a, b, c, 2000, 2000) | |
end = time.time() | |
print(end – start) |
Updated results:
We’re down to 0.015 seconds on that Cython code! We picked up a lot of speed dropping straight into C. Python interaction analysis is much less yellow:
The remaining yellow is:
Line 1 (middle yellow): Import a Python API
Line 3:(intense yellow) Take in 5 Python objects
Line 8:(faint yellow) Convert 3 Python collections to C arrays
Line 10:(middle yellow) Convert a C object into a Python collection.
Now we have something a lot faster than pure Python that takes in the Python inputs and returns a Python-manipulable collection.
But numpy’s multiplication is still clocking in at a 33% time savings off this.
So, the emperor is wearing clothes! The speed of C alone doesn’t fully explain Numpy’s speed. Numpy did all that same stuff faster than a naive looping C implementation.
What sort of implementation would mean less looping?
Well, instead of looping over elements in a blocking synchronous queue, it might take all the independent operations and do them at once.
For elementwise multiplication, for example, all of the calculations are independent. You don’t need the result of any particular element by element multiplication to calculate any of the others. One could, theoretically, open as many threads as there are elements in the base arrays and run all the calculations concurrently. The sky’s the limit! Well, not really: your CPU is the limit. But you get the idea.
The approach is even viable if some of the calculations do depend on one another. Take the dot product as an example: to calculate a dot product, one takes two arrays with corresponding numbers of elements in their inner dimensions, calculates an array with dimensions corresponding to the factors’ outer dimensions, then sums up those products to produce a scalar result.
In this case, the sum operation needs the multiplication operations to finish before it can finish because it needs to sum up the products. So we cannot put each element in its own thread and call it a day. We could, however, parallelize the multiplication operations, then run addition later.
Here’s a picture of a naive dot product implementation on the left, then an illustration of how we can imagine a parallelized version to work on the right. We don’t know at this point whether this is how vectorization works in Numpy, but this is how it could work if the code is indeed taking advantage of parallelization:
A note: this is the thing I mentioned about transposing helping us out. When we transpose the second array for doing the dot product, we can now do the multiplication calculations element-wise by the outer dimension, rather than having to move over one dimension in the first factor and then loop over the other one in the other factor. We can even use the same indices this way.
So, the million dollar question: does Numpy do this? Let’s see what we can learn about how this works, first from a few papers and then from the code itself.
The Scuttlebutt
If you’re looking to familiarize yourself with Numpy use cases, I can recommend this paper by Van der Walt, Colbert, and Varoquaux as a useful rundown. But it doesn’t address how the features work underneath the API.
If you want to know all the gritty implementation details, the piece you’re looking for is Beautiful Code, Chapter 19: “Multidimensional Iterators in Numpy” by Travis E. Oliphant.
I won’t quote the entire chapter in this post, but the chapter first discusses the historical challenges of iteration over memory addresses, then explains the pieces necessary for a general solution. With that explanation complete, we arrive at the following signature for the base iterator object in Numpy. I have added arrows with brief explanations, but for full context I recommend the chapter itself:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
typedef struct {<————————————–immutable key value data store to iterate over given array | |
PyObject_HEAD | |
int nd_m1;<————————————number of dimensions in array minus 1 | |
npy_intp index;<—————————–counter for element that iterator is at | |
npy_intp size;<——————————number of elements in array | |
npy_intp coords[NPY_MAXDIMS];<——–the N–dimensional location of the current element | |
npy_intp dims_m1[NPY_MAXDIMS];<——number of elements along each dimension minus 1 | |
npy_intp strides[NPY_MAXDIMS];<——number of bytes separating the elements in each dimension | |
npy_intp backstrides[NPY_MAXDIMS];<number of bytes to subtract to return to beginning of dimension | |
npy_intp factors[NPY_MAXDIMS];<——for converting from 1D index to N–dimensional coords in 1D arrays | |
PyArrayObject *ao;<————————pointer to the underlying array this iterator is built from | |
char *dataptr;<——————————pointer to the (first byte of) the current value of the array | |
npy_bool contiguous;<———————TRUE (1) if the array elements are all next to each other in memory | |
} PyArrayIterObject;<—————————name of struct object |
Sure enough, the PyArrayIterObject
appears in the Numpy elementwise iteration function:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/*********************** Element-wise Array Iterator ***********************/ | |
/* Aided by Peter J. Verveer's nd_image package and numpy's arraymap ****/ | |
/* and Python's array iterator ***/ | |
/* get the dataptr from its current coordinates for simple iterator */ | |
static char* | |
get_ptr_simple(PyArrayIterObject* iter, npy_intp *coordinates) | |
{ | |
npy_intp i; | |
char *ret; | |
ret = PyArray_DATA(iter->ao); | |
for(i = 0; i < PyArray_NDIM(iter->ao); ++i) { | |
ret += coordinates[i] * iter->strides[i]; | |
} | |
return ret; | |
} |
By the way folks, that right there is how you do attribution in code.
We also see the struct in use in the C-converted code from our own elementwise multiplication function. Here, we see the a PyArrayIterObject generated to manage iteration over our output array, out
:
Mind you, this is the base iterator for the Numpy arrays. It goes element by element in a given dimension, just like the iterators you’re used to. The design of this particular iterator takes advantage of a number of performance enhancements related to memory usage and compute time.
For example:
A common motif in NumPy is to gain speed by concentrating optimizations on the loop over a single dimension where simple striding can be assumed. Then an iteration strategy that iterates over all but the last dimension is used. This was the approach introduced by NumPy’s predecessor, Numeric, to implement the math functionality.
In NumPy, a slight modification to the NumPy iterator makes it possible to use this basic strategy in any code. The modified iterator is returned from the constructor as follows:
it = PyArray_IterAllButAxis(array, &dim).The PyArray_IterAllButAxis function takes a NumPy array and the address of an integer representing the dimension to remove from the iteration. The integer is passed by reference (the & operator) because if the dimension is specified as –1, the function determines which dimension to remove from iteration and places the number of that dimension in the argument. When the input dimension is –1, the routine chooses the dimension with the smallest nonzero stride.
Another choice for the dimension to remove might have been the dimension with the largest number of elements. That choice would minimize the number of outer loop iterations and reserve the most elements for the presumably fast inner loop. The problem with that choice is that getting information in and out of memory is often the slowest part of an algorithm on general-purpose processors.
As a result, the choice made by NumPy is to make sure that the inner loop is proceeding with data that is as close together as possible. Such data is more likely to be accessed more quickly during the speed-critical inner loop.
OK, so each loop should run fast. But to implement true parallelization, Numpy would need to run multiple iterators at the same time. How does Numpy handle multiple iterations within an array or collection of arrays?
Multiple Iterations
Another common task in NumPy is to iterate over several arrays in concert. For example, the implementation of array addition requires iterating over both arrays using a connected iterator so that the output array is the sum of each element of the first array [sic]multiplied by* each element of the second array.
*This is exactly what the chapter says. This paragraph starts by exemplifying elementwise addition in arrays, but then explains that as taking the sum of elementwise multiplication. Those two operations are not the same thing. I suspect what happened here is that the example was originally elementwise multiplication (that is, exactly what we are doing), but an editor suggested elementwise addition instead because addition is a sort of universal base case for scalar operations, and most of this paragraph got changed, but the explanation of the operation did not. We will operate under the assumption that this whole thing is about elementwise addition and “multiplied by” should say “and” (for addition of two elements).
So anyway, what does the chapter say we can do about that?
This can be accomplished using a different iterator for each of the input elements [the factor arrays] and an iterator for the output array in the normal fashion.
Alternatively, NumPy provides a multi-iterator object that can simplify dealing with several iterators at once.
This is where the parallelization would happen: multiple iterators running at the same time on an array or combination of arrays. In this case, we have an iterator for each individual input—an input being one of the addend arrays, rather than the elements thereof.
Here’s the method that returns us that multi-iterator object in Numpy:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/*NUMPY_API | |
* Get MultiIterator from array of Python objects and any additional | |
* | |
* PyObject **mps — array of PyObjects | |
* int n – number of PyObjects in the array | |
* int nadd – number of additional arrays to include in the iterator. | |
* | |
* Returns a multi-iterator object. | |
*/ | |
NPY_NO_EXPORT PyObject * | |
PyArray_MultiIterFromObjects(PyObject **mps, int n, int nadd, …) | |
{ | |
va_list va; | |
PyArrayMultiIterObject *multi; | |
PyObject *current; | |
PyObject *arr; | |
int i, ntot, err=0; | |
ntot = n + nadd; | |
if (ntot < 1 || ntot > NPY_MAXARGS) { | |
PyErr_Format(PyExc_ValueError, | |
"Need at least 1 and at most %d " | |
"array objects.", NPY_MAXARGS); | |
return NULL; | |
} | |
multi = PyArray_malloc(sizeof(PyArrayMultiIterObject)); | |
if (multi == NULL) { | |
return PyErr_NoMemory(); | |
} | |
PyObject_Init((PyObject *)multi, &PyArrayMultiIter_Type); | |
for (i = 0; i < ntot; i++) { | |
multi->iters[i] = NULL; | |
} | |
multi->numiter = ntot; | |
multi->index = 0; | |
va_start(va, nadd); | |
for (i = 0; i < ntot; i++) { | |
if (i < n) { | |
current = mps[i]; | |
} | |
else { | |
current = va_arg(va, PyObject *); | |
} | |
arr = PyArray_FROM_O(current); | |
if (arr == NULL) { | |
err = 1; | |
break; | |
} | |
else { | |
multi->iters[i] = (PyArrayIterObject *)PyArray_IterNew(arr); | |
if (multi->iters[i] == NULL) { | |
err = 1; | |
break; | |
} | |
Py_DECREF(arr); | |
} | |
} | |
va_end(va); | |
if (!err && PyArray_Broadcast(multi) < 0) { | |
err = 1; | |
} | |
if (err) { | |
Py_DECREF(multi); | |
return NULL; | |
} | |
PyArray_MultiIter_RESET(multi); | |
return (PyObject *)multi; | |
} |
I’m not seeing any split-and-parallelize logic in what I’m reading from iterators.c
in the Numpy library.
There’s no mention of the iterations running concurrently in the chapter, either. The word “simultaneous” does come up in the context of broadcasting. That said, I interpret the author(s) to be saying that multiple broadcasting operations can be run on the same collection of inputs—not that any iteration are happening at the same time. My interpretation could be wrong. Read the chapter! Draw your own conclusions. Write comments about them.
Admittedly, element-wise threading for multiplication could be a lot of threads. It could be up to as many threads as the factors have elements, since all operations are independent. How about dot products, though? That’s a more contained number of potential simultaneous operations: one for each inner dimension of the factors, then a final one to add up all the products. It’s less complicated because there is less pressure to trade off between number of threads and number of operations run in a given thread. So how does Numpy do it? Here’s the code:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/**begin repeat | |
* | |
* #name = BYTE, UBYTE, SHORT, USHORT, INT, UINT, | |
* LONG, ULONG, LONGLONG, ULONGLONG, | |
* FLOAT, DOUBLE, LONGDOUBLE, | |
* DATETIME, TIMEDELTA# | |
* #type = npy_byte, npy_ubyte, npy_short, npy_ushort, npy_int, npy_uint, | |
* npy_long, npy_ulong, npy_longlong, npy_ulonglong, | |
* npy_float, npy_double, npy_longdouble, | |
* npy_datetime, npy_timedelta# | |
* #out = npy_long, npy_ulong, npy_long, npy_ulong, npy_long, npy_ulong, | |
* npy_long, npy_ulong, npy_longlong, npy_ulonglong, | |
* npy_float, npy_double, npy_longdouble, | |
* npy_datetime, npy_timedelta# | |
*/ | |
static void | |
@name@_dot(char *ip1, npy_intp is1, char *ip2, npy_intp is2, char *op, npy_intp n, | |
void *NPY_UNUSED(ignore)) | |
{ | |
@out@ tmp = (@out@)0; | |
npy_intp i; | |
for (i = 0; i < n; i++, ip1 += is1, ip2 += is2) { | |
tmp += (@out@)(*((@type@ *)ip1)) * | |
(@out@)(*((@type@ *)ip2)); | |
} | |
*((@type@ *)op) = (@type@) tmp; | |
} | |
/**end repeat**/ |
Whelp, that’s a loop.
So at least in these two cases, Numpy vectorization does not appear to be using parallelization.
That doesn’t mean it’s implemented wrong. I did no research into whether parallelization would, in fact, be simpler or faster in this case than what’s already happening. And the degree to which data scientists talk up Numpy’s speed tells me that, as far as many practitioners are concerned, Numpy’s existing optimizations are puh-lenty fast enough.
However, they’re not fast for the reason that data scientists say they are. They’re not running independent operations concurrently.
They’re looping fast. They’re looping really fast. They employ some genius optimizations to loop really really fast.
But they’re looping.
Conclusion
Numpy vectorization has a reputation in the industry for being fast by capitalizing on parallelization.
Indeed, Numpy is fast. Its iterators compile back to C, but their implementation is fast for reasons other than C itself. The base iterator does an excellent job of encapsulating the kernel of commonality between collection operations. As technologists, we can benefit from studying this code as an example of thoughtful software design.
Moreover, that iterator demonstrates some really great choices to get the loops in C to run as fast as possible.
But, as far as I can tell from reading the code and its analysis, it does not capitalize on parallelization.
So what’s the lesson here? The lesson is much like the lesson we learned from studying the history of API design. The lesson is that just because lots of people think something, doesn’t mean it’s true. And if we investigate what we’re learning rather than take it at face value, sometimes the underlying truth can be just as interesting.
If you love getting into the technical weeds as much as I do, I also recommend:
This three part series on the history of API design
This is a great analysis. Excellent.
Thank you!!!!! This comment made my day :).
this question has been bugging my mind for days about what was really happening under the hood until i stumbled upon this artcle.
Great work Chelsea, keep it up:)
There may still be some parallel evaluation going on, but not thread-level parallelism: instead, the C compiler (using the kind of dependency analysis you talked about) can generate SIMD instructions that can perform up to four basic arithmetic operations at once using extra-wide registers. Pasting your C code into https://godbolt.org/, I see SIMD instructions like mulps and addps appearing at high enough optimisation levels. Identifying opportunities to use this kind of optimisation is what compiler people usually mean by “vectorisation”, and it’s really cool stuff: https://en.wikipedia.org/wiki/Automatic_vectorization