Python’s best known scientific computing libraries have lots of documentation about how to use them.* *

**But how do the tools work under the hood?** Google searches turn up fewer answers to this.

Luckily the tools are open source, so I can dive into the code myself.

I did a post about diving into the Scipy CSR Matrix. As it turns out, the code reconstitutes sparse arrays into dense ones with a pair of `for`

loops in C++. It’s fast because C is fast.

But that got me thinking about vectorization in Numpy. Numpy arrays tout a performance (speed) feature called *vectorization. *

The generally held impression among the scientific computing community is that vectorization is fast because it replaces the *loop* (running each item one by one) with something else that runs the operation on several items *in parallel*. That technique is commonly called **parallelization** in the wider programming community.

But what if these lauded “vectorizations” are, at their core, just `for`

loops in a fast language (namely C) without any parallel magic? Let’s find out.

### How do we know if Numpy’s speed is due to C or to parallelization?

Let’s do this the hard way, because the more gray hairs we give ourselves the better we’ll remember the lesson :).

We’ll write the loops in C and compare their speed to Numpy’s. If the loops are beating Numpy, then maybe Numpy *is* just loops. If Numpy beats the loops, maybe something more clever is going on in there.

Before we do this, let’s acknowledge that this has been done. Charles Menguy shared some results on StackOverflow in 2012:

I was trying to figure out the fastest way to do matrix multiplication and tried 3 different ways:

- Pure python implementation: no surprises here.
- Numpy implementation using
`numpy.dot(a, b)`

- Interfacing with C using
`ctypes`

module in Python.

Charles’ test purported to demonstrate that numpy operates *faster* than a straight looping C implementation. We’re not going to just leave it at that, but I figured I’d share the chart because charts are pretty:

### Wait, why can’t we just leave it at that?

The thing is, the implementation deployed for this particular test is more naive than a looping implementation has to be. Granted it is meant to work for an array of arbitrary size and dimension, but it doesn’t do that, so I cannot give it full marks for this. Additionally, the stride is implicitly, rather than explicitly, defined. Also, the `matmult`

method does a multiplication operation that could benefit from a structural efficiency by transposing one of the factors, and this implementation doesn’t do that (you’ll see why this is useful in a diagram we’ll draw later).

I redid the C with an explicitly defined matrix size and dimensionality. Matrix-handling libraries commonly have separate array manipulation implementations not only for separate numbers of dimensions, but even for separate *data types*.

I also added a struct for object clarity.

### C Experiment Number 1: Straight C with ctypes

Here’s all the code on Github, if you’re into that kind of thing.

#include <stdio.h> | |

#include <stdlib.h> | |

typedef struct Collection { | |

int size; | |

float data[2000][2000]; | |

} Collection; | |

void multiply(Collection *a, Collection *b, Collection *output, int x_dim, int y_dim) { | |

int i,j; | |

for (i = 0; i < x_dim; ++i) { | |

for (j = 0; j < y_dim; ++j) { | |

output->data[i][j] = a->data[i][j] * b->data[i][j]; | |

} | |

} | |

} |

Here are the multiplication methods:

import ctypes | |

import time | |

from _ctypes import Structure, POINTER, byref | |

from ctypes import c_int, c_float | |

import numpy as np | |

def c_multiplication(a, b): | |

class Collection(Structure): | |

_fields_ = [("size", c_int), | |

("data", (c_float * a.shape[1]) * a.shape[0])] | |

libmatmult = ctypes.CDLL("./multiply.so") | |

libmatmult.multiply.argtypes = [POINTER(Collection)] | |

libmatmult.multiply.restype = None | |

t_a = Collection() | |

t_b = Collection() | |

t_c = Collection() | |

for i in range(a.shape[0]): | |

for j in range(a.shape[1]): | |

t_a.data[i][j] = a[i][j] | |

t_c.data[i][j] = a[i][j] | |

for i in range(b.shape[0]): | |

for j in range(b.shape[1]): | |

t_b.data[i][j] = b[i][j] | |

start = time.time() | |

libmatmult.multiply(byref(t_a), byref(t_b), byref(t_c), a.shape[0], a.shape[1]) | |

end = time.time() | |

print(end – start) | |

c_result = np.ones(a.shape) | |

for i in range(a.shape[0]): | |

for j in range(a.shape[1]): | |

c_result = (t_c.data[i][j]) | |

return c_result | |

def python_multiplication(a,b): | |

product = np.ones(a.shape) | |

for row_index in np.arange(a.shape[0]): | |

for col_index in np.arange(a.shape[1]): | |

product[row_index][col_index] = a[row_index][col_index] * b[row_index][col_index] | |

return product |

and the calls:

import numpy as np | |

import time | |

from multiply import c_multiplication, python_multiplication | |

a = np.random.rand(2000,2000) | |

b = np.random.rand(2000,2000) | |

print("Timing python multiplication:\n") | |

start = time.time() | |

python_result = python_multiplication(a, b) | |

end = time.time() | |

print(end – start) | |

print("Timing C multiplication:\n") | |

c_result = c_multiplication(a, b) | |

print("Timing numpy multiplication:\n") | |

start = time.time() | |

python_result = a * b | |

end = time.time() | |

print(end – start) |

Results:

It looks like our straight Python takes 4 seconds (ouch), followed by the Numpy implementation at 0.013 seconds, followed by our C at 0.009 seconds. These numbers differ a little each time I run this code, but these results are typical.

Worth noting: the C timing is faster than the Numpy timing right now, but as you can see, the C time does *not* include loading the data into a C array and then back into a python-usable array. Since I do that element by element with python, it wouldn’t be a fair comparison to the C implementation with that in there. But since Numpy takes *and* returns a python-usable collection, this timing method isn’t exactly fair to Numpy.

### C Experiment Number 2: Cython Conversion of Straight Python

Let’s try this with Cython instead of with ctypes. Cython handles its own python to C conversion, so we will literally copy our python implementation *into* a Cython rig first and see if we pick up any speed that way.

Python code converting to Cython:

import numpy as np | |

def cython_multiplication(a,b): | |

product = np.ones(a.shape) | |

for row_index in np.arange(a.shape[0]): | |

for col_index in np.arange(a.shape[1]): | |

product[row_index][col_index] = a[row_index][col_index] * b[row_index][col_index] | |

return product |

updated execution:

import ctypes | |

import time | |

from _ctypes import Structure, POINTER, byref | |

from ctypes import c_int, c_float | |

import pyximport; pyximport.install() | |

import multiplication | |

import numpy as np | |

a = np.random.rand(2000,2000) | |

b = np.random.rand(2000,2000) | |

print("Timing python multiplication:\n") | |

start = time.time() | |

python_result = python_multiplication(a, b) | |

end = time.time() | |

print(end – start) | |

print("Timing ctypes C multiplication:\n") | |

c_result = c_multiplication(a, b) | |

print("Timing cython multiplication:\n") | |

start = time.time() | |

python_result = multiplication.cython_multiplication(a,b) | |

end = time.time() | |

print(end – start) | |

print("Timing numpy multiplication:\n") | |

start = time.time() | |

python_result = a * b | |

end = time.time() | |

print(end – start) |

Updated results:

Yow. Our Cython version takes 3.2 seconds—better than the straight Python without conversion, but nowhere near as fast as the straight C or Numpy.

Of course, the C part is still faster. That’s why the Cython one can load a python collection *into* C, run the transformation, unload the C array *back into python*, and still finish faster than the python implementation did despite having an extra step at the beginning and the end.

Here’s the Python interaction heat map for this code. We have a *lot* of Python interaction going on here as indicated by the yellow highlights: a middle yellow line on the import statement, then five lines of intense yellow, and a faint yellow on that return value. In fact, not a single line of our code avoids Python interaction.

The Python interaction is a pretty gaping speed leak. Let’s do this same thing in a more C-ish manner and see if it’s any faster. Note that we have even defined our API in C style: we pass a return value, and we pass the number of dimensions as integers. I got lazy at the end and used `np.asarray()`

to convert my return value. Sue me.

### C Experiment Number 3: Cython, But Make it C-ish

New code for Cython:

import numpy as np | |

def cython_multiplication(double[:,:] a, double[:,:] b, double[:,:] out, int x_dims, int y_dims): | |

cdef int i, j | |

for i in range(x_dims): | |

for j in range(y_dims): | |

out[i, j] = a[i,j] * b[i,j] | |

return np.asarray(out) |

Updated execution:

print("Timing cython multiplication:\n") | |

start = time.time() | |

python_result = multiplication.cython_multiplication(a, b, c, 2000, 2000) | |

end = time.time() | |

print(end – start) |

Updated results:

We’re down to 0.015 seconds on that Cython code! We picked up a lot of speed dropping straight into C. Python interaction analysis is much less yellow:

The remaining yellow is:

Line 1 (middle yellow): Import a Python API

Line 3:(intense yellow) Take in 5 Python objects

Line 8:(faint yellow) Convert 3 Python collections to C arrays

Line 10:(middle yellow) Convert a C object into a Python collection.

Now we have something a lot faster than pure Python that takes in the Python inputs and returns a Python-manipulable collection.

But numpy’s multiplication is still clocking in at a 33% time savings off this.

So, the emperor is wearing clothes! The speed of C alone doesn’t fully explain Numpy’s speed. Numpy did all that same stuff faster than a naive looping C implementation.

### What sort of implementation would mean *less* looping?

Well, instead of looping over elements in a blocking synchronous queue, it might take all the independent operations and do them at once.

For elementwise multiplication, for example, all of the calculations are *independent*. You don’t need the result of any particular element by element multiplication to calculate any of the others. One could, theoretically, open as many threads as there are elements in the base arrays and run *all the calculations concurrently*. The sky’s the limit! Well, not really: your CPU is the limit. But you get the idea.

The approach is even viable if some of the calculations *do* depend on one another. Take the dot product as an example: to calculate a dot product, one takes two arrays with corresponding numbers of elements in their inner dimensions, calculates an array with dimensions corresponding to the factors’ outer dimensions, then *sums up* those products to produce a scalar result.

In this case, the sum operation needs the multiplication operations to finish before it can finish because it needs to sum up the products. So we cannot put each element in its own thread and call it a day. We *could*, however, parallelize the multiplication operations, then run addition later.

Here’s a picture of a naive dot product implementation on the left, then an illustration of how we can imagine a parallelized version to work on the right. We don’t *know* at this point whether this is how vectorization works in Numpy, but this is how it *could* work if the code is indeed taking advantage of parallelization:

A note: this is the thing I mentioned about transposing helping us out. When we transpose the second array for doing the dot product, we can now do the multiplication calculations *element-wise* by the outer dimension, rather than having to move over one dimension in the first factor and then loop over the other one in the other factor. We can even use the same indices this way.

So, the million dollar question: *does Numpy do this? *Let’s see what we can learn about how this works, first from a few papers and then from the code itself.

### The Scuttlebutt

If you’re looking to familiarize yourself with Numpy use cases, I can recommend this paper by Van der Walt, Colbert, and Varoquaux as a useful rundown. But it doesn’t address how the features work underneath the API.

If you want to know all the gritty implementation details, the piece you’re looking for is *Beautiful Code*, Chapter 19: “Multidimensional Iterators in Numpy” by Travis E. Oliphant.

I won’t quote the entire chapter in this post, but the chapter first discusses the historical challenges of iteration over memory addresses, then explains the pieces necessary for a general solution. With that explanation complete, we arrive at the following signature for the base iterator object in Numpy. I have added arrows with brief explanations, but for full context I recommend the chapter itself:

typedef struct {<————————-immutable key value data store to iterate over given array | |

PyObject_HEAD | |

int nd_m1;<————————number of dimensions in array minus 1 | |

npy_intp index;<——————-counter for element that iterator is at | |

npy_intp size;<——————–number of elements in array | |

npy_intp coords[NPY_MAXDIMS];<—–the N-dimensional location of the current element | |

npy_intp dims_m1[NPY_MAXDIMS];<—-number of elements along each dimension minus 1 | |

npy_intp strides[NPY_MAXDIMS];<—-number of bytes separating the elements in each dimension | |

npy_intp backstrides[NPY_MAXDIMS];<number of bytes to subtract to return to beginning of dimension | |

npy_intp factors[NPY_MAXDIMS];<—-for converting from 1D index to N-dimensional coords in 1D arrays | |

PyArrayObject *ao;<—————-pointer to the underlying array this iterator is built from | |

char *dataptr;<——————–pointer to the (first byte of) the current value of the array | |

npy_bool contiguous;<————–TRUE (1) if the array elements are all next to each other in memory | |

} PyArrayIterObject;<——————name of struct object |

Sure enough, the `PyArrayIterObject`

appears in the Numpy elementwise iteration function:

/*********************** Element-wise Array Iterator ***********************/ | |

/* Aided by Peter J. Verveer's nd_image package and numpy's arraymap ****/ | |

/* and Python's array iterator ***/ | |

/* get the dataptr from its current coordinates for simple iterator */ | |

static char* | |

get_ptr_simple(PyArrayIterObject* iter, npy_intp *coordinates) | |

{ | |

npy_intp i; | |

char *ret; | |

ret = PyArray_DATA(iter->ao); | |

for(i = 0; i < PyArray_NDIM(iter->ao); ++i) { | |

ret += coordinates[i] * iter->strides[i]; | |

} | |

return ret; | |

} |

By the way folks, *that right there* is how you do attribution in code.

We also see the struct in use in the C-converted code from our own elementwise multiplication function. Here, we see the a PyArrayIterObject generated to manage iteration over our output array, `out`

:

Mind you, this is the *base* iterator for the Numpy arrays. It goes element by element in a given dimension, just like the iterators you’re used to. The design of this *particular *iterator takes advantage of a number of performance enhancements related to memory usage and compute time.

For example:

A common motif in NumPy is to gain speed by concentrating optimizations on the loop over a single dimension where simple striding can be assumed. Then an iteration strategy that iterates over all but the last dimension is used. This was the approach introduced by NumPy’s predecessor, Numeric, to implement the math functionality.

In NumPy, a slight modification to the NumPy iterator makes it possible to use this basic strategy in any code. The modified iterator is returned from the constructor as follows:

it = PyArray_IterAllButAxis(array, &dim).The PyArray_IterAllButAxis function takes a NumPy array and the address of an integer representing the dimension to remove from the iteration. The integer is passed by reference (the & operator) because if the dimension is specified as –1, the function determines which dimension to remove from iteration and places the number of that dimension in the argument. When the input dimension is –1, the routine chooses the dimension with the smallest nonzero stride.

Another choice for the dimension to remove might have been the dimension with the largest number of elements. That choice would minimize the number of outer loop iterations and reserve the most elements for the presumably fast inner loop. The problem with that choice is that getting information in and out of memory is often the slowest part of an algorithm on general-purpose processors.

As a result, the choice made by NumPy is to make sure that the inner loop is proceeding with data that is as close together as possible. Such data is more likely to be accessed more quickly during the speed-critical inner loop.

OK, so each loop should run *fast*. But to implement *true* parallelization, Numpy would need to run multiple iterators *at the same time*. How does Numpy handle multiple iterations within an array or collection of arrays?

## Multiple Iterations

Another common task in NumPy is to iterate over several arrays in concert. For example, the implementation of array addition requires iterating over both arrays using a connected iterator so that the output array is the sum of each element of the first array [sic]multiplied by* each element of the second array.

*This is exactly what the chapter says. This paragraph starts by exemplifying elementwise addition in arrays, but then explains that as taking the sum of elementwise multiplication. Those two operations are not the same thing. I suspect what happened here is that the example was *originally* elementwise multiplication (that is, exactly what we are doing), but an editor suggested elementwise addition instead because addition is a sort of universal base case for scalar operations, and most of this paragraph got changed, but the explanation of the operation did not. We will operate under the assumption that this whole thing is about elementwise addition and “multiplied by” should say “and” (for addition of two elements).

So anyway, what does the chapter say we can do about that?

This can be accomplished using a different iterator for each of the input elements [the factor arrays] and an iterator for the output array in the normal fashion.

Alternatively, NumPy provides a multi-iterator object that can simplify dealing with several iterators at once.

This is where the parallelization would happen: multiple iterators running at the same time on an array or combination of arrays. In this case, we have an iterator for each individual input—an input being one of the addend arrays, rather than the elements thereof.

Here’s the method that returns us that multi-iterator object in Numpy:

/*NUMPY_API | |

* Get MultiIterator from array of Python objects and any additional | |

* | |

* PyObject **mps — array of PyObjects | |

* int n – number of PyObjects in the array | |

* int nadd – number of additional arrays to include in the iterator. | |

* | |

* Returns a multi-iterator object. | |

*/ | |

NPY_NO_EXPORT PyObject * | |

PyArray_MultiIterFromObjects(PyObject **mps, int n, int nadd, …) | |

{ | |

va_list va; | |

PyArrayMultiIterObject *multi; | |

PyObject *current; | |

PyObject *arr; | |

int i, ntot, err=0; | |

ntot = n + nadd; | |

if (ntot < 1 || ntot > NPY_MAXARGS) { | |

PyErr_Format(PyExc_ValueError, | |

"Need at least 1 and at most %d " | |

"array objects.", NPY_MAXARGS); | |

return NULL; | |

} | |

multi = PyArray_malloc(sizeof(PyArrayMultiIterObject)); | |

if (multi == NULL) { | |

return PyErr_NoMemory(); | |

} | |

PyObject_Init((PyObject *)multi, &PyArrayMultiIter_Type); | |

for (i = 0; i < ntot; i++) { | |

multi->iters[i] = NULL; | |

} | |

multi->numiter = ntot; | |

multi->index = 0; | |

va_start(va, nadd); | |

for (i = 0; i < ntot; i++) { | |

if (i < n) { | |

current = mps[i]; | |

} | |

else { | |

current = va_arg(va, PyObject *); | |

} | |

arr = PyArray_FROM_O(current); | |

if (arr == NULL) { | |

err = 1; | |

break; | |

} | |

else { | |

multi->iters[i] = (PyArrayIterObject *)PyArray_IterNew(arr); | |

if (multi->iters[i] == NULL) { | |

err = 1; | |

break; | |

} | |

Py_DECREF(arr); | |

} | |

} | |

va_end(va); | |

if (!err && PyArray_Broadcast(multi) < 0) { | |

err = 1; | |

} | |

if (err) { | |

Py_DECREF(multi); | |

return NULL; | |

} | |

PyArray_MultiIter_RESET(multi); | |

return (PyObject *)multi; | |

} |

I’m not seeing any split-and-parallelize logic in what I’m reading from `iterators.c`

in the Numpy library.

There’s no mention of the iterations running concurrently in the chapter, either. The word “simultaneous” does come up in the context of broadcasting. That said, I interpret the author(s) to be saying that multiple broadcasting operations can be run on the same collection of inputs—not that any iteration are happening at the same time. My interpretation could be wrong. Read the chapter! Draw your own conclusions. Write comments about them.

Admittedly, element-wise *threading* for multiplication could be a lot of threads. It could be up to as many threads as the factors have elements, since all operations are independent. How about dot products, though? That’s a more contained number of potential simultaneous operations: one for each inner dimension of the factors, then a final one to add up all the products. It’s less complicated because there is less pressure to trade off between number of threads and number of operations run in a given thread. So how does Numpy do it? Here’s the code:

/**begin repeat | |

* | |

* #name = BYTE, UBYTE, SHORT, USHORT, INT, UINT, | |

* LONG, ULONG, LONGLONG, ULONGLONG, | |

* FLOAT, DOUBLE, LONGDOUBLE, | |

* DATETIME, TIMEDELTA# | |

* #type = npy_byte, npy_ubyte, npy_short, npy_ushort, npy_int, npy_uint, | |

* npy_long, npy_ulong, npy_longlong, npy_ulonglong, | |

* npy_float, npy_double, npy_longdouble, | |

* npy_datetime, npy_timedelta# | |

* #out = npy_long, npy_ulong, npy_long, npy_ulong, npy_long, npy_ulong, | |

* npy_long, npy_ulong, npy_longlong, npy_ulonglong, | |

* npy_float, npy_double, npy_longdouble, | |

* npy_datetime, npy_timedelta# | |

*/ | |

static void | |

@name@_dot(char *ip1, npy_intp is1, char *ip2, npy_intp is2, char *op, npy_intp n, | |

void *NPY_UNUSED(ignore)) | |

{ | |

@out@ tmp = (@out@)0; | |

npy_intp i; | |

for (i = 0; i < n; i++, ip1 += is1, ip2 += is2) { | |

tmp += (@out@)(*((@type@ *)ip1)) * | |

(@out@)(*((@type@ *)ip2)); | |

} | |

*((@type@ *)op) = (@type@) tmp; | |

} | |

/**end repeat**/ |

Whelp, that’s a loop.

So at least in these two cases, Numpy vectorization *does not* appear to be using parallelization.

That doesn’t mean it’s implemented wrong. I did no research into whether parallelization would, in fact, be simpler or faster in this case than what’s already happening. And the degree to which data scientists talk up Numpy’s speed tells me that, as far as many practitioners are concerned, Numpy’s existing optimizations are puh-lenty fast enough.

However, they’re not fast for the reason that data scientists say they are. They’re not running independent operations concurrently.

They’re looping fast. They’re looping *really *fast. They employ some genius optimizations to loop *really really *fast.

But they’re looping.

### Conclusion

Numpy vectorization has a reputation in the industry for being fast by capitalizing on parallelization.

Indeed, Numpy is fast. Its iterators compile back to C, but their implementation is fast for reasons other than C itself. The base iterator does an excellent job of encapsulating the kernel of commonality between collection operations. As technologists, we can benefit from studying this code as an example of thoughtful software design.

Moreover, that iterator demonstrates some really great choices to get the loops in C to run as fast as possible.

But, as far as I can tell from reading the code and its analysis, it does not capitalize on parallelization.

So what’s the lesson here? The lesson is much like the lesson we learned from studying the history of API design. The lesson is that just because lots of people think something, doesn’t mean it’s true. And if we investigate what we’re learning rather than take it at face value, sometimes the underlying truth can be just as interesting.

### If you love getting into the technical weeds as much as I do, I also recommend:

This three part series on the history of API design

This is a great analysis. Excellent.

Thank you!!!!! This comment made my day :).

this question has been bugging my mind for days about what was really happening under the hood until i stumbled upon this artcle.

Great work Chelsea, keep it up:)

There may still be some parallel evaluation going on, but not thread-level parallelism: instead, the C compiler (using the kind of dependency analysis you talked about) can generate SIMD instructions that can perform up to four basic arithmetic operations at once using extra-wide registers. Pasting your C code into https://godbolt.org/, I see SIMD instructions like mulps and addps appearing at high enough optimisation levels. Identifying opportunities to use this kind of optimisation is what compiler people usually mean by “vectorisation”, and it’s really cool stuff: https://en.wikipedia.org/wiki/Automatic_vectorization