Are you writing Python to perform transformations on a large set of numbers all at once?
Much of the Python community implements that with numpy, a library specifically designed to represent and manipulate vectors, matrices, and tensors using Python arrays.
Its API, like the API of OpenCV that we talked about last month, has some inconsistencies. This means that, if you have a piece of code that does A, you cannot necessarily extrapolate from that code to figure out what piece of code you need to do something similar to A. I’ll share some examples in the API review section of this post.
Resources I Recommend for Numpy
There’s a helpful iPython notebook by Volodymyr Kuleshov and Isaac Caswell from the CS231n Python tutorial by Justin Johnson. The tutorial and the notebook cover more stuff than numpy alone, but they offer a cursory view that you can supplement with the documentation when you need more information.
API Review for Numpy
Let me start with my favorite features of the Numpy API, plus some examples.*
*All examples in Python 2.7. Upversioning to 3.6 only affects the print syntax in these examples. If you wish to use 3.6, encapsulate all print arguments in parentheses. Or, if you’re into overkill, install a version compatibility package like python-future or six.
1.Boolean Array Indexing
import numpy as np
a = np.array([[1,2], [3, 4], [5, 6]])
bool_idx = (a > 2) # Find the elements of a that are bigger than 2;
[ True True]
[ True True]]
[3 4 5 6]
[3 4 5 6]
You can use boolean array indexing in numpy the same way you would use masks in pandas to select and operate only on certain values in your data. It’s especially helpful, of course, for conditions relevant to a characteristic of the data values as opposed to their location within the structure of the data.
2. An instance method for dot products
These both work:
I can see arguments both for representing dot product operations as a static method on the numpy library and as an instance method on each ndarray.* That said, I find the instance method version provides more semantic clarity for my implementations that use the dot product. If you disagree with me, you’re in luck; numpy also has a static method for finding the dot product of two ndarrays.
*ndarray stands for n-dimensional array, and it is the name of the object that numpy allows us to create and manipulate. It should be noted that this object is an array-like representation of vector/matrix/tensor data, rather than a python array or even an array of data. We want to keep in mind the abstraction we have made so we can apply manipulations that make sense for our original data.
Broadcasting describes numpy’s implementation of tiling. That is, asked to perform an operation on two ndarrays of different sizes, numpy will repeat the smaller one to make it the appropriate size to perform the operation with the larger one.
x = np.array([[1,2,3], [4,5,6], [7,8,9]])
v = np.array([1, 0, 1])
y = x + v
[[2 2 4]
[ 5 5 7]
[ 8 8 10]]
I like broadcasting because it is a performant way for me to clearly express the kernel of my kernels (ba-dum-tsh!)
That said, the feature is, strictly speaking, a violation of the design principle about making your API easy to use and difficult to misuse. It’s totally possible to mis-use, or rather, not even realize you are using, a feature that tiles your arrays without so much as a warning message. So, like
<em>git push -f and
<em>git co [filename], please proceed with the utmost caution.
Kuleshov and Caswell distilled the rules of whether/how broadcasting happens like so:
- If the arrays do not have the same rank, prepend the shape of the lower rank array with 1s until both shapes have the same length.
- The two arrays are said to be compatible in a dimension if they have the same size in the dimension, or if one of the arrays has size 1 in that dimension.
- The arrays can be broadcast together if they are compatible in all dimensions.
- After broadcasting, each array behaves as if it had shape equal to the elementwise maximum of shapes of the two input arrays.
- In any dimension where one array had size 1 and the other array had size greater than 1, the first array behaves as if it were copied along that dimension
Here is a more detailed resource on how broadcasting works, if you’re extra curious :).
More Idiosyncracies of the Numpy API
Broadcasting isn’t the only sorta odd thing about the numpy API.
1.Inconsistent descriptive structure of shape and rank
Shape and rank each describe the row and column structure of an ndarray.
a = np.array([1, 2, 3])
When I’m working with matrices I generally expect the dimensions to be in nrows, ncolumns format, but these appear to come the other way around. Additionally, it seems odd not to list the vertical dimension as 1 instead of None. I don’t understand the value that this provides.
Additionally, the shape representation depends, evidently, on an ndarray’s shape itself. What would you expect the shape of
b = np.array([[1,2,3],[4,5,6]]) to be? Maybe (3,) only happens when there’s just one dimension, and it will be (3,2). Or maybe the second place in the shape tuple always records one fewer than the number of rows. So then it should be (3,1).
(2,3). So in this case we do get nrows x ncolumns format.
print np.ones((1,2) )
print np.ones((2,) )
Just one array the second time, instead of an array of arrays. What you’re seeing in these inconsistencies is the difference between shape and rank, which numpy’s API seems to go out of its way at times to obfuscate.
To obtain a 2×2 ndarray of random numbers, you do:
e = np.random.random((2,2))
Why is random…twice? No, just once does not work:
'module' object not callable. Which would make sense if we could never call a class method on np and get an instance. But it works for
.ones(), as we see in the previous example. It also works for
.eye() to represent the identity matrix. So why doesn’t it work for a random array?
3.Slices modify the original ndarray
print a[0, 1]
b = a[o:o]
b[0, 0] = 77
print a[0, 1]
Even when you assign the slice to a new variable, you’re still pointing to the original object. This is not expected behavior. It should be noted that indexes (like
b=a) do not work this way.
4.Integer indexing + slicing gives an array of lower rank, whereas only slicing gives array of the same rank.
a = np.array([[1,2], [3, 4], [5, 6]])
a[[0, 1, 2], [0, 1, 0]]
is the same as
np.array([a[0, 0], a[1, 1], a[2, 0]])
Maybe this stems from the general confusion about rank, but I don’t see the practical application for this inconsistency.
Numpy is a powerful library, and it’s definitely the library of choice for batch numeric manipulation within the python community. It has some really useful features for writing expressive, performant linear algebra transformations. That said, using the API sometimes feels like riding a bike that got assembled just a little bit off balance. Some of those inconsistencies trip you up with errors and others just produce the wrong result with no warning. For this reason, I highly recommend having an idea of what final result you’re looking for as you write transformations until you become familiar with numpy.