Python has several popular scientific computing libraries, complete with lots of documentation about how to use them.
But how do the tools work under the hood? Google searches turn up fewer answers to this.
Luckily the tools are open source, so I can dive into the code myself.
We’ll start with the Scipy CSR sparse matrix. What is this thing, and how does it work?
Before We Begin: Notes on My Process
- The code is in gists for accessibility. It represents the Scipy library as of pull time on Sunday, September 23, 2018, at 3:24 PM.
- This post examines the code. It is neither a critique nor an endorsement of the code. I’ll point out some structural choices the programmers made alongside some alternatives, but I will not make definitive judgments about the “correct” choice.
- I used the PyCharm IDE to navigate through method implementations, fold and unfold code, annotate code, read commit messages, and generate gists of the code I wanted to show you. PyCharm Community Edition is available for free right here.
What is Scipy?
Scipy is a Python-based ecosystem of open-source software for mathematics, science, and engineering. The CSR sparse matrix is a data type inside of scipy that the library uses to represent sparse matrices.
What are sparse matrices?
In general: they are collections in which the vast majority of the items are some default value (usually
For Python scientific computing: they are enormous (thousands to millions of items) arrays where the vast majority of the items are zero.
A lot of scientific computing (and data science in general) relies (for now) on our ability to represent our data in numerical terms. That includes inputs like music, photographs, and text. We have techniques to turn these kinds of inputs into numbers, and sometimes those techniques give us enormous sparse matrices.
Take count vectorization, for example. Suppose you have a million sentences, and you want numeric representations of all of them. So you make an array with a zero in it for every word in the English language, with the first zero representing “a”, the second zero representing “aardvark,” and so on. Now, for each sentence, you start with that giant array, and you replace that zero with the number of times that word appears in your sentence. You make one of these for every single sentence. For each sentence, a very small number of these zeros get replaced. But each array is still the length of the entire English lexicon (230,000ish). It would take up a lot of memory, and iterating over it would be a saga.
What is the CSR Sparse Matrix?
CSR stands for “compressed sparse row,” and it’s a nod to the function of the CSR sparse matrix—a data type implemented in Scipy specifically to store sparse matrices as much smaller objects than an array with a crapload of zeros. This makes them faster and more lightweight so they can run on smaller/fewer machines.
Use case: the
TfidfVectorizer in scikit-learn returns its encodings as a CSR because the tf-idf encodings, like the word count encodings we discussed above, are mostly zeros for any given text in a much larger body of texts.
How does the CSR Sparse Matrix Work?
Let’s take a look!
First I cloned the repo from here and then opened it in PyCharm.
The class hierarchy goes
csr.py/csr_matrix --> compressed.py/_cs_matrix --> data.py/_data_matrix --> base.py/spmatrix. The
spmatrix cannot be instantiated directly: instead, clients should instantiate a subclass. This is because each sparse matrix type still needs specific behavior that spmatrix does not define itself.
Let’s go to the most specific class in the hierarchy and pretend we are instantiating it with a dense array. A dense array, in contrast to a sparse array, is the full-size, full-memory collection representation, crapload of zeros and all.
csr_matrix does not have its own initializer, so it leans on its superclass for that functionality. We find the following initializer in
I have ellipsed some code here to focus our discussion. We’re imagining ourselves saying something like
c = csr_matrix(array), where
array is a dense numpy array. It’s not a sparse matrix (so our code path skips the conditional on line 7) and it’s not a tuple (so it skips the conditional on line 10). We get
np.asarray(arg1) on line 16, and then on line 25 we get .
check_shape(shape). Let’s look at the implementation of that.
That implementation lives in
sputils.py. Here is the first line:
What follows is 56 lines of code to allow this csr_matrix to imitate the return value of
array.shape. Why is this method imitating
.shape? Because a
csr_matrix does not store all the values like a dense array does.
So how does it store the values? Let’s return to the
_cs_matrix initializer to find out.
Saving the Locations of Nonzero Data
On line 21 of the
_cs_matrix initializer, we set
self (the data representation) to a
coo_matrix. ‘COO’ stands for ‘coordinate.’ This type of matrix stores information about the locations of things in the data.
Here is the
What is happening here? First, the initializer re-represents the input dense array in at least 2 dimensions. If it receives a one-dimensional numpy array
[0,1,0,0,0,3,0,0,0,1], it converts that to
[[0,1,0,0,0,3,0,0,0,1]]. This step simplifies the way that sparse matrix data types index the nonzero values.
See that call to
M.nonzero() on line 24? That’s where we make these indices. Some types of sparse arrays have their own implementations of this method, but we will stick with the
csr_matrix for now, which does not. Instead, it uses the version of
.nonzero() that lives on
This method first takes the data and converts it into coordinate format. Now, to access the original data, a client would call
.data on the coordinate object. Next, this method creates a 2d mask of that data that contains
False wherever it has a zero value and
True wherever it has a nonzero value. It then returns the coordinates at which that data has a nonzero value. If we were to call this method and then zip the return values together, we would get a list of tuples that represent coordinate pairs. Then, if we were to display our data array as a field of x and y values, with
[0,0] at the top left and
[n,n] at the bottom right, our zipped up result would give us the coordinates of every item in the array with a nonzero value.
Before we move on, we have an opportunity here to talk about a tradeoff the Numpy team faced in the design of this location encoding.
[[t,h,i,s]] when passed in a collection with one dimension like
[t,h,i,s]. Here is what happens when you save the indices of the nonzero items in such a collection:
This array of 29 items, 7 of which are nonzero, gets two 7-item coordinate arrays. The X-coordinate array, highlighted in blue, will always contain exclusively zeros for a one-dimensional matrix. We know that this array will be a zeros array of the same length as the second array, which gives us the indices of the nonzero items in the way we’re familiar with from
np.argsort(arr), or general ordinal collection indexing for that matter.
In order to not store this sort of superfluous first array, the Numpy team could have made a number of choices.
For example, instead of converting arrays to at least 2d,
.nonzero() could check the dimensions and only return one index array out of this method for a 1d array versus two for a 2d array. This inconsistency in the number of return values from this method, though, would have forced any client who calls
nonzero() to check the number of return values before using them for anything. Each client would also have to either handle 1d and 2d arrays differently, with ordinal indexing for 1d arrays and coordinate indexing for 2d arrays, or independently new up an
np.zeros(indices.shape) to make their own first array for indexing here. This would complicate the tasks of both reading and writing client code.
For another example, they could instantiate separate classes of coordinate matrices: one for 1d arrays and another for 2d arrays. The
spmatrix could then instantiate one or the other conditionally.
The 1d implementation could preserve a consistent interface with the 2d
coo_matrix class by responding to calls to
.nonzero() first with a
np.zeros(indices.shape) , then with the indices collection itself.
For clients, nothing would change, but the zero-index matrix would instantiate on demand rather than live indefinitely in the data representation. We’d get a nice API and a smaller data footprint, but this solution would also mean another class for contributors to maintain. Any changes to indexing behavior could have to happen in two places.
I can understand why the team might therefore avoid splitting this class until they have a few different behavioral differences that merit separate classes. In the meantime the current implementation banks on the data being sparse enough a 1.5x data footprint for a 1d collection is worth the simplicity of one fewer matrix class.
A Note: CSR Sparse Matrices Don’t Support Tensors.
When we instantiate
coo_matrix from a tensor (matrix with more than 2 dimensions), we see the result on line 20:
raise TypeError('expected dimension <= 2 array or matrix'). Scientists who need a sparse matrix of a tensor either extend
coo_matrix or reimplement
sparray for tensors. There is an ongoing effort to get such a representation inside of scipy itself. In the meantime, another common approach is to reshape 3 and higher dimensional data into a very long one-dimensional array and use it like that. We see this approach in Andrew Ng's vectorized images for training deep learning classifiers.
Saving the Values of Nonzero Data
Besides all of the coordinates, the sparse matrix can assume that every other item is a zero. So it doesn't have to store that—it can fill it in later! But what about the values of the nonzero items that do have coordinates? We store the nonzero values in addition to their locations.
We can see that in the initializer of the
coo_matrix on line 25:
We're saving off those nonzero values in
Memory and Space in Sparse vs. Dense Matrices
Let's analyze the space constraints for this data type versus a dense collection. Remember, we store three things for each nonzero item: x coordinate, y coordinate, value. If one third of the array has nonzero values, this implementation of the sparse matrix requires the same number of memory allocation spots for the data as the dense matrix. If more than a third of the array has nonzero values, the dense representation takes fewer memory allocation spots!
The data type we're exploring here relies on sparseness for efficiency. We use it for collections with way more zeros than nonzero items.
For example, I'm building a text processing model right now with tf-idf vectorizations of the text. The texts I'm vectorizing range between 250 words and 1000 words. Each one gets vectorized into a tf-idf representation of the 40,000 words, bigrams, and trigrams in the text with the highest tf-idf scores. If every word of the largest sample were also part of either a bigram or trigram (unlikely), it would still only have 2,000 values filled in this 40,000 item collection, translating to 6,000 memory allocation spots for this data.
That's 24 kilobytes (4 bytes per integer), which is 15% of the 160 kilobytes it would take to store a dense representation. Now imagine 100,000 of those in a file. That makes a huge difference in how I can store and process my data.
How does our sparse matrix get recomposed as a dense matrix?
When we call
.toarray() on a
csr_matrix, the class relies on the implementation in its superclass,
Another matrix, the compressed sparse column matrix (
csc_matrix), also inherits this method. CSR and CSC matrices are very similar: one stores data in rows and the other stores data in columns. The conditional logic on lines 5, 6, and 11-16 rotate the output array as needed to return the data in the right shape for its data type.
More germane to our question about how we rebuild a dense representation for these sparse matrices is
self._process_toarray_args(order, out), which lives in the base
spmatrix class and looks like this:
This method takes the
shape, which is stored on the class from the length of horizontal and vertical indices we use to store nonzero item locations. It returns a numpy array of that shape with all zeros. This is our starting point.
Now we fill in the values. That happens in line 17 of
.toarray() when we call
_sparsetools.csr_todense(M, N, x.indptr, x.indices, x.data, y) . That method is written in C++. There's a Python generator for it here:
Wanna see some C++? Of course you do! Let's look at that method:
.csr_todense() takes a number of rows, a number of columns, a collection of x coordinates, a collection of y coordinates, then the nonzero values and, finally, a data array.
.toarray() passed in all these values: that final data array is the
np.zeros() template array from
I know this code is not Python, but you might recognize the general structure: a nested
.csr_todense() method loops through each x coordinate and y coordinate. It updates each location by adding the next item in the values array.
0.0 + some nonzero integer = the nonzero integer. The method returns the data array at the end with the nonzero values in their proper places.
And voilà! It has recomposed a sparse matrix to a dense representation! The dense version takes up more memory, but we can store it as a list in a dataframe or otherwise manipulate it the way we would an ordinary array or list. (By the way, this is how you can store the CSR output of a
sklearn TfidVectorizer or
CountVectorizer in a dataframe. Save it to an array, convert the array to a list, and stick it in a column in the dataframe. Later you can get the column value, convert to a list, then to an array, to use it as training or test data for a machine learning model.
I hope you have enjoyed diving into this library as much as I have. We saw a small slice of the choices that programmers make to represent huge amounts of data in processing-friendly formats.
In the future, I'll like to do more of these posts. Look forward to similar posts in the future about the XGBoost DMatrix as well as Numpy and Pandas vectorization.