Code Mechanic: The Scipy CSR Sparse Matrix

Reading Time: 9 minutes

Python has several popular scientific computing libraries, complete with lots of documentation about how to use them. 

But how do the tools work under the hood? Google searches turn up fewer answers to this.

Luckily the tools are open source, so I can dive into the code myself.

We’ll start with the Scipy CSR sparse matrix. What is this thing, and how does it work?

sports car under hood


Before We Begin: Notes on My Process

  • The code is in gists for accessibility. It represents the Scipy library as of pull time on Sunday, September 23, 2018, at 3:24 PM.
  • This post examines the code. It is neither a critique nor an endorsement of the code. I’ll point out some structural choices the programmers made alongside some alternatives, but I will not make definitive judgments about the “correct” choice.
  • I used the PyCharm IDE to navigate through method implementations, fold and unfold code, annotate code, read commit messages, and generate gists of the code I wanted to show you. PyCharm Community Edition is available for free right here.

What is Scipy?

Scipy is a Python-based ecosystem of open-source software for mathematics, science, and engineering. The CSR sparse matrix is a data type inside of scipy that the library uses to represent sparse matrices. 

What are sparse matrices?

In general: they are collections in which the vast majority of the items are some default value (usually None or 0.0).

For Python scientific computing: they are enormous (thousands to millions of items) arrays where the vast majority of the items are zero.

A lot of scientific computing (and data science in general) relies (for now) on our ability to represent our data in numerical terms. That includes inputs like music, photographs, and text. We have techniques to turn these kinds of inputs into numbers, and sometimes those techniques give us enormous sparse matrices.

Take count vectorization, for example. Suppose you have a million sentences, and you want numeric representations of all of them. So you make an array with a zero in it for every word in the English language, with the first zero representing “a”, the second zero representing “aardvark,” and so on. Now, for each sentence, you start with that giant array, and you replace that zero with the number of times that word appears in your sentence. You make one of these for every single sentence. For each sentence, a very small number of these zeros get replaced. But each array is still the length of the entire English lexicon (230,000ish). It would take up a lot of memory, and iterating over it would be a saga.

What is the CSR Sparse Matrix?

CSR stands for “compressed sparse row,” and it’s a nod to the function of the CSR sparse matrix—a data type implemented in Scipy specifically to store sparse matrices as much smaller objects than an array with a crapload of zeros. This makes them faster and more lightweight so they can run on smaller/fewer machines.

Use case: the TfidfVectorizer in scikit-learn returns its encodings as a CSR because the tf-idf encodings, like the word count encodings we discussed above, are mostly zeros for any given text in a much larger body of texts.

How does the CSR Sparse Matrix Work?

Let’s take a look!

First I cloned the repo from here and then opened it in PyCharm.

scipy in pycharm

The class hierarchy goes csr.py/csr_matrix --> compressed.py/_cs_matrix --> data.py/_data_matrix --> base.py/spmatrix. The spmatrix cannot be instantiated directly: instead, clients should instantiate a subclass. This is because each sparse matrix type still needs specific behavior that spmatrix does not define itself.


class spmatrix(object):
""" This class provides a base class for all sparse matrices. It
cannot be instantiated. Most of the work is provided by subclasses.
"""
__array_priority__ = 10.1
ndim = 2
def __init__(self, maxprint=MAXPRINT):
self._shape = None
if self.__class__.__name__ == 'spmatrix':
raise ValueError("This class is not intended"
" to be instantiated directly.")
self.maxprint = maxprint

view raw

base.py

hosted with ❤ by GitHub

Let’s go to the most specific class in the hierarchy and pretend we are instantiating it with a dense array. A dense array, in contrast to a sparse array, is the full-size, full-memory collection representation, crapload of zeros and all.

csr_matrix does not have its own initializer, so it leans on its superclass for that functionality. We find the following initializer in _cs_matrix:


class _cs_matrix(_data_matrix, _minmax_mixin, IndexMixin):
"""base matrix class for compressed row and column oriented matrices"""
def __init__(self, arg1, shape=None, dtype=None, copy=False):
_data_matrix.__init__(self)
if isspmatrix(arg1):
elif isinstance(arg1, tuple):
else:
# must be dense
try:
arg1 = np.asarray(arg1)
except Exception:
raise ValueError("unrecognized {}_matrix constructor usage"
"".format(self.format))
from .coo import coo_matrix
self._set_self(self.__class__(coo_matrix(arg1, dtype=dtype)))
# Read matrix dimensions given, if any
if shape is not None:
self._shape = check_shape(shape)
else:
self.check_format(full_check=False)

view raw

compressed.py

hosted with ❤ by GitHub

I have ellipsed some code here to focus our discussion. We’re imagining ourselves saying something like c = csr_matrix(array), where array is a dense numpy array. It’s not a sparse matrix (so our code path skips the conditional on line 7) and it’s not a tuple (so it skips the conditional on line 10). We get np.asarray(arg1) on line 16, and then on line 25 we get .check_shape(shape). Let’s look at the implementation of that.

That implementation lives in sputils.py. Here is the first line:


def check_shape(args, current_shape=None):
"""Imitate numpy.matrix handling of shape arguments"""

view raw

sputils.py

hosted with ❤ by GitHub

What follows is 56 lines of code to allow this csr_matrix to imitate the return value of array.shape. Why is this method imitating .shape? Because a csr_matrix does not store all the values like a dense array does.

So how does it store the values? Let’s return to the _cs_matrix initializer to find out.

Saving the Locations of Nonzero Data

On line 21 of the _cs_matrix initializer, we set self (the data representation) to a coo_matrix. ‘COO’ stands for ‘coordinate.’ This type of matrix stores information about the locations of things in the data.

Here is the coo_matrix initializer:


class coo_matrix(_data_matrix, _minmax_mixin):
format = 'coo'
def __init__(self, arg1, shape=None, dtype=None, copy=False):
_data_matrix.__init__(self)
if isinstance(arg1, tuple):
else:
if isspmatrix(arg1):
else:
#dense argument
M = np.atleast_2d(np.asarray(arg1))
if M.ndim != 2:
raise TypeError('expected dimension <= 2 array or matrix')
else:
self._shape = check_shape(M.shape)
self.row, self.col = M.nonzero()
self.data = M[self.row, self.col]
self.has_canonical_format = True
if dtype is not None:
self.data = self.data.astype(dtype, copy=False)
self._check()

view raw

coo.py

hosted with ❤ by GitHub

What is happening here? First, the initializer re-represents the input dense array in at least 2 dimensions. If it receives a one-dimensional numpy array [0,1,0,0,0,3,0,0,0,1], it converts that to [[0,1,0,0,0,3,0,0,0,1]]. This step simplifies the way that sparse matrix data types index the nonzero values.

See that call to M.nonzero() on line 24? That’s where we make these indices. Some types of sparse arrays have their own implementations of this method, but we will stick with the csr_matrix for now, which does not. Instead, it uses the version of .nonzero() that lives on spmatrix itself:


def nonzero(self):
"""nonzero indices
Returns a tuple of arrays (row,col) containing the indices
of the non-zero elements of the matrix.
Examples
——–
>>> from scipy.sparse import csr_matrix
>>> A = csr_matrix([[1,2,0],[0,0,3],[4,0,5]])
>>> A.nonzero()
(array([0, 0, 1, 2, 2]), array([0, 1, 2, 0, 2]))
"""
# convert to COOrdinate format
A = self.tocoo()
nz_mask = A.data != 0
return (A.row[nz_mask], A.col[nz_mask])

view raw

base.py

hosted with ❤ by GitHub

This method first takes the data and converts it into coordinate format. Now, to access the original data, a client would call .data on the coordinate object. Next, this method creates a 2d mask of that data that contains False wherever it has a zero value and True wherever it has a nonzero value. It then returns the coordinates at which that data has a nonzero value. If we were to call this method and then zip the return values together, we would get a list of tuples that represent coordinate pairs. Then, if we were to display our data array as a field of x and y values, with [0,0] at the top left and [n,n] at the bottom right, our zipped up result would give us the coordinates of every item in the array with a nonzero value.

Before we move on, we have an opportunity here to talk about a tradeoff the Numpy team faced in the design of this location encoding.

Remember that np.atleast_2d(np.asarray(arg1) returns [[t,h,i,s]] when passed in a collection with one dimension like [t,h,i,s]. Here is what happens when you save the indices of the nonzero items in such a collection:

csr matrix with 1d array

This array of 29 items, 7 of which are nonzero, gets two 7-item coordinate arrays. The X-coordinate array, highlighted in blue, will always contain exclusively zeros for a one-dimensional matrix. We know that this array will be a zeros array of the same length as the second array, which gives us the indices of the nonzero items in the way we’re familiar with from np.argsort(arr), or general ordinal collection indexing for that matter.

In order to not store this sort of superfluous first array, the Numpy team could have made a number of choices.

For example, instead of converting arrays to at least 2d, .nonzero() could check the dimensions and only return one index array out of this method for a 1d array versus two for a 2d array. This inconsistency in the number of return values from this method, though, would have forced any client who calls nonzero() to check the number of return values before using them for anything. Each client would also have to either handle 1d and 2d arrays differently, with ordinal indexing for 1d arrays and coordinate indexing for 2d arrays, or independently new up an np.zeros(indices.shape[0]) to make their own first array for indexing here. This would complicate the tasks of both reading and writing client code.

For another example, they could instantiate separate classes of coordinate matrices: one for 1d arrays and another for 2d arrays. The spmatrix could then instantiate one or the other conditionally.

The 1d implementation could preserve a consistent interface with the 2d coo_matrix class by responding to calls to .nonzero() first with a np.zeros(indices.shape[0]) , then with the indices collection itself.

For clients, nothing would change, but the zero-index matrix would instantiate on demand rather than live indefinitely in the data representation. We’d get a nice API and a smaller data footprint, but this solution would also mean another class for contributors to maintain. Any changes to indexing behavior could have to happen in two places.

I can understand why the team might therefore avoid splitting this class until they have a few different behavioral differences that merit separate classes. In the meantime the current implementation banks on the data being sparse enough a 1.5x data footprint for a 1d collection is worth the simplicity of one fewer matrix class.

A Note: CSR Sparse Matrices Don’t Support Tensors.

When we instantiate coo_matrix from a tensor (matrix with more than 2 dimensions), we see the result on line 20: raise TypeError('expected dimension <= 2 array or matrix'). Scientists who need a sparse matrix of a tensor either extend coo_matrix or reimplement sparray for tensors. There is an ongoing effort to get such a representation inside of scipy itself. In the meantime, another common approach is to reshape 3 and higher dimensional data into a very long one-dimensional array and use it like that. We see this approach in Andrew Ng's vectorized images for training deep learning classifiers.

Saving the Values of Nonzero Data

Besides all of the coordinates, the sparse matrix can assume that every other item is a zero. So it doesn't have to store that—it can fill it in later! But what about the values of the nonzero items that do have coordinates? We store the nonzero values in addition to their locations.

We can see that in the initializer of the coo_matrix on line 25:


class coo_matrix(_data_matrix, _minmax_mixin):
...
format = 'coo'
def __init__(self, arg1, shape=None, dtype=None, copy=False):
_data_matrix.__init__(self)
if isinstance(arg1, tuple):
...
else:
if isspmatrix(arg1):
...
else:
#dense argument
M = np.atleast_2d(np.asarray(arg1))
if M.ndim != 2:
raise TypeError('expected dimension <= 2 array or matrix')
else:
self._shape = check_shape(M.shape)
self.row, self.col = M.nonzero()
self.data = M[self.row, self.col]
self.has_canonical_format = True
if dtype is not None:
self.data = self.data.astype(dtype, copy=False)
self._check()

view raw

coo.py

hosted with ❤ by GitHub

We're saving off those nonzero values in self.data.

Memory and Space in Sparse vs. Dense Matrices

Let's analyze the space constraints for this data type versus a dense collection. Remember, we store three things for each nonzero item: x coordinate, y coordinate, value. If one third of the array has nonzero values, this implementation of the sparse matrix requires the same number of memory allocation spots for the data as the dense matrix. If more than a third of the array has nonzero values, the dense representation takes fewer memory allocation spots!

The data type we're exploring here relies on sparseness for efficiency. We use it for collections with way more zeros than nonzero items.

For example, I'm building a text processing model right now with tf-idf vectorizations of the text. The texts I'm vectorizing range between 250 words and 1000 words. Each one gets vectorized into a tf-idf representation of the 40,000 words, bigrams, and trigrams in the text with the highest tf-idf scores. If every word of the largest sample were also part of either a bigram or trigram (unlikely), it would still only have 2,000 values filled in this 40,000 item collection, translating to 6,000 memory allocation spots for this data.

That's 24 kilobytes (4 bytes per integer), which is 15% of the 160 kilobytes it would take to store a dense representation. Now imagine 100,000 of those in a file. That makes a huge difference in how I can store and process my data.

How does our sparse matrix get recomposed as a dense matrix?

When we call .toarray() on a csr_matrix, the class relies on the implementation in its superclass, _cs_matrix:


class _cs_matrix(_data_matrix, _minmax_mixin, IndexMixin):
...
def toarray(self, order=None, out=None):
if out is None and order is None:
order = self._swap('cf')[0]
out = self._process_toarray_args(order, out)
if not (out.flags.c_contiguous or out.flags.f_contiguous):
raise ValueError('Output array must be C or F contiguous')
# align ideal order with output array order
if out.flags.c_contiguous:
x = self.tocsr()
y = out
else:
x = self.tocsc()
y = out.T
M, N = x._swap(x.shape)
_sparsetools.csr_todense(M, N, x.indptr, x.indices, x.data, y)
return out

view raw

compressed.py

hosted with ❤ by GitHub

Another matrix, the compressed sparse column matrix (csc_matrix), also inherits this method. CSR and CSC matrices are very similar: one stores data in rows and the other stores data in columns. The conditional logic on lines 5, 6, and 11-16 rotate the output array as needed to return the data in the right shape for its data type.

More germane to our question about how we rebuild a dense representation for these sparse matrices is self._process_toarray_args(order, out), which lives in the base spmatrix class and looks like this:


class spmatrix(object):
...
def _process_toarray_args(self, order, out):
if out is not None:
if order is not None:
raise ValueError('order cannot be specified if out '
'is not None')
if out.shape != self.shape or out.dtype != self.dtype:
raise ValueError('out array must be same dtype and shape as '
'sparse matrix')
out[...] = 0.
return out
else:
return np.zeros(self.shape, dtype=self.dtype, order=order)

view raw

base.py

hosted with ❤ by GitHub

This method takes the shape, which is stored on the class from the length of horizontal and vertical indices we use to store nonzero item locations. It returns a numpy array of that shape with all zeros. This is our starting point.

Now we fill in the values. That happens in line 17 of .toarray() when we call _sparsetools.csr_todense(M, N, x.indptr, x.indices, x.data, y) . That method is written in C++. There's a Python generator for it here:


"""
python generate_sparsetools.py
Generate manual wrappers for C++ sparsetools code.
Type codes used:
'i': integer scalar
'I': integer array
'T': data array
'B': boolean array
'V': std::vector<integer>*
'W': std::vector<data>*
'*': indicates that the next argument is an output argument
'v': void
'l': 64-bit integer scalar
See sparsetools.cxx for more details.
"""
...
# csr.h
CSR_ROUTINES = """
...
csr_todense v iiIIT*T
...
"""

Wanna see some C++? Of course you do! Let's look at that method:


/*
* Compute B += A for CSR matrix A, C-contiguous dense matrix B
*
* Input Arguments:
* I n_row - number of rows in A
* I n_col - number of columns in A
* I Ap[n_row+1] - row pointer
* I Aj[nnz(A)] - column indices
* T Ax[nnz(A)] - nonzero values
* T Bx[n_row*n_col] - dense matrix in row-major order
*
*/
template <class I, class T>
void csr_todense(const I n_row,
const I n_col,
const I Ap[],
const I Aj[],
const T Ax[],
T Bx[])
{
T * Bx_row = Bx;
for(I i = 0; i < n_row; i++){
for(I jj = Ap[i]; jj < Ap[i+1]; jj++){
Bx_row[Aj[jj]] += Ax[jj];
}
Bx_row += (npy_intp)n_col;
}
}

view raw

csr.h

hosted with ❤ by GitHub

.csr_todense() takes a number of rows, a number of columns, a collection of x coordinates, a collection of y coordinates, then the nonzero values and, finally, a data array. .toarray() passed in all these values: that final data array is the np.zeros() template array from self._process_toarray_args(order, out).

I know this code is not Python, but you might recognize the general structure: a nested for loop.

The .csr_todense() method loops through each x coordinate and y coordinate. It updates each location by adding the next item in the values array. 0.0 + some nonzero integer = the nonzero integer. The method returns the data array at the end with the nonzero values in their proper places.

And voilà! It has recomposed a sparse matrix to a dense representation! The dense version takes up more memory, but we can store it as a list in a dataframe or otherwise manipulate it the way we would an ordinary array or list. (By the way, this is how you can store the CSR output of a sklearn TfidVectorizer or CountVectorizer in a dataframe. Save it to an array, convert the array to a list, and stick it in a column in the dataframe. Later you can get the column value, convert to a list, then to an array, to use it as training or test data for a machine learning model.

Like so:


import pandas as pd
import numpy as np
import itertool
from nltk import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
df = pd.read_csv('my_data_with_text.csv')
df.columns #id, text, category
texts = np.array(df['text']) #text contents in dataframe to array for processing
vocab_length = len(nltk.word_tokenize(list(itertools.chain.from_iterable(texts))) #concatenate all the texts and tokenize the whole corpus
vectorizer = TfidfVectorizer(ngram_range = (1,3), max_features = vocab_length) #make Tfidf Vectorizer
tfidf_encodings = vectorizer.fit_transform(texts) #encode the text
df['tfidf'] = list(tfidf_encodings.toarray()) #vectorized texts to dense list format for storage in dataframe
vectors_for_training = np.array(df['tfidf'].tolist()) #get the vectors back out of the dataframe for use in something else
X_train, y_train, X_test, y_test = train_test_split(vectors_for_training, df['category'].tolist())
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)
nb_predictions = nb_classifier.predict(df.tfidf.tolist())
#DO NOT DO:
df.to_csv('with_encoding.csv') #Stores the first and last 3 items in each vector as a string like "[0.0, 0.0, 0.0...0.0, 0.0, 0.0]"

Conclusion

I hope you have enjoyed diving into this library as much as I have. We saw a small slice of the choices that programmers make to represent huge amounts of data in processing-friendly formats.

In the future, I'll like to do more of these posts. Look forward to similar posts in the future about the XGBoost DMatrix as well as Numpy and Pandas vectorization.

If you like nerding out about code, you might also like:

Git commit messages served six ways

Breaking rules in Rails to build a process-oriented import class

My favorite code examination strategy—comes with colored pens!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.