Python has several popular scientific computing libraries, complete with lots of documentation about how to use them.
But how do the tools work under the hood? Google searches turn up fewer answers to this.
Luckily the tools are open source, so I can dive into the code myself.
We’ll start with the Scipy CSR sparse matrix. What is this thing, and how does it work?
Before We Begin: Notes on My Process
- The code is in gists for accessibility. It represents the Scipy library as of pull time on Sunday, September 23, 2018, at 3:24 PM.
- This post examines the code. It is neither a critique nor an endorsement of the code. I’ll point out some structural choices the programmers made alongside some alternatives, but I will not make definitive judgments about the “correct” choice.
- I used the PyCharm IDE to navigate through method implementations, fold and unfold code, annotate code, read commit messages, and generate gists of the code I wanted to show you. PyCharm Community Edition is available for free right here.
What is Scipy?
Scipy is a Python-based ecosystem of open-source software for mathematics, science, and engineering. The CSR sparse matrix is a data type inside of scipy that the library uses to represent sparse matrices.
What are sparse matrices?
In general: they are collections in which the vast majority of the items are some default value (usually None
or 0.0
).
For Python scientific computing: they are enormous (thousands to millions of items) arrays where the vast majority of the items are zero.
A lot of scientific computing (and data science in general) relies (for now) on our ability to represent our data in numerical terms. That includes inputs like music, photographs, and text. We have techniques to turn these kinds of inputs into numbers, and sometimes those techniques give us enormous sparse matrices.
Take count vectorization, for example. Suppose you have a million sentences, and you want numeric representations of all of them. So you make an array with a zero in it for every word in the English language, with the first zero representing “a”, the second zero representing “aardvark,” and so on. Now, for each sentence, you start with that giant array, and you replace that zero with the number of times that word appears in your sentence. You make one of these for every single sentence. For each sentence, a very small number of these zeros get replaced. But each array is still the length of the entire English lexicon (230,000ish). It would take up a lot of memory, and iterating over it would be a saga.
What is the CSR Sparse Matrix?
CSR stands for “compressed sparse row,” and it’s a nod to the function of the CSR sparse matrix—a data type implemented in Scipy specifically to store sparse matrices as much smaller objects than an array with a crapload of zeros. This makes them faster and more lightweight so they can run on smaller/fewer machines.
Use case: the TfidfVectorizer
in scikit-learn returns its encodings as a CSR because the tf-idf encodings, like the word count encodings we discussed above, are mostly zeros for any given text in a much larger body of texts.
How does the CSR Sparse Matrix Work?
Let’s take a look!
First I cloned the repo from here and then opened it in PyCharm.
The class hierarchy goes csr.py/csr_matrix --> compressed.py/_cs_matrix --> data.py/_data_matrix --> base.py/spmatrix
. The spmatrix
cannot be instantiated directly: instead, clients should instantiate a subclass. This is because each sparse matrix type still needs specific behavior that spmatrix does not define itself.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
class spmatrix(object): | |
""" This class provides a base class for all sparse matrices. It | |
cannot be instantiated. Most of the work is provided by subclasses. | |
""" | |
__array_priority__ = 10.1 | |
ndim = 2 | |
def __init__(self, maxprint=MAXPRINT): | |
self._shape = None | |
if self.__class__.__name__ == 'spmatrix': | |
raise ValueError("This class is not intended" | |
" to be instantiated directly.") | |
self.maxprint = maxprint |
Let’s go to the most specific class in the hierarchy and pretend we are instantiating it with a dense array. A dense array, in contrast to a sparse array, is the full-size, full-memory collection representation, crapload of zeros and all.
csr_matrix
does not have its own initializer, so it leans on its superclass for that functionality. We find the following initializer in _cs_matrix
:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
class _cs_matrix(_data_matrix, _minmax_mixin, IndexMixin): | |
"""base matrix class for compressed row and column oriented matrices""" | |
def __init__(self, arg1, shape=None, dtype=None, copy=False): | |
_data_matrix.__init__(self) | |
if isspmatrix(arg1): | |
… | |
elif isinstance(arg1, tuple): | |
… | |
else: | |
# must be dense | |
try: | |
arg1 = np.asarray(arg1) | |
except Exception: | |
raise ValueError("unrecognized {}_matrix constructor usage" | |
"".format(self.format)) | |
from .coo import coo_matrix | |
self._set_self(self.__class__(coo_matrix(arg1, dtype=dtype))) | |
# Read matrix dimensions given, if any | |
if shape is not None: | |
self._shape = check_shape(shape) | |
else: | |
… | |
self.check_format(full_check=False) |
I have ellipsed some code here to focus our discussion. We’re imagining ourselves saying something like c = csr_matrix(array)
, where array
is a dense numpy array. It’s not a sparse matrix (so our code path skips the conditional on line 7) and it’s not a tuple (so it skips the conditional on line 10). We get np.asarray(arg1)
on line 16, and then on line 25 we get .check_shape(shape)
. Let’s look at the implementation of that.
That implementation lives in sputils.py
. Here is the first line:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def check_shape(args, current_shape=None): | |
"""Imitate numpy.matrix handling of shape arguments""" |
What follows is 56 lines of code to allow this csr_matrix to imitate the return value of array.shape
. Why is this method imitating .shape
? Because a csr_matrix
does not store all the values like a dense array does.
So how does it store the values? Let’s return to the _cs_matrix
initializer to find out.
Saving the Locations of Nonzero Data
On line 21 of the _cs_matrix
initializer, we set self
(the data representation) to a coo_matrix
. ‘COO’ stands for ‘coordinate.’ This type of matrix stores information about the locations of things in the data.
Here is the coo_matrix
initializer:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
class coo_matrix(_data_matrix, _minmax_mixin): | |
… | |
format = 'coo' | |
def __init__(self, arg1, shape=None, dtype=None, copy=False): | |
_data_matrix.__init__(self) | |
if isinstance(arg1, tuple): | |
… | |
else: | |
if isspmatrix(arg1): | |
… | |
else: | |
#dense argument | |
M = np.atleast_2d(np.asarray(arg1)) | |
if M.ndim != 2: | |
raise TypeError('expected dimension <= 2 array or matrix') | |
else: | |
self._shape = check_shape(M.shape) | |
self.row, self.col = M.nonzero() | |
self.data = M[self.row, self.col] | |
self.has_canonical_format = True | |
if dtype is not None: | |
self.data = self.data.astype(dtype, copy=False) | |
self._check() |
What is happening here? First, the initializer re-represents the input dense array in at least 2 dimensions. If it receives a one-dimensional numpy array [0,1,0,0,0,3,0,0,0,1]
, it converts that to [[0,1,0,0,0,3,0,0,0,1]]
. This step simplifies the way that sparse matrix data types index the nonzero values.
See that call to M.nonzero()
on line 24? That’s where we make these indices. Some types of sparse arrays have their own implementations of this method, but we will stick with the csr_matrix
for now, which does not. Instead, it uses the version of .nonzero()
that lives on spmatrix
itself:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def nonzero(self): | |
"""nonzero indices | |
Returns a tuple of arrays (row,col) containing the indices | |
of the non-zero elements of the matrix. | |
Examples | |
——– | |
>>> from scipy.sparse import csr_matrix | |
>>> A = csr_matrix([[1,2,0],[0,0,3],[4,0,5]]) | |
>>> A.nonzero() | |
(array([0, 0, 1, 2, 2]), array([0, 1, 2, 0, 2])) | |
""" | |
# convert to COOrdinate format | |
A = self.tocoo() | |
nz_mask = A.data != 0 | |
return (A.row[nz_mask], A.col[nz_mask]) |
This method first takes the data and converts it into coordinate format. Now, to access the original data, a client would call .data
on the coordinate object. Next, this method creates a 2d mask of that data that contains False
wherever it has a zero value and True
wherever it has a nonzero value. It then returns the coordinates at which that data has a nonzero value. If we were to call this method and then zip the return values together, we would get a list of tuples that represent coordinate pairs. Then, if we were to display our data array as a field of x and y values, with [0,0]
at the top left and [n,n]
at the bottom right, our zipped up result would give us the coordinates of every item in the array with a nonzero value.
Before we move on, we have an opportunity here to talk about a tradeoff the Numpy team faced in the design of this location encoding.
Remember that np.atleast_2d(np.asarray(arg1)
returns [[t,h,i,s]]
when passed in a collection with one dimension like [t,h,i,s]
. Here is what happens when you save the indices of the nonzero items in such a collection:
This array of 29 items, 7 of which are nonzero, gets two 7-item coordinate arrays. The X-coordinate array, highlighted in blue, will always contain exclusively zeros for a one-dimensional matrix. We know that this array will be a zeros array of the same length as the second array, which gives us the indices of the nonzero items in the way we’re familiar with from np.argsort(arr)
, or general ordinal collection indexing for that matter.
In order to not store this sort of superfluous first array, the Numpy team could have made a number of choices.
For example, instead of converting arrays to at least 2d, .nonzero()
could check the dimensions and only return one index array out of this method for a 1d array versus two for a 2d array. This inconsistency in the number of return values from this method, though, would have forced any client who calls nonzero()
to check the number of return values before using them for anything. Each client would also have to either handle 1d and 2d arrays differently, with ordinal indexing for 1d arrays and coordinate indexing for 2d arrays, or independently new up an np.zeros(indices.shape[0])
to make their own first array for indexing here. This would complicate the tasks of both reading and writing client code.
For another example, they could instantiate separate classes of coordinate matrices: one for 1d arrays and another for 2d arrays. The spmatrix
could then instantiate one or the other conditionally.
The 1d implementation could preserve a consistent interface with the 2d coo_matrix
class by responding to calls to .nonzero()
first with a np.zeros(indices.shape[0])
, then with the indices collection itself.
For clients, nothing would change, but the zero-index matrix would instantiate on demand rather than live indefinitely in the data representation. We’d get a nice API and a smaller data footprint, but this solution would also mean another class for contributors to maintain. Any changes to indexing behavior could have to happen in two places.
I can understand why the team might therefore avoid splitting this class until they have a few different behavioral differences that merit separate classes. In the meantime the current implementation banks on the data being sparse enough a 1.5x data footprint for a 1d collection is worth the simplicity of one fewer matrix class.
A Note: CSR Sparse Matrices Don’t Support Tensors.
When we instantiate coo_matrix
from a tensor (matrix with more than 2 dimensions), we see the result on line 20: raise TypeError('expected dimension <= 2 array or matrix')
. Scientists who need a sparse matrix of a tensor either extend coo_matrix
or reimplement sparray
for tensors. There is an ongoing effort to get such a representation inside of scipy itself. In the meantime, another common approach is to reshape 3 and higher dimensional data into a very long one-dimensional array and use it like that. We see this approach in Andrew Ng's vectorized images for training deep learning classifiers.
Saving the Values of Nonzero Data
Besides all of the coordinates, the sparse matrix can assume that every other item is a zero. So it doesn't have to store that—it can fill it in later! But what about the values of the nonzero items that do have coordinates? We store the nonzero values in addition to their locations.
We can see that in the initializer of the coo_matrix
on line 25:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
class coo_matrix(_data_matrix, _minmax_mixin): | |
... | |
format = 'coo' | |
def __init__(self, arg1, shape=None, dtype=None, copy=False): | |
_data_matrix.__init__(self) | |
if isinstance(arg1, tuple): | |
... | |
else: | |
if isspmatrix(arg1): | |
... | |
else: | |
#dense argument | |
M = np.atleast_2d(np.asarray(arg1)) | |
if M.ndim != 2: | |
raise TypeError('expected dimension <= 2 array or matrix') | |
else: | |
self._shape = check_shape(M.shape) | |
self.row, self.col = M.nonzero() | |
self.data = M[self.row, self.col] | |
self.has_canonical_format = True | |
if dtype is not None: | |
self.data = self.data.astype(dtype, copy=False) | |
self._check() |
We're saving off those nonzero values in self.data.
Memory and Space in Sparse vs. Dense Matrices
Let's analyze the space constraints for this data type versus a dense collection. Remember, we store three things for each nonzero item: x coordinate, y coordinate, value. If one third of the array has nonzero values, this implementation of the sparse matrix requires the same number of memory allocation spots for the data as the dense matrix. If more than a third of the array has nonzero values, the dense representation takes fewer memory allocation spots!
The data type we're exploring here relies on sparseness for efficiency. We use it for collections with way more zeros than nonzero items.
For example, I'm building a text processing model right now with tf-idf vectorizations of the text. The texts I'm vectorizing range between 250 words and 1000 words. Each one gets vectorized into a tf-idf representation of the 40,000 words, bigrams, and trigrams in the text with the highest tf-idf scores. If every word of the largest sample were also part of either a bigram or trigram (unlikely), it would still only have 2,000 values filled in this 40,000 item collection, translating to 6,000 memory allocation spots for this data.
That's 24 kilobytes (4 bytes per integer), which is 15% of the 160 kilobytes it would take to store a dense representation. Now imagine 100,000 of those in a file. That makes a huge difference in how I can store and process my data.
How does our sparse matrix get recomposed as a dense matrix?
When we call .toarray()
on a csr_matrix
, the class relies on the implementation in its superclass, _cs_matrix
:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
class _cs_matrix(_data_matrix, _minmax_mixin, IndexMixin): | |
... | |
def toarray(self, order=None, out=None): | |
if out is None and order is None: | |
order = self._swap('cf')[0] | |
out = self._process_toarray_args(order, out) | |
if not (out.flags.c_contiguous or out.flags.f_contiguous): | |
raise ValueError('Output array must be C or F contiguous') | |
# align ideal order with output array order | |
if out.flags.c_contiguous: | |
x = self.tocsr() | |
y = out | |
else: | |
x = self.tocsc() | |
y = out.T | |
M, N = x._swap(x.shape) | |
_sparsetools.csr_todense(M, N, x.indptr, x.indices, x.data, y) | |
return out |
Another matrix, the compressed sparse column matrix (csc_matrix
), also inherits this method. CSR and CSC matrices are very similar: one stores data in rows and the other stores data in columns. The conditional logic on lines 5, 6, and 11-16 rotate the output array as needed to return the data in the right shape for its data type.
More germane to our question about how we rebuild a dense representation for these sparse matrices is self._process_toarray_args(order, out)
, which lives in the base spmatrix
class and looks like this:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
class spmatrix(object): | |
... | |
def _process_toarray_args(self, order, out): | |
if out is not None: | |
if order is not None: | |
raise ValueError('order cannot be specified if out ' | |
'is not None') | |
if out.shape != self.shape or out.dtype != self.dtype: | |
raise ValueError('out array must be same dtype and shape as ' | |
'sparse matrix') | |
out[...] = 0. | |
return out | |
else: | |
return np.zeros(self.shape, dtype=self.dtype, order=order) |
This method takes the shape
, which is stored on the class from the length of horizontal and vertical indices we use to store nonzero item locations. It returns a numpy array of that shape with all zeros. This is our starting point.
Now we fill in the values. That happens in line 17 of .toarray()
when we call _sparsetools.csr_todense(M, N, x.indptr, x.indices, x.data, y)
. That method is written in C++. There's a Python generator for it here:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
python generate_sparsetools.py | |
Generate manual wrappers for C++ sparsetools code. | |
Type codes used: | |
'i': integer scalar | |
'I': integer array | |
'T': data array | |
'B': boolean array | |
'V': std::vector<integer>* | |
'W': std::vector<data>* | |
'*': indicates that the next argument is an output argument | |
'v': void | |
'l': 64-bit integer scalar | |
See sparsetools.cxx for more details. | |
""" | |
... | |
# csr.h | |
CSR_ROUTINES = """ | |
... | |
csr_todense v iiIIT*T | |
... | |
""" |
Wanna see some C++? Of course you do! Let's look at that method:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/* | |
* Compute B += A for CSR matrix A, C-contiguous dense matrix B | |
* | |
* Input Arguments: | |
* I n_row - number of rows in A | |
* I n_col - number of columns in A | |
* I Ap[n_row+1] - row pointer | |
* I Aj[nnz(A)] - column indices | |
* T Ax[nnz(A)] - nonzero values | |
* T Bx[n_row*n_col] - dense matrix in row-major order | |
* | |
*/ | |
template <class I, class T> | |
void csr_todense(const I n_row, | |
const I n_col, | |
const I Ap[], | |
const I Aj[], | |
const T Ax[], | |
T Bx[]) | |
{ | |
T * Bx_row = Bx; | |
for(I i = 0; i < n_row; i++){ | |
for(I jj = Ap[i]; jj < Ap[i+1]; jj++){ | |
Bx_row[Aj[jj]] += Ax[jj]; | |
} | |
Bx_row += (npy_intp)n_col; | |
} | |
} |
.csr_todense()
takes a number of rows, a number of columns, a collection of x coordinates, a collection of y coordinates, then the nonzero values and, finally, a data array. .toarray()
passed in all these values: that final data array is the np.zeros()
template array from self._process_toarray_args(order, out)
.
I know this code is not Python, but you might recognize the general structure: a nested for
loop.
The .csr_todense()
method loops through each x coordinate and y coordinate. It updates each location by adding the next item in the values array. 0.0
+ some nonzero integer = the nonzero integer. The method returns the data array at the end with the nonzero values in their proper places.
And voilà! It has recomposed a sparse matrix to a dense representation! The dense version takes up more memory, but we can store it as a list in a dataframe or otherwise manipulate it the way we would an ordinary array or list. (By the way, this is how you can store the CSR output of a sklearn TfidVectorizer
or CountVectorizer
in a dataframe. Save it to an array, convert the array to a list, and stick it in a column in the dataframe. Later you can get the column value, convert to a list, then to an array, to use it as training or test data for a machine learning model.
Like so:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import pandas as pd | |
import numpy as np | |
import itertool | |
from nltk import word_tokenize | |
from sklearn.feature_extraction.text import TfidfVectorizer | |
from sklearn.model_selection import train_test_split | |
from sklearn.naive_bayes import MultinomialNB | |
df = pd.read_csv('my_data_with_text.csv') | |
df.columns #id, text, category | |
texts = np.array(df['text']) #text contents in dataframe to array for processing | |
vocab_length = len(nltk.word_tokenize(list(itertools.chain.from_iterable(texts))) #concatenate all the texts and tokenize the whole corpus | |
vectorizer = TfidfVectorizer(ngram_range = (1,3), max_features = vocab_length) #make Tfidf Vectorizer | |
tfidf_encodings = vectorizer.fit_transform(texts) #encode the text | |
df['tfidf'] = list(tfidf_encodings.toarray()) #vectorized texts to dense list format for storage in dataframe | |
vectors_for_training = np.array(df['tfidf'].tolist()) #get the vectors back out of the dataframe for use in something else | |
X_train, y_train, X_test, y_test = train_test_split(vectors_for_training, df['category'].tolist()) | |
nb_classifier = MultinomialNB() | |
nb_classifier.fit(X_train, y_train) | |
nb_predictions = nb_classifier.predict(df.tfidf.tolist()) | |
#DO NOT DO: | |
df.to_csv('with_encoding.csv') #Stores the first and last 3 items in each vector as a string like "[0.0, 0.0, 0.0...0.0, 0.0, 0.0]" |
Conclusion
I hope you have enjoyed diving into this library as much as I have. We saw a small slice of the choices that programmers make to represent huge amounts of data in processing-friendly formats.
In the future, I'll like to do more of these posts. Look forward to similar posts in the future about the XGBoost DMatrix as well as Numpy and Pandas vectorization.
If you like nerding out about code, you might also like:
Git commit messages served six ways
Breaking rules in Rails to build a process-oriented import class
My favorite code examination strategy—comes with colored pens!