This is the third post in a three-part case study. We walk through time and space efficiency considerations as we write an in-memory database.
In the previous post of this series, we implemented transaction support an in memory database. However, we did it in an extremely space-inefficient way. In case you read that post a while ago, here are a few quotes to refresh your memory:
Transactions allow us to stage a series of changes to the database that we can either commit (save) or roll back (discard) all at once. We begin the transaction by calling
.begin, save the changes of the transaction by calling
.commit, or discard those changes by calling
[But, in our previous implementation], when we open our first transaction, we immediately double our database footprint. When our database only has three things in it, maybe this isn’t a big deal. If our database had 100,000 things in it, this would become a much bigger deal.
So what are we going to do about this extremely space-intensive solution?
Recursion to the rescue!
What we’ll do next: we’ll make it possible to stack multiple transactions on top of each other like this:
Then, when we need to know the state of the database at the current transaction, recursively look back through the database versions to get the most recent value at each key. You can see how we do this in the
count_candidate methods in the implementation below.
Then, inside of
commit, we squash down the latest transaction change set onto the previous one. In
rollback, we delete the latest transaction change set.
Now we’ll be keeping each of our change sets in separate hashes:
A Sidenote on API Design: You’ll notice that we also introduced the concept of a
@tier. This keeps track of which transaction branch we are on, so we can do nested transactions.
I get that, when I add the code for
@tier, it kind of feels like I did the code sample version of this:
In our previous implementation, we would always make our changes on whatever the last hash in the database list was, and then merge those changes with the second to last hash in the list. Technically, we could still do that, indexing our
@db_versions array with
-2 all over the place.
If a new developer were to show up, they’d have to figure out the significance of these unlabeled variables with values of
-2. To make this easier, we assign a name to each piece of our code—preferably a name that describes what that piece of code means. I chose
@tier to represent this concept, like the tiers of this cake tray:
@tier represents the number of our latest transaction: the highest tier on the cake tray. And it allows us to manage nested transactions so we can do stuff like this:
Wait, what about deletion?
You’ll notice that we have been omitting key deletion so far in our transaction examples.
If we delete keys only from the top-tier change set, our current implementation of
merge_candidate could still have values for those keys, provided that those keys are still present at a lower tier.
We could also try “overwriting” those keys at a lower tier by setting the key’s value to
nil in the database. But, as we discussed in the previous post, that equates a key with a
nil value to no key at all in our database.
If we were to do that, we need to make sure that every use case would treat those two things the same. And even if they did, we then end up keeping the keys for key-value pairs that we have deleted, which takes up extra space.
We need a way to delete the keys, so we get the behavior described in this test:
Here is what we’ll do: inside the
.delete method, we’ll add the key being deleted to a list of deletions. Then, in .
merge_candidate, once we have our view of the database from the perspective of our current transaction, we rip out the keys in the deletion list before returning the finished hash:
This is starting to shape up pretty well.
A Sidenote on API Design: The recursive methods
count_candidate have a lot of code in common. Why didn’t I factor out a method that takes a hash and a block, and have each of these methods call that one rather than possessing their own logic?
Before I do that refactor, I want to be certain that the operations I am performing are, in fact, the same. If it turns out later that I was wrong, I have put either myself or my team in a bind, because factoring down (separating things) is often significantly harder than factoring up (combining things). I explain my cost evaluation for refactors in this keynote talk on the technology and psychology of refactoring.
Full disclosure: I stand by this refactoring choice, but not everyone agrees with it. I implemented the in-memory database you see here in four hours as a code challenge in an interview process, and I think this choice not to refactor might have been one of the reasons the company did not move forward with me after this code challenge 🙃.
Nevertheless I thought the exercise valuable to share as a demonstration of the way I consider space and time constraints while writing software.
That brings me to a time efficiency constraint of this implementation:
count_candidate do an iterative lookback through each individual transaction currently open for the database. Iteration is frequently a culprit for speed leaks in code that needs to be super fast. Numpy gets around this by dropping into C. Dask gets around this by parallelizing iterative processes across cores. Even if we don’t mess with threads or shell out to a faster language, are there things we could do to speed up our code?
We could try a few things.
For one, we could stop comparing keys for which we already have a value from a more current transaction. I chose not to do this for a few reasons:
- The list of keys not to consider could itself significantly increase the database memory footprint
- We’d still have to iteratively remove those keys from each less recent version
- When we remove those keys, we’d have to store them rather than delete them because they’d still be in effect if more recent transactions were all rolled back.
So I’m not convinced it would save any time, and it would definitely cost space.
How about caching? We could cache the output of
count_candidate, then invalidate the cache anytime a write (
.delete) is called. That would slow down the writes to speed up the reads (like
.count) since we currently call those candidate methods inside the read methods.
Before I start down any of these routes, though, it’s wise to determine the benefit I would derive from doing it.
So I’m making a risk calculation: “How likely is it that a user has 100,000 transactions nested inside each other in this database, and what are the consequences if they do?”
Consequences: reads will be slow. Maybe prohibitively slow.
Likelihood: please let me know if you think of something here, because I’m coming up blank on a scenario where it would be wise to open 100,000 transactions nested inside one another.
Maybe we would ship this to users, I’d be wrong about that being an unlikely thing, and then we’d need to consider some of our proposed solutions. But before I introduce the complexity of cache invalidation to a code base, I need solid evidence that it’s solving a problem. If the database is slow in a scenario that no one gets into, that’s not a problem.
Again, not everyone will agree with this choice. A comment from an interviewer during my code review tipped me off that my decision here was probably another reason that the company did not move forward with me from this code challenge 🙃.
Since this is an implementation walkthrough, I don’t know how much it makes sense to write a conclusion. That said, I hope the decisions we’ve discussed in the time and space efficiency posts has reinforced a few ideas:
- We sometimes face a tradeoff between a time-efficient implementation and a space-efficient one.
- Even as we try to build a small, fast program, it makes sense to optimize for a clear API, which will save ourselves and other developers valuable time when maintaining this program in the future.
- When a program can be made smaller or faster, we want to consider how beneficial those changes will be. They may not be worth the risk.