Let’s talk about time and space efficiency in software. We’ll do it by talking through a sample problem. This is the first post in a three-part series:
This series is a supplement to theoretical resources, not a replacement. If you learn best from specific examples rather than general explanations, this post might help you.
This post is not an introduction to big O notation. If that’s what you’re looking for, you want:
Our Example Problem
An in-memory database…keeps the whole dataset in RAM.
Each time you query a database or update data, you only access the main memory. So, there’s no disk involved into these operations.
As we implement this database, we’ll talk through additional tradeoffs in implementation details that affect:
- speed of record retrieval
- speed of adding, changing, removing records
- amount of memory the database takes up
Let’s get started.
Part I: The Get and Set Operations
We’ll write our database in Ruby, and we’ll interact with it in a command line console.
First, we want to get and set items in our database, like this:
We start out in our
test_database.rb file. If you’ve ever heard me open my mouth before on any subject remotely engineering related, you already know that I think testing is important.
Our tests will help us drive out the API we want. Then, if we’ll be regularly rethinking our implementation choices over performance concerns, our tests will ensure that our API stays consistent and our database still works.
So now we arrive at some implementation decisions.
What sort of data structure would make sense for this database?
- Use an array. Allow the array indices to index the objects, and iterate through them as needed. Row-oriented databases, like Postgres, and column-oriented databases, like Cassandra, do a (very, very) fancy version of this. Which one you choose depends on which axis you need to retrieve quickly. We touched on the tradeoffs of that decision in this post.
- Use a hash. Treat the keys as the indices to our values, and grab them out by those keys. Hash key lookup is fast AF. This post offers a pretty spectacular rundown on how and why. The tradeoff here: we’re no longer grouping our data by the rows or the columns. For us, right now, not an issue: we do not currently have a use case for iterating or grouping. We will in a second.
Let’s store our data as a hash:
Part 2: The Count Operation
Now that we can set keys and values, we want to be able to count how many times a given value exists in our database. (This is not necessary for keys because setting a new value to a key will overwrite the old value. This is fine as long as database indices are unique, which is usually what developers expect).
These tests describe the functionality we’re trying to get:
Ding-dangit, we just made a storage decision based on not needing to group stuff, and now you’re telling me we need to aggregate the counts of these values?
Well, that’s going to be slow. It’s going to be slow because, to do it, we have to iterate over all the values in the database and check that each one matches the thing we want a count of, and increment some counter.
Maybe. But maybe not.
What if, instead, we traded in some extra space to save a bunch of time on this call? What if, for example, we kept a second hash with counts in it?
How much extra space would that take up?
Would it double our database footprint? Probably not. Why not:
- We only add key-value pairs here for unique values, not all values in the database
- The value for each
@countkey is an integer. These do not take up much space in the grand scheme of data structure space requirements.
As the database grows exponentially, the size of the
@count hash relative to the
@db hash will continue to shrink. How fast it shrinks depends on how often duplicate values are added as well as how complex the values in the database are.
Also as the database grows, we have solutions for space. We can buy more space. We can shard the database to spread it across more space. Advances in computing hardware offer us more space.
Time? None of that is true. You can’t buy time. And if a customer leaves your service because the calls are too slow, it’s really expensive to buy them back, or to buy back their friends from the bad word of mouth they’ve heard. All else equal, space inefficiency is far cheaper than time inefficiency. (There are exceptions to this: microprocessors working in isolated environments, for example. The “all else equal” is there for a reason. This is a rule of thumb, not a commandment carved in stone).
Part 3: The Delete Operation
Now we need to be able to delete keys from the database.
A Sidenote on API Design: is it weird that I’m returning the string “NULL” here? Yes. I’m doing that so you can quickly and clearly tell, in the irb console, where my database has returned a null value and where I’m calling a method that always has no return value (see the
db.delete call in the screenshot above). To have the database return
nil itself in the case of no value, change the fetch default on line 17 of our current database implementation from “NULL” to
Here’s how we want
delete to work in test format:
Another Sidenote on API Design: Notice that my database API is on the forgiving end of the spectrum. Looking for a count on a value that isn’t there? You get 0. Looking to get a key that isn’t there? You get “NULL.” Looking to delete a key that isn’t there? You get “That key isn’t in the database!” I’m not throwing exceptions.
Here is my choice situated among some alternatives, ordered from most forgiving to least forgiving:
- I could have executed with no message (like bash’s
$rm -rf somedirectory)
- I could have executed with a message (this is what I chose)
- I could have returned a tuple of values: the first a success response, the second an error. Swift APIs often do this, and some HTTP APIs try to facilitate this with a
resultkey and an
errorkey in the response.
- I could have raised an error in the weird cases. Python dictionaries do this in their
dict["somekey"]syntax: if “somekey” isn’t there I get a
“Forgiving” sounds like a positive word, but here it’s a descriptor—not a normative judgement. There are tradeoffs. The more forgiving the API, the less likely it is that a harmless anomaly interrupts the code flow, but the more likely it is that a problem slips through unnoticed.
It is possible, for example, to
$rm -rf / on your console without so much as an error message (FOR THE LOVE OF GOD, DO NOT RUN THAT). That’s probably too forgiving of an API.
That said, having a database API that is relatively unforgiving, in the context of transactions, can lead to problems too. Maybe a transaction needs to make 10,000 changes and one of them is a little messed up. Instead of running the other 9999, the database rolls back, and the one issue has to be hunted down and fixed before the whole transaction runs again. We talked about how to write to a database in that exact situation right here.
So, for this example, I chose something pretty forgiving. The “right” choice here depends on your use case.
OK, let’s look at our implementation of delete:
Conclusion: We have our CRUD commands!
db.set "My", "Value"
db.set "My", "Different Value"
We’ve made a few performance decisions so far. The performance decisions get a little more interesting when we start adding transaction support.
This post is getting a bit long, though, so we will add transaction support in the next post of this series.
If you’re enjoying this series, you might also like:
This post on modifying authentication behavior in devise (a Ruby auth gem)
This post on mentors and sponsors (I don’t know…I’m proud of this one, and maybe sometimes you like a break from reading code?)