There are two courses that seem to universally strike fear in my students’ hearts. One of them is algorithms.

I hear about this dingdang class from multiple students every quarter. Most of my own algorithms knowledge comes from learning on the job; I have never taken a class on the subject. However, I have a lot of experience getting up to speed on things without the aid of a mandatory curriculum. So I decided to walk myself through an algorithms class and share my process.

Here’s the first thing to know about me: I don’t follow MOOCs. Why: I am not smart in the way that MOOCs (and CS classes especially) bank on the students being smart.

- I don’t read fast.
- I can’t pay attention for long to a lector who is below about a 7 out of 10 engaging. That knocks out most algorithms instructors for institutional reasons we might get to later.
- Mathematical notation does not come naturally to me. I might do better with a cheat sheet, but unfortunately I’ve never found a good one for the kind of notation you see beyond an introductory calculus class. So I tend to have to painstakingly sit there and translate equations into plain English.

Most of academic tradition caters to the most gifted, knowledgable, or advanced students in the class. I think this is because university professors are almost exclusively people for whom this system worked well. Their teaching propagates traditions that, as far as they know, work fine. But that doesn’t work for boneheads like me—at least not without supplementation.

### I don’t mind starting with a substrate, though.

At first I chose the 2012 recording of MIT’s “Introduction to Algorithms” on YouTube as my substrate. I’ll go through the videos one by one, but I’ll also share with you the supplemental images, videos, readings, and exercises I use to bridge the gaps between what *I* think are the biggest missing stairs that these lectures ask students to leap over.

They start with this video, which is, I guess, fine.

I learned from it:

- Why the efficiency of a lot of algorithms is n log n (indicates some kind of binary search is going on). Log, in this case, is log base 2, since stuff is getting split into 2 parts
- The students who ask questions in this class are, to me, very obviously the ones who don’t need to be here, which is one of the pitfalls of pedagogical tradition in general and probably honestly MIT specifically
- theta and T are assumed knowledge here. So I looked that up and was enlightened. It turns out the two are
*interchangeable*terms that refer to “the function we use to calculate the running time of this algorithm. Theta-n refers to the running time as a function of n, n being the number of items in the input list.” CS profs:**please, never use two interchangeable terms without telling the class they are interchangeable.**This is a super-efficient way to lose a lot of people.

While I was rooting around on the internet, I also found this visualization of different common running efficiencies. It gave me a great way to understand at a glance *how much faster* one procedure might be than another based on their thetas.

I *later* found this introductory video, which worked a lot better for me. I’m including it here because this is the one I’d recommend first to someone getting started on a video series about algorithms, but in reality I found this series after I was about 8 videos into my original series, and I’ll talk more about it later.

After “peak finding” I watched this video, which in my opinion should probably have been before “peak finding”:

Honestly, that video is excellent, and it motivated me to go deeper on which of these algorithms power the parts of Python that I use on a daily basis. Such a tie-in will also be useful for teaching my Python programming students. I find it much easier to connect with material if I have a plausible use case. It doesn’t even have to be something I *will* do with it—it could just be something I *might* do with it.

So then I watched this video:

The time-space tradeoff is an interesting one. I teach my students about caching and mention this tradeoff, so it might be useful, in the event that sorting comes up, to mention that insertion sort vs. merge sort demonstrate this same tradeoff since insertion sorting can be done in place and merge sorting requires copying stuff.

Here’s insertion sort:

I didn’t fully get how merge sort worked from this lecture, so I picked up a book called *A Common Sense Guide to Data Structures and Algorithms* by Jay Wengrow. This is a trick that I *love* using for learning something: interleave the perspectives of an academic and a practitioner. I find that the academic material often helps me understand the theoretical underpinnings and (when it’s good) the history of how and why we got to this. The practitioner material, by contrast, focuses on the part I would need to know to interact with this on a daily basis.

I stumbled upon this trick while working as a machine learning engineer. All of the eight “not-yet-at-deep-learning” courses I took required the students to implement gradient descent from scratch—which, for the uninitiated, is analogous to teaching structural engineers to solve calculator problems with a pencil and paper. You do it to teach people how to *think*, but you don’t expect them to do that *on the job*. Every machine learning library does gradient descent for you; you will never have to do this on your own at a job. I’m glad to have implemented it, to secure my knowledge of the procedure.

The issue isn’t what these courses include: it’s what they omit. *None* of those classes went over “how easy it is to accidentally make a racist model and deploy it to prod.” And I saw that reflected in industry: data scientists with Ivy League Triple Crowns would walk into my conference room for a job interview and wax poetic about, I am not kidding you, the system they built to judge creditworthiness *based on a person’s passport photo*. Any decent practitioner would *walk off* a job that asked them to do that, and any decent resource *written* by a practitioner addresses model bias almost *immediately*. So I want *both* perspectives to make sure I know what’s going on.

Anyway, I searched the index of *A Common Sense Guide *for “merge sort” (page 222). There I found a sentence like “merge sort is a beautiful algorithm; you should look it up.” So I looked it up. Turn your sound *down* before you press play on this, and focus on the illustrations because the formula in the video has an inaccuracy (see the comments for details).

**Here’s what’s interesting to me about merge sort: it strikes me as a sort of backwards version of quick sort.** This curriculum hasn’t gotten to quick sort yet, but this video provides the clearest example I’ve ever seen of the concept:

You’ll notice commenters under that video complaining that Derrick’s implementation copies the data. It does. Meanwhile, merge sort requires copying data, so for the purposes of this comparison it’s a perfectly acceptable implementation to use.

Quick sort splits the data on a pivot element, sorts it into “less than pivot” and “greater than pivot”, then calls itself on those two collections, and finally returns the resulting (sorted) collections concatenated with the pivot in the middle. At the “bottom recursion,” quick sort already knows that each item in each sub-collection falls between the greatest item in the next “lesser” sub-collection and below the smallest item in the next “greater” sub-collection. The task that remains is only to go back up the tree gluing the sub-collections together.

Merge sort, at the “bottom recursion,” has not begun sorting yet at all—it has instead, essentially, done indiscriminate binary splits, and it handles sorting on the way back “up” the recursion tree (that’s what you’re seeing in the merge sort video when each element is getting outlined in red—the algorithm is comparing the smallest remaining item in each of the two sub-collections it’s merging to determine which one to tack onto the tail of the sorted, combined copy).

Now, what’s my opinion on which approach is “better?” I don’t have one. Instead, I’m recording this observation because identifying exactly *how* two related concepts relate to each other helps me remember them both later.

Okay. We’re three videos into the algorithm series, and I’ve provided you with my supplemental resources, pedgogical thinking (as a teacher), and autodidactic thinking (as a student) that helps me understand and recall these concepts. I think that’s enough raw material for now. We’ll continue with some more algorithms videos in the next post of this series.

### If you liked this piece, you might also like:

The compilers series—yes, this is really the kind of person I am

This post about engineering for edge cases, so you get some practical stuff with your theoretical stuff

This piece on documentation, which people who spend their spare time learning about algorithms should probably also read 🙂