I teach a Mobile Software Development class. My students ask excellent questions. If I cannot answer a question thoroughly, concisely, and accurately on the spot, I defer and answer after class in a follow-up. This is one such question.
Question: We learned about using
git logto see our history of commits, about
git log path/to/fileto see commits that changed a specific file while it has had its current filename, and about
git log --follow path/to/fileto see commits that changed a specific file throughout its history, even if it has been renamed.
How does git know that a file has been renamed?
I decided to track down the answer to this to not only show you the answer, but also to show you how I go about answering questions like these.
I started with a Google search “How does git follow renames” and found this StackOverflow answer in the first three results. People have varying opinions on StackOverflow. I sometimes find it useful as a starting point for general questions or for googling very specific error messages. If I think that the answer I’m looking for is more likely to be best explained in somebody’s blog post, I’ll filter out StackOverflow answers in my Google search results by adding “-stackoverflow” to the end of my search query terms, like so:
In any case, even when I look at StackOverflow, I verify what I find there. My favorite sources for getting information about software are:
– the implementation of the software itself (git is open source)
– running my own experiments with the software
– the documentation
You’ll see all of those in this answer.
So first, we can learn something by reading the output that we see when we use git.
Git tracks changes to a code base through additions and subtractions. The underlying implementation doesn’t have a concept of “Change A to B.” Instead, it’s “Delete A, add B.”
Here’s a one line example. When we make a new file with some text in it,
And then change that text,
We changed the line “I am a file.” to say “You’ll never guess.” But to git, that looks like:
1. Remove the line “I am a file.”
2. Add the line “You’ll never guess.”
It works the same way for whole files in git.
So if I now rename the file “i_am_a_file.txt” to “guess_what_I_am.txt”:
To git, that looks like:
1. Remove the file “i_am_a_file.txt”
2. Add the file “guess_what_I_am.txt”
This is true even if we rename a file to which we have made no other changes whatsoever.
Here, I undid the change on line 2 of the file. You can tell it is unchanged because when I run “git status,” git reports no changes. Then I rename the file with the mv command (moving a file and renaming it are the same thing as far as your command line is concerned). Git again reports that we have deleted a file, but have obtained a new, untracked file. It isn’t checking the contents.
When we _commit_ the file with the new name, though:
Git now detects the rename! What gives?
Here’s how git detects a rename under the hood:
- Were any files _deleted_ in this commit? That is, are there files that were committed in the last commit, that as of this commit, are _gone_? These are candidates files that might have been renamed.
- Were any files _added_ in this commit? That is, are there files that were not there in the last commit, and as of this commit, exist? These are candidates for files _to which the deleted files_ might have been renamed.
- At this point, git runs a diff algorithm on the candidate files. It would be computationally expensive to run this on the whole repo all the time if we can narrow down the rename candidates to the newly deleted files and the newly added files, which is why we do that first.
You have access to the command git uses to do this: `$git diff -M`. The command `$git diff` shows you the individual changes in a file, a branch, or the whole repo if you want, depending on what you pass in. The `-M` flag stands for “detect move” or, for our purposes, detect a rename.
Here’s what the git documentation says about that flag:
Detect renames. If n is specified, it is a threshold on the similarity index (i.e. amount of addition/deletions compared to the file’s size). For example, -M90% means Git should consider a delete/add pair to be a rename if more than 90% of the file hasn’t changed. Without a % sign, the number is to be read as a fraction, with a decimal point before it. I.e., -M5 becomes 0.5, and is thus the same as -M50%. Similarly, -M05 is the same as -M5%. To limit detection to exact renames, use -M100%. The default similarity index is 50%.
How does the diff algorithm work?
That depends. Git gives you four choices for which diff algorithm to use:
Choose a diff algorithm. The variants are as follows:
The basic greedy diff algorithm. Currently, this is the default.
Spend extra time to make sure the smallest possible diff is produced.
Use “patience diff” algorithm when generating patches.
This algorithm extends the patience algorithm to “support low-occurrence common elements”.
The documentation suggests that Myers is the default algorithm, and the default similarity index is 50%. So, theoretically, a new file has to be more than half the same as a deleted file in a given commit to be labeled a rename. Sure enough, if I change what I think is probably “more than half” the file and then rename it, git no longer picks up the rename. It instead logs this as the deletion of an old file and the addition of a new file:
I am making a guess as to what constitutes “more than half.” Out of curiosity, I decided to go see if I could understand this better. Here’s the repo containing git. I decided to search it for mentions of this (you can do this by adding “/search” to the end of any repo URL in Github).
I searched “Myers.” Here were my search results. Of particular interest was this:
Let’s click on that and have a look. It goes to this file. The comments at the top of the file tell us that this is LibXDiff by Davide Libenzi. Interesting. Sounds like a dependency.
When we google that, we get to a description of LibXDiff:
The LibXDiff library implements basic and yet complete functionalities to create file differences/patches to both binary and text files.
That sounds right.
The line we saw in our search is the beginning of this long comment in the file.
This sounds like it’s describing the algorithm. I don’t know really what this means, and I don’t think I need to understand the ins and outs of the Myers algorithm to grasp the basics of git’s rename detection. So I’m going to stop here. (You’re more than welcome to look up that paper that the comment recommends if you’re dying to know how Myers works, of course).
But worth noting: git doesn’t store data changes as renames. It detects them after the fact when asked to. It stores changes as additions and subtractions. And it will fetch data for you in those terms, too, unless you specifically use the `–follow` flag to ask it to use its diff strategy to look for those renames for you:
Juicy Gossip: There is evidently a modicum of drama around the `–follow` command, as the git library’s notorious progenitor Linus Torvalds supposedly said it’s a backport for ‘SVN noobs’ and renames don’t matter.
First of all, on insulting the intelligence of subversion users: Linus is notorious for denigrating other programmers. In a talk at Google he felt compelled to reiterate every 6 minutes, on average, that he thinks programmers who aren’t him are stupid and incompetent. After 25 years he finally stepped away from leadership on the Linux kernel in part, we’re told, precisely to address his pattern of vicious verbal abuse toward other technologists.
On “renames don’t matter”: I think renames matter, but more importantly, so does Linus. Here, I searched the git repo for commits authored or committed by him that explicitly mention renames in the commit message. There are over 100. A lot of them mention “rename” because they deal with git functionality for renames in git repos, but on the first four pages of results, I found five commits exclusively or chiefly devoted to renaming a command or a module in git itself. If Linus really thought renames didn’t matter, he wouldn’t rename things.
It’s a reminder: what people say in this industry often won’t hold up to research.
If you liked this piece, you might also like:
The time we discovered that people who tout numpy vectorization don’t know what it does
The time we discovered that people who scoff at SOAP don’t know what it does
The time we discovered that classic signs of bug-prone code don’t seem to cause bugs
Thank you for the nice overview!
In a way, it is true that “renames don’t matter”, because internally the content is kept as a nameless blob. The names are associated to the blobs in the context of a given work-tree. Of course, without the names such blobs would be very much useless.
The renames basically dissociate a blob from an old name and associate it with a new name. A tricky thing begins when one tries to merge renames across branches.
I went through this interesting read, but my question still remains. How to fool git to consider that there is at least 50% of difference? The files are too small to look like a difference. Compiler ignores the comments if we add them.
[…] This is explained well here. […]