git finding duplicate commits (by patch-id)

Question

I'd like a recipe for finding duplicated changes. patch-id is likely to be the same but the commit attributes may not be.

This seems to be an intended use of patch-id:

git patch-id --help

IOW, you can use this thing to look for likely duplicate commits.

I imagine that stringing together "git log", "git patch-id" and uniq could do the job badly but if someone has an command that does the job well, I'd appreciate it.

This is a fascinating feature. Out of curiosity, how far back in the past are you intending to look? I could see some creative integration uses for this (i.e. "my contributor doesn't know how to rebase"), but over long history it would be less effective...? — Christopher, Jul 23 '12 at 02:23
The issue appeared in a week long history of a single branch, so my use case was quite gentle (git log -p was enough). The patch-id comment got me curious though... Searching all history could be painful. — bsb, Jul 25 '12 at 00:43

score 12 · Answer 1 · answered Jul 23 '12 at 07:31

Because the duplicate changes are likely to be not on the same branch (except when there are reverts in between them), you could use git cherry:

git cherry [-v] [<upstream> [<head> [<limit>]]]

Where upstream would be the branch to check for duplicates of changes in head.

Slipp D. Thompson · Answer 2 · 2016-12-06T22:07:33.107

For looking for duplicates of a specific commit, this may work for you.

First, determine the patch id of the target commit:

$ THE_COMMIT_REF_OR_SHA_YOURE_SEEKING_DUPES_OF='7a3e67c'
$ git show $THE_COMMIT_REF_OR_SHA_YOURE_SEEKING_DUPES_OF | git patch-id
f6ea51cd6acd30cd627ce1a56e2733c1777d5b52 7a3e67ce38dbef471889d9f706b9161da7dc5cf3

The first SHA is the patch-id. Next, list the patch ids for every commit and filter out any that match:

$ for c in $(git rev-list --all); do git show $c | git patch-id; done | grep 'f6ea51cd6acd30cd627ce1a56e2733c1777d5b52'
f6ea51cd6acd30cd627ce1a56e2733c1777d5b52 5028e2b5500bd5f4637531e337e17b73f5d0c0b1
f6ea51cd6acd30cd627ce1a56e2733c1777d5b52 7a3e67ce38dbef471889d9f706b9161da7dc5cf3
f6ea51cd6acd30cd627ce1a56e2733c1777d5b52 929c66b5783a0127a7689020d70d398f095b9e00

All together, with a few extra flags, and in the form of a utility script:

test ! -z "$1" && TARGET_COMMIT_SHA="$1" || TARGET_COMMIT_SHA="HEAD"

TARGET_COMMIT_PATCHID=$(
git show --patch-with-raw "$TARGET_COMMIT_SHA" |
    git patch-id |
    cut -d' ' -f1
)
MATCHING_COMMIT_SHAS=$(
for c in $(git rev-list --all); do
    git show --patch-with-raw "$c" |
        git patch-id
done |
    fgrep "$TARGET_COMMIT_PATCHID" |
    cut -d' ' -f2
)

echo "$MATCHING_COMMIT_SHAS"

Usage:

$ git list-dupe-commits 7a3e67c
5028e2b5500bd5f4637531e337e17b73f5d0c0b1
7a3e67ce38dbef471889d9f706b9161da7dc5cf3
929c66b5783a0127a7689020d70d398f095b9e00

It isn't terribly speedy, but for most repos should get the job done (just measured 36 seconds for a repo with 826 commits and a 158MB .git dir on a 2.4GHz Core 2 Duo).

I may be the only one confused, but just in case, "target-commit" is not a literal; replace it with the SHA of the commit you want to get a patch ID for. — Jimothy, Aug 20 '14 at 14:10
@Jimothy Yep, or a branch name or a tag name (any ref, I guess). I'll see if I can make it a bit clearer. — Slipp D. Thompson, Aug 20 '14 at 18:43

score 4 · Answer 3 · answered Jul 25 '12 at 04:30

4

I have a draft that works on a toy repo, but as it keeps the patch->commit map in memory it might have problems on large repos:

# print commit pairs with the same patch-id
for c in $(git rev-list HEAD); do \
    git show $c | git patch-id; done \
| perl -anle '($p,$c)=@F;print "$c $s{$p}" if $s{$p};$s{$p}=$c'

The output should be pairs of commits with the same patch-id (3 duplicates A B C come out as "A B" then "B C").

Change the git rev-list command to restrict the commits checked:

git log --format=%H HEAD somefile

Append "| xargs git show" to view the commits in detail, or "| xargs git show -s --oneline" for a summary:

0569473 add 6-8
5e56314 add 6-8 again
bece3c3 comment
e037ed6 add comment again

It turns out patch-id didn't work in my original case as there were additional changes in that later commit. "git log -S" was more useful.

answered Jul 25 '12 at 04:30

bsb

1,787
24
24

If you want to look at just at raw diffs between a commit and its parent, you could do something like `git diff $c~1 $c | git patch-id`. It's going to misbehave on merge commits. Following both merge parents is a more complex problem. – Christopher Jul 25 '12 at 09:57
It looks like patch-id finds the same diff? $ git diff HEAD~1 HEAD | git patch-id 3318362fa07e580.. 000000000000.. $ git show HEAD | git patch-id 3318362fa07e580.. c397c4cdc426.. – bsb Jul 26 '12 at 23:02
@bsb Are you sure you wanted to write `git show $c | git patch-id`? `git show` prints metadata, but `git patch-id` needs a patch as input... – Daniel Alder Apr 23 '14 at 16:02
@daniel-alder, I think I need the `show` rather than `diff` since that allows the Perl to print the duplicate commits (otherwise I just get a whole lot of zero shas). [The code](http://tinyurl.com/git-patch-id) skips non-diff input (although perhaps older versions don't suppose this, what version are you using?) – bsb Apr 28 '14 at 06:01
@bsb thx for this explanation. I checked again and saw the diff. `git show` and `git patch-id` seem to cooperate nicely, but only for normal commits. for merges it doesn't seem to show any diff, that was my problem. tested with 1.7.10.4 and 1.9.1 – Daniel Alder Apr 28 '14 at 19:52

unagi · Answer 4 · 2017-08-31T00:11:24.577

To search for duplicate commits of commit $hash, excluding merge commits:

git rev-list --no-merges --all | xargs -r git show | git patch-id \
    | grep ^$(git show $hash|git patch-id|cut -c1-40) | cut -c42-80 \
    | xargs -r git show -s --oneline

For searching the duplicate of a merge commit $mergehash, replace $(git show $hash|git patch-id|cut -c1-40) above by one of the two patch IDs (1st column) given by git diff-tree -m -p $mergehash | git patch-id. They correspond to the diffs of the merge commit with each of its two parents.

To find duplicates of all commits, excluding merge commits:

git rev-list --no-merges --all | xargs -r git show | git patch-id \
    | sort | uniq -w40 -D | cut -c42-80 \
    | xargs -r git log --no-walk --pretty=format:"%h %ad %an (%cn) %s" --date-order --date=iso

The search for duplicate commits can be extended or limited by changing the arguments to git rev-list, which accepts numerous options. For example, to limit the search to a specific branch specify its name instead of the option --all; or to search in the last 100 commits pass the arguments HEAD ^HEAD~100.

Note that these commands are fast since they use no shell loop, and batch-process commits.

To include merge commits, remove the option --no-merges, and replace xargs -r git show by xargs -r -L1 git diff-tree -m -p. This is much slower because git diff-tree is executed once per commit.

Explanation:

The first line generates a map of the patch IDs with the commit hashes (2-column data, of 40 characters each).
The second line only keeps commit hashes (2nd column) corresponding to the duplicate patch IDs (1st column).
The last line prints custom information about the duplicate commits.

score 2 · Answer 5 · answered Jul 28 '17 at 12:26

The nifty command suggested by bsb requires a couple of small tweaks:

(1) Instead of git show, which runs git diff-tree --cc, the command should use

    git diff-tree -p

Otherwise git patch-id generates spurious null SHA1 hashes.

(2) When the pipe to xargs is used, xargs should have the -L 1 argument. Otherwise a triplicated commit will not be paired with an equivalent commit.

Here's an alias to go in ~/.gitconfig:

dup = "!f() { for c in $(git rev-list HEAD); do git diff-tree -p $c | git patch-id; done | perl -anle '($p,$c)=@F;print \"$c $s{$p}\" if $s{$p};$s{$p}=$c' | xargs -L 1 git show -s --oneline; }; f" # "git dup" lists duplicate commits

score 0 · Answer 6 · answered Jun 17 '19 at 16:07

For anyone wanting to do this on windows powershell the equivalent command to unagi's answer is:

git rev-list --no-merges --all  | %{&git.exe show $_} | 
  git patch-id | ConvertFrom-String -PropertyNames PatchId, Commit | 
  Group-Object PatchId | Where-Object count -gt 1 | 
  %{$_.group.Commit + " "}

Gives an output like:

1605e0e1e13d7b3f456c20432d8edec664ca7117
1e8efa8f2f01962a2c08fd25caf687d330383428

b45b6db084b27ae420ac8e9cf6511110ebb46513
4a2e1e3ba5a9a1d5db1d00343813e1404f6124e2

With the duplicate commit hashes grouped together.

CAUTION: On my repo this was a slow command so be sure to filter the call to rev-list appropriately!

score 0 · Answer 7 · answered Aug 31 '20 at 16:10

Make sure to use a recent version of Git

The git log --format=%H mentioned by the OP bsb's answer is not always unique.

That is because, before Git 2.29 (Q4 2020), the patch-id computation did not ignore the "incomplete last line" marker like whitespaces.

See commit 82a6201 (19 Aug 2020) by René Scharfe (rscharfe).
^{(Merged by Junio C Hamano -- gitster -- in commit 5122614, 24 Aug 2020)}

patch-id: ignore newline at end of file in diff_flush_patch_id()

^{Reported-by: Tilman Vogel}
^{Initial-test-by: Tilman Vogel}
^{Signed-off-by: René Scharfe}

Whitespace is ignored when calculating patch IDs.
This is done by removing all whitespace from diff lines before hashing them, including a newline at the end of a file.
If that newline is missing, however, diff reports that fact in a separate line containing "\ No newline at end of file\n", and this marker is hashed like a context line.

This goes against our goal of making patch IDs independent of whitespace.

Use the same heuristic that 2485eab55cc (git-patch-id: do not trip over "no newline" markers, 2011-02-17) added to git patch-id^(man) instead and skip diff lines that start with a backslash and a space and are longer than twelve characters.

git finding duplicate commits (by patch-id)

7 Answers7

`patch-id`: ignore newline at end of file in `diff_flush_patch_id()`

Linked

git finding duplicate commits (by patch-id)

7 Answers7

patch-id: ignore newline at end of file in diff_flush_patch_id()

Linked

`patch-id`: ignore newline at end of file in `diff_flush_patch_id()`