11

I'd like a recipe for finding duplicated changes. patch-id is likely to be the same but the commit attributes may not be.

This seems to be an intended use of patch-id:

git patch-id --help

IOW, you can use this thing to look for likely duplicate commits.

I imagine that stringing together "git log", "git patch-id" and uniq could do the job badly but if someone has an command that does the job well, I'd appreciate it.

Amber
  • 477,764
  • 81
  • 611
  • 541
bsb
  • 1,787
  • 24
  • 24
  • This is a fascinating feature. Out of curiosity, how far back in the past are you intending to look? I could see some creative integration uses for this (i.e. "my contributor doesn't know how to rebase"), but over long history it would be less effective...? – Christopher Jul 23 '12 at 02:23
  • The issue appeared in a week long history of a single branch, so my use case was quite gentle (git log -p was enough). The patch-id comment got me curious though... Searching all history could be painful. – bsb Jul 25 '12 at 00:43

7 Answers7

12

Because the duplicate changes are likely to be not on the same branch (except when there are reverts in between them), you could use git cherry:

git cherry [-v] [<upstream> [<head> [<limit>]]]

Where upstream would be the branch to check for duplicates of changes in head.

robinst
  • 28,003
  • 9
  • 98
  • 103
11

For looking for duplicates of a specific commit, this may work for you.

First, determine the patch id of the target commit:

$ THE_COMMIT_REF_OR_SHA_YOURE_SEEKING_DUPES_OF='7a3e67c'
$ git show $THE_COMMIT_REF_OR_SHA_YOURE_SEEKING_DUPES_OF | git patch-id
f6ea51cd6acd30cd627ce1a56e2733c1777d5b52 7a3e67ce38dbef471889d9f706b9161da7dc5cf3

The first SHA is the patch-id. Next, list the patch ids for every commit and filter out any that match:

$ for c in $(git rev-list --all); do git show $c | git patch-id; done | grep 'f6ea51cd6acd30cd627ce1a56e2733c1777d5b52'
f6ea51cd6acd30cd627ce1a56e2733c1777d5b52 5028e2b5500bd5f4637531e337e17b73f5d0c0b1
f6ea51cd6acd30cd627ce1a56e2733c1777d5b52 7a3e67ce38dbef471889d9f706b9161da7dc5cf3
f6ea51cd6acd30cd627ce1a56e2733c1777d5b52 929c66b5783a0127a7689020d70d398f095b9e00

All together, with a few extra flags, and in the form of a utility script:

test ! -z "$1" && TARGET_COMMIT_SHA="$1" || TARGET_COMMIT_SHA="HEAD"

TARGET_COMMIT_PATCHID=$(
git show --patch-with-raw "$TARGET_COMMIT_SHA" |
    git patch-id |
    cut -d' ' -f1
)
MATCHING_COMMIT_SHAS=$(
for c in $(git rev-list --all); do
    git show --patch-with-raw "$c" |
        git patch-id
done |
    fgrep "$TARGET_COMMIT_PATCHID" |
    cut -d' ' -f2
)

echo "$MATCHING_COMMIT_SHAS"

Usage:

$ git list-dupe-commits 7a3e67c
5028e2b5500bd5f4637531e337e17b73f5d0c0b1
7a3e67ce38dbef471889d9f706b9161da7dc5cf3
929c66b5783a0127a7689020d70d398f095b9e00

It isn't terribly speedy, but for most repos should get the job done (just measured 36 seconds for a repo with 826 commits and a 158MB .git dir on a 2.4GHz Core 2 Duo).

Slipp D. Thompson
  • 30,655
  • 3
  • 41
  • 42
  • I may be the only one confused, but just in case, "target-commit" is not a literal; replace it with the SHA of the commit you want to get a patch ID for. – Jimothy Aug 20 '14 at 14:10
  • 1
    @Jimothy Yep, or a branch name or a tag name (any ref, I guess). I'll see if I can make it a bit clearer. – Slipp D. Thompson Aug 20 '14 at 18:43
4

I have a draft that works on a toy repo, but as it keeps the patch->commit map in memory it might have problems on large repos:

# print commit pairs with the same patch-id
for c in $(git rev-list HEAD); do \
    git show $c | git patch-id; done \
| perl -anle '($p,$c)=@F;print "$c $s{$p}" if $s{$p};$s{$p}=$c'

The output should be pairs of commits with the same patch-id (3 duplicates A B C come out as "A B" then "B C").

Change the git rev-list command to restrict the commits checked:

git log --format=%H HEAD somefile

Append "| xargs git show" to view the commits in detail, or "| xargs git show -s --oneline" for a summary:

0569473 add 6-8
5e56314 add 6-8 again
bece3c3 comment
e037ed6 add comment again

It turns out patch-id didn't work in my original case as there were additional changes in that later commit. "git log -S" was more useful.

bsb
  • 1,787
  • 24
  • 24
  • If you want to look at just at raw diffs between a commit and its parent, you could do something like `git diff $c~1 $c | git patch-id`. It's going to misbehave on merge commits. Following both merge parents is a more complex problem. – Christopher Jul 25 '12 at 09:57
  • It looks like patch-id finds the same diff? $ git diff HEAD~1 HEAD | git patch-id 3318362fa07e580.. 000000000000.. $ git show HEAD | git patch-id 3318362fa07e580.. c397c4cdc426.. – bsb Jul 26 '12 at 23:02
  • @bsb Are you sure you wanted to write `git show $c | git patch-id`? `git show` prints metadata, but `git patch-id` needs a patch as input... – Daniel Alder Apr 23 '14 at 16:02
  • @daniel-alder, I think I need the `show` rather than `diff` since that allows the Perl to print the duplicate commits (otherwise I just get a whole lot of zero shas). [The code](http://tinyurl.com/git-patch-id) skips non-diff input (although perhaps older versions don't suppose this, what version are you using?) – bsb Apr 28 '14 at 06:01
  • @bsb thx for this explanation. I checked again and saw the diff. `git show` and `git patch-id` seem to cooperate nicely, but only for normal commits. for merges it doesn't seem to show any diff, that was my problem. tested with 1.7.10.4 and 1.9.1 – Daniel Alder Apr 28 '14 at 19:52
3

To search for duplicate commits of commit $hash, excluding merge commits:

git rev-list --no-merges --all | xargs -r git show | git patch-id \
    | grep ^$(git show $hash|git patch-id|cut -c1-40) | cut -c42-80 \
    | xargs -r git show -s --oneline

For searching the duplicate of a merge commit $mergehash, replace $(git show $hash|git patch-id|cut -c1-40) above by one of the two patch IDs (1st column) given by git diff-tree -m -p $mergehash | git patch-id. They correspond to the diffs of the merge commit with each of its two parents.

To find duplicates of all commits, excluding merge commits:

git rev-list --no-merges --all | xargs -r git show | git patch-id \
    | sort | uniq -w40 -D | cut -c42-80 \
    | xargs -r git log --no-walk --pretty=format:"%h %ad %an (%cn) %s" --date-order --date=iso

The search for duplicate commits can be extended or limited by changing the arguments to git rev-list, which accepts numerous options. For example, to limit the search to a specific branch specify its name instead of the option --all; or to search in the last 100 commits pass the arguments HEAD ^HEAD~100.

Note that these commands are fast since they use no shell loop, and batch-process commits.

To include merge commits, remove the option --no-merges, and replace xargs -r git show by xargs -r -L1 git diff-tree -m -p. This is much slower because git diff-tree is executed once per commit.

Explanation:

  • The first line generates a map of the patch IDs with the commit hashes (2-column data, of 40 characters each).

  • The second line only keeps commit hashes (2nd column) corresponding to the duplicate patch IDs (1st column).

  • The last line prints custom information about the duplicate commits.

unagi
  • 418
  • 3
  • 15
2

The nifty command suggested by bsb requires a couple of small tweaks:

(1) Instead of git show, which runs git diff-tree --cc, the command should use

    git diff-tree -p

Otherwise git patch-id generates spurious null SHA1 hashes.

(2) When the pipe to xargs is used, xargs should have the -L 1 argument. Otherwise a triplicated commit will not be paired with an equivalent commit.

Here's an alias to go in ~/.gitconfig:

dup = "!f() { for c in $(git rev-list HEAD); do git diff-tree -p $c | git patch-id; done | perl -anle '($p,$c)=@F;print \"$c $s{$p}\" if $s{$p};$s{$p}=$c' | xargs -L 1 git show -s --oneline; }; f" # "git dup" lists duplicate commits
Gidfiddle
  • 111
  • 4
0

For anyone wanting to do this on windows powershell the equivalent command to unagi's answer is:

git rev-list --no-merges --all  | %{&git.exe show $_} | 
  git patch-id | ConvertFrom-String -PropertyNames PatchId, Commit | 
  Group-Object PatchId | Where-Object count -gt 1 | 
  %{$_.group.Commit + " "}

Gives an output like:

1605e0e1e13d7b3f456c20432d8edec664ca7117
1e8efa8f2f01962a2c08fd25caf687d330383428

b45b6db084b27ae420ac8e9cf6511110ebb46513
4a2e1e3ba5a9a1d5db1d00343813e1404f6124e2

With the duplicate commit hashes grouped together.

CAUTION: On my repo this was a slow command so be sure to filter the call to rev-list appropriately!

James Close
  • 832
  • 9
  • 13
0

Make sure to use a recent version of Git

The git log --format=%H mentioned by the OP bsb's answer is not always unique.

That is because, before Git 2.29 (Q4 2020), the patch-id computation did not ignore the "incomplete last line" marker like whitespaces.

See commit 82a6201 (19 Aug 2020) by René Scharfe (rscharfe).
(Merged by Junio C Hamano -- gitster -- in commit 5122614, 24 Aug 2020)

patch-id: ignore newline at end of file in diff_flush_patch_id()

Reported-by: Tilman Vogel
Initial-test-by: Tilman Vogel
Signed-off-by: René Scharfe

Whitespace is ignored when calculating patch IDs.
This is done by removing all whitespace from diff lines before hashing them, including a newline at the end of a file.
If that newline is missing, however, diff reports that fact in a separate line containing "\ No newline at end of file\n", and this marker is hashed like a context line.

This goes against our goal of making patch IDs independent of whitespace.

Use the same heuristic that 2485eab55cc (git-patch-id: do not trip over "no newline" markers, 2011-02-17) added to git patch-id(man) instead and skip diff lines that start with a backslash and a space and are longer than twelve characters.

VonC
  • 1,129,465
  • 480
  • 4,036
  • 4,755