291

I just started learning git and to do so I started reading the Git Community Book, and in this book they say that SVN and CVS store the difference between files and that git stores a snapshot of all the files.

But I didn't really get what they mean by snapshot. Does git really make a copy of all the files in each commit because that's what I understood from their explanation.

PS: If any one has any better source to learn git I would appreciate it.

Morty
  • 3,205
  • 4
  • 15
  • 20
  • 28
    Here's [a brilliant post](http://nfarina.com/post/9868516270/git-is-simpler) that explains in detail how git works. What you're looking for is probably the § about the object database. – greg0ire Nov 20 '11 at 00:04
  • Excellent article that contains links to other great resources. I've had fun with these for a couple of hours. – mihai Mar 11 '15 at 14:07
  • 3
    I found this really nice article describing git from inside out: http://maryrosecook.com/blog/post/git-from-the-inside-out – Anubis Jan 15 '16 at 02:38
  • @Sumudu The post 'git from the inside out' by Mary rose cook is indeed brilliant. – drlolly 2 hours ago Edit – drlolly Dec 07 '21 at 16:31

2 Answers2

329

Git does include for each commit a full copy of all the files, except that, for the content already present in the Git repo, the snapshot will simply point to said content rather than duplicate it.
That also means that several files with the same content are stored only once.

So a snapshot is basically a commit, referring to the content of a directory structure.

Some good references are:

You tell Git you want to save a snapshot of your project with the git commit command and it basically records a manifest of what all of the files in your project look like at that point

Lab 12 illustrates how to get previous snapshots


The progit book has the more comprehensive description of a snapshot:

The major difference between Git and any other VCS (Subversion and friends included) is the way Git thinks about its data.
Conceptually, most other systems store information as a list of file-based changes. These systems (CVS, Subversion, Perforce, Bazaar, and so on) think of the information they keep as a set of files and the changes made to each file over time

delta-based VCS

Git doesn’t think of or store its data this way. Instead, Git thinks of its data more like a set of snapshots of a mini filesystem.
Every time you commit, or save the state of your project in Git, it basically takes a picture of what all your files look like at that moment and stores a reference to that snapshot.
To be efficient, if files have not changed, Git doesn’t store the file again—just a link to the previous identical file it has already stored.
Git thinks about its data more like as below:

snapshot-based VCS

This is an important distinction between Git and nearly all other VCSs. It makes Git reconsider almost every aspect of version control that most other systems copied from the previous generation. This makes Git more like a mini filesystem with some incredibly powerful tools built on top of it, rather than simply a VCS.

See also:


Jan Hudec adds this important comment:

While that's true and important on the conceptual level, it is NOT true at the storage level.
Git does use deltas for storage.
Not only that, but it's more efficient in it than any other system. Because it does not keep per-file history, when it wants to do delta compression, it takes each blob, selects some blobs that are likely to be similar (using heuristics that includes the closest approximation of previous version and some others), tries to generate the deltas and picks the smallest one. This way it can (often, depends on the heuristics) take advantage of other similar files or older versions that are more similar than the previous. The "pack window" parameter allows trading performance for delta compression quality. The default (10) generally gives decent results, but when space is limited or to speed up network transfers, git gc --aggressive uses value 250, which makes it run very slow, but provide extra compression for history data.

VonC
  • 1,129,465
  • 480
  • 4,036
  • 4,755
  • 4
    @JanHudec good point. I have included your comment in the answer for more visibility. – VonC Jan 03 '13 at 14:00
  • 1
    Does anyone know the computer science term for the Git-like storage pattern, aka hash-based value store? (or something similar) – Joannes Vermorel Nov 24 '14 at 12:00
  • @JoannesVermorel in its very first commit for Git itself, Linus Torvalds described it simply as a "content-addressable collection of objects": https://github.com/git/git/commit/e83c5163316f89bfbde7d9ab23ca2e25604af290 – VonC Nov 24 '14 at 12:03
  • These two sections of the documentation seem somewhat contradicting: http://git-scm.com/book/en/v2/Getting-Started-Git-Basics, http://www.git-scm.com/book/en/v2/Git-Internals-Packfiles, it would be nice that they were clearer about how they make git efficient on a higher/layman level explanation – dukeofgaming Jan 03 '15 at 23:31
  • 55
    In the context of the OP's actual question, the first paragraph seems really misleading. It's not until you get to the final paragraph that we learn that, oh yes, fact Git **does** "store [...] differences between files. Really wish that info was flagged up top and not buried so deep. That said, thanks at least including the real story somewhere in your answer ;) – Josh O'Brien Jan 27 '15 at 20:50
  • Parts of this anwer are [translated into Russian for StackOverflow.RU](http://ru.stackoverflow.com/a/425121/) – Nick Volynkin Aug 21 '15 at 14:08
  • @NickVolynkin I have fixed the typo, and added an interesting link to illustrate delta compression (https://gist.github.com/matthewmccullough/2695758) – VonC Aug 21 '15 at 14:14
  • 2
    @NickVolynkin Great! I am glad those answers are finding a larger audience. – VonC Aug 21 '15 at 14:15
  • @NickVolynkin hopefully not just my answers ;) torek's answers on git are pretty sick too! (http://stackoverflow.com/search?tab=votes&q=user%3a1256452%20[git]) – VonC Aug 21 '15 at 14:18
  • @NickVolynkin I suspect other heuristics, but I am not sure. – VonC Aug 21 '15 at 14:26
  • Thanks for the link and clarification, I've updated my answer. I removed the commits which are no longer relevant. – Nick Volynkin Aug 21 '15 at 14:42
  • @JanHudec is there a place where I can look at git's schema? – Thomas Apr 14 '16 at 17:13
  • @Thomas Sure: http://stackoverflow.com/a/9478566/6309 and http://stefan.saasen.me/articles/git-clone-in-haskell-from-the-bottom-up/#pack_file_format – VonC Apr 14 '16 at 18:23
  • 1
    Another good book: Git From The Bottom Up: http://ftp.newartisans.com/pub/git.from.bottom.up.pdf – Jonas Berlin Aug 21 '17 at 19:34
  • 1
    "Git does store for each commit a full copy of all the files". Given the last paragraph, that is simply wrong! – Rastapopoulos Jun 11 '18 at 14:15
  • @Rastapopoulos considering that, [as commented here](https://stackoverflow.com/questions/21151945/what-does-git-fsck-stand-for/21157751#comment31845645_21151971), Git originally was envisioned as a versioned file system, it does have a full copy of all files per commit (I have edited the answer). But yes, that copy, if packed, involves deltification: see "[Are Git's pack files deltas rather than snapshots?](https://stackoverflow.com/a/24978239/6309)" – VonC Jun 11 '18 at 15:54
  • @Rastapopoulos For more on pack files, see "[Is the git binary diff algorithm (delta storage) standardized?](https://stackoverflow.com/a/9478566/6309)". This differs from the *initial* storage ("[Git Internals - Packfiles](https://git-scm.com/book/en/v2/Git-Internals-Packfiles)), called “loose” object format. See "[When and how does git use deltas for storage?](https://stackoverflow.com/a/50496599/6309)" – VonC Jun 11 '18 at 16:07
  • 3
    @VonC You say "does have a full copy of all files per commit", but the next sentence contradicts that... I think it is obvious to anyone who uses Git that Git can reconstitute a copy of a file in a given commit. – Rastapopoulos Jun 29 '18 at 16:27
  • I second @Rastapopoulos, it's still confusing. – Max Barraclough Nov 18 '20 at 20:43
  • @MaxBarraclough https://stackoverflow.com/a/51884316/6309 might help clarify that 2011 answer. – VonC Nov 18 '20 at 20:56
  • @MaxBarraclough On the "snapshot will simply point to said content rather than duplicate it" part, see also https://stackoverflow.com/a/61563261/6309 – VonC Nov 18 '20 at 21:00
58

Git logically stores each file under its SHA1. What this means is if you have two files with exactly the same content in a repository (or if you rename a file), only one copy is stored.

But this also means that when you modify a small part of a file and commit, another copy of the file is stored. The way git solves this is using pack files. Once in a while, all the “loose” files (actually, not just files, but objects containing commit and directory information too) from a repo are gathered and compressed into a pack file. The pack file is compressed using zlib. And similar files are also delta-compressed.

The same format is also used when pulling or pushing (at least with some protocols), so those files don't have to be recompressed again.

The result of this is that a git repository, containing the whole uncompressed working copy, uncompressed recent files and compressed older files is usually relatively small, two times smaller than the size of the working copy. And this means it's smaller than SVN repo with the same files, even though SVN doesn't store the history locally.

svick
  • 225,720
  • 49
  • 378
  • 501