3

I have a git repositories that is very large [larger than 1 GB] and there is always issue when we have to setup the repositories on new local instance. Is there any proven approach so that we can solve this issue?

prajwal_stha
  • 307
  • 3
  • 17
  • 5
    Yes, you can remove the large binary files from the repo which are causing it to bloat to 1GB (check SO for how to do this). If you really don't have any such files, and all 1GB is source code, then you must be sitting on a really large codebase. – Tim Biegeleisen Jan 03 '18 at 06:15
  • What about partial cloning with subversion (rather than `git`)? – iBug Jan 03 '18 at 06:18
  • Possible duplicate of [How to handle a large git repository?](https://stackoverflow.com/questions/12855926/how-to-handle-a-large-git-repository) – Rumid Jan 03 '18 at 21:35

3 Answers3

4

If you don't need full history right away, and you're using a fairly recent version of Git (1.9 or later), then you can do a shallow clone:

  • git clone --depth 5 user@host:repo.git will truncate repo history to the 5 most recent commits on each branch
  • git clone --shallow-since=2017-12-01 user@host:repo.git will truncate repo history to everything since 1 December 2017
  • git clone --shallow-exclude=abc1234 user@host:repo.git will clone every revision except for the specified one and the ones that are reachable from the specified one. You can use --shallow-exclude several times to specify several unwanted revisions.

You can also clone single branches with something like git clone --branch master --single-branch user@host:repo.git, which will only pull down the history of the master branch on the specified repo.

There's a bit more detail at https://www.atlassian.com/blog/git/handle-big-repositories-git which may be helpful - especially if you're dealing with a repo that has large binary assets.

Jim Redmond
  • 3,031
  • 1
  • 11
  • 18
2

Set up a "depot" clone repository with old history that won't change in it on a shared filesystem. Do all your further clones --reference that repo and its contents won't be duplicated to the new clones. Read the clone docs to see usage advice for this, e.g. what to do before losing (or if you might lose) access to your reference depot.

jthill
  • 48,781
  • 4
  • 72
  • 120
  • Will you please elaborate on setting up "depot" clone repository? – prajwal_stha Jan 03 '18 at 07:53
  • The simplest way is to do a clone to a widely-shared filesystem location and never push, pull or fetch there again, use it only as a reference. You can optionally delete references to more-recent code so all clones referencing it get the recent content in their own object db rather than reading it from the shared depot, if access to the shared fs is slow that will make a difference. – jthill Jan 03 '18 at 15:29
0

There is now microsoft/scalar (it started three years ago as GVFS, then VFS for Git, which moved in its own repository.
Now, since August 2019, Scalar)

Scalar: A set of tools and extensions for Git to allow very large monorepos to run on Git without a virtualization layer

If your repo is hosted on a service that supports the GVFS Protocol, such as Azure Repos, then scalar clone <url> will create a local enlistment with abilities like on-demand object retrieval, background maintenance tasks, and automatically sets Git config values and hooks that enable performance enhancements.
Scalar also assists in setting up sparse enlistments.

It is documented in Git 2.35 (Q1 2022):

scalar: start documenting the command

Signed-off-by: Johannes Schindelin

Scalar is an opinionated repository management tool.

By creating new repositories or registering existing repositories with Scalar, your Git experience will speed up.
Scalar sets advanced Git config settings, maintains your repositories in the background, and helps reduce data sent across the network.

An important Scalar concept is the enlistment: this is the top-level directory of the project.
It usually contains the subdirectory src/ which is a Git worktree. This encourages the separation between tracked files (inside src/) and untracked files, such as build artifacts (outside src/).

When registering an existing Git worktree with Scalar whose name is not src, the enlistment will be identical to the worktree.

The scalar command implements various subcommands, and different options depending on the subcommand.


With Git 2.27 (Q2 2020), "git fetch" offers a better support for scalar clone.

It also explains how scalar clone differs from a regular git clone and will handle larger repositories.

See commit b739d97 (13 Mar 2020) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit 4cd9bb4, 25 Mar 2020)

connected.c: reprepare packs for corner cases

Helped-by: Jeff King
Helped-by: Junio Hamano
Signed-off-by: Derrick Stolee

While updating the microsoft/git fork on top of v2.26.0-rc0 and consuming that build into Scalar, I noticed a corner case bug around partial clone.

The "scalar clone" command can create a Git repository with the proper config for using partial clone with the "blob:none" filter.
Instead of calling "git clone", it runs "git init" then sets a few more config values before running "git fetch".

In our builds on v2.26.0-rc0, we noticed that our "git fetch" command was failing with

error: https://github.com/microsoft/scalar did not send all necessary objects

This does not happen if you copy the config file from a repository created by "git clone --filter=blob:none <url>", but it does happen when adding the config option "core.logAllRefUpdates = true".

By debugging, I was able to see that the loop inside check_connnected() that checks if all refs are contained in promisor packs actually did not have any packfiles in the packed_git list.

I'm not sure what corner-case issues caused this config option to prevent the reprepare_packed_git() from being called at the proper spot during the fetch operation. This approach requires a situation where we use the remote helper process, which makes it difficult to test.

It is possible to place a reprepare_packed_git() call in the fetch code closer to where we receive a pack, but that leaves an opening for a later change to re-introduce this problem.
Further, a concurrent repack operation could replace the pack-file list we already loaded into memory, causing this issue in an even harder to reproduce scenario.

It is really the responsibility of anyone looping through the list of pack-files for a certain object to fall back to reprepare_packed_git() on a fail-to-find. The loop in check_connected() does not have this fallback, leading to this bug.

We _could_ try looping through the packs and only reprepare the packs after a miss, but that change is more involved and has little value.
Since this case is isolated to the case when opt->check_refs_are_promisor_objects_only is true, we are confident that we are verifying the refs after downloading new data. This implies that calling reprepare_packed_git() in advance is not a huge cost compared to the rest of the operations already made.


With Git 2.35 (Q1 2022), add pieces from "scalar" to contrib/.

See commit ddc35d8, commit 4582676, commit cb59d55, commit 4368e40, commit 546f822, commit f5f0842, commit 9187659, commit 829fe56, commit 0a43fb2, commit cd5a9ac (03 Dec 2021) by Johannes Schindelin (dscho).
See commit d85ada7 (03 Dec 2021) by Matthew John Cheetham (mjcheetham).
See commit 7020c88, commit 2b71045, commit c76a53e, commit d0feac4 (03 Dec 2021) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit 62e83d4, 21 Dec 2021)

scalar: implement the clone subcommand

Signed-off-by: Johannes Schindelin

This implements Scalar's opinionated clone command: it tries to use a partial clone and sets up a sparse checkout by default.
In contrast to git clone(man), scalar clone sets up the worktree in the src/ subdirectory, to encourage a separation between the source files and the build output (which helps Git tremendously because it avoids untracked files that have to be specifically ignored when refreshing the index).

Also, it registers the repository for regular, scheduled maintenance, and configures a flurry of configuration settings based on the experience and experiments of the Microsoft Windows and the Microsoft Office development teams.

Note: since the scalar clone command is by far the most commonly called scalar subcommand, we document it at the top of the manual page.


Git 2.36 (Q2 2022) includes new options for git scalar:

See commit 2ae8eb5 (28 Jan 2022) by Johannes Schindelin (dscho).
(Merged by Junio C Hamano -- gitster -- in commit ff6f169, 17 Feb 2022)

scalar: accept -C and -c options before the subcommand

Signed-off-by: Johannes Schindelin

The git executable has these two very useful options:

-C <directory>:
switch to the specified directory before performing any actions

-c <key>=<value>:
temporarily configure this setting for the duration of the specified scalar subcommand

With this commit, we teach the scalar executable the same trick.

VonC
  • 1,129,465
  • 480
  • 4,036
  • 4,755