23

I've noticed that normally when packages are installed using various package managers (for python), they are installed in /home/user/anaconda3/envs/env_name/ on conda and in /home/user/anaconda3/envs/env_name/lib/python3.6/lib-packages/ using pip on conda.

But conda caches all the recently downloaded packages too.

So, my question is: Why doesn't conda install all the packages on a central location and then when installed in a specific environment create a link to the directory rather than installing it there?

I've noticed that environments grow quite big and that this method would probably be able to save a bit of space.

lahsuk
  • 964
  • 6
  • 18

1 Answers1

40

Conda already does this. However, because it leverages hardlinks, it is easy to overestimate the space really being used, especially if one only looks at the size of a single env at a time.

To illustrate the case, let's use du to inspect the real disk usage. First, if I count each environment directory individually, I get the uncorrected per env usage

$ for d in envs/*; do du -sh $d; done
2.4G    envs/pymc36
1.7G    envs/pymc3_27
1.4G    envs/r-keras
1.7G    envs/stan
1.2G    envs/velocyto

which is what it might look like from a GUI.

Instead, if I let du count them together (i.e., correcting for the hardlinks), we get

$ du -sh envs/*
2.4G    envs/pymc36
326M    envs/pymc3_27
820M    envs/r-keras
927M    envs/stan
548M    envs/velocyto

One can see that a significant amount of space is already being saved here.

Most of the hardlinks go back to the pkgs directory, so if we include that as well:

$ du -sh pkgs envs/*
8.2G    pkgs
400M    envs/pymc36
116M    envs/pymc3_27
 92M    envs/r-keras
 62M    envs/stan
162M    envs/velocyto

one can see that outside of the shared packages, the envs are fairly light. If you're concerned about the size of my pkgs, note that I have never run conda clean on this system, so my pkgs directory is full of tarballs and superseded packages, plus some infrastructure I keep in base (e.g., Jupyter, Git, etc).

merv
  • 53,208
  • 11
  • 148
  • 196
  • Can I ask why does the size of your envs change before and after you included ```pkgs```? – Tian Sep 24 '19 at 02:03
  • @Tian sure. That's because `pkgs` is the central repository for package code and much of what is hardlinked goes back to there. Everything goes into there first and gets linked out to the envs whenever possible. – merv Sep 24 '19 at 02:21
  • @merv For me, `for d in envs/*; do du -sh $d; done` and `du -sh envs/*` shows same result...(They both show 3~5GB for each env) Why is it so? Does it mean it never use hardlink thing? (I'm using `miniconda` and `conda` version is `4.8.3`) – user3595632 Jan 19 '21 at 09:32
  • @user3595632 Possibly. Is your package cache on a different file system? For example, check`conda config —show pkgs_dirs envs_dirs`. Or there are other config settings related to linking behavior that may be worth checking. – merv Jan 19 '21 at 16:00