Maximize number of backups that fit on backup drive

Question

This is an intrinsically "practical" question, but it leads to a well-defined mathematical problem. Let me start with the practical part:

I regularly back up my data. My backup strategy are differential backups, where the first backup is a full backup containing all files and subsequent backups are either full backups or differential backups, containing only data that has been added/changed since the last full backup.¹

My backup storage is limited to $C$. I use the term "snapshot" to refer to the data from either a full backup or a differential backup.

Question: After how many differential backups should I make a new full backup to maximize the number of snapshots I can squeeze in the storage capacity $C$?

Intuition: A full backup takes up more space than a differential backup, but the more data has been added since the last full backup, the larger becomes each additional differential backup. Therefore, there is a trade-off between an additional large full backup vs. many larger differential backups.

I tried to formalize my problem as follows:

Denote data that is newly added between $t-1$ and $t$ as $s(t)$ and let $s(0) = d_0$ (some constant, size of initial data). Ignore deletes, i.e. total data at time $\tilde{t}$ is $f(\tilde{t})=\sum_{t=0}^{\tilde{t}} s(t)$, which equals the size of a full backup at this time. The size of a differential backup ($d$) at time $\tilde{t}$ depends on the time of the last full backup $b$: $d(\tilde{t},b) = \sum_{t=b+1}^{\tilde{t}} s(t)$.

With 1 full backup and only differential backups afterwards, I am not running out of storage capacity $C$ as long as $T$ is small enough² such that $$f(0) + \sum_{\tilde{t}=1}^T d(\tilde{t}, 0) \leq C.$$

With 2 full backups, each followed by differential backups, $T_1$ must be small enough such that $$f(0) + \sum_{\tilde{t}=1}^{T_0} \Big[ d(\tilde{t}, 0) \Big] + f(T_0+1) + \sum_{\tilde{t}=T_0+2}^{T_1} d(\tilde{t}, T_0+1) \leq C.$$

I think the pattern becomes clear now:³ with $N$ full backups, $T_{N-1}$ must be small enough such that $$f(0) + \sum_{\tilde{t}=1}^{T_0} \Big[ d(\tilde{t}, 0) \Big] + \sum_{n=1}^{N-1} \Big[ f(T_{n-1}+1) + \sum_{\tilde{t}=T_{n-1}+2}^{T_n} d(\tilde{t}, T_{n-1}+1) \Big] \leq C. \tag{1} \label{eq:cond}$$

This leads to the ultimate question:

Which $N$ maximizes the largest $T_{N-1}$ for which equation $(\ref{eq:cond})$ holds?

I am not sure how to approach a solution to this problem, nor am I sure if further assumptions are needed to make progress. Maybe it is helpful to assume that the amount of new data in each period $s(t)$ is constant ($s(t) = \bar{s} ~ \forall ~ t$)?

Plugging my definitions of $f(\tilde{t})$ and $d(\tilde{t}, b)$ into equation $(\ref{eq:cond})$ yields: $$d_0 + \Big[ \sum_{\tilde{t}=1}^{T_0} \sum_{t=1}^{\tilde{t}} s(t) \Big] + \sum_{n=1}^{N-1} \Bigg[ \Big[ \sum_{t=0}^{T_{n-1}+1} s(t) \Big] + \Big[ \sum_{\tilde{t}=T_{n-1}+2}^{T_n} \sum_{t=T_{n-1}+2}^{\tilde{t}} s(t) \Big] \Bigg] \leq C, \tag{1'} \label{eq:cond-plugged-in}$$ but I do not know how to proceed from here.

Any help, either on tackling equation $(\ref{eq:cond-plugged-in})$ or completely different strategies to solve the problem are highly appreciated!

¹ Note that I am not referring to incremental backups, where each backup contains only the data since the previous incremental backup. Differential backups always contain all data since the last full backup.

² I assume that $C$ is large enough for at least 1 full and 1 differential backup.

³ To be more explicit, the conditions for 3 and 4 full backups are $$f(0) + \sum_{\tilde{t}=1}^{T_0} d(\tilde{t}, 0) + f(T_0+1) + \sum_{\tilde{t}=T_0+2}^{T_1} d(\tilde{t}, T_0+1) + f(T_1+1) + \sum_{\tilde{t}=T_1+2}^{T_2} d(\tilde{t}, T_1+1) \leq C,$$

$$f(0) + \sum_{\tilde{t}=1}^{T_0} d(\tilde{t}, 0) + f(T_0+1) + \sum_{\tilde{t}=T_0+2}^{T_1} d(\tilde{t}, T_0+1) + f(T_1+1) + \sum_{\tilde{t}=T_1+2}^{T_2} d(\tilde{t}, T_1+1) + f(T_2+1) + \sum_{\tilde{t}=T_2+2}^{T_3} d(\tilde{t}, T_2+1) \leq C.$$

What kind of answer are you looking for? A closed form expression? An algorithm solving your problem? (In which case dynamic programming gives an easy pseudo-polynomial time solution) — Tassle, Mar 07 '21 at 19:41
The assertion that $f(\tilde{t})$ is the size of a full backup at time $\tilde{t}$ ignores not only deletions but also overwriting of old data with updated data. It is correct that a full backup will use no more than $f(\tilde{t})$ space. — prubin, Mar 07 '21 at 21:49
After making a full backup, would you not delete all previous backups (full and differential)? — prubin, Mar 07 '21 at 21:54
@Tassle Actually, either would be great. I didn't expect a closed-form solution to be possible at all, but seeing one would be fascinating. However, an algorithm would also help - anything that tackles the underlying practical problem is highly appreciated. — CL., Mar 08 '21 at 08:02
@prubin: Regarding the overwrites - yes, I abstracted from overwrites as well. In fact, I don't think modified or removed files are very important in my case and I considered the problem to be complicated enough even without adding them to the model. ;-) — CL., Mar 08 '21 at 08:06
@prubin: Regarding deleting old backups after a new full backup - no, I try to keep them in order to preserve history. Although I said before that overwrites / deleted play no important role, they may happen. Therefore, if a file exists in $t$, is accidentially deleted in $t+1$, a full backup occurs at $t+2$ and in $t+3$ the deletion is discovered, restoring the file is only possible if pre-$t+2$ snapshots still exist. Actually, this is basically the reason why my objective is to maximize the number of snapshots that can be squeezed into $C$ in the first place. — CL., Mar 08 '21 at 08:12
Thanks to both of you for taking the time to read and think through this question! — CL., Mar 08 '21 at 08:13
I'm still missing something here. Given what you said about recovering an incorrectly deleted file, why not do one full backup and then just incremental backups? What's the motivation for subsequent full backups? Restoring a full backup is faster than restoring a gaggle of incremental backups, but execution time does not seem to be a consideration here. — prubin, Mar 09 '21 at 17:10
@prubin Smart questions … these are details I tried to leave out of the question for brevity – but in case you're curious: I try to maintin a high level of redundancy. To this end, I use 3 external drives. All 3 hold identical copies of the full backup(s), but each differential backup is only stored on 1 drive (cycling through the drives). If 1 drive dies, I lose at most 1 diff. backup. While incremental backups would be great, this does not work with several drives: [cont.] — CL., Mar 10 '21 at 08:59
AFAIK, most software (at least what I use: Cobian Backup) utilizes the Archive Bit to determine if a file has changed since the last inc. backup: a file that has changed since the last full backup is only included in the 1st inc. backup. This defeats redundancy because the inc. backups from all drives are required to restore. I therefore use diff. backups and only set the arechive bit on full backups. [Actually, you've got me thinking about incremental backups that are cloned across drives - but that's a different story.] — CL., Mar 10 '21 at 08:59
If you are storing each differential backup on one drive, the assertion that a drive failure costs you only one differential backup holds only if you are doing a full backup after every three differentials. So solving your optimization problem would change that, no? Anyway, for the problem you stated, I agree with Tassle that you can solve it by dynamic programming (or as a longest path problem on a layered digraph, which is equivalent) as long as you know the values of $s(t)$. — prubin, Mar 11 '21 at 16:58
@prubin Why would you say that "the assertion that a drive failure costs you only one differential backup holds only if you are doing a full backup after every three differentials"? Are you maybe confusing differential and incremental backups in this argument? A restore only requires 1 full + 1 differential backup. If the drive with the latest differential backup dies, I lose the latest snapshot (not more, because the previous diff. backup is on a different drive). If one of the two other drives fails, I lose no data. — CL., Mar 12 '21 at 07:40
You're right, I was thinking incremental rather than differential. — prubin, Mar 13 '21 at 18:43

Tassle · Accepted Answer · 2021-03-14T21:58:34.827

Here is an easy dynamic programming solution which runs in $O(n^2)$ time, where $n$ is the number of updates to the data.

Create an array of partial sums $p$ of size $n$, such that $p[j]=\sum_{i=0}^{j}s(i)$ for all $0\leq j \leq n-1$.

For $0\leq k \leq t \leq n$ let $m[k][t]$ denote the minimum amount of storage required to store the first $k+1$ snapshots, knowing that the last full backup happened at snapshot $t$.

We have:

$m[0][0] = s(0)$
$m[i][i] = \min_{0\leq t \leq i-1}\{m[i-1][t]\}+p[i]$ for all $0<i\leq n-1$
$m[i][t] = m[i-1][t] + (p[i]-p[t])$ for all $0\leq t<i \leq n-1$

These relations let you compute the whole matrix $m$ in $O(n^2)$ time. If you then let $m'[i] = \min_{0\leq t \leq i}\{m[i][t]\}$ denote the minimum amount of storage required to store the first $i+1$ snapshots for all $0\leq i \leq n-1$, the maximum number of snapshots you can store is one more than the maximum $i$ such that $m'[i]\leq C$.

Note that you can also compute $m'$ as you are computing $m$ and stop whenever you find that $m'[i] > C$. You can also easily adapt this approach to handle deletes.

Now to actually find out when you should do a full backup and when you should do a differential backup, there are two main approaches (essentially doing the same thing). One involves storing some additional information while computing $m'$, the other a sort of backtracking to figure out which choices led to the values of $m$ and $m'$ we end up with.

I prefer the former, and it goes like this:

Let $q[i]$ denote the last time before $i$ we have to do a full backup in order to do store the first $i+1$ backups using $m'[i]$ storage. In other words, $q[i]=t$, where $0\leq t \leq i$ is the value which minimizes $m[i][t]$. You can compute $q$ while computing $m'$.
To store the $i+1$ first snapshots with the least amount of storage, you need to do your last full backup at time $t=q[i]$ and minimize the storage required for the first $t$ snapshots. To minimize the storage of the first $t$ snapshots you need to do your last full backup at time $t'=q[t-1]$ and minimize the storage required for the first $t'$ snapshots. To minimize the storage of the first $t'$ snapshots you need to do your last full backup at time $t''=q[t'-1]$... And so on until you reach $0$, where you have no choice but to do a full backup.

The values $t,t',t'',t'''\ldots$ you will reach in this process tell you exactly when you have to do a full backup in order to store the first $i+1$ snapshots using $m'[i]$ storage (which is the minimum possible).

Thanks, that looks great! I'll look into this as soon as possible. — CL., Mar 09 '21 at 08:51
Thank you again for your answer! I now found the time to … try to understand it. There's one thing that I cannot get my mind around: once I know the maximum number of snapshots that can be stored, how do I know what frequency of full backups is needed to reach this maximum? Is this something I can see from $m$? Sorry for my naiveté; I'm new to OR. — CL., Mar 14 '21 at 17:23
By the way, I tried to implement the algorithm suggested in this answer in R and my current approach is https://gist.github.com/ClaudiusL/ed53fc4fd55c1b8dfa4212bd699625bb — CL., Mar 14 '21 at 17:26
@CL. You are welcome! I have edited my answer to explain how to do that. Note that none of this is specific to OR, dynamic programming is a widely used technique in algorithms design generally. I encourage you to google the subject if you are interested in algorithms! The general principle is very simple once the first difficulties are overcome, and it is very useful. — Tassle, Mar 14 '21 at 22:03

Maximize number of backups that fit on backup drive

1 Answers1