How avoid duplication in a data lifecycle?

Question

I will put an example.

I have a 1 TB CSV file, I need to conserve the raw file and sanitize then.

After the sanitization near 10% of entries has changed.

Both files need to be stored for years.

So, what's the best way to handle the two files?

I need to avoid duplicate the files to save 0.9 TB of disk usage, so the best way I find it's to store only the 10% of sanitized data when somebody wants to read an entry that not exists in the sanitized file the software layer try to find it in the raw file.

So, i can save 0.9 TB of disk space but I need a software layer to handle the requests for the entries.

The problem is, I don't have found a software that does it for me, so I need to write them.

However, Linux BtrFS filesystem has a deduplication resource, I can have the raw file and the sanitized file and allow the filesystem to handle the deduplication. But in a test, an estimate can save only 0.5 TB because need space for metadata.

I'm a little lost here:

How can I apply my requirement to save disk space without writing the software?

Or perhaps, the question should be: What methodology I need to apply to save disk space in file duplication and operation?

If someone likes to recommend a book, I'm open to reading it.

Hi, because I need to keep the entries available for online requests. If I compress one or other file each request can take, perhaps, 1 or 2 seconds. It's too much for this case. — Amanda Osvaldo, May 31 '17 at 15:44

How avoid duplication in a data lifecycle?

0 Answers0