Storing large textual datasets generated from Python

Question

This post might very well be marked as a duplicate, but I've done my research online and couldn't find any helpful information regarding this question.

I'm currently working with a very large textual dataset generated from Python (around 10TB). I need to find a reliable way to store this dataset other than using a simple txt file.

To my understanding, characters are the lowest form in which text can be stored (1 byte for each char). But is there a way to shorten the length of a string (to save storage) while still maintaining the original content of the string? For example, a 10 char string would go through a certain operation that would result in a <10 char string but still holds the same information?

There is an example of the data that I'm trying to store (they are in no specific order):

1=>[32,543,7638,5436,764,54363,2345,5246,76534653,354]
2=>[3241,563,3425,764432,54326,5437,637,5432,6546,367,63734,43]
and so on...

If there is no such thing as shortening a string, what is a good approach to storing a dataset this large? (aka is there a way of storing a 10TB dataset in less than 10TB size on the hard drive?)

Storing large textual datasets generated from Python

0 Answers0