How to sort one million integers in a file and save in another file without using an array to store the original file contents?

Question

I am writing a program which creates a file with 1 million random inputs, sorts them and then writes the sorted list to a new file.

Is there an efficient way I can sort the elements without having to copy the entire input file into an array ->sort -> write back to output file?

Why don't you want to copy the input file into an array? (~4MB of memory isn't all that scary) — Mark Elliot, Sep 21 '13 at 22:02
What's efficient? If sorting the entire dataset in RAM is a unit of efficiency, how many times slower is still efficient? BTW what's your platform? If you cannot load a million numbers in RAM you must be programming for a toaster, but why a toaster would need to sort a million numbers? ;) — n. 1.8e9-where's-my-share m., Sep 21 '13 at 22:10
Read the file looking for 0s and write any that you find to the output. Now reset the read position and look for 1s. Repeat 2^32 times (or whatever INT_MAX is on your system). No need to store a thing in RAM! *disclaimer: using this method in real tasks might get you fired. — Dave, Sep 21 '13 at 22:21
Does: `system("sort -n input-file >output-file");` count as a solution? How much memory is available? — Jonathan Leffler, Sep 21 '13 at 22:21
Seriously though, a bucket-based solution may be your friend. Load in all numbers between 0 and 1023 in the first pass, sort them, and print them, then clear that and load in the numbers from 1024 to 2047, etc. Finding the largest value in the array is possible during the first pass, so you know when to stop and what bucket size is best. This falls down if you have a large number of the same value, in which case you could store the values as `value, quantity` pairs. No point doing any of this non-optimal stuff if you can just load the lot into memory though. — Dave, Sep 21 '13 at 22:25
you could try an *external sorting* algorithm ([here's an example in Python](http://stackoverflow.com/a/16954837/4279)) that also works for an unknown integer range. — jfs, Sep 21 '13 at 23:00

score 1 · Answer 1 · answered Sep 21 '13 at 22:11

1

I suggest reading Column 1 on Programming Pearls, 2nd edition. Jon Bentley, the author, solves exactly this problem using a bit vector, sorting about 10 million integers using only 1MB of memory. But that will only work if the integers in input are bound to a known range.

answered Sep 21 '13 at 22:11

Filipe Gonçalves

20,077
6
49
66

Here's that Bentley Column 1 article: http://netlib.bell-labs.com/cm/cs/pearls/cto.html – Mike Makuch Sep 21 '13 at 22:23
Heh. Is that bound and unique? Otherwise I can't see how he fits 10M integers in 8M bits. – jthill Sep 21 '13 at 22:24
@jthill yep, bound and unique. And even if it's not unique, you can easily extend it to count each occurrence within some limits, for example, if any integer shows up at most 15 times, you can use 4 bits per number, and that does not add up much more memory. – Filipe Gonçalves Sep 21 '13 at 22:46

Farouq Jouti · Answer 2 · 2013-09-21T22:16:29.773

0

well you can use a binary search tree and to write them back to the output file you can use this algorithm :

void output_to_file(FILE* file , struct* binTree)
{
     if(binTree != NULL)
     {
         output_to_file(binTree->left);
         fprintf(file , "%d/n" , binTree->num); //num is the member of the structure that carrys the number
     output_to_file(binTree->right);
     }
}

edited Sep 21 '13 at 22:16

answered Sep 21 '13 at 22:07

Farouq Jouti

1,647
9
15

1

That's at least as complicated as storing the data in an array. – Mats Petersson Sep 21 '13 at 22:16
1

No, it's complicated even if you are familiar with those constructs. And it also means storing all the data in memory - just taking a lot more memory than using an array (because unless you spend even more time in doing some sort of block allocation, each node is allocated independently, and thus taking up 16-64 bytes per node, meaning the allocation of 1M integers now take 16-64MB instead of the 4MB that an array takes). – Mats Petersson Sep 21 '13 at 22:21
So, you are using an array of nodes to store your nodes, or are you using `malloc`? If you are using `malloc`, then it's 40 bytes per node of "admin" space that `malloc` needs to put it back when you later do free. [And my machine used 8 bytes for pointers, so that would be 16 bytes of overhead per node, even if you start with `node array[1000000];` and then dole out nodes from that. – Mats Petersson Sep 21 '13 at 22:28

score 0 · Answer 3 · answered Sep 21 '13 at 22:29

The "trivial" solution is to iterate over the list finding the "lowest[1] number not yet stored in the output file", until you can't find a low number any longer. However, that will be tediously slow, since it probably means 1M reads of the entire unsorted file. But it won't take up much memory...

Unfortunately, any other method will involve reading the file (or some portions) into memory. On-disk sorting is a pretty rubbish method when there is enough memory to read into the memory - if you were to say sort a 40GB list of numbers on a machine with 16GB, then you have no choice, of course. But reading it all into memory and the writing it back out again is always the best choice if there is enough memory.

[1] Or highest, if you want to sort "high to low".

How to sort one million integers in a file and save in another file without using an array to store the original file contents?

3 Answers3