126

I have a fixed-width-field file which I'm trying to sort using the UNIX (Cygwin, in my case) sort utility.

The problem is there is a two-line header at the top of the file which is being sorted to the bottom of the file (as each header line begins with a colon).

Is there a way to tell sort either "pass the first two lines across unsorted" or to specify an ordering which sorts the colon lines to the top - the remaining lines are always start with a 6-digit numeric (which is actually the key I'm sorting on) if that helps.

Example:

:0:12345
:1:6:2:3:8:4:2
010005TSTDOG_FOOD01
500123TSTMY_RADAR00
222334NOTALINEOUT01
477821USASHUTTLES21
325611LVEANOTHERS00

should sort to:

:0:12345
:1:6:2:3:8:4:2
010005TSTDOG_FOOD01
222334NOTALINEOUT01
325611LVEANOTHERS00
477821USASHUTTLES21
500123TSTMY_RADAR00
Rob Gilliam
  • 2,326
  • 3
  • 22
  • 29

12 Answers12

152
(head -n 2 <file> && tail -n +3 <file> | sort) > newfile

The parentheses create a subshell, wrapping up the stdout so you can pipe it or redirect it as if it had come from a single command.

BobS
  • 2,428
  • 1
  • 14
  • 14
  • 1
    Thanks; I'm accepting this answer as it seems most complete and concise (and I understand what it's doing!) - it should be "head -n 2", though :-) – Rob Gilliam Jan 28 '13 at 14:18
  • 8
    Is there a way to have this version work on piped-in data? I tried with `tee >(head -n $header_size) | tail -n +$header_size | sort`, but head seems to run after the `tail|sort` pipe, so the header ends up printed in the end. Is this deterministic or a race condition? – Damien Pollet Nov 17 '14 at 17:34
  • You could probably piece together something where you use `cat` to redirect the stdin to a temporary file, then run the above command on that new file, but it's starting to get ugly enough that it's probably better to use one of the awk-based solutions given in the other responses. – BobS Nov 23 '14 at 00:04
  • @DamienPollet: See [Dave](http://stackoverflow.com/users/3398351/dave)'s [answer](http://stackoverflow.com/a/22281855/15168). – Jonathan Leffler Feb 01 '15 at 22:38
  • @DamienPollet: See [freeseek's](http://stackoverflow.com/users/4339300/freeseek) [answer](http://stackoverflow.com/a/27368739/343388) – fess . May 04 '15 at 03:55
  • If you're using zsh (and have `multios` set, which you should by default) you can also use `cat $file > >(head -n2) > >(tail -n+3 | sort)`. For files it's not much of a change, but for sorting command output it has the advantage that it'll only run the command once. – rookie1024 Nov 24 '17 at 03:10
  • I wonder why this doesn't work with piped in data. It seems that the tail doesn't run or gets nothing after the head. Probably because more than 1 line is being read, leaving nothing for tail. – CMCDragonkai Apr 23 '18 at 03:07
  • @CMCDragonkai, this is because head consumes more data than it prints, probably the first 4k block read, leaving very little data for tail. This is why the `read -r; echo "$REPLY"` solution is the correct one (i.e., never use `head` followed by `tail` on (the same) piped data!). Please upvote this comment to the top, so people don't erroneously use this (so called accepted) answer instead of the much better solution provided by freeseek's correct (for far more use cases, including piped data) answer, below. – Michael Goldshteyn Apr 16 '21 at 15:33
101

If you don't mind using awk, you can take advantage of awk's built-in pipe abilities

eg.

extract_data | awk 'NR<3{print $0;next}{print $0| "sort -r"}' 

This prints the first two lines verbatim and pipes the rest through sort.

Note that this has the very specific advantage of being able to selectively sort parts of a piped input. all the other methods suggested will only sort plain files which can be read multiple times. This works on anything.

Hotschke
  • 8,613
  • 5
  • 41
  • 51
Dave
  • 2,766
  • 1
  • 15
  • 13
  • 3
    Very nice, and it works with arbitrary pipes, not only files! – lapo Nov 24 '14 at 16:16
  • 7
    Beautiful, awk never stops surprising me. Also, you don't need the `$0`, `print` is enough. – nachocab Jan 28 '15 at 20:50
  • 1
    @SamWatkins [freeseek's](http://stackoverflow.com/users/4339300/freeseek) [answer](http://stackoverflow.com/a/27368739/343388) is less ugly. – fess . May 04 '15 at 03:56
  • What's the -r option doing to sort? Is this supposed to be reverse sort? – W7GVR May 14 '15 at 20:55
54

In simple cases, sed can do the job elegantly:

    your_script | (sed -u 1q; sort)

or equivalently,

    cat your_data | (sed -u 1q; sort)

The key is in the 1q -- print first line (header) and quit (leaving the rest of the input to sort).

For the example given, 2q will do the trick.

The -u switch (unbuffered) is required for those seds (notably, GNU's) that would otherwise read the input in chunks, thereby consuming data that you want to go through sort instead.

Andrea
  • 541
  • 4
  • 3
  • 1
    Hi, @Andrea; welcome to Stack Overflow. I'm afraid your answer doesn't work, at least not when I'm testing it in Git Bash on Windows (I've moved on from Cygwin, the shell I was using a different job 6 years ago). The sed command pulls all the data off the stdin, leaving no data to pass to sort. Try changing the command to cat your_data | (sed 1q ; wc -l) to see what I mean. – Rob Gilliam May 17 '19 at 11:54
  • 1
    This could work if you pass the input in a second time to the sed command, like this: cat sortMe.csv | (sed 1q sortMe.csv; sort -t, -k3 -rn) > sorted.csv – Harrison Cramer Jun 05 '20 at 03:00
  • 1
    IMO this is the simplest solution here and easiest to remember. It works with piped data with no special considerations or awkward quoting and escaping, and does not need to be used multiple times if you are sorting on multiple columns by a chain of piped sort commands with the -s flag. eg. `bgzip -dc somefile.tsv.gz | (sed -u 2q; sort -k 3,3 -n | sort -k 2,2 -n -s | sort -k 1,1 -s) | bgzip -c > my_sorted_file.tsv.gz`. Key though is the edit adding the `-u` flag which ought to have solved @RobGilliam's problem above. – slowkoni Dec 14 '20 at 20:53
  • 1
    Can you explain a bit how pipe and the paenthesis work? – Gqqnbig Mar 17 '21 at 11:49
  • I would use `head -n 1` instead of `sed -u 1q`. This head command is POSIX and much more portable than dealing with sed's `-u` flag. – dan Jun 03 '22 at 01:11
45

Here is a version that works on piped data:

(read -r; printf "%s\n" "$REPLY"; sort)

If your header has multiple lines:

(for i in $(seq $HEADER_ROWS); do read -r; printf "%s\n" "$REPLY"; done; sort)

This solution is from here

freeseek
  • 2,691
  • 1
  • 13
  • 12
  • 13
    nice. for the single header case I use `extract_data | (read h; echo "$h"; sort)` it's short enough to remember. your example covers more edge cases. :) This is the best answer. works on pipes. no awk. – fess . May 04 '15 at 03:51
  • 1
    Ok, I straced this and it seems that bash goes to special lengths to make this work. In general, if you coded this in C or another language it would not work because stdio would read more than just the first header line. If you run it on a seekable file, bash reads a larger chunk (128 bytes in my test), then lseeks back to after the end of the first line. If you run it on a pipe, bash reads one char at a time until it passes the end of the line. – Sam Watkins May 05 '15 at 09:01
  • Nice! If you just want to eat the header, it's even easier to remember: `extract_data | (read; sort)` – Jason Suárez Jan 27 '17 at 18:53
  • This one is almost perfect but you need to use "IFS= read" instead of "read" to keep leading and trailing spaces. – Stanislav German-Evtushenko Jun 23 '17 at 11:27
  • 9
    This should be the accepted answer in my opinion. Simple, concise and more flexible in that it also works on piped data. – Paul I Nov 21 '17 at 23:06
8

You can use tail -n +3 <file> | sort ... (tail will output the file contents from the 3rd line).

Anton Kovalenko
  • 20,203
  • 2
  • 35
  • 68
4
head -2 <your_file> && nawk 'NR>2' <your_file> | sort

example:

> cat temp
10
8
1
2
3
4
5
> head -2 temp && nawk 'NR>2' temp | sort -r
10
8
5
4
3
2
1
Vijay
  • 62,703
  • 87
  • 215
  • 314
3

It only takes 2 lines of code...

head -1 test.txt > a.tmp; 
tail -n+2 test.txt | sort -n >> a.tmp;

For a numeric data, -n is required. For alpha sort, the -n is not required.

Example file:
$ cat test.txt

header
8
5
100
1
-1

Result:
$ cat a.tmp

header
-1
1
5
8
100

Rodrigo Taboada
  • 2,679
  • 4
  • 22
  • 27
  • 2
    Isn't this basically the same answer as the accepted answer? (Except BobS's approach puts the result on stdout, allowing you to send the result through other filters before being written to file, if necessary) – Rob Gilliam Feb 02 '15 at 09:57
2

So here's a bash function where arguments are exactly like sort. Supporting files and pipes.

function skip_header_sort() {
    if [[ $# -gt 0 ]] && [[ -f ${@: -1} ]]; then
        local file=${@: -1}
        set -- "${@:1:$(($#-1))}"
    fi
    awk -vsargs="$*" 'NR<2{print; next}{print | "sort "sargs}' $file
}

How it works. This line checks if there is at least one argument and if the last argument is a file.

    if [[ $# -gt 0 ]] && [[ -f ${@: -1} ]]; then

This saves the file to separate argument. Since we're about to erase the last argument.

        local file=${@: -1}

Here we remove the last argument. Since we don't want to pass it as a sort argument.

        set -- "${@:1:$(($#-1))}"

Finally, we do the awk part, passing the arguments (minus the last argument if it was the file) to sort in awk. This was orignally suggested by Dave, and modified to take sort arguments. We rely on the fact that $file will be empty if we're piping, thus ignored.

    awk -vsargs="$*" 'NR<2{print; next}{print | "sort "sargs}' $file

Example usage with a comma separated file.

$ cat /tmp/test
A,B,C
0,1,2
1,2,0
2,0,1

# SORT NUMERICALLY SECOND COLUMN
$ skip_header_sort -t, -nk2 /tmp/test
A,B,C
2,0,1
0,1,2
1,2,0

# SORT REVERSE NUMERICALLY THIRD COLUMN
$ cat /tmp/test | skip_header_sort -t, -nrk3
A,B,C
0,1,2
2,0,1
1,2,0
flu
  • 526
  • 4
  • 11
0

With Python:

import sys
HEADER_ROWS=2

for _ in range(HEADER_ROWS):
    sys.stdout.write(next(sys.stdin))
for row in sorted(sys.stdin):
    sys.stdout.write(row)
crusaderky
  • 2,303
  • 2
  • 17
  • 25
0

Here's a bash shell function derived from the other answers. It handles both files and pipes. First argument is the file name or '-' for stdin. Remaining arguments are passed to sort. A couple examples:

$ hsort myfile.txt
$ head -n 100 myfile.txt | hsort -
$ hsort myfile.txt -k 2,2 | head -n 20 | hsort - -r

The shell function:

hsort ()
{
   if [ "$1" == "-h" ]; then
       echo "Sort a file or standard input, treating the first line as a header.";
       echo "The first argument is the file or '-' for standard input. Additional";
       echo "arguments to sort follow the first argument, including other files.";
       echo "File syntax : $ hsort file [sort-options] [file...]";
       echo "STDIN syntax: $ hsort - [sort-options] [file...]";
       return 0;
   elif [ -f "$1" ]; then
       local file=$1;
       shift;
       (head -n 1 $file && tail -n +2 $file | sort $*);
   elif [ "$1" == "-" ]; then
       shift;
       (read -r; printf "%s\n" "$REPLY"; sort $*);
   else
       >&2 echo "Error. File not found: $1";
       >&2 echo "Use either 'hsort <file> [sort-options]' or 'hsort - [sort-options]'";
       return 1 ;
   fi
}
JonDeg
  • 366
  • 3
  • 8
0

This is the same as Ian Sherbin answer but my implementation is :-

cut -d'|' -f3,4,7 $arg1 | uniq > filetmp.tc
head -1 filetmp.tc > file.tc;
tail -n+2 filetmp.tc | sort -t"|" -k2,2 >> file.tc;
Peter Wilson
  • 3,938
  • 3
  • 31
  • 55
Bik
  • 1
-5
cat file_name.txt | sed 1d | sort 

This will do what you want.

Sathish G
  • 23
  • 3
  • 1
    1) This only removes the header line and sorts the rest, it doesn't sort everything below the header line leaving the header intact. 2) it removes the first line only, when the header is actually two lines (read the question). 3) Why do you use "cat file_name.txt | sed 1d" when "sed 1d < file_name.txt" or even just "sed 1d file_name.txt" has the same effect? – Rob Gilliam Mar 09 '16 at 13:46