4

I have a big file containing 27 columns and nearly 6 million rows. The following is a little example of my file

head data
0.65   0.722222   1.0      0.75     0
0.35   0.277778   0.0      0.25     0
0      0.666667   0.75     0.5      0.5625
0      0.333333   0.25     0.5      0.4375 

Rows are SNPs and columns are individuals. and I have "2 rows per each SNP" (one for reference allele and the other for alternative allele). In the example above I showed data for 2 SNPs (rows 1 and 2 correspond to SNP1 and rows 3 and 4 correspond to SNP2). I want to check if for each SNP, both ref and alt allele frequencies are 0, replace them with 9. this is my desired output:

head desired
0.65   0.722222   1.0      0.75     9
0.35   0.277778   0.0      0.25     9
9      0.666667   0.75     0.5      0.5625
9      0.333333   0.25     0.5      0.4375 
terdon
  • 10,071
  • 5
  • 22
  • 48
Anna1364
  • 516
  • 2
  • 8
  • 1
    Do you only want to replace a field with 9 if the same field of the other line is 0 or if any of them are? For example, if you have a line with 0.65 0.722222 1.0 0.75 0 and the corresponding line is 0 0.722222 1.0 0.75 0.3 (so field 1 is 0 in one line and the last field is 0 in the other), what should happen? Should both 0s be made 9s? Neither? – terdon Dec 24 '17 at 11:12

1 Answers1

2

A possible approach in plain python:

#!/usr/bin/env python3

import sys

def both_zero(freq1, freq2):
    return freq1 == "0" and freq2 == "0"

with open(sys.argv[1], "r") as snp_file:
    for line1 in snp_file:
        # Consume one extra line to get the frequencies
        # for the alternative allele
        line2 = snp_file.readline()
        # "zip" the frequencies in pairs
        freq_pairs = zip(line1.strip().split(), line2.strip().split())
        recoded_pairs = [
            ("9", "9") if both_zero(freq1, freq2) else (freq1, freq2)
            for (freq1, freq2) in freq_pairs]
        # "unzip" the recoded frequencies into separate objects
        # (see https://stackoverflow.com/a/19343/1878788)
        (recoded1, recoded2) = zip(*recoded_pairs)
        print(*recoded1, sep="\t")
        print(*recoded2, sep="\t")

Calling it with your example snp frequencies file as command-line argument:

$ ./recode_absent.py snps.txt
0.65    0.722222        1.0     0.75    9
0.35    0.277778        0.0     0.25    9
9       0.666667        0.75    0.5     0.5625
9       0.333333        0.25    0.5     0.4375
bli
  • 3,130
  • 2
  • 15
  • 36
  • thanks. I just tried your python code, but I got a syntax error: File "./replace.py", line 21 print(*recoded1, sep="\t") ^ SyntaxError: invalid syntax – Anna1364 Jan 28 '18 at 20:05
  • @Anna1364 Maybe you are using python2 and not python3. Either try to add from __future__ import print_function at the top of the script, or try to execute the script with python3 – bli Jan 29 '18 at 12:28