-1

I'm working with GWAS data which is of 2Million columns and 522 rows. Here I need to replace "00" with "N/A" over data. Since I have a huge file I'm using the open_reader method.

sample data:

ID,kgp11270025,kgp570033,rs707,kgp7500
1,CT,GT,CA,00
200,00,TG,00,GT
300,AA,00,CG,AA
400,GG,CC,AA,TA 

Desired Output:

ID,kgp11270025,kgp570033,rs707,kgp7500
1,CT,GT,CA,N/A
200,N/A,TG,N/A,GT
300,AA,N/A,CG,AA
400,GG,CC,AA,TA 

The code I wrote:

import re

input_file = "test.csv"
output_file = "testresult.csv"

# print("Processing data from", input_file)
with open(input_file) as f:
    lineno = 0
    for line in f:
        lineno = lineno + 1
        if (lineno == 1):
            #need to skip first line
            # print("Skipping line 1 which is a header")
            print(line.rstrip())
        else:
            # print("Processing line {}".format(lineno))
            line = re.sub(r',00', ',N/A', line.rstrip())
            print(line)
    # print("Processed {} lines".format(lineno))

What I'm missing?

Raju Natha
  • 55
  • 5
  • What errors/outputs are you getting with your current code? – Aravind G. May 18 '22 at 10:12
  • Why you elected to declare `output_file = "testresult.csv"` and then never to use that variable? – Daweo May 18 '22 at 10:15
  • when I use `print(line)`, its showing fine but when I tried to `write(line)` to output file not working, i don't know exactly what's going wrong – Raju Natha May 18 '22 at 10:16
  • @Daweo, I'm not sure the code I used to `write(line)` is correct, can you help me with that I'll try – Raju Natha May 18 '22 at 10:18
  • Does this answer your question? [Directing print output to a .txt file](https://stackoverflow.com/questions/36571560/directing-print-output-to-a-txt-file) – SuperStormer May 22 '22 at 20:41

2 Answers2

1

when I use print(line), its showing fine

Then just use file keyword argument of print as follows

import re

input_file = "test.csv"
output_file = "testresult.csv"

# print("Processing data from", input_file)
with open(input_file) as f, open(output_file, "w") as g:
    lineno = 0
    for line in f:
        lineno = lineno + 1
        if (lineno == 1):
            #need to skip first line
            # print("Skipping line 1 which is a header")
            print(line.rstrip(),file=g)
        else:
            # print("Processing line {}".format(lineno))
            line = re.sub(r',00', ',N/A', line.rstrip())
            print(line,file=g)
    # print("Processed {} lines".format(lineno))

Note that whilst opening input file name only is sufficient as default mode is read-text, but specyfing writing mode (w) is required for output file.

Daweo
  • 21,690
  • 3
  • 9
  • 19
0

You could use pandas to do this easily:

import pandas as pd
df = pd.read('test.csv', dtype = str)
df = df.replace('00', 'N/A')
df.to_csv('test-result.csv', index = False)

For very large CSV files, you could do this:

header = True
for chunk in pd.read_csv('test.csv', chunksize = your-chunk-size, type = str):
    chunk = chunk.replace('00', 'N/A')
    chunk.to_csv('test-result.csv', index = False, header = header, mode = 'a')
    header = False
Aravind G.
  • 326
  • 1
  • 9
  • Ya, since I have 3GB file, pandas are not able to read data file that's why I need to read line by line – Raju Natha May 18 '22 at 10:21
  • @RajuNatha For large CSV files, you can [read](https://stackoverflow.com/a/25962187/16424700) and [store](https://stackoverflow.com/a/38531304/16424700) the CSV files in chunks. Let me update my answer to accomodate this. – Aravind G. May 18 '22 at 10:26
  • @RajuNatha Check updated answer. – Aravind G. May 18 '22 at 10:32