1

I have a directory full of data files in the following format:

4 2 5 7
1 4 9 8
8 7 7 1
  4 1 4
    1 5
    2 0
    1 0
    0 0
    0 0

They are separated by tabs. The third and fourth columns contain useful information until they reach 'zeroes'.. At which point, they are arbitrarily filled with zeroes until the end of file.

I want to get the length of the longest column where we do not count the 'zero' values on the bottom. In this case, the longest column is column 3 with a length of 7 because we disregard the zeros at the bottom. Then I want to transform all the other columns by packing zeroes on them until their length is equal to the length of my third column (besides column 4 b/c it is already filled with zeroes). Then I want to get rid of all the zeros beyond my max length in all my columns.. So my desired file output will be as follows:

4 2 5 7
1 4 9 8
8 7 7 1
0 4 1 4
0 0 1 5
0 0 2 0
0 0 1 0

These files consist of ~ 100,000 rows each on average... So processing them takes a while. Can't really find an efficient way of doing this. Because of the way file-reading goes (line-by-line), am I right in assuming that in order to find the length of a column, we need to process in the worst case, N rows? Where N is the length of the entire file. When I just ran a script to print out all the rows, it took about 10 seconds per file... Also, I'd like to modify the file in-place (over-write).

vgbcell
  • 179
  • 1
  • 12
  • 2 things: First of all, it really helps us if you show what code you have, that way we can help you improve your work. Secondly, bear in mind that printing to the console is not a good indicator of processing time. It's the act of displaying it on the screen that's likely taking up much of your 10 seconds. If you instead read and then write to a different file, I expect you'll find it takes 2 seconds or fewer. – Scott Mermelstein Oct 11 '16 at 16:42
  • Does the first zero and/or blank in a column mean that all subsequent items in that column are also zero or blank? – Robᵩ Oct 11 '16 at 17:41
  • @Rob all subsequent items are also zero or blank, yes – vgbcell Oct 11 '16 at 17:57

2 Answers2

0

Hi I would use Pandas and Numpy for this:

import pandas as pd
import numpy as np

df = pd.read_csv('csv.csv', delimiter='\t')
df = df.replace(0,np.nan)
while df.tail(1).isnull().all().all() == True:
  df=df[0:len(df)-1]
df=df.replace(np.nan,0)
df.to_csv('csv2.csv',sep='\t', index=False) #i used a different name just for testing

You create a DataFrame with your csv data.
There are a lot of built in functions that deal with NaN values, so change all 0s to nan. Then start at the end tail(1) and check if the row is all() NaN. If so copy the DF less the last row and repeat. I did this with 100k rows and it takes only a few seconds.

EoinS
  • 5,106
  • 1
  • 17
  • 31
0

Here are two ways to do it:

# Read in the lines and fill in the zeroes
with open('input.txt') as input_file:
    data = [[item.strip() or '0' 
             for item in line.split('\t')]
            for line in input_file]

# Delete lines near the end that are only zeroes
while set(data[-1]) == {'0'}:
    del data[-1]

# Write out the lines
with open('output.txt', 'wt') as output_file:
    output_file.writelines('\t'.join(line) + '\n' for line in data)

Or

with open('input.txt') as input_file:
    with open('output.txt', 'wt') as output_file:
        for line in input_file:
            line = line.split('\t')
            line = [item.strip() or '0' for item in line]
            if all(item == '0' for item in line):
                break
            output_file.write('\t'.join(line))
            output_file.write('\n')
Robᵩ
  • 154,489
  • 17
  • 222
  • 296