-1

I have a python script that reads a bunch of csv files and creates a new csv file that contains the last line of each of the files read. The script is this:

    import pandas as pd
    import glob
    import os

    path = r'Directory of the files read\*common_file_name_part.csv'
    r_path = r'Directory where the resulting file is saved.'
    if os.path.exists(r_path + 'csv'):
       os.remove(r_path + 'csv')
    if os.path.exists(r_path + 'txt'):
       os.remove(r_path + 'txt')

    files = glob.glob(path)
    column_list = [None] * 44
    for i in range(44):
        column_list[i] = str(i + 1)

    df = pd.DataFrame(columns = column_list)
    for name in files:
        df_n = pd.read_csv(name, names = column_list)
        df = df.append(df_n.iloc[-1], ignore_index=True)
        del df_n

    df.to_csv(r_path + 'csv', index=False, header=False)
    del df

The files all have a common name end and a genuine name beginning. The resulting file doesn't have the extension so I can do some checks. My problem is that the files have a variable amount of lines and columns, even inside the same file, and I can't read them properly. If I don't specify the column names, the program assumes the first line as the column names and that leads to a lot of columns being lost from some of the files. Also, I've tried reading the files without headers, by writing:

    df = pd.read_csv(r_path, header=None)

but it doesn't seem to work. I wanted to upload some files as an example but I don't know. If someone knows how I'll be happy to do it

  • 1
    Is there a reason you need to use pandas? are you really just interested in the last line of each csv file? – Chris Doyle Jul 30 '19 at 10:25
  • Yes. I just want the last line of each file. – George Tsaki Jul 30 '19 at 10:27
  • 1
    then you should be able to do this without pandas. googling `get last line of file in python` should give you many better approaches – Chris Doyle Jul 30 '19 at 10:28
  • Pandas is the only way i found for readin csv easy, because every other method will remove the delimiter, which is something i don't want. There might be more available but i haven't found them unfortunately. – George Tsaki Jul 30 '19 at 10:34
  • but you dont seem to be actually parsing or doing anything with the csv you just want to read the last line from each file and write them all into a single new file? – Chris Doyle Jul 30 '19 at 10:36
  • can you share an exampe of the end of one of your files – Chris Doyle Jul 30 '19 at 10:37
  • You can easily get the last line with `for row in open("my_file.csv"): pass` isn't very elegant. and doesn't clean up nicely. But it's a single line of code and doesn't load the entire file into memory. – Tim Jul 30 '19 at 10:38
  • Do you want the last line of the CSV files or the last row in the CSV files ? – Samuel Lopez Jul 30 '19 at 10:39
  • @Tim How do i save that row? I mean what type does python read it as? – George Tsaki Jul 30 '19 at 10:42
  • @SamuelLopez Isn't it the same thing? – George Tsaki Jul 30 '19 at 10:42
  • 1
    @GeorgeTsaki a CSV file is just text. Row is a string, just write it out to a file. You can literally do all of that pandas code with a couple of lines of Python. – Tim Jul 30 '19 at 10:45

2 Answers2

0

You can preprocess your files to sort of fill up the rows with less than max number of columns. Ref: Python csv; get max length of all columns then lengthen all other columns to that length

You can also use sep argument, or, if it fails to read your CSV correctly, read file as fixed width. See answers on this SO question: Read CSV into a dataFrame with varying row lengths using Pandas

suitsense
  • 38
  • 3
  • 1
    or could just not use pandas since he doesn't care about any of the data in the csv file and is only interested in writing the last line of each file to a single file. – Chris Doyle Jul 30 '19 at 10:49
  • Yes I think that will be better if there will be no data manipulation whatsoever! He can use builtin csv module instead – suitsense Jul 30 '19 at 11:07
0

It looks like you actually have two problems:

  1. getting a complete list of all the columns in all of the files

  2. reading the last line from each file and merging into the correct columns

To solve this the standard Python csv module makes more sense than Pandas.

I will assume you have identified the list of files you need and it's in your files variable

First get all the headers

import csv

# Use a set to eliminate eleminate duplicates
headers = set()

# Read the header from each file
for file in files:
    with open(file) as f:
        reader = csv.reader(f)

        # Read the first line as this will be the header
        header = next(reader)

        # Update the set with the list of headers
        headers.update(header)

print("Headers:", headers)

Now read the last lines and write them to the result file

Use a DictReader and DictWriter provide a dict mapped to the header.

with open(r_path, "w") as f_out:
    # The option extrasaction="ignore" allows for not
    # all columns to be provided when calling writerow
    writer = DictWriter(f_out, fieldnames=headers, extrasaction="ignore")
    writer.writeheader()

    # Read the last line of each file
    for file in files:
        with open(file) as f_in:
            reader = csv.DictReader(f_in)

            # Read all and ignore only keep the last line
            for row in reader: 
                pass

            # Write the last row into the result file
            writer.writerow(row)
Tim
  • 2,310
  • 1
  • 22
  • 26
  • I am geting an error that i am missing the fieldnames argument in DictWriter. Since i am not familiar with the dictwriter, what should i put as fieldnames? – George Tsaki Jul 30 '19 at 11:14
  • Updated argument, obviously can't test this as I don't have a big pile of CSV files ;) – Tim Jul 30 '19 at 11:18
  • Now it says that's it's missing the f positional argument without actually missing it. – George Tsaki Jul 30 '19 at 11:23
  • `csv.DictReader(f_in)` needed the file handle. – Tim Jul 30 '19 at 11:49
  • Now that i tested the code it looks like that it produces a very messed up file. The order of the cells is changed, some rows have a lot of spaces between cells and there are also spaces between rows. Lastly, the first line seems to be triple the size it should be – George Tsaki Jul 30 '19 at 12:10
  • You said your files have different columns... this merges all of them I can't fix your input data! Preserving the column order could be done by using a list and checking if each column exists before appending it. But it usually doesn't matter as the header is there to define each columns name. – Tim Jul 30 '19 at 12:23