0

I have a very large .csv file which got corrupted when I have opened it in MS Excel. The file has a datetime field (corresponding to a column in Excel/ Python dataframe) in "dd/mm/yyyy hh:mm:ss" format which seems to have got converted into "mm:ss.0:" format when I did some editing.

The datetime in my file is cyclic for every 37 rows. That means the first 37 rows correspond to today. The next 37 rows correspond to to yesterday, so on. I am not bothered about getting back the time part (hours or minutes or seconds). I need to get back at least the date.

Is there a way I can restore the datetime to original format or at least get the dates?

I searched for information on different forums on internet, but did not find any solution. Appreciate inputs.

The top few rows in the file earlier appear as (dummy data):

CN,NC,ND,Date
ccc,37.0,0.0,2022-06-03 08:47:12.167884,
fff,103221.0,316.0,2022-06-03 08:47:12.167884,
ggg,2273.0,0.0,2022-06-03 08:47:12.167884,
gfg,36189.0,41.0,2022-06-03 08:47:12.167884,
fkg,25130.0,38.0,2022-06-03 08:47:12.167884,
hgb,41297.0,81.0,2022-06-03 08:47:12.167884,
rgy,0.0,0.0,2022-06-03 08:47:12.167884,
kju,4188.0,78.0,2022-06-03 08:47:12.167884,
tgy,12522.0,17.0,2022-06-03 08:47:12.167884,

and now they appear as (dummy data):

CN,NC,ND,Date
ccc,37,0,47:12.2
fff,103221,316,47:12.2
ggg,2273,0,47:12.2
gfg,36189,41,47:12.2
fkg,25130,38,47:12.2
hgb,41297,81,47:12.2
rgy,0,0,47:12.2
kju,4188,78,47:12.2
tgy,12522,17,47:12.2
Srinivas
  • 464
  • 3
  • 14
  • So, before you had `02/06/2022 13:37:19` and now the column reads `13:37:0` - but you know that every 37 rows exactly, the date goes back 1 day ? If this is the case, awk could be a quick solution. Just make a counter of row and div it by 37, subtract the result from today's date. – MyICQ Jun 04 '22 at 07:07
  • What is in the CSV file itself (when you examine it with a text program like Notepad++)? How are you editing, saving and opening it into Excel? – Ron Rosenfeld Jun 04 '22 at 11:09
  • @MyICQ. What you stated about the date is correct. However, I did not understand your suggested solution. Can you please elaborate? – Srinivas Jun 04 '22 at 16:26
  • @ron Rosenfeld, the file contains well defined columns (header) and rows, ie., it is well structured and each row contains the same fields and data types. I opened using Excel and then did some copy/ paste and saved back. That changed the format of date fields (column). – Srinivas Jun 04 '22 at 16:30
  • I have added sample data for better understanding. – Srinivas Jun 04 '22 at 16:51
  • I **Open** the file in Excel, and I see just that strange time format. However, if I select one of the cells, and examine the formula bar, the entire date/time is present. I was able to display`06/03/2022 08:47:12.168` merely by changing the `numberformat` of the cell. If your work requires that Seconds have more than three decimals, you will need to use a different method as that is a limitation of the Excel numberformat string. So the data is in Excel. The next steps depend on how you edit, and whether you need to retain seconds to six decimals in the resultant file. Need more info. – Ron Rosenfeld Jun 04 '22 at 17:33
  • How is this related to Python (tagged)? What are you doing in Excel, what in Python? – FObersteiner Jun 04 '22 at 18:16
  • @Ron Rosenfeld, that is true, when you save the data in notepad (example), save as .csv file and then open in excel, you will see the full date in correct format in the formula bar. But I think the moment you save the file in excel (perhaps with some editing), the format of the date is lost and then end up with the problem I am facing. I do not needs hours/ minutes/ seconds, just retrieving the dates will save my day. – Srinivas Jun 04 '22 at 18:18
  • @FObersteiner, :)) the csv file was created using Python. So, I just tagged that way. I will be happy if I can know how to retrieve the dates even with Python code. – Srinivas Jun 04 '22 at 18:21
  • What if you change the format in Excel to display the full date before saving? Or maybe use another program that doesn't claim to "think" for you, like Libre office? – FObersteiner Jun 04 '22 at 18:27
  • Merely change the format in Excel before you save the file as CSV. – Ron Rosenfeld Jun 04 '22 at 18:28
  • @Ron Rosenfeld, problem is that I have already saved, now I want to retrieve. – Srinivas Jun 04 '22 at 18:42
  • In that case, the data is gone. You will leave the need to derive it from some logical method or go back to the original data source. Just to be clear, the data was corrupted not when you opened it with Excel, but when you saved it after opening it in Exce do you have backup or perhaps and XLSx file that might have the original data? – Ron Rosenfeld Jun 04 '22 at 20:28

1 Answers1

0

If you are sure that the data is gone, verify using a text editor/ viewer (and you do not have a backup), this little awk script can get some of the formatting back. Obviously what's gone is gone. You can't get the hour back, and you can't get the msec correct.

This script takes your data

CN,NC,ND,Date
ccc,37,0,47:12.2
fff,103221,316,47:12.2
ggg,2273,0,47:12.2
gfg,36189,41,47:12.2

and forms back into something with dates, based on the 37-line-rule you set (line numbers included for clarity here):

CN,NC,ND,Date
1   :ccc,37,0,2022-06-04 08:47:12.2       // today's date, or some start date
2   :fff,103221,316,2022-06-04 08:47:12.2
3   :ggg,2273,0,2022-06-04 08:47:12.2
...
38   :fff,103221,316,2022-06-03 08:47:12.2 // after every 37 lines, subtract 1 day
39   :ggg,2273,0,2022-06-03 08:47:12.2
40   :gfg,36189,41,2022-06-03 08:47:12.2

This is run by awk, which you can download for your operating system for free. For Windows, see here

BEGIN{
    FS=","
    OFS=","
    secperday=86400     # 1 day = 86400 seconds
}
{
    if (NR==1){
      print    # output the header untouched
    }
     else
    {
      mypos=NR-2                 # -2 because of the header
      daysoffset=int(mypos/37)   # change offset every 37 lines.
      mydate=strftime("%Y-%m-%d 08:", systime()-(daysoffset*secperday) )
      outdate=mydate$4           # combine computed date and original field 4
      print $1,$2,$3,outdate     # finally output the result
    }
}

If you want just the date, change the format above to not include "08:" and remove the line below, combining original field 4.

You can of course do this in python as well, easily.

MyICQ
  • 554
  • 1
  • 6
  • 20
  • Thanks for your suggestion. I will try that. For the time being, I solved my problem in Python. Thanks a lot everyone. – Srinivas Jun 05 '22 at 06:47