-1

I'm looking at files of data. The data always has some form of timestamp that is consistent throughout a given file, but the format varies widely from file to file.

A given file often starts with something like 01/02/17. It is easy to guess that the year is last but is it DMY or MDY? Some later timestamp might be 27/02/17 which is unambiguous, how can I then go back and make sure 01/02/17 was parsed DMY, or MDY if I later found 01/31/17?

I tried to do a wrapper around dateutil.parser but I could not find a way to get any format information back (even just that the format was ambiguous and so the default order was used)

Somethings that I have attempted: 1)

DATETIMEFORMAT = '%m/%d/%y %H:%M'
df[DATETIME] = pd.to_datetime(df[DATETIME],format=DATETIMEFORMAT)

where I have to go in and manually change DATETIMEFORMAT for each file after looking at the file.

2)

datecol = []
for idx in range(df.shape[0]):
  tmpdate = parser.parse(df[DATETIME][idx])
  datecol.append(pd.to_datetime(tmpdate))
df[DATETIME] = datecol

using python's dateutil.parser, this will handle most timestamps correctly but I can't set a default order that will hold for all files. I was working to get format information out when I came across

[https://stackoverflow.com/questions/53892450/get-the-format-in-dateutil-parse][1]

  1. a combination of the two
for idx in range(df.shape[0]):
  tmpdate = parser.parse(df[DATETIME][idx])
  if not is_ambiguous(tmpdate):
     formatstr = get_format_str(df,tmpdate,idx)
     break

df[DATETIME] = pd.to_datetime(df[DATETIME],format=formatstr)

where I look for an unambiguous timestamp and try to use that to recreate the format string for pd.to_datetime().

Ddavid
  • 9
  • 1
  • 1
    Please read [What topics can I ask about here?](https://stackoverflow.com/help/on-topic) and the [Welcome to Stack Overflow](https://stackoverflow.com/tour) introduction tour. "Show me how to solve this coding problem" is [off-topic for Stack Overflow](https://meta.stackoverflow.com/questions/284236/why-is-can-someone-help-me-not-an-actual-question). We expect you to make an [honest attempt at the solution](https://meta.stackoverflow.com/questions/261592/how-much-research-effort-is-expected-of-stack-overflow-users), and then ask a specific question about your implementation. – martineau Feb 17 '21 at 18:49

1 Answers1

1

The unfortunate news is ambiguous datetimes can be impossible to parse accurately. In theory, adding logic to perform sanity checks (checking if month is higher than 12, etc) could work, although this is not reliable.

A decent option is to map specific files to datetime formats, to specify when parsing. For example:

File Format
FileA mm-dd-yy
FileB dd-mm-yy

Your program logic can look up what format to use when parsing timestamps for a specific file.

Otherwise, converting whatever is producing the files upstream to use ISO-8601 will make things easier in the future for everyone.

Jack Casey
  • 1,318
  • 9
  • 18