1

I am using pyspark in databricks with a JSON file to clean data. The expression in the eval brackets comes from the JSON file.

One of the issues I am facing is manipulating timestamps/string.

I am trying to find the difference in months between a timestamp column and a single date (which is a string)

See code below.

import pyspark.sql.functions as F
df2 = df2.withColumn('test', eval("months_between( F.to_date(F.col('period_name')), lit('31/03/2019'))"))

It doesn't throw an error but evaluates to null.

Tiger_Stripes
  • 325
  • 4
  • 10
  • [Using `eval` is generally bad practice](https://stackoverflow.com/questions/1832940/why-is-using-eval-a-bad-practice). I don't see any need for it in this context. – pault Nov 07 '19 at 17:07

2 Answers2

2

Using eval is bad practice. I see no need for it in this case.

You can maintain the flexibility of reading the query from a json file using pyspark.sql.functions.expr instead

df2 = df2.withColumn(
    'test', 
    expr("months_between(to_date(period_name), to_date('31/03/2019', 'dd/MM/yyyy'))")
)
pault
  • 37,170
  • 13
  • 92
  • 132
0

Just change your eval query to:

df2 = df2.withColumn('test', eval("months_between(F.to_date(F.col('period_name')), F.to_date(F.lit('31/03/2019'), 'dd/MM/yyyy'))"))

This should convert the literal string column to datetype too.

If your date string is not yyyy-MM-dd, then you will need to specify the format.

pissall
  • 6,460
  • 1
  • 20
  • 39
  • Thanks this does work but weirdly I had to change the order of my date to yyyy-mm-dd otherwise it would still return null – Tiger_Stripes Nov 07 '19 at 15:17
  • @Tiger_Stripes I have updated my answer to include date format – pissall Nov 07 '19 at 15:33
  • Why use `eval` at all? Why not just: `df2.withColumn('test',F.months_between(F.to_date(F.col('period_name')), F.to_date(F.lit('31/03/2019'), 'dd/MM/yyyy'))))` – pault Nov 07 '19 at 17:06
  • @pault He says that the query/code is coming as a string from a json object. – pissall Nov 07 '19 at 17:08
  • but your answer modified the expression inside the `eval` which means that OP can do the same, which means they can avoid using `eval` altogether. – pault Nov 07 '19 at 17:10
  • @pault It’s possible it could be coming from another process. It’s mentioned in the question, so it’s same to assume it could be a constraint – pissall Nov 07 '19 at 18:06