0

I have a PySpark dataframe of 100k rows with the following schema:

+-------------+----------------+-----------------+-----------------+
|transactionID|transactionDate |account          |transactionAmount|
+-------------+----------------+-----------------+-----------------+
|6327jq39407  |1577868357      |Alice            |            30.09|
|6327jq39408  |1577868358      |Alice            |            11.82|
|6327jq39409  |1577870358      |Alice            |          3000.00|
|6327jq39410  |1577956658      |Alice            |           500.21|
|6327jq39411  |1578912392      |Bill             |             1.33|
|6327jq39412  |1577953658      |Alice            |            53.21|
+-------------+----------------+-----------------+-----------------+

My goal is to create a column calculating the 1 day rolling sum of the TransactionAmount column for each account, excluding transactions within the last 1 second.

From this post, I came up with the following code:

minutes= lambda i: i * 60
hours = lambda i: minutes(i) * 60
days = lambda i: hours(i) * 24

w = (Window()
    .partitionBy(col('account'))
    .orderBy('transactionDate')
    .rangeBetween(-days(1), -1)))
    
with_rolling_sum=sdf.select(*,sum(transactionAmount).over(w).alias('RollingSumOneDay))

For 99985 of the transactions in the original df, this calculates the correct value. However upon a manual check, there are 15 transactions where the rolling window excludes transactions that should be included in the 86,400 second window. In all instances of this error, the excluded transactions are >80,000 seconds away from the current transaction. Example below:

+-------------+----------------+-------+-----------------+--------------+----------+
|transactionID|transactionDate |account|transactionAmount|rollingSumOneD|correctSum| 
+-------------+----------------+------ +-----------------+--------------+----------+
|6327jq39407  |1577868357      |Alice  |            30.09|             0|         0|
|6327jq39408  |1577868358      |Alice  |            11.82|         30.09|     30.09|
|6327jq39409  |1577870358      |Alice  |          3000.00|         41.91|     41.91|
|6327jq39410  |1577956658      |Alice  |           500.21|       3041.91|   3041.91|
|6327jq39411  |1578912392      |Bill   |             1.33|             0|         0|
|6327jq39412  |1577953658      |Alice  |            53.21|       3512.02|   3542.11|
+-------------+----------------+-------+-----------------+--------------+----------+

I'm not sure what could be causing this calculation error given that the window is explicitly defined and that in most situations the rolling sum is calculated correctly. Does anyone have any experience with this type of error?

ryrodrig
  • 19
  • 2
  • Tried to replicate your code but the result of *rollingSumOneD* is different from your example. Are you sure you didn't miss something? – AdibP Sep 05 '21 at 01:17

0 Answers0