0

code in user guider is as follows:

def get_person() -> pl.Expr:
    return pl.col("first_name") + pl.lit(" ") + pl.col("last_name")

q = (
    dataset.lazy()
    .sort("birthday")
    .groupby(["state"])
    .agg(
        [
            get_person().first().alias("youngest"),
            get_person().last().alias("oldest"),
        ]
    )
    .limit(5)
)

df = q.collect()
df

1 May the real order of sort().groupby() execute groupby first and then execute sort? ,which is similar to pandas?
answer by @tvashtar about this question provides some tips.

lemmingxuan
  • 409
  • 1
  • 4
  • 15
  • I think you should ask one question per SO post. Can you split up this post in multiple questions? – ritchie46 May 08 '22 at 06:44
  • okk,I have edited the Question. – lemmingxuan May 08 '22 at 06:46
  • I don't think pandas first does the `groupby`. Pandas does not reorder operations. – ritchie46 May 08 '22 at 07:32
  • I remove those content. The osf is a good place to ask questions, but it seems to be focused on a single question, so maybe it's not a good channel for "continue discussion".My original question has been answered, although subsequent study has made me think that there may be a little problem with the original answer( thans for his hard work), but it seems to be more difficult to achieve a quick contact and communication with the original author. – lemmingxuan May 10 '22 at 00:50

1 Answers1

0

The logical order of a polars query is the order you read it from top to bottom.

q = (
    dataset.lazy()
    .sort("birthday")
    .groupby(["state"])
    .agg(
        [
            get_person().first().alias("youngest"),
            get_person().last().alias("oldest"),
        ]
    )
    .limit(5)
)

This snippets has the following order of operations sort -> groupby/agg -> limit.

Note that polars may choose to execute the query in a different order IFF the outcome is the same. This might be done for performance reasons.

1 May the real order of sort().groupby() execute groupby first and then execute sort? ,which is similar to pandas?

I don't think that pandas does this. The result would be incorrect if it did. The outcome of a first aggregation changes by sorting, so if we would decide to do the sort after the groupby operation, we would have changed the outcome of the query and thus this optimization is invalid.

ritchie46
  • 3,240
  • 1
  • 9
  • 21
  • so it doesn't group by `state` and sort the `birthday` within group. If that's true, I think it may be not a good example for user guide. – lemmingxuan May 08 '22 at 07:44
  • Why isn't it a good example? I don't think the user guide claimed it sorted the birthday within that group. – ritchie46 May 08 '22 at 08:33