12

For the below dataframe

df=spark.createDataFrame(data=[('Alice',4.300),('Bob',7.677)],schema=['name','High'])

When I try to find min & max I am only getting min value in output.

df.agg({'High':'max','High':'min'}).show()
+-----------+
|min(High)  |
+-----------+
|    2094900|
+-----------+

Why can't agg() give both max & min like in Pandas?

zero323
  • 305,283
  • 89
  • 921
  • 912
GeorgeOfTheRF
  • 7,096
  • 21
  • 52
  • 75
  • 1
    if someone is still wondering on _Why can't agg() give both max & min like in Pandas?_ is **it will not work on pandas either** because both agg() in pandas and pyspark accepts a dictionary and as we know dictionary can't have more than one key with same name, hence `df.agg({'High':'max','High':'min'}).show()` is really `df.agg({'High':'min'}).show()` because `'High':'max'` was rewritten to `'High':'min'` – Ankit Agrawal Apr 01 '21 at 16:10
  • CONTD: syntax in pandas will be `df.agg({'High': {'min(High)': np.min, 'max(High)': np.max}})` – Ankit Agrawal Apr 01 '21 at 16:13

1 Answers1

41

As you can see here:

agg(*exprs)

Compute aggregates and returns the result as a DataFrame.

The available aggregate functions are avg, max, min, sum, count.

If exprs is a single dict mapping from string to string, then the key is the column to perform aggregation on, and the value is the aggregate function.

Alternatively, exprs can also be a list of aggregate Column expressions.

Parameters: exprs – a dict mapping from column name (string) to aggregate functions (string), or a list of Column.

You can use a list of column and apply the function that you need on every column, like this:

>>> from pyspark.sql import functions as F

>>> df.agg(F.min(df.High),F.max(df.High),F.avg(df.High),F.sum(df.High)).show()
+---------+---------+---------+---------+
|min(High)|max(High)|avg(High)|sum(High)|
+---------+---------+---------+---------+
|      4.3|    7.677|   5.9885|   11.977|
+---------+---------+---------+---------+
titiro89
  • 1,908
  • 1
  • 17
  • 31