pyspark max string length for each column in the dataframe

Question

I am trying this in databricks . Please let me know the pyspark libraries needed to be imported and code to get the below output in Azure databricks pyspark

example:- input dataframe :-

|     column1     |    column2    | column3  |  column4  |

| a               | bbbbb         | cc       | >dddddddd |
| >aaaaaaaaaaaaaa | bb            | c        | dddd      |
| aa              | >bbbbbbbbbbbb | >ccccccc | ddddd     |
| aaaaa           | bbbb          | ccc      | d         |

output dataframe :-

| column  | maxLength |

| column1 |        14 |
| column2 |        12 |
| column3 |         7 |
| column4 |         8 |

score 4 · Accepted Answer · answered Nov 04 '20 at 07:05

4

>>> from pyspark.sql import functions as sf
>>> df = sc.parallelize([['a','bbbbb','ccc','ddd'],['aaaa','bbb','ccccccc', 'dddd']]).toDF(["column1", "column2", "column3", "column4"])
>>> df1 = df.select([sf.length(col).alias(col) for col in df.columns])
>>> df1.groupby().max().show()
+------------+------------+------------+------------+
|max(column1)|max(column2)|max(column3)|max(column4)|
+------------+------------+------------+------------+
|           4|           5|           7|           4|
+------------+------------+------------+------------+

then use this link to melt previous dataframe

answered Nov 04 '20 at 07:05

E.ZY.

556
3
10

Thanks this worked and also taking less time to execute :) :) – VIGNESH R Nov 04 '20 at 12:34
@VIGNESHR Glad to know that your issue has resolved. You can accept it as an answer( click on the check mark beside the answer to toggle it from greyed out to filled in). This can be beneficial to other community members. Thank you. – CHEEKATLAPRADEEP-MSFT Nov 09 '20 at 06:43

pyspark max string length for each column in the dataframe

1 Answers1