How to change a dataframe column from String type to Double type in PySpark?

Question

I have a dataframe with column as String. I wanted to change the column type to Double type in PySpark.

Following is the way, I did:

toDoublefunc = UserDefinedFunction(lambda x: x,DoubleType())
changedTypedf = joindf.withColumn("label",toDoublefunc(joindf['show']))

Just wanted to know, is this the right way to do it as while running through Logistic Regression, I am getting some error, so I wonder, is this the reason for the trouble.

score 239 · Accepted Answer · edited Mar 09 '19 at 21:29

239

There is no need for an UDF here. Column already provides cast method with DataType instance :

from pyspark.sql.types import DoubleType

changedTypedf = joindf.withColumn("label", joindf["show"].cast(DoubleType()))

or short string:

changedTypedf = joindf.withColumn("label", joindf["show"].cast("double"))

where canonical string names (other variations can be supported as well) correspond to simpleString value. So for atomic types:

from pyspark.sql import types 

for t in ['BinaryType', 'BooleanType', 'ByteType', 'DateType', 
          'DecimalType', 'DoubleType', 'FloatType', 'IntegerType', 
           'LongType', 'ShortType', 'StringType', 'TimestampType']:
    print(f"{t}: {getattr(types, t)().simpleString()}")

BinaryType: binary
BooleanType: boolean
ByteType: tinyint
DateType: date
DecimalType: decimal(10,0)
DoubleType: double
FloatType: float
IntegerType: int
LongType: bigint
ShortType: smallint
StringType: string
TimestampType: timestamp

and for example complex types

types.ArrayType(types.IntegerType()).simpleString()

'array<int>'

types.MapType(types.StringType(), types.IntegerType()).simpleString()

'map<string,int>'

edited Mar 09 '19 at 21:29

10465355

4,499
2
16
39

answered Aug 29 '15 at 13:15

zero323

305,283
89
921
912

6

Using the `col` function also works. `from pyspark.sql.functions import col`, `changedTypedf = joindf.withColumn("label", col("show").cast(DoubleType())) ` – Staza Apr 03 '18 at 22:38
What are the possible values of cast() argument (the "string" syntax)? – Wirawan Purwanto Nov 28 '18 at 01:56
I can't believe how terse Spark doc was on the valid string for the datatype. The closest reference I could find was this: https://docs.tibco.com/pub/sfire-analyst/7.7.1/doc/html/en-US/TIB_sfire-analyst_UsersGuide/connectors/apache-spark/apache_spark_data_types.htm . – Wirawan Purwanto Nov 28 '18 at 02:07
1

How to convert multiple columns in one go? – hui chen Dec 30 '19 at 14:38
How do I change nullable to false? – pitchblack408 Jun 27 '20 at 01:59

score 70 · Answer 2 · edited Sep 30 '21 at 15:00

70

Preserve the name of the column and avoid extra column addition by using the same name as input column:

from pyspark.sql.types import DoubleType
changedTypedf = joindf.withColumn("show", joindf["show"].cast(DoubleType()))

edited Sep 30 '21 at 15:00

ZygD

10,844
36
65
84

answered Jul 12 '16 at 02:16

Duckling

821
7
12

4

Thanks I was looking for how to retain original column name – WestCoastProjects Mar 02 '17 at 19:33
is there a list somewhere of the short string data types Spark will identify? – alfredox Aug 29 '17 at 23:11
1

this solution also works splendidly in a loop e.g. `from pyspark.sql.types import IntegerType for ftr in ftr_list: df = df.withColumn(f, df[f].cast(IntegerType()))` – Quetzalcoatl Apr 13 '18 at 20:01
@Quetzalcoatl Your code is wrong. What is `f`? Where are you using `ftr`? – Sheldore Jan 31 '21 at 14:44
Yeh, thanks -- 'f' should be 'ftr'. Others likely figured that out. – Quetzalcoatl Feb 01 '21 at 15:05

score 14 · Answer 3 · edited Sep 30 '21 at 13:06

14

Given answers are enough to deal with the problem but I want to share another way which may be introduced the new version of Spark (I am not sure about it) so given answer didn't catch it.

We can reach the column in spark statement with col("colum_name") keyword:

from pyspark.sql.functions import col
changedTypedf = joindf.withColumn("show", col("show").cast("double"))

edited Sep 30 '21 at 13:06

ZygD

10,844
36
65
84

answered Oct 24 '17 at 12:13

serkan kucukbay

593
6
15

Thank you! Using `'double'` is more elegant than `DoubleType()` which may also need to be imported. – ZygD Sep 30 '21 at 15:02

score 6 · Answer 4 · edited Sep 23 '21 at 08:40

6

PySpark version:

df = <source data>
df.printSchema()

from pyspark.sql.types import *

# Change column type
df_new = df.withColumn("myColumn", df["myColumn"].cast(IntegerType()))
df_new.printSchema()
df_new.select("myColumn").show()

edited Sep 23 '21 at 08:40

ZygD

10,844
36
65
84

answered Oct 15 '19 at 17:45

Cristian

500
6
7

score 1 · Answer 5 · answered Aug 29 '15 at 10:31

1

the solution was simple -

toDoublefunc = UserDefinedFunction(lambda x: float(x),DoubleType())
changedTypedf = joindf.withColumn("label",toDoublefunc(joindf['show']))

answered Aug 29 '15 at 10:31

Abhishek Choudhary

7,985
18
66
123

How to change a dataframe column from String type to Double type in PySpark?

5 Answers5

Linked

Related