Pyspark: how do I select all columns except for one by name?

Question

I have a spark data frame and I want to do array = np.array(df.collect()) on all my columns except on the first one (which I want to select by name or number). How do I do that?

Use `drop`: `array = np.array(df.drop("some_column_to_exclude").collect())` or a list comp: `array = np.array(df.select(*[c for c in df.columns if c != "some_column_to_exclude"]).collect())`. Looking for a dupe... — pault, Nov 01 '18 at 17:28

score 1 · Accepted Answer · answered Nov 01 '18 at 17:31

1

I did it that way:

s = list(set(con.columns) - {'FAULTY'}) 

array = np.array(con.select(s).collect())

answered Nov 01 '18 at 17:31

LN_P

1,268
2
18
33

1

This is fine as long as you don't care about maintaining the order of the columns. However, using `drop` here would be my recommendation. `array = np.array(df.drop("FAULTY").collect())`. Or since it's the first column, you can do `array = np.array(con.select(con.columns[1:]).collect())` – pault Nov 01 '18 at 17:34

score 0 · Answer 2 · answered Nov 01 '18 at 17:31

0

You can try,

first_col = 'name_of_your_first_column' 
df_exclude = df.select([cols for cols in df.columns if cols not in first_col]).collect()

answered Nov 01 '18 at 17:31

pvy4917

1,599
15
22

Pyspark: how do I select all columns except for one by name?

2 Answers2