2

I have a spark data frame and I want to do array = np.array(df.collect()) on all my columns except on the first one (which I want to select by name or number). How do I do that?

pault
  • 37,170
  • 13
  • 92
  • 132
LN_P
  • 1,268
  • 2
  • 18
  • 33
  • Use `drop`: `array = np.array(df.drop("some_column_to_exclude").collect())` or a list comp: `array = np.array(df.select(*[c for c in df.columns if c != "some_column_to_exclude"]).collect())`. Looking for a dupe... – pault Nov 01 '18 at 17:28

2 Answers2

1

I did it that way:

s = list(set(con.columns) - {'FAULTY'}) 

array = np.array(con.select(s).collect())
LN_P
  • 1,268
  • 2
  • 18
  • 33
  • 1
    This is fine as long as you don't care about maintaining the order of the columns. However, using `drop` here would be my recommendation. `array = np.array(df.drop("FAULTY").collect())`. Or since it's the first column, you can do `array = np.array(con.select(con.columns[1:]).collect())` – pault Nov 01 '18 at 17:34
0

You can try,

first_col = 'name_of_your_first_column' 
df_exclude = df.select([cols for cols in df.columns if cols not in first_col]).collect()
pvy4917
  • 1,599
  • 15
  • 22